Founding Data Operations Lead at Besimple AI (X25)
$70K - $100K
Expert-in-the-loop eval data for AI
Redwood City, CA, US / Remote (US)
Full-time
US citizen/visa only
3+ years
About Besimple AI

Why Us

At Besimple AI, we’re making it radically easier for teams to build and ship reliable AI by fixing the hardest part of the stack: data. Good evaluation, training and safety data require domain experts, robust tooling and meticulous QA. AI teams and labs come to us to get high quality data so they can launch AI safely. We’re a YC X25 company based in Redwood City, CA, already powering evaluation and training pipelines for leading AI companies across customer support, search, and education. Join now to be close to real customer impact, not just demos.

Why This Matters

High-quality, human-reviewed data is still the single biggest driver of model quality, but most teams are stuck with old tools and legacy processes that do not scale to modern, multimodal, agentic workflows. Besimple replaces that mess with instant custom UIs, tailored rubrics, and an end-to-end human-in-the-loop workflow that supports text, chat, audio, video, LLM traces, and more. We meet teams where they are—whether they need on-prem deployments and granular user management or a fast cloud setup—to turn evaluation into a continuous capability rather than a one-time project.

Traction & Customers

Who You’ll Work With

Founders previously built the annotation platform that supported Meta’s Llama models. We’ve seen how world-class annotation systems shape model quality and iteration speed; we’re bringing those lessons to every AI team that needs to ship with confidence. You’ll work directly with the founders and users, owning problems end-to-end—from an interface that unlocks a tough rubric, to a workflow that reduces disagreement, to a AI judge system that improves quality.

How We Work

  • Bias to shipping and learning with customers
  • Respect for craft: calibration, rubric clarity, inter-annotator agreement (IRR)
  • Tight feedback loops from production back to evaluation
  • Ownership: you’ll shape evaluation as an engineering discipline with real “fail-to-ship” tests tied to business and safety goals

If you’re excited by systems that combine product design, human judgment, and applied AI—and you want to build the data and evaluation layer that keeps AI trustworthy—come build with us. See how fast teams can go from raw logs to a robust, human-in-the-loop eval pipeline—and how that changes the way they ship AI.

About the role

Type: Full-time Remote (US-first)
Team: Founding Ops & Customer Delivery

About Besimple

We are a safety data research company. Our mission is to bring AI into the real world safely. We believe that AI can meaningfully empower humanity only if we put safety first. We’re a small, nimble team of passionate builders who believe humans must remain in the loop.

The Role (Founding)

This is our founding operations role. You won’t “run a process”—you’ll design the process, the playbooks, and the bar for what world-class, AI-first data operations looks like. You’ll take ambiguous customer needs, turn them into crisp rubrics and workflows, recruit and train annotator bench globally, and stand up the quality systems, dashboards, and SLAs that become Besimple’s operating backbone. As we grow, you’ll scale the org you built—hiring, coaching, and evolving best practices.

You’ll use AI coding tools (Copilot/Cursor/Codex) and lightweight Python/SQL to automate processes, analyze variance/drift, and accelerate delivery. You’ll partner with customers to define and refine annotation requirements, and with Product/Eng to shape UX, guardrails, and platform roadmap.

What You’ll Do

  • Own customer programs end-to-end: translate goals into schemas, rubrics, gold sets, and success metrics; pilot → scale with clear reporting and write-ups.
  • Define & refine requirements with customers: run scoping sessions, lock criteria/edge-case taxonomies/IAA targets; iterate as models and prompts change.
  • Recruit, onboard, and train annotators: source SMEs, design paid trials, build training artifacts, calibrate on gold data, and manage QA/arb loops.
  • Ship with AI-accelerated ops: write quick scripts and notebooks for data transforms, audits, log parsing, schema reconciliation, and quality analytics.
  • Build the operating system: SLAs, sampling plans, consensus/appeals, audit trails, and continuous calibration; make quality measurable and repeatable.
  • Close the loop: drive prompt/model/policy experiments; surface insights to Product/Eng; propose UI tweaks and guardrails that raise signal-to-noise.

What Will Make You Successful

  • Company-builder mindset: you’ve built 0→1 programs or teams, created playbooks, and raised the bar for quality and speed.
  • Customer-facing clarity: you convert open-ended asks into precise pass/fail criteria and aren’t afraid to propose a better spec.
  • People leadership: you attract, calibrate, and motivate high-judgment annotators while holding a crisp, documented bar.
  • Hands-on with data & AI tools: comfortable with AI coding assistants plus basic Python/SQL to answer questions fast and automate the dull bits.
  • Execution bias: you prefer small pilots and rapid iteration over lengthy specs, and you over-communicate risks and status.

Qualifications

  • 2–4+ years in data/product/research operations for ML/AI, relevance, or safety—or equivalent “high-judgment at scale” experience.
  • Track record recruiting, onboarding, and training annotators/raters with gold-set calibration and QA loops.
  • Demonstrated program ownership: requirements, change management, stakeholder updates, and postmortems.
  • Excellent writing: rubrics, edge-case guides, SOPs, and crisp weekly reports.

Nice to Have

  • Trust & Safety, RLHF/RLAIF, search/relevance, or regulated domains (medical, legal, finance).
  • Experience designing evaluator UIs, prompt templates, or judgment tasks for LLMs/multimodal models.
  • Familiarity with IAA stats, sampling methods, or experiment design.

Compensation & Ownership

Founding-level role with meaningful equity and scope to define what it means to build an AI-first data annotation company—from playbooks and metrics to culture and hiring.

Technology

Technology & Hard Problems

Product Surface

Besimple generates task-specific annotation interfaces and guidelines on the fly, runs human-in-the-loop (HITL) workflows at scale, and trains AI judges that learn from human decisions to triage easy cases and flag ambiguous ones. We support multimodal data (text, chat, audio, video, traces) and enterprise needs like on-prem deployment and fine-grained access control. Under the hood, we optimize for latency, correctness, and adaptability—simultaneously.

Hard Technical Problems We’re Tackling

  • Generative UI for Any Data Shape Turn arbitrary inputs—JSON logs, multi-turn dialogs, code diffs, speech transcripts, video frames—into ergonomic, versioned UIs with validation and assistive affordances (schema inference, promptable components, live preview with safe defaults).
  • Human-in-the-Loop Orchestration Route tasks to the right experts, enforce calibration and quality gates, measure IRR, and run adjudication when disagreement is informative—not noise.
  • AI-Judge Training & Control Distill human rubrics into model-based evaluators that score live traffic, self-update with new human decisions, and stay inside guardrails (confidence thresholds, policy constraints, auditability).
  • Production-Grade Eval Build gating suites and regression tests aligned to product KPIs and safety constraints; snapshot datasets; track drift; and plumb production signals back into evaluation and training.
  • Enterprise Delivery On-prem optional installs, isolation-by-tenant, SSO/RBAC, and audit trails that satisfy infosec without slowing iteration.

What You’ll Own

End-to-end slices of the product—e.g., building a new multimodal interface, designing a calibration workflow that improves IRR, shipping a rubric-aware AI judge for a new domain, or tightening dataset lineage so a customer can trace a production decision back to ground truth.

Why This Is a Great Fit for Builders

This work sits at the intersection of product engineering, systems design, and applied AI. You’ll ship tangible interfaces, shape evaluation science, and see your work block real regressions. The feedback loop is measured in better models in production, not vanity benchmarks.

Other jobs at Besimple AI

fulltimeRedwood City, CA, US / Remote (US)$70K - $100K3+ years

internUS / Remote (US)Full stack$6K - $10K / monthlyAny

Hundreds of YC startups are hiring on Work at a Startup.

Sign up to see more ›