Essay

How we designed the yoyo eval harness

We stopped trying to prove that yoyo was a fully autonomous coding agent and started evaluating it for what it actually is: a structural MCP tool layer that should make repository answers more grounded and write execution cleaner under direction.

This is the design logic behind the current eval stack, not a final benchmark report.

The first mistake was benchmarking the wrong thing

The earliest harness work used puncture tasks. Those were cheap to build, easy to score, and useful for validating the plumbing. They also taught the wrong lesson. Small local bug repairs reward direct file reads and direct edits. That makes native agent behavior look strong and under-measures the part of the product we actually care about: grounded repository understanding.

Puncture tasks are still useful, but only as smoke tests. They are now Tier 0 in the eval stack, not product truth.

The second mistake was treating yoyo like a stand-alone agent

yoyo is not a planner. It is not a replacement policy layer for Codex or Claude. It is a collection of structural repository tools over MCP. So the main question cannot be “can treatment autonomously solve the task with no guidance?” That asks the model runtime to do work that does not belong to the tool layer.

The better question When an engineer gives realistic commands, does yoyo make the model more grounded on reads and cleaner on writes?

That is why the harness became directed

The harness now supports directed tool-use tasks. Instead of handing the model a repo and hoping it invents the right workflow, we give it explicit engineering instructions in phases.

  • Read-only: locate the likely files and symbols, explain the ownership layer, state invariants and blast radius.
  • Write-only: start from a known fix surface, make the minimal patch, and run the narrowest verification.
  • Read-then-write: investigate first, then patch based on that understanding.

This matches how an engineer actually uses a tool suite. It also lets us ask principal-engineer-level questions drawn from the codebase instead of only local implementation questions.

We separated read quality from write quality on purpose

If a task starts by rediscovering ownership and ends by changing code, you cannot tell whether the model failed on analysis or failed on execution. So the harness isolates them.

Read-only tasks measure groundedness, ownership calls, invariants, and blast-radius reasoning. Write-only tasks assume the surface is already known and measure patch quality, scope discipline, and verification. Read-then-write tasks are the integration path, but not the only path.

This is how we got to the honest product statement: yoyo is strongest when read judgment narrows the surface first, then change executes the write cleanly.

We had to learn contamination rules the hard way

A benchmark can look clean and still be invalid. Two concrete failures forced design changes.

  • We had runs where the repo under test could see Codex session residue. That polluted search results, so runtime artifacts were moved fully outside the workspace.
  • We had a run where the model inspected local git history and effectively found the hidden upstream fix. That made the result unusable, so the directed policy now forbids git log, git show, git blame, and direct oracle access during issue-backed tasks.

The harness is now designed around isolation first: pinned repo state, explicit task files, hidden reference fix, and no accidental leakage from the evaluator’s environment.

Verification had to become narrower and more honest

A broad “run everything” verify step is often the wrong measurement. For write tasks, the harness pushes the model toward the exact narrow test or build command that proves the intended fix. That keeps the signal tied to the engineering task instead of letting noise from unrelated failures dominate the result.

We also changed how we judge patch scope. For puncture-backed tasks, the repo is already dirty after setup because the harness injected a regression. So the right baseline is the post-setup state, not raw final git status.

The eval stack now has different jobs

  • Tier 0: puncture smoke tests for harness sanity and version regression checks.
  • Tier 1: tool-accuracy checks for structural correctness and write safety.
  • Tier 2: directed tool-use evals, which are the main product benchmark.
  • Tier 3: broader daily-engineering tasks from real repos, kept as integration work rather than the only source of truth.

This separation matters. It stops easy benchmarks from masquerading as product insight.

What the harness is trying to prove now

The current harness is not trying to prove that yoyo makes every agent faster. That is not the product claim. The real claim is narrower and more defensible:

Current product thesis yoyo should make codebase answers less hallucinated and more grounded on reads, and should make writes cleaner once the correct surface has been identified.

That is why one clean read-only result where Codex stayed on yoyo for 22 out of 22 tool calls matters. It is evidence that, under explicit engineering questions, the model will stay on the structural MCP surface when that surface is useful.

The harness is part of the product

A bad benchmark teaches a product team to optimize the wrong thing. A better benchmark does something else: it clarifies what the product is actually supposed to do.

In our case, the harness forced the product story to get sharper. Search was not the moat. Autonomous one-shot repair was not the point. The important behavior was grounded judgment on the read side and disciplined execution on the write side.

That is what the harness is designed to surface now.