Three proofs the harness matters as much as the model

Firefox found 13x more bugs, Zenith won 5 of 8 tasks at 43% cost, Eugene Yan shared his workflow. The signal: invest in your harness, not the model.

Santiago Mansilla May 11, 2026 Updated Jun 1, 2026 4 min read

In May 2026, while you were debating whether Opus 4.7 still earns its premium over Kimi K2.6, Mozilla announced that its agents found 423 security bugs in Firefox in a single month. In 2025, that number sat between 20 and 30. Thirteen times more, without changing model providers.

That’s not an isolated data point. Two other results published that same month point in the same direction: the harness — the scaffolding around the model: prompts, tools, verification loops, context management — matters as much as the model itself. Sometimes more. If you’re shipping agents to production, this should change how you allocate next quarter’s hours.

Firefox: 13x more bugs without switching models

Mozilla went from 20-30 monthly fixes in 2025 to 423 in April 2026. When Simon Willison breaks down how they pulled it off, he doesn’t credit a new Anthropic model. He says it explicitly: “success depended equally on dramatically improved techniques for harnessing these models — steering them, scaling them, and stacking them.”

The detail matters. If the gain came from the model, other teams with access to the same model would see the same curve. They don’t. The difference is what Mozilla built around it: how they steer the agent, how they filter its output, how they chain invocations. It’s the harness, not Mythos.

There’s a secondary data point worth surfacing: many of the agent’s attempts were blocked by Firefox’s existing defenses. That means the 423 fixes are not amplified false positives — they’re real vulnerabilities that made it past a system designed to reject slop. The metric is honest. (source)

Zenith: 5 of 8 tasks at 43% cost

Zenith, an agent orchestration harness, won 5 of 8 tasks at 43% of baseline cost, per Intelligent Internet’s Zenith harness. Same underlying model. Different agentic architecture.

Read it again. Same model, better harness, drives cost down and win-rate up at the same time. That’s exactly the kind of improvement a model swap almost never delivers: switching from Opus to Kimi may save you 5x cost, sure, but typically with a quality drop. Here there’s no tradeoff — because the lever isn’t the model.

The implication is stronger than it looks. Last year’s consensus said: “pick the most capable model you can afford and then polish the prompt.” What Zenith — and Mozilla with steering/scaling/stacking — show is that this sequence has flipped. Today the lever is in the orchestration. The model is the swappable piece.

Eugene Yan: the workflow as infrastructure

Eugene Yan published how he works with Claude, and it’s not a benchmark: it’s a practitioner sharing his system. Read carefully, it’s a manual on how to build your own personal harness.

The system has concrete pieces, not abstractions:

Directories ~/src and ~/vault with annotated INDEX.md in each. Treat each new session as onboarding someone. The agent’s memory doesn’t exist by default; you manufacture it.
CLAUDE.md hierarchy with three scopes: global (~/.claude/), per-repo, and per-project. Each level encodes a behavior contract. What in 2025 was a prompt repeated every session is now versioned configuration.
Shift verification left. Cheap deterministic linters and hooks before expensive evals. Let the model run its own verification loop when it has something to verify against.
Parallel sessions instead of pair-programming. Yan runs 3-6 at a time. The unit of work stops being “one task with the agent” and becomes “a queue of tasks.”
Mining transcripts. He’s analyzed ~2,500 past turns looking for recurring correction patterns — “can you also…”, “still wrong” — and converts them into new CLAUDE.md rules.

None of those investments depend on the underlying model. All of it is harness. And all of it compounds: a tuned CLAUDE.md is worth more tomorrow than today, while the model you pick today probably won’t be the model of tomorrow. (source)

Generating was cheap, verifying was costly: the asymmetry flipped

Simon Willison, analyzing the Firefox case, drops a line about the economics of LLM outputs that deserves its own section:

“What was previously cheap-to-generate but expensive-to-verify has been rebalanced through better model control and signal-filtering approaches.”

Through 2024 and most of 2025, that asymmetry defined the ceiling of agents in production. Generating a thousand suggestions was trivial; auditing them to find the two good ones cost more than the original problem. That’s why so many agent pilots stayed pilots.

What matters in the sentence is the verb: has been rebalanced. It rebalanced. But not automatically. It rebalanced because someone built better model control and signal-filtering. If your harness is still the one from 2025 — a loose prompt, manual review of every output, no automatic filters — your economics are still 2025’s. The model doesn’t pull you out. Your architecture does.

How much time on the model, how much on the harness

If you ship agents to production, do the honest math: in the last quarter, how many hours did you spend comparing models, reading Kimi vs Opus benchmarks, measuring latencies? And how many did you spend refining your CLAUDE.md, writing new tools, instrumenting the loop with observability?

The next round of competitive improvements — the one separating the teams that ship agents to production from the ones still in demo — won’t be defined by who runs the newest model. It’ll be defined by who built the most mature harness around it.

What piece of your harness is taking longer than it should?

Three proofs the harness matters as much as the model

Firefox: 13x more bugs without switching models

Zenith: 5 of 8 tasks at 43% cost

Eugene Yan: the workflow as infrastructure

Generating was cheap, verifying was costly: the asymmetry flipped

How much time on the model, how much on the harness

What breaks when the loop works alone (part 2)

An agent box: where your agent loops actually live (part 1)

Your role isn't your title: five archetypes for agent-augmented teams

Subscribe to the newsletter