Three proofs the harness matters as much as the model
Firefox found 13x more bugs, Zenith won 5 of 8 tasks at 43% cost, Eugene Yan shared his workflow. The signal: invest in your harness, not the model.
In May 2026, while you were debating whether Opus 4.7 still earns its premium over Kimi K2.6, Mozilla announced that its agents found 423 security bugs in Firefox in a single month. In 2025, that number sat between 20 and 30. Thirteen times more, without changing model providers.
That’s not an isolated data point. Two other results published that same month point in the same direction: the harness — the scaffolding around the model: prompts, tools, verification loops, context management — matters as much as the model itself. Sometimes more. If you’re shipping agents to production, this should change how you allocate next quarter’s hours.
Firefox: 13x more bugs without switching models
Mozilla went from 20-30 monthly fixes in 2025 to 423 in April 2026. When Simon Willison breaks down how they pulled it off, he doesn’t credit a new Anthropic model. He says it explicitly: “success depended equally on dramatically improved techniques for harnessing these models — steering them, scaling them, and stacking them.”
The detail matters. If the gain came from the model, other teams with access to the same model would see the same curve. They don’t. The difference is what Mozilla built around it: how they steer the agent, how they filter its output, how they chain invocations. It’s the harness, not Mythos.
There’s a secondary data point worth surfacing: many of the agent’s attempts were blocked by Firefox’s existing defenses. That means the 423 fixes are not amplified false positives — they’re real vulnerabilities that made it past a system designed to reject slop. The metric is honest. (source)
Zenith: 5 of 8 tasks at 43% cost
Zenith, an agent orchestration harness, won 5 of 8 tasks at 43% of baseline cost, per the Latent Space weekly briefing. Same underlying model. Different agentic architecture.
Read it again. Same model, better harness, drives cost down and win-rate up at the same time. That’s exactly the kind of improvement a model swap almost never delivers: switching from Opus to Kimi may save you 5x cost, sure, but typically with a quality drop. Here there’s no tradeoff — because the lever isn’t the model. (source)
The implication is stronger than it looks. Last year’s consensus said: “pick the most capable model you can afford and then polish the prompt.” What Zenith — and Mozilla with steering/scaling/stacking — show is that this sequence has flipped. Today the lever is in the orchestration. The model is the swappable piece.
Eugene Yan: the workflow as infrastructure
Eugene Yan published how he works with Claude, and it’s not a benchmark: it’s a practitioner sharing his system. Read carefully, it’s a manual on how to build your own personal harness.
The system has concrete pieces, not abstractions:
- Directories
~/srcand~/vaultwith annotatedINDEX.mdin each. Treat each new session as onboarding someone. The agent’s memory doesn’t exist by default; you manufacture it. CLAUDE.mdhierarchy with three scopes: global (~/.claude/), per-repo, and per-project. Each level encodes a behavior contract. What in 2025 was a prompt repeated every session is now versioned configuration.- Shift verification left. Cheap deterministic linters and hooks before expensive evals. Let the model run its own verification loop when it has something to verify against.
- Parallel sessions instead of pair-programming. Yan runs 3-6 at a time. The unit of work stops being “one task with the agent” and becomes “a queue of tasks.”
- Mining transcripts. He’s analyzed ~2,500 past turns looking for recurring correction patterns — “can you also…”, “still wrong” — and converts them into new
CLAUDE.mdrules.
None of those investments depend on the underlying model. All of it is harness. And all of it compounds: a tuned CLAUDE.md is worth more tomorrow than today, while the model you pick today probably won’t be the model of tomorrow. (source)
Generating was cheap, verifying was costly: the asymmetry flipped
Simon Willison, analyzing the Firefox case, drops a line about the economics of LLM outputs that deserves its own section:
“What was previously cheap-to-generate but expensive-to-verify has been rebalanced through better model control and signal-filtering approaches.”
Through 2024 and most of 2025, that asymmetry defined the ceiling of agents in production. Generating a thousand suggestions was trivial; auditing them to find the two good ones cost more than the original problem. That’s why so many agent pilots stayed pilots.
What matters in the sentence is the verb: has been rebalanced. It rebalanced. But not automatically. It rebalanced because someone built better model control and signal-filtering. If your harness is still the one from 2025 — a loose prompt, manual review of every output, no automatic filters — your economics are still 2025’s. The model doesn’t pull you out. Your architecture does.
How much time on the model, how much on the harness
If you ship agents to production, do the honest math: in the last quarter, how many hours did you spend comparing models, reading Kimi vs Opus benchmarks, measuring latencies? And how many did you spend refining your CLAUDE.md, writing new tools, instrumenting the loop with observability?
The next round of competitive improvements — the one separating the teams that ship agents to production from the ones still in demo — won’t be defined by who runs the newest model. It’ll be defined by who built the most mature harness around it.
What piece of your harness is taking longer than it should?