Your accounting agent isn't incapable, it's unreliable
Aggregate pass@1 falls from 76% to 52% as tasks get longer. In an agent that posts to your ledger, that gap is mis-stated entries. Capability isn't reliability.
An agent that posts entries to your general ledger and “succeeds 76% of the time” isn’t 76% good: in accounting, the 24% that fails are mis-stated entries, wrong VAT, reconciliations that only pretend to balance — and they end up at the tax authority. And it turns out that 76% — the number your benchmark gives you, the standardized test you compare models with — isn’t even the one that matters.
The paper Beyond pass@1 (March 2026) ran ten open-source models 23,392 times across 396 tasks and measured something your benchmark doesn’t: how reliability falls as the task gets longer, from 76.3% on short tasks to 52.1% on long ones. It separates two things we treat as one: capability (does it succeed in a single attempt?) and reliability (does it succeed consistently on real-duration tasks?). In a system that touches money, only the second one counts. Here it is, unpacked, with the action each finding implies.
Capability isn’t reliability: pass@1 vs pass^k
A benchmark tells you your agent solves 76% of tasks. That number is pass@1: you give each task a single attempt and count what percentage it solves. It’s the score almost every benchmark reports. But in production your accounting agent doesn’t reconcile an invoice once — it reconciles it thousands of times each close, and what matters is that it succeeds every time.
The honest metric for that is pass^k (read “pass to the k”): you give the same task k attempts —say 5— and only count it as a success if it solves it all 5 times. It measures consistency, not luck. The paper plots it with the Reliability Decay Curve, which crosses task duration with that pass^k. And the uncomfortable detail is that the decay is super-linear —it doesn’t fall in a straight line, but faster and faster— because failures aren’t independent: they’re correlated. An agent that slips once in a close tends to slip again in the same close.
The action is direct: stop reporting pass@1 and measure pass^k with k≥3 repeats per task. Run each eval several times and check how many reconciliations pass across all runs, not in one.
Human-estimated duration isn’t agent complexity
Reliability doesn’t decay uniformly across domains, and that’s where it gets counterintuitive. In software engineering it drops nearly 50% as the task grows: the Graceful Degradation Score —a 0-to-1 grade for how much of the work the agent completed, weighting critical subtasks more— collapses from 0.90 to 0.44. In document processing it barely moves: 0.74 to 0.71.
The mechanism: the duration a human estimates for a task and the complexity that task carries for the agent are two different things. A month-end close is a long, chained task —each step depends on the last, like the two-hour software job— so it collapses reliability; “categorize 500 receipts” is repetitive and tolerates the long horizon without degrading, like document processing.
That’s why an eval on five-minute atomic entries tells you nothing about your real use. Measure the Reliability Decay Curve on the duration tier you actually run — if your agent runs a full close, evaluate it on a full close, not on a single entry.
More variance means more capability, not more noise
The Beyond pass@1 paper breaks a strong intuition about variance. Frontier models —the most capable ones around— amplify variance (how much the result changes from one run to the next) more than 2x going from short to long tasks. That’s the idea of the Variance Amplification Factor (VAF): how many times more unpredictable results become on long tasks versus short ones. Frontier models reach a VAF ≥ 2.37 (DeepSeek V3 2.49, MiniMax M2.5 2.60); mid-tier ones stay at ≤ 1.26 (barely any change). The two groups split cleanly, no overlap.
The natural read is “more variance” = “more unstable, worse.” It’s the opposite: “high variance amplification is a capability signature, not an instability signature”. Only a capable model produces mixed results on long closes; a weak one fails uniformly, with no variance, because it never even gets close.
It changes how you read your dashboards: when a capable model gives you high variance on a long close, don’t treat it as noise to suppress. Measure that factor and read it as signal — a high value tells you the model is attempting ambitious strategies, not that it’s broken.
The ‘add memory’ reflex hurt 6 of 10 models
The 2026 reflex when an agent gets lost mid-close is “give it memory.” The paper tested exactly that: a scaffold —the code-and-prompts framing around the model— with a notebook where the agent jots down what it does, against a bare ReAct (the basic reason → act → observe loop, no extra memory). The result is blunt: memory never improved long-horizon reliability, and hurt 6 of 10 models. Kimi K2.5 lost 0.14 of GDS; Mistral 24B, 0.13.
The mechanism is budget: that notebook consumes step budget —how many actions the agent can take before stopping— and space in the context window, the model’s limited working memory. Exactly the two scarcest resources on a long task. What you gain in “remembering” you lose in steps and space.
It’s not that memory never helps — it’s that it isn’t free and isn’t a safe default. Before deploying a memory system in your close agent, check with an A/B —the same task with and without memory, side by side— on your own closes; if you don’t calibrate case by case, you’re most likely paying context to make things worse.
Decomposition and meltdown detection: the real levers
Two interventions actually move long-task reliability, and neither is adding memory. The first: decompose. Splitting a long task into short subtasks and restarting at the boundaries is, per the authors, the highest-leverage intervention — for Qwen3 30B there’s a 41.5-point gap between its short pass@1 (75.8%) and very-long (34.3%), and decomposition recovers much of that gap. Accounting plays to your advantage here: it’s already decomposable by account, by period, by client — put a checkpoint at each reconciled segment instead of treating the close as one block.
The second is the juiciest paradox in the paper: frontier models melt down more, not less — they spiral into incoherent loops and collapse. DeepSeek V3 melts down on 19% of very-long tasks (around step 17 on the median), MiniMax on 13%; everyone else, 0–4%. It’s not weakness: they pursue ambitious multi-step strategies and, when they get tangled, their tool calls (read an invoice, check a balance, post an entry…) turn chaotic. That measure of “chaos” is entropy, and the Meltdown Onset Point —the point where the collapse begins— watches it over a sliding window of the last few steps.
Instrument both things: split the close with checkpoints at each account or period, and measure that disorder in your agent’s tool calls. When the collapse alarm triggers, reset the context —save state, open a fresh window, continue— before the agent starts posting garbage to the ledger.
The authors put it this way: “task decomposition is the highest-leverage reliability intervention”. The underlying shift is one of mindset: the number you chase isn’t “what percentage it solves on its best day,” it’s “how consistently it solves it when I run it a thousand times over the real duration of my close.”
How many times do you run your agent’s eval over a full close before you let it post to the ledger?