The UK's AI Security Institute (AISI) tested frontier models across seven benchmarks with varying compute budgets. The finding: fixed budget caps systematically underestimate how capable AI agents really are.
An AI agent's performance is a curve that rises with test-time compute, the amount of processing power an agent is allowed to burn while working on a task. Cut the budget while the curve is still climbing, and the measured score tells you the minimum, not the maximum.
That's what the AISI researchers set out to prove in their latest work. The big question: how much do capabilities scale with compute, and what does that mean for cybersecurity?
The effect shows up across domains. In cybersecurity, about 8 percent of tasks were only solved when the budget exceeded 10 million tokens; some even required 50 million. The newest models hit even higher scores at budgets above 100 million tokens.
On software engineering tasks (TerminalBench 2.0, SWE-Bench Pro), success rates jumped about 25 percent when the token budget went from one million to ten million. For math and academic tasks (Humanity's Last Exam), the gain was around 22 percent up to a budget of five million tokens.
Extra compute doesn't help everywhere equally. On HealthBench, a medical task benchmark, all models hit their plateau within the standard budget. According to AISI, more compute helps most where agents can verify their own work, like running code or testing an exploit. But it barely moves the needle where feedback is missing or delayed.
Another finding ties the time a human expert needs for a task to the agent's token consumption. Across 211 software engineering tasks from the research institute METR and 78 cyber tasks from AISI, this relationship follows a power law. A one-minute task costs the agent thousands of tokens. A one-hour task costs millions. A one-week task costs billions.
A fixed evaluation budget therefore cuts off the longest and hardest tasks. Failure can mean the budget was too tight, not that the agent lacked the skill. AISI points to the cyber task "The Last Ones", which takes a human expert about 20 hours. No tested model could solve it with fewer than 30 million tokens.
Newer models benefit from extra compute far more than older ones, according to the study. The capability curve shifts upward with each generation and changes shape along three axes: reach (harder tasks become solvable), reliability (the same task gets solved more often), and efficiency (the same task needs fewer tokens).
A current frontier model's time horizon grew from about 40 minutes at a budget of 2.5 million tokens to roughly four hours at 50 million tokens. Across the entire frontier, the horizon shifts from about two hours to 14 hours when the budget jumps from 2.5 to 50 million tokens.
AISI had previously estimated that the time horizon of frontier models on cyber tasks doubles roughly every 4.7 months, measured at a fixed budget of 2.5 million tokens. At 50 million tokens, the trend is about 60 percent steeper. Doubling happens every 40 to 50 days instead of every 67 to 91.
The estimated doubling rate is partly a product of the evaluation budget you pick, not a fixed property of frontier progress, AISI says. Progress isn't uniform, though. On about 10 to 30 percent of tasks, newer models actually scored worse than their predecessors.
For AISI, the main lesson is about how you measure. "If we keep treating capability as a fixed score rather than a curve over compute, we will keep being surprised by what these systems can do when more is spent on them."
Test a model with too small a budget, and you get a score that skews decisions about deployment, economic value, and risk. Falling costs per token could also make higher test-time budgets more accessible, meaning capabilities that once seemed unaffordable could get cheaper and easier to reach over time. That would make measurements that factor in compute budgets even more important.
AISI now runs frontier models through tests at several different budgets. The idea behind these "minimum informative budgets" is to check whether a model's reach stops growing with extra compute; only then does a result count as meaningful. The team is also trying to figure out how to predict high-budget performance from cheaper test runs.