AI Models
Claude Fable 5: The Model Built to Run for Hours
Anthropic’s most powerful public model is here, and it beats Opus 4.8 on nearly every benchmark. But the benchmark deltas are a distraction. The real disruption is the clock: Fable 5 can work autonomously for hours on a single task. That changes what an agent is allowed to attempt.
On June 9, 2026, Anthropic released Claude Fable 5 — its most powerful generally available model, and the first to sit a full tier above Opus. The launch coverage is wall-to-wall benchmark tables, and yes, Fable 5 wins almost all of them. But if you read the release as “Opus 4.8, but the scores went up,” you have missed the actual shift. The headline number is not on a leaderboard. It is a duration: Fable 5 can run autonomously, unattended, for thirty minutes to several hours on a single task.
That is the disruption. Not five points on SWE-bench. A model that holds its thread across a long-horizon job — planning, tool calls, dead ends, recovery — without a human in the loop turns the unit of work from “a prompt” into “a job you hand off and walk away from.” Everything interesting about Fable 5, and everything you need to redesign in your stack to use it, flows from that one capability.
The headline number on Fable 5 is not on a leaderboard. It is a duration. The model is built to run for hours, and that quietly rewrites the agent playbook.
What Anthropic Actually Shipped
Fable 5 is what Anthropic calls a “Mythos-class” model made safe for general use. There is a sibling, Mythos 5, which is the same underlying model with its safeguards lifted in certain areas — and it is not public. Mythos 5 is restricted to vetted cyberdefenders and infrastructure providers through a program Anthropic calls Project Glasswing. Fable 5, then, is Mythos for the rest of us: the frontier model, with guardrails bolted on so it can ship to everyone.
The specs are straightforward. The API identifier is claude-fable-5. It carries a 1M token context window and up to 128K tokens of output, the same envelope as Opus 4.8. Pricing is where the tier shows up: $10 per million input tokens and $50 per million output — exactly double Opus 4.8’s $5 / $25. You are paying a premium, and Anthropic is not hiding it.
The Number That Matters Isn't a Benchmark
Here is the contrast that reframes everything. A typical Opus 4.8 response on a coding task lands in three to fifteen seconds. Ask Fable 5 the same thing and you may wait sixty seconds to a few minutes. On a multi-step asynchronous run with tools — the kind where the model plans, edits, tests, and iterates — Fable 5 can churn for thirty minutes to several hours before it surfaces a result.
Read naively, that is a regression. Slower is worse. But latency is the wrong lens. Opus is fast because it answers turns. Fable is slow because it completes jobs. The minutes and hours are not the model being sluggish; they are the model doing work that previously required a human to babysit a loop — re-prompting, catching it when it claimed false progress, nudging it back on track. Fable 5’s whole design point is holding coherence long enough that you do not have to.
Opus is fast because it answers turns. Fable is slow because it completes jobs. The hours are not the model being slow — they are the babysitting you no longer have to do.
This is why the benchmark wins are real but secondary. Fable 5 is reported as state-of-the-art on nearly every public eval, and the shape of those wins is the tell: the lead is marginal on quick tasks and widens sharply on long, hard, multi-step ones. The longer and more complex the job, the bigger Fable’s margin over Opus. That is exactly the signature you would expect from a model whose edge is endurance, not raw single-shot smarts.
| Benchmark | Fable 5 | Opus 4.8 |
|---|---|---|
| SWE-Bench Verified (coding) | 95.0% | 88.6% |
| SWE-Bench Pro (hard, agentic) | 80.0% | 69.2% |
| Terminal-Bench 2.1 (long-horizon) | 88.0% | 82.7% |
| GDPval-AA (knowledge work, Elo) | 1932 | 1890 |
The qualitative firsts point the same direction. Fable 5 is reported as the first model to break 90% on a core analytics eval — a ten-point jump over Opus — the top scorer on Hebbia’s finance benchmark and Cognition’s FrontierBench, and the best result on Harvey’s legal agent suites. These are not toy prompts. They are exactly the long, structured, professional jobs that benefit from a model that can stay on task.
Why Hours-Long Autonomy Breaks the Old Agent Playbook
For two years the dominant pattern for getting real work out of a model has been the harness: the engineered scaffolding of context, tools, verification loops, and memory wrapped around the model to keep it honest over long tasks. We have written at length about why the harness, not the model, became the source of agent power. Much of that scaffolding exists to compensate for a model that loses the plot after a few dozen steps — checkpoints to catch drift, re-prompts to recover, humans to confirm progress.
A model that sustains coherence for hours does not eliminate the harness, but it moves the work. You stop building scaffolding to prevent collapse and start building it to delegate and verify. The shape of an agent system tilts from “tight loop with a human watching” toward “hand off a well-specified job, check the result.” Concretely, three things change:
The job replaces the prompt
With Opus you nudge a fast model through a task interactively. With Fable you write the full specification once — goal, constraints, definition of done — and let it run. Underspecified prompts drip-fed over many turns waste a long-horizon model’s biggest advantage. Front-load the context.
Why it matters — When a single dispatch can run for hours, you specify outcomes and constraints up front, not step-by-step instructions.
Budgets and checkpoints become load-bearing
Anthropic ships task budgets for exactly this: tell the model how many tokens it has for the whole loop and it self-moderates against a running countdown. On an hours-long autonomous run, a budget and intermediate checkpoints are not nice-to-haves — they are the difference between a finished job and a surprise invoice.
Why it matters — A model that can run for hours can also burn tokens and money for hours. You need explicit ceilings, not vibes.
Verification moves to the end
When you are not watching the loop, the trustworthy pattern is outcome-graded: define a rubric for what “done” looks like and check the artifact against it. The supervision you used to spend per-step gets reinvested into a strong final gate — tests, review, an independent grader.
Why it matters — You can no longer eyeball every step. The leverage is in grading the output, not supervising the process.
The New Economics: Route by Task, Don't Upgrade Everything
Because Fable 5 is twice the price and far slower, the instinct to “switch to the best model” is wrong here. Opus 4.8 already wins six of seven head-to-head benchmarks against the rest of the field — we covered why Opus 4.8 took the crown when it shipped — and it answers in seconds at half the cost. For the overwhelming majority of everyday work, Opus is still the correct default.
The move is to route by task. Reserve Fable 5 for the jobs where its endurance pays for itself: large codebase migrations, deep multi-source research, overnight autonomous runs, the gnarly long-horizon problems where Opus would need a human babysitting the loop anyway. Send everything else — chat, classification, quick edits, latency-sensitive UX — to Opus or Sonnet. Frontier intelligence has become a premium tier you call selectively, not a blanket upgrade.
Frontier intelligence is now a premium tier you call selectively, not a default you switch to. Fable for the hard jobs, Opus for the rest.
This is the same instinct that drives good agentic system design: match the tool to the task. A two-tier frontier — a fast, cheap flagship and a slow, expensive frontier model — just makes the routing decision sharper and more consequential than it used to be.
The Catch: Safety Routing and the Mythos Shadow
Anthropic released a Mythos-class model to the public days after publicly warning that AI was getting too dangerous, and the safeguards are how it squares that. Fable 5 runs three classifier categories — cybersecurity, biology and chemistry, and model distillation — and when a request trips one, it is quietly routed to an Opus 4.8 fallback. Anthropic says this fires in fewer than 5% of sessions on average.
The practical consequence is worth internalizing: in those guarded domains, Fable 5 and Opus 4.8 are literally the same model. If your workload is security tooling or computational biology, you are not buying frontier capability with that 2x price — you are paying double for an Opus response. Add the latency tax on top, and the routing math gets unforgiving. There is also a 30-day data retention policy attached to Mythos-class models that is worth a read before you pipe sensitive material through it.
What This Means If You're Building Agents Today
Do not rip out Opus. Add Fable as a second gear. The highest-leverage move this week is to find the one or two workflows in your stack that are genuinely long-horizon — the ones where you currently either babysit a loop or break the job into so many small steps that the orchestration is its own headache — and try handing the whole thing to Fable 5 with a clear spec, a token budget, and a verification gate at the end.
If it finishes a job overnight that used to take a developer a day of supervised iteration, the 2x token price is trivial against the time saved. If it does not — if the task was never really long-horizon — you have your answer, and you route it back to Opus. That experiment, run honestly on your own workload, beats any benchmark table for deciding where Fable belongs.
The deeper signal is about trajectory. We have spent the agent era engineering around a model’s inability to stay coherent. Fable 5 is the clearest evidence yet that the ceiling on autonomous run length is rising fast — and as it rises, more of the harness becomes about delegation and grading, less about damage control. That is a better problem to have, and it is the one worth designing for now.
Sources and further reading: Anthropic's announcement, TechCrunch on the safety timing, and the AWS availability note.