AI Models
Sakana Fugu: The Model That Matches Fable 5 Without Fable 5
Everyone is calling Fugu “a new LLM that beats Fable 5.” That is not quite what happened, and the real story is more interesting. Fugu is an orchestration model. It commands a pool of other LLMs, and it reaches Fable-5-class results even though Fable 5 is locked behind export controls and isn’t in its pool at all.
On June 22, 2026, the Tokyo lab Sakana AI released Fugu, and the internet immediately filed it under the wrong headline: “new model beats Fable 5.” That is not what this is. Fugu isn't a new frontier LLM in the way people mean. It's a multi-agent orchestration system that happens to be packaged as a single model. You call one OpenAI-compatible API, and behind it Fugu plans the task, hands pieces of it to a pool of other LLMs (including copies of itself), checks their work, and stitches the answer back together. The part the hype skips is the part that actually impressed me. It posts Fable-5-class benchmark numbers without Fable 5 in its pool at all.
That one distinction is the whole story, so it's worth slowing down on. Anthropic's Fable 5 and its Mythos sibling are under export controls and not publicly accessible. Sakana's pitch is that you don't need them anymore. Take the strong models you can reach, Opus 4.8, Gemini 3.1 Pro, GPT-5.5, coordinate them well enough, and you land in the same performance neighborhood as the model you were locked out of.
You don't need access to Fable 5 to get Fable-5-level results. That's the whole bet, and the benchmarks say it isn't crazy.
What actually shipped?
Fugu went straight to general availability on June 22, 2026, after a beta of roughly 500 users. It comes in two flavors. Plain Fugu is the everyday driver, tuned for low latency on coding, code review, and chat, and it only escalates to a bigger team when a task earns it. Fugu Ultra is the heavyweight, with a deeper, fixed pool of expert agents pointed at hard, multi-step problems. Early users threw it at Kaggle competitions, reproducing scientific papers, cybersecurity analysis, and patent and literature searches.
You talk to either one through a single OpenAI-compatible endpoint, so there's nothing to migrate. Point your existing client at Fugu and the swarm stays invisible. Internally, Fugu decides whether to just answer or to assemble a team, and the selecting, delegating, checking, and synthesizing all happen inside that one call. From the outside it looks like a model. Inside, it's a coordinated fleet.
So it's not a normal model. How does the orchestration actually work?
Fugu is a language model, but the thing it's trained to be good at is calling other language models. Instead of a hand-wired “if it's a coding task, use model X” pipeline, the routing is learned. The work sits on two ICLR 2026 papers Sakana published, and both are worth knowing by name, because they're the reason this is more than a clever wrapper.
TRINITY
A lightweight, evolved coordinator that hands each agent one of three roles, Thinker, Worker, or Verifier, across coding, math, reasoning, and knowledge tasks. The Thinker plans, the Workers run in parallel, and the Verifier checks the output before anyone trusts it. That verifier is the quiet hero. It's how a swarm catches its own mistakes instead of confidently shipping them.
Why it matters — It turns a static prompt chain into a small org chart that reshapes itself for each task.
Conductor
Trained with reinforcement learning to find its own natural-language coordination strategies and agent-to-agent communication patterns. Rather than a human designing the handoff protocol between agents, Conductor learns one that works and reuses it.
Why it matters — The coordination is discovered, not authored, which is why it holds up on cases nobody thought to script.
Together, that's the gap between Fugu and the do-it-yourself version most teams have already tried, the “fire off five GPT calls and glue the outputs together” approach. Here the glue is the product, and it was trained, not duct-taped.
The benchmarks: does it really match Fable 5?
Here's the honest version. Sakana's headline table compares Fugu against the frontier models you can actually buy today, Opus 4.8, Gemini 3.1 Pro, and GPT-5.5, because Fable 5 and Mythos aren't public. Against that field, Fugu Ultra leads most of the board.
| Benchmark | Fugu | Fugu Ultra | Opus 4.8 | Gemini 3.1 Pro | GPT-5.5 |
|---|---|---|---|---|---|
| SWE-Bench Pro (agentic coding) | 59.0 | 73.7 | 69.2 | 54.2 | 58.6 |
| LiveCodeBench (coding) | 92.9 | 93.2 | 87.8 | 88.5 | 85.3 |
| GPQA-D (science reasoning) | 95.5 | 95.5 | 92.0 | 94.3 | 93.6 |
| Terminal-Bench 2.1 (terminal coding) | 80.2 | 82.1 | 74.6 | 70.3 | 78.2 |
| Humanity's Last Exam (reasoning) | 47.2 | 50.0 | 49.8 | 44.4 | 41.4 |
| MRCRv2 (long-context recall) | 86.6 | 93.6 | 87.9 | 84.9 | 94.8 |
Read it closely and the shape is clear. Fugu Ultra leads on coding (SWE-Bench Pro 73.7 against Opus 4.8's 69.2), on science (GPQA-D 95.5), and on terminal work. But it's not a clean sweep. On MRCRv2 long-context recall, GPT-5.5 still wins at 94.8. On Humanity's Last Exam the lead over Opus 4.8 is basically a rounding error, 50.0 to 49.8. An orchestrator inherits the ceilings of the models it calls. It can't invent ability that none of them have.
As for Fable 5 specifically, Sakana calls Fugu Ultra “shoulder-to-shoulder” with Fable 5 and Mythos Preview, and secondary reporting from VentureBeat says Fugu edges Fable 5 on LiveCodeBench (92.9 to 89.8) and beats the older Mythos Preview on GPQA-D. So “matches Fable 5” is fair on these particular public benchmarks. “Beats Fable 5 across the board” is a claim nobody can actually make, because Fable 5 isn't in the pool and you can't run a head-to-head against a model you can't touch.
Matching a model you're locked out of, using only the models you're allowed to use, is a genuinely new kind of win. It's just not the win the headlines are selling.
The two catches that decide whether you should care
This is where the press-release shine wears off, and where you should pay attention before you rewire anything.
1. Orchestration isn’t free, and Sakana won’t say how unfree
The Decoder said it plainly: “how much the orchestration drives up token usage and costs remains an open question that Sakana doesn't address.” Sakana softens the blow by billing multiple active agents at the single highest-tier rate instead of stacking fees, but a deep Fugu Ultra run still burns far more tokens than one model answering once. Benchmark a real workload before you hand over a budget.
Why it matters — If three models confer on every hard query, you’re paying for three models on every hard query. The quality might be worth it. The bill might not be.
2. Resilience isn’t sovereignty
Sakana frames Fugu as a hedge against vendor lock-in and shifting export controls. That's true up to a point. If one provider's access disappears, Fugu reroutes. But if several big providers pull back at once, Fugu's pool shrinks right along with them. An orchestrator buys you resilience, not a model you own. The whole thing is only ever as good as the pool it can reach.
Why it matters — A router that leans on the same handful of labs as everyone else is sturdier, not independent.
Pricing and access
Fugu ships with both subscriptions and pay-as-you-go. Subscriptions are $20 a month (Standard), $100 a month (Pro, 10x the usage), and $200 a month (Max, 20x the usage), and all three include both Fugu and Fugu Ultra. On metered billing, Fugu Ultra runs $5 in and $30 out per million tokens, with $0.50 cached input and higher rates once you cross a 272K-token context. Subscribe before July 2026 and the second month is free. One sharp edge worth flagging: Fugu isn't available in the EU or EEA yet, pending GDPR compliance.
So should you actually use it?
If you're export-control-blocked from Fable 5 or Mythos, Fugu is the most credible way to get near that tier today. No real argument there. If you run hard, multi-step agentic work, things like research reproduction, deep code review, or security analysis, and you care more about the answer than the token bill, Fugu Ultra is worth a real trial. Beta testers said it caught “more than twenty” bugs in a code review where GPT-5.5 flagged three. And because the API is OpenAI-compatible, the experiment costs you almost no engineering time.
Where I'd hesitate is high-volume, cost-sensitive, latency-sensitive work. There, a single strong model like Opus 4.8 called directly is predictable in a way a swarm just isn't. You know what each request costs and how long it takes. Fugu trades that predictability for a higher ceiling. Figure out which one your workload actually needs before you commit.
The bigger signal is the one Sakana is really sending. Orchestration is turning into the product. When the base models all converge and access gets political, the lasting edge moves to whoever coordinates them best. Fugu is the first mainstream model to wrap that idea behind a single endpoint, and that, far more than a few benchmark points over a model you can't buy, is why it's worth watching.