Updated on June 17, 2026
Hypernova 60B v2605: Improved Coding and General Capability
- Review the newest Hypernova 60B v2605 with improved coding capabilities and overall intelligence improvements.
- Read our note on Artificial Analysis’ independent evaluation.
HyperNova 60B started with a clear conviction: the market doesn't need another massive, general-purpose model — it needs smaller, specialized ones that do one job exceptionally well, at a fraction of the cost and latency.
That conviction came from a real customer need. In late 2024, we partnered with an important client to build a production-grade code agent in English. The requirements were demanding: frontier-level reasoning in a model light enough to run on a single GPU, with the low latency that real-time developer workflows demand. Our answer was to take OpenAI's gpt-oss-120b — a 117B-parameter open-weight model — and compress it to 59B parameters using CompactifAI's quantum-inspired technology. Half the size, same intelligence.
HyperNova 60B delivered. On reasoning benchmarks — MMLU-Pro, GPQA Diamond, AIME 2025 — the compressed model matched its parent across every major evaluation. More importantly, it worked in production: a specialized, efficient model doing exactly what it was built to do.
Then something interesting happened. When we open-sourced HyperNova 60B, the community took it beyond its original scope. Developers started plugging it into agentic pipelines — multi-turn workflows where models don't just reason, they act: calling APIs, chaining tools, orchestrating real systems. And they told us what they needed next: reliable tool-calling.
HyperNova 60B version 2602 is that next step. We added robust tool-calling capabilities — without increasing the model's footprint. Same 59B parameters, but now usable across the full spectrum of agentic workflows. What began as a specialized code model for one customer has become a general-purpose agent backbone for the open-source community — and it still fits on a single GPU.
The Challenge: How Compression Works and Tool-Calling
To understand what we improved in HyperNova 60B version 2602, it helps to understand how HyperNova 60B is built.
The original gpt-oss-120b uses a "Mixture-of-Experts" (MoE) architecture. Think of it like a massive corporation with 117 billion employees, but only 5.1 billion are actively working on any given task. Most sit in specialized departments ("experts"), and a manager ("router") decides which department handles which request.
Our CompactifAI compression identifies mathematical redundancy in this routing structure using quantum-inspired tensor decomposition, cutting the model to 59B parameters with 4.8B active per token.
This preserved reasoning because the core knowledge pathways survived compression. But tool-calling depends on something different: the ability to produce precisely structured outputs — valid JSON, correct argument types, exact schema compliance — across multi-step interactions.
The redundancy we removed turned out to be carrying important structural generation capabilities, even though it wasn't carrying reasoning capability.
The result was a model that could think clearly but couldn't fill out a form correctly — which, in the world of agentic AI, is a dealbreaker.
The Solution: Teaching the Model to Use Tools
We built HyperNova 60B version 2602, a new model based on gpt-oss-120b and benchmarked against the base model.
HyperNova 60B version 2602 keeps the same compressed architecture. Same 59B parameters, same single-GPU footprint. What changes is a targeted post-training phase built on knowledge distillation.
Here's how it works: Think of knowledge distillation as a master-apprentice dynamic. We used a larger, highly capable "teacher" model to generate thousands of synthetic training examples. It showed our compressed "student" model how to correctly call tools, handle multi-step conversations, and navigate tricky edge cases.
Crucially, this didn't come at the expense of reasoning. The approach adds a capability layer on top of the compressed foundation rather than trading one strength for another.
The Results
Tool-calling was the priority for HyperNova 60B version 2602, and the numbers reflect it in two key benchmarks:
- BFCL v4: Function-calling accuracy across API schemas
- τ²-Bench: Agent multi-turn tool use in stateful conversations
BFCL v4 went from 25 to 62 — recovering 97% of the base model's capability. τ²-Bench went from 12 to 61 — a score that was disqualifying is now competitive.
For teams building AI agents that chain multiple tool calls — customer support automation, code generation pipelines, data retrieval workflows — this is the difference between a prototype that demos well and a system that runs in production.
Beyond Tool-Calling: Smarter Across the Board
When we fine-tuned HyperNova 60B v2602 for tool-calling, something unexpected happened— the model got better at everything else too. The knowledge distillation process didn't just add a new skill. It sharpened existing ones:
- Terminal Bench (generation of correct command-line instructions and interact with terminal environments): climbed from 8 → 16.
- AA-LCR (long-context reasoning over contracts, research papers, and large codebases): 34 → 36.
- IFBench (the ability to follow complex instructions accurately): 56 → 60.
On general intelligence, the benchmarks that validated HyperNova 60B version 2602 held steady or improved:
MMLU-Pro jumped from 71 to 74, putting a 59B-parameter model within 4 points of its 117B parent on one of the most widely used knowledge benchmarks. This is a model half the size, performing at full-size levels — and the distillation process appears to have sharpened it further rather than trading off breadth for the new tool-calling capability.
Inference Speed: Faster Response, Lower Cost
Benchmark scores tell you what a model can do. Inference performance tells you whether you can afford to let it. We measured head-to-head against gpt-oss-120b on identical hardware — a single NVIDIA H200 Tensor Core GPU, at concurrency 128, with a 1k input / 1k output workload.
The results are decisive. On throughput, HyperNova 60B v2602 is 39.5% faster than gpt-oss-120b, and on median time to first token it is 50.8% faster — a latency improvement users feel directly as a more responsive interface. And it does all of this while weighing in at 32 GB versus 65 — less than half the footprint of a model with 120B-like intelligence.
If you care about sustainability, that memory number is the one that matters most. At 32 GB, HyperNova 60B v2602 fits on a single 40 GB GPU — no 80 GB card required. That means less energy per inference, but it's also a direct cost advantage: 40 GB GPUs are cheaper, more widely available, and easier to provision. Or, if you already have an 80 GB card, you can run two parallel instances where you used to run one — doubling throughput without adding a single piece of hardware.
The bottom line: HyperNova 60B v2602 delivers comparable intelligence at a fraction of the infrastructure cost.
At production scale, these numbers translate directly to budget. Take a typical deployment handling 1,000 requests per second: the 39.5% throughput gain means roughly 395 requests per second of additional headroom on the same GPU fleet — or about 28% fewer GPUs to handle the same traffic. A 50.8% faster time-to-first-token keeps the experience snappy even as concurrency climbs. And with peak weights at 32 GB instead of 65, the model drops onto a 40 GB GPU tier that costs significantly less to rent than the 80 GB cards a model of 120B-like intelligence demands.
Every step in an agentic chain runs faster and costs less — a saving that compounds with every user, every query, every call.
HyperNova 60B v2605: Improved Coding and General Capability
If v2602 proved that quantum-inspired compression could deliver frontier intelligence at half the footprint, v2605 is about closing the gap that compression left behind. Same 32 GB. Same single-GPU economics. But across coding, reasoning, and tool use, the model is measurably smarter.
The headline is coding. On LiveCodeBench, HyperNova 60B v2605 jumps from 51.5 to 68.7 — beating gpt-oss-120b's 62.8. A 60B model with 4.8B active parameters now writes code better than a model with 120B-like intelligence, on a benchmark built to resist memorization. The rest of the coding track moves with it: AIDER climbs from 26.2 to 34.2, SciCode from 33.5 to 36.0, and Terminal Bench from 12.1 to 15.9.
The gains aren't confined to code. Across general intelligence benchmarks, v2605 recovers most of the ground that compression initially cost. HLE more than doubles (7.3 → 15.0). AIME25 climbs to 90.0, within striking distance of the 120B model's 93.7. GPQA-d goes from 65.6 to 71.9. And IFBench reaches 66.6 — essentially matching gpt-oss-120b's 67.0, meaning instruction-following at this size is now a solved problem, not a tradeoff. MMLU-Pro, AA-LCR, and Tau2-bench all step up in turn. See more details in HyperNova 60B v2605’s model card.
The pattern is consistent: knowledge distillation recovers the capabilities compression degrades, without giving back the efficiency. Tool calling stays native — function calling with defined schemas, structured JSON outputs, and agent-style workflows in the OpenAI format — so every benchmark gain flows straight into the agentic chains where this model earns its keep.
Faster, too. On the same H200, at concurrency 128, v2605 pushes 5,210 tokens per second against gpt-oss-120b's 3,821 — a 36% throughput gain — and returns the first token in 4.85 seconds instead of 7.04, a 31% latency improvement. v2605's distillation slightly changes the compute profile while improving latency, so its raw throughput sits marginally below v2602's even as time-to-first-token gets faster. All of it at 32 GB, on a single 40 GB GPU, at less than half the weight of a model with 120B-like intelligence.
The takeaway is simple: v2605 keeps everything that made v2602 cheap to run, and removes most of the reasons you'd reach for a bigger model instead. Comparable intelligence, frontier coding, half the infrastructure.
Independent Validation from Artificial Analysis
Artificial Analysis is the independent benchmarking lab the industry watches most closely. They evaluated HyperNova 60B v2605 through our CompactifAI API and placed it on their Intelligence Index.
This is the validation that matters. Independent confirmation that our compression technology delivers a genuinely intelligent model at half the size — and half the size is not an abstraction. It means the model runs on a smaller, cheaper GPU tier, serves more requests per card, and costs less for every inference it handles. The intelligence is real, and so are the savings.
On the Artificial Analysis Intelligence Index v4.0 — a blend of ten evaluations, from GDPval and Terminal-Bench Hard to GPQA Diamond and Humanity's Last Exam — HyperNova 60B v2605 scores 29.3. But the score isn't really the point. Where it sits is.
Artificial Analysis groups models by size, and in the 40B–150B class — the bracket where a model has to earn its keep on a single GPU — Hypernova lands in the "most attractive quadrant": the high-intelligence, low-parameter corner where you actually want to be. Everything that clearly outscores it is paying for it in parameters. gpt-oss-120b, Qwen3.5 122B, Mistral Medium 3.5, NVIDIA Nemotron 3 Super — each roughly twice the size or more. Hypernova does it from 60B, with 4.8B active per token.
This is the class that matters for anyone running a model in production. Teams don't deploy against a trillion-parameter giant — they deploy against a budget, a GPU tier, and a latency target. The 40B–150B class is where those decisions actually get made, and it's where the economics of serving a model are decided. In that class, HyperNova 60B v2605 is the standout: the most intelligence for every parameter you pay to run, and the efficiency leader in the size category that defines real-world deployment.
It's also the most cost-effective way to run that intelligence. Because HyperNova fits on a 40 GB GPU instead of an 80 GB card, the economics carry straight through to what you pay per token on the CompactifAI API — and you can see exactly what that looks like on our pricing page. Smaller model, cheaper hardware, lower price, same tier of intelligence.
So here's where we landed: an outside benchmarking organization, its own benchmarks, its own chart — and the same conclusion we started with. You don't need to pay for 120B parameters to get this kind of intelligence. You can get it at 60B, on one GPU, through an API priced to match.
HyperNova 60B v2605 is available now on the CompactifAI API.
What Comes Next
HyperNova 60B v2605 is a proof point: quantum-inspired compression can deliver frontier intelligence at a fraction of the infrastructure cost, and knowledge distillation can recover capabilities that compression initially degrades, and now improve on them. We're applying these techniques to additional proprietary models and architectures, and we'll have more to share soon.
The model is available now under the Apache 2.0 license on Hugging Face, and through our CompactifAI API on AWS Marketplace.
