Hypernova-60B v2602: Improved Tool-Calling Capability

Hypernova-60B started with a clear conviction: the market doesn't need another massive, general-purpose model — it needs smaller, specialized ones that do one job exceptionally well, at a fraction of the cost and latency.

That conviction came from a real customer need. In late 2024, we partnered with an important client to build a production-grade code agent in English. The requirements were demanding: frontier-level reasoning in a model light enough to run on a single GPU, with the low latency that real-time developer workflows demand. Our answer was to take OpenAI's gpt-oss-120b — a 117B-parameter open-weight model — and compress it to 59B parameters using CompactifAI's quantum-inspired technology. Half the size, same intelligence.

Hypernova-60B delivered. On reasoning benchmarks — MMLU-Pro, GPQA Diamond, AIME 2025 — the compressed model matched its parent across every major evaluation. More importantly, it worked in production: a specialized, efficient model doing exactly what it was built to do.

Then something interesting happened. When we open-sourced Hypernova-60B, the community took it beyond its original scope. Developers started plugging it into agentic pipelines — multi-turn workflows where models don't just reason, they act: calling APIs, chaining tools, orchestrating real systems. And they told us what they needed next: reliable tool-calling.

Hypernova-60B version 2602 is that next step. We added robust tool-calling capabilities — without increasing the model's footprint. Same 59B parameters, but now usable across the full spectrum of agentic workflows. What began as a specialized code model for one customer has become a general-purpose agent backbone for the open-source community — and it still fits on a single GPU.

The Challenge: How Compression Works and Tool-Calling

To understand what we improved in Hypernova-60B version 2602, it helps to understand how Hypernova-60B is built.

The original gpt-oss-120b uses a "Mixture-of-Experts" (MoE) architecture. Think of it like a massive corporation with 117 billion employees, but only 5.1 billion are actively working on any given task. Most sit in specialized departments ("experts"), and a manager ("router") decides which department handles which request.

Our CompactifAI compression identifies mathematical redundancy in this routing structure using quantum-inspired tensor decomposition, cutting the model to 59B parameters with 4.8B active per token.

This preserved reasoning because the core knowledge pathways survived compression. But tool-calling depends on something different: the ability to produce precisely structured outputs — valid JSON, correct argument types, exact schema compliance — across multi-step interactions.

The redundancy we removed turned out to be carrying important structural generation capabilities, even though it wasn't carrying reasoning capability.

The result was a model that could think clearly but couldn't fill out a form correctly — which, in the world of agentic AI, is a dealbreaker.

The Solution: Teaching the Model to Use Tools

We built Hypernova-60B version 2602, a new model based on gpt-oss-120b and benchmarked against the base model.

Hypernova-60B version 2602 keeps the same compressed architecture. Same 59B parameters, same single-GPU footprint. What changes is a targeted post-training phase built on knowledge distillation.

Here's how it works: Think of knowledge distillation as a master-apprentice dynamic. We used a larger, highly capable "teacher" model to generate thousands of synthetic training examples. It showed our compressed "student" model how to correctly call tools, handle multi-step conversations, and navigate tricky edge cases.

Crucially, this didn't come at the expense of reasoning. The approach adds a capability layer on top of the compressed foundation rather than trading one strength for another.

The Results

Tool-calling was the priority for Hypernova-60B version 2602, and the numbers reflect it in two key benchmarks:

BFCL v4: Function-calling accuracy across API schemas
τ²-Bench: Agent multi-turn tool use in stateful conversations

BFCL v4 went from 25 to 62 — recovering 97% of the base model's capability. τ²-Bench went from 12 to 61 — a score that was disqualifying is now competitive.

For teams building AI agents that chain multiple tool calls — customer support automation, code generation pipelines, data retrieval workflows — this is the difference between a prototype that demos well and a system that runs in production.

Beyond Tool-Calling: Smarter Across the Board

When we fine-tuned Hypernova-60B_v2602 for tool-calling, something unexpected happened— the model got better at everything else too. The knowledge distillation process didn't just add a new skill. It sharpened existing ones:

Terminal Bench (generation of correct command-line instructions and interact with terminal environments): climbed from 8 → 16.
AA-LCR (long-context reasoning over contracts, research papers, and large codebases): 34 → 36.
IFBench (the ability to follow complex instructions accurately): 56 → 60.

On general intelligence, the benchmarks that validated Hypernova-60B version 2602 held steady or improved:

MMLU-Pro jumped from 71 to 74, putting a 59B-parameter model within 4 points of its 117B parent on one of the most widely used knowledge benchmarks. This is a model half the size, performing at full-size levels — and the distillation process appears to have sharpened it further rather than trading off breadth for the new tool-calling capability.

Inference Speed: Faster Response, Lower Cost

Benchmark scores tell you what a model can do. Inference performance tells you whether you can afford to let it. We measured head-to-head against gpt-oss-120b on identical hardware (H200 Tensor Core GPU).

If you care about sustainability, the number that matters most is memory footprint. At 32 GB peak memory, HyperNova-60B_v2602 fits on a single 40 GB GPU — no need for an 80 GB card. That means less energy per inference, but it's also a direct cost advantage: 40 GB GPUs are cheaper, more widely available, and easier to provision. Or, if you already have an 80 GB card, you can run two parallel instances where you used to run one — doubling throughput without adding hardware.

The bottom line: HyperNova-60B_v2602 delivers comparable intelligence at a fraction of the infrastructure cost.

At production scale, these numbers translate directly to budget.

Take a typical deployment handling 1,000 requests per second: the 39.5% throughput gain means roughly 400 extra requests per second of headroom on the same GPU fleet — or ~28% fewer GPUs to handle the same traffic. Latency improvements of 36–51% across TTFT, TPOT, and ITL mean your users consistently experience a more responsive interface, even as concurrency scales. And with peak memory at 32 GB instead of 61, the model fits on a 40 GB GPU tier that costs significantly less to rent than the 80 GB cards the parent model requires.

Every step in an agentic chain runs faster and costs less — a saving that compounds with every user, every query, every call.

What Comes Next

Hypernova-60B_v2602 is a proof point: quantum-inspired compression can deliver frontier intelligence at a fraction of the infrastructure cost, and knowledge distillation can recover capabilities that compression initially degrades. We're applying these techniques to additional proprietary models and architectures, and we'll have more to share soon.

The model is available now under the Apache 2.0 license on Hugging Face, and through our CompactifAI API on AWS Marketplace.

We look forward to seeing what you build with it.