For teams putting reasoning agents into production, the question stopped being "is the model good enough?" a while ago. The models are smart. The harder question is whether you can afford to run the smart one at the scale your business actually needs.
That question comes from real work, research assistants reading across hundreds of documents, support copilots living inside a contract or a wiki, planners chaining a dozen tool calls without losing the thread. These workloads share two requirements: the model has to reason at a high level, and it has to hold a very long context while doing it. And they hit the same wall, the model that's smart enough is too big to deploy economically at real traffic.
That's the gap Pulsar-16B is built to close.
Pulsar-16B is the latest model from Multiverse Computing, a 16-billion-parameter Hybrid Mamba2-Transformer with a Mixture-of-Experts architecture. It's built on NVIDIA's Nemotron 3 Nano 30B-A3B and developed with Multiverse Computing's proprietary technology, using NVIDIA ModelOpt and Megatron Bridge in the compression pipeline, and validated on NVIDIA accelerated computing infrastructure.
The result is reasoning performance in line with leading ~30B-class models at roughly half the parameter count, a concrete illustration of how architectural discipline and optimization, not parameter inflation, define the next wave of deployable LLMs. And it's not just our claim: NVIDIA independently reproduced the full evaluation suite on its own hardware, confirming that what came out the other side is the same intelligence in a smaller package.
How we built it
Pulsar-16B is a result of CompactifAI, Multiverse Computing's proprietary technology for designing exceptionally efficient large language models.
Most efficiency techniques are blunt instruments round the weights, prune the small ones, hope nothing important breaks. CompactifAI takes a structural approach. Using quantum-inspired tensor decomposition, it identifies the mathematical redundancy hiding inside large neural networks and removes it without disturbing the patterns the model uses to think.
Think of a large model as an over-engineered bridge. Most methods shave material wherever it looks heavy and wait to see if the span holds. CompactifAI maps the load-bearing structure first, then removes only what was never carrying weight. The cables that matter stay exactly where they were. The result: a 16B model that performs in the class of much larger systems, on a methodology we'll keep applying across our lineup.
Pulsar versions
A key advantage of Multiverse's technology is that it composes with standard quantization rather than competing with it models can be further quantized so teams balance accuracy and footprint without retraining the core weights.
Pulsar-16B ships in BF16, FP8, and NVFP4. In NVIDIA's evaluation, FP8 matched BF16 on quality, while NVFP4 degraded only 1–6% a small accuracy gap for a large memory gain. Pulsar-16B-FP8 fits in ~16 GB of weights and Pulsar-16B-NVFP4 in ~10 GB. That last point is the whole game: a model that fits in 10 GB doesn't just cost less to serve, it changes which GPUs you can serve it on at all.
Performance
A benchmark score tells you whether a model is capable. It says nothing about whether the economics let you put that capability into production. So we measured both.
We ran a structured inference benchmark using guidellm with vLLM 0.18.0, targeting NVIDIA B200 GPUs under a fixed decode profile (temperature 0.0, top_p 1.0), following the model card's 8k/16k input/output shape so results stay comparable to the original spec.
On B200, against the 30B-class Nemotron baseline, Pulsar-16B-BF16 raised system throughput from 3,363 to 3,760 tokens/sec while cutting time-to-first-token from 2.18 to 1.80 seconds. The aggressive quantization builds bend the curve harder: FP8 and NVFP4 reach 4,808 and 4,735 tokens/sec roughly a 40% throughput gain while TTFT drops to ~1.25 seconds. Model weights tell the most direct version of the story: 59 GB collapses to 16 GB at FP8 and 10 GB at NVFP4.
In a real deployment, those gains compound. A 40% throughput gain is ~40% more requests on the same GPU fleet, or a smaller fleet serving the same traffic. A 6x reduction in weight footprint moves the model to a card class that's cheaper, more available, and easier to provision. And because the workload definition never changed across precisions, every gain is apples-to-apples.
The pattern holds on datacenter-class L40S hardware. At FP8, Pulsar trims weights from 33 to 16 GB and pulls TTFT from 2.00 to 1.40 seconds, a 30% latency improvement at half the footprint. At NVFP4, weights drop from 21 to 10 GB and TTFT falls from 4.70 to 3.80 seconds, with throughput steady. The smaller model isn't trading speed for size; it gives back memory and latency at once.
Benchmarks that matter
Across reasoning, knowledge, coding, and instruction-following benchmarks, Pulsar-16B competes head-to-head with reasoning models nearly twice its parameter count and outscores OpenAI's gpt-oss-20B on nearly every measure. A few that stand out:
- AIME 2025: 87.22 within half a point of the 30B-class base model (87.66) and 54 points ahead of Ministral-3-14B-Instruct-2512. A 16B model solving competition math at the level of systems twice its size.
- GPQA-Diamond (PhD-level science): 71.41 well ahead of Qwen3-14B at 63.63.
- The agent cluster instruction following (IFBench), function calling (BFCL-v4), and math reasoning (AIME): Pulsar-16B beats gpt-oss-20B by 14, 11, and 15 points respectively.
That last cluster is where it counts. Instruction following, function calling, and math reasoning are the skills an agent leans on every time it plans a step, calls a tool, or checks its own work, the numbers that separate a prototype that demos well from a system that ships.
Long context
Long-context performance is where compression often quietly degrades and where we focused real attention. An agent that loses the thread at 100K tokens isn't an agent; it's a demo with a short memory.
Pulsar-16B was evaluated across LongBench, AA-LCR, the RULER suite, and NIAH variants at progressively longer contexts. Needle-in-a-haystack retrieval is essentially perfect on both sides of the 100K mark. On the harder RULER tasks at extended context, Pulsar holds strong reasoning across the full evaluated range. For document-heavy work contracts, codebases, support transcripts, research corpora, it performs at the level enterprise workloads demand.
This is what makes Pulsar fit the workloads it was designed around: research assistants over large corpora, enterprise knowledge assistants over manuals and contracts, and multi-step agent loops that don't exhaust the context window or collapse into repetition. Strong reasoning is necessary for all three. Strong reasoning that survives a long context is what makes them production-grade.
Why it matters at scale
Cutting a model in half changes the economics of inference more than the spec sheet suggests. Cost per token drops. Concurrency budgets stretch. Deployment options open up, single-node setups, regulated on-prem environments, latency-sensitive customer-facing systems, that weren't feasible with the larger model. Same NVIDIA accelerated computing underneath, more productive work out of every chip.
What comes next
Pulsar-16B is available today on Hugging Face under the Apache 2.0 license, with standard inference paths via vLLM and Hugging Face Transformers, and full support for tool calling and hybrid reasoning out of the box.
The takeaway reaches past any single release. The distance between what a model can do and what a team can actually deploy has been the real constraint on enterprise AI and Pulsar-16B shows that distance can close. Get the architecture and the compression right, and the capable model and the deployable model stop being two different choices.
Pull it, build on it, and tell us where it takes you.
Pulsar-16B is developed by Multiverse Computing within the Nemotron architectural family pioneered by NVIDIA, and runs on NVIDIA accelerated computing. The evaluation results presented were independently reproduced by NVIDIA. Available under the Apache 2.0 license.

