From Testing Chaos to Production-Grade QA Infrastructure

The work described in this article is the result of a collaborative effort across multiple teams at Multiverse Computing. You can find more at the end of this article.

Model serving is operationally fragile. A seemingly local change — a new model configuration, a different deployment shape, a GPU allocation tweak, or a benchmark-driven optimization — can shift latency, change supported RPS, eat into GPU utilization, and even affect the reliability of other models running in the same environment. What looks like a one-line update touches a system with many moving parts.

That makes production changes expensive to plan. Before QA Harness, moving a model or configuration closer to production required days of careful coordination across scripts, benchmark tools, load-testing utilities, manual reports, and shared GPU environments to make sure there was no collateral impact across workloads.

The Problem: Every Model-Serving Change Can Affect Production Performance

Any team shipping LLMs must answer the same questions on every release: is the model still correct, does it remain competitive on the benchmarks that matter, and can it serve traffic reliably under production-like load? In practice, those answers often live in different places. Functional checks run in terminals. Benchmark scores sit in notebooks or external tools. Performance numbers are captured manually for a specific release and rarely revisited. The work gets done, but it is fragmented, hard to reproduce, and slow to act on exactly when release decisions need to be made.

At Multiverse Computing, this matters even more because we develop and serve optimized models through our Inference API, alongside compressed variants produced with CompactifAI, our quantum-inspired compression technology. Every release touches a shared serving environment where one model's deployment shape can affect another's latency budget. Production-grade validation is not optional; it is the only way to move fast without breaking other workloads.

Historically, this kind of validation required days of meticulous coordination across scripts, benchmark tools, load-testing utilities, reports, and GPU environments. QA Harness automates the operational coordination required to change model-serving environments safely. It turns production-readiness checks from a manual, multi-day planning effort into a repeatable workflow that provisions models, runs the right evaluations, tears resources down safely, and generates recurring statistics about the real performance capacity of the system.

Concretely, that workflow rests on three capabilities: a catalog of evaluation suites covering functional behavior, standard benchmarks, and performance under load; an orchestration layer that coordinates model startup, suite execution, scheduling, and teardown; and a reporting layer that persists results so teams can compare system behavior over time. Scheduled runs feed the platform with recurring statistics (supported RPS, latency profiles, saturation points, regression signals) giving teams a continuous view of model-serving capability and production readiness rather than a one-off snapshot per release.

The Challenge: Why Production QA Readiness does not Scale

Preparing a model-serving change for production is not like validating a typical web service. The system is stateful, resource-constrained, GPU-dependent, and shared across workloads. A change that improves one model can reduce available capacity for another, shift latency under load, or change the supported RPS of the overall environment.

Three constraints made our previous approach unsustainable:

Heterogeneous evaluation needs. A model, especially a compressed one where every optimization is a potential trade-off, must be correct (does it still answer well?), competitive (does it hold up on standard benchmarks?), and fast under load (does it serve traffic at acceptable latency and throughput?). These are three very different test families, each with their own tooling, runtime, and output format.
Shared GPUs make planning expensive. Running these evaluations means spinning up the model on a GPU, hitting it, and tearing it down. In a shared environment, that planning is non-trivial: forgetting to release a GPU blocks other workloads, and running two heavy jobs on the same hardware costs reliability across the board. The bottleneck is rarely raw compute; it is the coordination overhead of using it safely.
Production changes have a collateral risk. A model-serving change can affect more than the model being changed. Configuration, deployment shape, GPU allocation, traffic patterns, and co-located workloads can all shift latency, RPS, or reliability elsewhere in the system. Validating one model without understanding its effect on the others is not validation at all.
No memory of the past. Without a central record, comparing today's run against last week's means digging through chat threads and spreadsheets. Regressions can go unnoticed for days, and by the time they surface, the trail back to the change that caused them is already cold.

Individually, each problem had a workaround. Together, they slowed every release decision: production readiness became a multi-day coordination exercise instead of an answer the team could pull on demand. We did not need a better script. We needed infrastructure.

The Strategy: A Central Nervous System for QA

QA Harness was designed as a single operational entry point for model-serving readiness. Instead of asking teams to manually coordinate scripts, benchmark tools, load tests, GPU environments, and reports, the platform handles the full workflow: it provisions the model, runs the required checks, captures the system-level metrics, tears resources down safely, and makes the results available for comparison over time.

The platform is model agnostic. Any LLM or VLM the team needs to validate — base models, compressed variants, fine-tuned checkpoints, configuration changes on already-deployed models — flows through the same entry point.

Think of it less as a test runner and more as a nervous system. Each test suite is a sensor; the orchestrator is the brain that decides what runs when, on which hardware, and in what order. The team interacts with the brain, not with the sensors.

The design rests on three principles:

Centralization. Every QA activity (functional, benchmark, performance) lives behind one UI and one API.
Automation. Nothing about a routine evaluation should require a human. Models start and stop on demand, schedules run by themselves, and reports arrive in Slack — the engineer who submitted the job (or nobody at all, for nightly runs) sees the verdict in their feed without touching anything in between.
Resource awareness. The system never forgets that GPUs are scarce and shared, and it acts accordingly: models are torn down as soon as their jobs finish, two heavy evaluations are not scheduled on the same hardware unless we explicitly allow it, and idle capacity is released rather than held hostage.

The Three Pillars: Functional, Benchmarks, Performance

Every QA Harness run produces a production-readiness picture from three angles: how much traffic the system can support, whether the model behaves correctly, and whether it remains competitive on standard benchmarks. Submitting all three at once is a common pattern: it gives a complete portrait of the model in a single run.

Coverage at a glance:

Functional

End-to-end coverage of the model's API: chat completions, tool calls, streaming responses, image and audio inputs, structured outputs with strict JSON-mode constraints, reasoning traces, authentication flows, and RBAC permutations. Suites combine golden tests (deterministic input to expected output), contract tests against the API schema, and multi-turn scenarios that exercise stateful conversation handling, retry semantics, and timeout behavior.

Each capability marker targets a specific failure mode that LLM regressions tend to surface in production: tool-call grounding, schema compliance, audio transcription fidelity, function-calling argument types, and cross-capability flows where two skills have to compose.

Tests are organized by tier (smoke, critical, standard, extended, full) and by capability marker, so the same suite can serve a quick pre-release smoke check or an exhaustive nightly run.

Benchmarks

A benchmark is a standardized evaluation task with a known input distribution and a defined scoring criterion, used to compare models against shared references and against each other.

QA Harness does not replace the evaluation methodology owned by the Model Evaluations team; it consumes their standardized API and orchestrates the runs. The catalog spans reasoning, long-context, code, instruction following, tool use, and agentic safety, with new categories added as the evaluation team incorporates them. Each benchmark exposes its decoding parameters (temperature, top-p, top-k), sample limits, eval batch size, and any benchmark-specific flags; defaults come from the Model Evaluations team, and overrides are saved per job.

Behind the scenes, the platform integrates evaluation libraries (nemo-skills, evalscope, lm-eval-harness, inspect_eval) used as-is, exactly as their maintainers intended. The harness focuses on what those libraries do not cover: provisioning the model, comparing runs, persisting history, and surfacing regressions.

Comparability is a first-class concern. Every run records the versions behind it (benchmark and dataset version, evaluator version, and prompt template), so two results stay comparable even when they are months apart.

Performance

Load tests built around traffic profiles that mirror real-world inference patterns. smoke for a quick health check, peak for finding the throughput knee, stability for fixed-load endurance, soak for slow drift, ramp for gradual increases, step for geometric staircases (×2 each stage to pinpoint where degradation starts), spike for resilience and recovery, load for ramp-up-plateau-ramp-down realism, and max_throughput for closed-loop concurrency sweeps.

Every profile emits a verdict (passed, degraded, blocked) and the metrics that matter for serving LLMs at scale: goodput vs raw throughput (we care about successful requests, not just requests issued), p50/p95/p99 latency, TTFT (time to first token), TPOT (time per output token), ITL (inter-token latency), queue depth over time, in-flight concurrency, SLO adherence, and per-stage error rates. The saturation curve (goodput versus p50 latency with the knee at the peak stage) is what tells us whether a model can be served at the target SLA or whether we need to revisit the deployment shape.

Under the Hood: An Orchestrator That Manages Itself

From the user's perspective, submitting a run feels almost trivial: pick a model, choose your suites, hit launch. Behind that click sits the orchestrator, the part of QA Harness that turns a one-line job submission into a coordinated sequence of model startups, suite executions, result collection, and teardown, without anyone watching it run.

This is where "production-grade" stops being a slogan and becomes a checklist. Concretely, the orchestrator handles, on its own, all of the following:

Job durability — once submitted, a job survives component restarts and infrastructure failures
Failure isolation — one job crashing does not propagate to others on the queue
Reproducibility and historical comparability — every run is versioned and queryable months later
Resource governance — scarce GPUs are allocated, shared, and released by policy, not by hand
Auditability — every run carries a record of who triggered it, when, and on what
Access control through Teleport
Full observability via OpenTelemetry, Grafana traces, and Loki logs
Scheduling without supervision — nightly and weekly runs fire and report on their own
Notifications that reach the team in Slack as soon as a run completes

The rest of this section shows how the architecture makes each of those properties hold.

Internally, the orchestrator is made of four cooperating components, each with a single responsibility:

Dispatcher. The entry point. Validates the job, assigns priority, and pushes it into the queue. Once it returns, the job is durable and ordered.
Workers. Long-running processes that pick jobs off the queue and execute them end-to-end. Adding capacity means adding workers; nothing else has to change.
LLM manager. Owns model lifecycle: provisions a model on a GPU when a worker needs it, tears it down when no queued job still needs it. This single behavior eliminates the most common waste pattern in shared-GPU environments.
Scheduler. Enqueues jobs on a cadence (nightly regressions, weekly benchmark sweeps, monthly full sweeps). The scheduler does not run jobs itself; it submits them to the dispatcher like any other client, so a scheduled run is indistinguishable from a manual one.

The split matters. Treating the queue as the contract between components is what makes the platform tolerant to failure, and the benefits show up in every failure mode that used to be expensive:

Worker crashes are recoverable. A worker failing on one job does not take down the queue. The job is rescheduled, picked up by another worker, and the rest of the system keeps moving.
Model startup failures are local. If a model fails to come up on a GPU, only the jobs that need that specific model wait. Everything else continues to run.
Late or missed schedules do not corrupt state. A scheduled run that fires late simply lands in the queue when it is ready, in the same order it would have arrived if everything had been on time. There is no special path for "recovery from a missed window".
Concurrent jobs do not collide. The queue plus the LLM manager enforce that two heavy evaluations cannot land on the same GPU unless we explicitly allow it. Failures of contention, the kind that used to require an engineer at the keyboard, do not happen.

The runtime is implemented in Python, the same language as most of our evaluation tooling, which means new suites can be plugged in without crossing a language boundary. The queue, the workers, and the LLM manager all expose a small set of APIs that any internal team can build on top of, including, eventually, teams outside QA who want to use the same infrastructure for other workloads.

The benefits of this design are concrete:

GPU utilization tracks actual workload. Idle models do not occupy hardware that another job could use.
Jobs are isolated by design. Two heavy evaluations no longer compete for the same GPU unless we explicitly allow it.
Failures are local. A worker crashing one job does not take down the queue; the next worker picks up the next job.
The system scales horizontally, up to a point. Adding workers adds throughput, and adding GPU capacity lets the LLM manager serve more concurrent models, until a GPU, the benchmark runner, or model-serving becomes the bottleneck. Within those limits, the architecture does not have to change as the catalog of models grows.

Scheduling: Tests That Run While You Sleep

Cadences are configurable through standard cron expressions: nightly regressions, weekly benchmark sweeps, monthly full-coverage runs, all coexisting without supervision. From the dispatcher's perspective, a scheduled job is indistinguishable from a manual one, which is the property that keeps the rest of the system simple.

This is where QA Harness becomes more than a release-time tool. Because runs can be triggered automatically through cron-based schedules, the platform continuously refreshes the team's understanding of what the serving system can actually handle. Supported RPS, latency trends, and saturation points stop being manually collected snapshots and become recurring operational indicators, picking up shifts that no single release review would catch: a regression introduced by a deployment configuration change, a new co-located workload eating into available GPU capacity, an infrastructure update that shifted the saturation knee. The model itself does not need to change for the system around it to degrade; scheduled runs are how we see that happen.

From Raw Data to Actionable Insights

Every completed job produces a structured report. Reports are searchable, filterable, and linkable: a job ID points to its report, a report points to its job, and any failing test points to its prompt, response, and request metadata. The team does not chase artifacts; the artifacts come to the team.

Underneath that experience, every job, report, metric, and trace is persisted in a centralized data layer, the same platform that backs the rest of our production data. That choice has real consequences: QA artifacts inherit the durability, governance, and discoverability of the rest of the company's data, and historical comparisons across weeks or months become queries rather than archaeology. The reporting and analysis views in QA Harness are one consumer of that layer; any other team (Compression, Inference, Marketing, Sales engineering) can build on top of the same source of truth without asking us first.

Three capabilities make this layer practical:

Comparison. Any two reports can be set side by side: same model on different days, different models on the same day, baseline versus candidate. Summary metrics, failure categories, benchmark accuracy, and per-stage performance numbers are diffed automatically.
Analysis. Beyond individual reports, the analysis view aggregates across runs: peak RPS by model, evolution of benchmark scores over time, performance summaries with average / best / last / gap.
Slack integration. When a scheduled run finishes, the team gets a notification with the verdict and a link to the report. No one must remember to go check.

The most valuable output is not any single report. It is the recurring visibility the platform builds up over time: supported RPS, latency profiles, saturation points, regression signals, and production-readiness verdicts, refreshed automatically and comparable across every release. A report answers "how did this run go"; the platform answers "what can our serving system actually handle right now, and how has that changed".

Operational Impact: From Manual Release Planning to Continuous Serving Visibility

The blog opened with a problem: in model serving, a seemingly local change can ripple across latency, supported RPS, GPU utilization, and the reliability of other models in the same environment, and validating that safely used to take days of manual coordination.

QA Harness collapses that effort. It reduces the manual planning and operational risk involved in moving a model or configuration toward production, turning a fragmented, multi-day exercise into an automated, repeatable, and measurable workflow. The impact shows up as concrete changes in how the organization operates:

Fewer manual release steps. A full production-readiness pass dropped from roughly seven distinct manual steps, spread across tools, machines, and people, to a single job submission. Setup time per evaluation went from at least an hour, sometimes a full working day, to under 30 minutes.
Lower collateral risk. Every change is validated against functional behavior, benchmark accuracy, and performance under load before it reaches production, so the effect of one model's deployment shape on the rest of the shared environment is measured rather than discovered in production.
Better use of shared GPU resources. Because the orchestrator owns model lifecycle, GPUs are provisioned and released by policy, not by hand. Off-hours utilization that historically sat below 5% can reach 30–40% on average, as scheduled runs reclaim capacity that would otherwise be idle.
Faster production-readiness checks. A new model or configuration can be evaluated end-to-end without writing a single new script: the suites, the orchestrator, and the reporting layer absorb the work that used to live in each engineer's local environment.
Recurring system-level performance signals. Through cron-based schedules (currently four unsupervised runs per week across 36 benchmark tasks), the platform continuously refreshes the team's view of supported RPS, latency profiles, and saturation points. Performance capability stops being a release-time snapshot and becomes a live operational indicator.
Stronger historical memory across releases. Every run is versioned and queryable months later, so regressions and trends surface on a chart instead of being lost in chat threads and spreadsheets.

Today one engineering team runs the platform, with results shared across Support, Sales, and Pre-Sales so customer-facing conversations are grounded in measured model behavior rather than anecdote.

The shift is the point. QA Harness turns a manual, risky, fragmented production-readiness process into an automated, measurable, and continuously observable operational capability for model serving. It is less a test runner than a control layer for model-serving readiness: the place the organization goes to answer, on demand and with evidence, whether a change is safe to ship.