The work described in this article is the result of a collaborative effort across multiple teams at Multiverse Computing. You can find more at the end of this article.
1. Introduction: Why This Matters
Most of the world’s video is too long to watch. A football match is 90 minutes but the highlights that matter can be summarized in seconds. An F1 race is two hours but the overtakes that define it could fit on a postcard. A two-hour all-hands has perhaps four decisions worth re-watching. A podcast runs around an hour but the soundbite that gets clipped to Instagram or TikTok lasts twelve seconds. The pattern is the same everywhere: someone has to watch all of it to find the parts worth keeping, and that someone is almost always doing it manually.
Highlights Studio is our answer to that bottleneck. The system takes a live broadcast or a recorded video, watches it as it arrives, and produces scored, titled, optionally subtitled clips ready for distribution with the help of AI.
Highlights Studio keeps two things separate: what counts as important, and the engine that finds it. Each use case defines, in natural language, what matters. A broadcaster cares about goals and red cards, an F1 team about overtakes and race starts, an enterprise about the one binding decision in a three-hour meeting. The engine reads the footage, scores each moment against that rubric, and ranks what comes out. The pipeline stays the same, but it's built to flex underneath: the rubric is tailored to each use case, and the model running inside can be swapped or upgraded to fit the deployment (a tighter hardware budget, a new vertical, or a newer open-source release). In practice, a new vertical means adapting the rubric and configuration layer rather than rebuilding the underlying intelligence. So, each use case gets a system tuned to exactly what it needs, on the setup that suits it, without a separate model to train or maintain for each one.
The same pipeline is designed to serve sports broadcasters who need a 30-second goal clip on the feed before the replay is over, F1 teams turning a race weekend into shareable overtakes, enterprise teams looking for the decisions buried in a three-hour meeting recording, and content creators cutting a podcast into the five highlights that get viral. The configuration changes, the engine does not.
But the clipping pipeline isn't where the real edge lies. The advantage is the model layer underneath it. Multiverse Computing's compression technology makes the AI that powers this pipeline light enough to deploy either way: in the cloud, where efficiency and compression keeps inference cost sustainable even as the workload scales to millions of hours of live content; or on-prem, for cases where data sovereignty and protection matter and footage never has to leave the infrastructure it's processed on. Our compressed Whisper Large v3 Turbo Slim handles the continuous speech-to-text, and efficient models do the rest. That deployment flexibility, sustainable in the cloud at scale and sovereign on-prem when data can't leave, is what makes the difference for broadcasters and enterprises.
2. The Challenge: Long Video, Short Attention
The bottleneck is not just time (source footage, editing and processing time). Anyone who has tried to automate highlight detection quickly runs into a more fundamental problem: what counts as a highlight is not universal.
A goal and a near-miss are both worth clipping, but not equally. An overtake in the final lap means something different from an overtake in the first. A binding decision in a board meeting matters; the ten minutes of context leading up to it usually do not (except when they do). The definition of "important" shifts with every sport, every format, every customer.
Speed adds a second layer of difficulty. A live broadcast cannot be paused while the model thinks. The system has to analyze footage, score it, and produce a clip within a window tight enough to still feel timely, which puts real constraints on how much context the model can see at once, and how heavyweight the processing can be.
Existing tools tend to solve one side or the other. Rules-based systems are fast but rigid: they recognize what they were programmed to recognize and nothing else. Large cloud-based models are flexible but introduce latency, cost, and data-sovereignty concerns that many broadcasters and enterprise customers cannot accept. The gap is exactly what Highlights Studio was built to fill: fast, flexible, deployable on reasonable hardware, and configurable without retraining.
3. The Solution: How Highlights Studio Works
A simple flow, framed for many use cases
At the top level, Highlights Studio does three things:
- It ingests video. Live, via standard broadcast protocols (SRT or RTMP), or a pre-recorded file.
- It looks at the video in fixed-length windows and asks: is anything important happening here? If so, what is it, and how important?
- It outputs a clip. With a title, score, and (optionally) burned-in subtitles in the speaker’s language.Â
Current architecture (live demo)
Target Architecture (production-ready: full delivery API and Publisher Console)
The Clever Part: The Rubric is the Product
We don’t ask the Vision Language Model (VLM) to “find a highlight” in the abstract. We hand it a structured prompt with three slots:
- Context: what kind of video is this? (football match, F1 race, board meeting, podcast…)
- Definition: what counts as a highlight in this context? (a goal, an overtake, a binding decision, a quotable statement…)
- Scoring Rubric: how important is each kind of highlight on a comparable scale? (a goal might be a 10, a near-miss a 6, a foul a 4; a binding decision a 10, an action item an 8, a tangent a 3).
This is the design choice that makes Highlights Studio domain-flexible without re-training. A new sport, a new event, or a new enterprise use case is a new prompt, not a new model. The same engine that finds goals can find overtakes or business decisions, because the rubric travels with the use case.
Example of prompt/rubric
How the demo works
Imagine a live football match is being broadcast over RTMP. The signal arrives at the Ingest Gateway, which doesn't care whether it's a live stream or a file someone dragged into the UI. Its job is to turn any source into a stable, seekable timeline that the rest of the system can work with.
Once the video is flowing, the STT module runs first across the entire source to produce a timestamped transcript of the audio. We use Whisper Large v3 Turbo Slim, our compressed model of the original Whisper version. On long-form and live workloads, speech recognition is not a one-off step: it has to run across the whole timeline, often while the feed is still arriving, and it has to stay ahead of the VLM windows that follow. That is why cheap and fast matters here. A slow or heavy ASR stack becomes a bottleneck of the pipeline. And that is exactly why our compressed model still delivers broadcast-grade accuracy and lets us spend the compute budget where it counts: on the VLM scoring each window against the rubric.
After that, the VLM Detector takes over. It slices the timeline into overlapping 30-second windows, advancing every 15 seconds, so every moment of footage appears in at least two consecutive windows, and nothing falls through the gap between them. For each window, it feeds two things to the Vision-Language Model: the decoded video segment and the matching transcript snippet from STT. The VLM fuses both signals in a single pass using the rubric: it scores the window from 0 to 10, gives it a title, and explains why it matters.Â
Windows that fall below the score threshold are discarded immediately. The threshold is a configurable cutoff on that 0–10 rubric score (for example, 7.0 for a tight social reel vs. 5.0 for a longer highlights package), and it is set per use case or per workflow, not baked into the model. Among the survivor clips, a selection algorithm (NMS) ensures the system does not pick from the same passage of play.
The winning clips then pass to the Captioner STT stage. Here, the STT module generates timed subtitles in the original language. This is what makes the clips self-contained: a viewer scrolling on mute still gets the context.
Finally, the Clip Composer does the craft work. It trims the source video to the exact timestamps, scales and crops it to the target format (16:9 for broadcast, 9:16 for Reels or TikToks), burns in the subtitles and a title card, and renders a polished MP4 ready for distribution.
Everything above runs automatically in the background: ingest, transcription, detection, and rendering complete without anyone in the loop. The output is a set of scored, titled, ready-to-publish clips.
In the target architecture, these clips land in a Publisher Console, where the producer steps in only when the clips are ready: to review what the system found, reorder the highlights, remove what does not fit, and publish the final reel. The rubric already defines what counts as important, but the producer still decides what actually goes out.
The Vision-Language model
A Vision-Language Model (VLM) is an AI model that can look at images or video frames and understand them in the same way it understands text. Where a traditional text model reads words, a VLM reads both pictures and words at once. For example, it can be shown a frame from a football match and asked “what is happening here?”, and it will answer in plain language. This ability to reason across vision and language is what makes it useful for highlight detection: rather than relying on hard-coded rules and algorithms, the model is shown the footage and asked to judge it against a rubric written in plain text.
For processing each video window, Highlights Studio uses an open-source VLM (Qwen3-VL-30B-A3B-Instruct). What makes this specific model highly efficient is its Mixture-of-Experts (MoE) architecture. In a standard AI model, the entire network is activated for every single calculation. An MoE model, by contrast, operates like a team of specialists: it divides its neural network into distinct "experts" and dynamically routes each task only to the parts of the network best equipped to handle it.Â
The MoE architecture activates only ~3B of its 30B parameters per token, reducing active compute per token and improving inference efficiency compared with a dense model of similar total capacity. In practice, deployment cost still depends on quantization, expert loading, batching strategy and hardware configuration.
From a strategic standpoint, our model layer is designed for flexibility and cost efficiency. Because the rubric is the product and the VLM is the engine executing it, the underlying model can be swapped or upgraded without changing the pipeline. A different vertical, a tighter hardware budget, or a newer open-source release is just a configuration change. This decoupled architecture ensures the pipeline remains highly adaptable, letting us adopt better or more efficient models as the technology evolves, without reworking the rest of the system. And because the speech-to-text runs on a compressed model and the rest of the pipeline uses efficient models, the whole stack stays light enough to deploy either in the cloud, where efficiency and compression keeps cost sustainable as volume scales, or fully on-prem, where data sovereignty and protection require footage to stay in place.
How the processing window works
The pipeline exposes a set of configurable parameters that can be adjusted to fit the deployment context. The most relevant are the clip length (how long each output clip runs) and the processing window (the amount of footage the model analyses at once to decide whether a highlight is present). Both are tuned per use case: a sports broadcaster and an enterprise meeting room have different needs in terms of pacing and context.Â
By default, Highlights Studio scores the video in overlapping 30-second windows that advance every 15 seconds. Each second of footage is therefore examined twice, in two consecutive glances with different framings. The intuition: most highlight-worthy events have a build-up that is part of what makes them a highlight, and 30 seconds is the smallest window the model can reliably score for a complete moment, while the 50% overlap guarantees that events landing on a window boundary (a shot in window N and the celebration in N+1, for example) are still seen in full by at least one window.
The trade-off we chose:
- Smaller, non-overlapping windows (5–10 s) feel snappier in a demo and look great on contrived clips. They miss real-world highlights where the lead-up matters and fail completely when an event lands on a boundary.
- Larger windows (60 s+) capture more context per glance but force the VLM to dilute its attention across more frames at lower fps and slow the feed for live broadcasts.
- 30 s windows with 50% overlap are the sweet spot: the model has enough lead-up to score correctly, the overlap eliminates boundary misses, and the system still tracks a live broadcast.
Multilingual Subtitles
When a producer needs subtitles, our speech-to-text stack kicks in. But it does far more than just transcribe, it’s actually the core engine that generates the context driving the entire system. To make this efficient, we used our compressed Whisper large v3 turbo by CompactifAI. By shrinking the model from 0.809B (original) to just 0.394B parameters, we made it over 2x faster. Despite the massive speed boost, its Word Error Rate (WER) stays near-identical to the original. Plus, with a memory footprint under 1GB, it's incredibly light and edge-deployable, seamlessly supporting additional language packs on demand.
WER measured on earnings22 dataset (long-form speech).
4. Use Cases: One Engine, Many Audiences
The same Highlights Studio engine (same model, same pipeline, same windows) can serve many different audiences once the rubric in the prompt changes. Here’s how the same machinery looks across some fields:
Sports Broadcasting
A 90-minute football match produces one goal, three near-misses, a red card, and forty minutes of midfield passing that nobody will ever watch again. Highlights Studio processes the broadcast as it arrives and surfaces the moments that matter before the final whistle.
A typical run on a league match might return:
- Goal: left-foot finish from outside the box. Score: 9.8
- Red card: second yellow for a late tackle. Score: 8.4
- Near-miss: header off the crossbar, 87th minute. Score: 7.1
- Free kick: wall blocks, keeper claims. Score: 5.2
The rubric decides what makes the cut. A broadcaster focused on social clips sets the threshold at 7.0 and gets three clips. A highlights reel producer sets it at 5.0 and gets the full story.
Below you can see an example of what Highlights Studio returns on a real match broadcast:
Debates and speeches (interventions, plenary sessions, hearings)
A debate or session generates hours of footage: rostrum interventions, hearings, rebuttal turns, and chamber reaction. The moments that define the political narrative (a confrontation at the dispatch box, a quotable line, a vote result, a procedural incident) are scattered across all of it. Highlights Studio finds them, so the comms team is not still editing when the session adjourns.
A typical run on a debate might return:
- Confrontation: direct rebuttal at the rostrum. Score: 9.4
- Vote: result with chamber reaction. Score: 8.3
- Interpellation: speaker under sustained pressure. Score: 7.6
- Point of order: interruption of the speaking turn. Score: 6.4
For debates and speeches, the rubric can weight drama and rhetorical intensity over pure procedural significance. For example, a heated point of order that suspends the session can score higher than a routine motion reading, even if both move the legislative agenda forward.
Below you can see an example of what Highlights Studio returns on a real session:
Lectures, Conferences and Keynotes
A full-day conference produces hours of content. Most attendees will never watch the recording. What they will watch is a two-minute clip of the moment a speaker said something genuinely surprising, or the thirty seconds where an audience question landed better than the talk itself.
A typical run on a keynote might return:
- Key claim: "the model outperforms GPT-5 on every benchmark we tested". Score: 9.2
- Audience question: challenge on data privacy, unrehearsed response. Score: 8.5
- Live demo: product shown for the first time on stage. Score: 8.0
- Closing statement: call to action, direct address to camera. Score: 6.8
Below you can see an example of what Highlights Studio returns on a real lecture recording:
4. Conclusion & Results
Highlights Studio started from a simple observation: the most valuable moments in any video make up only a tiny fraction of the total, yet finding them has always required a human to watch everything. We aren't here to change what a good highlight is, that creative judgment still belongs entirely to the editor, the producer, or the social team.Â
What we are changing is how fast you can get there.
By breaking down the traditional editing bottlenecks, we’ve built a system defined by autonomy and adaptability:
- Zero Retraining (The Rubric): A plain-text rubric replaces expensive and rigid model retraining. You simply tell the system what you are looking for in natural language.
- Configurable Thresholds: The system doesn't blindly cut video. A configurable score cutoff lets each use case decide how selective to be, delivering only the moments above its bar.
- High Flexibility (Model-Agnostic): Because the rubric is the product and the AI is just the engine, underlying models (like our compressed Whisper or MoE VLMs) can be swapped or upgraded as technology evolves, avoiding vendor lock-in.
- Ready-to-Publish Assets: A scored, titled, and directly subtitled MP4 removes most of the tedious timeline scrub and export queue.
The result is a system designed to work across sports broadcasting, enterprise communications, and live events without modifications to the core engine, only to the configuration layer. Running on efficient and compressed models, it can be deployed in the cloud at sustainable cost as volume scales, or fully on-prem where data sovereignty matters.
This is our first public step. The architecture is deliberately designed to grow: more models, more verticals, tighter latency, a fuller publisher experience. But the core insight is already proven:Â if you can write down what a highlight looks like, the system can find it.
