vLLM-Omni Integration
Voxtral TTS is served through vLLM-Omni, an extension of vLLM for multi-stage multimodal models.
The system decomposes into two pipeline stages:
- Generation stage: predicts semantic and acoustic tokens auto-regressively
- Codec decoding stage: converts tokens to waveform
The two stages communicate via an asynchronous chunked streaming protocol over
shared memory, enabling first-audio latency well before the full waveform is generated.
Each emitted chunk overlaps with previous frames to maintain temporal coherence across boundaries.
CUDA Graph Acceleration
The flow-matching transformer is the computational bottleneck. The entire ODE solver is
captured into CUDA graphs, eliminating Python-level overhead and kernel-launch latency.
Batch sizes are rounded up to bucket boundaries; outputs are sliced back to actual size.
CUDA graphs reduce latency by 47% (133 ms → 70 ms) and improve RTF by 2.5× (0.258 → 0.103)
on a single NVIDIA H200.
CUDA graphs and RTF explained
CUDA graphs pre-record a sequence of GPU operations (kernel calls, memory copies) into a single executable graph. Instead of Python dispatching each CUDA kernel individually at runtime (incurring overhead per launch), the entire ODE solver for the flow-matching transformer runs as one atomic GPU operation. This is especially effective when the computation graph is fixed at a given batch size.
RTF (Real-Time Factor) is the ratio of processing time to audio duration: RTF = 0.103 means generating 1 second of audio takes 0.103 seconds — or about 10× faster than real-time. RTF < 1 is required for practical streaming; the 2.5× improvement from CUDA graphs is the difference between "barely fast enough" and "comfortable streaming."
CUDA Graph vs Eager Mode
Table 7: Effect of CUDA graph acceleration on the flow-matching transformer.
500-char text input, 10s audio reference, concurrency 1, single H200.
| Configuration | Latency | RTF |
| Eager mode | 133 ms | 0.258 |
| CUDA graph | 70 ms | 0.103 |
Serving Performance on Single H200
Table 8: Serving performance of Voxtral TTS with 500-char text input and 10s audio reference.
| Concurrency | Latency | RTF | Throughput (char/s/GPU) | Wait Rate |
| 1 | 70 ms | 0.103 | 119.14 | 0% |
| 16 | 331 ms | 0.237 | 879.11 | 0% |
| 32 | 552 ms | 0.302 | 1,430.78 | 0% |
Wait Rate = 0% at all concurrency levels. Throughput scales 12× from concurrency 1 to 32. A single H200 can serve 30+ concurrent users with sub-second latency.