Voxtral TTS

An expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio.

Flagship Voice Win Rate
vs ElevenLabs Flash v2.5

Voice Cloning Win Rate
vs ElevenLabs Flash v2.5

View on arXiv Model Weights (HuggingFace) Mistral Blog

Abstract

We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec, a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme.

In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual voice cloning due to its naturalness and expressivity, achieving a 68.4% win rate over ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.

"68.4% win rate over ElevenLabs Flash v2.5 in zero-shot voice cloning — preferred across 9 languages."

What is "zero-shot voice cloning"?

Traditional TTS systems require hours of recorded speech from a target speaker to produce a convincing imitation. Zero-shot voice cloning means giving the model just a short clip (here, as little as 3 seconds) and generating new speech in that voice — without any fine-tuning on that speaker. The model generalizes from training on thousands of speakers, learning to extract a "voice fingerprint" from the reference and apply it to arbitrary new text.

Human Evaluation Results

Win Rate bar chart: Voxtral TTS vs ElevenLabs Flash v2.5. Flagship voices: 58.3% vs 41.7%. Voice cloning: 68.4% vs 31.6%. — **Figure 1:** Voxtral TTS is preferred to ElevenLabs Flash v2.5 in human evaluations. Win rate plotted across two categories — Flagship voices (default voices) and Voice cloning (3s reference clip). 77 unique text examples; native speaker annotators rated audio blindly.

Independent Human Evaluation

77 unique text examples evaluated by native speakers per language
Flagship voices: default voices compared across same gender and accent
Voice cloning: 3-second reference clip provided; annotators rated likeness and naturalness
Annotators chose "slightly better", "much better", or "both good" — ties excluded from win rate
Voxtral TTS preferred in 58.3% of flagship comparisons
Voxtral TTS preferred in 68.4% of voice cloning comparisons

Model Architecture

Voxtral TTS consists of a novel audio codec (Voxtral Codec) and an autoregressive decoder backbone. The codec encodes a reference voice sample into audio tokens at 12.5 Hz — each frame comprising 1 semantic token and 36 acoustic tokens. The decoder auto-regressively generates semantic tokens, while a lightweight flow-matching transformer predicts acoustic tokens conditioned on decoder states. A codec decoder maps the output tokens to the corresponding audio waveform.

Semantic tokens vs. acoustic tokens — why both?

Semantic tokens (1 per frame, VQ codebook size 8192): Capture what is being said — phonemes, prosody, linguistic content. Distilled from Whisper ASR, so they align with text. The autoregressive backbone generates these one by one, like an LLM generating text tokens.
Acoustic tokens (36 per frame, FSQ 21 levels): Capture how it sounds — voice timbre, breathiness, subtle resonance. The flow-matching transformer predicts all 36 simultaneously, conditioned on the semantic token, recovering fine-grained audio detail that a single VQ codebook cannot represent.
Why 12.5 Hz? Each frame covers 80ms of audio. 37 tokens × 12.5 frames/s = 462 tokens/s — manageable for autoregressive generation without sacrificing audio quality.

Voxtral Codec

A convolutional-transformer autoencoder that compresses 24 kHz mono audio to 12.5 Hz frames of 37 discrete tokens — 1 semantic (VQ codebook size 8192) and 36 acoustic (FSQ, 21 levels each) — at a total bitrate of 2.14 kbps.

The semantic component is distilled from a supervised Whisper ASR model using a soft-alignment cosine distance loss, enabling text-aligned semantic tokens without requiring forced aligners. The acoustic component uses finite scalar quantization (FSQ) with 21 uniform levels.

An 8-discriminator multi-resolution adversarial training objective ensures high-fidelity waveform reconstruction. The full codec has approximately 300M parameters.

VQ vs. FSQ: two quantization strategies

Vector Quantization (VQ) maintains a learned codebook of discrete embedding vectors; each input is replaced by its nearest-neighbor entry. Codebook size 8192 means 8192 possible semantic "words." Finite Scalar Quantization (FSQ) quantizes each dimension independently to a fixed number of uniform levels (here 21). No codebook lookup — it's simply rounding each scalar to the nearest of 21 values. FSQ avoids VQ's "codebook collapse" problem (where many codebook entries are never used) while providing more stable training.

Voxtral Codec architecture: encoder blocks, VQ and FSQ quantization, decoder blocks, with adversarial and ASR distillation losses. — **Figure 3:** Architecture and training of Voxtral Codec. Split semantic VQ and acoustic FSQ codebooks. Semantic token has additional ASR distillation loss.

Abstract illustration of autoregressive token generation sequence — Autoregressive semantic token generation followed by flow-matching acoustic token prediction.

Decoder Backbone & Flow-Matching Transformer

The decoder backbone follows the Ministral 3B architecture — a decoder-only transformer. Input consists of interleaved voice reference audio tokens and text tokens; output audio tokens are auto-regressively generated.

At each timestep, a bidirectional 3-layer flow-matching transformer predicts acoustic tokens from the decoder's hidden state. It uses 8 NFEs (Number of Function Evaluations) and classifier-free guidance (CFG, α=1.2) to balance expressiveness and text adherence.

Float-valued acoustic outputs are discretized to 21 FSQ levels before the next AR step, maintaining a fully discrete token interface with the backbone vocabulary.

What is flow-matching and why use it for acoustic tokens?

Flow-matching is a generative modeling technique that learns to transport a simple distribution (Gaussian noise) to a target distribution (real audio tokens) via an ordinary differential equation (ODE). Unlike autoregressive generation (which is sequential), flow-matching generates all 36 acoustic tokens simultaneously by solving the ODE in a small number of steps (8 NFEs here). This makes it much faster than full diffusion sampling while retaining the ability to model the complex dependencies across acoustic token dimensions needed for high-fidelity voice timbre.

Voxtral Codec: Key Hyperparameters

Table 1: Key hyperparameters of the Voxtral Codec.

Parameter	Value
Input / Preprocessing
Sampling rate	24,000
Patch size	240
AutoEncoder
Encoder patch projection kernel size	7
Encoder patch projection dimension	1024
Encoder transformer layers	2 → 2 → 2 → 2
Encoder sliding window size	16 → 8 → 4 → 2
Encoder conv kernels	4 → 4 → 4 → 3
Encoder conv strides	2 → 2 → 2 → 1
Discrete Bottleneck
Semantic VQ codebook size	8,192
Acoustic FSQ codebook count × size	36 × 21
Discriminator
FFT sizes	2296, 1418, 876, 542, 334, 206, 126, 76
Channels	256

Training

Voxtral TTS is trained in two stages: large-scale pretraining on paired audio-transcript data, followed by Direct Preference Optimization (DPO) to improve speech naturalness and speaker similarity.

Pretraining

Trained on paired audio and pseudo-labelled transcripts from Voxtral Mini Transcribe. Each sample is a tuple (A₁, T₂, A₂) — A₁ is the voice reference, T₂ is the transcript for A₂ (generation target).

Loss computed on A₂ tokens only: cross-entropy on semantic tokens and flow-matching loss on acoustic tokens. Decoder backbone initialized from Ministral 3B; text embedding layers frozen to improve robustness on low-frequency tokens.

Voice-activity detection (VAD) suppresses loss on silent frames. Simple LLM-based text rewrites improve robustness to normalized vs un-normalized text.

Direct Preference Optimization (DPO)

Post-training with DPO to improve word error rate (WER) and speaker similarity. For semantic tokens, the standard DPO objective is used. For acoustic tokens, the flow-DPO objective from Ziv et al. (2025) is adapted to the autoregressive setting.

A rejection-sampling pipeline generates winner/loser speech pairs scored by WER, speaker similarity, loudness consistency, and UTMOS-v2. Combined DPO + pretraining objective is trained for 1 epoch on high-quality speech.

β_semantic = 0.1, β_acoustic = 0.5. Learning rate 8×10⁻⁸ for training stability.

Why DPO for TTS? Adapting an NLP technique to speech

DPO (Direct Preference Optimization) was designed for aligning language models to human preferences without explicit reward modeling. For TTS, "preferences" are speech quality judgments. The pipeline: (1) generate multiple speech outputs for the same text, (2) score them on WER, speaker similarity, and UTMOS-v2 naturalness, (3) form winner/loser pairs, (4) train the model to increase the likelihood of winners relative to losers. The trick is adapting DPO to continuous acoustic tokens — the flow-DPO variant propagates preference gradients through the ODE solver of the flow-matching step.

Results

Voxtral TTS is evaluated on codec reconstruction quality, automatic metrics (WER, UTMOS-v2, Speaker Similarity), and human preference studies across 9 languages.

Voxtral Codec vs Mimi

Table 2: Comparison of Voxtral Codec and Mimi on the Expresso dataset. ↓ lower is better, ↑ higher is better.

Model	fps	token/frame × vocab. size	bitrate (kbps)	Reconstruction ↓		Intrusive ↑		Perceptual
Model	fps	token/frame × vocab. size	bitrate (kbps)	Mel	STFT	PESQ	ESTOI	ASR-WER (%) ↓	Speaker Sim ↑
Mimi – 8cb (Moshi)	12.5	8 × (2048)	1.1	0.702	1.177	2.07	0.803	11.75	0.672
Mimi – 16cb	12.5	16 × (2048)	2.2	0.618	1.100	2.67	0.865	11.01	0.829
Mimi – full 32cb	12.5	32 × (2048)	4.4	0.552	1.040	3.18	0.910	10.25	0.902
Voxtral Codec	12.5	1 × (8192) + 36 × (21)	2.1	0.545	0.982	3.05	0.882	10.66	0.843

At 2.1 kbps, Voxtral Codec matches or exceeds Mimi-16cb (2.2 kbps) on all objective metrics.

What do the codec evaluation metrics mean?

Mel / STFT distance (↓): Spectral reconstruction error — how closely the decoded audio matches the original in frequency space. PESQ (↑): Perceptual speech quality score from telephony standards. ESTOI (↑): Extended Short-Time Objective Intelligibility — predicts how well listeners can understand the speech. ASR-WER (↓): Word error rate when an ASR model transcribes the reconstructed audio — lower means the codec preserves phonetic detail. Speaker Sim (↑): Cosine similarity between speaker embeddings of original and reconstructed audio.

Automatic Evaluations — WER, UTMOS, Speaker Similarity

Table 3: WER (%) ↓, UTMOS ↑, and Speaker Similarity ↑ for Voxtral TTS, ElevenLabs v3, and ElevenLabs Flash v2.5 across languages.

Task	WER (%) ↓			UTMOS ↑			Speaker Sim ↑
Task	Voxtral	ElevenLabs v3	ElevenLabs Flash	Voxtral	ElevenLabs v3	ElevenLabs Flash	Voxtral	ElevenLabs v3	ElevenLabs Flash
MiniMax
Arabic	2.68	3.67	2.86	3.07	2.50	2.89	0.746	0.546	0.539
German	0.83	0.45	1.08	3.12	2.90	3.27	0.721	0.457	0.489
English	0.63	0.48	0.33	4.30	4.27	4.27	0.786	0.484	0.489
Spanish	0.51	0.87	0.49	3.41	3.18	2.99	0.762	0.443	0.541
French	3.22	2.34	2.26	2.83	2.90	2.94	0.587	0.339	0.378
Hindi	4.99	8.71	5.08	3.56	3.56	3.35	0.839	0.707	0.679
Italian	1.32	0.58	0.55	3.43	3.08	3.09	0.739	0.527	0.485
Dutch	1.99	1.52	0.83	3.89	3.53	3.68	0.720	0.397	0.598
Portuguese	1.02	0.92	1.15	3.66	3.41	3.41	0.785	0.571	0.642
Seed TTS	1.23	1.26	0.86	4.11	3.92	4.09	0.628	0.392	0.413

Voxtral TTS significantly outperforms ElevenLabs models on Speaker Similarity across all languages.

Flagship Voice Evaluation (Emotion Steering)

Table 4: Voxtral TTS win rates by steering type.

Emotion Steering	Opponent Model	Voxtral TTS Win Rate (%)
Explicit	ElevenLabs v3	51.0
Explicit	Gemini 2.5 Flash TTS	35.4
Implicit	ElevenLabs Flash v2.5	58.3
	ElevenLabs v3	55.4
	Gemini 2.5 Flash TTS	37.1

Voxtral TTS consistently outperforms ElevenLabs in implicit emotion steering.

Zero-Shot Voice Cloning Win Rates

Table 5: Voxtral TTS win rate vs ElevenLabs Flash v2.5 across languages.

Language	Voxtral TTS Win Rate (%)
Arabic	72.9
Dutch	49.4
English	60.8
French	54.4
German	72.0
Hindi	79.8
Italian	57.1
Portuguese	74.4
Spanish	87.8
Overall	68.4

Voxtral TTS matches or outperforms ElevenLabs Flash v2.5 on every language.

Analysis

We analyze the impact of DPO post-training and ablate key inference parameters: Number of Function Evaluations (NFEs) for the flow-matching transformer and CFG scale α.

DPO Improvements

Table 6: DPO improves WER and UTMOS across languages.

Task	WER (%) ↓		UTMOS ↑
Task	Pretrain	DPO	Pretrain	DPO
MiniMax
Arabic	2.80	2.68 (-0.12)	3.01	3.07 (+0.06)
German	4.08	0.83 (-3.25)	3.05	3.12 (+0.07)
English	0.84	0.63 (-0.21)	4.25	4.30 (+0.05)
Spanish	0.56	0.51 (-0.06)	3.38	3.41 (+0.04)
French	5.01	3.22 (-1.79)	2.76	2.83 (+0.07)
Hindi	3.39	4.99 (+1.61)	3.43	3.56 (+0.13)
Italian	2.18	1.32 (-0.85)	3.36	3.43 (+0.07)
Dutch	3.10	1.99 (-1.11)	3.85	3.89 (+0.04)
Portuguese	1.17	1.02 (-0.15)	3.60	3.66 (+0.06)
Seed TTS	1.58	1.23 (-0.35)	4.07	4.11 (+0.04)

DPO improves WER and UTMOS across most languages. Hindi shows WER regression — DPO improves UTMOS (+0.13) but slightly hurts intelligibility.

Effect of NFEs and CFG Scale α

Six plots showing effect of NFEs and CFG scale alpha on WER, UTMOS, and Speaker Similarity. Increasing NFEs from 2 to 8 improves all metrics. CFG alpha shows trade-off between UTMOS and implicit emotion adherence. — **Figure 4:** Effect of NFEs and CFG on automatic evaluations. Metrics averaged over SEED-TTS and 9 MiniMax languages. Increasing NFEs from 2 to 8 significantly improves WER and UTMOS. Beyond 8, gains plateau. CFG α=1.2 selected as default — higher α over-adheres to voice prompt and degrades implicit emotion steering.

Inference & Serving

vLLM-Omni Integration

Voxtral TTS is served through vLLM-Omni, an extension of vLLM for multi-stage multimodal models. The system decomposes into two pipeline stages:

Generation stage: predicts semantic and acoustic tokens auto-regressively
Codec decoding stage: converts tokens to waveform

The two stages communicate via an asynchronous chunked streaming protocol over shared memory, enabling first-audio latency well before the full waveform is generated. Each emitted chunk overlaps with previous frames to maintain temporal coherence across boundaries.

CUDA Graph Acceleration

The flow-matching transformer is the computational bottleneck. The entire ODE solver is captured into CUDA graphs, eliminating Python-level overhead and kernel-launch latency. Batch sizes are rounded up to bucket boundaries; outputs are sliced back to actual size.

CUDA graphs reduce latency by 47% (133 ms → 70 ms) and improve RTF by 2.5× (0.258 → 0.103) on a single NVIDIA H200.

CUDA graphs and RTF explained

CUDA graphs pre-record a sequence of GPU operations (kernel calls, memory copies) into a single executable graph. Instead of Python dispatching each CUDA kernel individually at runtime (incurring overhead per launch), the entire ODE solver for the flow-matching transformer runs as one atomic GPU operation. This is especially effective when the computation graph is fixed at a given batch size.

RTF (Real-Time Factor) is the ratio of processing time to audio duration: RTF = 0.103 means generating 1 second of audio takes 0.103 seconds — or about 10× faster than real-time. RTF < 1 is required for practical streaming; the 2.5× improvement from CUDA graphs is the difference between "barely fast enough" and "comfortable streaming."

CUDA Graph vs Eager Mode

Table 7: Effect of CUDA graph acceleration on the flow-matching transformer.
500-char text input, 10s audio reference, concurrency 1, single H200.

Configuration	Latency	RTF
Eager mode	133 ms	0.258
CUDA graph	70 ms	0.103

Serving Performance on Single H200

Table 8: Serving performance of Voxtral TTS with 500-char text input and 10s audio reference.

Concurrency	Latency	RTF	Throughput (char/s/GPU)	Wait Rate
1	70 ms	0.103	119.14	0%
16	331 ms	0.237	879.11	0%
32	552 ms	0.302	1,430.78	0%

Wait Rate = 0% at all concurrency levels. Throughput scales 12× from concurrency 1 to 32. A single H200 can serve 30+ concurrent users with sub-second latency.

Conclusion

We introduced Voxtral TTS, a multilingual TTS model leveraging a hybrid architecture for auto-regressive generation of semantic tokens and flow-matching for acoustic tokens. Tokens correspond to those from Voxtral Codec — a speech tokenizer combining ASR-distilled semantic tokens with FSQ acoustic tokens.

Voxtral TTS generates expressive, voice-cloned speech from as little as 3 seconds of reference audio, and is preferred over ElevenLabs Flash v2.5 with a 68.4% win rate in human evaluations. Model weights are released under the CC BY-NC license to support further research and development of expressive TTS systems.

9 Languages 3s Reference Audio 68.4% Voice Cloning Win Rate CC BY-NC Open Weights 70 ms First-Audio Latency

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.

View Services Contact Us