---
arxiv_id: 2603.25551
title: "Voxtral TTS"
authors:
  - Alexander H. Liu
  - Alexis Tacnet
  - Andy Ehrenberg
  - Andy Lo
  - Chen-Yo Sun
  - Guillaume Lample
  - Henry Lagarde
  - Jean-Malo Delignon
  - Jaeyoung Kim
  - John Harvill
  - Khyathi Raghavi Chandu
  - Lorenzo Signoretti
  - Margaret Jennings
  - Patrick von Platen
  - Pavankumar Reddy Muddireddy
  - Rohin Arora
  - Sanchit Gandhi
  - Samuel Humeau
  - Soham Ghosh
  - Srijan Mishra
  - Van Phung
  - Abdelaziz Bounhar
  - Abhinav Rastogi
  - Adrien Sadé
  - Alan Jeffares
  - Albert Jiang
  - Alexandre Cahill
  - Alexandre Gavaudan
  - Alexandre Sablayrolles
  - Amélie Héliou
  - Amos You
  - Andrew Bai
  - Andrew Zhao
  - Angele Lenglemetz
  - Anmol Agarwal
  - Anton Eliseev
  - Antonia Calvi
  - Arjun Majumdar
  - Arthur Fournier
  - Artjom Joosen
  - Avi Sooriyarachchi
  - Aysenur Karaduman Utkur
  - Baptiste Bout
  - Baptiste Rozière
  - Baudouin De Monicault
  - Benjamin Tibi
  - Bowen Yang
  - Charlotte Cronjäger
  - Clémence Lanfranchi
  - Connor Chen
  - Corentin Barreau
  - Corentin Sautier
  - Cyprien Courtot
  - Darius Dabert
  - Diego de las Casas
  - Elizaveta Demyanenko
  - Elliot Chane-Sane
  - Emmanuel Gottlob
  - Enguerrand Paquin
  - Etienne Goffinet
  - Fabien Niel
  - Faruk Ahmed
  - Federico Baldassarre
  - Gabrielle Berrada
  - Gaëtan Ecrepont
  - Gauthier Guinet
  - Genevieve Hayes
  - Georgii Novikov
  - Giada Pistilli
  - Guillaume Kunsch
  - Guillaume Martin
  - Guillaume Raille
  - Gunjan Dhanuka
  - Gunshi Gupta
  - Han Zhou
  - Harshil Shah
  - Hope McGovern
  - Hugo Thimonier
  - Indraneel Mukherjee
  - Irene Zhang
  - Jacques Sun
  - Jan Ludziejewski
  - Jason Rute
  - Jérémie Dentan
  - Joachim Studnia
  - Jonas Amar
  - Joséphine Delas
  - Josselin Somerville Roberts
  - Julien Tauran
  - Karmesh Yadav
  - Kartik Khandelwal
  - Kilian Tep
  - Kush Jain
  - Laurence Aitchison
  - Laurent Fainsin
  - Léonard Blier
  - Lingxiao Zhao
  - Louis Martin
  - Lucile Saulnier
  - Luyu Gao
  - Maarten Buyl
  - Manan Sharma
  - Marie Pellat
  - Mark Prins
  - Martin Alexandre
  - Mathieu Poirée
  - Mathieu Schmitt
  - Mathilde Guillaumin
  - Matthieu Dinot
  - Matthieu Futeral
  - Maxime Darrin
  - Maximilian Augustin
  - Mert Unsal
  - Mia Chiquier
  - Mikhail Biriuchinskii
  - Minh-Quang Pham
  - Mircea Lica
  - Morgane Rivière
  - Nathan Grinsztajn
  - Neha Gupta
  - Olivier Bousquet
  - Olivier Duchenne
  - Patricia Wang
  - Paul Jacob
  - Paul Wambergue
  - Paula Kurylowicz
  - Philippe Pinel
  - Philomène Chagniot
  - Pierre Stock
  - Piotr Miłoś
  - Prateek Gupta
  - Pravesh Agrawal
  - Quentin Torroba
  - Ram Ramrakhya
  - Randall Isenhour
  - Rishi Shah
  - Romain Sauvestre
  - Roman Soletskyi
  - Rosalie Millner
  - Rupert Menneer
  - Sagar Vaze
  - Samuel Barry
  - Samuel Belkadi
  - Sandeep Subramanian
  - Sean Cha
  - Shashwat Verma
  - Siddhant Waghjale
  - Siddharth Gandhi
  - Simon Lepage
  - Sumukh Aithal
  - Szymon Antoniak
  - Tarun Kumar Vangani
  - Teven Le Scao
  - Théo Cachet
  - Theo Simon Sorg
  - Thibaut Lavril
  - Thomas Chabal
  - Thomas Foubert
  - Thomas Robert
  - Thomas Wang
  - Tim Lawson
  - Tom Bewley
  - Tom Edwards
  - Tyler Wang
  - Umar Jamil
  - Umberto Tomasini
  - Valeriia Nemychnikova
  - Vedant Nanda
  - Victor Jouault
  - Vincent Maladière
  - Vincent Pfister
  - Virgile Richard
  - Vladislav Bataev
  - Wassim Bouaziz
  - Wen-Ding Li
  - William Havard
  - William Marshall
  - Xinghui Li
  - Xingran Guo
  - Xinyu Yang
  - Yannic Neuhaus
  - Yassine El Ouahidi
  - Yassir Bendou
  - Yihan Wang
  - Yimu Pan
  - Zaccharie Ramzi
  - Zhenlin Xu
difficulty: Intermediate
tags:
  - Audio
  - Multimodal
published_at: 2026-03-26
flecto_url: https://flecto.zer0ai.dev/papers/2603.25551/
lang: en
---

An expressive multilingual text-to-speech model that generates natural speech
        from as little as 3 seconds of reference audio.

Flagship Voice Win Rate vs ElevenLabs Flash v2.5

Voice Cloning Win Rate vs ElevenLabs Flash v2.5

## Abstract

We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates
          natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid
          architecture that combines auto-regressive generation of semantic speech tokens with
          flow-matching for acoustic tokens. These tokens are encoded and decoded with Voxtral Codec,
          a speech tokenizer trained from scratch with a hybrid VQ-FSQ quantization scheme.

In human evaluations conducted by native speakers, Voxtral TTS is preferred for multilingual
          voice cloning due to its naturalness and expressivity, achieving a 68.4% win rate over
          ElevenLabs Flash v2.5. We release the model weights under a CC BY-NC license.

> "68.4% win rate over ElevenLabs Flash v2.5 in zero-shot voice cloning — preferred across 9 languages."

#### What is "zero-shot voice cloning"?

Traditional TTS systems require hours of recorded speech from a target speaker to produce a convincing imitation. Zero-shot voice cloning means giving the model just a short clip (here, as little as 3 seconds) and generating new speech in that voice — without any fine-tuning on that speaker. The model generalizes from training on thousands of speakers, learning to extract a "voice fingerprint" from the reference and apply it to arbitrary new text.

## Human Evaluation Results

### Independent Human Evaluation

- 77 unique text examples evaluated by native speakers per language

- Flagship voices: default voices compared across same gender and accent

- Voice cloning: 3-second reference clip provided; annotators rated likeness and naturalness

- Annotators chose "slightly better", "much better", or "both good" — ties excluded from win rate

- Voxtral TTS preferred in 58.3% of flagship comparisons

- Voxtral TTS preferred in 68.4% of voice cloning comparisons

## Model Architecture

Voxtral TTS consists of a novel audio codec (Voxtral Codec) and an autoregressive decoder backbone.
          The codec encodes a reference voice sample into audio tokens at 12.5 Hz — each frame comprising
          1 semantic token and 36 acoustic tokens. The decoder auto-regressively generates semantic tokens,
          while a lightweight flow-matching transformer predicts acoustic tokens conditioned on decoder states.
          A codec decoder maps the output tokens to the corresponding audio waveform.

#### Semantic tokens vs. acoustic tokens — why both?

- Semantic tokens (1 per frame, VQ codebook size 8192): Capture what is being said — phonemes, prosody, linguistic content. Distilled from Whisper ASR, so they align with text. The autoregressive backbone generates these one by one, like an LLM generating text tokens.

- Acoustic tokens (36 per frame, FSQ 21 levels): Capture how it sounds — voice timbre, breathiness, subtle resonance. The flow-matching transformer predicts all 36 simultaneously, conditioned on the semantic token, recovering fine-grained audio detail that a single VQ codebook cannot represent.

- Why 12.5 Hz? Each frame covers 80ms of audio. 37 tokens × 12.5 frames/s = 462 tokens/s — manageable for autoregressive generation without sacrificing audio quality.

### Voxtral Codec

A convolutional-transformer autoencoder that compresses 24 kHz mono audio to 12.5 Hz frames of
              37 discrete tokens — 1 semantic (VQ codebook size 8192) and 36 acoustic (FSQ, 21 levels each) — at a total bitrate of 2.14 kbps .

The semantic component is distilled from a supervised Whisper ASR model using a soft-alignment
              cosine distance loss, enabling text-aligned semantic tokens without requiring forced aligners.
              The acoustic component uses finite scalar quantization (FSQ) with 21 uniform levels.

An 8-discriminator multi-resolution adversarial training objective ensures high-fidelity
              waveform reconstruction. The full codec has approximately 300M parameters .

#### VQ vs. FSQ: two quantization strategies

Vector Quantization (VQ) maintains a learned codebook of discrete embedding vectors; each input is replaced by its nearest-neighbor entry. Codebook size 8192 means 8192 possible semantic "words." Finite Scalar Quantization (FSQ) quantizes each dimension independently to a fixed number of uniform levels (here 21). No codebook lookup — it's simply rounding each scalar to the nearest of 21 values. FSQ avoids VQ's "codebook collapse" problem (where many codebook entries are never used) while providing more stable training.

### Decoder Backbone & Flow-Matching Transformer

The decoder backbone follows the Ministral 3B architecture — a decoder-only transformer.
              Input consists of interleaved voice reference audio tokens and text tokens; output audio
              tokens are auto-regressively generated.

At each timestep, a bidirectional 3-layer flow-matching transformer predicts
              acoustic tokens from the decoder's hidden state. It uses 8 NFEs (Number of
              Function Evaluations) and classifier-free guidance (CFG, α=1.2) to balance
              expressiveness and text adherence.

Float-valued acoustic outputs are discretized to 21 FSQ levels before the next AR step,
              maintaining a fully discrete token interface with the backbone vocabulary.

#### What is flow-matching and why use it for acoustic tokens?

Flow-matching is a generative modeling technique that learns to transport a simple distribution (Gaussian noise) to a target distribution (real audio tokens) via an ordinary differential equation (ODE). Unlike autoregressive generation (which is sequential), flow-matching generates all 36 acoustic tokens simultaneously by solving the ODE in a small number of steps (8 NFEs here). This makes it much faster than full diffusion sampling while retaining the ability to model the complex dependencies across acoustic token dimensions needed for high-fidelity voice timbre.

### Voxtral Codec: Key Hyperparameters

Table 1: Key hyperparameters of the Voxtral Codec.

## Training

Voxtral TTS is trained in two stages: large-scale pretraining on paired audio-transcript data,
        followed by Direct Preference Optimization (DPO) to improve speech naturalness and speaker similarity.

### Pretraining

Trained on paired audio and pseudo-labelled transcripts from Voxtral Mini Transcribe.
            Each sample is a tuple (A₁, T₂, A₂) — A₁ is the voice reference, T₂ is the transcript for A₂ (generation target).

Loss computed on A₂ tokens only: cross-entropy on semantic tokens and flow-matching loss on acoustic tokens.
            Decoder backbone initialized from Ministral 3B; text embedding layers frozen to improve robustness
            on low-frequency tokens.

Voice-activity detection (VAD) suppresses loss on silent frames. Simple LLM-based text rewrites
            improve robustness to normalized vs un-normalized text.

### Direct Preference Optimization (DPO)

Post-training with DPO to improve word error rate (WER) and speaker similarity.
            For semantic tokens, the standard DPO objective is used. For acoustic tokens, the
            flow-DPO objective from Ziv et al. (2025) is adapted to the autoregressive setting.

A rejection-sampling pipeline generates winner/loser speech pairs scored by WER,
            speaker similarity, loudness consistency, and UTMOS-v2. Combined DPO + pretraining
            objective is trained for 1 epoch on high-quality speech.

β semantic = 0.1, β acoustic = 0.5. Learning rate 8×10⁻⁸ for training stability.

#### Why DPO for TTS? Adapting an NLP technique to speech

DPO (Direct Preference Optimization) was designed for aligning language models to human preferences without explicit reward modeling. For TTS, "preferences" are speech quality judgments. The pipeline: (1) generate multiple speech outputs for the same text, (2) score them on WER, speaker similarity, and UTMOS-v2 naturalness, (3) form winner/loser pairs, (4) train the model to increase the likelihood of winners relative to losers. The trick is adapting DPO to continuous acoustic tokens — the flow-DPO variant propagates preference gradients through the ODE solver of the flow-matching step.

## Results

Voxtral TTS is evaluated on codec reconstruction quality, automatic metrics (WER, UTMOS-v2, Speaker Similarity),
        and human preference studies across 9 languages.

### Voxtral Codec vs Mimi

Table 2: Comparison of Voxtral Codec and Mimi on the Expresso dataset. ↓ lower is better, ↑ higher is better.

At 2.1 kbps, Voxtral Codec matches or exceeds Mimi-16cb (2.2 kbps) on all objective metrics.

#### What do the codec evaluation metrics mean?

Mel / STFT distance (↓): Spectral reconstruction error — how closely the decoded audio matches the original in frequency space. PESQ (↑): Perceptual speech quality score from telephony standards. ESTOI (↑): Extended Short-Time Objective Intelligibility — predicts how well listeners can understand the speech. ASR-WER (↓): Word error rate when an ASR model transcribes the reconstructed audio — lower means the codec preserves phonetic detail. Speaker Sim (↑): Cosine similarity between speaker embeddings of original and reconstructed audio.

### Automatic Evaluations — WER, UTMOS, Speaker Similarity

Table 3: WER (%) ↓, UTMOS ↑, and Speaker Similarity ↑ for Voxtral TTS, ElevenLabs v3, and ElevenLabs Flash v2.5 across languages.

Voxtral TTS significantly outperforms ElevenLabs models on Speaker Similarity across all languages.

### Flagship Voice Evaluation (Emotion Steering)

Table 4: Voxtral TTS win rates by steering type.

Voxtral TTS consistently outperforms ElevenLabs in implicit emotion steering.

### Zero-Shot Voice Cloning Win Rates

Table 5: Voxtral TTS win rate vs ElevenLabs Flash v2.5 across languages.

Voxtral TTS matches or outperforms ElevenLabs Flash v2.5 on every language.

## Analysis

We analyze the impact of DPO post-training and ablate key inference parameters:
        Number of Function Evaluations (NFEs) for the flow-matching transformer and CFG scale α.

### DPO Improvements

Table 6: DPO improves WER and UTMOS across languages.

DPO improves WER and UTMOS across most languages. Hindi shows WER regression — DPO improves UTMOS (+0.13) but slightly hurts intelligibility.

### Effect of NFEs and CFG Scale α

## Inference & Serving

### vLLM-Omni Integration

Voxtral TTS is served through vLLM-Omni, an extension of vLLM for multi-stage multimodal models.
            The system decomposes into two pipeline stages:

- Generation stage: predicts semantic and acoustic tokens auto-regressively

- Codec decoding stage: converts tokens to waveform

The two stages communicate via an asynchronous chunked streaming protocol over
            shared memory, enabling first-audio latency well before the full waveform is generated.
            Each emitted chunk overlaps with previous frames to maintain temporal coherence across boundaries.

### CUDA Graph Acceleration

The flow-matching transformer is the computational bottleneck. The entire ODE solver is
            captured into CUDA graphs , eliminating Python-level overhead and kernel-launch latency.
            Batch sizes are rounded up to bucket boundaries; outputs are sliced back to actual size.

#### CUDA graphs and RTF explained

CUDA graphs pre-record a sequence of GPU operations (kernel calls, memory copies) into a single executable graph. Instead of Python dispatching each CUDA kernel individually at runtime (incurring overhead per launch), the entire ODE solver for the flow-matching transformer runs as one atomic GPU operation. This is especially effective when the computation graph is fixed at a given batch size.

RTF (Real-Time Factor) is the ratio of processing time to audio duration: RTF = 0.103 means generating 1 second of audio takes 0.103 seconds — or about 10× faster than real-time. RTF < 1 is required for practical streaming; the 2.5× improvement from CUDA graphs is the difference between "barely fast enough" and "comfortable streaming."

### CUDA Graph vs Eager Mode

Table 7: Effect of CUDA graph acceleration on the flow-matching transformer. 500-char text input, 10s audio reference, concurrency 1, single H200.

### Serving Performance on Single H200

Table 8: Serving performance of Voxtral TTS with 500-char text input and 10s audio reference.

Wait Rate = 0% at all concurrency levels. Throughput scales 12× from concurrency 1 to 32. A single H200 can serve 30+ concurrent users with sub-second latency.

## Conclusion

We introduced Voxtral TTS, a multilingual TTS model leveraging a hybrid architecture for
          auto-regressive generation of semantic tokens and flow-matching for acoustic tokens.
          Tokens correspond to those from Voxtral Codec — a speech tokenizer combining ASR-distilled
          semantic tokens with FSQ acoustic tokens.

Voxtral TTS generates expressive, voice-cloned speech from as little as 3 seconds of reference
          audio, and is preferred over ElevenLabs Flash v2.5 with a 68.4% win rate in human
          evaluations. Model weights are released under the CC BY-NC license to support further
          research and development of expressive TTS systems.
