VibeVoice Technical Report — Long-Form Multi-Speaker Speech Synthesis

Abstract

VibeVoice is a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion — a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. A novel continuous speech tokenizer achieves 80× data compression versus Encodec while maintaining comparable performance. VibeVoice can synthesize long-form speech for up to 90 minutes (64K context window) with up to 4 speakers, capturing authentic conversational "vibe" and surpassing open-source and proprietary dialogue models.

Introduction

Recent Text-to-Speech systems have achieved impressive results for single-speaker, short-utterance synthesis — but a major frontier remains unsolved: generating long-form, natural-sounding dialogue with multiple distinct speakers. Producing a 30-minute podcast where two speakers maintain consistent voices, realistic turn-taking, and natural prosody across thousands of tokens is far beyond what existing models can reliably accomplish.

The core challenge is scale. Standard audio codecs like Encodec operate at 75–600 Hz, meaning even a few minutes of speech produces hundreds of thousands of tokens — far exceeding the context capacity of current LLMs. Without a radical reduction in token density, long-form multi-speaker generation is computationally intractable.

Why Does Frame Rate Matter for Speech?

In audio processing, "frame rate" (measured in Hz) means how many tokens or latent vectors are generated per second of audio. Traditional codecs like Encodec operate at 75 Hz — meaning 1 minute of speech = 75 × 60 = 4,500 tokens. At the scale of a 30-minute podcast with multiple speakers, that's 135,000 tokens — far beyond most LLMs' context windows (typically 4K–32K).

VibeVoice's tokenizer achieves just 7.5 Hz — 10× fewer tokens than Encodec. This means the same 30-minute podcast needs only ~13,500 tokens, easily fitting in a 64K context window. The challenge: how do you compress audio 10× without losing sound quality? The answer is a clever VAE architecture with 6 downsampling stages.

VibeVoice solves this with a novel continuous speech tokenizer operating at just 7.5 Hz — a compression of over 80× versus Encodec — while preserving audio fidelity. This ultra-low frame rate makes 90-minute, 4-speaker synthesis feasible within a 64K-token context window using a standard LLM backbone.

The architecture is deliberately streamlined: a pre-trained LLM (Qwen2.5, 1.5B or 7B parameters) processes voice prompts and text scripts, then a lightweight diffusion head generates continuous speech latents token by token. The simplicity is intentional — previous designs required complex separate components that VibeVoice collapses into a single unified pipeline.

Key Contributions

Ultra-low frame rate tokenizer (7.5 Hz) — novel continuous speech tokenizer achieving 80× compression vs Encodec, preserving audio fidelity at a tiny fraction of the token cost
Next-token diffusion LLM backbone — unified framework combining LLM sequence modeling with token-level diffusion decoding for high-fidelity continuous speech
64K context window / 90-min synthesis — supports up to 4 simultaneous speakers in a single generation run, unprecedented in open-source TTS
State-of-the-art quality — VibeVoice-7B outperforms Gemini-2.5-Pro-Preview-TTS, Eleven-V3, and all open-source competitors in preference, realism, and richness MOS scores

Method

2.1 Speech Tokenizers

VibeVoice employs two separate tokenizers to learn both acoustic and semantic features. Generating long-form speech benefits from this separation: the acoustic tokenizer preserves audio quality at ultra-low bit rate, while the semantic tokenizer captures linguistic content independently.

Acoustic Tokenizer

Based on Variational Autoencoder (VAE) principles — specifically the o-VAE variant from LatentLM — to prevent variance collapse in autoregressive settings. Features a mirror-symmetric encoder-decoder with 7 hierarchical stages of modified Transformer blocks, using 1D depth-wise causal convolutions for efficient streaming.

Six downsampling layers achieve the breakthrough compression. The encoder produces a continuous latent vector Z_t per timestep, and the decoder reconstructs waveform from these latents. The entire design is causal — enabling real-time streaming synthesis.

7.5 Hz frame rate 80× compression vs Encodec

Semantic Tokenizer

Mirrors the hierarchical architecture of the Acoustic Tokenizer's encoder, but without VAE components — it is deterministic, focused on extracting content-centric linguistic features rather than acoustic fidelity.

Uses Automatic Speech Recognition (ASR) as the proxy training objective. This grounds the semantic latents in linguistic content, ensuring the model understands what is being said (semantic) separately from how it sounds (acoustic) — the key to consistent long-form generation.

What Is a VAE and Why Does Variance Collapse Matter?

A Variational Autoencoder (VAE) learns to encode inputs into a compact probability distribution (a mean and variance) rather than a single fixed vector. During decoding, it samples from this distribution to reconstruct the original. The advantage: the latent space is smooth and continuous, making interpolation between sounds natural.

The problem: in autoregressive models (where each token depends on the previous), standard VAEs suffer from variance collapse — the model learns to ignore the stochastic component entirely, making the latent space collapse to near-deterministic. The "o-VAE" variant from LatentLM fixes this by reparameterizing the variance term, forcing the model to maintain meaningful uncertainty throughout the generation process.

2.2 VibeVoice Architecture

VibeVoice Architecture Diagram — **Figure 2:** VibeVoice employs next-token diffusion to synthesize long-form multi-speaker audio. Voice prompts and text scripts are tokenized into acoustic (A) and semantic (S) latent representations, processed autoregressively by the LLM backbone, and decoded via a lightweight diffusion head to produce up to 90 minutes of audio with up to 4 speakers.

VibeVoice uses a Large Language Model as its core sequence model, integrated with specialized audio encoding and diffusion-based decoding. The LLM (Qwen2.5) processes interleaved voice font features and text script embeddings, with role identifiers distinguishing speakers. At each step, a lightweight diffusion head conditioned on the LLM's hidden state generates the next continuous acoustic latent vector.

What Is Next-Token Diffusion?

Standard autoregressive language models generate tokens one by one: each token is discrete (a categorical choice from a vocabulary). Speech is different — it's continuous (a waveform). You can't just pick from a dictionary of possible sounds.

Diffusion models solve the continuous generation problem by starting from random noise and progressively denoising it toward the target signal. But traditional diffusion generates the entire output at once, not token by token.

Next-token diffusion combines both: at each autoregressive step, instead of outputting a discrete token, the model runs a mini-diffusion process conditioned on the LLM's hidden state to generate the next continuous latent vector. Think of it as: LLM provides the "intention" for each audio segment, diffusion fills in the acoustic details. This enables streaming generation of arbitrarily long speech.

Input Representation

X = [Speaker₁:Z₁, Speaker₂:Z₂, ..., Speaker_N:Z_N] + [Speaker₁:T₁, Speaker₂:T₂, ..., Speaker_N:T_N]

Z denotes acoustic latent features (from voice prompts), T denotes semantic text embeddings. Speaker role identifiers (Speaker_k) interleave the features, allowing the model to track which speaker is producing which audio segment throughout the 64K-token context window.

Token-Level Diffusion: At each autoregressive step, the diffusion head is conditioned on the LLM's hidden state h_i and predicts the denoised acoustic latent. During training, it learns to reverse a forward noising process. During inference, it uses DPM-Solver++ for fast sampling in just 10 steps — enabling practical streaming generation.

DPM-Solver++ speed trick: Traditional diffusion requires hundreds of denoising steps. DPM-Solver++ is a numerical ODE solver specifically designed for diffusion models that achieves comparable quality in just ~10 steps — making real-time speech generation practical. Without it, even the 7.5 Hz frame rate would be too slow for responsive synthesis.

The model was instantiated with Qwen2.5 at two scales (1.5B and 7B). The diffusion head comprises 4 transformer layers. During training, the acoustic and semantic tokenizers remain frozen — only the LLM and diffusion head are learned, enabling efficient fine-tuning.

1.5B params 7B params 4-layer diffusion head

Results

VibeVoice Performance Charts — **Figure 1:** (Left) Subjective evaluation comparing VibeVoice-7B and VibeVoice-1.5B against commercial models (Gemini-2.5-Pro-Preview-TTS, Eleven-V3 Alpha) across Preference, Realism, and Richness dimensions. VibeVoice-7B leads on all metrics. (Right) Timeline of TTS model capabilities (2023–2025) by output speech length — VibeVoice stands alone at 5,000+ seconds of continuous synthesis.

3.1 Long-Form Podcast Evaluation

VibeVoice was evaluated against state-of-the-art conversational speech models: Nari Labs Dia, SesameAILabs CSM, Higgs Audio V2, Eleven-V3 Alpha, and Gemini-2.5-Pro-Preview-TTS. The test set consisted of 8 long conversational transcripts totaling approximately 1 hour.

Objective evaluation: Word Error Rate (WER) was measured using Whisper-large-v3 and Nemo ASR. Speaker similarity (SIM) was computed using WavLM-large speaker embeddings.

Subjective evaluation: 24 human annotators scored each system on three dimensions: Realism (naturalness, prosody, emotion, turn-taking smoothness), Richness (expressiveness in tone, emotion, and conversational dynamics), and Preference (overall listening preference).

What is MOS? Mean Opinion Score (MOS) is the standard way to measure perceived speech quality. Human listeners rate audio on a scale of 1 (bad) to 5 (excellent). "Realism" measures how natural the speech sounds. "Richness" measures expressiveness — do speakers sound emotionally varied and engaged? "Preference" is the overall winner in a side-by-side comparison. A difference of ~0.3 MOS is considered perceptually significant.

VibeVoice models outperform all competing systems across both objective and subjective metrics. The 7B model shows significant gains over 1.5B, particularly in perceptual quality scores. Scaling the LLM backbone directly translates to better speech quality — a consistent finding across all evaluation dimensions.

Main Evaluation Results Table — **Table 1:** Human subjective and objective evaluation results for long-form podcast generation. Subjective metrics (Realism, Richness, Preference, Average) use Mean Opinion Scores (higher is better). Objective metrics: WER-Whisper, WER-Nemo (lower is better), SIM speaker similarity (higher is better). Best results in bold.

Key finding: VibeVoice-7B achieves an average MOS of 3.76, surpassing Gemini-2.5-Pro-Preview-TTS (3.40), MOSS-TTSD (3.54), and Eleven-V3 Alpha (3.66) on long-form conversation. WER-Whisper of 1.29% and speaker similarity of 0.692 demonstrate strong objective quality.

3.2 Zero-Shot Short Utterance Evaluation

Although primarily trained on long-form speech, VibeVoice was also evaluated on the SEED short-utterance benchmark — approximately 1,000 English samples and 2,000 Chinese samples from Common Voice. Metrics: CER↓ (Chinese character error rate), WER↓ (English word error rate), SIM↑ (speaker similarity).

Compared models operate at 25–50 Hz frame rates; VibeVoice-1.5B uses just 7.5 Hz — generating 7× fewer tokens per second of audio. This dramatically reduces inference compute while maintaining competitive accuracy.

Zero-Shot TTS Evaluation Table — **Table 2:** Zero-shot TTS evaluation on SEED test sets. CER↓ and SIM↑ on Chinese test-zh; WER↓ and SIM↑ on English test-en. Frame rate (Hz) shown for each model — VibeVoice-1.5B operates at 7.5 Hz, 3–7× lower than competitors.

Key finding: VibeVoice-1.5B achieves CER 1.16% (Chinese) and WER 3.04% (English) at 7.5 Hz frame rate — competitive with MaskGCT (50 Hz, CER 2.27%), Seed-TTS (–, CER 1.12%), and Spark TTS (50 Hz, CER 1.20%). Strong generalization despite being trained primarily on long-form audio.

Metrics explained: WER (Word Error Rate) — % of words that a speech recognizer (Whisper) transcribes incorrectly. Lower = clearer speech. CER (Character Error Rate) — same idea but at character level, used for Chinese where word boundaries are ambiguous. SIM (Speaker Similarity) — cosine similarity between speaker embedding vectors (from WavLM). Measures how well the model preserves the reference speaker's voice identity. Range 0–1, higher = more similar.

3.3 Tokenizer Reconstruction Quality

The fidelity of audio reconstructed from acoustic tokens measures how well the tokenizer preserves essential acoustic information under extreme compression. VibeVoice's tokenizer was benchmarked against ground truth, DAC, EnCodec, SpeechTokenizer, and WavTokenizer on LibriTTS test-clean and test-other, using PESQ↑, STOI↑, MOS↑, SIM↑, and UTMOS↑.

Tokenizer Reconstruction Quality Table — **Table 3:** Objective evaluation of speech tokenizer reconstruction quality on LibriTTS. N_q = number of quantizers (VAE for VibeVoice). Token Rate = tokens generated per second of audio. Higher PESQ, STOI, MOS, SIM, and UTMOS indicate better reconstruction. VibeVoice achieves competitive quality at just 7.5 tokens/second.

Key finding: At 7.5 Hz (vs DAC at 86 Hz, EnCodec at 75 Hz), VibeVoice's tokenizer delivers competitive PESQ, STOI, and UTMOS scores — demonstrating that the 80× compression does not significantly degrade audio quality. This is the enabling technology for 90-minute synthesis.

Audio Quality Metrics Explained

PESQ (Perceptual Evaluation of Speech Quality) — simulates human perception of speech degradation vs a reference. Score range 1–4.5, higher = better. Industry standard for telephony quality.
STOI (Short-Time Objective Intelligibility) — measures how intelligible speech is (can you understand the words?). Range 0–1, higher = clearer.
UTMOS — an AI system trained to predict human MOS ratings automatically. Useful for large-scale evaluation where human annotation is expensive.
SIM (Speaker Similarity) — how well the voice identity is preserved after tokenizer reconstruction.

The key insight from Table 3: VibeVoice achieves competitive scores on all metrics at just 7.5 Hz — proving the compression doesn't significantly degrade audio quality.

Conclusion, Limitations & Risks

VibeVoice introduces a novel framework for long-form and multi-speaker speech generation. By integrating an ultra-low frame rate (7.5 Hz) acoustic tokenizer achieving 80× compression, hybrid acoustic+semantic speech representations, and an end-to-end LLM-based next-token diffusion backbone, VibeVoice enables synthesis of up to 90-minute conversations with up to 4 speakers.

The model achieves state-of-the-art quality on long-form podcast generation while maintaining strong generalization to short-utterance benchmarks. Scaling from 1.5B to 7B parameters consistently improves perceptual quality. Future directions include richer prosody control, broader language support, and background audio modeling.

Limitations & Responsible Use

⚠

Language Scope

Currently supports English and Chinese only. Transcripts in other languages may produce unexpected or degraded audio outputs.

⚠

Speech Only

Focuses exclusively on speech synthesis. Background noise, music, sound effects, and overlapping speech segments are not supported.

⚠

No Overlapping Speech

The current model does not explicitly model or generate overlapping speech — a natural component of real conversations that remains an open research problem.

⚠

Ethical Risk: Deepfakes & Disinformation

High-quality synthetic speech can be misused for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable and avoid misleading applications. VibeVoice is not recommended for commercial or real-world deployment without further safety testing. Intended for research purposes only.

References (45 citations)

[ACC+24a] Philip Anastassiou et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv:2406.02430, 2024.
[ACC+24b] Philip Anastassiou et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv:2406.02430, 2024.
[Bos25] Boson AI. Higgs Audio V2: Redefining Expressiveness in Audio Generation. GitHub, 2025.
[CNM+24] Yushen Chen et al. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. arXiv:2410.06885, 2024.
[CWC+22] Sanyuan Chen et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE JSTSP, 2022.
[DCSA22] Alexandre Défossez et al. High fidelity neural audio compression. arXiv:2210.13438, 2022.
[DWC+24a] Zhihao Du et al. CosyVoice 2: Scalable streaming speech synthesis with LLMs. arXiv:2412.10117, 2024.
[DWC+24b] Zhihao Du et al. CosyVoice 2: Scalable streaming speech synthesis with LLMs. arXiv:2412.10117, 2024.
[Ele] ElevenLabs. ElevenLabs v3 alpha.
[GLS+24] Haohan Guo et al. FireRedTTS: A foundation TTS framework for industry-level applications. arXiv:2409.03283, 2024.
[Goo] Google. Gemini 2.5 Pro Preview TTS.
[Goo24] Google. NotebookLM. 2024.
[GZMY22] Zhifu Gao et al. Paraformer: Fast and accurate parallel transformer for non-autoregressive ASR. Interspeech 2022.
[HJA20] Jonathan Ho, Ajay Jain, Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS 33, 2020.
[JCC+25] Dongya Jia et al. DiTar: Diffusion transformer autoregressive modeling for speech. arXiv:2502.03930, 2025.
[JJW+25] Shengpeng Ji et al. WavTokenizer: An efficient acoustic discrete codec tokenizer. ICLR 2025.
[JYY+25] Zeqian Ju et al. MoonCast: High-quality zero-shot podcast generation. arXiv:2503.14345, 2025.
[KSL+23] Rithesh Kumar et al. High-fidelity audio compression with improved RVQGAN. NeurIPS 2023.
[KW14] Diederik P. Kingma, Max Welling. Auto-Encoding Variational Bayes. ICLR 2014.
[LTL+24] Tianhong Li et al. Autoregressive image generation without vector quantization. arXiv:2406.11838, 2024.
[LVS+23] Matthew Le et al. Voicebox: Text-guided multilingual universal speech generation at scale. NeurIPS 2023.
[LWI+24] Zhijun Liu et al. Autoregressive diffusion transformer for TTS. arXiv:2406.05551, 2024.
[LZB+22] Cheng Lu et al. DPM-Solver: A fast ODE solver for diffusion probabilistic models. NeurIPS 2022.
[LZB+25] Cheng Lu et al. DPM-Solver++: Fast solver for guided sampling of diffusion models. MIR, 2025.
[Nar25] Nari Labs. Nari Labs Dia. GitHub, 2025.
[Ope25] OpenMOSS Team. MOSS-TTSD. GitHub, 2025.
[PSJ+24] Se Jin Park et al. Long-form speech generation with spoken language models. arXiv:2412.18603, 2024.
[RBHH01] Antony W. Rix et al. PESQ: Perceptual evaluation of speech quality. ICASSP 2001.
[RKX+23] Alec Radford et al. Robust speech recognition via large-scale weak supervision. ICML 2023.
[SBW+24] Yutao Sun et al. Multimodal latent language modeling with next-token diffusion. arXiv:2412.08635, 2024.
[Ses25] SesameAILabs. SesameAILabs CSM Model. GitHub, 2025.
[SHB15] Rico Sennrich, Barry Haddow, Alexandra Birch. Neural machine translation of rare words with subword units. arXiv:1508.07909, 2015.
[SJT+23] Kai Shen et al. NaturalSpeech 2: Latent diffusion models for zero-shot TTS. arXiv:2304.09116, 2023.
[SXN+22] Takaaki Saeki et al. UTMOS: UTokyo-SaruLab system for VoiceMOS Challenge 2022. arXiv:2204.02152, 2022.
[THHJ10] Cees H. Taal et al. A short-time objective intelligibility measure. ICASSP 2010.
[VSP+17] Ashish Vaswani et al. Attention Is All You Need. NeurIPS 2017.
[WCW+23] Chengyi Wang et al. Neural codec language models are zero-shot TTS synthesizers. arXiv:2301.02111, 2023.
[WJM+25] Xinsheng Wang et al. Spark-TTS: An efficient LLM-based TTS model. arXiv:2503.01710, 2025.
[WZL+24] Yuancheng Wang et al. MaskGCT: Zero-shot TTS with masked generative codec transformer. arXiv:2409.00750, 2024.
[XJM+23] Hainan Xu et al. Efficient sequence transduction by jointly predicting tokens and durations. INTERSPEECH 2023.
[YYZ+24] An Yang et al. Qwen2.5 technical report. arXiv:2412.15115, 2024.
[YZC+25] Zhen Ye et al. LLASA: Scaling train-time and inference-time compute for speech synthesis. arXiv:2502.04128, 2025.
[ZDC+19] Heiga Zen et al. LibriTTS: A corpus derived from LibriSpeech for TTS. arXiv:1904.02882, 2019.
[ZQW+25] Leying Zhang et al. CoVoMix2: Advancing zero-shot dialogue generation. arXiv:2503.00872, 2025.
[ZZL+23] Xin Zhang et al. SpeechTokenizer: Unified speech tokenizer for speech LLMs. arXiv:2308.16692, 2023.