---
arxiv_id: "2508.19205"
title: "VibeVoice Technical Report — Long-Form Multi-Speaker Speech Synthesis"
authors:
  - Zhiliang Peng
  - Jianwei Yu
  - Wenhui Wang
  - Yaoyao Chang
  - Yutao Sun
  - Li Dong
  - Yi Zhu
  - Weijiang Xu
  - Hangbo Bao
  - Zehua Wang
  - Shaohan Huang
  - Yan Xia
  - Furu Wei
difficulty: Intermediate
tags:
  - Audio
  - LLM
  - Diffusion
  - Multimodal
published_at: "2025-08-26"
flecto_url: https://flecto.zer0ai.dev/papers/2508.19205/
lang: en
---

## Abstract

VibeVoice is a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion — a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. A novel continuous speech tokenizer achieves 80× data compression versus Encodec while maintaining comparable performance. VibeVoice can synthesize long-form speech for up to 90 minutes (64K context window) with up to 4 speakers, capturing authentic conversational "vibe" and surpassing open-source and proprietary dialogue models.

## Introduction

Recent Text-to-Speech systems have achieved impressive results for single-speaker, short-utterance synthesis — but a major frontier remains unsolved: generating long-form, natural-sounding dialogue with multiple distinct speakers. Producing a 30-minute podcast where two speakers maintain consistent voices, realistic turn-taking, and natural prosody across thousands of tokens is far beyond what existing models can reliably accomplish.

The core challenge is scale. Standard audio codecs like Encodec operate at 75–600 Hz, meaning even a few minutes of speech produces hundreds of thousands of tokens — far exceeding the context capacity of current LLMs. Without a radical reduction in token density, long-form multi-speaker generation is computationally intractable.

VibeVoice solves this with a novel continuous speech tokenizer operating at just 7.5 Hz — a compression of over 80× versus Encodec — while preserving audio fidelity. This ultra-low frame rate makes 90-minute, 4-speaker synthesis feasible within a 64K-token context window using a standard LLM backbone.

The architecture is deliberately streamlined: a pre-trained LLM (Qwen2.5, 1.5B or 7B parameters) processes voice prompts and text scripts, then a lightweight diffusion head generates continuous speech latents token by token. The simplicity is intentional — previous designs required complex separate components that VibeVoice collapses into a single unified pipeline.

### Key Contributions

- **Ultra-low frame rate tokenizer (7.5 Hz)** — novel continuous speech tokenizer achieving 80× compression vs Encodec, preserving audio fidelity at a tiny fraction of the token cost
- **Next-token diffusion LLM backbone** — unified framework combining LLM sequence modeling with token-level diffusion decoding for high-fidelity continuous speech
- **64K context window / 90-min synthesis** — supports up to 4 simultaneous speakers in a single generation run, unprecedented in open-source TTS
- **State-of-the-art quality** — VibeVoice-7B outperforms Gemini-2.5-Pro-Preview-TTS, Eleven-V3, and all open-source competitors in preference, realism, and richness MOS scores

## Method

### 2.1 Speech Tokenizers

VibeVoice employs two separate tokenizers to learn both acoustic and semantic features. Generating long-form speech benefits from this separation: the acoustic tokenizer preserves audio quality at ultra-low bit rate, while the semantic tokenizer captures linguistic content independently.

**Acoustic Tokenizer:** Based on Variational Autoencoder (VAE) principles — specifically the o-VAE variant from LatentLM — to prevent variance collapse in autoregressive settings. Features a mirror-symmetric encoder-decoder with 7 hierarchical stages of modified Transformer blocks, using 1D depth-wise causal convolutions for efficient streaming. Six downsampling layers achieve the breakthrough compression. The encoder produces a continuous latent vector Z_t per timestep, and the decoder reconstructs waveform from these latents. The entire design is causal — enabling real-time streaming synthesis.

- 7.5 Hz frame rate
- 80× compression vs Encodec

**Semantic Tokenizer:** Mirrors the hierarchical architecture of the Acoustic Tokenizer's encoder, but without VAE components — it is deterministic, focused on extracting content-centric linguistic features rather than acoustic fidelity. Uses Automatic Speech Recognition (ASR) as the proxy training objective. This grounds the semantic latents in linguistic content, ensuring the model understands what is being said (semantic) separately from how it sounds (acoustic) — the key to consistent long-form generation.

### 2.2 VibeVoice Architecture

VibeVoice uses a Large Language Model as its core sequence model, integrated with specialized audio encoding and diffusion-based decoding. The LLM (Qwen2.5) processes interleaved voice font features and text script embeddings, with role identifiers distinguishing speakers. At each step, a lightweight diffusion head conditioned on the LLM's hidden state generates the next continuous acoustic latent vector.

**Input Representation:** Z denotes acoustic latent features (from voice prompts), T denotes semantic text embeddings. Speaker role identifiers (Speaker_k) interleave the features, allowing the model to track which speaker is producing which audio segment throughout the 64K-token context window.

**Token-Level Diffusion:** At each autoregressive step, the diffusion head is conditioned on the LLM's hidden state h_i and predicts the denoised acoustic latent. During training, it learns to reverse a forward noising process. During inference, it uses DPM-Solver++ for fast sampling in just 10 steps — enabling practical streaming generation.

The model was instantiated with Qwen2.5 at two scales (1.5B and 7B). The diffusion head comprises 4 transformer layers. During training, the acoustic and semantic tokenizers remain frozen — only the LLM and diffusion head are learned, enabling efficient fine-tuning.

- 4-layer diffusion head

## Results

### 3.1 Long-Form Podcast Evaluation

VibeVoice was evaluated against state-of-the-art conversational speech models: Nari Labs Dia, SesameAILabs CSM, Higgs Audio V2, Eleven-V3 Alpha, and Gemini-2.5-Pro-Preview-TTS. The test set consisted of 8 long conversational transcripts totaling approximately 1 hour.

**Objective evaluation:** Word Error Rate (WER) was measured using Whisper-large-v3 and Nemo ASR. Speaker similarity (SIM) was computed using WavLM-large speaker embeddings.

**Subjective evaluation:** 24 human annotators scored each system on three dimensions: Realism (naturalness, prosody, emotion, turn-taking smoothness), Richness (expressiveness in tone, emotion, and conversational dynamics), and Preference (overall listening preference).

VibeVoice models outperform all competing systems across both objective and subjective metrics. The 7B model shows significant gains over 1.5B, particularly in perceptual quality scores. Scaling the LLM backbone directly translates to better speech quality — a consistent finding across all evaluation dimensions.

**Key finding:** VibeVoice-7B achieves an average MOS of 3.76, surpassing Gemini-2.5-Pro-Preview-TTS (3.40), MOSS-TTSD (3.54), and Eleven-V3 Alpha (3.66) on long-form conversation. WER-Whisper of 1.29% and speaker similarity of 0.692 demonstrate strong objective quality.

### 3.2 Zero-Shot Short Utterance Evaluation

Although primarily trained on long-form speech, VibeVoice was also evaluated on the SEED short-utterance benchmark — approximately 1,000 English samples and 2,000 Chinese samples from Common Voice. Metrics: CER↓ (Chinese character error rate), WER↓ (English word error rate), SIM↑ (speaker similarity).

Compared models operate at 25–50 Hz frame rates; VibeVoice-1.5B uses just 7.5 Hz — generating 7× fewer tokens per second of audio. This dramatically reduces inference compute while maintaining competitive accuracy.

**Key finding:** VibeVoice-1.5B achieves CER 1.16% (Chinese) and WER 3.04% (English) at 7.5 Hz frame rate — competitive with MaskGCT (50 Hz, CER 2.27%), Seed-TTS (–, CER 1.12%), and Spark TTS (50 Hz, CER 1.20%). Strong generalization despite being trained primarily on long-form audio.

### 3.3 Tokenizer Reconstruction Quality

The fidelity of audio reconstructed from acoustic tokens measures how well the tokenizer preserves essential acoustic information under extreme compression. VibeVoice's tokenizer was benchmarked against ground truth, DAC, EnCodec, SpeechTokenizer, and WavTokenizer on LibriTTS test-clean and test-other, using PESQ↑, STOI↑, MOS↑, SIM↑, and UTMOS↑.

**Key finding:** At 7.5 Hz (vs DAC at 86 Hz, EnCodec at 75 Hz), VibeVoice's tokenizer delivers competitive PESQ, STOI, and UTMOS scores — demonstrating that the 80× compression does not significantly degrade audio quality. This is the enabling technology for 90-minute synthesis.

## Conclusion, Limitations & Risks

VibeVoice introduces a novel framework for long-form and multi-speaker speech generation. By integrating an ultra-low frame rate (7.5 Hz) acoustic tokenizer achieving 80× compression, hybrid acoustic+semantic speech representations, and an end-to-end LLM-based next-token diffusion backbone, VibeVoice enables synthesis of up to 90-minute conversations with up to 4 speakers.

The model achieves state-of-the-art quality on long-form podcast generation while maintaining strong generalization to short-utterance benchmarks. Scaling from 1.5B to 7B parameters consistently improves perceptual quality. Future directions include richer prosody control, broader language support, and background audio modeling.

### Limitations & Responsible Use

**Language Scope:** Currently supports English and Chinese only. Transcripts in other languages may produce unexpected or degraded audio outputs.

**Speech Only:** Focuses exclusively on speech synthesis. Background noise, music, sound effects, and overlapping speech segments are not supported.

**No Overlapping Speech:** The current model does not explicitly model or generate overlapping speech — a natural component of real conversations that remains an open research problem.

**Ethical Risk: Deepfakes & Disinformation:** High-quality synthetic speech can be misused for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable and avoid misleading applications. VibeVoice is not recommended for commercial or real-world deployment without further safety testing. Intended for research purposes only.