---
arxiv_id: 2601.03233
title: "LTX-2: Efficient Joint Audio-Visual Foundation Model"
authors:
  - Yoav HaCohen
  - Benny Brazowski
  - Nisan Chiprut
  - Yaki Bitterman
  - Andrew Kvochko
  - Avishai Berkowitz
  - Daniel Shalem
  - Daphna Lifschitz
  - Dudu Moshe
  - Eitan Porat
  - Eitan Richardson
  - Guy Shiran
  - Itay Chachy
  - Jonathan Chetboun
  - Michael Finkelson
  - Michael Kupchick
  - Nir Zabari
  - Nitzan Guetta
  - Noa Kotler
  - Ofir Bibi
  - Ori Gordon
  - Poriya Panet
  - Roi Benita
  - Shahar Armon
  - Victor Kulikov
  - Yaron Inger
  - Yonatan Shiftan
  - Zeev Melumian
  - Zeev Farbman
difficulty: Intermediate
tags:
  - Multimodal
  - Audio
  - Vision
published_at: 2026-01-06
flecto_url: https://flecto.zer0ai.dev/papers/2601.03233/
lang: en
---

> LTX-2: Efficient Joint Audio-Visual Foundation Model

**Authors**: The first open-source foundation model for joint text-to-audio+video generation

## Introduction

### Introduction

The core problem: Current text-to-video models produce visually stunning but silent output. Text-to-audio models remain specialized for individual domains (speech, music, or foley). Attempts at audiovisual generation rely on decoupled sequential pipelines that fail to model the full joint distribution, missing critical dependencies like lip-synchronization and environmental acoustics.

Recent text-to-video (T2V) diffusion models such as LTX-Video , WAN 2.1 , and Hunyuan Video have achieved remarkable progress in generating visually realistic, motion-consistent videos from text prompts. However, these models remain fundamentally silent &mdash; they omit the semantic, emotional, and environmental information conveyed by synchronized sound.

In parallel, text-to-audio generation has evolved from task-specific systems toward more general-purpose representations, yet most models remain specialized for specific domains rather than offering a unified approach to audio generation.

Achieving a coherent audiovisual experience requires a unified model that jointly captures the generative dependencies between vision and sound. While proprietary systems such as Veo 3 and Sora 2 have explored this direction, they remain closed-source. LTX-2 is the first open-source model to address this challenge with a unified architecture.

## Experiments

### Experiments & Results

LTX-2 is evaluated across three key dimensions: audiovisual quality via human preference studies, visual-only performance via standard benchmarks, and computational efficiency.

### Audiovisual Evaluation

Human preference studies show that LTX-2 significantly outperforms open-source alternatives such as Ovi. Furthermore, LTX-2 achieves human preference parity with proprietary models at a fraction of their computational cost and inference time.

### Video-Only Benchmarks

Despite being a multimodal model, LTX-2's visual stream maintains top-tier performance on standard video generation tasks. In the Artificial Analysis public leaderboard, LTX-2 achieves competitive results, demonstrating that adding audio does not degrade video quality.

### Inference Performance & Scalability

The primary advantage of the LTX-2 architecture is its extreme efficiency. Compared against Wan 2.2-14B (video-only, 14B parameters) on an H100 GPU:

### faster than Wan 2.2 per diffusion step on H100 GPU

Despite having more parameters (19B vs 14B) and generating both audio and video simultaneously , LTX-2 is approximately 18x faster per diffusion step. This speed advantage comes from the optimized latent space mechanism.

LTX-2 can generate up to 20 seconds of continuous video with synchronized stereo audio, exceeding the temporal limits of most current T2V models.

## Conclusion

### Conclusion

LTX-2 extends LTX-Video into a joint audiovisual foundation model through four key innovations: an asymmetric dual-stream transformer architecture, deep text conditioning with thinking tokens and multi-layer feature extraction from Gemma3 12B, a compact causal audio VAE with an efficient 1D latent space, and modality-aware classifier-free guidance for fine-grained audiovisual control.

Experiments show that LTX-2 sets a new benchmark for open-source T2AV generation &mdash; achieving state-of-the-art audiovisual quality while being the fastest model in its class.

All model weights and code are publicly released to advance research and democratize audiovisual content creation.

## References

### References (32 entries)

## Meta Description

LTX-2 is the first open-source foundation model for joint text-to-audio+video generation with state-of-the-art quality and extreme efficiency.

## Contributions

### Key Contributions

### Asymmetric Dual-Stream Architecture

A transformer-based backbone featuring a 14B-parameter video stream and a 5B-parameter audio stream , linked via bidirectional cross-attention with temporal RoPE. This asymmetric design efficiently allocates compute to match each modality's complexity.

### Text Processing with Thinking Tokens

A refined text-conditioning module using Gemma3 12B with multi-layer feature extraction and learned "thinking tokens" for enhanced prompt understanding and phonetic accuracy in generated speech.

### Compact Neural Audio Representation

An efficient causal audio VAE that produces a high-fidelity 1D latent space optimized for diffusion-based training, enabling generation of up to 20 seconds of continuous stereo audio.

### Modality-Aware Classifier-Free Guidance

A novel bimodal CFG scheme with independent text and cross-modal guidance scales , significantly improving audiovisual alignment and providing fine-grained controllability over synchronization.

## Architecture

### Architecture Overview

Figure 1: Overview of the LTX-2 architecture. Raw video and audio signals are encoded into modality-specific latent tokens via causal VAEs, while text is processed through a refined embedding pipeline. The asymmetric dual-stream transformer processes both modalities with bidirectional cross-attention.

### Decoupled Latent Representations

Rather than forcing video and audio into a shared latent space, LTX-2 uses separate modality-specific VAEs . Video employs a spatiotemporal causal VAE, while audio uses a mel-spectrogram-based causal VAE with a 1D latent space. This allows each encoder to be independently optimized.

### Asymmetric Dual Stream

Video and audio have fundamentally different information densities. The 14B-parameter video stream handles complex spatial and temporal visual content, while the 5B-parameter audio stream processes lower-dimensional audio latents. Both share the same architectural blueprint but differ in width and depth.

### Cross-Modal Attention

Bidirectional cross-attention layers throughout the model enable tight temporal alignment. By utilizing 1D temporal RoPE during cross-modal interactions, the model captures dependencies like lip-synchronization and environmental acoustics without degrading unimodal generation quality.

## Dual Stream

### Dual-Stream Architecture Details

Figure 2: (a) The dual-stream backbone processes video and audio latents in parallel, exchanging information via bidirectional cross-attention with temporal 1D RoPE. (b) Single block detail showing Self Attention, Text Cross-Attention, AV Cross-Attention, and FFN with AdaLN timestep conditioning.

The core of LTX-2 is an asymmetric dual-stream Diffusion Transformer. The backbone comprises a high-capacity 14B-parameter video stream and a 5B-parameter audio stream. Both streams share the same architectural blueprint &mdash; each block consists of Self Attention, Text Cross-Attention, Audio-Visual Cross-Attention, and a Feed-Forward Network (FFN). RMS normalization layers are interleaved between operations to stabilize activations.

### Positional Encoding Strategy

The model uses Rotary Positional Embeddings (RoPE) to encode structure. In the video stream, 3D RoPE injects positional information across spatial dimensions (x, y) and time (t). In the audio stream, 1D RoPE encodes only the temporal dimension. During cross-modal attention, only the temporal component of RoPE is used, enforcing that cross-modal attention focuses on temporal synchronization rather than spatial correspondence.

## Cross Attention Viz

### Audio-Visual Cross-Attention

Figure 3: Visualization of AV cross-attention maps averaged across attention heads and model layers. V2A maps show how audio attends to video frames; A2V maps show how video attends to audio segments. Scenarios include a car passing by, speech with clapping, multi-speaker dialog, and a welcome message.

At each layer, the audio-visual cross-attention module enables bidirectional information flow between streams. The visualizations demonstrate that the model correctly associates sound events with their visual sources &mdash; car engine sounds focus on the vehicle, speech waveforms align with lip movements, and applause timing matches hand clapping.

## Text Conditioning

### Deep Text Conditioning and Thinking Tokens

Figure 4: The text understanding pipeline. The text prompt is encoded by Gemma3 12B, multi-layer activations are processed through the Feature Extractor, combined with learned Thinking Tokens, and refined through the Text Connector transformer blocks.

### Multi-Layer Feature Extractor

Rather than relying on the final causal layer of the LLM, LTX-2 extracts features across all decoder layers . Intermediate representations capture a broader spectrum of linguistic features &mdash; from low-level phonetics in early layers to high-level semantics in later layers. The extraction process involves three steps:

Mean-centered scaling is applied to intermediate outputs across the sequence and embedding dimensions for each layer.

### The scaled output is flattened into a representation of shape [B, T, D &times; L].

This high-dimensional representation is projected to the target dimension D using a learnable dense projection matrix W, jointly optimized then frozen.

The projection matrix W was jointly optimized with the LTX-2 model during a brief initial training stage using the standard diffusion MSE loss. This yielded an improvement in the model's prompt adherence and overall generation quality.

### Thinking Tokens

Inspired by register tokens, LTX-2 introduces learned thinking tokens (R per prompt) appended to the text embeddings. These tokens and the original embeddings are processed together through a Text Connector module consisting of two transformer blocks. This enables richer token interactions and contextual mixing before conditioning the diffusion transformer, significantly improving generation quality.

## Audio Vae

### Audio VAE and Latent Space

### Audio VAE

Inspired by the efficient deep latent space from LTX-Video, LTX-2 adopts a compact causal audio VAE. It processes mel-spectrogram inputs and encodes them into 1D latent tokens. This compact representation enables efficient diffusion-based training while maintaining high-fidelity audio reconstruction quality.

### Vocoder

The final waveform is reconstructed using a vocoder based on the HiFi-GAN architecture, modified for joint stereo synthesis and upsampling. It directly converts the decoded mel-spectrograms into high-quality stereo waveforms.

## Inference

### Inference

### Modality-Aware Classifier-Free Guidance (CFG)

Figure 5: Multimodal Classifier-Free Guidance with independent text (s t ) and cross-modal (s m ) control scales.

During inference, LTX-2 employs a multimodal extension of Classifier-free Guidance (CFG) to enhance cross-modal consistency and synchronization while maintaining high fidelity to the text prompt.

Where s t controls textual guidance strength and s m controls cross-modal guidance strength. Increasing s m promotes mutual information refinement between modalities &mdash; stronger cross-modal guidance leads to tighter lip synchronization and more coherent foley placement.

### Multi-Scale, Multi-Tile Inference

### Base Generation

Inference begins at a lower resolution, generating a base latent representation at approximately 0.5 Megapixels . This captures the overall structure, motion, and audio content at manageable computational cost.

### Latent Upscaling

A dedicated latent upscaler increases the spatial resolution of video latents while preserving temporal consistency and audio alignment.

### Tiled Refinement

Upscaled latents are partitioned into overlapping spatial and temporal tiles. Each tile is refined independently, achieving 1080p fidelity in the final output.

## Training

### Training Data & Captioning

### Training Data

LTX-2 uses a subset of the LTX-Video dataset, filtered for video clips containing significant and informative audio . The focus is on videos where audio is semantically meaningful &mdash; not just background noise, but speech, environmental sounds, and musical elements.

### Captioning System

A new video captioning system was developed to describe both visual and audio content . Captions are comprehensive yet factual, describing only what is seen and heard without emotional interpretation.

## Related Work

### Related Work

### Diffusion Transformers (DiTs)

Diffusion Transformers have emerged as a unifying architecture for large-scale generative modeling. Introduced by Peebles and Xie, DiTs replace convolutional U-Nets with transformer blocks, enabling better scaling behavior and more expressive latent processing.

### Audio and Video Generation

Text-to-video models like LTX-Video and WAN 2.1 demonstrate the power of DiT architectures trained on massive video datasets. Decoupled audio-visual synthesis has been explored through A2V and V2A approaches. Joint T2AV models like MMAudio and Ovi represent the frontier, but face challenges in achieving true audiovisual coherence.

### Text Conditioning

Text-conditioning has evolved from training encoders from scratch to leveraging pretrained encoders like T5, and more recently to using decoder-only LLMs with multi-layer feature extraction. LTX-2 builds on this by using Gemma3 12B with a novel multi-layer feature extractor and thinking tokens.

## Limitations

### Limitations

Language performance varies: Prompts in well-represented languages (primarily English) yield better results. Performance on less common languages may be limited.

### Audio quality gaps: Complex musical compositions and heavily overlapping speech remain challenging.

### Temporal scope: Generation is limited to a maximum of 20 seconds of continuous content.

Training data bias: The model's generation diversity may be influenced by biases present in the training dataset.

## Social Impact

### Social Impact

### Opportunities

Text-to-audio+video generation enables content creators, educators, and accessibility tools. Models like LTX-2 can democratize audiovisual content creation , making professional-quality media generation accessible to individuals and small teams.

### Challenges

Realistic synthetic media carries potential for misuse, including deepfakes and disinformation . Responsible deployment requires safeguards such as watermarking, content provenance tracking, and clear disclosure of AI-generated content.

## Supplementary

### Supplementary Figures (2 figures)

Figure A1: LTX-2 training and inference pipelines. (a) Training: audio and video inputs are encoded into latents and the model learns to denoise. (b) Inference: the model denoises from noise, then decodes to audio (via VAE decoder + vocoder) and video (via VAE decoder).

Figure A2: Detailed view of a single stream. The audio and video streams are identical in architecture, featuring RMS Norm, Self Attention with RoPE, Text Cross Attention, AV Cross Attention, and FFN with AdaLN timestep conditioning.