LTX-2: Efficient Joint Audio-Visual Foundation Model

Introduction

The core problem: Current text-to-video models produce visually stunning but silent output. Text-to-audio models remain specialized for individual domains (speech, music, or foley). Attempts at audiovisual generation rely on decoupled sequential pipelines that fail to model the full joint distribution, missing critical dependencies like lip-synchronization and environmental acoustics.

Recent text-to-video (T2V) diffusion models such as LTX-Video, WAN 2.1, and Hunyuan Video have achieved remarkable progress in generating visually realistic, motion-consistent videos from text prompts. However, these models remain fundamentally silent — they omit the semantic, emotional, and environmental information conveyed by synchronized sound.

In parallel, text-to-audio generation has evolved from task-specific systems toward more general-purpose representations, yet most models remain specialized for specific domains rather than offering a unified approach to audio generation.

Achieving a coherent audiovisual experience requires a unified model that jointly captures the generative dependencies between vision and sound. While proprietary systems such as Veo 3 and Sora 2 have explored this direction, they remain closed-source. LTX-2 is the first open-source model to address this challenge with a unified architecture.

Key Contributions

🏗

Asymmetric Dual-Stream Architecture

A transformer-based backbone featuring a 14B-parameter video stream and a 5B-parameter audio stream, linked via bidirectional cross-attention with temporal RoPE. This asymmetric design efficiently allocates compute to match each modality's complexity.

💬

Text Processing with Thinking Tokens

A refined text-conditioning module using Gemma3 12B with multi-layer feature extraction and learned "thinking tokens" for enhanced prompt understanding and phonetic accuracy in generated speech.

🎵

Compact Neural Audio Representation

An efficient causal audio VAE that produces a high-fidelity 1D latent space optimized for diffusion-based training, enabling generation of up to 20 seconds of continuous stereo audio.

🎯

Modality-Aware Classifier-Free Guidance

A novel bimodal CFG scheme with independent text and cross-modal guidance scales, significantly improving audiovisual alignment and providing fine-grained controllability over synchronization.

Why asymmetric? Think of it like a movie production team: you need a much larger crew for filming (video) than for recording sound (audio). Similarly, video data is far more complex (it has spatial dimensions x, y plus time), so it needs a bigger neural network (14B parameters) to process. Audio is simpler (just a 1D time signal), so a smaller network (5B parameters) suffices. This saves compute without sacrificing quality.

Architecture Overview

Decoupled Latent Representations

Rather than forcing video and audio into a shared latent space, LTX-2 uses separate modality-specific VAEs. Video employs a spatiotemporal causal VAE, while audio uses a mel-spectrogram-based causal VAE with a 1D latent space. This allows each encoder to be independently optimized.

Asymmetric Dual Stream

Video and audio have fundamentally different information densities. The 14B-parameter video stream handles complex spatial and temporal visual content, while the 5B-parameter audio stream processes lower-dimensional audio latents. Both share the same architectural blueprint but differ in width and depth.

Cross-Modal Attention

Bidirectional cross-attention layers throughout the model enable tight temporal alignment. By utilizing 1D temporal RoPE during cross-modal interactions, the model captures dependencies like lip-synchronization and environmental acoustics without degrading unimodal generation quality.

What is RoPE (Rotary Positional Embeddings)? Neural networks process data as sequences of tokens but don't inherently know the position of each token. RoPE is an elegant technique that encodes position information by rotating the embedding vectors. For video, LTX-2 uses 3D RoPE (encoding x, y, and time positions). For audio, it uses 1D RoPE (encoding only time). When the two streams talk to each other via cross-attention, only the time component matters — ensuring they stay in sync temporally.

Dual-Stream Architecture Details

The core of LTX-2 is an asymmetric dual-stream Diffusion Transformer. The backbone comprises a high-capacity 14B-parameter video stream and a 5B-parameter audio stream. Both streams share the same architectural blueprint — each block consists of Self Attention, Text Cross-Attention, Audio-Visual Cross-Attention, and a Feed-Forward Network (FFN). RMS normalization layers are interleaved between operations to stabilize activations.

Positional Encoding Strategy

The model uses Rotary Positional Embeddings (RoPE) to encode structure. In the video stream, 3D RoPE injects positional information across spatial dimensions (x, y) and time (t). In the audio stream, 1D RoPE encodes only the temporal dimension. During cross-modal attention, only the temporal component of RoPE is used, enforcing that cross-modal attention focuses on temporal synchronization rather than spatial correspondence.

Understanding the Positional Encoding Design

The key insight here is about what information matters when:

Within video: Position in space (x, y) and time (t) all matter — a face at position (100, 200) at frame 5 is different from a face at (300, 400) at frame 10.
Within audio: Only time matters — audio is a 1D signal. There's no spatial "position" in a waveform.
Between modalities: When syncing audio and video, only temporal alignment matters. You want the sound of a clap to match the frame showing hands meeting, regardless of where in the frame the hands are. That's why cross-attention uses only the temporal RoPE component.

Audio-Visual Cross-Attention

AV Cross-Attention Maps — **Figure 3:** Visualization of AV cross-attention maps averaged across attention heads and model layers. V2A maps show how audio attends to video frames; A2V maps show how video attends to audio segments. Scenarios include a car passing by, speech with clapping, multi-speaker dialog, and a welcome message.

At each layer, the audio-visual cross-attention module enables bidirectional information flow between streams. The visualizations demonstrate that the model correctly associates sound events with their visual sources — car engine sounds focus on the vehicle, speech waveforms align with lip movements, and applause timing matches hand clapping.

What do these attention maps show? Each heatmap reveals what the model "focuses on" when processing one modality given the other. Hot spots in V2A maps show which video frames are most relevant when generating a particular audio segment. For example, when generating the sound "car passing by," the model attends strongly to the video frames showing the car. This bidirectional attention is what makes the generated audio sound natural and synchronized.

Deep Text Conditioning and Thinking Tokens

Text Understanding Pipeline — **Figure 4:** The text understanding pipeline. The text prompt is encoded by Gemma3 12B, multi-layer activations are processed through the Feature Extractor, combined with learned Thinking Tokens, and refined through the Text Connector transformer blocks.

Multi-Layer Feature Extractor

Rather than relying on the final causal layer of the LLM, LTX-2 extracts features across all decoder layers. Intermediate representations capture a broader spectrum of linguistic features — from low-level phonetics in early layers to high-level semantics in later layers. The extraction process involves three steps:

Mean-centered scaling is applied to intermediate outputs across the sequence and embedding dimensions for each layer.
The scaled output is flattened into a representation of shape [B, T, D × L].
This high-dimensional representation is projected to the target dimension D using a learnable dense projection matrix W, jointly optimized then frozen.

The projection matrix W was jointly optimized with the LTX-2 model during a brief initial training stage using the standard diffusion MSE loss. This yielded an improvement in the model's prompt adherence and overall generation quality.

Why extract from ALL layers, not just the last one? Large language models process text through many layers, and each layer captures different information. Early layers capture low-level features like phonetics and character patterns (critical for generating realistic speech). Middle layers capture syntax and word relationships. Later layers capture high-level semantics and meaning. By extracting and combining information from all layers, LTX-2 gets a much richer understanding of the prompt than models that only use the final layer output.

Thinking Tokens

Inspired by register tokens, LTX-2 introduces learned thinking tokens (R per prompt) appended to the text embeddings. These tokens and the original embeddings are processed together through a Text Connector module consisting of two transformer blocks. This enables richer token interactions and contextual mixing before conditioning the diffusion transformer, significantly improving generation quality.

Thinking Tokens Explained

Imagine you're solving a math problem and write down "scratch work" before giving your final answer. Thinking tokens work similarly — they're extra learned slots that give the model space to "think" and mix information before conditioning the generation.

Specifically, R extra tokens (with learned initial values) are appended to the text embedding and processed together through transformer blocks. These tokens don't correspond to any input text; instead, they serve as a computational workspace where the model can combine and refine the text representation. This concept is inspired by register tokens used in vision transformers.

Audio VAE and Latent Space

Audio VAE

Inspired by the efficient deep latent space from LTX-Video, LTX-2 adopts a compact causal audio VAE. It processes mel-spectrogram inputs and encodes them into 1D latent tokens. This compact representation enables efficient diffusion-based training while maintaining high-fidelity audio reconstruction quality.

Vocoder

The final waveform is reconstructed using a vocoder based on the HiFi-GAN architecture, modified for joint stereo synthesis and upsampling. It directly converts the decoded mel-spectrograms into high-quality stereo waveforms.

Inference

Modality-Aware Classifier-Free Guidance (CFG)

Multimodal CFG Diagram — **Figure 5:** Multimodal Classifier-Free Guidance with independent text (s_t) and cross-modal (s_m) control scales.

During inference, LTX-2 employs a multimodal extension of Classifier-free Guidance (CFG) to enhance cross-modal consistency and synchronization while maintaining high fidelity to the text prompt.

$$M'(x,t,m) = M(x,t,m) + s_t \cdot \bigl(M(x,t,m) - M(x,\varnothing,m)\bigr) + s_m \cdot \bigl(M(x,t,m) - M(x,t,\varnothing)\bigr)$$

Where s_t controls textual guidance strength and s_m controls cross-modal guidance strength. Increasing s_m promotes mutual information refinement between modalities — stronger cross-modal guidance leads to tighter lip synchronization and more coherent foley placement.

Understanding Modality-Aware CFG

Classifier-Free Guidance (CFG) is a widely used technique to improve generation quality. The basic idea: during inference, the model makes two predictions — one conditioned on the text prompt and one unconditional. The difference between them is amplified to push the output closer to what the prompt describes.

LTX-2 extends this with two separate guidance scales:

s_t (text guidance): Controls how strongly the output follows the text prompt. Higher values = more prompt-faithful output.
s_m (cross-modal guidance): Controls how strongly audio and video influence each other. Higher values = tighter synchronization (e.g., better lip-sync).

This separation means you can independently tune "how well does it match my description?" vs. "how well are audio and video synchronized?" — a significant advantage over single-scale CFG.

Multi-Scale, Multi-Tile Inference

Base Generation

Inference begins at a lower resolution, generating a base latent representation at approximately 0.5 Megapixels. This captures the overall structure, motion, and audio content at manageable computational cost.

Latent Upscaling

A dedicated latent upscaler increases the spatial resolution of video latents while preserving temporal consistency and audio alignment.

Tiled Refinement

Upscaled latents are partitioned into overlapping spatial and temporal tiles. Each tile is refined independently, achieving 1080p fidelity in the final output.

Why use tiles? Generating a full 1080p video in one pass would require enormous GPU memory. Instead, LTX-2 first generates a low-resolution "sketch," upscales it, then refines overlapping patches (tiles) separately. The overlap ensures smooth transitions between tiles. This is similar to how image editors process large photos in patches — practical engineering that enables high-resolution output on available hardware.

Training Data & Captioning

Training Data

LTX-2 uses a subset of the LTX-Video dataset, filtered for video clips containing significant and informative audio. The focus is on videos where audio is semantically meaningful — not just background noise, but speech, environmental sounds, and musical elements.

Captioning System

A new video captioning system was developed to describe both visual and audio content. Captions are comprehensive yet factual, describing only what is seen and heard without emotional interpretation.

Experiments & Results

LTX-2 is evaluated across three key dimensions: audiovisual quality via human preference studies, visual-only performance via standard benchmarks, and computational efficiency.

Audiovisual Evaluation

Human preference studies show that LTX-2 significantly outperforms open-source alternatives such as Ovi. Furthermore, LTX-2 achieves human preference parity with proprietary models at a fraction of their computational cost and inference time.

Video-Only Benchmarks

Despite being a multimodal model, LTX-2's visual stream maintains top-tier performance on standard video generation tasks. In the Artificial Analysis public leaderboard, LTX-2 achieves competitive results, demonstrating that adding audio does not degrade video quality.

Inference Performance & Scalability

The primary advantage of the LTX-2 architecture is its extreme efficiency. Compared against Wan 2.2-14B (video-only, 14B parameters) on an H100 GPU:

Table 1: Inference Speed — Time per diffusion step on H100 GPU
Model	Modality	Params	Sec/Step
Wan 2.2-14B	Video Only	14B	22.30s
LTX-2	Audio + Video	19B	1.22s

~18x faster than Wan 2.2 per diffusion step on H100 GPU

Despite having more parameters (19B vs 14B) and generating both audio and video simultaneously, LTX-2 is approximately 18x faster per diffusion step. This speed advantage comes from the optimized latent space mechanism.

The 18x speed advantage seems counterintuitive — LTX-2 has more parameters (19B vs 14B) and generates both audio and video, yet it's 18 times faster. The secret is the optimized latent space: by encoding video and audio into very compact latent representations before processing, the actual computation happens on much smaller tensors. This is like compressing a file before sending it — the compression step adds work, but the savings on the main task far outweigh it.

LTX-2 can generate up to 20 seconds of continuous video with synchronized stereo audio, exceeding the temporal limits of most current T2V models.

Human preference studies are considered the gold standard for evaluating generative models because automated metrics often don't capture perceptual quality well. In these studies, human evaluators compare outputs from different models side-by-side and choose which one they prefer. The fact that LTX-2 achieves parity with proprietary models (like Veo 3 and Sora 2) while being open-source and much faster is a significant achievement.

Related Work

Diffusion Transformers (DiTs)

Diffusion Transformers have emerged as a unifying architecture for large-scale generative modeling. Introduced by Peebles and Xie, DiTs replace convolutional U-Nets with transformer blocks, enabling better scaling behavior and more expressive latent processing.

Audio and Video Generation

Text-to-video models like LTX-Video and WAN 2.1 demonstrate the power of DiT architectures trained on massive video datasets. Decoupled audio-visual synthesis has been explored through A2V and V2A approaches. Joint T2AV models like MMAudio and Ovi represent the frontier, but face challenges in achieving true audiovisual coherence.

Text Conditioning

Text-conditioning has evolved from training encoders from scratch to leveraging pretrained encoders like T5, and more recently to using decoder-only LLMs with multi-layer feature extraction. LTX-2 builds on this by using Gemma3 12B with a novel multi-layer feature extractor and thinking tokens.

Limitations

Language performance varies: Prompts in well-represented languages (primarily English) yield better results. Performance on less common languages may be limited.
Audio quality gaps: Complex musical compositions and heavily overlapping speech remain challenging.
Temporal scope: Generation is limited to a maximum of 20 seconds of continuous content.
Training data bias: The model's generation diversity may be influenced by biases present in the training dataset.

Social Impact

Opportunities

Text-to-audio+video generation enables content creators, educators, and accessibility tools. Models like LTX-2 can democratize audiovisual content creation, making professional-quality media generation accessible to individuals and small teams.

Challenges

Realistic synthetic media carries potential for misuse, including deepfakes and disinformation. Responsible deployment requires safeguards such as watermarking, content provenance tracking, and clear disclosure of AI-generated content.

Conclusion

LTX-2 extends LTX-Video into a joint audiovisual foundation model through four key innovations: an asymmetric dual-stream transformer architecture, deep text conditioning with thinking tokens and multi-layer feature extraction from Gemma3 12B, a compact causal audio VAE with an efficient 1D latent space, and modality-aware classifier-free guidance for fine-grained audiovisual control.

Experiments show that LTX-2 sets a new benchmark for open-source T2AV generation — achieving state-of-the-art audiovisual quality while being the fastest model in its class.

All model weights and code are publicly released to advance research and democratize audiovisual content creation.

Supplementary Figures (2 figures)

Training and Inference Pipelines — **Figure A1:** LTX-2 training and inference pipelines. (a) Training: audio and video inputs are encoded into latents and the model learns to denoise. (b) Inference: the model denoises from noise, then decodes to audio (via VAE decoder + vocoder) and video (via VAE decoder).

Single-Stream Architecture Detail — **Figure A2:** Detailed view of a single stream. The audio and video streams are identical in architecture, featuring RMS Norm, Self Attention with RoPE, Text Cross Attention, AV Cross Attention, and FFN with AdaLN timestep conditioning.

References (32 entries)

Benita et al. CAFA: A Controllable Automatic Foley Artist. arXiv:2504.06778, 2025.
Cheng et al. MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis. ICML, 2025.
Dar et al. Analyzing Transformers in Embedding Space. ACL, 2023.
Darcet et al. Vision Transformers Need Registers. arXiv:2309.16588, 2023.
Esser et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. ICML, 2024.
Gao et al. Kling 1.6: A Universal Media Generation System. arXiv:2507.10898, 2025.
Gao et al. WAN-S2V: Audio-Driven Cinematic Video Generation. arXiv:2506.06033, 2025.
Google DeepMind. Veo 3: A Diffusion-Based Audio+Video Generation System. 2025.
Guan et al. Taming Text-to-Sounding Video Generation. 2025.
Gutflaish et al. Generating an Image from 1,000 Representations. arXiv:2502.14148, 2025.
HaCohen et al. LTX-Video: Realtime Video Latent Diffusion. arXiv:2501.00103, 2024.
Ho and Salimans. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022.
Kong et al. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. NeurIPS, 2020.
Kong et al. HunyuanVideo: A Systematic Framework for Large Video Generative Models. arXiv:2412.03603, 2024.
Lipman et al. Flow Matching for Generative Modeling. arXiv:2210.02747, 2022.
Liu et al. Playground v3: Improving Text-to-Image Alignment with Deep Text Understanding. 2024.
Liu et al. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. ICML, 2023.
Liu et al. AudioLDM 2: Learning Holistic Audio Generation. IEEE/ACM TASLP, 2024.
Luo et al. Diff-Foley: Synchronized Video-to-Audio Synthesis. NeurIPS, 2024.
Nichol et al. GLIDE: Towards Photorealistic Image Generation and Editing. ICML, 2022.
OpenAI. Sora 2 is here. 2025.
Pan et al. Transfer Between Modalities with MetaQueries. NeurIPS, 2024.
Peebles and Xie. Scalable Diffusion Models with Transformers. ICCV, 2023.
Character.AI Research. Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation. arXiv:2510.01284, 2025.
Saharia et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS, 2022.
Skean et al. Layer by Layer: Uncovering Hidden Representations in Language Models. arXiv:2502.04975, 2025.
Gemma Team. Gemma 3 Technical Report. arXiv:2503.19786, 2025.
Team Wan et al. WAN: Open and Advanced Large-Scale Video Generative Models. arXiv:2503.20314, 2025.
Wang et al. A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation. CVPR, 2025.
Wen et al. Efficient Vision-Language Models by Summarizing Visual Tokens. arXiv:2410.14072, 2024.
Xie et al. SANA: Efficient High-Resolution Image Synthesis. ICML, 2025.
Zhang et al. FoleyCrafter: Bring Silent Videos to Life. arXiv:2407.01494, 2024.