Lightricks
The first open-source foundation model for joint text-to-audio+video generation
A unified model that generates synchronized video and rich audio tracks including speech, foley, and ambient sounds โ achieving state-of-the-art quality at extreme computational efficiency.
The core problem: Current text-to-video models produce visually stunning but silent output. Text-to-audio models remain specialized for individual domains (speech, music, or foley). Attempts at audiovisual generation rely on decoupled sequential pipelines that fail to model the full joint distribution, missing critical dependencies like lip-synchronization and environmental acoustics.
Recent text-to-video (T2V) diffusion models such as LTX-Video, WAN 2.1, and Hunyuan Video have achieved remarkable progress in generating visually realistic, motion-consistent videos from text prompts. However, these models remain fundamentally silent โ they omit the semantic, emotional, and environmental information conveyed by synchronized sound.
In parallel, text-to-audio generation has evolved from task-specific systems toward more general-purpose representations, yet most models remain specialized for specific domains rather than offering a unified approach to audio generation.
Achieving a coherent audiovisual experience requires a unified model that jointly captures the generative dependencies between vision and sound. While proprietary systems such as Veo 3 and Sora 2 have explored this direction, they remain closed-source. LTX-2 is the first open-source model to address this challenge with a unified architecture.
A transformer-based backbone featuring a 14B-parameter video stream and a 5B-parameter audio stream, linked via bidirectional cross-attention with temporal RoPE. This asymmetric design efficiently allocates compute to match each modality's complexity.
A refined text-conditioning module using Gemma3 12B with multi-layer feature extraction and learned "thinking tokens" for enhanced prompt understanding and phonetic accuracy in generated speech.
An efficient causal audio VAE that produces a high-fidelity 1D latent space optimized for diffusion-based training, enabling generation of up to 20 seconds of continuous stereo audio.
A novel bimodal CFG scheme with independent text and cross-modal guidance scales, significantly improving audiovisual alignment and providing fine-grained controllability over synchronization.
Rather than forcing video and audio into a shared latent space, LTX-2 uses separate modality-specific VAEs. Video employs a spatiotemporal causal VAE, while audio uses a mel-spectrogram-based causal VAE with a 1D latent space. This allows each encoder to be independently optimized.
Video and audio have fundamentally different information densities. The 14B-parameter video stream handles complex spatial and temporal visual content, while the 5B-parameter audio stream processes lower-dimensional audio latents. Both share the same architectural blueprint but differ in width and depth.
Bidirectional cross-attention layers throughout the model enable tight temporal alignment. By utilizing 1D temporal RoPE during cross-modal interactions, the model captures dependencies like lip-synchronization and environmental acoustics without degrading unimodal generation quality.
The core of LTX-2 is an asymmetric dual-stream Diffusion Transformer. The backbone comprises a high-capacity 14B-parameter video stream and a 5B-parameter audio stream. Both streams share the same architectural blueprint โ each block consists of Self Attention, Text Cross-Attention, Audio-Visual Cross-Attention, and a Feed-Forward Network (FFN). RMS normalization layers are interleaved between operations to stabilize activations.
The model uses Rotary Positional Embeddings (RoPE) to encode structure. In the video stream, 3D RoPE injects positional information across spatial dimensions (x, y) and time (t). In the audio stream, 1D RoPE encodes only the temporal dimension. During cross-modal attention, only the temporal component of RoPE is used, enforcing that cross-modal attention focuses on temporal synchronization rather than spatial correspondence.
The key insight here is about what information matters when:
At each layer, the audio-visual cross-attention module enables bidirectional information flow between streams. The visualizations demonstrate that the model correctly associates sound events with their visual sources โ car engine sounds focus on the vehicle, speech waveforms align with lip movements, and applause timing matches hand clapping.
Rather than relying on the final causal layer of the LLM, LTX-2 extracts features across all decoder layers. Intermediate representations capture a broader spectrum of linguistic features โ from low-level phonetics in early layers to high-level semantics in later layers. The extraction process involves three steps:
The projection matrix W was jointly optimized with the LTX-2 model during a brief initial training stage using the standard diffusion MSE loss. This yielded an improvement in the model's prompt adherence and overall generation quality.
Inspired by register tokens, LTX-2 introduces learned thinking tokens (R per prompt) appended to the text embeddings. These tokens and the original embeddings are processed together through a Text Connector module consisting of two transformer blocks. This enables richer token interactions and contextual mixing before conditioning the diffusion transformer, significantly improving generation quality.
Imagine you're solving a math problem and write down "scratch work" before giving your final answer. Thinking tokens work similarly โ they're extra learned slots that give the model space to "think" and mix information before conditioning the generation.
Specifically, R extra tokens (with learned initial values) are appended to the text embedding and processed together through transformer blocks. These tokens don't correspond to any input text; instead, they serve as a computational workspace where the model can combine and refine the text representation. This concept is inspired by register tokens used in vision transformers.
Inspired by the efficient deep latent space from LTX-Video, LTX-2 adopts a compact causal audio VAE. It processes mel-spectrogram inputs and encodes them into 1D latent tokens. This compact representation enables efficient diffusion-based training while maintaining high-fidelity audio reconstruction quality.
The final waveform is reconstructed using a vocoder based on the HiFi-GAN architecture, modified for joint stereo synthesis and upsampling. It directly converts the decoded mel-spectrograms into high-quality stereo waveforms.
During inference, LTX-2 employs a multimodal extension of Classifier-free Guidance (CFG) to enhance cross-modal consistency and synchronization while maintaining high fidelity to the text prompt.
$$M'(x,t,m) = M(x,t,m) + s_t \cdot \bigl(M(x,t,m) - M(x,\varnothing,m)\bigr) + s_m \cdot \bigl(M(x,t,m) - M(x,t,\varnothing)\bigr)$$
Where st controls textual guidance strength and sm controls cross-modal guidance strength. Increasing sm promotes mutual information refinement between modalities โ stronger cross-modal guidance leads to tighter lip synchronization and more coherent foley placement.
Classifier-Free Guidance (CFG) is a widely used technique to improve generation quality. The basic idea: during inference, the model makes two predictions โ one conditioned on the text prompt and one unconditional. The difference between them is amplified to push the output closer to what the prompt describes.
LTX-2 extends this with two separate guidance scales:
This separation means you can independently tune "how well does it match my description?" vs. "how well are audio and video synchronized?" โ a significant advantage over single-scale CFG.
Inference begins at a lower resolution, generating a base latent representation at approximately 0.5 Megapixels. This captures the overall structure, motion, and audio content at manageable computational cost.
A dedicated latent upscaler increases the spatial resolution of video latents while preserving temporal consistency and audio alignment.
Upscaled latents are partitioned into overlapping spatial and temporal tiles. Each tile is refined independently, achieving 1080p fidelity in the final output.
LTX-2 uses a subset of the LTX-Video dataset, filtered for video clips containing significant and informative audio. The focus is on videos where audio is semantically meaningful โ not just background noise, but speech, environmental sounds, and musical elements.
A new video captioning system was developed to describe both visual and audio content. Captions are comprehensive yet factual, describing only what is seen and heard without emotional interpretation.
LTX-2 is evaluated across three key dimensions: audiovisual quality via human preference studies, visual-only performance via standard benchmarks, and computational efficiency.
Human preference studies show that LTX-2 significantly outperforms open-source alternatives such as Ovi. Furthermore, LTX-2 achieves human preference parity with proprietary models at a fraction of their computational cost and inference time.
Despite being a multimodal model, LTX-2's visual stream maintains top-tier performance on standard video generation tasks. In the Artificial Analysis public leaderboard, LTX-2 achieves competitive results, demonstrating that adding audio does not degrade video quality.
The primary advantage of the LTX-2 architecture is its extreme efficiency. Compared against Wan 2.2-14B (video-only, 14B parameters) on an H100 GPU:
| Model | Modality | Params | Sec/Step |
|---|---|---|---|
| Wan 2.2-14B | Video Only | 14B | 22.30s |
| LTX-2 | Audio + Video | 19B | 1.22s |
Despite having more parameters (19B vs 14B) and generating both audio and video simultaneously, LTX-2 is approximately 18x faster per diffusion step. This speed advantage comes from the optimized latent space mechanism.
LTX-2 can generate up to 20 seconds of continuous video with synchronized stereo audio, exceeding the temporal limits of most current T2V models.
LTX-2 extends LTX-Video into a joint audiovisual foundation model through four key innovations: an asymmetric dual-stream transformer architecture, deep text conditioning with thinking tokens and multi-layer feature extraction from Gemma3 12B, a compact causal audio VAE with an efficient 1D latent space, and modality-aware classifier-free guidance for fine-grained audiovisual control.
Experiments show that LTX-2 sets a new benchmark for open-source T2AV generation โ achieving state-of-the-art audiovisual quality while being the fastest model in its class.
All model weights and code are publicly released to advance research and democratize audiovisual content creation.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages โ we turn any source material into production-quality content. Rich HTML ยท Custom slides ยท Animated video.
Social Impact
Opportunities
Text-to-audio+video generation enables content creators, educators, and accessibility tools. Models like LTX-2 can democratize audiovisual content creation, making professional-quality media generation accessible to individuals and small teams.
Challenges
Realistic synthetic media carries potential for misuse, including deepfakes and disinformation. Responsible deployment requires safeguards such as watermarking, content provenance tracking, and clear disclosure of AI-generated content.