LG's First Open-Weight Vision-Language Model for Industrial Intelligence
EXAONE 4.5 is the first open-weight vision-language model from LG AI Research, built by integrating a custom 1.2B-parameter visual encoder into the powerful EXAONE 4.0 language model. Trained on document-centric corpora with a focus on industrial applications, it achieves state-of-the-art results in document understanding and Korean language reasoning while supporting six languages and a massive 256K token context window.
The EXAONE foundation models have been continuously engineered to address complex and demanding challenges in real-world industrial environments. Earlier releases established strong capabilities in domain-specific understanding across fields such as finance, law, biomedical research, and chemical process engineering. Building upon EXAONE 4.0's hybrid reasoning architecture, EXAONE 4.5 takes the next major step: adding native visual perception.
EXAONE 4.5 integrates a custom-built 1.2B-parameter vision encoder into the robust EXAONE 4.0 32B base model. This allows the system to seamlessly process text, images, and documents together โ enabling a new class of industrial applications where understanding complex documents, charts, and diagrams is critical.
The model supports six languages โ Korean, English, Spanish, German, Japanese, and Vietnamese โ and extends context length up to 256K tokens. It is designed for practical deployment in enterprise settings, where long-context document reasoning across multiple languages and modalities is a key requirement.
EXAONE 4.5 addresses a core challenge in vision-language models: efficiently processing large numbers of visual tokens alongside text. To maintain computational efficiency, the team adopted hybrid attention mechanisms and Grouped Query Attention (GQA) applied within the vision encoder itself โ an innovative choice that substantially reduces the KV-cache memory footprint during inference. The visual encoder was trained from scratch as a 1.2B-parameter model, since existing encoders did not meet LG's requirements for scalability and efficiency. It uses 2D Rotary Position Embedding (2D RoPE) to handle images at their native resolution without resizing, preserving spatial relationships critical for document understanding.
Standard Transformer attention requires storing a separate Key-Value (KV) cache for every attention head โ this becomes enormous for large models. Grouped Query Attention (GQA) groups multiple query heads to share one KV cache pair, dramatically reducing memory usage with minimal accuracy loss.
Why does this matter for EXAONE 4.5? Applying GQA to the vision encoder is unusual and innovative. Images generate thousands of visual tokens, so keeping KV-cache small is critical for running the model efficiently on real hardware.
Multimodal pre-training proceeds in two stages. Stage 1 focuses on large-scale visual-linguistic alignment, training on 420B image tokens and 400B text tokens at 8K sequence length. Stage 2 refines with higher-quality data at a reduced scale (225B image tokens, 110B text tokens), using FLOPs of 6.43ร10ยฒยฒ compared to Stage 1's 1.57ร10ยฒยณ. The pre-training data spans image captions (Korean-English bilingual), interleaved image-text documents, OCR/document corpora (critical for LG's document-centric focus), and video data for temporal understanding.
FLOPs (Floating Point Operations) measure the total computation used during training. Stage 1 required 1.57ร10ยฒยณ FLOPs โ roughly equivalent to what a single high-end GPU would need ~5,000 years to compute. This is why LLM training requires hundreds of GPUs running in parallel for weeks.
Stage 2's reduced scale (6.43ร10ยฒยฒ FLOPs) reflects that refining a good base model requires far less compute than building it from scratch.
EXAONE 4.5 extends context to 256K tokens by integrating length extension directly into the supervised fine-tuning stage rather than as a separate pre-training phase. Starting from a base model already capable of 128K tokens provides stability and fast convergence. Context Parallelism distributes the 256K-length sequences across multiple GPUs, keeping memory requirements manageable. This is especially important for industrial use cases involving long legal documents, technical manuals, or multi-page financial reports.
Most language models cap at 4Kโ32K tokens. A 256K context window means the model can process roughly 200,000 words in one pass โ equivalent to 3โ4 full-length novels, or a lengthy legal contract with all its appendices.
For enterprise document AI, this is transformative: financial analysts can feed entire quarterly reports (with footnotes), legal teams can process complete contracts, and engineers can reason over multi-chapter technical manuals โ all without chunking or losing context.
Rather than relying solely on public datasets, the team constructed a high-quality SFT dataset spanning multiple domains and modalities. This includes domain-specific instruction data for finance, law, science, and Korean-language tasks, as well as multimodal instruction data for document Q&A, chart understanding, and OCR tasks.
Offline preference optimization is applied in a multi-stage framework using Direct Preference Optimization (DPO). Each phase targets a specific capability: instruction following, document understanding, and multilingual alignment. The DPO loss encourages the model to prefer higher-quality responses over lower-quality alternatives from a reference model.
The DPO formula in the paper looks complex, but the core idea is simple: teach the model to prefer better answers by comparing pairs of responses.
For each training example, the model sees a prompt x, a good answer yโบ (preferred by human raters), and a bad answer yโป (rejected). The model learns to increase the probability of generating yโบ and decrease yโป, compared to a reference model โ without needing a separate reward model like traditional RLHF. This makes training more stable and efficient.
Joint multimodal reinforcement learning is applied across both text and vision modalities. Text data covers mathematical reasoning, coding, and science problems. Vision data focuses on diagram understanding, chart Q&A, and document interpretation. RL helps the model develop robust reasoning that generalizes across input types.
EXAONE 4.5 is evaluated on a comprehensive set of vision and language benchmarks, comparing against leading models including GPT-4.5 mini, Qwen3.5-VL-32B, and Qwen3.5-72B. Benchmarks span four vision categories (STEM/Puzzle, Document Understanding, General, Korean) and four language categories (Reasoning, Long Context, Multilinguality, Korean).
On vision benchmarks, EXAONE 4.5 delivers competitive and well-balanced performance. Its standout results come in Document Understanding, where it outperforms models of similar scale on chart and document-centric tasks such as ChartQAPRO and CharXiv. The model's specialized document training pipeline and native-resolution visual encoder give it a clear edge on real-world document AI tasks.
On language benchmarks, EXAONE 4.5's greatest strengths are in reasoning, long-context understanding, and Korean. It achieves 91.0 on IF-Eval (instruction following), 95.0 on CCSum (68K-context long document summarization), and 92.1 on KMMLU (Korean language understanding). These results reflect the model's deep specialization in document-centric, enterprise-grade tasks where EXAONE was designed to excel.
It is not by accident โ LG AI Research specifically built Korean excellence into the training pipeline. Korean is particularly challenging for AI because it is agglutinative (words formed by stacking morphemes), has complex honorific systems, and has limited public training data compared to English.
EXAONE's 92.1 KMMLU score (vs. competitors typically in the 70sโ80s) comes from LG's intentional inclusion of large-scale Korean industrial documents, Korean OCR corpora, and Korean-English bilingual data throughout every training stage.
Like all large multimodal models, EXAONE 4.5 has limitations and may occasionally generate inaccurate or inappropriate responses. The model's strong Korean-language specialization means its multilingual capabilities, while supporting six languages, may not be uniformly strong across all non-Korean languages.
The document understanding capabilities are particularly strong for Korean and English documents, reflecting the training data distribution. The model's industrial focus may limit its performance on highly creative or open-ended tasks compared to models optimized for such use cases. Users should evaluate EXAONE 4.5 carefully for their specific domain before production deployment.
EXAONE 4.5 is released as an open-weight model under the EXAONE AI Model License Agreement 1.2 โ NC (Non-Commercial). The model weights are hosted on HuggingFace, and the reference code is available on GitHub. The model is designed for practical industrial deployment with inference support for long-context (256K token) workloads. For commercial licensing, please refer to the official license documentation in the paper's appendix.
EXAONE 4.5 represents a significant advancement in the EXAONE model series, marking LG AI Research's first open-weight vision-language model. By integrating a custom-built 1.2B-parameter vision encoder with EXAONE 4.0's powerful language decoder, the model achieves a strong balance between visual understanding and text reasoning โ particularly for industrial document-centric applications.
Key technical contributions include: GQA-based vision encoder design for efficiency, 2D RoPE for native resolution image handling, the MTP (Multi-Token Prediction) module for improved generation, and context parallelism enabling 256K token windows. Evaluations validate these choices, with EXAONE 4.5 establishing new standards in Korean language reasoning and competitive document understanding performance at the 33B scale.
First open-weight VLM from LG, combining a 1.2B visual encoder with EXAONE 4.0's 32B language decoder for native multimodal reasoning.
Document-centric training pipeline enables strong performance on real-world document AI tasks: chart Q&A, OCR, long-form document summarization.
92.1 KMMLU score establishes EXAONE 4.5 as a leading model for Korean-language industrial applications, a key differentiator for LG's enterprise customers.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages โ we turn any source material into production-quality content. Rich HTML ยท Custom slides ยท Animated video.