EXAONE 4.5 Technical Report

Introduction

The EXAONE foundation models have been continuously engineered to address complex and demanding challenges in real-world industrial environments. Earlier releases established strong capabilities in domain-specific understanding across fields such as finance, law, biomedical research, and chemical process engineering. Building upon EXAONE 4.0's hybrid reasoning architecture, EXAONE 4.5 takes the next major step: adding native visual perception.

EXAONE 4.5 integrates a custom-built 1.2B-parameter vision encoder into the robust EXAONE 4.0 32B base model. This allows the system to seamlessly process text, images, and documents together — enabling a new class of industrial applications where understanding complex documents, charts, and diagrams is critical.

The model supports six languages — Korean, English, Spanish, German, Japanese, and Vietnamese — and extends context length up to 256K tokens. It is designed for practical deployment in enterprise settings, where long-context document reasoning across multiple languages and modalities is a key requirement.

33B

Parameters

6

Languages

256K

Context Window

Model Architecture & Training

Model Configuration

EXAONE 4.5 addresses a core challenge in vision-language models: efficiently processing large numbers of visual tokens alongside text. To maintain computational efficiency, the team adopted hybrid attention mechanisms and Grouped Query Attention (GQA) applied within the vision encoder itself — an innovative choice that substantially reduces the KV-cache memory footprint during inference. The visual encoder was trained from scratch as a 1.2B-parameter model, since existing encoders did not meet LG's requirements for scalability and efficiency. It uses 2D Rotary Position Embedding (2D RoPE) to handle images at their native resolution without resizing, preserving spatial relationships critical for document understanding.

What is GQA (Grouped Query Attention)?

Standard Transformer attention requires storing a separate Key-Value (KV) cache for every attention head — this becomes enormous for large models. Grouped Query Attention (GQA) groups multiple query heads to share one KV cache pair, dramatically reducing memory usage with minimal accuracy loss.

Why does this matter for EXAONE 4.5? Applying GQA to the vision encoder is unusual and innovative. Images generate thousands of visual tokens, so keeping KV-cache small is critical for running the model efficiently on real hardware.

EXAONE 4.5 Architecture Diagram — **Figure 1:** EXAONE 4.5 Architecture — A dedicated visual encoder (1.2B params) processes images at native resolution and feeds visual tokens into the EXAONE 4.0 language decoder (32B params). The system supports multi-image inputs and video.

2D RoPE (2D Rotary Positional Embedding) is a way to encode where each visual patch is located in the 2D image grid — unlike standard 1D position encodings that assume sequential order. This lets the model handle images at any resolution without distortion, which is essential for reading scanned documents where text layout carries meaning.

Pre-training

Multimodal pre-training proceeds in two stages. Stage 1 focuses on large-scale visual-linguistic alignment, training on 420B image tokens and 400B text tokens at 8K sequence length. Stage 2 refines with higher-quality data at a reduced scale (225B image tokens, 110B text tokens), using FLOPs of 6.43×10²² compared to Stage 1's 1.57×10²³. The pre-training data spans image captions (Korean-English bilingual), interleaved image-text documents, OCR/document corpora (critical for LG's document-centric focus), and video data for temporal understanding.

Understanding FLOPs in Training Scale

FLOPs (Floating Point Operations) measure the total computation used during training. Stage 1 required 1.57×10²³ FLOPs — roughly equivalent to what a single high-end GPU would need ~5,000 years to compute. This is why LLM training requires hundreds of GPUs running in parallel for weeks.

Stage 2's reduced scale (6.43×10²² FLOPs) reflects that refining a good base model requires far less compute than building it from scratch.

Training Stages Comparison Table — **Table 1:** Two-stage pre-training configuration — Stage 1 establishes broad visual-linguistic alignment; Stage 2 refines with quality data at reduced compute.

Context Length Extension

EXAONE 4.5 extends context to 256K tokens by integrating length extension directly into the supervised fine-tuning stage rather than as a separate pre-training phase. Starting from a base model already capable of 128K tokens provides stability and fast convergence. Context Parallelism distributes the 256K-length sequences across multiple GPUs, keeping memory requirements manageable. This is especially important for industrial use cases involving long legal documents, technical manuals, or multi-page financial reports.

Why is 256K Context Window Significant?

Most language models cap at 4K–32K tokens. A 256K context window means the model can process roughly 200,000 words in one pass — equivalent to 3–4 full-length novels, or a lengthy legal contract with all its appendices.

For enterprise document AI, this is transformative: financial analysts can feed entire quarterly reports (with footnotes), legal teams can process complete contracts, and engineers can reason over multi-chapter technical manuals — all without chunking or losing context.

Post-training

Supervised Fine-Tuning (SFT)

Rather than relying solely on public datasets, the team constructed a high-quality SFT dataset spanning multiple domains and modalities. This includes domain-specific instruction data for finance, law, science, and Korean-language tasks, as well as multimodal instruction data for document Q&A, chart understanding, and OCR tasks.

Offline Preference Optimization

Offline preference optimization is applied in a multi-stage framework using Direct Preference Optimization (DPO). Each phase targets a specific capability: instruction following, document understanding, and multilingual alignment. The DPO loss encourages the model to prefer higher-quality responses over lower-quality alternatives from a reference model.

Direct Preference Optimization (DPO) Explained

The DPO formula in the paper looks complex, but the core idea is simple: teach the model to prefer better answers by comparing pairs of responses.

For each training example, the model sees a prompt x, a good answer y⁺ (preferred by human raters), and a bad answer y⁻ (rejected). The model learns to increase the probability of generating y⁺ and decrease y⁻, compared to a reference model — without needing a separate reward model like traditional RLHF. This makes training more stable and efficient.

Reinforcement Learning (RL)

Joint multimodal reinforcement learning is applied across both text and vision modalities. Text data covers mathematical reasoning, coding, and science problems. Vision data focuses on diagram understanding, chart Q&A, and document interpretation. RL helps the model develop robust reasoning that generalizes across input types.

Evaluation Results

EXAONE 4.5 is evaluated on a comprehensive set of vision and language benchmarks, comparing against leading models including GPT-4.5 mini, Qwen3.5-VL-32B, and Qwen3.5-72B. Benchmarks span four vision categories (STEM/Puzzle, Document Understanding, General, Korean) and four language categories (Reasoning, Long Context, Multilinguality, Korean).

Vision Benchmarks

On vision benchmarks, EXAONE 4.5 delivers competitive and well-balanced performance. Its standout results come in Document Understanding, where it outperforms models of similar scale on chart and document-centric tasks such as ChartQAPRO and CharXiv. The model's specialized document training pipeline and native-resolution visual encoder give it a clear edge on real-world document AI tasks.

Vision Benchmark Results Table — **Table 2:** Vision benchmark results — EXAONE 4.5 33B vs GPT-4.5 mini, Qwen3.5-VL-32B-A22B, and Qwen3.5-72B across STEM, Document Understanding, General, and Korean categories.

71.7
ChartQAPRO Score

92.1
KMMLU (Korean)

95.0
CCSum Long Context

Language Benchmarks

On language benchmarks, EXAONE 4.5's greatest strengths are in reasoning, long-context understanding, and Korean. It achieves 91.0 on IF-Eval (instruction following), 95.0 on CCSum (68K-context long document summarization), and 92.1 on KMMLU (Korean language understanding). These results reflect the model's deep specialization in document-centric, enterprise-grade tasks where EXAONE was designed to excel.

Language Benchmark Results Table — **Table 3:** Language benchmark results — Reasoning, Long Context Understanding, Multilinguality (WMT25), and Korean (KMMLU, ReleKA) categories.

Why Does EXAONE 4.5 Excel at Korean?

It is not by accident — LG AI Research specifically built Korean excellence into the training pipeline. Korean is particularly challenging for AI because it is agglutinative (words formed by stacking morphemes), has complex honorific systems, and has limited public training data compared to English.

EXAONE's 92.1 KMMLU score (vs. competitors typically in the 70s–80s) comes from LG's intentional inclusion of large-scale Korean industrial documents, Korean OCR corpora, and Korean-English bilingual data throughout every training stage.

Limitations

Like all large multimodal models, EXAONE 4.5 has limitations and may occasionally generate inaccurate or inappropriate responses. The model's strong Korean-language specialization means its multilingual capabilities, while supporting six languages, may not be uniformly strong across all non-Korean languages.

The document understanding capabilities are particularly strong for Korean and English documents, reflecting the training data distribution. The model's industrial focus may limit its performance on highly creative or open-ended tasks compared to models optimized for such use cases. Users should evaluate EXAONE 4.5 carefully for their specific domain before production deployment.

Deployment & Availability

EXAONE 4.5 is released as an open-weight model under the EXAONE AI Model License Agreement 1.2 — NC (Non-Commercial). The model weights are hosted on HuggingFace, and the reference code is available on GitHub. The model is designed for practical industrial deployment with inference support for long-context (256K token) workloads. For commercial licensing, please refer to the official license documentation in the paper's appendix.

🤗 Download on HuggingFace ↗ 📦 View Code on GitHub ↗

Conclusion

EXAONE 4.5 represents a significant advancement in the EXAONE model series, marking LG AI Research's first open-weight vision-language model. By integrating a custom-built 1.2B-parameter vision encoder with EXAONE 4.0's powerful language decoder, the model achieves a strong balance between visual understanding and text reasoning — particularly for industrial document-centric applications.

Key technical contributions include: GQA-based vision encoder design for efficiency, 2D RoPE for native resolution image handling, the MTP (Multi-Token Prediction) module for improved generation, and context parallelism enabling 256K token windows. Evaluations validate these choices, with EXAONE 4.5 establishing new standards in Korean language reasoning and competitive document understanding performance at the 33B scale.

Vision-Language Integration

First open-weight VLM from LG, combining a 1.2B visual encoder with EXAONE 4.0's 32B language decoder for native multimodal reasoning.

Industrial Document Focus

Document-centric training pipeline enables strong performance on real-world document AI tasks: chart Q&A, OCR, long-form document summarization.

Korean Language Excellence

92.1 KMMLU score establishes EXAONE 4.5 as a leading model for Korean-language industrial applications, a key differentiator for LG's enterprise customers.