← Flecto🤖 Agent Ready

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

A unified framework integrating cutting-edge efficient training methods for flexibly customizing the fine-tuning of 100+ LLMs without coding

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, Yongqiang Ma — Beihang University & Peking University

25K+ GitHub Stars 100+ LLMs 3K+ Forks Apache 2.0

Abstract

Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, implementing these methods on different models requires non-trivial effort. LlamaFactory is a unified framework that integrates a suite of cutting-edge efficient training methods. It provides a solution for flexibly customizing the fine-tuning of 100+ LLMs without coding through its built-in web UI, LlamaBoard. The framework has been empirically validated on language modeling and text generation tasks, and has received over 25,000 GitHub stars and 3,000 forks.

Introduction

Large language models (LLMs) demonstrate remarkable reasoning capabilities and power a wide range of applications including question answering, machine translation, and information extraction. With over 5,000 models on Hugging Face's open LLM leaderboard, the ecosystem is growing rapidly. However, adapting these models to specific tasks presents significant challenges.

Key Challenges in LLM Fine-Tuning

  • Resource constraints: Fine-tuning billions of parameters with limited GPU memory is the primary bottleneck for most practitioners
  • Implementation complexity: Each efficient fine-tuning method requires custom implementation for different model architectures
  • Fragmented ecosystem: Existing frameworks cover only subsets of available methods and models, lacking a unified solution

To address these problems, LlamaFactory provides a modular framework with three core modules — Model Loader, Data Worker, and Trainer — that minimize dependencies on specific models and datasets. This allows flexible scaling to hundreds of models and training approaches, including pre-training, supervised fine-tuning (SFT), RLHF, and DPO.

What is Fine-Tuning and Why Does It Matter?

Fine-tuning is the process of taking a pre-trained language model (like GPT or Llama) and further training it on your specific data so it performs better at your particular task.

Think of it like hiring a broadly educated college graduate and then giving them on-the-job training for your specific role. The “pre-training” gave them general knowledge; “fine-tuning” makes them an expert in your domain.

  • SFT (Supervised Fine-Tuning): Training the model on example input-output pairs you provide
  • RLHF (Reinforcement Learning from Human Feedback): Using human preferences to guide the model toward better responses
  • DPO (Direct Preference Optimization): A simpler alternative to RLHF that directly optimizes from preference data

Feature Comparison with Existing Frameworks

LlamaFactory stands out by providing comprehensive support across optimization methods, computation efficiency techniques, and training paradigms — a breadth that no single competing framework matches.

Table 1: Feature comparison of LLM fine-tuning frameworks
Table 1: Feature comparison of LlamaFactory with FastChat, LitGPT, LMFlow, and Open-Instruct across optimization methods, computation techniques, and training paradigms

Efficient Fine-Tuning Techniques

LlamaFactory's efficient fine-tuning techniques fall into two categories: efficient optimization (reducing which parameters need updating) and efficient computation (reducing the cost of each computation step). Together, they can reduce memory footprint from 18 bytes per parameter down to just 0.6 bytes per parameter.

Efficient Optimization Methods

  • Freeze-tuning: Freeze most parameters, only fine-tune a small subset of decoder layers
  • LoRA: Low-rank adaptation — adds trainable low-rank matrices to frozen weights
  • Understanding LoRA and Its Variants

    LoRA (Low-Rank Adaptation) is one of the most popular efficient fine-tuning techniques. Instead of updating all billions of parameters in a model, LoRA freezes the original weights and adds small, trainable “adapter” matrices.

    Imagine you have a massive library (the model). Instead of rewriting every book, you add small sticky notes (adapters) that modify how certain pages are read. This is much cheaper and faster.

    • QLoRA: Combines LoRA with 4-bit quantization — compresses the “library” to take up less shelf space while still allowing sticky notes
    • DoRA: Decomposes weights into magnitude and direction, giving more stable training
    • LoRA+: Uses different learning speeds for different adapter matrices for faster convergence
    • PiSSA: Uses math (SVD) to find the best starting point for adapter initialization

    These variants all share the core idea: train a small number of parameters instead of all of them, reducing GPU memory from tens of GBs to just a few GBs.

  • QLoRA: LoRA on 4-bit quantized models for extreme memory savings
  • DoRA: Weight-decomposed LoRA for improved training stability
  • LoRA+: Different learning rates for A and B matrices for faster convergence
  • PiSSA: Initializes adapters using principal singular values for better starting points
  • GaLore: Gradient low-rank projection for full-parameter learning with reduced memory
  • GaLore stands for Gradient Low-Rank Projection. While LoRA adds small adapters, GaLore takes a different approach: it projects the gradients (the signals that tell the model how to update) into a lower-dimensional space. This allows full-parameter learning while using much less memory — you're updating all parameters, but the gradient computation is compressed.

  • BAdam: Block-coordinate optimization with Adam, training parameter blocks sequentially

Efficient Computation Methods

  • Mixed precision: Train in fp16/bf16 to halve memory usage with minimal quality loss
  • Activation checkpointing: Trade compute for memory by recomputing activations during backward pass
  • Flash Attention: IO-aware attention computation that is both faster and more memory-efficient
  • Flash Attention Explained

    The attention mechanism is the core of transformer models, but it's extremely memory-hungry — memory usage grows quadratically with sequence length. Flash Attention reorganizes the computation to be “IO-aware,” meaning it minimizes expensive memory reads/writes between GPU compute cores and GPU memory.

    The result: attention computation that is both 2-4x faster and uses significantly less memory, with mathematically identical results. It's like reorganizing your workflow to avoid unnecessary trips to the filing cabinet.

  • S2 Attention: Shifted sparse attention for handling extended context lengths
  • Unsloth: Custom CUDA kernels for accelerated LoRA training
  • Quantization: 4-bit/8-bit via bitsandbytes, GPTQ, AWQ, or AQLM for compressed models
Table 2: Compatibility between fine-tuning techniques
Table 2: Compatibility matrix showing which fine-tuning techniques can be combined together in LlamaFactory

Framework Architecture

LlamaFactory consists of three main modules: Model Loader (handles model architectures for both LLMs and VLMs), Data Worker (processes data through a unified pipeline supporting single-turn and multi-turn dialogues), and Trainer (applies efficient fine-tuning techniques across pre-training, SFT, RLHF, and DPO). On top sits LlamaBoard, a web UI for codeless fine-tuning.

Figure 1: Architecture of LlamaFactory
Figure 1: The architecture of LlamaFactory showing the three main modules and LlamaBoard interface

Model Loader

Handles model initialization, patching, quantization, adapter attachment, and precision adaptation across diverse architectures.

  • Model Initialization: Uses Transformers Auto Classes to load pre-trained models (AutoModelForCausalLM, AutoModelForVision2Seq)
  • Model Patching: Monkey patches for S2 attention; native Flash Attention support since Transformers 4.34.0
  • Model Quantization: Dynamic 4/8-bit quantization via bitsandbytes, GPTQ, AWQ, AQLM
  • Adapter Attaching: Automatic layer identification for LoRA, rsLoRA, DoRA, PiSSA via the PEFT library
  • Precision Adaptation: Automatic fp16/bf16 selection based on GPU compute capability

Data Worker

Standardizes datasets from different formats into a unified structure for flexible fine-tuning.

  • Dataset Loading: Loads from Hugging Face Hub or local files via the Datasets library with Arrow-backed efficient memory usage
  • Dataset Aligning: Data description specification converts diverse formats (Alpaca, ShareGPT, plain text, preference) into a standardized structure
  • The data description specification is LlamaFactory's way of handling the “data format chaos” in the LLM ecosystem. Different datasets use different JSON structures (Alpaca uses instruction/input/output; ShareGPT uses conversations array). Instead of writing custom loaders for each format, you define a small config that tells LlamaFactory how to map your dataset's fields to its standardized format. This means you can mix and match datasets from different sources in a single training run.

  • Dataset Merging: Concatenation for non-streaming datasets; interleaved reading for streaming mode
  • Pre-processing: Automatic chat template selection per model type; optional sequence packing for faster training

Trainer

Integrates state-of-the-art training methods with distributed training support.

  • Efficient Training: LoRA+, GaLore, BAdam integration as drop-in replacements for default optimizer components
  • Model-Sharing RLHF: Enables full RLHF training on a single GPU by sharing weights between actor and critic models through adapter separation
  • Model-Sharing RLHF: A Clever Resource Trick

    Standard RLHF training requires four separate models running simultaneously: the policy model, a reference model, a reward model, and a value model. This makes RLHF extremely GPU-hungry — typically requiring multiple high-end GPUs.

    LlamaFactory's Model-Sharing RLHF solves this by using a single base model with multiple lightweight adapters:

    • First, train one adapter + value head for reward scoring
    • Then, initialize a second adapter for the policy (the model being improved)
    • Both adapters share the same frozen base model

    This brings RLHF from “needs a cluster” to “runs on a single consumer GPU” — a major step toward democratizing alignment training.

  • Distributed Training: DeepSpeed ZeRO Stage 1-3 with data parallelism for multi-GPU training; memory reduction via partitioning and offloading

Supported Data Formats

LlamaFactory supports five dataset structures through its Data Worker pipeline: plain text, Alpaca-like data, ShareGPT-like data, preference data, and a standardized format that unifies all others. This flexibility allows users to bring their own data in any common format.

Table 3: Dataset structures in LlamaFactory
Table 3: Dataset structures supported by LlamaFactory, showing JSON format examples for each type

LlamaBoard: Codeless Fine-Tuning Interface

LlamaBoard is a Gradio-based web interface that lets users customize LLM fine-tuning without writing any code. It provides a streamlined experience from configuration to evaluation.

Easy Configuration

Customize fine-tuning arguments through a web interface with sensible defaults for most parameters. Preview datasets directly in the UI to validate them before training.

Monitorable Training

Training logs and loss curves are visualized and updated in real time, allowing users to monitor training progress and gain insights into the fine-tuning process.

Flexible Evaluation

Calculate text similarity scores (BLEU-4, ROUGE) automatically, or perform human evaluation by chatting with your fine-tuned model directly.

Multilingual Support

Interface localization supporting English, Russian, and Chinese, allowing a broader range of users to leverage LlamaBoard for their fine-tuning workflows.

Training Efficiency Results

Training efficiency was evaluated using the PubMed dataset (36M+ biomedical records) on Gemma-2B, Llama2-7B, and Llama2-13B models. Methods compared include full fine-tuning, GaLore, LoRA, and QLoRA, measuring peak memory usage, training throughput (tokens/s), and perplexity (PPL).

5.21 GB QLoRA memory for Gemma-2B (vs 17.06 GB full fine-tuning)
5,608 LoRA throughput (tokens/s) for Gemma-2B
12.61 GB QLoRA memory for Llama2-13B (full fine-tuning impossible)
Table 4: Training efficiency comparison
Table 4: Training efficiency comparison across Gemma-2B, Llama2-7B, and Llama2-13B showing trainable parameters, memory, throughput, and perplexity

QLoRA consistently achieves the lowest memory footprint because pre-trained weights are stored in lower precision. LoRA delivers the highest throughput in most cases. Notably, full fine-tuning of Llama2-13B causes memory overflow on a single A100 40GB GPU, while QLoRA handles it with just 12.61 GB.

Reading the Efficiency Results

Perplexity (PPL) measures how well the model predicts the next word — lower is better. A PPL of 10 means the model is roughly as confused as if it had to pick from 10 equally likely words at each step.

Key insight: QLoRA uses only 5.21 GB for Gemma-2B (vs 17.06 GB for full fine-tuning) — a 3.3x memory reduction. Yet the perplexity difference is small (10.46 vs 10.34), meaning the quality loss is minimal. For Llama2-13B, full fine-tuning is simply impossible on a single A100 GPU, but QLoRA handles it with just 12.61 GB.

Downstream Task Performance

Performance was evaluated on three text generation tasks: CNN/DailyMail and XSum (English summarization) and AdGen (Chinese advertisement generation). Eight instruction-tuned models were tested with full fine-tuning, GaLore, LoRA, and QLoRA, measuring averaged ROUGE-1, ROUGE-2, and ROUGE-L scores.

Table 5: ROUGE scores on downstream tasks
Table 5: ROUGE score comparison across CNN/DailyMail, XSum, and AdGen tasks for different models and fine-tuning methods

A key finding is that LoRA and QLoRA achieve the best performance in most cases, often matching or exceeding full fine-tuning. This demonstrates that efficient methods do not sacrifice quality — they can actually improve it through regularization effects, while using a fraction of the memory.

ROUGE scores measure text overlap between generated and reference summaries. ROUGE-1 counts matching single words, ROUGE-2 counts matching word pairs, and ROUGE-L finds the longest common subsequence. The key takeaway: LoRA and QLoRA often outperform full fine-tuning, likely because the parameter constraints act as a form of regularization that prevents overfitting.

Conclusion & Future Work

LlamaFactory demonstrates that a unified, modular framework can democratize LLM fine-tuning. By minimizing dependencies between models, datasets, and training methods, it enables fine-tuning of over 100 LLMs with diverse efficient techniques. LlamaBoard further lowers the barrier by providing a codeless web interface for configuration, training, and evaluation.

Future Roadmap

  1. Multi-modal fine-tuning: Extending support to audio and video modalities beyond text and vision
  2. Advanced parallelism: Integrating sequence parallelism and tensor parallelism for even larger-scale training
  3. Conversational fine-tuning: Exploring self-play and other advanced methods for improving conversational model quality

Broader Impact & Responsible Use

LlamaFactory has attracted a large community of LLM practitioners, contributing significantly to open-source growth. Featured in Hugging Face's Awesome Transformers list, it serves as a representative efficient fine-tuning framework. The authors emphasize responsible use and adherence to model licenses when building upon the framework.

Supported Models

View all 50+ supported model families
Table 6: Full list of supported models
Table 6: Complete list of supported models including Llama, Gemma, Qwen, Mistral, Phi, DeepSeek, and many more

References

View references (17 key citations)
  1. Abdin et al. (2024). Phi-3 Technical Report.
  2. Abid et al. (2019). Gradio: Hassle-Free Sharing and Testing of ML Models in the Wild.
  3. AI@Meta (2024). Llama 3.
  4. Bai et al. (2023). Qwen Technical Report.
  5. Canese & Weis (2013). PubMed: The Bibliographic Database.
  6. Chen et al. (2016). Training Deep Nets with Sublinear Memory Cost.
  7. Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
  8. Dettmers et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
  9. Dettmers et al. (2023). QLoRA: Efficient Finetuning of Quantized Language Models.
  10. Hu et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models.
  11. Luo et al. (2024). BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models.
  12. Meng et al. (2024). PiSSA: Principal Singular Values and Singular Vectors Adaptation.
  13. Rasley et al. (2020). DeepSpeed: System Optimizations Enable Training Deep Learning Models.
  14. Touvron et al. (2023). LLaMA: Open and Efficient Foundation Language Models.
  15. Wolf et al. (2020). Transformers: State-of-the-Art Natural Language Processing.
  16. Zhao et al. (2024). GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.
  17. Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.

View Services Contact Us