---
arxiv_id: 2403.13372
title: "LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models"
authors:
  - Yaowei Zheng
  - Richong Zhang
  - Junhao Zhang
  - Yanhan Ye
  - Zheyan Luo
  - Zhangchi Feng
  - Yongqiang Ma
difficulty: Intermediate
tags:
  - LLM
  - Benchmark
published_at: 2026-04-05
flecto_url: https://flecto.zer0ai.dev/papers/2403.13372/
lang: en
---

> A unified framework integrating cutting-edge efficient training methods for flexibly customizing the fine-tuning of 100+ LLMs without coding

**Authors**: Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, Yongqiang Ma &mdash; Beihang University & Peking University

## Abstract

### Abstract

Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, implementing these methods on different models requires non-trivial effort. LlamaFactory is a unified framework that integrates a suite of cutting-edge efficient training methods. It provides a solution for flexibly customizing the fine-tuning of 100+ LLMs without coding through its built-in web UI, LlamaBoard . The framework has been empirically validated on language modeling and text generation tasks, and has received over 25,000 GitHub stars and 3,000 forks .

## Introduction

### Introduction

Large language models (LLMs) demonstrate remarkable reasoning capabilities and power a wide range of applications including question answering, machine translation, and information extraction. With over 5,000 models on Hugging Face's open LLM leaderboard, the ecosystem is growing rapidly. However, adapting these models to specific tasks presents significant challenges.

### Key Challenges in LLM Fine-Tuning

Resource constraints: Fine-tuning billions of parameters with limited GPU memory is the primary bottleneck for most practitioners

Implementation complexity: Each efficient fine-tuning method requires custom implementation for different model architectures

Fragmented ecosystem: Existing frameworks cover only subsets of available methods and models, lacking a unified solution

To address these problems, LlamaFactory provides a modular framework with three core modules &mdash; Model Loader, Data Worker, and Trainer &mdash; that minimize dependencies on specific models and datasets. This allows flexible scaling to hundreds of models and training approaches, including pre-training, supervised fine-tuning (SFT), RLHF, and DPO.

## Techniques

### Efficient Fine-Tuning Techniques

LlamaFactory's efficient fine-tuning techniques fall into two categories: efficient optimization (reducing which parameters need updating) and efficient computation (reducing the cost of each computation step). Together, they can reduce memory footprint from 18 bytes per parameter down to just 0.6 bytes per parameter.

### Efficient Optimization Methods

### Freeze-tuning: Freeze most parameters, only fine-tune a small subset of decoder layers

### LoRA: Low-rank adaptation &mdash; adds trainable low-rank matrices to frozen weights

### QLoRA: LoRA on 4-bit quantized models for extreme memory savings

### DoRA: Weight-decomposed LoRA for improved training stability

### LoRA+: Different learning rates for A and B matrices for faster convergence

### PiSSA: Initializes adapters using principal singular values for better starting points

### GaLore: Gradient low-rank projection for full-parameter learning with reduced memory

### BAdam: Block-coordinate optimization with Adam, training parameter blocks sequentially

### Efficient Computation Methods

### Mixed precision: Train in fp16/bf16 to halve memory usage with minimal quality loss

### Activation checkpointing: Trade compute for memory by recomputing activations during backward pass

### Flash Attention: IO-aware attention computation that is both faster and more memory-efficient

### S 2 Attention: Shifted sparse attention for handling extended context lengths

### Unsloth: Custom CUDA kernels for accelerated LoRA training

### Quantization: 4-bit/8-bit via bitsandbytes, GPTQ, AWQ, or AQLM for compressed models

Table 2: Compatibility matrix showing which fine-tuning techniques can be combined together in LlamaFactory

## Framework

### Framework Architecture

LlamaFactory consists of three main modules: Model Loader (handles model architectures for both LLMs and VLMs), Data Worker (processes data through a unified pipeline supporting single-turn and multi-turn dialogues), and Trainer (applies efficient fine-tuning techniques across pre-training, SFT, RLHF, and DPO). On top sits LlamaBoard , a web UI for codeless fine-tuning.

### Figure 1: The architecture of LlamaFactory showing the three main modules and LlamaBoard interface

### Model Loader

Handles model initialization, patching, quantization, adapter attachment, and precision adaptation across diverse architectures.

Model Initialization: Uses Transformers Auto Classes to load pre-trained models (AutoModelForCausalLM, AutoModelForVision2Seq)

Model Patching: Monkey patches for S 2 attention; native Flash Attention support since Transformers 4.34.0

### Model Quantization: Dynamic 4/8-bit quantization via bitsandbytes, GPTQ, AWQ, AQLM

Adapter Attaching: Automatic layer identification for LoRA, rsLoRA, DoRA, PiSSA via the PEFT library

### Precision Adaptation: Automatic fp16/bf16 selection based on GPU compute capability

### Data Worker

### Standardizes datasets from different formats into a unified structure for flexible fine-tuning.

Dataset Loading: Loads from Hugging Face Hub or local files via the Datasets library with Arrow-backed efficient memory usage

Dataset Aligning: Data description specification converts diverse formats (Alpaca, ShareGPT, plain text, preference) into a standardized structure

### Dataset Merging: Concatenation for non-streaming datasets; interleaved reading for streaming mode

Pre-processing: Automatic chat template selection per model type; optional sequence packing for faster training

### Trainer

### Integrates state-of-the-art training methods with distributed training support.

Efficient Training: LoRA+, GaLore, BAdam integration as drop-in replacements for default optimizer components

Model-Sharing RLHF: Enables full RLHF training on a single GPU by sharing weights between actor and critic models through adapter separation

Distributed Training: DeepSpeed ZeRO Stage 1-3 with data parallelism for multi-GPU training; memory reduction via partitioning and offloading

## Experiments Downstream

### Downstream Task Performance

Performance was evaluated on three text generation tasks: CNN/DailyMail and XSum (English summarization) and AdGen (Chinese advertisement generation). Eight instruction-tuned models were tested with full fine-tuning, GaLore, LoRA, and QLoRA, measuring averaged ROUGE-1, ROUGE-2, and ROUGE-L scores.

Table 5: ROUGE score comparison across CNN/DailyMail, XSum, and AdGen tasks for different models and fine-tuning methods

A key finding is that LoRA and QLoRA achieve the best performance in most cases , often matching or exceeding full fine-tuning. This demonstrates that efficient methods do not sacrifice quality &mdash; they can actually improve it through regularization effects, while using a fraction of the memory.

## Experiments Efficiency

### Training Efficiency Results

Training efficiency was evaluated using the PubMed dataset (36M+ biomedical records) on Gemma-2B , Llama2-7B , and Llama2-13B models. Methods compared include full fine-tuning, GaLore, LoRA, and QLoRA, measuring peak memory usage, training throughput (tokens/s), and perplexity (PPL).

### QLoRA memory for Gemma-2B (vs 17.06 GB full fine-tuning)

### LoRA throughput (tokens/s) for Gemma-2B

### QLoRA memory for Llama2-13B (full fine-tuning impossible)

Table 4: Training efficiency comparison across Gemma-2B, Llama2-7B, and Llama2-13B showing trainable parameters, memory, throughput, and perplexity

QLoRA consistently achieves the lowest memory footprint because pre-trained weights are stored in lower precision. LoRA delivers the highest throughput in most cases. Notably, full fine-tuning of Llama2-13B causes memory overflow on a single A100 40GB GPU , while QLoRA handles it with just 12.61 GB.

## Conclusion

### Conclusion & Future Work

LlamaFactory demonstrates that a unified, modular framework can democratize LLM fine-tuning. By minimizing dependencies between models, datasets, and training methods, it enables fine-tuning of over 100 LLMs with diverse efficient techniques. LlamaBoard further lowers the barrier by providing a codeless web interface for configuration, training, and evaluation.

### Future Roadmap

### Multi-modal fine-tuning: Extending support to audio and video modalities beyond text and vision

Advanced parallelism: Integrating sequence parallelism and tensor parallelism for even larger-scale training

Conversational fine-tuning: Exploring self-play and other advanced methods for improving conversational model quality

## Broader Impact

### Broader Impact & Responsible Use

LlamaFactory has attracted a large community of LLM practitioners, contributing significantly to open-source growth. Featured in Hugging Face's Awesome Transformers list, it serves as a representative efficient fine-tuning framework. The authors emphasize responsible use and adherence to model licenses when building upon the framework.

## References

### References

### View references (17 key citations)

## Supported Models

### Supported Models

### View all 50+ supported model families

Table 6: Complete list of supported models including Llama, Gemma, Qwen, Mistral, Phi, DeepSeek, and many more

## Feature Comparison

### Feature Comparison with Existing Frameworks

LlamaFactory stands out by providing comprehensive support across optimization methods, computation efficiency techniques, and training paradigms &mdash; a breadth that no single competing framework matches.

Table 1: Feature comparison of LlamaFactory with FastChat, LitGPT, LMFlow, and Open-Instruct across optimization methods, computation techniques, and training paradigms

## Llamaboard

### LlamaBoard: Codeless Fine-Tuning Interface

LlamaBoard is a Gradio-based web interface that lets users customize LLM fine-tuning without writing any code. It provides a streamlined experience from configuration to evaluation.

### Easy Configuration

Customize fine-tuning arguments through a web interface with sensible defaults for most parameters. Preview datasets directly in the UI to validate them before training.

### Monitorable Training

Training logs and loss curves are visualized and updated in real time, allowing users to monitor training progress and gain insights into the fine-tuning process.

### Flexible Evaluation

Calculate text similarity scores (BLEU-4, ROUGE) automatically, or perform human evaluation by chatting with your fine-tuned model directly.

### Multilingual Support

Interface localization supporting English, Russian, and Chinese, allowing a broader range of users to leverage LlamaBoard for their fine-tuning workflows.

## Data Formats

### Supported Data Formats

LlamaFactory supports five dataset structures through its Data Worker pipeline: plain text, Alpaca-like data, ShareGPT-like data, preference data, and a standardized format that unifies all others. This flexibility allows users to bring their own data in any common format.

### Table 3: Dataset structures supported by LlamaFactory, showing JSON format examples for each type
