---
arxiv_id: 2603.26164
title: "DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models"
authors:
  - Hao Liang
  - Zhengyang Zhao
  - Meiyi Qiang
  - Mingrui Chen
  - Lu Ma
  - Rongyi Yu
  - Hengyi Feng
  - Shixuan Sun
  - Zimo Meng
  - Xiaochen Ma
  - Xuanlin Yang
  - Qifeng Cai
  - Ruichuan An
  - Bohan Zeng
  - Zhen Hao Wong
  - Chengyu Shen
  - Runming He
  - Zhaoyang Han
  - Yaowei Zheng
  - Fangcheng Fu
  - Conghui He
  - Bin Cui
  - Zhiyu Li
  - Weinan E
  - Wentao Zhang
difficulty: Advanced
tags:
  - LLM
  - Training
  - Data Curation
  - Data Selection
  - Domain Mixture
  - LLaMA-Factory
published_at: 2026-03-27
flecto_url: https://flecto.zer0ai.dev/papers/2603.26164/
lang: en
---

## Html Page Title

### DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

## Html Meta Description

DataFlex unifies dynamic data selection, domain mixture optimization, and sample reweighting into a single LLaMA-Factory compatible framework for reproducible, scalable LLM training.

## Small Badge In Hero Section

### arXiv 2603.26164 &middot; cs.AI &middot; Mar 2026

## Main Paper Title In Hero

### DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

## Author List In Hero

Hao Liang*, Zhengyang Zhao*, Meiyi Qiang*, Mingrui Chen*, Lu Ma, Rongyi Yu, et al. Peking University / Shanghai AI Lab / LLaMA-Factory Team / OpenDCAI

## Key Claim / Tagline In Hero

One framework. Three paradigms. DataFlex unifies dynamic data selection, domain mixture, and sample reweighting — all as a drop-in replacement for standard LLM training.

## Arxiv Button In Hero

### arXiv

## Github Button In Hero

### GitHub

## Alt Text For Thumbnail Image

Abstract illustration of three data streams converging into a neural network node, representing DataFlex unified training paradigm

## Section Heading

### Abstract

### Three-Layer Architecture

### Three Data-Centric Paradigms

### Drop-in YAML Configuration

### Experimental Results: Data Selection

### Experimental Results: Data Mixture

### Runtime Efficiency

### Why DataFlex Matters

## Paper Abstract

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. We present DataFlex , a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms: dynamic sample selection , domain mixture adjustment , and sample reweighting , while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and supports large-scale settings including DeepSpeed ZeRO-3 .

## Sub Heading For Contribution Cards

### Key Contributions

## Card Title

### Unified Interface

### Drop-in Replacement

### DeepSpeed ZeRO-3

### Consistent MMLU Gains

## Card Body

Single framework unifying three data-centric paradigms previously scattered across isolated codebases with inconsistent APIs.

Plugs directly into LLaMA-Factory via YAML config. No code changes required to the standard training pipeline.

Full support for large-scale distributed training with DeepSpeed ZeRO-3 and FSDP parallel optimization.

Dynamic data selection consistently outperforms static full-data training on MMLU across Mistral-7B and Llama-3.2-3B.

## Alt Text For Architecture Diagram

DataFlex three-layer framework architecture: Base Layer from LLaMA-Factory, Trainer Layer with Select/Mix/Weight trainers, Component Layer with pluggable algorithm strategies

## Figure Caption

Figure 2: DataFlex Framework overview. (a) Base Layer inherits LLaMA-Factory infrastructure. (b) Trainer Layer introduces three unified trainers. (c) Component Layer provides pluggable strategy algorithms.

Figure 5: Runtime comparison between DataFlex and original TSDS implementation. DataFlex achieves consistent improvements as dataset size scales from 5K to 100K training samples (left) and 50 to 1,000 validation samples (right).

## Description Below Architecture Figure

DataFlex is structured in three layers : the Base layer inherits model management, data processing, and optimizers directly from LLaMA-Factory with DeepSpeed/FSDP support; the Trainer layer introduces three unified trainer abstractions ( SelectTrainer , MixTrainer , WeightTrainer ); the Component layer provides pluggable, swappable algorithm modules — enabling researchers to compare methods side-by-side without integration overhead.

## Flecto Note Heading

### What is data-centric training?

### What is LLaMA-Factory?

### Why does dynamic data selection help?

### What is DoReMi and why does domain proportion matter?

### What is DeepSpeed ZeRO-3?

## Flecto Note Body

Traditional LLM training optimizes model parameters while keeping the training data fixed. Data-centric training flips this perspective: it treats the data pipeline itself as an optimization target. Which samples are most informative for the current model state? What domain proportions maximize generalization? Which training examples deserve more gradient weight? DataFlex automates all three of these questions during a single training run.

LLaMA-Factory is a widely-used open-source framework for efficiently fine-tuning and pretraining large language models. It abstracts the boilerplate of distributed training, dataset loading, and optimizer configuration, and supports dozens of model architectures including LLaMA, Mistral, Qwen, and Phi. DataFlex builds directly on top of LLaMA-Factory's trainer infrastructure, inheriting all of its capabilities while adding a data-centric optimization layer above them.

In static training, every sample in the dataset gets equal opportunity to influence the model, regardless of its actual utility at that training step. Dynamic selection uses the current model state to score each sample's expected learning value — for example, by measuring how much the model's hidden representations change for that sample, or how large the gradient signal would be. High-utility samples get selected more frequently, low-utility ones less so. This is like a student focusing more study time on problems they haven't yet mastered, rather than re-reading pages they already know.

Language model pretraining datasets like SlimPajama combine multiple text domains: Wikipedia, GitHub code, C4 web text, books, ArXiv papers, etc. The proportion of each domain in each training batch is a critical hyperparameter — too much code makes the model worse at natural language; too little Wikipedia hurts factual recall. DoReMi (Domain Reweighting with Minimax Optimization) uses a reference model trained on uniform proportions as a baseline, then learns to upweight domains where the main model has the highest excess loss. This ensures data-scarce but important domains get more training signal. ODM (Online Domain Mixing) takes a similar adaptive approach without requiring a reference model.

Training large language models with billions of parameters requires more GPU memory than a single accelerator can hold. DeepSpeed's ZeRO (Zero Redundancy Optimizer) algorithm splits the model across multiple GPUs by sharding optimizer states, gradients, and model parameters across devices. ZeRO-3 is the most aggressive partitioning — all three are distributed — allowing models that are otherwise too large for any single GPU to be trained efficiently on a cluster. DataFlex's support for ZeRO-3 means that data-centric methods (which require auxiliary operations like embedding extraction and model inference) are fully compatible with modern multi-GPU training setups.

## Panel Title

### Dynamic Sample Selection

### Domain Mixture Adjustment

### Sample Reweighting

## Panel Body

Iteratively identifies and selects the most informative training samples at each step. DataFlex supports LESS, TSDS, and custom selector components, all sharing a unified interface for scoring and filtering.

Dynamically adjusts the sampling proportions across data domains (e.g., SlimPajama subsets) during pretraining. DoReMi and ODM are the built-in mixture algorithms, with a mixer abstraction for custom strategies.

Assigns per-sample loss weights during gradient updates based on model-dependent quality signals. Pluggable weighter components enable custom reweighting strategies with minimal boilerplate.

## Alt Text For Config Figure

YAML configuration snippet showing DataFlex DoReMi dynamic mix training settings alongside standard LLaMA-Factory fields

## Figure Caption For Config

Figure 3: Minimal YAML configuration — add a single ### dataflex block to your existing LLaMA-Factory config to enable dynamic data optimization.

## Description For Config Section

DataFlex integrates via a minimal YAML configuration block appended to the standard LLaMA-Factory config. Setting train_type , component_name , and component-specific hyperparameters activates dynamic data optimization — no changes to training code required . The example shows DoReMi domain mixture on Qwen2.5-0.5B pretraining on wiki_demo and c4_demo datasets.

## Alt Text

MMLU accuracy curves over training steps for dynamic data selection methods on Mistral-7B (left) and Llama-3.2-3B (right), showing consistent gains over static full-data training

Two line charts comparing runtime in seconds between DataFlex and TSDS original implementation as training dataset size (5K-100K samples) and validation dataset size (50-1000 samples) grow

## Caption For Results Figure

Figure 4: MMLU accuracy during training for dynamic data selection methods vs. static full-data baseline. Dynamic methods consistently outperform across both Mistral-7B and Llama-3.2-3B backbones.

## Description For Results Section

Comprehensive experiments across seven dynamic data selection algorithms confirm that data-centric dynamic training delivers measurable, consistent improvements over static baselines. The gains hold across both 7B and 3B model scales, demonstrating that the unified DataFlex infrastructure does not introduce regressions compared to standalone implementations.

## Alt Text For Mixture Results Table

Table comparing DoReMi and ODM data mixture methods versus default proportions on MMLU accuracy and corpus perplexity at 6B and 30B token scales with Qwen2.5-1.5B

## Table Caption

Table: DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity vs. default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales.

## Description For Mixture Results

For domain mixture optimization, both DoReMi and ODM outperform fixed default proportions when pretraining Qwen2.5-1.5B on the SlimPajama dataset. Benefits scale from 6B to 30B tokens, validating the approach at practical pretraining scales. This demonstrates that automatically learned domain proportions generalize across scales.

## Description For Efficiency Section

Despite unifying multiple algorithms under a single framework, DataFlex introduces no meaningful computational overhead . Benchmarks against the original TSDS implementation show DataFlex matching or beating standalone runtimes across all dataset scales. The unified abstraction layer amortizes per-algorithm setup costs and enables shared buffer reuse across components, leading to the observed efficiency gains.

## Conclusion Paragraph

Data-centric training is an emerging paradigm that treats data quality and composition as a first-class optimization target alongside model parameters. DataFlex removes the fragmentation that has hindered this field — different groups publishing incompatible implementations — and provides a reproducible, extensible platform that the community can build on. Built on LLaMA-Factory's production-grade infrastructure, DataFlex is ready for both research and practical LLM training pipelines.

## Takeaways Card Title

### Key Takeaways

## Bullet Point

DataFlex is the first framework to unify dynamic data selection, domain mixture, and sample reweighting under a single LLaMA-Factory-compatible interface.

Dynamic data selection consistently outperforms static full-data training on MMLU for both Mistral-7B and Llama-3.2-3B — confirming that what data you train on matters as much as how you train.

DoReMi and ODM domain mixture methods improve both accuracy and perplexity at 6B and 30B token scales — the benefits are not scale-dependent.

Unification introduces no runtime overhead — DataFlex matches or beats original standalone implementations in benchmarks.

## Footer Heading For Citation

### Citation

## Bibtex Citation

@article{liang2026dataflex, title = {DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models}, author = {Liang, Hao and Zhao, Zhengyang and Qiang, Meiyi and Chen, Mingrui and others}, journal = {arXiv preprint arXiv:2603.26164}, year = {2026} }

## Footer Heading For Links

### Links

## Footer Acknowledgement

HTML generated by Flecto . Content based on the original paper. All figures and tables from the authors.