arXiv 2603.26164 · cs.AI · Mar 2026

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Hao Liang*, Zhengyang Zhao*, Meiyi Qiang*, Mingrui Chen*, Lu Ma, Rongyi Yu, et al.
Peking University / Shanghai AI Lab / LLaMA-Factory Team / OpenDCAI

One framework. Three paradigms. DataFlex unifies dynamic data selection, domain mixture, and sample reweighting — all as a drop-in replacement for standard LLM training.

arXiv GitHub

Abstract

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. We present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms: dynamic sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and supports large-scale settings including DeepSpeed ZeRO-3.

What is data-centric training?

Traditional LLM training optimizes model parameters while keeping the training data fixed. Data-centric training flips this perspective: it treats the data pipeline itself as an optimization target. Which samples are most informative for the current model state? What domain proportions maximize generalization? Which training examples deserve more gradient weight? DataFlex automates all three of these questions during a single training run.

Key Contributions

🔗

Unified Interface

Single framework unifying three data-centric paradigms previously scattered across isolated codebases with inconsistent APIs.

⚙️

Drop-in Replacement

Plugs directly into LLaMA-Factory via YAML config. No code changes required to the standard training pipeline.

🚀

DeepSpeed ZeRO-3

Full support for large-scale distributed training with DeepSpeed ZeRO-3 and FSDP parallel optimization.

📊

Consistent MMLU Gains

Dynamic data selection consistently outperforms static full-data training on MMLU across Mistral-7B and Llama-3.2-3B.

Three-Layer Architecture

DataFlex three-layer framework architecture: Base Layer from LLaMA-Factory, Trainer Layer with Select/Mix/Weight trainers, Component Layer with pluggable algorithm strategies — Figure 2: DataFlex Framework overview. (a) Base Layer inherits LLaMA-Factory infrastructure. (b) Trainer Layer introduces three unified trainers. (c) Component Layer provides pluggable strategy algorithms.

DataFlex is structured in three layers: the Base layer inherits model management, data processing, and optimizers directly from LLaMA-Factory with DeepSpeed/FSDP support; the Trainer layer introduces three unified trainer abstractions (SelectTrainer, MixTrainer, WeightTrainer); the Component layer provides pluggable, swappable algorithm modules — enabling researchers to compare methods side-by-side without integration overhead.

What is LLaMA-Factory?

LLaMA-Factory is a widely-used open-source framework for efficiently fine-tuning and pretraining large language models. It abstracts the boilerplate of distributed training, dataset loading, and optimizer configuration, and supports dozens of model architectures including LLaMA, Mistral, Qwen, and Phi. DataFlex builds directly on top of LLaMA-Factory's trainer infrastructure, inheriting all of its capabilities while adding a data-centric optimization layer above them.

Three Data-Centric Paradigms

Dynamic Sample Selection

Iteratively identifies and selects the most informative training samples at each step. DataFlex supports LESS, TSDS, and custom selector components, all sharing a unified interface for scoring and filtering.

Domain Mixture Adjustment

Dynamically adjusts the sampling proportions across data domains (e.g., SlimPajama subsets) during pretraining. DoReMi and ODM are the built-in mixture algorithms, with a mixer abstraction for custom strategies.

Sample Reweighting

Assigns per-sample loss weights during gradient updates based on model-dependent quality signals. Pluggable weighter components enable custom reweighting strategies with minimal boilerplate.

Why does a pluggable component architecture matter for research?

Before unified frameworks, each research group built their data-centric method on top of their own training codebase. To compare Method A (built on PyTorch Lightning) with Method B (built on Hugging Face Trainer), a researcher had to either port one method to the other's codebase — which introduces implementation bugs — or accept that the comparison is confounded by infrastructure differences. DataFlex's pluggable architecture means all methods share exactly the same training loop, distributed training setup, and data loading pipeline. When results differ, it's because the algorithms differ, not the infrastructure.

Drop-in YAML Configuration

DataFlex integrates via a minimal YAML configuration block appended to the standard LLaMA-Factory config. Setting train_type, component_name, and component-specific hyperparameters activates dynamic data optimization — no changes to training code required. The example shows DoReMi domain mixture on Qwen2.5-0.5B pretraining on wiki_demo and c4_demo datasets.

### model
model_name_or_path: Qwen2.5-0.5B
### method
stage: pt
finetuning_type: full
deepspeed: ds_z3_config.json
### dataset
dataset: wiki_demo, c4_demo
template: qwen
### train
learning_rate: 5.0e-5
num_train_epochs: 1.0
### dataflex
train_type: dynamic_mix
component_name: doremi
mixture_sample_rule: mixture
init_mixture_proportions: [0.5, 0.5]
warmup_step: 100
update_step: 200
update_times: 3

What is DoReMi and why does domain proportion matter?

Language model pretraining datasets like SlimPajama combine multiple text domains: Wikipedia, GitHub code, C4 web text, books, ArXiv papers, etc. The proportion of each domain in each training batch is a critical hyperparameter — too much code makes the model worse at natural language; too little Wikipedia hurts factual recall. DoReMi (Domain Reweighting with Minimax Optimization) uses a reference model trained on uniform proportions as a baseline, then learns to upweight domains where the main model has the highest excess loss. ODM (Online Domain Mixing) takes a similar adaptive approach without requiring a reference model.

YAML configuration snippet showing DataFlex DoReMi dynamic mix training settings alongside standard LLaMA-Factory fields — Figure 3: Minimal YAML configuration — add a single `### dataflex` block to your existing LLaMA-Factory config to enable dynamic data optimization.

Experimental Results: Data Selection

MMLU accuracy curves over training steps for dynamic data selection methods on Mistral-7B (left) and Llama-3.2-3B (right), showing consistent gains over static full-data training — Figure 4: MMLU accuracy during training for dynamic data selection methods vs. static full-data baseline. Dynamic methods consistently outperform across both Mistral-7B and Llama-3.2-3B backbones.

Comprehensive experiments across seven dynamic data selection algorithms confirm that data-centric dynamic training delivers measurable, consistent improvements over static baselines. The gains hold across both 7B and 3B model scales, demonstrating that the unified DataFlex infrastructure does not introduce regressions compared to standalone implementations.

Why does dynamic data selection help?

In static training, every sample in the dataset gets equal opportunity to influence the model, regardless of its actual utility at that training step. Dynamic selection uses the current model state to score each sample's expected learning value — for example, by measuring how much the model's hidden representations change for that sample, or how large the gradient signal would be. High-utility samples get selected more frequently, low-utility ones less so. This is like a student focusing more study time on problems they haven't yet mastered, rather than re-reading pages they already know.

What is MMLU and why is it used as a benchmark here?

MMLU (Massive Multitask Language Understanding) is a benchmark covering 57 academic subjects from elementary mathematics to professional law and medicine. It evaluates factual knowledge and reasoning across domains with multiple-choice questions. It's widely used because it's comprehensive (57 subjects), standardized (fixed evaluation protocol), and correlates well with real-world utility. In the context of DataFlex experiments, MMLU measures whether smarter data selection during fine-tuning leads to a model that knows more and reasons better — making it a strong downstream signal for training quality.

Experimental Results: Data Mixture

Table comparing DoReMi and ODM data mixture methods versus default proportions on MMLU accuracy and corpus perplexity at 6B and 30B token scales with Qwen2.5-1.5B — Table: DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity vs. default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales.

For domain mixture optimization, both DoReMi and ODM outperform fixed default proportions when pretraining Qwen2.5-1.5B on the SlimPajama dataset. Benefits scale from 6B to 30B tokens, validating the approach at practical pretraining scales. This demonstrates that automatically learned domain proportions generalize across scales.

What is DoReMi and why does domain proportion matter?

Language model pretraining datasets like SlimPajama combine multiple text domains: Wikipedia, GitHub code, C4 web text, books, ArXiv papers, etc. The proportion of each domain in each training batch is a critical hyperparameter — too much code makes the model worse at natural language; too little Wikipedia hurts factual recall. DoReMi (Domain Reweighting with Minimax Optimization) uses a reference model trained on uniform proportions as a baseline, then learns to upweight domains where the main model has the highest excess loss. This ensures data-scarce but important domains get more training signal. ODM (Online Domain Mixing) takes a similar adaptive approach without requiring a reference model.

What is SlimPajama and why is it used for these experiments?

SlimPajama is a cleaned, deduplicated version of the RedPajama dataset, containing approximately 627 billion tokens of text across seven domains: Commoncrawl, C4, GitHub, Books, ArXiv, Wikipedia, and Stackexchange. It was created specifically to enable fair comparisons of language model pretraining strategies by providing a standardized, well-cleaned, multi-domain dataset. For domain mixture experiments, SlimPajama is ideal because its seven distinct domains have different intrinsic characteristics and sizes, making the domain proportion choice non-trivial. Previous work (DoReMi, ODM) also used SlimPajama, enabling direct comparisons.

What is corpus-level perplexity and why does it matter for pretraining?

Perplexity measures how surprised a language model is by a given text. Mathematically, it's the exponentiated average negative log-likelihood per token — a lower score means the model assigns higher probability to the actual next tokens. In pretraining evaluation, corpus-level perplexity is measured on held-out text from the same domains as the training data. A lower perplexity indicates the model has learned the statistical structure of the language better. For domain mixture experiments, perplexity is measured per domain (e.g., Wikipedia perplexity, code perplexity), making it possible to see whether the mixture optimization improved all domains or just traded one off against another.

Runtime Efficiency

Two line charts comparing runtime in seconds between DataFlex and TSDS original implementation as training dataset size (5K-100K samples) and validation dataset size (50-1000 samples) grow — Figure 5: Runtime comparison between DataFlex and original TSDS implementation. DataFlex achieves consistent improvements as dataset size scales from 5K to 100K training samples (left) and 50 to 1,000 validation samples (right).

Despite unifying multiple algorithms under a single framework, DataFlex introduces no meaningful computational overhead. Benchmarks against the original TSDS implementation show DataFlex matching or beating standalone runtimes across all dataset scales. The unified abstraction layer amortizes per-algorithm setup costs and enables shared buffer reuse across components, leading to the observed efficiency gains.

What is DeepSpeed ZeRO-3?

Training large language models with billions of parameters requires more GPU memory than a single accelerator can hold. DeepSpeed's ZeRO (Zero Redundancy Optimizer) algorithm splits the model across multiple GPUs by sharding optimizer states, gradients, and model parameters across devices. ZeRO-3 is the most aggressive partitioning — all three are distributed — allowing models that are otherwise too large for any single GPU to be trained efficiently on a cluster. DataFlex's support for ZeRO-3 means that data-centric methods (which require auxiliary operations like embedding extraction and model inference) are fully compatible with modern multi-GPU training setups.

How does DataFlex use gradient computation for data selection?

Some data selection algorithms (like LESS) score training samples by computing how much each sample's gradient would influence the model's performance on a target task. This requires computing gradients for candidate samples — an expensive operation. DataFlex unifies this operation across all algorithms that need it, caching and reusing gradient computations where possible. It also handles the engineering complexity of performing this within a DeepSpeed ZeRO-3 context, where model parameters are distributed across GPUs and must be gathered before gradient computation. This is a significant engineering contribution: making gradient-based data selection feasible at scale.

Why DataFlex Matters

Data-centric training is an emerging paradigm that treats data quality and composition as a first-class optimization target alongside model parameters. DataFlex removes the fragmentation that has hindered this field — different groups publishing incompatible implementations — and provides a reproducible, extensible platform that the community can build on. Built on LLaMA-Factory's production-grade infrastructure, DataFlex is ready for both research and practical LLM training pipelines.

What's the difference between model-centric and data-centric AI?

Model-centric AI assumes the training dataset is fixed and focuses on improving the model architecture, loss functions, and optimization algorithms. Data-centric AI asks: given a fixed model architecture, how can we improve performance by improving the data? This includes cleaning labels, removing duplicates, curating diverse examples, and — as in DataFlex — dynamically adjusting which data influences training at each step. The data-centric view has gained traction because practitioners often find that data quality improvements yield larger gains than architectural tweaks, especially once the model architecture is mature.

How can researchers add their own algorithm to DataFlex?

DataFlex provides abstract base classes for each paradigm: BaseSelector, BaseMixer, and BaseWeighter. To add a new algorithm, a researcher subclasses the appropriate base, implements the required methods (typically a scoring function and an update step), and registers the component with a name string. The YAML config then references this name string to activate the custom algorithm. The base class handles all the infrastructure concerns — distributed training, checkpoint compatibility, device placement, and hook injection into the training loop — so the researcher only needs to implement the algorithm logic itself, typically 50–200 lines of Python.

Key Takeaways

DataFlex is the first framework to unify dynamic data selection, domain mixture, and sample reweighting under a single LLaMA-Factory-compatible interface.
Dynamic data selection consistently outperforms static full-data training on MMLU for both Mistral-7B and Llama-3.2-3B — confirming that what data you train on matters as much as how you train.
DoReMi and ODM domain mixture methods improve both accuracy and perplexity at 6B and 30B token scales — the benefits are not scale-dependent.
Unification introduces no runtime overhead — DataFlex matches or beats original standalone implementations in benchmarks.

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.

View Services Contact Us