Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. We present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms: dynamic sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and supports large-scale settings including DeepSpeed ZeRO-3.
Traditional LLM training optimizes model parameters while keeping the training data fixed. Data-centric training flips this perspective: it treats the data pipeline itself as an optimization target. Which samples are most informative for the current model state? What domain proportions maximize generalization? Which training examples deserve more gradient weight? DataFlex automates all three of these questions during a single training run.
Single framework unifying three data-centric paradigms previously scattered across isolated codebases with inconsistent APIs.
Plugs directly into LLaMA-Factory via YAML config. No code changes required to the standard training pipeline.
Full support for large-scale distributed training with DeepSpeed ZeRO-3 and FSDP parallel optimization.
Dynamic data selection consistently outperforms static full-data training on MMLU across Mistral-7B and Llama-3.2-3B.
DataFlex is structured in three layers: the Base layer inherits model management, data processing, and optimizers directly from LLaMA-Factory with DeepSpeed/FSDP support; the Trainer layer introduces three unified trainer abstractions (SelectTrainer, MixTrainer, WeightTrainer); the Component layer provides pluggable, swappable algorithm modules β enabling researchers to compare methods side-by-side without integration overhead.
LLaMA-Factory is a widely-used open-source framework for efficiently fine-tuning and pretraining large language models. It abstracts the boilerplate of distributed training, dataset loading, and optimizer configuration, and supports dozens of model architectures including LLaMA, Mistral, Qwen, and Phi. DataFlex builds directly on top of LLaMA-Factory's trainer infrastructure, inheriting all of its capabilities while adding a data-centric optimization layer above them.
Iteratively identifies and selects the most informative training samples at each step. DataFlex supports LESS, TSDS, and custom selector components, all sharing a unified interface for scoring and filtering.
Dynamically adjusts the sampling proportions across data domains (e.g., SlimPajama subsets) during pretraining. DoReMi and ODM are the built-in mixture algorithms, with a mixer abstraction for custom strategies.
Assigns per-sample loss weights during gradient updates based on model-dependent quality signals. Pluggable weighter components enable custom reweighting strategies with minimal boilerplate.
Before unified frameworks, each research group built their data-centric method on top of their own training codebase. To compare Method A (built on PyTorch Lightning) with Method B (built on Hugging Face Trainer), a researcher had to either port one method to the other's codebase β which introduces implementation bugs β or accept that the comparison is confounded by infrastructure differences. DataFlex's pluggable architecture means all methods share exactly the same training loop, distributed training setup, and data loading pipeline. When results differ, it's because the algorithms differ, not the infrastructure.
DataFlex integrates via a minimal YAML configuration block appended to the standard LLaMA-Factory config. Setting train_type, component_name, and component-specific hyperparameters activates dynamic data optimization β no changes to training code required. The example shows DoReMi domain mixture on Qwen2.5-0.5B pretraining on wiki_demo and c4_demo datasets.
### model
model_name_or_path: Qwen2.5-0.5B
### method
stage: pt
finetuning_type: full
deepspeed: ds_z3_config.json
### dataset
dataset: wiki_demo, c4_demo
template: qwen
### train
learning_rate: 5.0e-5
num_train_epochs: 1.0
### dataflex
train_type: dynamic_mix
component_name: doremi
mixture_sample_rule: mixture
init_mixture_proportions: [0.5, 0.5]
warmup_step: 100
update_step: 200
update_times: 3
Language model pretraining datasets like SlimPajama combine multiple text domains: Wikipedia, GitHub code, C4 web text, books, ArXiv papers, etc. The proportion of each domain in each training batch is a critical hyperparameter β too much code makes the model worse at natural language; too little Wikipedia hurts factual recall. DoReMi (Domain Reweighting with Minimax Optimization) uses a reference model trained on uniform proportions as a baseline, then learns to upweight domains where the main model has the highest excess loss. ODM (Online Domain Mixing) takes a similar adaptive approach without requiring a reference model.
### dataflex block to your existing LLaMA-Factory config to enable dynamic data optimization.
Comprehensive experiments across seven dynamic data selection algorithms confirm that data-centric dynamic training delivers measurable, consistent improvements over static baselines. The gains hold across both 7B and 3B model scales, demonstrating that the unified DataFlex infrastructure does not introduce regressions compared to standalone implementations.
In static training, every sample in the dataset gets equal opportunity to influence the model, regardless of its actual utility at that training step. Dynamic selection uses the current model state to score each sample's expected learning value β for example, by measuring how much the model's hidden representations change for that sample, or how large the gradient signal would be. High-utility samples get selected more frequently, low-utility ones less so. This is like a student focusing more study time on problems they haven't yet mastered, rather than re-reading pages they already know.
MMLU (Massive Multitask Language Understanding) is a benchmark covering 57 academic subjects from elementary mathematics to professional law and medicine. It evaluates factual knowledge and reasoning across domains with multiple-choice questions. It's widely used because it's comprehensive (57 subjects), standardized (fixed evaluation protocol), and correlates well with real-world utility. In the context of DataFlex experiments, MMLU measures whether smarter data selection during fine-tuning leads to a model that knows more and reasons better β making it a strong downstream signal for training quality.
For domain mixture optimization, both DoReMi and ODM outperform fixed default proportions when pretraining Qwen2.5-1.5B on the SlimPajama dataset. Benefits scale from 6B to 30B tokens, validating the approach at practical pretraining scales. This demonstrates that automatically learned domain proportions generalize across scales.
Language model pretraining datasets like SlimPajama combine multiple text domains: Wikipedia, GitHub code, C4 web text, books, ArXiv papers, etc. The proportion of each domain in each training batch is a critical hyperparameter β too much code makes the model worse at natural language; too little Wikipedia hurts factual recall. DoReMi (Domain Reweighting with Minimax Optimization) uses a reference model trained on uniform proportions as a baseline, then learns to upweight domains where the main model has the highest excess loss. This ensures data-scarce but important domains get more training signal. ODM (Online Domain Mixing) takes a similar adaptive approach without requiring a reference model.
SlimPajama is a cleaned, deduplicated version of the RedPajama dataset, containing approximately 627 billion tokens of text across seven domains: Commoncrawl, C4, GitHub, Books, ArXiv, Wikipedia, and Stackexchange. It was created specifically to enable fair comparisons of language model pretraining strategies by providing a standardized, well-cleaned, multi-domain dataset. For domain mixture experiments, SlimPajama is ideal because its seven distinct domains have different intrinsic characteristics and sizes, making the domain proportion choice non-trivial. Previous work (DoReMi, ODM) also used SlimPajama, enabling direct comparisons.
Perplexity measures how surprised a language model is by a given text. Mathematically, it's the exponentiated average negative log-likelihood per token β a lower score means the model assigns higher probability to the actual next tokens. In pretraining evaluation, corpus-level perplexity is measured on held-out text from the same domains as the training data. A lower perplexity indicates the model has learned the statistical structure of the language better. For domain mixture experiments, perplexity is measured per domain (e.g., Wikipedia perplexity, code perplexity), making it possible to see whether the mixture optimization improved all domains or just traded one off against another.
Despite unifying multiple algorithms under a single framework, DataFlex introduces no meaningful computational overhead. Benchmarks against the original TSDS implementation show DataFlex matching or beating standalone runtimes across all dataset scales. The unified abstraction layer amortizes per-algorithm setup costs and enables shared buffer reuse across components, leading to the observed efficiency gains.
Training large language models with billions of parameters requires more GPU memory than a single accelerator can hold. DeepSpeed's ZeRO (Zero Redundancy Optimizer) algorithm splits the model across multiple GPUs by sharding optimizer states, gradients, and model parameters across devices. ZeRO-3 is the most aggressive partitioning β all three are distributed β allowing models that are otherwise too large for any single GPU to be trained efficiently on a cluster. DataFlex's support for ZeRO-3 means that data-centric methods (which require auxiliary operations like embedding extraction and model inference) are fully compatible with modern multi-GPU training setups.
Some data selection algorithms (like LESS) score training samples by computing how much each sample's gradient would influence the model's performance on a target task. This requires computing gradients for candidate samples β an expensive operation. DataFlex unifies this operation across all algorithms that need it, caching and reusing gradient computations where possible. It also handles the engineering complexity of performing this within a DeepSpeed ZeRO-3 context, where model parameters are distributed across GPUs and must be gathered before gradient computation. This is a significant engineering contribution: making gradient-based data selection feasible at scale.
Data-centric training is an emerging paradigm that treats data quality and composition as a first-class optimization target alongside model parameters. DataFlex removes the fragmentation that has hindered this field β different groups publishing incompatible implementations β and provides a reproducible, extensible platform that the community can build on. Built on LLaMA-Factory's production-grade infrastructure, DataFlex is ready for both research and practical LLM training pipelines.
Model-centric AI assumes the training dataset is fixed and focuses on improving the model architecture, loss functions, and optimization algorithms. Data-centric AI asks: given a fixed model architecture, how can we improve performance by improving the data? This includes cleaning labels, removing duplicates, curating diverse examples, and β as in DataFlex β dynamically adjusting which data influences training at each step. The data-centric view has gained traction because practitioners often find that data quality improvements yield larger gains than architectural tweaks, especially once the model architecture is mature.
DataFlex provides abstract base classes for each paradigm: BaseSelector, BaseMixer, and BaseWeighter. To add a new algorithm, a researcher subclasses the appropriate base, implements the required methods (typically a scoring function and an update step), and registers the component with a name string. The YAML config then references this name string to activate the custom algorithm. The base class handles all the infrastructure concerns β distributed training, checkpoint compatibility, device placement, and hook injection into the training loop β so the researcher only needs to implement the algorithm logic itself, typically 50β200 lines of Python.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages β we turn any source material into production-quality content. Rich HTML Β· Custom slides Β· Animated video.