Data Curation
The process of selecting, cleaning, and organizing training data to maximize AI model quality.
Overview
Data curation is the careful process of selecting, cleaning, deduplicating, filtering, and organizing data for AI model training. High-quality data curation is often more impactful than model architecture improvements — the adage 'garbage in, garbage out' is especially true for machine learning.
Key Practices
Effective data curation includes removing duplicates and near-duplicates, filtering toxic or low-quality content, balancing representation across topics and demographics, removing personally identifiable information (PII), ensuring data freshness, and applying quality classifiers. Projects like The Pile, RedPajama, and FineWeb demonstrate the significant effort required to create high-quality training datasets.