Data Labeling
The process of annotating raw data with meaningful tags or categories, creating the labeled datasets needed to train supervised machine learning models.
Methods
Manual labeling: Human annotators tag data (most accurate but expensive). Semi-automated: Model pre-labels, humans correct. Crowdsourcing: Platforms like Scale AI, Labelbox, Amazon MTurk. Programmatic: Snorkel's labeling functions.
Challenges
Quality control (inter-annotator agreement). Scaling to millions of examples. Handling subjective labels (is this text offensive?). Domain expertise requirements (medical, legal). Cost: labeling can be the most expensive part of an ML project.
Trends
LLMs increasingly used for synthetic labeling. Active learning reduces labeling effort. Self-supervised pre-training reduces labeled data needs. Foundation models enable few-shot learning with minimal labels.