AI Glossary

Data Labeling

The process of annotating raw data with meaningful tags or categories, creating the labeled datasets needed to train supervised machine learning models.

Methods

Manual labeling: Human annotators tag data (most accurate but expensive). Semi-automated: Model pre-labels, humans correct. Crowdsourcing: Platforms like Scale AI, Labelbox, Amazon MTurk. Programmatic: Snorkel's labeling functions.

Challenges

Quality control (inter-annotator agreement). Scaling to millions of examples. Handling subjective labels (is this text offensive?). Domain expertise requirements (medical, legal). Cost: labeling can be the most expensive part of an ML project.

Trends

LLMs increasingly used for synthetic labeling. Active learning reduces labeling effort. Self-supervised pre-training reduces labeled data needs. Foundation models enable few-shot learning with minimal labels.

← Back to AI Glossary

Last updated: March 5, 2026