AI Glossary

Data Leakage

When information from outside the training set improperly influences model training, leading to overly optimistic performance estimates that don't hold in production.

Common Causes

Using future data to predict the past (temporal leakage). Preprocessing (scaling, encoding) before splitting train/test. Duplicate entries across train and test splits. Target variable information encoded in features.

Why It's Dangerous

Models with data leakage appear to perform brilliantly in evaluation but fail catastrophically in production. The model has essentially 'cheated' during training.

Prevention

Always split data before any preprocessing. Use temporal splits for time-series data. Check for near-duplicates across splits. Be skeptical of suspiciously high performance. Use proper cross-validation.

← Back to AI Glossary

Data Leakage

Common Causes

Why It's Dangerous

Prevention

Related Articles

AI Agents for Data Analysis: Automating Insights

Data Augmentation: Getting More from Less Data

Data Labeling Platforms: Building Training Datasets Efficiently

K-Nearest Neighbors: The Simplest ML Algorithm

Related Concepts