AI Glossary

Data Leakage

When information from outside the training set improperly influences model training, leading to overly optimistic performance estimates that don't hold in production.

Common Causes

Using future data to predict the past (temporal leakage). Preprocessing (scaling, encoding) before splitting train/test. Duplicate entries across train and test splits. Target variable information encoded in features.

Why It's Dangerous

Models with data leakage appear to perform brilliantly in evaluation but fail catastrophically in production. The model has essentially 'cheated' during training.

Prevention

Always split data before any preprocessing. Use temporal splits for time-series data. Check for near-duplicates across splits. Be skeptical of suspiciously high performance. Use proper cross-validation.

← Back to AI Glossary

Last updated: March 5, 2026