Data Leakage
When information from outside the training set improperly influences model training, leading to overly optimistic performance estimates that don't hold in production.
Common Causes
Using future data to predict the past (temporal leakage). Preprocessing (scaling, encoding) before splitting train/test. Duplicate entries across train and test splits. Target variable information encoded in features.
Why It's Dangerous
Models with data leakage appear to perform brilliantly in evaluation but fail catastrophically in production. The model has essentially 'cheated' during training.
Prevention
Always split data before any preprocessing. Use temporal splits for time-series data. Check for near-duplicates across splits. Be skeptical of suspiciously high performance. Use proper cross-validation.