Synthetic Data
Artificially generated data that mimics the statistical properties of real data, used when real data is scarce, expensive, sensitive, or imbalanced.
Generation Methods
Rule-based generation, statistical sampling, GANs (for images), LLMs (for text), simulation engines (for robotics/autonomous vehicles), and domain-specific generators.
Use Cases
Training data augmentation, testing and validation, privacy-preserving data sharing, addressing class imbalance, and generating edge cases that are rare in real data.
In LLM Training
Synthetic data generated by stronger models is increasingly used to train or fine-tune smaller models. This 'model distillation through data' approach is behind many capable open-source LLMs. Quality filtering of synthetic data is crucial.