Pre-Training
The initial phase of training a model on a large, general-purpose dataset to learn broad representations before fine-tuning on specific tasks.
For LLMs
Pre-training involves next-token prediction on trillions of tokens from the internet, books, code, and other sources. This phase costs millions of dollars and produces a model with broad knowledge but no specific task optimization.
The Pre-Train Then Fine-Tune Paradigm
First introduced by ULMFiT and popularized by BERT/GPT, this two-stage approach is now universal: pre-train on massive data for general capabilities, then fine-tune (or align) for specific tasks. It's vastly more efficient than training from scratch for each task.