CLIP
Contrastive Language-Image Pre-training -- an OpenAI model that learns to connect images and text by training on 400 million image-text pairs from the internet.
How CLIP Works
CLIP consists of an image encoder and a text encoder trained jointly. Given a batch of image-text pairs, CLIP learns to maximize the similarity between matching pairs while minimizing it for non-matching pairs (contrastive learning).
Zero-Shot Classification
CLIP can classify images into any categories without task-specific training. You encode candidate text labels ('a photo of a dog', 'a photo of a cat') and compare them against the image embedding. The closest text match becomes the prediction.
Impact
CLIP bridged the gap between vision and language, enabling text-based image search, image generation guidance (used in DALL-E and Stable Diffusion), and zero-shot transfer to new visual tasks.