AI Glossary

CLIP

Contrastive Language-Image Pre-training -- an OpenAI model that learns to connect images and text by training on 400 million image-text pairs from the internet.

How CLIP Works

CLIP consists of an image encoder and a text encoder trained jointly. Given a batch of image-text pairs, CLIP learns to maximize the similarity between matching pairs while minimizing it for non-matching pairs (contrastive learning).

Zero-Shot Classification

CLIP can classify images into any categories without task-specific training. You encode candidate text labels ('a photo of a dog', 'a photo of a cat') and compare them against the image embedding. The closest text match becomes the prediction.

Impact

CLIP bridged the gap between vision and language, enabling text-based image search, image generation guidance (used in DALL-E and Stable Diffusion), and zero-shot transfer to new visual tasks.

← Back to AI Glossary

CLIP

How CLIP Works

Zero-Shot Classification

Impact

Related Articles

Stable Diffusion: How Text-to-Image AI Works Under the Hood

Multimodal LLMs: When AI Can See, Hear, and Read

Vision Transformers (ViT): Applying Attention to Images

Related Concepts