What is a Vision Transformer?

For decades, Convolutional Neural Networks (CNNs) were the undisputed champions of computer vision. From image classification to object detection to facial recognition, CNNs dominated every benchmark. Then, in 2020, a team at Google Brain asked a radical question: what if we took the transformer architecture -- the engine behind GPT and BERT that had revolutionized natural language processing -- and applied it directly to images? The result was the Vision Transformer (ViT), and it changed the landscape of computer vision forever.

The key insight behind ViT is deceptively simple. Transformers process sequences of tokens. Text is naturally a sequence of words. Images are not sequences at all -- they are two-dimensional grids of pixels. But what if you could turn an image into a sequence? That is exactly what ViT does: it chops an image into small patches, flattens each patch into a vector, and feeds the resulting sequence into a standard transformer. No convolutions needed. The approach sounds almost too simple to work, and yet it matches or surpasses CNNs on major image classification benchmarks when trained on sufficient data.

Image Patches as Tokens

The first step in a Vision Transformer is to divide the input image into a grid of fixed-size, non-overlapping patches. For a standard 224 by 224 pixel image, ViT typically uses 16 by 16 pixel patches, producing a sequence of 196 patches (14 rows times 14 columns). Each of these patches becomes the equivalent of a "word" in a text transformer.

Each patch is then flattened into a one-dimensional vector and projected through a linear layer to create a patch embedding. If a patch is 16 by 16 pixels with 3 color channels (RGB), that is 768 values per patch. The linear projection maps this to the transformer's hidden dimension -- typically 768 for ViT-Base, creating an embedding for each patch that the transformer can process.

But there is a problem: when you flatten patches into a sequence, you lose information about where each patch was in the original image. A patch from the top-left corner looks the same as one from the bottom-right to the transformer. To solve this, ViT adds a positional embedding to each patch embedding -- a learnable vector that encodes each patch's position in the grid. The model learns during training that position 1 means "top-left" and position 196 means "bottom-right," preserving spatial information.

Finally, ViT prepends a special classification token (called the CLS token) to the sequence, just like BERT does for text classification. After processing through the transformer, the output corresponding to this CLS token is fed to a classification head to produce the final prediction.

Patch Size Matters

Smaller patches (like 8x8) give the model more tokens to work with and finer-grained detail, but quadruple the computational cost because self-attention scales quadratically with sequence length. Larger patches (like 32x32) are cheaper but lose fine detail. The 16x16 default is a practical sweet spot.

Self-Attention on Images

Once the patches are embedded and positioned, they pass through multiple layers of the standard transformer encoder. Each layer contains two key components: multi-head self-attention and a feed-forward network, connected with residual connections and layer normalization.

Self-attention is what makes ViT fundamentally different from CNNs. In a CNN, each convolutional filter has a small receptive field -- it can only see a tiny local patch of the image at each layer. Understanding long-range relationships (like "the object in the top-left is casting a shadow in the bottom-right") requires stacking many layers so information can propagate gradually across the image. In ViT, every patch attends to every other patch in a single attention operation. The model can directly compare patch 1 (top-left corner) with patch 196 (bottom-right corner) in the very first layer.

This global receptive field gives ViT an inherent advantage for tasks that require understanding relationships across the entire image. A ViT processing a photo of a person in a landscape can simultaneously attend to the person's face, their clothing, the horizon, and the lighting conditions, relating all these elements in one pass. A CNN would need many layers to build up this global understanding.

Multi-head attention means the model runs several attention operations in parallel, each learning to focus on different types of relationships. Some heads might learn to attend to color patterns, others to edges, others to spatial proximity, and others to semantic similarity. The diversity of attention patterns is what gives ViT its representational power.

ViT vs. CNN

The comparison between Vision Transformers and CNNs is nuanced and depends heavily on the amount of training data available. When trained on small to medium datasets (like ImageNet alone, with about 1.2 million images), CNNs tend to outperform ViT. This is because CNNs have strong inductive biases -- architectural assumptions that are well-suited to images, such as translation equivariance (a cat is a cat regardless of where it appears in the image) and locality (nearby pixels are more related than distant ones).

ViT, by contrast, has almost no image-specific inductive biases. It treats the image as a generic sequence and must learn spatial relationships entirely from data. This lack of built-in assumptions means ViT needs much more data to learn what CNNs get for free from their architecture. However, when trained on very large datasets (tens or hundreds of millions of images), ViT's flexibility becomes an advantage. Without the constraints of convolutional architecture, ViT can learn more general and powerful representations that ultimately surpass CNNs.

This led to a practical insight: the best results often come from hybrid approaches. Models like DeiT (Data-efficient Image Transformer) use training techniques like knowledge distillation and strong data augmentation to make ViT competitive even on smaller datasets. Swin Transformer introduces hierarchical processing and shifted windows, combining the local processing strength of CNNs with the global attention of transformers. These hybrids have become the backbone of modern computer vision systems, powering everything from autonomous driving to medical image analysis.

Today, Vision Transformers and their variants have essentially replaced pure CNNs at the frontier of computer vision research. Models like CLIP, DALL-E, Stable Diffusion, and SAM all use transformer-based vision encoders, demonstrating that the same architecture can unify text and image understanding under one framework.

Key Takeaway

The Vision Transformer proved that the transformer architecture is not limited to language. By re-imagining images as sequences of patches, ViT demonstrated that self-attention can replace convolution as the primary mechanism for visual understanding. The key trade-off is data: ViT needs more data to match CNNs but scales to higher performance when that data is available.

ViT's success triggered a broader movement toward unified architectures that handle text, images, audio, and video with the same underlying transformer mechanism. This convergence is one of the most important trends in modern AI, pointing toward general-purpose models that can perceive and reason across all modalities. Whether you are building an image classifier, a multimodal chatbot, or a generative art system, understanding the Vision Transformer is essential because it sits at the heart of nearly every state-of-the-art vision system today.

← Back to AI Glossary

Next: Weights in Neural Networks →