Image Segmentation: U-Net, Mask R-CNN, and SAM

Image segmentation goes beyond object detection by classifying every single pixel in an image. Rather than drawing bounding boxes around objects, segmentation produces precise masks that outline the exact shape of each object. This pixel-level understanding is critical for applications ranging from medical imaging and autonomous driving to photo editing and augmented reality.

Types of Image Segmentation

Before diving into specific architectures, it is important to understand the three main types of segmentation tasks:

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel in the image. All pixels belonging to "road" are labeled as road, all pixels belonging to "car" are labeled as car, and so on. However, semantic segmentation does not distinguish between individual instances: if there are three cars, all car pixels get the same label.

Instance Segmentation

Instance segmentation goes further by distinguishing individual object instances. Each of the three cars gets a unique mask, allowing the system to count objects and track them individually. Instance segmentation typically only covers "thing" categories (countable objects) and ignores "stuff" categories (amorphous regions like sky or road).

Panoptic Segmentation

Panoptic segmentation unifies both approaches. Every pixel receives a label: "thing" pixels get both a class label and an instance ID, while "stuff" pixels get only a class label. This provides a complete understanding of the entire scene.

Semantic segmentation tells you what is in each pixel. Instance segmentation tells you which specific object each pixel belongs to. Panoptic segmentation does both.

Key Takeaway

The three types of segmentation serve different needs: semantic for scene understanding, instance for object-level analysis, and panoptic for complete scene decomposition.

U-Net: The Medical Imaging Workhorse

U-Net, introduced in 2015 by Olaf Ronneberger et al., was designed for biomedical image segmentation where labeled data is scarce. Its elegant architecture has made it one of the most influential segmentation models, with applications far beyond its original medical imaging domain.

U-Net has a symmetric encoder-decoder structure shaped like the letter U:

Encoder (contracting path): Successive blocks of convolutions and max pooling reduce spatial resolution while increasing feature channels, capturing high-level context
Decoder (expanding path): Transposed convolutions upsample the feature maps back to the original resolution, recovering spatial detail
Skip connections: The defining feature -- feature maps from corresponding encoder levels are concatenated with decoder feature maps, allowing the decoder to recover fine-grained spatial information that was lost during downsampling

Skip connections are what make U-Net special. Without them, the decoder must reconstruct spatial details purely from the compressed bottleneck representation. With skip connections, high-resolution features from the encoder are directly available to the decoder, enabling precise boundary delineation.

U-Net's success in medical imaging led to numerous variants including U-Net++ (dense skip connections), Attention U-Net (attention gates in skip connections), and nnU-Net (self-configuring framework that automatically adapts U-Net for any medical segmentation task).

Mask R-CNN: Instance Segmentation Pioneer

Mask R-CNN, proposed by Kaiming He et al. in 2017, extends Faster R-CNN with a parallel branch that predicts a binary segmentation mask for each detected object. The architecture adds a small fully convolutional network (FCN) on top of each Region of Interest (RoI).

The key components of Mask R-CNN include:

Backbone + FPN: A ResNet or similar backbone extracts features, and a Feature Pyramid Network provides multi-scale representations
Region Proposal Network: Proposes candidate object regions, inherited from Faster R-CNN
RoIAlign: A critical improvement over RoI pooling that uses bilinear interpolation instead of quantization, preserving precise spatial information needed for pixel-level masks
Three parallel heads: Classification head, bounding box regression head, and mask prediction head operate simultaneously on each proposal

Mask R-CNN's mask head predicts a mask independently for each class, avoiding competition between classes. This design choice, combined with RoIAlign's spatial precision, enabled Mask R-CNN to achieve state-of-the-art instance segmentation results that held for several years.

SAM: Segment Anything Model

Meta AI's Segment Anything Model (SAM), released in 2023, represents a paradigm shift in segmentation. Rather than training a model for specific segmentation tasks, SAM is a foundation model for segmentation that can segment any object in any image given a prompt.

How SAM Works

SAM consists of three components:

Image encoder: A Vision Transformer (ViT) pre-trained with MAE that produces image embeddings. This runs once per image and can be cached for interactive use.
Prompt encoder: Encodes various prompt types including points, bounding boxes, text descriptions, and rough masks into prompt embeddings
Mask decoder: A lightweight transformer decoder that combines image and prompt embeddings to produce segmentation masks. It generates multiple possible masks along with confidence scores.

SAM was trained on the SA-1B dataset, containing over 1 billion masks on 11 million images, making it the largest segmentation dataset ever created. This massive training data enables SAM's remarkable zero-shot generalization ability.

SAM changed the game by turning segmentation into a promptable task. Instead of training a specialized model for each domain, you can point at what you want to segment and SAM figures out the rest.

SAM 2 and Beyond

SAM 2, released in 2024, extends the Segment Anything concept to video, enabling users to segment and track objects across video frames with minimal interaction. It processes video in a streaming fashion, maintaining a memory of previously segmented frames.

Key Takeaway

SAM represents the foundation model approach to segmentation: train once on massive data, then apply to any segmentation task through prompting. This mirrors the shift from task-specific models to general-purpose models seen in NLP with LLMs.

Choosing the Right Segmentation Approach

The best segmentation method depends on your application:

Medical imaging: U-Net variants (especially nnU-Net) remain the standard due to their ability to work with limited labeled data and produce precise boundaries
Autonomous driving: Panoptic segmentation models that combine semantic and instance segmentation for complete scene understanding
Interactive editing: SAM excels when users need to select and segment objects interactively
Industrial inspection: Instance segmentation with Mask R-CNN or its descendants for counting and measuring specific defects
Research and prototyping: SAM's zero-shot capabilities make it ideal for quick experimentation without task-specific training

The field is converging toward foundation models that can handle multiple segmentation tasks. Models like OneFormer unify semantic, instance, and panoptic segmentation into a single architecture, while SAM demonstrates that prompting can replace task-specific training. As these approaches mature, the boundaries between segmentation subtypes continue to blur.

Image Segmentation: U-Net, Mask R-CNN, and SAM

Types of Image Segmentation

Semantic Segmentation

Instance Segmentation

Panoptic Segmentation

Key Takeaway

U-Net: The Medical Imaging Workhorse

Mask R-CNN: Instance Segmentation Pioneer

SAM: Segment Anything Model

How SAM Works

SAM 2 and Beyond

Key Takeaway

Choosing the Right Segmentation Approach

Related Posts

Object Detection: YOLO, Faster R-CNN, and Beyond

Encoder-Decoder Architecture: From Seq2Seq to Transformers

The Attention Mechanism: How AI Learned to Focus