Stable Diffusion: How Text-to-Image AI Works Under the Hood

Stable Diffusion is the technology that brought AI image generation to the masses. Released as open source by Stability AI in August 2022, it enabled anyone with a decent GPU to generate stunning images from text descriptions. But while millions have used it, few understand how it actually works. This article takes you under the hood of Stable Diffusion, explaining each component of its latent diffusion architecture in accessible terms.

The Key Insight: Latent Space Diffusion

Before Stable Diffusion, diffusion models operated directly in pixel space -- denoising a full-resolution image at every step. For a 512x512 image with 3 color channels, that means working with 786,432 values at each of the 50-1000 denoising steps. This was computationally enormous.

Stable Diffusion's breakthrough was performing the diffusion process not on the image itself but in a compressed latent space. A pretrained autoencoder compresses images to a fraction of their size -- typically a 64x64x4 representation for a 512x512 image. That's a 48x reduction in dimensionality. The denoising happens in this compact space, and the result is decoded back to pixel space only at the very end.

Latent diffusion is like sculpting in miniature and then scaling up the result. Working in the compressed space makes the process dramatically faster and cheaper while preserving the quality of the final output.

How Diffusion Models Generate Images

Figure: Forward diffusion adds noise; the model learns to reverse this process to generate images from noise

The Three Core Components

1. The VAE (Variational Autoencoder)

The VAE has two parts: an encoder that compresses pixel-space images into latent representations, and a decoder that converts latents back to pixel-space images. During training, the VAE learns to preserve perceptually important details while discarding redundant pixel-level information.

The encoder is used only during training (to compress training images) and for image-to-image tasks. The decoder is used at the end of every generation to convert the denoised latent back into a viewable image. The quality of the VAE directly affects the sharpness and fidelity of generated images.

2. The U-Net (Denoising Network)

The U-Net is the core of Stable Diffusion -- the neural network that actually performs the denoising. It takes three inputs: the noisy latent image, the current timestep (noise level), and the text conditioning. It outputs a prediction of the noise in the latent, which is then subtracted to produce a slightly cleaner latent.

The architecture follows the classic U-Net encoder-decoder pattern with skip connections, but augmented with cross-attention layers where the text conditioning is injected. These attention layers allow every spatial location in the latent to attend to the text embeddings, ensuring the generated content aligns with the prompt.

In Stable Diffusion 1.x, the U-Net has approximately 860 million parameters. SDXL increased this to about 2.6 billion parameters with a more sophisticated architecture including dual text encoders.

3. The Text Encoder (CLIP)

The text encoder converts your text prompt into numerical representations that the U-Net can use. Stable Diffusion 1.x uses CLIP ViT-L/14, which produces a sequence of 77 token embeddings, each 768 dimensions. SDXL uses both CLIP ViT-L and OpenCLIP ViT-G, concatenating their outputs for a richer text representation.

CLIP was trained on millions of image-text pairs from the internet, learning to associate visual concepts with language descriptions. This pre-existing knowledge of image-text relationships is what gives Stable Diffusion its ability to understand and visualize natural language prompts.

Key Takeaway

Stable Diffusion's architecture elegantly separates three concerns: the VAE handles compression/decompression, the U-Net handles the actual image generation through denoising, and CLIP handles understanding the text prompt. Each component is trained separately and combined into the final system.

The Generation Process Step by Step

Text Encoding: Your prompt "a majestic lion in a field of sunflowers, oil painting" is tokenized and passed through CLIP to produce text embeddings
Noise Initialization: A random tensor of shape 64x64x4 (latent space) is generated from a Gaussian distribution. The random seed determines this initial noise
Iterative Denoising: Over 20-50 steps, the U-Net repeatedly predicts and removes noise from the latent, guided by the text embeddings. Each step makes the latent slightly more structured
Classifier-Free Guidance: At each step, two predictions are made -- one conditioned on the text and one unconditional. The difference is amplified by the guidance scale (typically 7-12), strengthening the text influence
VAE Decoding: The final denoised latent is passed through the VAE decoder to produce the full-resolution pixel image

Key Parameters and Their Effects

Guidance Scale (CFG): Controls how strongly the image adheres to the text prompt. Low values (1-3) produce creative but potentially off-topic results. High values (10-20) produce highly prompt-adherent but sometimes oversaturated images. The sweet spot is typically 7-12.

Steps: The number of denoising iterations. More steps generally produce higher quality but with diminishing returns. 20-30 steps is usually sufficient with modern schedulers like DPM++ 2M Karras.

Seed: The random seed determines the initial noise tensor. Same prompt + same seed = same image (deterministic generation). This allows reproducibility and systematic exploration of variations.

Scheduler/Sampler: The algorithm that determines how noise is removed at each step. Options include DDPM, DDIM, Euler, DPM++, and others. Each has different speed-quality tradeoffs, with DPM++ variants generally offering the best balance.

The Stable Diffusion Ecosystem

The open-source nature of Stable Diffusion has spawned an enormous ecosystem of extensions and techniques.

ControlNet -- Adds spatial conditioning through edge maps, depth maps, pose skeletons, or segmentation masks, giving precise structural control over generated images
LoRA (Low-Rank Adaptation) -- Efficient fine-tuning technique that adapts the model to specific styles, characters, or concepts using small additional weight matrices
Textual Inversion -- Teaches the model new concepts by learning custom text embeddings from a few example images
IP-Adapter -- Enables image prompting, where a reference image guides the style or content of the generation
ComfyUI / Automatic1111 -- Popular web interfaces that provide access to the full Stable Diffusion pipeline with visual workflow builders

Key Takeaway

Stable Diffusion's open-source model created a vibrant ecosystem that has pushed the boundaries of what's possible in AI image generation. Understanding the core architecture helps you make better use of its many extensions and parameters.

Stable Diffusion's latent diffusion architecture represents one of the most elegant solutions in modern AI -- achieving high-quality image generation at a fraction of the computational cost of pixel-space diffusion. As the architecture continues to evolve through SD3 and beyond, the fundamental principles remain the same: compress, denoise, decode.

Stable Diffusion: How Text-to-Image AI Works Under the Hood

The Key Insight: Latent Space Diffusion

How Diffusion Models Generate Images

The Three Core Components

1. The VAE (Variational Autoencoder)

2. The U-Net (Denoising Network)

3. The Text Encoder (CLIP)

Key Takeaway

The Generation Process Step by Step

Key Parameters and Their Effects

The Stable Diffusion Ecosystem

Key Takeaway

References & Further Reading

Related Glossary Terms

The Key Insight: Latent Space Diffusion

How Diffusion Models Generate Images

The Three Core Components

1. The VAE (Variational Autoencoder)

2. The U-Net (Denoising Network)

3. The Text Encoder (CLIP)

Key Takeaway

The Generation Process Step by Step

Key Parameters and Their Effects

The Stable Diffusion Ecosystem

Key Takeaway

References & Further Reading

Related Glossary Terms

Related Posts

AI Image Generation: How Machines Create Art

DALL-E vs Midjourney vs Stable Diffusion: Which Is Best?

Computer Vision: The Complete Beginner's Guide