AI Glossary

Flash Attention

An IO-aware attention algorithm that computes exact attention much faster by optimizing memory access patterns, reducing GPU memory reads/writes.

The Problem It Solves

Standard attention materializes the full N*N attention matrix in GPU memory, which is slow due to memory bandwidth limitations. Flash Attention computes attention in blocks, keeping intermediate results in fast on-chip SRAM.

Performance Gains

2-4x speedup and significant memory savings compared to standard attention. Enables training with longer context windows. Has become the default attention implementation in most modern LLM training.

Versions

Flash Attention 1 (2022): Initial algorithm. Flash Attention 2 (2023): Better work partitioning, ~2x faster. Flash Attention 3 (2024): Leverages newer GPU features (Hopper architecture) for further gains.

← Back to AI Glossary

Flash Attention

The Problem It Solves

Performance Gains

Versions

Related Articles

Flash Attention: Making Transformers 5x Faster

The Attention Mechanism: How AI Learned to Focus

Efficient Transformers: A Survey of Faster Architectures

LLM Inference Optimization: Making Models Faster

Related Concepts