Flash Attention
An IO-aware attention algorithm that computes exact attention much faster by optimizing memory access patterns, reducing GPU memory reads/writes.
The Problem It Solves
Standard attention materializes the full N*N attention matrix in GPU memory, which is slow due to memory bandwidth limitations. Flash Attention computes attention in blocks, keeping intermediate results in fast on-chip SRAM.
Performance Gains
2-4x speedup and significant memory savings compared to standard attention. Enables training with longer context windows. Has become the default attention implementation in most modern LLM training.
Versions
Flash Attention 1 (2022): Initial algorithm. Flash Attention 2 (2023): Better work partitioning, ~2x faster. Flash Attention 3 (2024): Leverages newer GPU features (Hopper architecture) for further gains.