Flash Attention
An IO-aware attention algorithm that reduces memory usage and speeds up transformers through tiling.
Overview
Flash Attention is a hardware-aware implementation of the attention mechanism that avoids materializing the full N x N attention matrix in GPU memory. Instead, it uses a tiling approach that computes attention in blocks, keeping data in fast GPU SRAM rather than slower HBM (high-bandwidth memory).
Key Details
Flash Attention reduces attention memory from O(N^2) to O(N) and achieves 2-4x wall-clock speedup compared to standard attention implementations. Flash Attention 2 further optimized parallelism and work partitioning, and Flash Attention 3 leverages hardware features of newer GPUs. It's now the default attention implementation in most deep learning frameworks and is a key enabler of long-context LLMs.