AI Glossary

Flash Attention

An IO-aware attention algorithm that reduces memory usage and speeds up transformers through tiling.

Overview

Flash Attention is a hardware-aware implementation of the attention mechanism that avoids materializing the full N x N attention matrix in GPU memory. Instead, it uses a tiling approach that computes attention in blocks, keeping data in fast GPU SRAM rather than slower HBM (high-bandwidth memory).

Key Details

Flash Attention reduces attention memory from O(N^2) to O(N) and achieves 2-4x wall-clock speedup compared to standard attention implementations. Flash Attention 2 further optimized parallelism and work partitioning, and Flash Attention 3 leverages hardware features of newer GPUs. It's now the default attention implementation in most deep learning frameworks and is a key enabler of long-context LLMs.

Related Concepts

flash attention • attention mechanism • transformer

← Back to AI Glossary