FP16 / BF16 (Half Precision)
16-bit floating-point number formats that use half the memory of standard 32-bit floats, enabling faster training and inference of neural networks with minimal quality loss.
FP16 vs BF16
FP16: More precision but smaller range. Can cause overflow/underflow. Requires loss scaling. BF16 (Brain Float 16): Same range as FP32 but less precision. Simpler to use, no loss scaling needed. Preferred for modern training.
Mixed Precision Training
Store weights in FP32 for stability, compute forward/backward passes in FP16/BF16 for speed. This gives nearly 2x speedup with minimal accuracy loss. Supported by all modern frameworks via torch.cuda.amp or similar.