Softmax
A mathematical function that converts a vector of raw scores (logits) into a probability distribution, where all values are between 0 and 1 and sum to 1.
The Formula
softmax(x_i) = exp(x_i) / sum(exp(x_j)) for all j. It exponentiates each value (making them positive) and normalizes by the sum (making them sum to 1). Larger input values get disproportionately larger probabilities.
Where It's Used
In the output layer of classification models to produce class probabilities. In the attention mechanism of transformers to compute attention weights. Everywhere a probability distribution over discrete options is needed.
Temperature Scaling
Dividing logits by a temperature parameter T before softmax controls 'sharpness': T<1 makes the distribution peakier (more confident), T>1 makes it flatter (more random). This is how temperature controls LLM generation creativity.