Without activation functions, a neural network is just a fancy linear regression. No matter how many layers you stack, the output would still be a linear combination of the inputs. Activation functions introduce nonlinearity, giving neural networks the power to learn complex, curved decision boundaries and hierarchical representations.
Why Nonlinearity Matters
Consider stacking two linear layers: the first computes z1 = W1*x + b1 and the second computes z2 = W2*z1 + b2. Substituting gives z2 = W2*W1*x + W2*b1 + b2, which is still a linear function. No matter how many linear layers you add, the result is always linear. Activation functions break this linearity, allowing each layer to introduce new curves and boundaries that the previous layers cannot represent alone.
"Activation functions are the secret ingredient that turns a stack of linear transformations into a universal function approximator capable of learning virtually any pattern."
Sigmoid
The sigmoid function squashes any input into the range (0, 1): sigma(z) = 1 / (1 + exp(-z)). Historically, it was the default activation because its output can be interpreted as a probability.
- Pros: Smooth, differentiable, output bounded between 0 and 1, useful for output layers in binary classification.
- Cons: Suffers from the vanishing gradient problem. For very large or very small inputs, the gradient approaches zero, making learning extremely slow. Also, its outputs are not zero-centered, which can slow down gradient descent convergence.
When to use: Primarily in the output layer for binary classification. Rarely used in hidden layers of modern networks.
Tanh (Hyperbolic Tangent)
Tanh is a rescaled version of sigmoid that maps inputs to the range (-1, 1): tanh(z) = (exp(z) - exp(-z)) / (exp(z) + exp(-z)). Because its output is zero-centered, it generally trains faster than sigmoid in hidden layers.
- Pros: Zero-centered output, stronger gradients than sigmoid near zero.
- Cons: Still suffers from vanishing gradients at extreme values. Computationally more expensive than ReLU.
When to use: In recurrent neural networks where zero-centered outputs help with hidden state dynamics. Sometimes in hidden layers where bounded outputs are desired.
ReLU (Rectified Linear Unit)
ReLU is the most widely used activation function in modern deep learning: ReLU(z) = max(0, z). It simply outputs zero for negative inputs and the input itself for positive inputs.
- Pros: Computationally cheap (just a comparison). Does not saturate for positive inputs, avoiding vanishing gradients. Leads to sparse activations (many neurons output zero), which improves efficiency. Empirically trains faster than sigmoid or tanh.
- Cons: The dying ReLU problem: if a neuron's input is always negative, its gradient is always zero and it never learns again. Not zero-centered. Not differentiable at exactly zero (though this rarely matters in practice).
Key Takeaway
ReLU is the default choice for hidden layers in most deep learning architectures. It is simple, fast, and avoids the vanishing gradient problem for positive inputs. Start with ReLU unless you have a specific reason to choose something else.
Leaky ReLU and Parametric ReLU
Leaky ReLU addresses the dying ReLU problem by allowing a small, non-zero gradient for negative inputs: LeakyReLU(z) = max(alpha*z, z) where alpha is typically 0.01. This ensures that neurons with negative inputs can still learn.
Parametric ReLU (PReLU) makes alpha a learnable parameter. The network learns the optimal slope for negative inputs during training, which can lead to slightly better performance at the cost of additional parameters.
ELU (Exponential Linear Unit)
ELU uses an exponential curve for negative inputs: ELU(z) = z if z > 0, alpha*(exp(z) - 1) if z <= 0. This produces outputs that are closer to zero-centered than ReLU and avoids the dying neuron problem. However, the exponential computation makes it slightly slower than ReLU.
GELU (Gaussian Error Linear Unit)
GELU has become the activation of choice in Transformer models, including BERT and GPT. It multiplies the input by the probability that it is positive under a standard normal distribution: GELU(z) = z * P(Z <= z) where Z follows N(0,1).
Unlike ReLU, which makes a hard cut at zero, GELU provides a smooth transition. This smoothness helps with optimization in very deep networks and attention mechanisms. GELU is the default in most modern language models.
Swish and SiLU
Swish (also called SiLU) is defined as Swish(z) = z * sigmoid(z). Discovered through neural architecture search by Google, it has been shown to outperform ReLU on deeper networks. Like GELU, it is smooth and non-monotonic (it dips slightly below zero for negative inputs), which seems to help optimization.
Softmax
Softmax is used exclusively in the output layer for multi-class classification. It converts a vector of raw scores (logits) into a probability distribution where all values are positive and sum to 1: softmax(zi) = exp(zi) / sum(exp(zj)). Each output represents the probability of the corresponding class.
Choosing the Right Activation Function
Here is a practical decision guide:
- Hidden layers (general): ReLU as default. Leaky ReLU or ELU if you encounter dying neurons.
- Hidden layers (Transformers): GELU or Swish.
- Hidden layers (RNNs): Tanh for hidden state, sigmoid for gates.
- Output layer (binary classification): Sigmoid.
- Output layer (multi-class): Softmax.
- Output layer (regression): Linear (no activation).
- Very deep networks: Swish or GELU, combined with batch normalization and skip connections.
Key Takeaway
The choice of activation function affects training speed, stability, and final performance. ReLU remains the safe default for CNNs and feedforward networks. GELU dominates Transformer architectures. Always match your output activation to your task type.
Common Mistakes
- Using sigmoid in hidden layers: This causes vanishing gradients and slow training. Use ReLU variants instead.
- Using ReLU in the output layer for classification: ReLU outputs are unbounded. Use sigmoid or softmax instead.
- Ignoring the dying ReLU problem: If training stalls, check if many neurons have zero output. Switch to Leaky ReLU or lower the learning rate.
- Overthinking the choice: For most problems, ReLU in hidden layers with the appropriate output activation works well. Only experiment with alternatives when baseline performance is unsatisfactory.
Activation functions are a small detail with an outsized impact. Getting them right is one of the foundations of successful deep learning, enabling networks to learn the rich, nonlinear representations that make modern AI possible.
