Activation Functions: ReLU, Sigmoid, Tanh, and Softmax
Abhay
4 min read
Imagine stacking ten layers of neurons, training for hours, burning through GPU credits — and ending up with something mathematically indistinguishable from a single straight line. That’s not a horror story; it’s exactly what happens if you forget the activation functions. They are the small, unglamorous nonlinear twists that turn a tall stack of linear algebra into something that can actually learn. Let’s meet the usual suspects.
Why nonlinearity is the whole game
A neural layer, stripped down, just multiplies inputs by weights and adds a bias. That’s a linear operation. And here’s the inconvenient truth: chaining linear operations together gives you… another linear operation. Multiply ten matrices in a row and the result is still one matrix. So a ten-layer network with no activation functions has the exact same expressive power as a single layer. All that depth, collapsed into a damp line through your data.
Activation functions break this curse. By inserting a nonlinear squish between layers, each layer gets to bend space in its own way, and the stack can finally approximate curves, decision boundaries, and the wonderfully messy patterns of the real world. No nonlinearity, no deep learning. Full stop.
Sigmoid and tanh: the elegant elders
Sigmoid squashes any number into the range (0, 1) with a graceful S-curve. For years it was the activation function, partly because that output looks like a probability. Tanh is its better-centered cousin, mapping inputs to (-1, 1), which keeps activations symmetric around zero and usually trains a little nicer.
Both, however, have a fatal flaw in deep networks: saturation. Feed a sigmoid a large positive or negative number and it pins to 1 or 0, where the curve goes nearly flat. Flat curve means a gradient near zero. And when you backpropagate through many such layers, those tiny gradients multiply together into something microscopic — the dreaded vanishing gradient problem. The early layers receive almost no learning signal and effectively freeze. Your network stops improving, and you wonder why.
ReLU: brutally simple, wildly effective
Enter the Rectified Linear Unit, the function that quietly powered the deep learning boom. Its definition is almost insultingly basic:
import numpy as np
def relu(x):
return np.maximum(0, x)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
If the input is positive, pass it straight through. If it’s negative, output zero. That’s it. Because its gradient is exactly 1 for positive inputs, ReLU doesn’t saturate on the positive side, so gradients flow happily through deep stacks. It’s also dirt cheap to compute — no exponentials, just a comparison. Faster training, better convergence, fewer headaches.
But ReLU has a quirk of its own: the dying ReLU problem. If a neuron’s weights drift such that it always receives negative input, it outputs zero forever, its gradient is zero, and it never updates again. It’s flatlined — a dead neuron taking up space. When chunks of your network die, capacity quietly evaporates.
The fix is to give negatives a small escape hatch. Leaky ReLU lets a sliver of the negative signal through (that alpha * x above), so the gradient is never quite zero. GELU (Gaussian Error Linear Unit) goes smoother still, weighting inputs by how likely they are to be useful rather than chopping them off at a hard corner. GELU is now the default in heavyweight architectures like BERT and the GPT series, where its gentle curve consistently edges out plain ReLU on test error.
Softmax: turning scores into a verdict
The others operate on hidden neurons. Softmax is a specialist for the output of a multiclass classifier. Given a vector of raw scores (logits), it exponentiates each, then normalizes so they sum to 1 — producing a clean probability distribution across your classes. Ask “is this image a cat, dog, or hedgehog?” and softmax answers “71% dog, 22% cat, 7% hedgehog.” The exponentiation exaggerates the leader, making the network commit to a confident pick while staying differentiable for training.
For a binary yes/no output, you don’t even need softmax — a single sigmoid does the job, since one probability implies the other.
A practical which-to-use-where guide
You rarely need to agonize. The field has converged on a few reliable defaults:
- Hidden layers: Reach for ReLU first — fast, simple, hard to beat. If you see dead neurons or you’re training a transformer, upgrade to GELU (or Leaky ReLU as a cheap middle ground).
- Binary classification output: Sigmoid — one neuron, one probability.
- Multiclass classification output: Softmax — one neuron per class, probabilities that sum to 1.
- Regression output: Often no activation at all — you want a raw, unbounded number.
- Tanh: Mostly retired to hidden layers in specific recurrent setups; rarely your first choice today.
The one-line takeaway: ReLU (or GELU) in the hidden layers, sigmoid or softmax at the output, and never, ever leave them out — because a deep network without activation functions is just an expensive way to draw a straight line.
Sources: Towards Data Science — The Dying ReLU Problem, Prodia — GELU vs ReLU, Towards AI — Is GELU the ReLU successor?, GeeksforGeeks — ReLU Activation Function.