Why does softmax look like that?
Softmax is the function that turns a vector of arbitrary numbers into probabilities. The exponential in the middle isn't decorative — it's what makes the whole machine differentiable, well-behaved, and historically inevitable.
Why it exists
Picture ChatGPT guessing the next word in “I’m hungry, let’s order ___”.
Internally it produces a row of raw scores — say pizza: 8.2, sushi: 6.0,
gravel: -3.1, and so on for every word in its vocabulary. Those numbers
are unbounded and don’t add up to anything meaningful. Before the model can
roll dice and pick one, something has to convert them into actual
percentages — pizza 78%, sushi 21%, gravel 0.0001%. Softmax is that
conversion. Every token a language model emits passes through it.
Anyone who has stared at the softmax formula has had the same small moment of suspicion:
softmax(z)_i = exp(z_i) / Σ_j exp(z_j)
Why exp? Why not just take absolute values, or square, or shift everything
to be positive and divide by the sum? Any of those would also turn a vector
of arbitrary real numbers into something that sums to 1. The exponential
looks like a flourish — a particular choice where many would do.
It isn’t a flourish. The exp is doing several jobs at once that no other
elementary function does together, and once you see them, the formula stops
looking arbitrary and starts looking like the unique answer to a fairly
specific question: what’s the smoothest possible way to turn unbounded
“scores” into a probability distribution, in a way a gradient-descent
optimizer can actually train?
The neural-network framing is recent. The function itself is older — it’s the same shape as the Boltzmann distribution from 19th-century statistical mechanics, and the same shape as the multinomial logistic regression used in statistics for decades before deep learning showed up. The standard account is that softmax was inherited from those older traditions rather than invented for neural nets — but I don’t have a clean citation for the first paper that put it on top of a neural network classifier, so treat that lineage as “common knowledge in the field” rather than a sourced claim.
Why it matters now
Softmax is the last layer of nearly every classifier and the last layer of
every modern LLM.
When a language model picks the next token, it computes a vector of logits
— one real number per vocabulary entry — and softmax converts those into
the probability distribution you sample from. The temperature knob you
see in chat APIs is literally a divisor inside the softmax. Cross-entropy
loss, the function nearly every classifier is trained against, is defined
as −log of the softmax output for the correct class.
If softmax is wrong, training is wrong. If softmax is numerically unstable,
your loss explodes. If you swap it for something “simpler” without
understanding what exp was doing, you usually discover the model stops
learning. It’s load-bearing in a quiet way.
The short answer
softmax(z)_i = exp(z_i) / Σ_j exp(z_j)
It’s the function that takes a vector of real numbers (positive, negative,
unbounded) and returns a probability distribution where each entry is
proportional to exp of its score. The exp makes everything positive,
makes the ratios depend only on differences between scores, and makes the
gradients clean. There isn’t a simpler function that does all three.
How it works
The cleanest way to see why softmax has the shape it does is to ask what constraints we want any “scores → probabilities” function to satisfy, and then notice that those constraints almost completely pin down the answer.
The four constraints
- Output is a probability distribution. All entries non-negative, sum to 1.
- Strictly increasing in each score. Raising
z_ishould raisep_i, never lower it. - Translation invariant. Adding the same constant to every score shouldn’t change the output. (If all scores go up by 5, no class became more likely relative to the others.)
- Smooth and differentiable everywhere. This is the gradient-descent constraint. We want to backpropagate through it.
The first constraint suggests dividing by the sum: p_i = f(z_i) / Σ_j f(z_j) for some non-negative f. The second says f must be increasing.
The third — translation invariance — is the surprisingly strong one.
Translation invariance says f(z_i + c) / f(z_j + c) must equal f(z_i) / f(z_j) for all c. The only continuous function satisfying that
multiplicative property under addition is the exponential: f(z) = exp(α·z) for some α > 0. (This is essentially the “Cauchy functional
equation” in disguise.)
So the exp isn’t a stylistic pick — it’s forced by wanting the function
to depend only on score differences, which is what we mean by “the
absolute scale of logits is meaningless, only their gaps matter.” The free
parameter α is exactly the inverse temperature: softmax_T(z)_i = exp(z_i / T) / Σ exp(z_j / T).
Why translation invariance matters in practice
This is the part where the seam shows. Logits coming out of a neural
network can be enormous — exp(1000) overflows a 32-bit float
instantly. But because softmax is translation invariant, you can subtract
max(z) from every entry before exponentiating without changing the
answer. The largest input becomes 0, exp(0) = 1, everything else is
between 0 and 1, no overflow. Every production softmax implementation does
this. It’s a free numerical-stability trick that falls out of the same
property that justified the exp in the first place.
The gradient that made it stick
Even if you accepted some other positive-and-increasing function, softmax has one more property that explains why the field standardized on it: when you compose softmax with cross-entropy loss, the gradient simplifies to something almost embarrassingly clean.
If p = softmax(z) and the loss is L = −log p_y (cross-entropy with the
correct class y), then:
∂L/∂z_i = p_i − 1[i = y]
That’s it. The gradient with respect to the logits is just “predicted
probability minus 1 for the right class.” No exponentials in the gradient,
no chain-rule explosion, no special cases. Every modern deep-learning
framework exploits this by fusing softmax and cross-entropy into a single
op (log_softmax + nll_loss, or softmax_cross_entropy_with_logits)
because computing them separately is both slower and less numerically
stable.
That clean gradient is, I’d argue, the single biggest reason softmax survived the transition from statistics to deep learning. Other functions satisfy the four constraints up to constants; none of them give you a gradient that simple.
The Boltzmann connection
The same formula p_i ∝ exp(−E_i / kT) describes the probability of
finding a thermodynamic system in state i with energy E_i at
temperature T. The mapping is exact: a logit is a negative energy, and
the softmax temperature is the physical temperature. High temperature →
the distribution flattens (all states roughly equally likely). Low
temperature → it concentrates on the lowest-energy (highest-logit) state.
At T → 0, softmax becomes argmax.
The standard account in physics derives the Boltzmann distribution by maximizing entropy subject to a constraint on expected energy. Run the same derivation in ML — what’s the maximum-entropy distribution consistent with these expected sufficient statistics? — and you get softmax. Same math, different vocabulary.
Famous related terms
- Sigmoid —
sigmoid(x) = 1 / (1 + exp(−x))— softmax for two classes. Falls out of softmax over{x, 0}after cancellation. Same exponential, same translation-invariance argument. - Logit —
logit(p) = log(p / (1 − p))— the inverse of sigmoid. The “raw scores” that go into softmax are called logits because in the binary case they literally are logits. - Cross-entropy loss —
H(p, q) = − Σ p(x) log q(x)— the loss function softmax was made for; see entropy. - Temperature —
softmax_T(z) = softmax(z / T)— the knob that controls how peaked or flat the output distribution is. See temperature. - Boltzmann distribution —
p_i ∝ exp(−E_i / kT)— the physics ancestor. Maximum-entropy distribution given an energy constraint. - Argmax —
argmax = softmax at T → 0— the non-differentiable function softmax exists to smooth out.
Going deeper
- Bridle, J. S. (1990), Probabilistic interpretation of feedforward classification network outputs — one of the early treatments of softmax in a neural-net classifier. (I’m naming this from memory of the literature; I haven’t re-verified the title and year against the original — treat as a starting point, not a citation.)
- Goodfellow, Bengio, Courville, Deep Learning — chapter 6 walks through the softmax + cross-entropy pairing and the clean gradient.
- Jaynes, E. T., Probability Theory: The Logic of Science — for the maximum-entropy derivation that connects softmax to Boltzmann via a constrained optimization rather than via physics.
A note on what I’m sure of. The translation-invariance argument forcing
expis a standard piece of the exponential family story and the clean softmax–cross-entropy gradient is straightforward calculus. The historical claim about who first used softmax on a neural network is the part I’d most want a real source for and don’t have one on hand; the broader “this came from statistical mechanics and multinomial logistic regression” framing is the standard account but I haven’t traced individual citations.