Why does softmax look like that?

Softmax is the function that turns a vector of arbitrary numbers into probabilities. The exponential in the middle isn't decorative — it's what makes the whole machine differentiable, well-behaved, and historically inevitable.

Math intermediate Apr 29, 2026

Why it exists

Picture ChatGPT guessing the next word in “I’m hungry, let’s order ___”. Internally it produces a row of raw scores — say pizza: 8.2, sushi: 6.0, gravel: -3.1, and so on for every word in its vocabulary. Those numbers are unbounded and don’t add up to anything meaningful. Before the model can roll dice and pick one, something has to convert them into actual percentages — pizza 78%, sushi 21%, gravel 0.0001%. Softmax is that conversion. Every token a language model emits passes through it.

Anyone who has stared at the softmax formula has had the same small moment of suspicion:

softmax(z)_i = exp(z_i) / Σ_j exp(z_j)

Why exp? Why not just take absolute values, or square, or shift everything to be positive and divide by the sum? Any of those would also turn a vector of arbitrary real numbers into something that sums to 1. The exponential looks like a flourish — a particular choice where many would do.

It isn’t a flourish. The exp is doing several jobs at once that no other elementary function does together, and once you see them, the formula stops looking arbitrary and starts looking like the unique answer to a fairly specific question: what’s the smoothest possible way to turn unbounded “scores” into a probability distribution, in a way a gradient-descent optimizer can actually train?

The neural-network framing is recent. The function itself is older — it’s the same shape as the Boltzmann distribution from 19th-century statistical mechanics, and the same shape as the multinomial logistic regression used in statistics for decades before deep learning showed up. The standard account is that softmax was inherited from those older traditions rather than invented for neural nets — but I don’t have a clean citation for the first paper that put it on top of a neural network classifier, so treat that lineage as “common knowledge in the field” rather than a sourced claim.

Why it matters now

Softmax is the last layer of nearly every classifier and the last layer of every modern LLM. When a language model picks the next token, it computes a vector of logits — one real number per vocabulary entry — and softmax converts those into the probability distribution you sample from. The temperature knob you see in chat APIs is literally a divisor inside the softmax. Cross-entropy loss, the function nearly every classifier is trained against, is defined as −log of the softmax output for the correct class.

If softmax is wrong, training is wrong. If softmax is numerically unstable, your loss explodes. If you swap it for something “simpler” without understanding what exp was doing, you usually discover the model stops learning. It’s load-bearing in a quiet way.

The short answer

softmax(z)_i = exp(z_i) / Σ_j exp(z_j)

It’s the function that takes a vector of real numbers (positive, negative, unbounded) and returns a probability distribution where each entry is proportional to exp of its score. The exp makes everything positive, makes the ratios depend only on differences between scores, and makes the gradients clean. There isn’t a simpler function that does all three.

How it works

The cleanest way to see why softmax has the shape it does is to ask what constraints we want any “scores → probabilities” function to satisfy, and then notice that those constraints almost completely pin down the answer.

The four constraints

Output is a probability distribution. All entries non-negative, sum to 1.
Strictly increasing in each score. Raising z_i should raise p_i, never lower it.
Translation invariant. Adding the same constant to every score shouldn’t change the output. (If all scores go up by 5, no class became more likely relative to the others.)
Smooth and differentiable everywhere. This is the gradient-descent constraint. We want to backpropagate through it.

The first constraint suggests dividing by the sum: p_i = f(z_i) / Σ_j f(z_j) for some non-negative f. The second says f must be increasing. The third — translation invariance — is the surprisingly strong one. Translation invariance says f(z_i + c) / f(z_j + c) must equal f(z_i) / f(z_j) for all c. The only continuous function satisfying that multiplicative property under addition is the exponential: f(z) = exp(α·z) for some α > 0. (This is essentially the “Cauchy functional equation” in disguise.)

So the exp isn’t a stylistic pick — it’s forced by wanting the function to depend only on score differences, which is what we mean by “the absolute scale of logits is meaningless, only their gaps matter.” The free parameter α is exactly the inverse temperature: softmax_T(z)_i = exp(z_i / T) / Σ exp(z_j / T).

Why translation invariance matters in practice

This is the part where the seam shows. Logits coming out of a neural network can be enormous — exp(1000) overflows a 32-bit float instantly. But because softmax is translation invariant, you can subtract max(z) from every entry before exponentiating without changing the answer. The largest input becomes 0, exp(0) = 1, everything else is between 0 and 1, no overflow. Every production softmax implementation does this. It’s a free numerical-stability trick that falls out of the same property that justified the exp in the first place.

The gradient that made it stick

Even if you accepted some other positive-and-increasing function, softmax has one more property that explains why the field standardized on it: when you compose softmax with cross-entropy loss, the gradient simplifies to something almost embarrassingly clean.

If p = softmax(z) and the loss is L = −log p_y (cross-entropy with the correct class y), then:

∂L/∂z_i = p_i − 1[i = y]

That’s it. The gradient with respect to the logits is just “predicted probability minus 1 for the right class.” No exponentials in the gradient, no chain-rule explosion, no special cases. Every modern deep-learning framework exploits this by fusing softmax and cross-entropy into a single op (log_softmax + nll_loss, or softmax_cross_entropy_with_logits) because computing them separately is both slower and less numerically stable.

That clean gradient is, I’d argue, the single biggest reason softmax survived the transition from statistics to deep learning. Other functions satisfy the four constraints up to constants; none of them give you a gradient that simple.

The Boltzmann connection

The same formula p_i ∝ exp(−E_i / kT) describes the probability of finding a thermodynamic system in state i with energy E_i at temperature T. The mapping is exact: a logit is a negative energy, and the softmax temperature is the physical temperature. High temperature → the distribution flattens (all states roughly equally likely). Low temperature → it concentrates on the lowest-energy (highest-logit) state. At T → 0, softmax becomes argmax.

The standard account in physics derives the Boltzmann distribution by maximizing entropy subject to a constraint on expected energy. Run the same derivation in ML — what’s the maximum-entropy distribution consistent with these expected sufficient statistics? — and you get softmax. Same math, different vocabulary.

Sigmoid — sigmoid(x) = 1 / (1 + exp(−x)) — softmax for two classes. Falls out of softmax over {x, 0} after cancellation. Same exponential, same translation-invariance argument.
Logit — logit(p) = log(p / (1 − p)) — the inverse of sigmoid. The “raw scores” that go into softmax are called logits because in the binary case they literally are logits.
Cross-entropy loss — H(p, q) = − Σ p(x) log q(x) — the loss function softmax was made for; see entropy.
Temperature — softmax_T(z) = softmax(z / T) — the knob that controls how peaked or flat the output distribution is. See temperature.
Boltzmann distribution — p_i ∝ exp(−E_i / kT) — the physics ancestor. Maximum-entropy distribution given an energy constraint.
Argmax — argmax = softmax at T → 0 — the non-differentiable function softmax exists to smooth out.

Going deeper

Bridle, J. S. (1990), Probabilistic interpretation of feedforward classification network outputs — one of the early treatments of softmax in a neural-net classifier. (I’m naming this from memory of the literature; I haven’t re-verified the title and year against the original — treat as a starting point, not a citation.)
Goodfellow, Bengio, Courville, Deep Learning — chapter 6 walks through the softmax + cross-entropy pairing and the clean gradient.
Jaynes, E. T., Probability Theory: The Logic of Science — for the maximum-entropy derivation that connects softmax to Boltzmann via a constrained optimization rather than via physics.

A note on what I’m sure of. The translation-invariance argument forcing exp is a standard piece of the exponential family story and the clean softmax–cross-entropy gradient is straightforward calculus. The historical claim about who first used softmax on a neural network is the part I’d most want a real source for and don’t have one on hand; the broader “this came from statistical mechanics and multinomial logistic regression” framing is the standard account but I haven’t traced individual citations.