Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

What is a neural network?

A pile of multiplications and a 'how wrong was I?' signal — somehow, when you stack enough of them, the thing learns to read, see, and play chess.

AI & ML intro Apr 30, 2026

Why it exists

For most of computing history, getting a machine to do something useful meant writing the rules down. If you wanted a program that recognized handwritten digits, you sat down and tried to describe what makes a 7 a 7. Slanted top stroke. Sometimes a crossbar. By the time you finished, your code didn’t generalize, and someone’s grandmother wrote a 7 you’d never seen before and broke it.

A whole category of problems — recognizing faces, reading speech, telling spam from not-spam, predicting the next word — refused to yield to hand-coded rules. Not because the rules don’t exist. Because there are millions of them, they’re tangled, and no human can enumerate them faster than reality invents new edge cases.

Neural networks exist because you don’t have to know the rules. You only need examples. Show a flexible enough function enough labeled examples, give it a way to measure how wrong it is, and let it nudge itself toward less wrong. The rules fall out as a side effect — encoded in millions of small numbers nobody has to interpret. Trade “I understand exactly what my program does” for “my program does things I couldn’t have written by hand.” The twenty-first century mostly took the second deal.

Why it matters now

Almost everything currently labeled “AI” is a neural network somewhere underneath. LLMs, image generators, speech recognition, recommendation feeds, fraud detection — all neural networks, differing mostly in shape, size, and what they were trained on.

You need a picture of one because the vocabulary of the field assumes you have it. Weights, layers, training, fine-tuning, gradients, parameters aren’t metaphors — they’re the literal mechanical parts. With the picture, almost everything else in modern ML becomes “okay, but bigger” or “okay, but with a clever twist.”

The short answer

neural network = stacked layers of (linear transform + non-linear activation), trained by gradient descent on a loss

A neural net is a long pipeline of “multiply by some numbers, add some numbers, then bend the result.” Start with random numbers, feed in examples, measure how wrong the output is, and adjust the numbers slightly toward less-wrong. Repeat a few billion times. The numbers that survive are the model.

How it works

Start with a single neuron:

output = activation( w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b )

Take the inputs, multiply each by a weight, add them up, add a bias, then pass the result through a non-linear activation like ReLU, which clips negatives to zero. Without that non-linearity, you could collapse a hundred-layer network into one matrix multiply and learn nothing interesting. The kink is what makes depth pay off.

Now stack them. A layer is many neurons in parallel, each with their own weights. A network is layers feeding into layers. The simplest shape is the MLP: input vector in, a few fully-connected hidden layers, output out. Each layer can only do something simple, but stacks of simple bends can approximate any reasonable function. The shape-true result behind that is real; the practical question is always can we find the right weights, not do they exist.

Training is a loop:

1. Forward pass. Feed an input through the network and read off the prediction.

2. Loss. A loss function turns “predicted vs. actual” into a single number — squared error for regression, cross-entropy for classification, next-token cross-entropy for an LLM.

3. Backward pass. The loss depends on the last layer’s weights, which depend on the previous layer’s weights, and so on. The chain rule lets you push “how does the loss change if I nudge this weight?” backward through every layer in one sweep, producing a gradient for every weight. Backprop is the efficient bookkeeping for that calculation; without it, training deep nets would be combinatorially hopeless.

4. Step. Move each weight a little against its gradient: w ← w − η · ∂L/∂w. Do that on batch after batch, for a few million batches. The weights drift toward a configuration where the loss is small.

That’s the entire algorithm. Pick a shape, pick a loss, initialize randomly, forward / loss / backward / step. Everything fancier in modern deep learning — convolutions, attention, residual connections, Adam — is a refinement of one of those pieces, not a replacement.

Two things worth flagging. First, the “knowledge” is just numbers — a big tensor of floats with no symbolic representation you can read off, which is why interpretability is its own active field. Second, gradient descent shouldn’t really work this well. The loss surface of a big network is wildly non-convex, and yet plain SGD consistently finds solutions that generalize. Why is honestly not fully understood; there are partial explanations (overparameterization, implicit regularization, high-dimensional geometry) but no clean closed story.

Going deeper

A note on what I’m sure of: the mechanical story (neurons, layers, forward/loss/backward, gradient descent) is established and hasn’t changed in decades. Why deep nets generalize as well as they do, and what their internal representations actually mean, are still open research questions — treat confident-sounding stories about either with appropriate suspicion, including mine.