AI & ML

61 posts in this domain.

What does 'X parameters' mean in an LLM? Llama 3.1 70B, DeepSeek-V3 671B, Phi-4 14B — what is that number actually counting, and why is it the headline figure on every model release? May 4, 2026 · intro
How can I tell when an LLM is making the answer up? True answers and fabricated ones come out of the same pipe, in the same tone. There's no red light. But there are seams — places hallucinations cluster, shapes they tend to take, tells you can learn to read. May 2, 2026 · intro
Why dropout disappeared from modern LLMs Dropout was the regularization workhorse of the deep-learning era. Frontier LLM pretraining quietly stopped using it. The reason isn't that dropout broke — it's that the problem dropout solved stopped being the problem. May 2, 2026 · intermediate
Why LLMs can't count the r's in 'strawberry' A model that can write a sonnet stumbles on a question a five-year-old gets right. The reason isn't intelligence — it's that the model never sees the letters. May 2, 2026 · intro
Why reward hacking is RLHF's hardest problem You can't write down a loss function for 'be helpful,' so you train a model to predict it — and then a much bigger model spends all its optimization pressure looking for holes in that prediction. That gap is reward hacking, and it doesn't go away with scale. May 2, 2026 · intermediate
Why synthetic data works for modern LLM training The open web ran out of high-quality text years before frontier models stopped getting better. The new training signal didn't come from a fresh internet — it came from models writing for models, with filters in front. May 2, 2026 · intermediate
10 famous AI-ML terms The vocabulary you keep hearing on every podcast — neural network, transformer, RLHF, RAG — compressed to one line each, then unpacked. Apr 30, 2026 · intro
What is harness engineering? Most of the work that turns a frontier model into a reliable product happens around the model, not inside it. Harness engineering is the name for that work. Apr 30, 2026 · intermediate
How does an AI model decide what to say? It looks like one big choice — you type a question, you get an answer. Underneath it's thousands of tiny choices, made one token at a time, with no plan and no rewind. Apr 30, 2026 · intro
What is attention (in transformers)? Every token in a sequence gets to peek at every other token and decide which ones matter. That trick is the engine inside every modern LLM. Apr 30, 2026 · intro
What is a neural network? A pile of multiplications and a 'how wrong was I?' signal — somehow, when you stack enough of them, the thing learns to read, see, and play chess. Apr 30, 2026 · intro
What is a transformer? The neural network architecture behind every modern LLM, image model, and protein folder — and the one big idea that made it work: drop recurrence, let every token look at every other token directly. Apr 30, 2026 · intro
Why do attention sinks exist? Trained transformers funnel a startling fraction of their attention onto the very first token — a token that's usually semantically meaningless. The pattern looks like a bug, behaves like a feature, and falls out cleanly from one constraint in the softmax. Apr 30, 2026 · intermediate
Why FlashAttention was a breakthrough Same math, same exact outputs, same asymptotic compute — and yet it made attention several times faster and unlocked long context. The trick was noticing attention was a memory problem, not a compute problem. Apr 30, 2026 · intermediate
Why FP8 training is stable FP8 has only 256 representable values. Training a frontier model in it sounds insane — and it almost is. Here's the trick that makes it work. Apr 30, 2026 · intermediate
Why grouped-query attention exists Multi-head attention is a memory-bandwidth disaster at decode time. GQA keeps most of the quality and throws away most of the bandwidth bill. Apr 30, 2026 · intermediate
Why MLA replaced MHA DeepSeek-V2 cut its KV cache by 93% by attacking the bottleneck differently than GQA — and it scored higher, not lower, on benchmarks. Apr 30, 2026 · intermediate
Why does PagedAttention exist? Naive KV-cache allocation reserves a contiguous slab for the worst-case sequence length, then watches 60–80% of it sit unused. PagedAttention asks: what if we treated GPU memory the way an operating system treats RAM? Apr 30, 2026 · intermediate
Why RoPE replaced sinusoidal positional encoding The original transformer added a fixed sine/cosine vector to each token. Almost no frontier model does that anymore. RoPE rotates queries and keys instead — and that one structural change is what made long context tractable. Apr 30, 2026 · intermediate
Why AI runs away in verifiable domains AI is getting superhuman fastest at things a computer can grade — math, code, formal proofs — and dragging behind on things it can't. The reason isn't that those domains are 'easier.' It's that training has a feedback step, and feedback needs a verifier. Apr 30, 2026 · intermediate
A short history of AI, from Turing to today's LLMs Seventy years of trying to make machines think — and how a single architecture from 2017 finally cashed the check that 1950s AI wrote. Apr 29, 2026 · intermediate
Why does chain-of-thought prompting work? Adding 'let's think step by step' to a prompt makes models measurably better at hard problems. Nobody fully agrees on why, and the wrong story will mislead you about how to use it. Apr 29, 2026 · intermediate
Why do LLMs hallucinate confidently instead of saying 'I don't know'? The model isn't lying. It was never trained to know when to stop talking. Apr 29, 2026 · intro
What is an agent harness? The loop and scaffolding around a language model that turns 'a thing that emits tokens' into 'a thing that does work in the world.' Apr 29, 2026 · intro
Why do embeddings exist? Computers want numbers, but you also want 'cat' and 'kitten' to live next to each other. Embeddings are the trick that makes both true at once. Apr 29, 2026 · intro
Why does in-context learning work? You paste three examples into a prompt and the model suddenly does the task. Nothing got trained. So what just happened? Apr 29, 2026 · intermediate
What is an LLM? A neural network trained to predict the next token of text — and why that simple goal scaled into something that feels like reasoning. Apr 29, 2026 · intro
Why does GPU memory bandwidth matter more than FLOPS for LLM inference? You bought the GPU for the teraflops. At inference time, almost none of them are doing anything. The bottleneck is moving the weights, not multiplying them. Apr 29, 2026 · intermediate
Why does MCP exist? Every AI app was reinventing the same plumbing to talk to the same tools. MCP is the standard that turns an M×N integration mess into M+N. Apr 29, 2026 · intro
RAG: why retrieval didn't die when context windows got huge Long context windows were supposed to kill retrieval-augmented generation. They didn't. Here's why the bottleneck moved instead of disappearing. Apr 29, 2026 · intro
Why do small models exist? If bigger models always benchmark better, why does anyone ship a 3B model? The answer is mostly about latency, cost, and the place the model has to live. Apr 29, 2026 · intro
Why does temperature exist as a knob? If the model knows the right answer, why is there a dial that asks it to be wrong on purpose? Apr 29, 2026 · intro
Why is the KV cache a thing? The model has to read your whole prompt every time it picks a token. Why doesn't it choke? Because of a quiet trick almost nobody mentions in the docs. Apr 29, 2026 · intermediate
Why does tokenization exist? Computers can already read bytes. So why do language models insist on chopping text into these weird half-words first? Apr 29, 2026 · intro
Why Adam beat plain SGD for LLMs Vision models are mostly trained with SGD + momentum. Transformers are almost always trained with Adam or AdamW. Why did one optimizer win one regime and lose the other? Apr 29, 2026 · intermediate
Why agents fall apart over long horizons Your agent solves any single step beautifully. Run it for fifty steps and it falls off a cliff. The math behind that cliff is older than LLMs, but a newer twist makes it worse. Apr 29, 2026 · intermediate
Why vector search is approximate on purpose Exact nearest-neighbor search exists, works, and is correct. At scale, the AI-era retrieval stack quietly walks away from it. The reason is more interesting than 'it's faster.' Apr 29, 2026 · intermediate
Why is attention quadratic? Doubling the context length makes attention 4× more expensive, not 2×. That single fact shapes every trade-off in modern LLM serving — and explains what FlashAttention actually changed (it's not what most people think). Apr 29, 2026 · intermediate
Why beam search died for LLMs Beam search was the default way to decode neural sequence models for years. Then chatbots arrived and quietly stopped using it. The reason is stranger than 'sampling is more creative.' Apr 29, 2026 · intermediate
Why does continuous batching exist? Static batching works fine for image classifiers and breaks immediately for LLMs. The problem isn't the batch — it's that generation lengths vary, and the slowest sequence holds the GPU hostage. Apr 29, 2026 · intermediate
Why image generation went diffusion, not autoregressive LLMs are autoregressive: predict the next token. Image models could have been the same — predict the next pixel. Almost none of the dominant ones are. Here's why the field walked away from that approach. Apr 29, 2026 · intermediate
Why model distillation exists A small model trained on a big model's outputs often beats the same small model trained on the original labels. That shouldn't be obvious — and the reason it works is the actually interesting part. Apr 29, 2026 · intermediate
Why is fine-tuning so cheap compared to pretraining? Pretraining a frontier model costs tens of millions of dollars. Fine-tuning the same model on your data can cost less than a pizza. Why the four-orders-of-magnitude gap? Apr 29, 2026 · intermediate
Why GPU kernels are still hand-tuned A modern GPU can do tens of teraflops of matrix math. A naive, correct implementation of the same math leaves most of that on the floor. Here's why moving the bytes — not doing the FLOPs — is the actual job. Apr 29, 2026 · intermediate
Why LayerNorm (and RMSNorm) exist Every transformer block has a normalization step. Pull it out and training falls apart in the first thousand steps. Why is this tiny operation load-bearing? Apr 29, 2026 · intermediate
Why is evaluating an LLM so much harder than testing normal software? Unit tests pass or fail. LLM outputs don't. The hard part isn't running the eval — it's deciding what 'correct' even means when there are a million right answers. Apr 29, 2026 · intermediate
Why LoRA exists Full fine-tuning a 70B model means storing optimizer state for 70 billion weights. LoRA trains under 1% of the parameters and, on the tasks people have tested, often matches the result. The trick is a hypothesis about the shape of the update. Apr 29, 2026 · intermediate
Why long-context models still get lost in the middle Your model has a 1M token context window. It can recall the first paragraph perfectly. It can recall the last paragraph perfectly. The thing in the middle? Coin flip. This is not a bug — it's what happens when you ask a model trained one way to behave a different way. Apr 29, 2026 · intermediate
Why does mixture-of-experts exist? A 671B-parameter model whose per-token compute is closer to a 37B one. The trick isn't compression — it's that most of the weights sit out most of the time. Apr 29, 2026 · intermediate
Why does predicting the next token end up doing reasoning? An LLM is trained on one objective: guess the next token. From that one task, you get translation, code, arithmetic, and arguments. Why is autocomplete this powerful? Apr 29, 2026 · intermediate
Why do positional encodings exist? A transformer cannot tell 'dog bites man' from 'man bites dog' on its own. The attention math is symmetric in token order — until you bolt on a position signal. Every modern LLM does, and the choice of how shapes long-context behavior more than people realize. Apr 29, 2026 · intermediate
Why does prompt caching exist? Your agent sends the same 50,000-token system prompt on every turn. The provider charges you 90% less when they recognize it. They're not being generous — they're charging you for work they didn't do. Apr 29, 2026 · intermediate
Why quantization works Stuffing a 70-billion-parameter model into 4-bit weights sounds like it should ruin it. It mostly doesn't — and the reason is more about how the model gets used at inference than about the math of rounding. Apr 29, 2026 · intermediate
Why reasoning models exist Why we suddenly have a separate class of LLMs that 'think before answering' — and what changed to make spending compute at inference, not training, the new lever. Apr 29, 2026 · intermediate
Why RLHF exists A pretrained language model knows everything and answers nothing. RLHF exists because the gap between 'predict the next token' and 'do what the user asked' is wider than prompt engineering can paper over. Apr 29, 2026 · intermediate
Why do scaling laws exist? Bigger model, more data, more compute — and the loss falls along a straight line on a log-log plot for seven orders of magnitude. Nobody fully knows why that line is so straight. Apr 29, 2026 · intermediate
Why does speculative decoding exist? A small fast model guesses, a big slow model checks. Somehow you get the big model's exact output, faster. The trick isn't cleverness — it's that your GPU was already sitting idle. Apr 29, 2026 · intermediate
Why is structured output so hard? You ask the model for JSON. Sometimes it gives you a trailing comma. Sometimes a markdown fence. Sometimes prose. Why is this still a problem? Apr 29, 2026 · intermediate
Why SwiGLU replaced ReLU in transformers Modern LLMs ditched the simplest activation function in deep learning for a multiplicative gate nobody can fully explain. Here's why. Apr 29, 2026 · intermediate
Why isn't temperature 0 actually deterministic? You set temperature to 0, send the same prompt twice, get two different answers. The math says argmax is a function. The hardware disagrees. Apr 29, 2026 · intermediate
Why VRAM is the bottleneck for LLM serving It's not FLOPS, it's not network, it's not the CPU. The thing that decides whether your model fits and how many users you can serve is a number printed on the GPU's spec sheet — and three things fight to consume it. Apr 29, 2026 · intermediate