Why Linux has an OOM killer
Linux promises memory it doesn't have, then has to break the promise — the OOM killer is the reaper that decides who dies so the system can keep running.
Why it exists
You launch a process. It calls malloc(8 GB). The kernel says yes. You run a
few of these and the total “allocated” memory now exceeds the physical RAM
plus swap on the box. Nothing has crashed yet, because none of those processes
have actually touched most of that memory — malloc only reserved address
space, not pages.
Then one of them starts writing. The kernel has to find a real, physical page to back each virtual page being touched. At some point it can’t. Every page is in use, swap is full, the page cache is already squeezed flat. The kernel made a promise it can’t keep.
Two options exist at that moment. One: panic the whole machine — every process dies, the box reboots. Two: pick a victim, kill it, free its pages, and let everyone else live. Linux chose option two and built the OOM killer to do it.
The OOM killer exists because Linux deliberately overcommits memory. Overcommit is a calculated bet that programs reserve far more than they touch — and most of the time the bet pays off. The OOM killer is what happens when it doesn’t.
Why it matters now
If you run anything in containers — Kubernetes pods,
Docker
containers, an ML training job inside a cgroup with a memory limit — you
have already met the OOM killer. The “exit code 137” you see in your logs is
128 + SIGKILL(9): the kernel reaped your process.
It matters even more now because of large models. An LLM that loads 70 GB of weights, a fine-tuning run that spikes activation memory, a vector index that grows past what you sized for — these are exactly the workloads that will push real pages into existence and force the kernel to make good on its promises. AI infra is OOM-killer infra.
The short answer
OOM killer = "system is out of memory" trigger + heuristic that scores processes + SIGKILL on the highest scorer
When allocation fails and there’s nowhere left to reclaim from, the kernel
walks the process list, computes a “badness” score for each one, and sends
SIGKILL to the worst offender. The score roughly tracks how much memory the
process is using, with adjustments for privilege and policy.
How it works
The trigger is a failed page allocation under __alloc_pages that the kernel can’t satisfy by reclaiming page cache or swapping. At that point it calls into the OOM killer.
Selection uses a score per process, exposed at /proc/<pid>/oom_score. The
modern formula (post the rewrite around the 2.6.36 era — I haven’t re-sourced
the exact release, so don’t quote me on the year) is roughly:
- Base score = the process’s resident memory as a fraction of available memory, scaled to 0–1000.
- Plus an adjustment,
oom_score_adj, in/proc/<pid>/oom_score_adj, ranging from-1000(immune) to+1000(kill me first).
So in plain terms: bigger processes get killed first, but you can bias the
decision either way. Setting oom_score_adj to -1000 makes a process
effectively unkillable by the OOM killer — which is how things like sshd or
critical daemons opt out.
The kill itself is a SIGKILL. There is no graceful shutdown, no chance to
flush, no atexit handlers. The process is gone.
Containers complicate the story
When you run inside a memory-limited cgroup, there’s a second OOM situation: the cgroup hits its limit even though the host has plenty of RAM. That’s a cgroup OOM, and the kernel runs the same kind of selection — but only over processes inside that cgroup. Your container gets killed; the host doesn’t notice.
This is why a Kubernetes pod with resources.limits.memory: 4Gi can OOM at
4 GB on a node with 256 GB free. The host has memory; the cgroup doesn’t.
Show the seams
A few things the textbook version skips:
- Overcommit is configurable.
vm.overcommit_memoryhas three modes: heuristic (default, allow most allocations), always (lie aggressively), and never (refuse allocations that wouldn’t fit). The “never” mode trades the OOM killer formallocreturningNULL— which most programs handle even worse than they handle being killed, because nobody actually checksmalloc’s return value. - The killer’s choice can be surprising. The biggest process is often the
one doing the work you care about — your database, your model server. The
kernel doesn’t know that. Tools like
systemd-oomdand policy viaoom_score_adjexist precisely because the default heuristic is naive about importance. - OOM != “no free memory.” Linux deliberately keeps memory full —
unused RAM is wasted RAM, so the page cache will eat anything free. “Free”
memory in
topbeing near zero is normal. OOM only fires when allocation truly fails after reclaim. - It can also kill the wrong cgroup-mate. Inside a cgroup with several processes, the killer picks the worst-scored one in that cgroup, which isn’t always the leaker — sometimes it’s the innocent neighbor that just happens to be larger.
dmesgis where the evidence lives. When the OOM killer fires, it logs a whole dump: which process, the score, the memory state at the time. If you’ve ever debugged a mysterious “my container just disappeared,” that log is the breadcrumb.
There’s a deeper philosophical seam here too. The OOM killer is the kernel
admitting that perfect memory accounting under overcommit is impossible — at
the moment of failure, it has no good options, only less-bad ones. Some
operating systems (notably some of the BSDs, at least historically — I don’t
have a current confident summary of where each one stands today) refuse to
overcommit at all and force malloc to fail honestly. Linux’s bet is that
overcommit + a reaper produces better aggregate throughput, even though the
tail experience — your job dying with exit 137 — is uglier.
Famous related terms
- Overcommit —
overcommit = "promise more memory than you have" + "hope nobody calls the bet"— the policy that makes the OOM killer necessary. - cgroup memory limit —
cgroup memory limit ≈ per-process-group RAM cap— creates a second, narrower OOM domain inside a host. oom_score_adj—oom_score_adj = per-process bias on who gets killed first— the knob for “please don’t kill my database.”systemd-oomd—systemd-oomd ≈ userspace OOM killer that acts before the kernel's does— uses pressure signals to kill earlier and more selectively.- PSI (Pressure Stall Information) —
PSI ≈ "how stalled is the system on memory/CPU/IO right now"— the modern signalsystemd-oomdreads to act before the kernel has to.
Going deeper
- The kernel source:
mm/oom_kill.cis surprisingly readable for a piece of load-bearing kernel code. man 5 proc— foroom_score,oom_score_adj, and the overcommit knobs.- Chris Down’s writing on
systemd-oomdand PSI — the case for killing in userspace before the kernel has to.