Why Linux has an OOM killer

Linux promises memory it doesn't have, then has to break the promise — the OOM killer is the reaper that decides who dies so the system can keep running.

Systems intermediate Apr 29, 2026

Why it exists

You launch a process. It calls malloc(8 GB). The kernel says yes. You run a few of these and the total “allocated” memory now exceeds the physical RAM plus swap on the box. Nothing has crashed yet, because none of those processes have actually touched most of that memory — malloc only reserved address space, not pages.

Then one of them starts writing. The kernel has to find a real, physical page to back each virtual page being touched. At some point it can’t. Every page is in use, swap is full, the page cache is already squeezed flat. The kernel made a promise it can’t keep.

Two options exist at that moment. One: panic the whole machine — every process dies, the box reboots. Two: pick a victim, kill it, free its pages, and let everyone else live. Linux chose option two and built the OOM killer to do it.

The OOM killer exists because Linux deliberately overcommits memory. Overcommit is a calculated bet that programs reserve far more than they touch — and most of the time the bet pays off. The OOM killer is what happens when it doesn’t.

Why it matters now

If you run anything in containers — Kubernetes pods, Docker containers, an ML training job inside a cgroup with a memory limit — you have already met the OOM killer. The “exit code 137” you see in your logs is 128 + SIGKILL(9): the kernel reaped your process.

It matters even more now because of large models. An LLM that loads 70 GB of weights, a fine-tuning run that spikes activation memory, a vector index that grows past what you sized for — these are exactly the workloads that will push real pages into existence and force the kernel to make good on its promises. AI infra is OOM-killer infra.

The short answer

OOM killer = "system is out of memory" trigger + heuristic that scores processes + SIGKILL on the highest scorer

When allocation fails and there’s nowhere left to reclaim from, the kernel walks the process list, computes a “badness” score for each one, and sends SIGKILL to the worst offender. The score roughly tracks how much memory the process is using, with adjustments for privilege and policy.

How it works

The trigger is a failed page allocation under __alloc_pages that the kernel can’t satisfy by reclaiming page cache or swapping. At that point it calls into the OOM killer.

Selection uses a score per process, exposed at /proc/<pid>/oom_score. The modern formula (post the rewrite around the 2.6.36 era — I haven’t re-sourced the exact release, so don’t quote me on the year) is roughly:

Base score = the process’s resident memory as a fraction of available memory, scaled to 0–1000.
Plus an adjustment, oom_score_adj, in /proc/<pid>/oom_score_adj, ranging from -1000 (immune) to +1000 (kill me first).

So in plain terms: bigger processes get killed first, but you can bias the decision either way. Setting oom_score_adj to -1000 makes a process effectively unkillable by the OOM killer — which is how things like sshd or critical daemons opt out.

The kill itself is a SIGKILL. There is no graceful shutdown, no chance to flush, no atexit handlers. The process is gone.

Containers complicate the story

When you run inside a memory-limited cgroup, there’s a second OOM situation: the cgroup hits its limit even though the host has plenty of RAM. That’s a cgroup OOM, and the kernel runs the same kind of selection — but only over processes inside that cgroup. Your container gets killed; the host doesn’t notice.

This is why a Kubernetes pod with resources.limits.memory: 4Gi can OOM at 4 GB on a node with 256 GB free. The host has memory; the cgroup doesn’t.

Show the seams

A few things the textbook version skips:

Overcommit is configurable. vm.overcommit_memory has three modes: heuristic (default, allow most allocations), always (lie aggressively), and never (refuse allocations that wouldn’t fit). The “never” mode trades the OOM killer for malloc returning NULL — which most programs handle even worse than they handle being killed, because nobody actually checks malloc’s return value.
The killer’s choice can be surprising. The biggest process is often the one doing the work you care about — your database, your model server. The kernel doesn’t know that. Tools like systemd-oomd and policy via oom_score_adj exist precisely because the default heuristic is naive about importance.
OOM != “no free memory.” Linux deliberately keeps memory full — unused RAM is wasted RAM, so the page cache will eat anything free. “Free” memory in top being near zero is normal. OOM only fires when allocation truly fails after reclaim.
It can also kill the wrong cgroup-mate. Inside a cgroup with several processes, the killer picks the worst-scored one in that cgroup, which isn’t always the leaker — sometimes it’s the innocent neighbor that just happens to be larger.
dmesg is where the evidence lives. When the OOM killer fires, it logs a whole dump: which process, the score, the memory state at the time. If you’ve ever debugged a mysterious “my container just disappeared,” that log is the breadcrumb.

There’s a deeper philosophical seam here too. The OOM killer is the kernel admitting that perfect memory accounting under overcommit is impossible — at the moment of failure, it has no good options, only less-bad ones. Some operating systems (notably some of the BSDs, at least historically — I don’t have a current confident summary of where each one stands today) refuse to overcommit at all and force malloc to fail honestly. Linux’s bet is that overcommit + a reaper produces better aggregate throughput, even though the tail experience — your job dying with exit 137 — is uglier.

Overcommit — overcommit = "promise more memory than you have" + "hope nobody calls the bet" — the policy that makes the OOM killer necessary.
cgroup memory limit — cgroup memory limit ≈ per-process-group RAM cap — creates a second, narrower OOM domain inside a host.
oom_score_adj — oom_score_adj = per-process bias on who gets killed first — the knob for “please don’t kill my database.”
systemd-oomd — systemd-oomd ≈ userspace OOM killer that acts before the kernel's does — uses pressure signals to kill earlier and more selectively.
PSI (Pressure Stall Information) — PSI ≈ "how stalled is the system on memory/CPU/IO right now" — the modern signal systemd-oomd reads to act before the kernel has to.

Going deeper

The kernel source: mm/oom_kill.c is surprisingly readable for a piece of load-bearing kernel code.
man 5 proc — for oom_score, oom_score_adj, and the overcommit knobs.
Chris Down’s writing on systemd-oomd and PSI — the case for killing in userspace before the kernel has to.