Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why Linux has an OOM killer

Linux promises memory it doesn't have, then has to break the promise — the OOM killer is the reaper that decides who dies so the system can keep running.

Systems intermediate Apr 29, 2026

Why it exists

You launch a process. It calls malloc(8 GB). The kernel says yes. You run a few of these and the total “allocated” memory now exceeds the physical RAM plus swap on the box. Nothing has crashed yet, because none of those processes have actually touched most of that memory — malloc only reserved address space, not pages.

Then one of them starts writing. The kernel has to find a real, physical page to back each virtual page being touched. At some point it can’t. Every page is in use, swap is full, the page cache is already squeezed flat. The kernel made a promise it can’t keep.

Two options exist at that moment. One: panic the whole machine — every process dies, the box reboots. Two: pick a victim, kill it, free its pages, and let everyone else live. Linux chose option two and built the OOM killer to do it.

The OOM killer exists because Linux deliberately overcommits memory. Overcommit is a calculated bet that programs reserve far more than they touch — and most of the time the bet pays off. The OOM killer is what happens when it doesn’t.

Why it matters now

If you run anything in containers — Kubernetes pods, Docker containers, an ML training job inside a cgroup with a memory limit — you have already met the OOM killer. The “exit code 137” you see in your logs is 128 + SIGKILL(9): the kernel reaped your process.

It matters even more now because of large models. An LLM that loads 70 GB of weights, a fine-tuning run that spikes activation memory, a vector index that grows past what you sized for — these are exactly the workloads that will push real pages into existence and force the kernel to make good on its promises. AI infra is OOM-killer infra.

The short answer

OOM killer = "system is out of memory" trigger + heuristic that scores processes + SIGKILL on the highest scorer

When allocation fails and there’s nowhere left to reclaim from, the kernel walks the process list, computes a “badness” score for each one, and sends SIGKILL to the worst offender. The score roughly tracks how much memory the process is using, with adjustments for privilege and policy.

How it works

The trigger is a failed page allocation under __alloc_pages that the kernel can’t satisfy by reclaiming page cache or swapping. At that point it calls into the OOM killer.

Selection uses a score per process, exposed at /proc/<pid>/oom_score. The modern formula (post the rewrite around the 2.6.36 era — I haven’t re-sourced the exact release, so don’t quote me on the year) is roughly:

So in plain terms: bigger processes get killed first, but you can bias the decision either way. Setting oom_score_adj to -1000 makes a process effectively unkillable by the OOM killer — which is how things like sshd or critical daemons opt out.

The kill itself is a SIGKILL. There is no graceful shutdown, no chance to flush, no atexit handlers. The process is gone.

Containers complicate the story

When you run inside a memory-limited cgroup, there’s a second OOM situation: the cgroup hits its limit even though the host has plenty of RAM. That’s a cgroup OOM, and the kernel runs the same kind of selection — but only over processes inside that cgroup. Your container gets killed; the host doesn’t notice.

This is why a Kubernetes pod with resources.limits.memory: 4Gi can OOM at 4 GB on a node with 256 GB free. The host has memory; the cgroup doesn’t.

Show the seams

A few things the textbook version skips:

There’s a deeper philosophical seam here too. The OOM killer is the kernel admitting that perfect memory accounting under overcommit is impossible — at the moment of failure, it has no good options, only less-bad ones. Some operating systems (notably some of the BSDs, at least historically — I don’t have a current confident summary of where each one stands today) refuse to overcommit at all and force malloc to fail honestly. Linux’s bet is that overcommit + a reaper produces better aggregate throughput, even though the tail experience — your job dying with exit 137 — is uglier.

Going deeper