Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why containers won over VMs

Both promise isolated, reproducible environments. One boots in milliseconds and ships in megabytes; the other boots in seconds and ships in gigabytes. The reason isn't 'containers are lighter VMs' — they're a different kind of thing entirely.

Systems intermediate Apr 29, 2026

Why it exists

Around 2010, deploying a service usually meant a virtual machine: a full guest operating system, kernel and all, running on top of a hypervisor on top of the host kernel. It worked, but it was heavy. A “small” service shipped as a multi-gigabyte disk image, took tens of seconds to boot, and spent most of its RAM on a kernel and userland its actual workload would never use. Running ten copies of your service for testing meant ten kernels.

The pain point was a mismatch. What developers actually wanted was: “give me my code, my dependencies, and a filesystem that looks the way I expect — and please don’t let me see anyone else’s stuff.” They didn’t want a second kernel. They wanted isolation of the things above the kernel, not duplication of the kernel.

Linux had been quietly accumulating the pieces to do exactly that: namespaces (starting with mount namespaces in 2002, with the core set early containers relied on landing by around 2013 with user namespaces; cgroup namespaces came later, in 4.6 / 2016) and cgroups (merged in 2007 by Google engineers). In 2013, Docker packaged those primitives behind a friendly CLI and an image format you could push and pull, and over the next several years much of the industry’s deployment story shifted from VMs to containers. The reason it shifted is the heart of this post.

Why it matters now

Almost every piece of modern software you touch as an engineer assumes containers somewhere in the path:

VMs didn’t disappear — they’re still the substrate cloud providers use to isolate tenants from each other on shared hardware, and they show back up in the container world when stronger isolation is needed: Firecracker is a microVM monitor (and is what runs AWS Lambda functions); Kata launches each container inside a lightweight VM; gVisor takes a different route, intercepting syscalls in a userspace application kernel. But for the day-to-day “how do I ship my service” slot, containers have been the dominant answer for years.

The short answer

container = process + namespaces + cgroups + a layered filesystem image

A container isn’t a tiny VM. It’s a normal Linux process that the kernel has been told to show a different view of the system to — its own PID 1, its own mount tree, its own network interfaces, its own user IDs — with hard limits on how much CPU and memory it can use. There’s only one kernel: the host’s. That’s why it boots in milliseconds and weighs megabytes.

How it works

What a VM actually is

A VM goes deep. The hypervisor emulates a whole computer: virtual CPUs (with help from hardware virtualization extensions like Intel VT-x), virtual RAM, virtual NICs, virtual disks. On top of that emulated hardware, you boot a complete guest operating system — kernel, init, drivers, libc, shell, everything. Your application then runs as a normal process inside that OS.

The isolation is excellent precisely because it’s at the hardware boundary: the guest can’t see the host kernel because it has its own. The cost is also at the hardware boundary: every guest pays for a kernel, memory for that kernel, and the latency of booting it.

What a container actually is

There is no second kernel. A container is a process (or a small process tree) the host kernel has decorated with two things:

  1. Namespaces scope what the process can see. Linux now has seven kinds: mount, PID, network, IPC, UTS (hostname), user, and cgroup (the last of which only landed in 4.6, after the others). The first process in a new PID namespace gets PID 1 inside that namespace; from its point of view, no other processes on the host exist. A process in its own mount namespace sees a filesystem rooted somewhere completely different — typically an unpacked image with its own /usr, /lib, etc.
  2. Cgroups scope what the process can consume. CPU shares, memory limit, block-I/O bandwidth, number of PIDs. The kernel enforces these from outside the container’s view.

That’s the whole isolation story. Run ps -ef on the host while a container is running and you’ll see the container’s processes right there in the host’s process list — they’re just regular processes, with extra restrictions on what they’re allowed to look at and use.

The image trick: why docker pull is fast

The other half of why containers won is the image format. A container image is a stack of read-only filesystem layers, each layer a tarball of changed files relative to the layer below, addressed by the SHA-256 of its contents. A typical Python service image might be:

When you pull an image, the registry only sends the layers your host doesn’t already have. When you run it, the kernel mounts the layers as a single filesystem using a union filesystem (overlayfs on modern Linux), with a thin writable layer on top for the running container. Two containers from the same image share the underlying read-only layers in the page cache, so the second one starts even faster than the first.

VM disk formats can do something similar — qcow2 supports backing files and copy-on-write snapshots, for instance — but the ecosystem around layered, content-addressed, registry-distributed images standardized on the container side. In practice, “share a base, only ship the diff” is what docker pull makes routine, while VM images are usually shipped as whole filesystems.

Why “containers boot in milliseconds”

Because there is no boot. Starting a container is approximately:

  1. A low-level container runtime — runc is the common one, sitting under higher-level systems like containerd (which Docker and Kubernetes-via-CRI typically use) — calls clone() with flags asking for new namespaces. This is a fork-style syscall — see why fork is weird.
  2. It sets up cgroups and mounts the image’s overlayfs as the new root.
  3. It execs your entrypoint binary.

That’s it. No kernel boot, no init system traversal, no driver probing. The first instruction of your application runs almost immediately after the syscall returns. A VM, in contrast, has to POST virtual hardware, run a bootloader, boot a kernel, run an init system, start services, and only then execute your code.

Show the seams

Containers won, but the reasons they didn’t fully replace VMs are worth knowing:

The shape to keep: a VM virtualizes the machine; a container virtualizes the view from inside one process. Same goal — isolated, reproducible environments — but at completely different layers of the stack. The container won the deployment slot because its layer was the right one for “ship my code and its dependencies.”

Going deeper