Why containers won over VMs

Both promise isolated, reproducible environments. One boots in milliseconds and ships in megabytes; the other boots in seconds and ships in gigabytes. The reason isn't 'containers are lighter VMs' — they're a different kind of thing entirely.

Systems intermediate Apr 29, 2026

Why it exists

Around 2010, deploying a service usually meant a virtual machine: a full guest operating system, kernel and all, running on top of a hypervisor on top of the host kernel. It worked, but it was heavy. A “small” service shipped as a multi-gigabyte disk image, took tens of seconds to boot, and spent most of its RAM on a kernel and userland its actual workload would never use. Running ten copies of your service for testing meant ten kernels.

The pain point was a mismatch. What developers actually wanted was: “give me my code, my dependencies, and a filesystem that looks the way I expect — and please don’t let me see anyone else’s stuff.” They didn’t want a second kernel. They wanted isolation of the things above the kernel, not duplication of the kernel.

Linux had been quietly accumulating the pieces to do exactly that: namespaces (starting with mount namespaces in 2002, with the core set early containers relied on landing by around 2013 with user namespaces; cgroup namespaces came later, in 4.6 / 2016) and cgroups (merged in 2007 by Google engineers). In 2013, Docker packaged those primitives behind a friendly CLI and an image format you could push and pull, and over the next several years much of the industry’s deployment story shifted from VMs to containers. The reason it shifted is the heart of this post.

Why it matters now

Almost every piece of modern software you touch as an engineer assumes containers somewhere in the path:

CI often runs in containers. Many GitHub Actions and GitLab jobs use a container: step or container-based executor; even when the outer runner is a VM, build and test steps are routinely wrapped in containers for reproducibility.
Production runs in containers. Kubernetes scheduling, the entire cloud-native ecosystem, your Dockerfile — all of it.
Local dev runs in containers. Devcontainers, docker compose, the Postgres you spin up for a feature branch.
AI/ML serving runs in containers. GPU-enabled containers (via the NVIDIA Container Toolkit) are a common way model servers, training jobs, and inference endpoints get shipped. Model weights are sometimes baked into an image and often mounted or fetched at runtime; either way the runtime is a container with the host’s GPU exposed.

VMs didn’t disappear — they’re still the substrate cloud providers use to isolate tenants from each other on shared hardware, and they show back up in the container world when stronger isolation is needed: Firecracker is a microVM monitor (and is what runs AWS Lambda functions); Kata launches each container inside a lightweight VM; gVisor takes a different route, intercepting syscalls in a userspace application kernel. But for the day-to-day “how do I ship my service” slot, containers have been the dominant answer for years.

The short answer

container = process + namespaces + cgroups + a layered filesystem image

A container isn’t a tiny VM. It’s a normal Linux process that the kernel has been told to show a different view of the system to — its own PID 1, its own mount tree, its own network interfaces, its own user IDs — with hard limits on how much CPU and memory it can use. There’s only one kernel: the host’s. That’s why it boots in milliseconds and weighs megabytes.

How it works

What a VM actually is

A VM goes deep. The hypervisor emulates a whole computer: virtual CPUs (with help from hardware virtualization extensions like Intel VT-x), virtual RAM, virtual NICs, virtual disks. On top of that emulated hardware, you boot a complete guest operating system — kernel, init, drivers, libc, shell, everything. Your application then runs as a normal process inside that OS.

The isolation is excellent precisely because it’s at the hardware boundary: the guest can’t see the host kernel because it has its own. The cost is also at the hardware boundary: every guest pays for a kernel, memory for that kernel, and the latency of booting it.

What a container actually is

There is no second kernel. A container is a process (or a small process tree) the host kernel has decorated with two things:

Namespaces scope what the process can see. Linux now has seven kinds: mount, PID, network, IPC, UTS (hostname), user, and cgroup (the last of which only landed in 4.6, after the others). The first process in a new PID namespace gets PID 1 inside that namespace; from its point of view, no other processes on the host exist. A process in its own mount namespace sees a filesystem rooted somewhere completely different — typically an unpacked image with its own /usr, /lib, etc.
Cgroups scope what the process can consume. CPU shares, memory limit, block-I/O bandwidth, number of PIDs. The kernel enforces these from outside the container’s view.

That’s the whole isolation story. Run ps -ef on the host while a container is running and you’ll see the container’s processes right there in the host’s process list — they’re just regular processes, with extra restrictions on what they’re allowed to look at and use.

The image trick: why `docker pull` is fast

The other half of why containers won is the image format. A container image is a stack of read-only filesystem layers, each layer a tarball of changed files relative to the layer below, addressed by the SHA-256 of its contents. A typical Python service image might be:

debian:slim base layer (shared by every Debian-based image)
Python runtime layer (shared by every Python image based on this base)
Your pip install layer (shared by every build with the same requirements.txt)
Your application code (changes every commit)

When you pull an image, the registry only sends the layers your host doesn’t already have. When you run it, the kernel mounts the layers as a single filesystem using a union filesystem (overlayfs on modern Linux), with a thin writable layer on top for the running container. Two containers from the same image share the underlying read-only layers in the page cache, so the second one starts even faster than the first.

VM disk formats can do something similar — qcow2 supports backing files and copy-on-write snapshots, for instance — but the ecosystem around layered, content-addressed, registry-distributed images standardized on the container side. In practice, “share a base, only ship the diff” is what docker pull makes routine, while VM images are usually shipped as whole filesystems.

Why “containers boot in milliseconds”

Because there is no boot. Starting a container is approximately:

A low-level container runtime — runc is the common one, sitting under higher-level systems like containerd (which Docker and Kubernetes-via-CRI typically use) — calls clone() with flags asking for new namespaces. This is a fork-style syscall — see why fork is weird.
It sets up cgroups and mounts the image’s overlayfs as the new root.
It execs your entrypoint binary.

That’s it. No kernel boot, no init system traversal, no driver probing. The first instruction of your application runs almost immediately after the syscall returns. A VM, in contrast, has to POST virtual hardware, run a bootloader, boot a kernel, run an init system, start services, and only then execute your code.

Show the seams

Containers won, but the reasons they didn’t fully replace VMs are worth knowing:

Same kernel = a shared blast radius. A kernel-level security bug (sandbox escape, namespace bug) lets a container break out to the host in ways a VM bug usually can’t, because the VM’s “kernel” is the guest kernel, not the host’s. This is why public cloud providers still wrap customer containers in lightweight VMs (Firecracker, gVisor’s user-space kernel, Kata Containers) when running untrusted code.
No Windows containers on a Linux host (and vice versa). Containers share the host kernel, so the kernel ABI has to match. VMs don’t care.
GPUs and other devices need explicit pass-through. GPU containers work because the NVIDIA Container Toolkit injects the host’s driver and device files into the container’s mount namespace. It’s not magic; someone wired up the seams.
“Stateless” is doing real work in this story. Containers are easy to throw away because the design assumes state lives elsewhere (databases, object storage). The moment you put real state inside a container, the layered-image and “cattle, not pets” framing starts fighting you. Kubernetes’ StatefulSets exist exactly because this is awkward.
Cold-start is fast but not free. Pulling a multi-gigabyte AI/ML image (CUDA + PyTorch + model weights) over the network on a cold node is the dominant cost of a “container start” in practice — not the syscalls. Image-streaming and lazy-pull (e.g. SOCI, stargz) exist to attack this.

The shape to keep: a VM virtualizes the machine; a container virtualizes the view from inside one process. Same goal — isolated, reproducible environments — but at completely different layers of the stack. The container won the deployment slot because its layer was the right one for “ship my code and its dependencies.”

Namespace — namespace = a per-process kernel-level scope for some kind of resource (filesystem, PID, network…) — the “what can I see” half of containerization.
cgroup — cgroup = kernel-enforced quota and accounting for a group of processes — the “what can I use” half.
Hypervisor — hypervisor = thin layer that virtualizes hardware so multiple guest OSes can run on one machine — what a VM runs on.
Image layer — layer = tarball of file changes + content-addressed hash — why docker pull is incremental and image storage deduplicates.
OCI — OCI = Open Container Initiative ≈ vendor-neutral specs for container images and runtimes — what docker build produces is an OCI-compatible image; OCI runtimes like runc then run an unpacked OCI bundle derived from it.
Firecracker — Firecracker = minimal VMM + microVMs that boot in milliseconds — what AWS Lambda and Fargate use to get VM-grade isolation at container-grade speed.
Kata Containers — Kata = OCI runtime + each container in its own lightweight VM — container ergonomics, VM isolation boundary.
gVisor — gVisor = userspace application kernel that intercepts syscalls + OCI runtime — a different bet: keep one host kernel, but have container syscalls hit a sandbox kernel first.

Going deeper

Linux man pages: namespaces(7), cgroups(7), clone(2) — the primary sources on what the kernel actually provides.
Jérôme Petazzoni’s classic talk “Cgroups, namespaces and beyond: what are containers made from?” — a good walk through the primitives without the Docker marketing layer.
The OCI image-spec and runtime-spec on GitHub — the real, vendor-neutral contract behind docker push.