Why containers won over VMs
Both promise isolated, reproducible environments. One boots in milliseconds and ships in megabytes; the other boots in seconds and ships in gigabytes. The reason isn't 'containers are lighter VMs' — they're a different kind of thing entirely.
Why it exists
Around 2010, deploying a service usually meant a virtual machine: a full guest operating system, kernel and all, running on top of a hypervisor on top of the host kernel. It worked, but it was heavy. A “small” service shipped as a multi-gigabyte disk image, took tens of seconds to boot, and spent most of its RAM on a kernel and userland its actual workload would never use. Running ten copies of your service for testing meant ten kernels.
The pain point was a mismatch. What developers actually wanted was: “give me my code, my dependencies, and a filesystem that looks the way I expect — and please don’t let me see anyone else’s stuff.” They didn’t want a second kernel. They wanted isolation of the things above the kernel, not duplication of the kernel.
Linux had been quietly accumulating the pieces to do exactly that:
namespaces
(starting with mount namespaces in 2002, with the core set early containers
relied on landing by around 2013 with user namespaces; cgroup namespaces
came later, in 4.6 / 2016) and
cgroups
(merged in 2007 by Google engineers). In 2013, Docker packaged those
primitives behind a friendly CLI and an image format you could push and
pull, and over the next several years much of the industry’s
deployment story shifted from VMs to containers. The reason it shifted is
the heart of this post.
Why it matters now
Almost every piece of modern software you touch as an engineer assumes containers somewhere in the path:
- CI often runs in containers. Many GitHub Actions and GitLab jobs
use a
container:step or container-based executor; even when the outer runner is a VM, build and test steps are routinely wrapped in containers for reproducibility. - Production runs in containers. Kubernetes scheduling, the entire
cloud-native ecosystem, your
Dockerfile— all of it. - Local dev runs in containers. Devcontainers,
docker compose, the Postgres you spin up for a feature branch. - AI/ML serving runs in containers. GPU-enabled containers (via the NVIDIA Container Toolkit) are a common way model servers, training jobs, and inference endpoints get shipped. Model weights are sometimes baked into an image and often mounted or fetched at runtime; either way the runtime is a container with the host’s GPU exposed.
VMs didn’t disappear — they’re still the substrate cloud providers use to isolate tenants from each other on shared hardware, and they show back up in the container world when stronger isolation is needed: Firecracker is a microVM monitor (and is what runs AWS Lambda functions); Kata launches each container inside a lightweight VM; gVisor takes a different route, intercepting syscalls in a userspace application kernel. But for the day-to-day “how do I ship my service” slot, containers have been the dominant answer for years.
The short answer
container = process + namespaces + cgroups + a layered filesystem image
A container isn’t a tiny VM. It’s a normal Linux process that the kernel has been told to show a different view of the system to — its own PID 1, its own mount tree, its own network interfaces, its own user IDs — with hard limits on how much CPU and memory it can use. There’s only one kernel: the host’s. That’s why it boots in milliseconds and weighs megabytes.
How it works
What a VM actually is
A VM goes deep. The hypervisor emulates a whole computer: virtual CPUs (with help from hardware virtualization extensions like Intel VT-x), virtual RAM, virtual NICs, virtual disks. On top of that emulated hardware, you boot a complete guest operating system — kernel, init, drivers, libc, shell, everything. Your application then runs as a normal process inside that OS.
The isolation is excellent precisely because it’s at the hardware boundary: the guest can’t see the host kernel because it has its own. The cost is also at the hardware boundary: every guest pays for a kernel, memory for that kernel, and the latency of booting it.
What a container actually is
There is no second kernel. A container is a process (or a small process tree) the host kernel has decorated with two things:
- Namespaces scope what the process can see. Linux now has seven
kinds: mount, PID, network, IPC, UTS (hostname), user, and cgroup
(the last of which only landed in 4.6, after the others). The first
process in a new PID namespace gets PID 1 inside that namespace; from
its point of view, no other processes on the host exist. A process in
its own mount namespace sees a filesystem rooted somewhere completely
different — typically an unpacked image with its own
/usr,/lib, etc. - Cgroups scope what the process can consume. CPU shares, memory limit, block-I/O bandwidth, number of PIDs. The kernel enforces these from outside the container’s view.
That’s the whole isolation story. Run ps -ef on the host while a
container is running and you’ll see the container’s processes right there
in the host’s process list — they’re just regular processes, with extra
restrictions on what they’re allowed to look at and use.
The image trick: why docker pull is fast
The other half of why containers won is the image format. A container image is a stack of read-only filesystem layers, each layer a tarball of changed files relative to the layer below, addressed by the SHA-256 of its contents. A typical Python service image might be:
debian:slimbase layer (shared by every Debian-based image)- Python runtime layer (shared by every Python image based on this base)
- Your
pip installlayer (shared by every build with the samerequirements.txt) - Your application code (changes every commit)
When you pull an image, the registry only sends the layers your host doesn’t already have. When you run it, the kernel mounts the layers as a single filesystem using a union filesystem (overlayfs on modern Linux), with a thin writable layer on top for the running container. Two containers from the same image share the underlying read-only layers in the page cache, so the second one starts even faster than the first.
VM disk formats can do something similar — qcow2 supports backing
files and copy-on-write snapshots, for instance — but the ecosystem
around layered, content-addressed, registry-distributed images
standardized on the container side. In practice, “share a base, only
ship the diff” is what docker pull makes routine, while VM images are
usually shipped as whole filesystems.
Why “containers boot in milliseconds”
Because there is no boot. Starting a container is approximately:
- A low-level container runtime —
runcis the common one, sitting under higher-level systems like containerd (which Docker and Kubernetes-via-CRI typically use) — callsclone()with flags asking for new namespaces. This is a fork-style syscall — see why fork is weird. - It sets up cgroups and mounts the image’s overlayfs as the new root.
- It
execs your entrypoint binary.
That’s it. No kernel boot, no init system traversal, no driver probing. The first instruction of your application runs almost immediately after the syscall returns. A VM, in contrast, has to POST virtual hardware, run a bootloader, boot a kernel, run an init system, start services, and only then execute your code.
Show the seams
Containers won, but the reasons they didn’t fully replace VMs are worth knowing:
- Same kernel = a shared blast radius. A kernel-level security bug (sandbox escape, namespace bug) lets a container break out to the host in ways a VM bug usually can’t, because the VM’s “kernel” is the guest kernel, not the host’s. This is why public cloud providers still wrap customer containers in lightweight VMs (Firecracker, gVisor’s user-space kernel, Kata Containers) when running untrusted code.
- No Windows containers on a Linux host (and vice versa). Containers share the host kernel, so the kernel ABI has to match. VMs don’t care.
- GPUs and other devices need explicit pass-through. GPU containers work because the NVIDIA Container Toolkit injects the host’s driver and device files into the container’s mount namespace. It’s not magic; someone wired up the seams.
- “Stateless” is doing real work in this story. Containers are easy to throw away because the design assumes state lives elsewhere (databases, object storage). The moment you put real state inside a container, the layered-image and “cattle, not pets” framing starts fighting you. Kubernetes’ StatefulSets exist exactly because this is awkward.
- Cold-start is fast but not free. Pulling a multi-gigabyte AI/ML image (CUDA + PyTorch + model weights) over the network on a cold node is the dominant cost of a “container start” in practice — not the syscalls. Image-streaming and lazy-pull (e.g. SOCI, stargz) exist to attack this.
The shape to keep: a VM virtualizes the machine; a container virtualizes the view from inside one process. Same goal — isolated, reproducible environments — but at completely different layers of the stack. The container won the deployment slot because its layer was the right one for “ship my code and its dependencies.”
Famous related terms
- Namespace —
namespace = a per-process kernel-level scope for some kind of resource (filesystem, PID, network…)— the “what can I see” half of containerization. - cgroup —
cgroup = kernel-enforced quota and accounting for a group of processes— the “what can I use” half. - Hypervisor —
hypervisor = thin layer that virtualizes hardware so multiple guest OSes can run on one machine— what a VM runs on. - Image layer —
layer = tarball of file changes + content-addressed hash— whydocker pullis incremental and image storage deduplicates. - OCI —
OCI = Open Container Initiative ≈ vendor-neutral specs for container images and runtimes— whatdocker buildproduces is an OCI-compatible image; OCI runtimes likeruncthen run an unpacked OCI bundle derived from it. - Firecracker —
Firecracker = minimal VMM + microVMs that boot in milliseconds— what AWS Lambda and Fargate use to get VM-grade isolation at container-grade speed. - Kata Containers —
Kata = OCI runtime + each container in its own lightweight VM— container ergonomics, VM isolation boundary. - gVisor —
gVisor = userspace application kernel that intercepts syscalls + OCI runtime— a different bet: keep one host kernel, but have container syscalls hit a sandbox kernel first.
Going deeper
- Linux man pages:
namespaces(7),cgroups(7),clone(2)— the primary sources on what the kernel actually provides. - Jérôme Petazzoni’s classic talk “Cgroups, namespaces and beyond: what are containers made from?” — a good walk through the primitives without the Docker marketing layer.
- The OCI image-spec and runtime-spec on GitHub — the real, vendor-neutral
contract behind
docker push.