Why syscalls are expensive

A function call costs a few cycles. A system call costs hundreds — sometimes thousands. The gap isn't sloppy engineering; it's the price of the user/kernel boundary.

Systems intermediate Apr 29, 2026

Why it exists

A normal function call is like asking your friend sitting next to you to pass the salt — quick, no fuss, a couple of words and you’re done. A syscall is like asking airport security to fetch something from your checked bag. They have to check your ID, scan you, decide whether you’re allowed, get the item, escort you back, and re-lock the door. The actual task might take a millisecond; the security ritual takes the rest of the time. Every time your program reads a file, opens a network socket, or even asks the current time, it’s doing the airport- security version of a function call. Do it ten times a second and nobody cares. Do it a million times a second and your program spends most of its life waiting at the checkpoint.

You write read(fd, buf, n) and it looks like any other function call. It isn’t. A normal function call is a handful of cycles — push some registers, jump, pop, return. A syscall is, on a modern x86-64 box, on the order of hundreds of nanoseconds in the best case, and often more once you count the indirect costs that hit after the call returns. That’s a 50–200× gap, and it’s not because kernel programmers are bad at their jobs.

The gap exists because a syscall is not really a function call at all. It’s a controlled crossing of a hardware-enforced security boundary. The CPU is in one of two modes — user mode or kernel mode — and a huge amount of machinery exists to stop user code from forging its way into kernel mode. A syscall is the one sanctioned door, and going through it costs what the door costs.

The pain points the boundary is solving:

Untrusted code can’t be allowed to touch hardware (disks, network cards, page tables, other processes’ memory) directly. One buggy program would take down the whole machine.
The kernel runs with full privileges — it can write any RAM, talk to any device. You absolutely do not want user code to trick the CPU into running user instructions in that mode.
The transition has to be safe both directions. Going in, the kernel must not trust a single register, pointer, or flag the user set up. Coming out, the kernel must not leak its own state.

All of that costs cycles. The “expense” is the integrity tax.

Why it matters now

It shows up the moment your workload is bottlenecked on lots of small I/O:

A web server doing one read and one write per byte of payload will saturate on syscalls long before the network or disk.
A database doing a pread per 4 KB row is paying syscall overhead on every page.
An AI inference server streaming tokens out over HTTP is doing a write per token chunk, which is fine until you have thousands of concurrent streams.
The reason io_uring exists at all is “syscalls per I/O operation is too expensive for modern NVMe and 100 GbE; let’s submit and reap many at once.”

The AI-era version: when people batch tokenization, batch GPU launches, or use shared-memory IPC instead of sockets, “amortize the syscall” is usually in the unstated reasons. It’s also why the cost of a single syscall is a useful unit of “how much work is this small operation actually worth doing.”

The short answer

syscall cost = mode switch + register save/restore + KPTI page-table swap + cache & TLB & branch-predictor pollution

A function call stays in user mode and shares everything with the caller. A syscall flips the CPU into kernel mode, switches address-space context (on post-Meltdown kernels), saves a pile of state, and disturbs CPU caches the user code was depending on. You pay all of that twice — once going in, once coming out.

How it works

The hardware crossing

On x86-64, the user-space syscall path is the syscall instruction. It does roughly this in microcode:

Save the user RIP (instruction pointer) and RFLAGS into specific registers.
Load the kernel entry point from a CPU model-specific register (MSR_LSTAR).
Switch the CPU’s CPL from ring 3 (user) to ring 0 (kernel).
Mask interrupts according to MSR_SFMASK.

That much is cheap-ish — tens of cycles. The expensive part is what software has to do next.

What the kernel has to do on entry

The kernel can’t trust a single thing about the CPU state it just inherited. So the entry stub:

Swaps stacks. The user stack pointer can’t be trusted (could be garbage, could point at kernel memory); the kernel switches to a per-CPU kernel stack via swapgs and a load from a per-CPU area.
Saves user registers. All general-purpose registers go onto the kernel stack so the syscall handler can use them, and so the user state can be restored exactly.
Switches the page table — on post-Meltdown kernels with KPTI enabled. User mode and kernel mode now have different page tables, so Spectre/Meltdown-style attacks can’t speculatively read kernel memory from user mode. Every syscall writes CR3 on the way in and again on the way out. CR3 writes flush most of the TLB.
Validates user pointers. Any pointer the user passed in (buf in read(fd, buf, n)) has to be checked: is it actually in the user’s address space? Is it readable/writable? If the user passed a kernel address, the syscall has to refuse rather than let the kernel happily dereference it.
Does the actual work.
Reverses everything on return — restore registers, swap CR3 back, sysretq back to user mode.

Why it’s still expensive even after the cycles

The direct cycle count is only part of it. The real bill is the microarchitectural collateral damage:

TLB flushes from CR3 writes mean the next several user-mode memory accesses miss the TLB and pay full page-walk cost. Your hot loop’s carefully-warmed translations are gone.
L1 / L2 cache pollution. The kernel ran code and touched data; that evicted some of yours. When you come back to user mode, the first accesses miss caches that were hot a microsecond ago.
Branch predictor and indirect-branch state get poisoned, partly as a side effect of running other code, partly deliberately on mitigated kernels (the IBPB / IBRS family). Predictors that had learned your loops have to relearn.
Speculative-execution mitigations add real overhead per crossing on affected CPUs. Meltdown’s KPTI alone roughly doubled the cost of a syscall on older Intel chips when it landed in 2018; AMD and newer Intel parts have less of this tax, but it’s not zero anywhere. (I don’t have a current per-vendor breakdown that I’d quote precisely — it shifts with every microcode update.)

So a “200 ns syscall” is really “120 ns of direct work plus another few hundred nanoseconds of cold-cache pain that you’ll feel as your user code runs slowly for a little while after.” The second number doesn’t show up in benchmarks that measure the syscall in isolation, which is part of why syscall cost is consistently underestimated.

Show the seams

The numbers are not stable. Syscall cost depends heavily on CPU generation, which mitigations are enabled, and whether KPTI is on. A no-mitigation server CPU does a syscall in roughly 50–100 ns; a mitigated laptop part can be 3–5× that. Treat any single number with suspicion.
Not every syscall is the same price. getpid is nearly pure overhead and is the canonical “cost of the boundary itself” microbenchmark. read from a hot page-cache page is overhead-dominated. read that hits disk is dominated by the disk, not the syscall. Don’t optimize a syscall that’s already dwarfed by what it’s calling.
vDSO cheats. A few “syscalls” — gettimeofday, clock_gettime, getcpu — aren’t real syscalls at all on Linux. They’re code mapped into your address space that reads kernel-maintained pages directly. That’s why high-rate timing code feels free.
io_uring doesn’t make syscalls faster — it makes there be fewer of them. Submission and completion go through shared-memory ring buffers; one syscall can submit hundreds of I/Os. The boundary cost hasn’t gone away, it’s been amortized.
Containers don’t add a second crossing. A container is a process with fancier namespaces; its syscalls go straight to the host kernel like any other. Hypervisors do add a second one — VM exits are the syscall-of-syscalls — which is why “cloud overhead” is partly real.

The deep idea is the same as virtual memory: the OS doesn’t actually have power over user code except at moments when the hardware hands control over. The syscall is one of those moments, deliberately constructed. Everything expensive about it is the cost of making that handover safe.

vDSO — vDSO = kernel code page mapped into user space + lets a few "syscalls" run without crossing — why clock_gettime is essentially free.
io_uring — io_uring = shared-memory submission + completion rings + one syscall amortized over many I/Os — Linux’s answer to syscall-per-IO.
KPTI — KPTI = separate page tables for user and kernel + swap on every crossing — the post-Meltdown tax on syscall cost.
Context switch — context switch ≈ syscall + scheduler picks a different process to resume — strictly more expensive than a syscall, for the same reasons plus a fresh address space.
VM exit — VM exit ≈ syscall but from guest to hypervisor — same shape, one level up.
eBPF — eBPF = verified bytecode the kernel runs in-kernel on your behalf — partly motivated by “if I could just run my filter inside the kernel, I wouldn’t pay the boundary cost per packet.”

Going deeper

Brendan Gregg’s writeups on KPTI / Meltdown overhead are the clearest numbers I’ve seen on what mitigations actually cost in production.
The Linux arch/x86/entry/entry_64.S syscall entry stub is short and surprisingly readable — every line is paying for something on this list.
Jens Axboe’s io_uring design notes (LWN coverage from 2019 onward) lay out the “amortize the boundary” argument explicitly.