Why UTF-8 won

Unicode could have been a fixed 4-byte-per-character encoding. Instead, the web runs on a variable-width hack — and that hack is why everything still works.

Computer Science intro Apr 29, 2026

Why it exists

Imagine it’s 1992. You’re writing C code. A string is a char* — a pointer to bytes that ends with a 0. Every library, every kernel call, every config file, every protocol assumes this. ASCII fits in 7 bits, so one byte per character, and the world holds together.

Then you want to support Japanese. And Arabic. And Cyrillic. And you discover there are roughly a million possible characters in the world’s writing systems once you count CJK ideographs and historic scripts. A byte isn’t enough. So what do you do?

The obvious answer — make every character 4 bytes — is what UTF-32 does. It’s clean. Index i of the string is at offset i * 4. No ambiguity. The problem: every existing program that reads bytes, every filesystem, every network protocol, every C library breaks. A file that used to be 1KB is now 4KB. And 90% of the bytes in an English document are zeros, because ASCII characters only need 7 bits but you’re padding them out to 32. The disk groans. The wire groans. And worst of all, a \0 byte appears inside every English character, which means strlen walks off the end of the string and your program segfaults.

UTF-8 is the answer to “how do we add a million characters without breaking anything that already exists.” It exists because backwards compatibility with ASCII and with byte-oriented C code wasn’t a nice-to-have — it was the only way the new encoding could possibly win adoption.

Why it matters now

Every web page, every JSON payload, every source file, every commit message you’ve written this week is UTF-8. The W3C made it the default for HTML5. JSON requires it. Linux filenames are byte sequences that are almost always interpreted as UTF-8. Go and Rust string literals are UTF-8 by definition. Even Windows, which spent two decades on UTF-16 internally, has been quietly migrating APIs toward UTF-8.

This means a software engineer who doesn’t understand UTF-8 will eventually hit a bug they can’t explain — a string that reverses wrong, a length that’s off, a regex that matches the middle of a character. AI-era engineers hit this constantly because LLM tokenizers operate on byte-pair-encoded UTF-8, not on “characters” in any human sense. (See tokenization.)

The short answer

UTF-8 = variable-width encoding (1–4 bytes) + ASCII as a literal subset + self-synchronizing byte pattern

UTF-8 encodes each Unicode code point in 1, 2, 3, or 4 bytes. The first 128 code points (ASCII) are encoded as a single byte identical to ASCII — so every ASCII file is already valid UTF-8. Non-ASCII characters use multi-byte sequences whose bytes are all in the range 0x80–0xFF, which means they never collide with ASCII control characters or with \0.

How it works

The encoding rule is almost suspiciously simple. Look at the high bits of each byte:

0xxxxxxx                             → 1 byte,  7 bits of payload   (ASCII)
110xxxxx 10xxxxxx                    → 2 bytes, 11 bits
1110xxxx 10xxxxxx 10xxxxxx           → 3 bytes, 16 bits
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  → 4 bytes, 21 bits

A leading byte tells you how long the sequence is. A continuation byte always starts with 10. The high-bit pattern of the leading byte (0, 110, 1110, 11110) is also the count of leading 1s, which is the byte length. From any byte in the stream, you can find the start of the current character by scanning backward at most 3 bytes for one that doesn’t start with 10. That property — self-synchronizing — means a corrupted byte loses one character, not the whole rest of the file.

The clever bit: continuation bytes are 10xxxxxx, which is 0x80–0xBF. A leading byte for a multi-byte sequence is 0xC0–0xFF. ASCII bytes are 0x00–0x7F. These three ranges don’t overlap. So you can search a UTF-8 string for the ASCII byte '/' (0x2F) using plain memchr and you will never get a false hit inside a multi-byte character. This is the magic that let UTF-8 slot into existing C code without rewriting it. Byte-oblivious code mostly keeps working.

The trade you make: indexing is no longer O(1). s[5] doesn’t mean “the 6th character” anymore — to find the 6th character you have to walk from the start, decoding one code point at a time. In practice this matters less than people fear, because most string operations (search, concatenate, split on a delimiter, send over a socket) don’t need character-indexed access. They just need bytes that round-trip cleanly.

The seam: “characters” is a lie

Here’s the honest part. Even after you decode UTF-8 into code points, “character” still doesn’t mean what you think. The user-perceived character é can be one code point (U+00E9) or two (U+0065 U+0301 — e plus a combining acute accent). Family emoji like 👨‍👩‍👧 are sequences of multiple code points joined by zero-width joiners. What humans call a “character” is a grapheme cluster, and finding grapheme boundaries requires a Unicode-aware library and a giant lookup table that ships with every new emoji release.

So “string length” has at least four reasonable answers: bytes (UTF-8), code units (UTF-16), code points (Unicode scalars), and grapheme clusters (what users count). Bugs love this gap. UTF-8 didn’t cause it — Unicode itself did — but UTF-8 made the gap visible at the byte layer where most engineers live.

UTF-16 — UTF-16 = variable-width 2-or-4-byte encoding + endianness baggage — what Java, JavaScript strings, and the Windows API use internally. Lost the web because it isn’t ASCII-compatible.
ASCII — ASCII = 7-bit fixed encoding for English + control characters — UTF-8’s superset relationship with ASCII is the entire reason it won.
Code point — code point = a Unicode integer (0 to 0x10FFFF) — the abstract identity of a character, separate from how it’s encoded as bytes.
Grapheme cluster — grapheme ≈ what a human would call "one character" — usually a sequence of code points. Where most “string length” bugs come from.

Going deeper

Rob Pike and Ken Thompson’s original UTF-8 design note (the “designed on a placemat at a New Jersey diner” story is theirs; I don’t have a primary source link to drop in here, but the design itself is documented in RFC 3629).
RFC 3629 — the IETF specification for UTF-8.
The Unicode Standard, chapter 3 (Conformance) — heavy reading but the canonical source. I’m relying on the general shape of the spec here rather than citing specific section numbers I haven’t re-verified.