Why columnar storage won for analytics
Row stores read every column to answer a one-column question. Columnar stores refuse — and that refusal is what makes Parquet, ClickHouse, DuckDB, and every modern data warehouse fast.
Why it exists
Imagine a table of a billion clickstream events with 40 columns: timestamp, user_id, country, browser, url, referrer, and 35 others. You want one number out of it — median latency last week. That answer lives in two columns: timestamp and latency_ms.
A traditional row-oriented database stores each row contiguously: all 40 fields of event 1, then all 40 fields of event 2, and so on. To compute the median latency, the engine has to read past 38 columns it doesn’t care about for every single row, because rows are the unit on disk. You pay for the data you didn’t ask for, and on a billion rows that bill is the whole query.
Columnar storage is what you get when you stop accepting that bill. Rotate the layout 90°: store all the timestamps together, then all the latencies, then all the user_ids. The query reads exactly the two columns it needs. The other 38 never leave disk.
This isn’t a new idea — the C-Store paper (Stonebraker et al., 2005) and MonetDB before it laid the academic groundwork, and the design appears in proprietary systems earlier than that. What changed is that the format went mainstream: Apache Parquet, Apache ORC, and Apache Arrow turned columnar into the default shape of analytical data, and engines like ClickHouse, DuckDB, BigQuery, Snowflake, and Redshift all bet the farm on it.
Why it matters now
If you work anywhere near data, columnar is already under your feet:
- Data lakes are Parquet. Object-storage buckets full of Parquet files are how most companies hold their analytical data in 2026. Spark, Trino, Athena, DuckDB — they all read Parquet natively, and the orchestration layers above them (dbt, Airflow) expect that shape.
- In-memory analytics is Arrow. Apache Arrow is a columnar in-memory format designed so different tools can pass tables around without serialization cost. Pandas 2.x can sit on top of it. Polars uses Arrow heavily for interchange (though its internal engine is its own).
- AI training data is increasingly columnar. Large training pipelines often store filtered, deduped text and metadata in Parquet because shuffling and filtering by column is what the loader actually wants. Embeddings tables — billions of vectors plus metadata — frequently end up as Parquet-on-object-storage; I don’t have hard market-share numbers but the shape is now common enough to be unsurprising.
- Vector databases borrow the trick. Several modern vector databases store the embedding column separately from the metadata columns for the same reason warehouses do: filters and projections touch different columns, and you don’t want to pay for both on every read. (Implementations vary — this isn’t universal.)
You don’t choose columnar by typing it. You choose it by picking a tool whose authors already did.
The short answer
columnar storage = transpose the table on disk + compress each column + skip the columns you don't need
Three wins stack. Layout: each column is a contiguous run, so a one-column query reads one-column’s worth of bytes. Compression: values in a column are the same type and often similar, so they compress 5–20x where mixed rows would barely compress at all. Skipping: with per-block min/max statistics, the engine can prove an entire chunk of a column is irrelevant and never decompress it.
Row stores aren’t wrong; they’re just answering a different question — give me this whole record — that columnar stores are bad at.
How it works
Three things make it work, and they reinforce each other.
1. The transpose
A row layout for a tiny table:
row 1: 2026-04-29 | alice | US | 142
row 2: 2026-04-29 | bob | DE | 88
row 3: 2026-04-30 | carol | US | 201
The same data, columnar:
timestamp: 2026-04-29 | 2026-04-29 | 2026-04-30
user: alice | bob | carol
country: US | DE | US
latency_ms: 142 | 88 | 201
Now SELECT avg(latency_ms) FROM events reads exactly one column. The bytes you don’t read are the bytes you don’t pay for — in disk I/O, in memory bandwidth, and (on a cloud warehouse) in literal dollars.
The catch: INSERT of one row touches every column. That’s why columnar formats are usually written in batches (Parquet groups rows into “row groups” sized in bytes — the spec recommends roughly 512 MB to 1 GB; concrete row counts depend on the writer and the schema) and why most columnar systems are bad at single-row updates. Updates are typically expressed as appends plus a periodic rewrite, which is the same bargain LSM trees make for a different reason.
2. The compression
A column is type-homogeneous and often value-homogeneous. That makes it dramatically more compressible than a row.
Common encodings columnar formats stack:
- Dictionary encoding. A column of country codes with cardinality 200 doesn’t store the strings — it stores integers 0–199 plus a small dictionary.
"United States"once,73a billion times. - Run-length encoding. A column where values repeat (sorted timestamps, sparse flags, low-cardinality enums) is stored as
(value, count)pairs. A million identical values become one pair. - Bit-packing. If a column’s values fit in 9 bits, store them in 9 bits, not 32 or 64.
- Delta encoding. Sorted timestamps become a starting value plus tiny differences, which then compress further.
- General-purpose codecs. Snappy, LZ4, Zstd on top of all of the above.
Real-world Parquet files routinely compress 5–20x versus the equivalent CSV — I’m giving a range here because the actual ratio is brutally dependent on the data, and any single number would be a lie. The point is that compression on a column is qualitatively better than compression on a row, because the entropy is lower when types and values are clustered.
3. The skipping (this is the underrated one)
A Parquet file is divided into row groups, and each row group carries per-column statistics: min, max, null count, and optionally a Bloom filter and finer-grained page indexes. When you ask WHERE timestamp >= '2026-04-22', the reader checks each row group’s timestamp.max. If it’s less than 2026-04-22, the entire row group — every column, not just timestamp — is skipped without reading or decompressing.
This is predicate pushdown + min/max pruning, and it’s where a lot of the real-world speed comes from. A query that selectively touches a small slice of the data can physically read a small slice of the data, instead of streaming the whole table and filtering — assuming the layout cooperates. ClickHouse builds on this idea even more aggressively with sparse primary indexes and “skip indexes.” DuckDB, BigQuery, Snowflake — different implementations, same shape.
The corollary: columnar is only fast if your data is laid out to be skippable. A Parquet file where timestamps are scrambled across row groups can’t be pruned by timestamp. That’s why warehouses care so much about clustering / partitioning / sort keys — those are the knobs that decide whether min/max pruning actually works.
Show the seams
A few things the marketing slides skip.
- Single-row reads are slow.
SELECT * FROM events WHERE id = 42on a columnar store has to seek into every column file. Row stores destroy columnar at this. If your workload is “fetch this user’s profile,” columnar is the wrong tool. - Updates are awkward. In-place update of a single row means rewriting every column’s encoded block. Most columnar systems implement updates via appends + periodic rewrites + tombstones, like an LSM tree — which is why “lakehouse” formats like Delta Lake, Iceberg, and Hudi exist on top of Parquet.
- The CPU effect is bigger than people expect. Reading a contiguous run of one type lets the CPU vectorize — process 4 or 8 values per instruction with SIMD. Modern columnar engines (DuckDB, ClickHouse, Velox) lean on this hard. Row stores struggle to, because the next field on the cache line is usually a different type.
- Schema evolution is real. Adding a column to a Parquet dataset is cheap (new files have it, old files don’t, the reader handles missing columns). Renaming or retyping is not. Iceberg/Delta exist partly to make this less painful.
- Write amplification is the cost. To get the compression and pruning wins, you batch. To batch, you accept latency between write and visibility, plus background work to compact small files. There is no free lunch — you traded random-write friendliness for read efficiency.
Famous related terms
- Apache Parquet —
Parquet ≈ columnar file format + row groups + per-column stats + nested types— the de facto on-disk format for analytical data. - Apache Arrow —
Arrow = columnar in-memory format + zero-copy interop— the in-RAM counterpart; lets Pandas, Polars, DuckDB, and Spark hand each other tables without serializing. - Apache ORC —
ORC ≈ Parquet's older cousin from the Hive lineage— same shape, different file format; widely used in Hadoop-era stacks. - Predicate pushdown —
pushdown = move the filter down to the storage layer— what lets min/max pruning skip entire row groups instead of reading them. - Vectorized execution —
vectorized = process columns in batches with SIMD— the CPU-side payoff that makes columnar engines feel fast even on already-cached data. - Lakehouse formats (Iceberg / Delta / Hudi) —
lakehouse table = immutable data files + a transactional metadata layer— bolt-on ACID, schema evolution, and time travel for columnar files (typically Parquet) on object storage.
Going deeper
- Stonebraker et al., C-Store: A Column-oriented DBMS, 2005 — the canonical academic argument for column stores. Daniel Abadi’s follow-up papers on column-store performance are the standard references for why the wins compound.
- The Parquet format documentation on the Apache Parquet site — file structure, encodings, statistics. Drier than the topic deserves.
- Mark Raasveldt and Hannes Mühleisen’s DuckDB papers — the modern story of how an in-process columnar engine actually executes queries, vectorization included.
- Andy Pavlo’s CMU database lectures (15-721, advanced database systems) — multiple lectures on storage models, compression, and vectorization, freely available online.