Apache Parquet became the default storage format for analytics because it is columnar and compressed, which lets query engines read only the columns a query touches and skip the rest. That single design choice is why Parquet now sits underneath nearly every modern analytics tool, from cloud warehouses to local-first engines, and why the Parquet-vs-CSV comparison almost always lands in Parquet's favor for analytical work.

The format you use without noticing

If you have run an analytical query in the last few years, you have almost certainly read Parquet, whether you knew it or not. It is the on-disk format for data lakes, the export target for warehouse unloads, the interchange format between Python and SQL engines, and the thing your local analytics tool quietly writes when it caches a result set.

Parquet is so common that it has become invisible infrastructure. Nobody markets "now with Parquet support" anymore, the way nobody markets "now with TCP/IP." It is just assumed. That invisibility is worth pausing on, because it is rare for a file format to win this decisively. CSV, JSON, and a handful of proprietary formats all competed for the analytics storage slot. Parquet won, and the reasons are entirely about how analytical queries actually behave.

This piece is a plain-English tour of why. It is deliberately distinct from our DuckDB writing: the rise of in-process analytics with DuckDB is about the engine, this is about the file format that engine reads. The two reinforce each other, but they are different layers of the stack.

Row stores vs columnar stores

Start with how data is laid out on disk, because that is the whole story.

A row-oriented format like CSV stores one record at a time, field after field, row after row. To read the third column of every row, you still have to walk past columns one and two on every line. The layout is optimized for "give me this entire record," which is exactly what a transactional system wants when it loads a single customer or order.

A columnar format like Parquet flips that around. It stores all the values for one column together, then all the values for the next column, and so on. To read the third column of every row, you jump straight to that column's block and read it contiguously. Everything else stays on disk, untouched.

Analytics queries almost never want whole records. They want "the sum of amount, grouped by month, for the last year." That query touches maybe three columns out of forty. With a row store, you pay to read all forty. With a columnar store, you read three. As tables get wider and queries stay narrow, the gap between those two behaviors widens into the difference between a fast dashboard and a slow one.

Why compression and column pruning matter

Columnar layout unlocks two compounding advantages: compression and column pruning.

Compression gets dramatically better when similar values sit next to each other. A column of order statuses is mostly the same handful of strings repeated thousands of times. A column of dates marches in near-sorted order. When values are this homogeneous and adjacent, compression algorithms shrink them aggressively, often by an order of magnitude or more. Parquet layers on encodings like dictionary encoding and run-length encoding that are tailored to exactly this situation. A row store cannot compress nearly as well, because each row mixes a date, a string, a float, and an integer together, and that variety defeats the compressor.

Column pruning means the engine reads only the columns a query references. Combined with Parquet's internal row-group statistics, which record the min and max of each column chunk, the engine can also skip entire blocks that cannot possibly match a filter. If a row group's max date is older than your WHERE clause, the engine never reads it. This is sometimes called predicate pushdown, and it turns a full scan into a targeted one.

Put the two together and a typical analytical query reads a small fraction of the bytes a CSV would force it to. Less I/O is the dominant cost saving in analytics, more than CPU, more than memory. That is why the Parquet-vs-CSV question has such a lopsided answer for analytical scans.

Where Parquet fits in a local-first workflow

For years the columnar advantage was associated with big, expensive cloud warehouses. That association is now outdated. The same properties that make Parquet great for a data lake make it great on a laptop.

A local-first workflow reads Parquet files directly, often without any import step at all. You point a tool at a file or a directory of files and query it in place. Because Parquet carries its own schema and statistics, the engine knows the column types and value ranges before it reads a single data block. Because the files are compressed, a dataset that would be gigabytes as CSV fits comfortably in a directory you can commit, copy, or sync.

This is the practical foundation under the current wave of local-first versus cloud BI tooling. You do not need a cluster to get columnar performance. You need Parquet files and an engine that reads them well. That combination is why a single machine can now handle analytical workloads that used to demand distributed infrastructure, a theme we explored from the engine side in choosing a real-time analytics database.

Parquet also pairs naturally with vectorized engines like DuckDB. Vectorized execution processes data in batches of column values at a time, using the CPU's wide instructions efficiently. A columnar file is already laid out in exactly the shape a vectorized engine wants to consume, so there is little translation cost between reading and processing. Format and engine are designed for the same world.

The tradeoffs to know

Parquet is not a universal answer, and pretending otherwise would be dishonest. It is built for one job and is the wrong tool for others.

It is not for transactional row updates. Parquet is an immutable, write-once format. There is no efficient way to update a single field in a single row in place, because the value is buried inside a compressed column block. Systems that need frequent small updates, an order status flipping, an inventory count decrementing, want a row-oriented transactional store, not Parquet. The format assumes you write data in bulk and then read it many times.

Tiny datasets see little benefit. If your data is a few hundred rows, the columnar machinery is overhead you will not notice paying off. CSV is fine, and arguably friendlier, at trivial scale.

It is a binary format. You cannot open a Parquet file in a text editor and eyeball it the way you can a CSV. You need a tool. In exchange you get a self-describing file with embedded schema and statistics, which is usually the better trade for analytics, but it is a trade.

The honest summary: Parquet is purpose-built for analytical scans over columns of data that are written in bulk and read often. That describes the vast majority of BI and analytics work, which is exactly why it won the slot. It does not describe transactional systems, and it should not be forced into them.

How Visivo uses Parquet under the hood

This is where the format stops being abstract. Visivo runs a local-first workflow: you define models, metrics, dimensions, relations, and insights as code, and the Explorer lets you build and preview against real data fast. Keeping that loop fast at scale is a storage problem, and the answer is Parquet.

In Visivo v1.0.79 we moved the Explorer's data tables onto Parquet-backed storage for exactly the reasons above. When you preview a model or render an insight result, the underlying table is columnar, so the Explorer reads only the columns the current view needs and benefits from Parquet's compression as result sets grow. A preview that would have stalled on a wide, hundred-thousand-row result stays responsive, because the work scales with the columns you look at rather than the rows you stored.

You never configure any of this. You write your project, run visivo serve, and the tool uses the right format underneath. That is the point of good infrastructure: the format that makes everything fast is the one you never have to think about. If you want to see the workflow it enables, the examples gallery and the get started guide are the fastest way in.

Previously in Visivo

This post is the conceptual companion to our last release recap, Visivo v1.0.79: Parquet-backed tables and a smarter Explorer. That recap covers the concrete changes, Parquet-backed data tables, the query-profile panel, background schema jobs, Snowflake schema parsing, and full object editability. Read it for the what, and read this for the why behind the storage choice underneath it.

Why Parquet Quietly Became the Default for Analytics Data

Columnar, compressed, and ubiquitous: a plain-English look at why Apache Parquet is the storage format under nearly every modern analytics tool.