Inside the Delta Lake Transaction Log

Every Delta Lake table is two things at once: a directory of Parquet data files, and a transaction log that describes what those files mean. The Parquet files are dumb storage. All the intelligence — schema, history, ACID guarantees — lives in the log.

Understanding the log is understanding Delta. Most docs explain what Delta does. This explains how.

The `_delta_log/` directory

When you create a Delta table, Delta writes a _delta_log/ directory alongside your data:

my_table/
├── _delta_log/
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   ├── 00000000000000000002.json
│   └── 00000000000000000010.checkpoint.parquet
└── part-00000-a1b2c3.snappy.parquet
    part-00000-d4e5f6.snappy.parquet
    ...

Each .json file is one commit — a complete, atomic record of one write operation. The files are named with 20-digit zero-padded version numbers. Version 0 is the table creation. Version 1 is the first write. And so on, monotonically, forever.

The checkpoint file is a compacted snapshot — more on that shortly.

Anatomy of a commit file

A commit JSON file is newline-delimited JSON. Each line is an action, and a commit is a sequence of actions applied atomically. The main action types:

add — a new Parquet file is part of the table:

{
  "add": {
    "path": "part-00000-a1b2c3.snappy.parquet",
    "partitionValues": {"date": "2024-01-15"},
    "size": 102400,
    "modificationTime": 1705276800000,
    "dataChange": true,
    "stats": "{\"numRecords\":50000,\"minValues\":{\"id\":1},\"maxValues\":{\"id\":50000}}"
  }
}

remove — a file is no longer part of the table (it still exists on disk until VACUUM):

{
  "remove": {
    "path": "part-00000-old.snappy.parquet",
    "deletionTimestamp": 1705276800000,
    "dataChange": true
  }
}

metaData — schema or configuration change:

{
  "metaData": {
    "id": "3f7a2b91-...",
    "schemaString": "{\"type\":\"struct\",\"fields\":[...]}",
    "partitionColumns": ["date"],
    "configuration": {"delta.autoOptimize.optimizeWrite": "true"}
  }
}

protocol — minimum reader/writer versions required to interact with this table. When Delta enables a new feature (deletion vectors, column mapping), it bumps the protocol version so older readers fail fast rather than silently misread data.

A full INSERT into a partitioned table produces one commit with several add actions — one per output file. An UPDATE that rewrites three files produces three remove actions and three add actions in the same commit. All-or-nothing.

How reads reconstruct table state

When you run SELECT * FROM my_table, Delta doesn’t just scan Parquet files — it first reconstructs the current table snapshot. The algorithm:

Find the latest checkpoint file (if any)
Load that checkpoint as the base state
Replay all JSON commits after the checkpoint, in order
The result is the set of currently-active add actions = the set of files to scan

This is why Delta reads have a small overhead even before touching data: the log replay. For a table with a recent checkpoint and a handful of commits since, this is fast. For a table with thousands of commits and no checkpoint, it’s a full log scan — which is why checkpointing matters.

The stats field on each add action is where data skipping comes from. Delta stores per-file min/max values and null counts. When your query has a filter like WHERE id > 40000, Delta can rule out files whose maxValues.id < 40000 without opening them. This is entirely driven by log metadata — no index files, no separate statistics store.

Checkpoint files

Every 10 commits (configurable via delta.checkpointInterval), Delta writes a checkpoint. A checkpoint is a Parquet file that encodes the same information as “replay all JSON commits from version 0” — the complete set of currently-active add and remove actions, plus current metadata and protocol.

A checkpoint for version 10 means: to read the current state of the table, load 00000000000000000010.checkpoint.parquet, then replay commits 11, 12, 13… You never need to go back further.

Delta also writes a _last_checkpoint file:

{"version": 10, "size": 5}

This tells readers where to start — no need to list the entire _delta_log/ directory to find the latest checkpoint.

For large tables with millions of files, a single checkpoint Parquet can itself become large. Delta 2.0+ supports multi-part checkpoints: the checkpoint is split into multiple Parquet files that can be read in parallel.

Optimistic concurrency and conflict detection

Delta uses optimistic concurrency: multiple writers proceed in parallel and only conflict at commit time. The protocol:

Writer reads the current table version (say, version 5)
Writer performs its work, prepares a set of add/remove actions
Writer attempts to write version 6 (the next JSON file)
If someone else already wrote version 6, the writer re-reads version 6, checks for conflicts, and retries at version 7

Conflict detection isn’t “did anyone else write?” — it’s “did anyone else write in a way that invalidates my assumptions?” Delta tracks which files a transaction read and checks whether any of those files were modified by intervening commits. An INSERT into a non-overlapping partition doesn’t conflict with a concurrent INSERT into a different partition. A blind UPDATE on the same files does.

This is what makes concurrent writes to the same Delta table possible without a distributed lock manager — but it’s also why high-concurrency, overlapping writes produce retries, and why very high write concurrency eventually needs careful partitioning strategy.

Time travel and `VACUUM`

Because remove actions only mark files as logically deleted (they never delete the Parquet file itself), old table versions remain readable. SELECT * FROM my_table VERSION AS OF 3 replays the log up to version 3 and reads those files. This is time travel — zero copy, no separate snapshot storage.

VACUUM is what actually reclaims disk space. It:

Scans the log to find all files referenced in the current snapshot and any version within the retention window (default: 7 days)
Lists all physical Parquet files in the table directory
Deletes any physical files not referenced by any retained version

The critical implication: VACUUM with the default retention window means you can time-travel up to 7 days back. Lowering the retention window frees disk space faster but shortens how far back you can query. Setting it to zero and running VACUUM is irreversible — those old snapshots are gone.

One subtlety: VACUUM doesn’t touch the _delta_log/ directory. Old JSON commits stay forever (or until you use VACUUM --dry-run equivalents against the log itself, which is a separate, newer operation called log retention cleanup).

What this means in practice

A few things that fall out of understanding the log:

Small file compaction is a log operation. OPTIMIZE rewrites N small Parquet files into fewer large ones. From the log’s perspective, it’s just a commit with N remove actions and M add actions. No data is logically changed — the files just got bigger. The old small files still exist on disk until VACUUM.

Schema evolution is a metadata action. Adding a nullable column is a single metaData action in the log, zero bytes of Parquet rewritten. Renaming a column (with column mapping enabled) is similarly log-only — the Parquet files stay untouched, and Delta uses the column mapping in the metadata to reconcile names.

The log is the source of truth, not the filesystem. If a Parquet file exists in the table directory but has no add action in the log, Delta ignores it. This is how partial write failures are handled — the failed writer never committed its JSON file, so the orphaned Parquet files are invisible to readers and get cleaned up by VACUUM.