Git for Data: What MatrixOne Gives Up to Diff a Table in 3 Seconds


“git for data” opens with the same scene. An engineer files a pull request on a dataset. A teammate reviews the rows. They comment, they argue, they approve, they merge. It is a clean story. It is the story Dolt tells, the story lakeFS tells, the story Project Nessie tells, and now the story Version Control System for Data with MatrixOne (Gou et al., MatrixOrigin, arXiv, April 2026) tells. I really love the concept in principle but I always struggle with the utility and use cases. Every time I ask around, I find few people that are using the concept in production environments, but maybe I couldn’t be looking in the right places. Anyway, code review of notebooks, sure. Review of migration scripts, all the time. Review of the rows themselves, with humans looking at a diff of records before merge? Unlikely.

That’s not a knock on the paper. It is the question I want to hold in my head while reading it: who is this actually for, and what is being given up to make it fast enough that they might one day show up?

The MatrixOne paper is another take on a thesis I’ve been kicking around. Git-for-data sits on a tradeoff triangle of semantic correctness, speed, and scale. You pick two. “Discussion” section of this paper is the price list for picking speed and scale.

How I think about data versioning

Most schema migrations are reversible. Some are not. Rails ships an entire exception class for this, IrreversibleMigration, and the reason it has that name is that reversibility is the meaningful axis, not “schema change yes/no.” Adding a column with a default? Reversible. Renaming a column? Manageable. Dropping a column with no backup of the dropped values or changing the values in a column? All bets are off. I see this in my work, we make promises in our API documentation that we’ll give developers notice when there’s “breaking” changes. In practice, the only breaking changes happen when you remove something or change the meaning of a field.

The interesting case for any data versioning system is the migration that mutates the semantic meaning of metadata. The bytes might be the same. The meaning is not. A Merkle tree over rows does not save you when the column score used to mean “log-odds” and now means “calibrated probability.” That class of change is where versioning systems either earn their keep or fall apart, and so far the answer for every system I’ve looked at is: fall apart.

So when I read a paper that promises git-like workflows over terabyte-scale data, the first thing I want to know is what the system does when the schema changes. Not whether it claims to handle it. What it actually does. In MatrixOne’s case, the answer is in §5.5.6 and it is honest. We will get to it.

What the paper does

MatrixOne is an HTAP database where storage is immutable column-store objects on object storage (think S3), organized as an LSM tree keyed by primary key. MVCC is PostgreSQL-style. A snapshot, in MatrixOne’s vocabulary, is just a list of which objects exist at a moment in time. Because the objects are immutable, a snapshot is metadata, not data. That part is unsurprising and reads like Iceberg, Snowflake zero-copy clone, Delta clones, and every other metadata-only branching system from the last few years.

The paper’s contribution sits one layer up. The authors define what clone, diff, and merge mean over snapshots of an LSM table, and then they implement those operations in time proportional to the change set rather than the table.

The mechanism is what they call Δ-scan with diff aggregation. The paper puts it like this:

The diff between two snapshots is the set difference of the objects added on each side; deletions are modeled as additions of tombstone objects. Scanning Δ_sn2 ∪ Δ_sn3 (and only that) is sufficient to compute the multiset diff.

In other words, if you cloned a 600M-row table and modified 1M rows, the diff machinery looks at the handful of new objects holding those 1M rows. It does not scan 600M. The SQL baseline they compare to does scan 600M, twice, which is part of why the headline numbers look the way they do. We’ll get to that too.

On top of Δ-scan they bolt a conflict taxonomy. For tables with a primary key, they enumerate six cases distinguishing true conflicts (both sides modified the same row relative to the common base) from false conflicts (only one side did, auto-resolved). For tables without a primary key, they use multiset cardinalities and signed deltas (δ_T = N_sn2 − N_sn1, δ_TClone = N_sn3 − N_sn1) to decide whether both branches actually moved the value-set count for a given row shape. There is also a “row movement” carve-out so that LSM compaction physically moving a row between snapshots does not get classified as a modification. That detail is real correctness work, and it is the kind of thing papers in this space usually skip.

The available conflict modes are SKIP (keep target), ACCEPT (keep source), and FAIL (abort). No cell-level resolution. No “take A’s columns, take B’s columns.” Row-level only. The paper is upfront about it:

Currently the system supports row-level conflict resolution only. Cell-level conflict resolution would require materializing per-column deltas and joining against the base. We may consider relaxing this rule in the future.

So far this all sounds like a tidy, well-engineered piece of work. It is. The paper then walks through clone (copy the metadata directory), restore (point the table back at an earlier directory), three-way merge (the diff aggregation over Δ_sn2 and Δ_sn3 with the common base T_sn1 in hand), and two-way merge (three-way with an empty base, which means move detection fails and every physical move becomes a true conflict).

And then there is §5.5, the Discussion. The list of things the system does not do.

What MatrixOne gives up to be fast

This is the load-bearing section, both for the paper and for my read of it. Every item below is a deliberate engineering choice that buys speed by narrowing the semantic surface.

  • No schema-change-aware diff or merge (§5.5.6). If the schema of the clone differs from the parent in any way after the clone is created, diff and merge “cannot run.” The paper’s guidance is to make schema changes before cloning. That is the inverse of how software branching works.
  • Secondary indices are not cloned (§5.5.4). Clone copies the directory metadata for the primary LSM tree. Index tables, which are themselves LSMs, are not copied. The clone is a correct table that has lost its query plans until the indices are rebuilt.
  • Row-level conflicts only (§5.5.3). Two engineers editing different non-PK columns of the same row produce a conflict, even though the edits are semantically disjoint.
  • Datalink content is not tracked (§5.5.5). LOB values are diffed via SHA256. Datalink values (URLs pointing at external resources, like images or model artifacts on S3) are diffed only on the URL string. If the bytes behind the URL drift, MatrixOne is blind to it.

Read those four together. What MatrixOne sacrifices is precisely the semantic surface that would force per-row, per-cell, or cross-schema work. The Δ-scan is fast because the algorithm is allowed to assume identical schema, identical PK definition, no cell-level merge, and a common physical base. The §5.5 list is not a list of bugs to be fixed later. It is the precondition for the speed.

The §5.5.6 schema-change limit deserves a second beat. The abstract of this paper sells the system on AI/ML feature engineering, where “data engineers can adopt established software engineering workflows: creating branches for isolated experimentation.” But the most common reason an ML engineer wants a branch is to try a new feature column. To widen a type. To rename label to label_v1. MatrixOne says: do that first, then branch. Which is a fine engineering rule, but it is not the workflow the abstract is pitching. The paper does not connect this dot, so I will: the abstract’s headline use case is in direct tension with the system’s §5.5.6 carve-out.

There is a sharper version of this point. Plenty of databases and managed providers now ship “branching”: an ephemeral copy of production an engineer can test against and then throw away. That is genuinely useful. But look at what the workflow actually needs. It needs a cheap clone. It does not need MERGE. A feature is produced by code. When the experiment on the branch works, you change the code that generates the feature and let production rebuild it from there. You do not hand-merge the rows back. The branch was scaffolding. To me, the git-for-data pitch, merging data into production feels like the leasted needed feature. Now, maybe if you take the concept of production out of it and you’re trying to compare two different versions of the feature you’re building and want to see what changed in the rows themselves, that might be useful but that’s a lot of git-for-data scaffolding to build to get that functionality. Seems like a bit of overkill.

What the numbers actually show

The test is simple to describe. Take a big table, around 600 million rows. Copy it. Change some rows on the copy, anywhere from a thousand of them up to a million. Then measure how long it takes to find the difference between the two versions and to merge the changes back. The benchmark compares MatrixOne’s built-in commands against doing the same job by hand in ordinary SQL.

Copying the table is basically free. About a fifth of a second. Every system in this space can do that, so it settles nothing on its own.

The result that matters is diff and merge. When the table has a primary key, MatrixOne is genuinely fast. Finding the difference after a million changed rows takes about 3 seconds. The plain-SQL version takes about 7 minutes, roughly 130 times slower. Merging is 16 seconds against about 8 minutes. The gap holds up across repeated runs.

Why the gap is that wide is the whole point. Plain SQL scans the entire 600-million-row table every time, so it takes the same 7 minutes whether you changed ten rows or a million. MatrixOne only looks at the rows that actually changed. Its time grows with the size of your change, not the size of your table. You can watch this happen: when four engineers merge their work back one after another, MatrixOne’s merge time climbs as the pile of accumulated changes grows, while the SQL version sits flat at 7-ish minutes the whole way. That climbing curve is the algorithm doing exactly what it promises.

Then the catch. All of that is the primary-key case. Take the primary key away and the picture gets worse. The speedup drops from over a hundred times to roughly six or seven. And the numbers stop behaving: in one run, changing 10,000 rows took longer than changing 100,000. More work, less time. The authors blame the query planner picking different strategies at different sizes, which is a polite way of saying performance is unpredictable in that range. They report a single run per test with no averaging, so there is no way to tell how much of this is just noise.

One last thing worth knowing. The SQL baseline is not quite a fair fight. The authors admit, in passing, that for the tests without a primary key they quietly let the SQL version use primary-key knowledge it was not supposed to have. That makes the baseline faster than what you would really have to write, so the true speedup is probably a little better than six or seven times. But it also means the whole comparison is against something no engineer would actually deploy.

Honest summary: with a primary key and a large batch of changes, MatrixOne is somewhere around a hundred times faster than doing the job in SQL. Without a primary key, it is closer to six or seven times, and the numbers wobble. That is still a good result. It is just not the “100 to 500x” the abstract advertises.

Where this sits next to Dolt, Iceberg, Nessie, lakeFS

I find the tradeoff triangle clearer when each system is placed on it explicitly.

  • Dolt. Cell-level three-way merge of SQL data with full schema awareness. The only system in this comparison that genuinely understands two engineers writing to disjoint columns of the same row. Pays for that correctness in raw query throughput (about 4.7x slower than MySQL on average as of late 2025). In July 2025 they shipped a Prolly-tree-based fast merge that hits ~1000x speedups on a 5M-row test, but the fast path is gated behind the same kind of carve-outs MatrixOne has: no secondary indexes, no check constraints, no schema change. Correctness + scale, paying in speed on the general case.
  • Iceberg branches and tags. A branch is a named reference to a snapshot. Branching is free. There is no native merge in the git sense. fast_forward is a pointer move. When branches actually diverge, you fall back to MERGE INTO SQL against specific snapshots, which puts you back at full-scan economics. Speed + scale. Correctness is the user’s problem.
  • Project Nessie. Catalog-level git over Iceberg. Cross-table commits, real branches, a MERGE command. Conflict detection is at the table pointer level, not the row level. If two branches both touched a table, Nessie either rejects, forces, or punts to the table format underneath. Speed + scale. Same fundamental compromise as Iceberg.
  • lakeFS. Git-like branching at the file level over object storage. Three-way merge with a real common ancestor, but conflict resolution is whole-file and applied globally as source-wins or dest-wins. Their own docs are direct about this: “currently it is not possible to treat conflicts individually.” Speed + scale. Sells correctness for format-agnostic generality.
  • MatrixOne. PK-aware row-level diff and merge inside a SQL surface. Δ-scan over an LSM of immutable objects gives diff in seconds at 100GB. Concedes cell-level merge, schema-change handling, indexed clones, and external artifact content. Speed + scale, with semantic correctness scoped via §5.5. Closer to the lakehouse camp than to Dolt, with two genuine advantages over Iceberg/Nessie/lakeFS: PK-aware row granularity, and the whole thing happens in the same database that runs your analytics.

Nobody is on all three corners. That is the triangle. It is real.

On Dolt and convergent evolution

The Related Work section of the paper cites Git, git-lfs, ZFS, VMware VMDK, Snowflake, and Supabase. It does not cite Dolt. Which is awkward, because Dolt shipped cell-level three-way merge of SQL data in 2020, and then in July 2025, nine months before this paper, shipped a Prolly-tree fast merge that is structurally the same insight MatrixOne is selling: exploit immutable, ordered storage so that diff is bounded by changes, not table size.

I want to be careful about how I frame this. The temptation is to call MatrixOne unoriginal. I don’t think that’s right. Two teams converging on the same algorithmic shape within a year of each other, independently, against the same problem shape (immutable ordered storage, snapshot semantics, branch/merge surface), is much better read as evidence the shape is correct. This is convergent evolution. Crabs keep evolving because the body plan works. Δ-scan over immutable LSM keeps getting reinvented because that is what diff over content-addressed ordered storage looks like.

But the algorithmic story is stronger once you consider Dolt and MatrixOne together, not weaker.

My take

The Δ-scan with diff aggregation algorithm is the right answer for this problem shape. The PK numbers are real. The collaborative-workload curve is the algorithm’s signature and it shows up clean across the experiments. The conflict taxonomy and move detection are genuine correctness work that the metadata-only branching systems (Iceberg, Nessie, lakeFS) do not even attempt.

The abstract sells git-like workflows for AI/ML feature engineering, which, for me, means schema changes, disjoint-column edits, and external artifacts. The system does none of those things well, and the paper is honest enough to say so in §5.5 but does not connect the §5.5 list back to the abstract’s framing. If your workload is “I have a 600M-row table with a stable schema and a real primary key, and I want to branch it, run a CI pipeline that mutates 1M rows, validate, and merge back atomically,” MatrixOne is fast, credible, and roughly the right tool. If your workload is “I want to add a feature column, fill it on a branch, and merge back,” MatrixOne tells you to plan the schema first, and at that point you’re not really branching, you’re scheduling.

The other thing I keep coming back to is the PR review on data observation. The entire category sells the workflow, and I have not seen anyone actually run it. Engineers are willing to review notebooks. They are willing to review SQL. They are not, in my experience, willing to sit down and look at 1M-row diffs of records and approve them. The tooling is racing ahead of the social practice, and until the social practice exists, “diff and merge in 3 seconds instead of 7 minutes” is making a workflow nobody runs faster.

That is not a reason to dismiss the work. Tools sometimes have to exist before the practice catches up. There’s obviously something these tools are onto because their popularity seems to be growing, by some heuristics by looking at their repos on GitHub and associated communities are growing. The Δ-scan algorithm will outlast whatever survives of the “git for data” pitch, and the next time somebody builds a CI/CD-for-data system for git-like data operations, it will look something like what MatrixOne and Dolt are converging on. I just don’t want to nod along while the tradeoff triangle gets papered over by abstract-tier language.

What I’d want next

A head-to-head between MatrixOne’s Δ-scan and Dolt’s Prolly fast-merge on the same workload would settle the algorithmic question. Dolt’s 5M-row benchmark and MatrixOne’s 600M-row benchmark are not directly comparable on shape or scale.

I’d also want the missing measurements: variance across trials, the cost of move detection when compaction has actually run between snapshots, storage growth on long-lived named branches (which pin objects from compaction), and any number at all on the cloud-native CN/TN/LogService path the architecture section sells. The 100GB single-server eval is fine as a unit test. It is not the “terabyte-scale” claim, but how many people and companies are really dealing with datasets of that size? I feel like its rare.

And cell-level merge. That is the next visible gap, and both Dolt’s Prolly fast-merge constraints and MatrixOne’s §5.5.3 land on the same future-work bullet. Whoever ships it first changes the triangle’s shape.

Until then: pick two. The paper is honest about which two if you read §5.5 before the abstract.