eazyware
Engineering·September 23, 2024·10 min read

Dataset versioning for AI: DVC, LakeFS, and the patterns that scale

Training data, eval sets, RAG corpora — all need version control. The tooling landscape and the workflow patterns that hold up.

KR
Kushal R.
Engineering lead

Reproducibility in AI is harder than in traditional software because the source code includes data. A fine-tune trained on dataset v1 vs v2 produces different weights; RAG retrieval against corpus v1 vs v2 produces different answers; evals run against test-set v1 vs v2 produce different scores. Without dataset versioning, you can't reliably reproduce last quarter's results. This post is the tooling and patterns we deploy.

What to version
What to version in AI systems Training data DVC, LakeFS, Git-LFS snapshots per run Eval sets Git + structured files PR reviews for changes RAG corpora content + chunking cfg index snapshots Prompts Git (like code) semantic versioning Fine-tuned models model registry lineage to data version Embeddings indexes snapshot + model ver reindex on model swap goal: any production incident traceable to exact versions of all six artifacts
Training data, eval sets, RAG corpora, prompts, fine-tuned models, embedding indexes. Six artifact types, each needing version control, lineage tracking.

What to version

Six artifact types in a typical AI system, each needing version control:

Training data — what was used to fine-tune or train models. DVC, LakeFS, or Git-LFS work. Snapshot before each training run.

Eval sets — test cases used for quality measurement. Git + structured files (JSONL, YAML). Treat changes like code changes, with PR reviews.

RAG corpora — the knowledge base fed to retrieval. Document content plus chunking configuration. Version both so you can reproduce exact retrieval behavior.

Prompts — see prompt version control post. Git, not spreadsheets.

Fine-tuned model artifacts — weights, configs, lineage to training data. Model registry (MLflow, Weights & Biases, internal).

Embedding indexes — snapshot the index file with the embedding model version. Reindex when the embedding model changes; preserve old indexes for rollback.

DVC — the workhorse

Data Version Control (DVC) extends Git for large files and datasets. Stores data in cloud storage (S3, GCS); stores small pointer files in Git. Team members checkout data on demand; don't bloat Git repos.

Works well for training datasets where you want strong integration with Git workflow. Less ideal for multi-gigabyte corpora where LakeFS or git-lfs-with-s3 backends are better fits.

LakeFS — at scale

Git-like branching and merging for data lakes. Useful when your data is terabytes in S3 or equivalent. Branch for experiments, merge back atomically, rollback by pointing at an earlier commit.

Overkill for smaller setups. Essential when you're managing corpora at the terabyte scale with team-wide collaboration.

Lineage tracking

The real value of dataset versioning is lineage. A production incident: quality regression on endpoint X. Investigation needs to trace: which model version (model registry), trained on which data version (DVC/LakeFS), with which prompt version (Git), against which RAG index version (index registry).

Link all these in a metadata layer — we use simple structured files (a release YAML listing versions of every artifact) that travels with each production deploy. Future you will thank present you.

Eval set hygiene

Eval sets deserve the most care. They're how you know whether quality is regressing. Critical practices:

Never test on your training set. Keep holdouts that never enter training. Refresh periodically from production traffic samples (with redaction). Curate deliberately — add cases that catch known failure modes. Review eval-set changes in PR just like code.

RAG corpus changes

Corpus updates are a common source of silent quality drift. A documentation team restructures the wiki; retrieval results shift; chat answers subtly degrade. Without corpus versioning, you can't attribute the regression.

Snapshot the corpus periodically (monthly is typical). Retain at least the last 3-6 snapshots. When a retrieval quality issue surfaces, you can diff corpora to understand what changed.

Rollback procedures

Your versioning is only useful if rollback actually works. Practice it. Roll back a model version; verify the system returns to prior behavior. Roll back a corpus version; re-run evals; confirm expected scores. Monthly rollback drill is worth the minor inconvenience of planning.

Read next
Data strategy for AI: what to fix before you buy models
Read next
Why evaluation infrastructure matters more than prompts
Read next
When to fine-tune (and when RAG is fine)
Tags
data versioningDVCLakeFSreproducibility
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request