Reproducibility in AI is harder than in traditional software because the source code includes data. A fine-tune trained on dataset v1 vs v2 produces different weights; RAG retrieval against corpus v1 vs v2 produces different answers; evals run against test-set v1 vs v2 produce different scores. Without dataset versioning, you can't reliably reproduce last quarter's results. This post is the tooling and patterns we deploy.

What to version

Training data, eval sets, RAG corpora, prompts, fine-tuned models, embedding indexes. Six artifact types, each needing version control, lineage tracking.

What to version

Six artifact types in a typical AI system, each needing version control:

Training data — what was used to fine-tune or train models. DVC, LakeFS, or Git-LFS work. Snapshot before each training run.

Eval sets — test cases used for quality measurement. Git + structured files (JSONL, YAML). Treat changes like code changes, with PR reviews.

RAG corpora — the knowledge base fed to retrieval. Document content plus chunking configuration. Version both so you can reproduce exact retrieval behavior.

Prompts — see prompt version control post. Git, not spreadsheets.

Fine-tuned model artifacts — weights, configs, lineage to training data. Model registry (MLflow, Weights & Biases, internal).

Embedding indexes — snapshot the index file with the embedding model version. Reindex when the embedding model changes; preserve old indexes for rollback.

DVC — the workhorse

Data Version Control (DVC) extends Git for large files and datasets. Stores data in cloud storage (S3, GCS); stores small pointer files in Git. Team members checkout data on demand; don't bloat Git repos.

Works well for training datasets where you want strong integration with Git workflow. Less ideal for multi-gigabyte corpora where LakeFS or git-lfs-with-s3 backends are better fits.

LakeFS — at scale

Git-like branching and merging for data lakes. Useful when your data is terabytes in S3 or equivalent. Branch for experiments, merge back atomically, rollback by pointing at an earlier commit.

Overkill for smaller setups. Essential when you're managing corpora at the terabyte scale with team-wide collaboration.

Lineage tracking

The real value of dataset versioning is lineage. A production incident: quality regression on endpoint X. Investigation needs to trace: which model version (model registry), trained on which data version (DVC/LakeFS), with which prompt version (Git), against which RAG index version (index registry).

Link all these in a metadata layer — we use simple structured files (a release YAML listing versions of every artifact) that travels with each production deploy. Future you will thank present you.

Eval set hygiene

Eval sets deserve the most care. They're how you know whether quality is regressing. Critical practices:

Never test on your training set. Keep holdouts that never enter training. Refresh periodically from production traffic samples (with redaction). Curate deliberately — add cases that catch known failure modes. Review eval-set changes in PR just like code.

RAG corpus changes

Corpus updates are a common source of silent quality drift. A documentation team restructures the wiki; retrieval results shift; chat answers subtly degrade. Without corpus versioning, you can't attribute the regression.

Snapshot the corpus periodically (monthly is typical). Retain at least the last 3-6 snapshots. When a retrieval quality issue surfaces, you can diff corpora to understand what changed.

Rollback procedures

Your versioning is only useful if rollback actually works. Practice it. Roll back a model version; verify the system returns to prior behavior. Roll back a corpus version; re-run evals; confirm expected scores. Monthly rollback drill is worth the minor inconvenience of planning.

Dataset versioning for AI: DVC, LakeFS, and the patterns that scale

What to version

DVC — the workhorse

LakeFS — at scale

Lineage tracking

Eval set hygiene

RAG corpus changes

Rollback procedures

Continue the thread.

Data strategy for AI: what to fix before you buy models

Why evaluation infrastructure matters more than prompts

When to fine-tune (and when RAG is fine)

Want to talk about this?