Prompts belong in Git. Not spreadsheets, not Notion pages, not hosted prompt tools that add a separate workflow nobody follows. This is a position I've defended across many client engagements, and every team that has made the switch ends up thanking me for the push. The arguments are practical: diff review, CI integration, rollback, audit — all things Git does well and that purpose-built prompt tools do badly.

Git workflow for prompts

Prompt PR to CI eval gate (auto-block on regression) to merge and deploy (tagged, canary rollout). Rollback is git revert.

Why spreadsheets fail at scale

At 5 prompts, a spreadsheet works. At 50, it doesn't. Changes get made without review. Nobody knows the history. A prompt that regressed a week ago is hard to identify because change attribution is weak. Backup and restore are manual.

Production AI systems often have hundreds of prompts — system prompts, tool descriptions, example sets, safety filters, fine-tune training templates. Managing all of them outside version control is begging for incidents.

Why hosted prompt tools often fail

Hosted prompt management tools (LangSmith, PromptLayer, others) have clean UIs. They work well for small teams or prompt engineers working in isolation. They struggle at scale in multi-team organizations.

The issues: separate workflow engineers have to remember to use. Not integrated with CI. Not integrated with code review. Secrets and access control diverge from the rest of the codebase. When something goes wrong in production, the prompt's state lives somewhere other than the code deploy — adding variables to debugging.

Hybrid use is fine — use the hosted tool for prototyping and experimentation, then migrate shipped prompts to Git. Don't leave production prompts in the hosted tool.

The Git workflow

Prompts live in the same repo as application code, organized by feature area. Path like src/prompts/chat/system.md or src/prompts/extraction/invoice-v3.yaml. Text-based for diffs.

Changes come as PRs. Same review process as code. A prompt change requires a reviewer — ideally someone who understands both the AI system and the domain the prompt targets.

CI runs evals on the PR. A prompt change that regresses eval pass rate beyond threshold blocks the merge. See eval infrastructure post. The CI gate is what makes this workflow safer than freeform editing.

Merge triggers deploy. The new prompt goes through your normal deploy pipeline (canary, gradual rollout). See canary deployments post.

Rollback is git revert — instant, known-good state. The most underrated benefit.

File formats and structure

Markdown for prompt templates with long text. YAML for structured prompts with multiple fields (system, user, examples). JSON for tool definitions where strictness matters.

Keep prompts close to the code that uses them. src/features/chat/prompts/ rather than a root-level /prompts directory. Locality makes changes easier to review.

Semver on prompts for major stakeholders. invoice-extraction-v3.yaml. When a backward-incompatible change is needed, a new version file lets old and new coexist until the old is retired.

Branch-based experiments

Feature branches get their own prompts. Developers can experiment with prompt variations in a branch without affecting main. CI runs evals on the branch prompts; deploys go to staging environments for integration testing.

When the branch merges, its prompts become part of production. This is identical to how code branches work — Git-versioned prompts integrate naturally.

Multi-environment prompts

Sometimes production and staging need different prompts (different models, different data sources). Environment-specific overrides in config. Base prompt in Git; environment config layers overrides.

Keep overrides minimal. If prompts diverge significantly between environments, you're not actually testing what you're shipping. Test with production-like prompts in staging.

Metrics on the prompt registry

Useful metrics: total prompts (watching for sprawl), prompt age (long-static prompts may be stale), PR throughput (how active is prompt development), eval pass rates over time (quality trend).

These metrics guide engineering investment. Rising prompt count with flat quality scores signals consolidation opportunity. Low PR throughput on critical prompts might mean nobody's maintaining them.

Prompt version control: Git, not Google Docs

Why spreadsheets fail at scale

Why hosted prompt tools often fail

The Git workflow

File formats and structure

Branch-based experiments

Multi-environment prompts

Metrics on the prompt registry

Continue the thread.

Prompt testing like it's 2026

Why evaluation infrastructure matters more than prompts

AI engineering culture: what the best teams share

Want to talk about this?