Model versioning is where a lot of AI systems quietly break. Providers ship silent updates to 'latest' aliases. A team pins a specific version, gets busy, and misses the deprecation warning. A rollback happens but the fallback model was never kept in sync. This post is the four strategies we use for managing model versions in production, with the specific gotchas for each.

Four strategies

Pinning is the default safety pattern. Alias-based deployment enables instant rollback. Shadow testing validates before cutover. Gradual rollout is the high-stakes migration path.

Pinning — the default

Never use 'latest' or 'default' in production. Always pin a specific model version in config. gpt-4o-2024-11-20, not gpt-4o. claude-sonnet-4-5-20250929, not claude-sonnet-latest. Upgrade is an explicit code change subject to your normal review process.

This sounds obvious. Teams still get it wrong because providers' default aliases 'just work' in development. The alias silently rolls over to a new model and your production behavior changes without a deployment. Debugging a quality regression that has no corresponding git commit is painful.

Alias-based deployment

One level of indirection: application code references an internal alias (prod-chat-model-v2); alias resolves to a specific pinned version. Alias flips require no code deploy.

Benefit: instant rollback. When a model upgrade causes issues, flip the alias back. Seconds, not minutes. For critical systems where rollback speed matters (financial services, high-traffic consumer products), this is worth the infrastructure.

Implementation: model registry (MLflow, internal DB) stores alias to version mappings. Gateway or middleware resolves the alias at request time. Registry access is privileged — not everyone can flip aliases.

Shadow testing

Before switching to a new model, run it in parallel on live traffic for N days. Compare outputs. Measure quality deltas. See shadow testing post for implementation details.

Pairs naturally with evals. Shadow mode tests behavior on real traffic distribution; evals test behavior on curated cases. Both should pass before a production switch. See eval infrastructure post.

Gradual rollout

For high-stakes migrations: 1% to 10% to 50% to 100% traffic over days or weeks. Metrics watched at each gate. Auto-rollback on regression. See canary deployments post.

When to use: switching to a meaningfully different model (4o to Claude, for instance), not when upgrading within a model family. The pattern overhead is too much for minor version bumps; save it for upgrades where regression risk is real.

Handling provider deprecations

Providers deprecate model versions. You get 3-12 months of notice, then the version stops working. Track deprecation timelines; plan migrations with buffer.

Your registry should warn on pinned versions approaching deprecation. This prevents the 'oh no, gpt-4-turbo-preview was turned off and we're paged at 2am' scenario.

Multi-model systems

If you route to multiple models (see multi-model routing), every one needs version management. A degradation in one model's quality should route traffic away while you debug, not silently fail.

Health checks per model: periodic evals against representative tasks; alert on regressions. Automated routing adjustments based on health. This is especially important when a model provider is having issues — you route around them automatically while they recover.

What to track per version

Model identifier, date introduced to system, date promoted to production, associated prompt version, eval scores at promotion, any known issues or caveats, deprecation date if known. Lineage from model version to prompt version to data version matters for reproducibility. See dataset versioning post.

Model versioning strategies for production AI

Pinning — the default

Alias-based deployment

Shadow testing

Gradual rollout

Handling provider deprecations

Multi-model systems

What to track per version

Continue the thread.

Canary deployments for AI: the rollout pattern that saves weekends

A/B testing LLM features: the pitfalls that invalidate results

Why evaluation infrastructure matters more than prompts

Want to talk about this?