Model versioning is where a lot of AI systems quietly break. Providers ship silent updates to 'latest' aliases. A team pins a specific version, gets busy, and misses the deprecation warning. A rollback happens but the fallback model was never kept in sync. This post is the four strategies we use for managing model versions in production, with the specific gotchas for each.
Pinning — the default
Never use 'latest' or 'default' in production. Always pin a specific model version in config. gpt-4o-2024-11-20, not gpt-4o. claude-sonnet-4-5-20250929, not claude-sonnet-latest. Upgrade is an explicit code change subject to your normal review process.
This sounds obvious. Teams still get it wrong because providers' default aliases 'just work' in development. The alias silently rolls over to a new model and your production behavior changes without a deployment. Debugging a quality regression that has no corresponding git commit is painful.
Alias-based deployment
One level of indirection: application code references an internal alias (prod-chat-model-v2); alias resolves to a specific pinned version. Alias flips require no code deploy.
Benefit: instant rollback. When a model upgrade causes issues, flip the alias back. Seconds, not minutes. For critical systems where rollback speed matters (financial services, high-traffic consumer products), this is worth the infrastructure.
Implementation: model registry (MLflow, internal DB) stores alias to version mappings. Gateway or middleware resolves the alias at request time. Registry access is privileged — not everyone can flip aliases.
Shadow testing
Before switching to a new model, run it in parallel on live traffic for N days. Compare outputs. Measure quality deltas. See shadow testing post for implementation details.
Pairs naturally with evals. Shadow mode tests behavior on real traffic distribution; evals test behavior on curated cases. Both should pass before a production switch. See eval infrastructure post.
Gradual rollout
For high-stakes migrations: 1% to 10% to 50% to 100% traffic over days or weeks. Metrics watched at each gate. Auto-rollback on regression. See canary deployments post.
When to use: switching to a meaningfully different model (4o to Claude, for instance), not when upgrading within a model family. The pattern overhead is too much for minor version bumps; save it for upgrades where regression risk is real.
Handling provider deprecations
Providers deprecate model versions. You get 3-12 months of notice, then the version stops working. Track deprecation timelines; plan migrations with buffer.
Your registry should warn on pinned versions approaching deprecation. This prevents the 'oh no, gpt-4-turbo-preview was turned off and we're paged at 2am' scenario.
Multi-model systems
If you route to multiple models (see multi-model routing), every one needs version management. A degradation in one model's quality should route traffic away while you debug, not silently fail.
Health checks per model: periodic evals against representative tasks; alert on regressions. Automated routing adjustments based on health. This is especially important when a model provider is having issues — you route around them automatically while they recover.
What to track per version
Model identifier, date introduced to system, date promoted to production, associated prompt version, eval scores at promotion, any known issues or caveats, deprecation date if known. Lineage from model version to prompt version to data version matters for reproducibility. See dataset versioning post.