The teams that ship reliable AI to production have specific cultural habits, distinct from both traditional software engineering culture and ML research culture. Tools matter but culture matters more. This post is the practices we've observed in high-performing AI engineering organizations — and which ones consistently predict which teams will deliver versus which will demo.
Evals-first shipping
PRs affecting AI behavior don't merge without eval results. The eval suite is part of CI. Regressions block merges. This is controversial at first — engineers used to shipping fast resist the friction. The teams that get over that hump ship more reliably than teams without the gate.
Key detail: the eval suite must be fast enough that it runs on every PR. If it takes 30 minutes, people skip it. Aim for 2-5 minute eval runs for the core suite; longer suites run nightly.
Weekly eval review
A recurring meeting — 30 minutes, same time every week — where the team reviews: aggregate eval trends, specific regressions, wins, edge cases from production that got added to the eval set. Not a status meeting. A quality ritual.
The teams that do this consistently spot patterns before they become incidents. 'Quality on category X has been drifting down for three weeks' is a signal that triggers investigation, not a postmortem topic three months later.
Cost as a first-class metric
Cost dashboards visible to every engineer. PRs include estimated cost impact. Cost regressions (a PR that increases per-request cost by 20%) are incidents. This discipline transforms AI spending from a CFO worry to an engineering habit.
Teams that do this end up with 30-50% lower per-user AI costs than teams that don't. Not because they cut features; because they catch waste early and route intelligently.
Canary + fast rollback
No 100% deployments. Every change rolls out to 1-5% first for at least a few hours. Rollback path tested monthly (literally — practice rolling back to verify it works). Mean time to revert a bad deploy: under 5 minutes.
The teams without this discipline have a story: 'we shipped a prompt change, it looked fine in testing, production saw 200k requests in the next hour, 30% were worse, we didn't notice for a day, the rollback took 3 hours because of caching issues, we're still recovering.' The teams with canary discipline don't have this story — they caught it at 1% traffic.
User feedback loops everywhere
Thumbs up/down in the UI on every AI response. Weekly review of negative feedback patterns. Every pattern that reveals a failure case becomes an eval case. The production stream becomes an ever-growing quality signal — and it compounds as the product matures.
Teams that skip this optimize against their own imagination. Teams that use it optimize against real user frustration, which is always richer.
Living documentation
Runbooks in the same repo as code. ADRs (architecture decision records) for significant choices. Evaluation methodology documented and updated. Documentation decay is a failure signal: a six-month-old runbook that references systems that no longer exist means the team isn't maintaining the docs.
On-call fresh engineers can function because the runbooks are current. New hires ramp faster because the ADRs explain why decisions were made, not just what was built. This is boring operational hygiene. It produces outsized returns.
The meta-pattern: engineering discipline applied to non-deterministic systems
All six practices share a theme: applying software engineering rigor to systems that don't behave deterministically. Non-deterministic systems need different tools (evals instead of unit tests) but the same cultural commitment to quality, measurement, and iterative improvement.
Teams that treat AI as a magical exception — 'you can't test LLMs the same way,' 'prompts are creative writing not code,' 'quality is subjective' — consistently ship worse systems than teams that apply standard discipline with AI-appropriate tools.
What research-heritage teams get wrong
Overemphasis on model novelty and underemphasis on production basics. The team that spends three months trying a new architecture and ignores deployment reliability ships later and worse than the team that picks a proven architecture and obsesses over deployment.
Comfort with 'unknown failure modes.' Research teams are used to systems that sometimes fail interestingly. Product teams need systems that fail predictably — known failure modes with known responses. This shift in expectations is cultural and takes effort.
What traditional engineering teams get wrong
Trying to apply unit-test thinking to non-deterministic systems. 'We wrote tests' doesn't cover LLM behavior. You need evals — a different discipline with different tools.
Underestimating the importance of prompt and retrieval tuning. These aren't 'configuration'; they are primary engineering artifacts that need version control, review, evals, and thoughtful iteration.
Treating AI features as external dependencies that 'just work.' The model is upstream; the prompt, retrieval, guardrails, and integration are your code. You own quality on your side of that boundary.