Field notes from shipping AI.
Engineering posts, playbooks, and opinions from the team. No thought leadership, no "AI-powered" buzzwords. What we learned actually deploying systems.
The 2026 guide to picking an AI vendor
Not all AI agencies are the same. A framework for evaluating agencies vs consultancies vs freelancers vs in-house, with real cost data and time-to-ship benchmarks.
Why evaluation infrastructure matters more than prompts
Prompt engineering gets all the attention. Eval infrastructure is what actually ships reliable AI. Here's what that looks like in production.
Total cost of ownership for LLM systems
The per-token API price is maybe 30% of your real LLM cost. The other 70% is what nobody talks about. A complete TCO framework.
Six RAG patterns that actually work in production
Beyond "top-k + prompt". The retrieval patterns we deploy most — hybrid search, query rewriting, reranking, parent-document — with when to use each.
Embedding models compared: OpenAI vs Cohere vs Jina vs BGE vs Nomic
Which embedding model should you use in 2026? A head-to-head across retrieval quality, cost, speed, and context window.
The 2026 guide to picking an AI vendor
Not all AI agencies are the same. A framework for evaluating agencies vs consultancies vs freelancers vs in-house, with real cost data and time-to-ship benchmarks.
Vector databases in 2026: Pinecone vs Qdrant vs Weaviate vs pgvector
When to pick a managed vector DB versus pgvector, and what actually matters at production scale.
LLM security basics every team should know
Prompt injection, jailbreaks, data exfiltration, and the concrete mitigations that actually work.
Why evaluation infrastructure matters more than prompts
Prompt engineering gets all the attention. Eval infrastructure is what actually ships reliable AI. Here's what that looks like in production.
PII redaction patterns for LLM pipelines
How to strip sensitive data before it hits a model, and the three places this usually breaks.
Guardrails and validators: keeping LLM outputs safe
Schema validators, content filters, topic guards — the layers between LLM output and your users.
Making structured outputs actually reliable
JSON mode, function calling, and constrained decoding — what works, what fails, and how to test.
Function calling patterns that hold up in production
Five tool-use patterns we use across agentic systems, with failure modes and workarounds.
Total cost of ownership for LLM systems
The per-token API price is maybe 30% of your real LLM cost. The other 70% is what nobody talks about. A complete TCO framework.
Streaming LLM UX: architecture and pitfalls
Users expect streaming. Servers, proxies, and clients have opinions. Here is how we make it work end-to-end.
Latency budgeting for LLM systems
Every stage of an LLM request costs milliseconds. Here is how we allocate budget and hit targets.
Self-hosting vs managed: GPU decisions in 2026
When to pay for managed inference and when to run your own GPUs. Real costs from real deployments.
Open-source models in production: what actually holds up
Llama 3.3, Qwen, Mistral, DeepSeek — which open-weights models we ship and where they beat closed ones.
Six RAG patterns that actually work in production
Beyond "top-k + prompt". The retrieval patterns we deploy most — hybrid search, query rewriting, reranking, parent-document — with when to use each.
Context window engineering: working within and beyond the limits
Long-context models sound great until you hit the middle-of-context problem. Patterns that actually use long windows well.
Multi-model routing: cutting LLM costs 40-60% with zero quality loss
Route by task, not by vendor. A deep dive into how we classify queries and route them to the cheapest capable model — with real cost data from production.
Reasoning models in production: where they actually help
o3, DeepSeek-R1, and friends — when the extra latency and cost is worth it, and when regular models win.
Synthetic data for AI: when to generate, when to buy
LLM-generated training data has gone from novelty to necessity. The patterns that work, the traps to avoid.
Red-teaming AI systems before your users do
A practical playbook for stress-testing LLM apps: prompt injection, jailbreaks, tool misuse, privilege escalation.
The AI readiness audit: 10 questions before you write a single prompt
Most AI failures happen before the first sprint. A structured readiness check across data, team, infrastructure, and use case.
The AI-ops runbook: what to do when things break at 3am
Concrete response patterns for the seven AI-specific incidents, with exact first-five-minute actions.
AI for legal teams: patterns that pass review
Contract analysis, due diligence, clause extraction. What works at law firms and legal ops teams, what fails review.
Build vs buy: when custom AI beats off-the-shelf
Custom AI is expensive and slow. Off-the-shelf AI SaaS is generic and locks you in. Here's the clear line for when each wins.
Healthcare AI: compliance-first design for HIPAA and beyond
How to ship clinical and operational AI without a compliance incident. BAA, PHI, audit trails, model routing.
AI in insurance: claims, underwriting, and fraud in practice
Patterns we deploy at P&C and life insurers. Where LLMs add value, where classical ML still wins.
AI agents in production: what actually breaks
Agentic workflows look great in demos. At 100,000 calls a day, different problems emerge. A tour of the failure modes we've fixed.
AI in manufacturing: the use cases that earn payback
Predictive maintenance, quality inspection, supplier intelligence, SOP search. What actually ships on the shop floor.
AI in real estate: listings, valuation, and tenant screening
Where AI adds real value in proptech, and where fair-housing regulation makes it dangerous.
Building voice AI that passes the "grandma test"
Voice AI is unforgiving. One wrong word and the caller hangs up. How to build voice agents people actually want to talk to.
Want this content in your inbox?
One post per week, engineering-first. No spam, no pop-ups, unsubscribe in one click.