The AI engineer job market in 2026 is a strange place. A million resumes mention LLMs. 'Prompt engineer' is a job title again. Vendors are selling certifications. The signal-to-noise ratio in hiring is brutal. This post is the filter we use — the skills that actually separate juniors from seniors, and the table-stakes things that have become meaningless for interview questions.

Skills map

Critical must-haves (eval design, retrieval tuning, production debugging, cost awareness, system design). Role-dependent valuable skills. And table-stakes knowledge you shouldn't interview on.

What has become table stakes

Prompt engineering. Anyone can do basic prompting. It's a skill in the way that 'using a search engine' is a skill — necessary, not differentiating. If your interview question is 'how would you prompt an LLM to do X', you're selecting for people who took a weekend course.

Using LangChain, LlamaIndex, or similar frameworks. Library knowledge. Valuable, shallow. 'I've used LangChain' tells you little about whether the person can ship production systems.

Knowing the latest model names. GPT-4o or Claude Opus or Gemini this-or-that changes quarterly. Knowing the current state is useful. Interviewing on it is not.

Building a demo. The wild west is over. Anyone can ship a demo app in a weekend. Demos don't reveal engineering judgment.

What separates seniors from juniors

Evaluation design

Can the candidate design and maintain an eval suite? Do they think about what to measure, how to measure it, when to add a case? A senior who has built eval infrastructure can describe the datasets, scoring methods, and failure modes they've seen. A junior can name LangSmith or Braintrust and gesture vaguely.

Interview question that works: 'walk me through an eval you built. How did you pick cases? What regressions did it catch? What did you wish you'd measured differently?' Seniors will tell you a story with specifics; juniors will describe features of tools.

Retrieval tuning

Everyone uses RAG. Few people have gotten a RAG system from 60% to 90% precision. The path from 'I've used a vector DB' to 'I know how to tune a retrieval stack' involves embedding choice, chunking experiments, hybrid search, reranking, and systematic measurement. Ask about a specific retrieval problem they solved — the answer will reveal their depth.

Production debugging

LLM systems fail in subtle ways. Traces, latency percentiles, token accounting, cost anomaly detection. A candidate who has been on-call for AI systems speaks differently about operations than one who has only built prototypes. Ask: 'tell me about a production AI incident you debugged.' The good answers involve specifics — what they checked, in what order, what they ruled out.

Cost awareness

Senior AI engineers think about tokens. They know that a loop generating 10k tokens per user per minute ends badly. They route expensive work to cheap models. They cache. They compress prompts. Interview question: 'a product feature gets 100k calls/day, average 2k input, 800 output tokens. How much does this cost and what would you do about it?' The right answer starts with math.

System design fundamentals

The bar hasn't moved. Queues, state, failure modes, idempotency, observability. These are as important for agentic AI systems as for any other distributed system. A candidate who can't describe how to handle partial failure in a multi-step agent cannot ship a reliable agent.

Role-specific valuable skills

Fine-tuning: valuable if your domain requires it; over-weighted if not. Most production systems ship without fine-tuning. A candidate who has done meaningful fine-tuning brings depth but don't require it unless the role demands it.

Agent system design: becoming more important as agentic products proliferate. Candidates who have shipped non-trivial agents have scar tissue worth having.

Domain expertise (legal, medical, financial): hard to teach on the job and highly leveraged if the role demands it. Often worth trading technical breadth for domain depth.

Inference infrastructure: critical if you self-host; irrelevant if you use APIs. Match to your actual architecture.

The interview pattern we use

Ninety-minute technical interview. Thirty minutes: candidate walks through a production AI system they built, with numbers. Thirty minutes: design exercise — given a problem, design an AI system including the eval plan, failure modes, and cost model. Thirty minutes: debugging scenario — given logs and traces from a failing system, diagnose.

This sequence surfaces judgment, breadth, and operational thinking. It does not select for prompt-engineering trivia or framework fluency. Juniors can still do well on it if they have the judgment foundation; seniors with only prompt skills struggle to answer the system design and debugging portions.

Hiring AI engineers: the skills that matter in 2026

What has become table stakes

What separates seniors from juniors

Evaluation design

Retrieval tuning

Production debugging

Cost awareness

System design fundamentals

Role-specific valuable skills

The interview pattern we use

Continue the thread.

How to structure an AI team in 2026

AI engineering culture: what the best teams share

How we structure AI engagements (and why)

Want to talk about this?