eazyware
Engineering·March 24, 2025·10 min read

AI code review in CI: what actually catches bugs

LLM-based code review is useful and has sharp edges. The patterns that add signal without adding noise.

KR
Kushal R.
Engineering lead

AI code review in CI has gone from 'curiosity' to 'standard practice' at forward-leaning engineering teams over the last 18 months. The tools — CodeRabbit, GitHub's built-in review AI, custom implementations — are genuinely useful. They also produce noise that can undermine trust. This post is how we tune AI code review to add signal without adding friction.

Signal vs noise
AI code review — what to comment on High-signal categories · Null/undefined handling missed · Race conditions, shared state bugs · Security: secrets, injection, authz gaps · API contract inconsistencies Low-signal (disable these) · Naming preferences · Style / formatting (use linter) · "Consider adding a comment" · Refactor suggestions without bugs Tuning to reduce noise · System prompt: "only flag likely bugs; skip style" · Threshold: only comment if confidence > 70% · Dismiss-rate metric: if > 60% dismissed, retune · Engineer can reply "not a bug" to train future reviews · Skip files: tests, generated code, vendor dirs
High-signal categories vs low-signal. Tuning: system prompt, confidence threshold, dismissal-rate metric, per-file skip rules.

What AI catches well

Null or undefined handling gaps. Race conditions, shared state bugs, async/await mistakes. Security vulnerabilities — secrets in code, SQL injection patterns, authorization gaps, CORS misconfigurations. API contract inconsistencies across files. LLMs spot these reliably.

What AI does badly

Style and naming preferences. Commentary and documentation suggestions. Refactor suggestions for code that already works. All these add clutter, dilute attention from actual bugs, and train engineers to dismiss.

Tuning for signal

System prompt: 'Only flag likely bugs, security issues, or API inconsistencies. Do not suggest style changes, renames, or additional comments. Do not suggest refactors unless they fix a bug.' This alone drops low-signal comments by 60-80%.

Confidence threshold: ask the model to self-rate confidence. Only publish comments with confidence above threshold (70% starting point). Tune based on dismissal rates — if engineers dismiss over 50%, raise the threshold.

Dismissal feedback loop: when engineers dismiss with 'not a bug,' capture the signal. Use it in future prompts or to fine-tune. Convert dismissal from symptom into fix.

File and language scope: skip tests, generated code, vendored dependencies, migrations. Focus AI attention on application code where bugs have real impact.

Integration patterns

Comment inline on PR as a bot. Make sure attributed clearly — pretending it's human is worse than labeling. Don't block merges on AI comments; humans have final say. Track comments-per-PR (should trend down), dismissal rate (below 40% is healthy), and bugs-caught-vs-shipped.

The cultural piece

Some engineers resent AI reviewing their code. Frame it right: AI is a fast first pass for obvious issues so human reviewers can focus on architecture and design. Not 'AI judging your code.' Framing matters.

Read next
AI pair programming: Copilot, Cursor, Claude Code patterns
Read next
Building AI-native developer tools: what developers actually want
Read next
Why evaluation infrastructure matters more than prompts
Tags
code reviewCIdeveloper tools
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request