Code Review in the AI Era: What Still Needs a Human Eye

The new review landscape

Something quietly shifted in the last 18 months. PRs got longer. They arrived faster. And for a growing number of teams, the author listed at the top of the diff barely wrote a line of the code underneath it.

AI wrote it. A developer described what they wanted, reviewed the output, maybe made a few tweaks, and pushed. That's the workflow now. And it's not a bad thing — AI-generated code ships faster, handles boilerplate cleanly, and rarely makes the trivial mistakes that used to dominate review comments.

But it creates a new problem: more code, less time, and a review culture that hasn't caught up with the reality of who's actually writing the code. Teams are still reviewing AI-generated PRs the same way they reviewed human-written ones — which means they're spending time on things that don't matter and missing the things that do.

Code review in the AI era needs a different focus. Not because AI code is worse — in many ways, it's more consistent, better-formatted, and more thoroughly commented than average developer output. But because AI has specific, predictable blind spots. And if your review process doesn't target those blind spots, you're reviewing the wrong things.

What AI gets right

Before you can focus your review on what matters, it helps to understand what you can safely move past. AI is genuinely excellent at:

Syntax and formatting: AI code is almost always syntactically valid and consistently styled. Style-guide violations, trailing whitespace, inconsistent naming — these are rare in AI output. Spending 10 minutes on formatting feedback in an AI-generated PR is a waste of everyone's time.
Common patterns: CRUD operations, REST API structure, standard auth flows, database queries following well-established patterns — AI has trained on millions of examples of these and gets them right reliably. You don't need to verify that a useEffect cleanup is correctly structured.
Boilerplate and repetitive logic: Error handling structure, logging setup, config parsing, form validation against known schemas — AI handles these well. If you're manually reviewing every line of an AI-generated Zod schema, you're in the wrong layer.
Documentation and comments: AI-generated code is often better documented than equivalent human code. Trust the docstrings. Read the comments. They're usually accurate.

The practical implication: you can compress the parts of code review that used to take the most time — style, structure, standard patterns — and redirect that attention to the things AI gets wrong.

The review shift: Stop reviewing like a linter. Start reviewing like a product engineer who understands the system's real constraints. AI can pass a linter. It can't fully understand your business.

The 5 things AI consistently misses

These aren't random failures. They're structural gaps — categories of reasoning that are genuinely hard for AI because they require context that isn't in the code itself.

Business logic edge cases

AI implements what you describe. It doesn't know that "cancel subscription" also needs to halt any in-flight webhook deliveries, or that "delete account" in your system triggers a 30-day grace period before data removal. These aren't coding problems — they're product decisions baked into your codebase. AI can't discover them from a prompt. You have to verify the implementation matches the actual business rule, not just the surface description of it.

Security assumptions

AI-generated auth and authorization code is often correct in isolation and broken in context. The isAdmin check might be right; the fact that you're checking it client-side when the data is fetched server-side without a corresponding server check is what AI misses. It generates code that looks secure. It doesn't verify that the security model is coherent across the entire request path. The same gap that breaks vibe-coded apps in production shows up here: local correctness isn't global safety.

State mutations and side effects

AI code tends to treat state optimistically. It updates the UI before confirming the server write. It assumes an array is immutable when something upstream mutates it. It forgets that calling a function on one component triggers a re-render in a parent that resets the form the user is filling out. These are emergent behaviors — they only appear when the code runs in the context of everything else. AI wrote the module in isolation; your job is to verify it behaves correctly in the system.

Async race conditions

Concurrency bugs are hard for humans to spot in review. For AI, they're a persistent blind spot. AI-generated async code often looks correct — the promises are awaited, the errors are caught, the loading states are set. What it misses: two requests fired in quick succession where the second resolves first and overwrites the correct state; a debounced function that doesn't cancel the previous call; a websocket that reconnects mid-transaction. These require thinking about time and interleaving, not just about what the code does in a single execution path.

Test fidelity

AI writes tests that pass. That's a different thing from tests that catch regressions. AI tends to write tests that verify the happy path, match the implementation structure too closely (so they break when implementation details change, not when behavior changes), and avoid the edge cases that are hardest to set up. If you asked AI to write tests for a function it also wrote, there's a high chance the tests and the implementation share the same blind spot. Review the tests harder than the code.

A practical review checklist for AI-generated PRs

This isn't a checklist for reviewing every line. It's a checklist for the human gates — the things that require judgment, context, and knowledge of your actual system. Run through this before approving any AI-generated PR going to production.

Human Gates — AI-Generated PR Review Checklist

Business Logic

Does this implementation match the actual business rule, not just the description in the prompt/ticket?

Are there downstream effects this change doesn't account for? (webhooks, queued jobs, related records)

What happens with existing data? Does the migration handle legacy records correctly?

Security

Is authorization checked server-side on every endpoint that needs it? Not just in middleware — at the data layer.

Are user-supplied inputs validated and sanitized before they touch the database or get serialized?

Does this expose any data the calling user shouldn't have access to? (over-fetching, unfiltered joins)

Any hardcoded secrets, API keys, or environment values that should be in config?

State & Side Effects

Does the UI update happen only after the server confirms success — or does it optimistically update and silently fail?

Are there shared state objects being mutated directly? (Redux state, context, prop objects)

What happens on unmount, navigation away, or session timeout mid-operation?

Async & Concurrency

What happens if this async operation is called twice quickly? (double-click, rapid navigation)

Are previous in-flight requests cancelled when a new one is fired?

Are retry and timeout behaviors defined, or does the code hang indefinitely on a slow response?

Tests

Do the tests cover the failure paths, not just the happy path?

Would these tests catch a behavioral regression, or only an implementation change?

Is the critical path covered by at least one integration test, not just unit tests?

This is what the Golden Code methodology calls the human gate layer. Your AI tools — including the prompts you use to generate code — can enforce a lot of structure automatically. But the list above requires a human who understands the product and the system to actually run through it.

How to use AI to review AI

Here's where it gets interesting: you can use AI's own capabilities against its blind spots. The key is adversarial prompting — asking the AI to critique its own output from the perspective of a skeptical senior engineer, not a helpful assistant.

The instinct to ask "does this code look good?" is wrong. The model wants to be helpful; it will find reasons to say yes. Instead, frame it as a red-team exercise:

# Adversarial review prompts that actually work

"You are a security engineer doing a threat model review.
 This code was written by an AI assistant.
 List every assumption this code makes about trust boundaries,
 authentication state, and input validity.
 Then tell me which assumptions are dangerous."

"You are a senior engineer who has debugged three production
 outages in the last month. Review this async code and list
 every scenario where the UI could display stale or incorrect
 state due to a race condition or out-of-order resolution."

"Read these tests. Your job is NOT to confirm they pass —
 assume they pass. Your job is to find what they DON'T test.
 What inputs, states, or sequences would cause the underlying
 code to fail without any of these tests catching it?"

Adversarial prompting works because it reframes the AI's goal. Instead of optimizing for "the code is fine," it optimizes for "find the failure." The model is just as capable of both — you just have to ask for the right one.

This isn't a replacement for the checklist above. It's a force multiplier before you run it. Use adversarial prompts to surface issues in the diff, then use the checklist to verify your human-layer concerns are addressed.

The right division of labor

The goal isn't to review AI code less rigorously — it's to review it differently. AI is the first pass. You are the final gate. Matching the tool to the task makes both faster.

Review Layer	AI First-Pass	Human Final Gate
Syntax & formatting	✓ Linter + AI review	Skip — trust the tools
Common patterns	✓ AI self-review	Spot-check only
Security assumptions	Adversarial prompt	✓ Manual verification
Business logic	AI can surface questions	✓ Human must verify
State mutations	Adversarial prompt	✓ Human trace through
Race conditions	Adversarial prompt	✓ Human verify + test
Test coverage gaps	Adversarial prompt	✓ Human add missing cases

The pattern is consistent: AI is good at checking against known rules; humans are essential for reasoning about system-level coherence. Automated tools (linters, type checkers, the four-gate build check from the ship-faster methodology) handle the bottom layer. Adversarial AI prompts surface issues in the middle layer. Human reviewers own the top layer — the business, the security model, the system behavior.

This isn't about trusting AI less. It's about deploying human attention where it creates the most value — which is exactly the kind of leverage that makes AI-assisted teams genuinely faster rather than just superficially faster.

The developers who ship reliable AI-generated code to production aren't doing more review. They're doing smarter review. They've stopped re-litigating formatting and pattern choices that AI gets right, and they've doubled down on the five categories where AI consistently fails.

Use the checklist. Use adversarial prompts. Trust AI as your first pass, and treat your human review time as the final gate — the one thing in the process that can catch what no automated tool can.

The code might be AI-generated. The judgment about whether it's safe to ship is still yours.

Make your review process systematic

Install midas-mcp to enforce gates, run completeness audits, and use phase-aware prompts that build in the right checks from the start — so there's less to catch in review.

npx merlyn-mcp click to copy

Also read: Golden Code: The Methodology That Turns Vibe-Coding into Production Software →