I pit two AIs against each other in code review — here's what happens

February 24th, 2026AI

A year ago, my AI-assisted coding looked like everyone else's. A Copilot suggestion here, a ChatGPT prompt there, a "thanks, looks great" without really reading the output. It worked, sort of. Mostly sort of.

The wake-up call was a Friday evening. I was working on an internal POC — nothing critical, just a tool for the team. I'd just merged a PR almost entirely generated by AI. Tests green, lint clean, looked solid. Except Monday morning, a colleague pings me: "Your endpoint returns data from the wrong tenant when there are two active sessions." On a POC, it's not the end of the world. But it got me thinking: if I let this slide on a project with zero stakes, what am I letting through when the pressure's on? I'd trusted blindly, and even on a POC, that's a bad habit to build.

That day, I decided: if I'm going to delegate to AI for real, I need to treat it like a brilliant junior dev who still needs supervision. And one pair of eyes is never enough.

The thing that changed everything: a text file

This is going to sound ridiculously simple, but the first thing that transformed my workflow was a file. A plain CLAUDE.md file at the root of my machine.

It's Claude Code's configuration file. When you start a session, it reads this first. And in it, I put everything I used to repeat to the AI at the start of every conversation:

"Give me a plan before you code." "Don't tell me you're done if you haven't run the tests." "Keep it simple, stop over-engineering."

Sounds like nothing, but before that, every session started from scratch. I'd spend the first 5 minutes getting the AI back on track. Now Claude Code shows up with the right mindset from the start. It's like working with someone who actually read the project docs before day one — refreshing.

I also added a rule I'm particularly fond of: the self-improvement loop. Every time I correct Claude Code on something, it logs it in a lessons file. Next session, it doesn't make the same mistake. It's simple, but after a few weeks, the difference is striking.

"Show me your plan before you touch the code"

This is rule number one in my CLAUDE.md, and it saves me the most time.

Before, when I wanted a feature, I'd type something like "add pagination to the /users endpoint" and Claude Code would dive straight into the code. Result: it'd modify 8 files, break something I hadn't noticed, and I'd spend more time understanding what it did than it would've taken to do it myself.

Now, it gives me a plan first. Which files it'll touch, in what order, why. And I read it. Actually read it. I push back. "Why do you want to modify the middleware when it's just pagination?" "Did you think about the expired cursor case?"

Sometimes I reject the whole thing and we start over. And that's fine. A plan takes 30 seconds to generate. Poorly thought-out code takes hours to untangle.

The counterintuitive thing is that it slows down the beginning but speeds up everything else. When the plan is solid, the implementation comes out nearly clean on the first pass.

And I say all of this out loud

Here's a detail that always surprises people when I mention it: I barely type my prompts anymore. I dictate them.

Tools like Wispr Flow or SuperWhisper do intelligent dictation — you talk, and the text lands directly in your terminal, your IDE, your chat window. No button to press, no corrections needed. You speak as you think and the text follows.

At first, I felt a bit silly talking to my screen. Now it's become so natural that the keyboard feels slow. When I'm challenging a Claude Code plan — "no wait, why do you want to go through a middleware here when a simple wrapper would do?" — it's smoother to say it than to type it. The thought comes out faster, rawer, more honest.

And there's a side effect I didn't anticipate: speaking forces me to articulate clearly. When you type, you can write a vague prompt and hope the AI figures it out. When you speak, you hear yourself. And if your sentence is muddled, you know it immediately.

Where it gets interesting: two AIs talking to each other

OK, so Claude Code has an approved plan, it implements. Tests pass, lint is happy. At this point, the old me would've merged. The new me does something a bit weird: I send everything to Codex.

Not to fix it. To review it.

Codex didn't write this code. It doesn't have the creator's bias. It shows up with fresh eyes and isn't there to be nice. And regularly, it finds things. A missed edge case, an ambiguous name, a test that tests the wrong thing.

But the most interesting part is what happens next. I don't fix Codex's remarks myself. I send them back to Claude Code. "Codex says your error handling here is fragile, what do you think?"

And Claude Code can respond. Sometimes it agrees and fixes. Sometimes it argues: "No, that's intentional because [reason]." Codex re-reviews, counter-argues or validates.

I don't stop until both are aligned.

It's a bit like watching two senior developers debate in code review. Except it takes 3 minutes instead of 3 hours, nobody takes the feedback personally, and I'm sipping my coffee watching it unfold.

Is it perfect? No. They sometimes share blind spots. But the code that comes out of this loop is consistently better than what either one would've produced alone.

The safety net isn't the AI — it's the tooling

An important point: I don't trust these AIs because they're AIs. I trust them because I have solid guardrails around them.

All my projects run strict linting, automated tests, pre-commit hooks. Non-negotiable. If the lint breaks, we fix it. If a test fails, we start over. The AI can't cheat its way through.

That's what lets me sleep at night. The tests aren't there to verify that the AI coded well. They're there to verify that the code does what's expected, regardless of who wrote it.

And once everything's green and both AIs agree, I still do a quick pass. But a targeted one. I don't re-read every line — the machines already did that. I look at architecture choices, exposed interfaces, and business edge cases that only someone who knows the context can evaluate. The kind of things no test can catch.

The PR that tells a story

Last thing that changed my life: PR descriptions.

Before, my descriptions were "fix pagination" or "add user endpoint". About as useful as a "road" sign in the middle of a highway.

Now, I ask Claude Code to write the description focusing on the why. Not "added a try/catch at line 42" — you can see that in the diff. I want to know: why this approach? What trade-off was made? What's the impact on the rest of the system?

A good PR description is a mini-ADR (Architecture Decision Record). Someone reading it 6 months from now should understand the reasoning, not just the changes.

Unexpected side effect: review discussions got more interesting. When the description talks architecture, reviewers talk architecture. When it talks about lines of code, everyone nitpicks style.

Let's be honest

This workflow isn't magic. It works because I invested time in three things:

A good CLAUDE.md. It's a living document. Every time the AI does something that annoys me, I add a rule. It starts at 10 lines, after a few months it's a real playbook.

Solid tests. That's the foundation of everything. Without tests, delegating to AI is just hope. The AI can write beautiful code that does absolutely nothing you expected.

Staying in the loop. I approve the plan. I approve the result. The AI accelerates, but the judgment stays mine. The day I stop reviewing plans and results is the day I end up with a tenant bug in prod on a Friday night.

If you want to try this, start small. A CLAUDE.md with 10 lines and your 3 most important rules. The requirement of a plan before every task. And a second pair of eyes, whether that's Codex, another LLM, or a human colleague. Nobody should merge code that only one pair of eyes has seen.

The best code I shipped this year, I didn't write. But every line carries my judgment.

Loading content...