About Career Portfolio Journal Life Contact
← Back to Journal AI Engineering

Building With Two LLM Agents in Deliberately Separated Roles:

An Experiment in Strict Agent Role Separation

May 2026 · 17 min

I was watching Claude work through Codex’s first review of the milestone where I assembled the public corpus and the eval cases. Seven blockers, mostly small — change a return value to a typed exception, raise on integrity errors instead of returning a status object, that kind of thing. Standard fix-up cycle. Then I noticed Claude had opened a test file.

The whole point of the workflow was that Claude doesn’t open test files. Codex writes the tests. Claude implements the code. The line is bright. That’s the rule. So I had to hit the stop button and directly ask Claude if it was touching test files.

Screenshot of me stopping Claude from tampering with test scripts.

Claude described what it had done: three tests rewritten, one new test added. Codex’s existing tests had been written for the old contract — return a status object on hash mismatch and let the caller decide what to do next. Codex’s review now demanded the new contract — raise a typed exception on hash mismatch, fail fast at the document level. The implementation Codex wanted broke the tests Codex had already written. Claude couldn’t ship Codex’s demanded change without touching Codex’s territory.

I had Claude roll back to the pre-edit state and walked through the diff one test at a time. Then I sent Codex the test diff for re-review before commit, not after — invert the normal order. Codex’s verdict, recorded in the build journal verbatim: “legitimate test updates, not Claude watering down Codex tests… new assertions are actually stronger than mine were.”

I logged it as a “test-update exception (acknowledged)” in the build journal because the auditor/auditee separation rule normally forbids it. The honest version: the workflow didn’t catch this on its own — I did. The two-agent split had narrowed the surface area for slips like this, but it took a human at the screen to actually catch one when it came. Naming the exception was the mitigation, the workflow itself wasn’t the thing that detected the breach.

This post is about why I designed the workflow, what it caught during the first half of the build and importantly, what it didn’t catch.


What Self-Grading Bias Actually Is

The bias here is well-documented at this point — the open question is what to actually do about it. I’d seen this discussed in super-user circles, and having now experienced it firsthand I decided to dig into the research. The failure mode is structural, not fixable through better prompting, which means the mitigation has to be structural too. The workflow is my answer; what follows is the research that justifies its shape.

The sharpest test-generation finding I found comes from a 2026 paper. Haroon et al. evaluated LLM-generated unit tests under software evolution and found that more than 99% of failing tests under semantic-altering changes still pass on the original program, executing the modified region without catching the change.[1] The tests aren’t reasoning about program semantics — they’re reproducing surface patterns from the training distribution. When the same model owns code and tests, the assertions drift toward the implementation’s behavior, not the spec’s.

The mechanism behind that drift shows up in two adjacent places in the literature.

The first is self-preference bias in LLM-as-judge work. Spiliopoulou et al. in 2025 built a statistical framework over 5,000+ prompt-completion pairs with expert human annotations and nine LLM judges, and found that GPT-4o and Claude 3.5 Sonnet systematically assign higher scores to their own outputs — and to outputs from other models in the same product family.[2] This is the failure mode whenever you reach for LLM-as-judge.

The second is the limits of self-correction without an external verifier. Stechly, Valmeekam, and Kambhampati at ICLR 2025 tested GPT-4 on Game of 24, Graph Coloring, and STRIPS planning tasks, and found that intrinsic self-critique — the model checking its own answers without an external reasoner — fails to improve performance and often degrades it. An external correct verifier is what fixes it.[3] The framing matters: a model can’t reliably grade its own work without something outside itself doing the grading.

Stack the three threads together and the picture gets sharp. A model writing tests for code it just wrote is grading its own work, can’t reliably catch its own errors without external feedback, and tends to write tests that encode its own assumptions about the implementation rather than the spec. The mitigation that survives all three is straightforward: don’t let the same model own both surfaces.

That’s the workflow I built.


Why I Rebuilt

The first version of this system worked. Locally, on my laptop, with one corpus loaded — NDCG@5 went from 0.101 at the vibe-coded baseline to 0.498 after I tore the retrieval pipeline apart and put it back together. I shipped that result to my own browser and called it a win. It worked, for me, but it isn’t deployable outside; the reality is it’s the equivalent of a functional demo.

Since I built the original system I’ve learned a lot, and I realized that just because the system was working didn’t mean it was good.

The schema had no migrations. Embeddings lived in a .npy sidecar with no transactional coupling to the SQLite rows they were supposed to mirror. Document identity was the file path — rename a folder and half the labels went stale. The evaluation harness existed, but it only tested search retrieval before deployment — no long-term evaluation. The ingest pipeline ran extract → chunk → embed → index → sync inside a single function, so swapping the embedding model meant re-extracting text from PDFs I had extracted last week.

The reality is, none of it would survive being read by another engineer. Some of it wouldn’t survive my own re-reading three months later. So I wrote a lessons-learned doc and started over — with a workflow constraint I’d been thinking about for a while but hadn’t implemented on a real project yet.


The Constraint

I wanted to use two LLM agents on the rebuild. Not collaborating. Deliberately separated.

Claude implements. Codex verifies. Ownership is strict and non-overlapping. The rules are captured in an architecture decision record (ADR) — a short markdown document that records a single decision, the alternatives considered, and the consequences accepted — that I drafted at M0 and have not amended since.

AreaOwner
Application code, ingestion, retrieval, agent loop, FastAPI, frontend, CLIClaude
Architecture, build journal, ADRs (drafts), blog draftsClaude
Test plan, all tests, prompt-injection battery, code-review reportsCodex
Eval cases (the domain-judgment work)Me, with Claude drafting
Plan, final architectural calls, arbitrationMe

The bolded row is the load-bearing one. Codex never edits application code. Claude never edits tests or review reports. If a Codex-authored test exposes an implementation bug, Codex writes a review report citing the line and severity — and stops. The fix is Claude’s.

This is not how most teams work, and the cost is real. A misunderstood requirement now needs an ADR amendment or a Codex review to surface. Some boilerplate fixes that one brain could resolve in seconds require a round-trip across the boundary. The same hands never get to write the code and the test that defends it. I accepted that explicitly in the ADR’s “Consequences” section. The benefit I was buying is what the research above describes — the workflow narrows the surface area where any one model is grading its own work — and an audit trail that no single agent could have produced for itself.


How The Agents Communicate

The two-agent setup needs plumbing under it. Without shared state and consistent prompting, the agents drift past each other and the boundary blurs across milestones. Three pieces hold the workflow together.

The build journal as inter-agent message bus. This is how the agents communicate. It’s a requirement that they show their work in one file, on disk, that both of them read. When Codex makes a statement about a failure, it goes in the journal. When Claude describes a schema choice, in the journal it goes. The journal is the durable shared state — not chat history, not whatever an agent remembers from a prior session, but a flat append-only log that survives session limits and re-runs. Codex’s review reports cite specific journal entries; the journal cites specific commits and decisions. The artifact-level link is what makes “this was deferred to M3” survive between sessions.

Architecture Decision Records. Each phase starts with decisions: hardline choices on the framework, the schema, the cadence rule, the exception protocol. Each agent is responsible for adhering to it, with the other agent checking their work. Defined once, cited many times, never amended silently. The agent-separation ADR I drafted at M0 is the load-bearing one; it’s been cited in every review report since.

One prompt generation point. I use one agent with one context window to generate every prompt across the project. The practical effect is consistency: every Codex prompt opens with “do not edit Claude’s territory” and lists exactly what that is. Every Claude prompt opens with the inverse. If those constraint blocks were re-typed by hand across milestones they would drift — the boundary would blur in the prompt language even if the rule itself stayed fixed. They don’t drift, because they don’t get re-typed. The prompt-generating agent reads the build journal before each generation to keep the source of truth current.

The point of all three: non-overlapping ownership only works if both agents are talking to the same boundary specification across time. Without the journal as state, they’d contradict prior milestones. Without the prompt generation point, the boundary would drift in the prompt language. Without the ADRs, the rules of the workflow itself would get re-litigated every milestone.


How Each Milestone Closes

The unit of progress is a milestone, not a commit. Eight milestones from M0 (skeleton + CI) through M8 (deployment + release scorecard); five reviewed so far.

Each milestone closes the same way. Claude commits implementation plus matching architecture and journal updates. No code-only PRs. Codex reviews the diff, runs the test suite, runs a conformance scan against a prohibited-dependency list — the framework-free ADR makes LangChain, LangGraph, LlamaIndex, and similar wrappers a hard prohibition — and writes a markdown review report under a date-prefixed naming convention with one of three sign-offs:

  • approved — merge.
  • changes-requested — Claude addresses the report and goes back for re-review.
  • architectural-concern — escalates to me before any code change.

The review reports are durable artifacts. They live in the repo and they don’t get rewritten. If a milestone came back changes-requested, the report stays in git history with the original blockers visible — even after the re-review approves the fix. That was the shape I wanted. Every sign-off is a public record of what failed first and what was done about it.

Branch protection on main requires a green CI run and a Codex review. The protection isn’t theater; the workflow only holds if the review report is mandatory.


What The Second Pair Of Eyes Caught

The pattern repeats through every milestone. Not always with three blockers — M2 cleared with zero open issues, which felt like an outlier and probably should, but always with something the first pair of eyes had walked past.

The catches don’t break into “huge gotchas.” They’re contract violations, hygiene gaps, integrity-error shapes that should have been exceptions instead of return values, log fields that should have been redacted instead of emitted, schema files that should have been linted, validation rules that should have been at the route layer instead of the request model, missing canonical files that the plan called for and the implementation split. None of them, individually, would headline a blog post. Cumulatively, they’re the audit trail.

Severity counts as initially raised by Codex, across the five reviewed milestones:

MilestoneBlockerMajorMinorNitSign-off
M03310changes-requested → approved
M17111changes-requested → approved
M20000approved
M30110changes-requested (open)
M40210changes-requested (open)
Total10741

Twenty-two issues across five reviews, ten of them blockers, all concentrated at the early milestones where the foundations were getting laid. After M0 and M1, the catch volume drops sharply — which is the signal you’d expect when the substrate is doing the work that used to require manual review. M2’s zero-issue cycle is the outlier and probably should be; M3 and M4 are still open with one inherited blocker carrying forward across both.

The most representative cycle is M0. The first milestone is a skeleton — pyproject.toml, multi-stage Dockerfile, FastAPI app with /health, Alembic with one migration, structlog JSON logging, Pydantic settings, CI workflow with five jobs. The implementation got committed; Codex came back with three blockers and two majors against it: required-config defaults that wouldn’t fail loudly when absent, secret-shaped log fields rendered verbatim to JSON, a mypy pre-commit hook that wasn’t actually green in its isolated environment, the pre-commit smoke unit-test hook missing entirely, and a CI integration job running on the runner host instead of inside the production Docker image. The fix sweep took one session — Pydantic-required fields without defaults, a redaction processor before the JSON renderer, a system-language mypy hook sharing one source of truth with CI, the missing smoke hook, a CI test stage extending the runtime image — and Codex re-reviewed and approved.

The interesting one is the redacted-secrets blocker. The structlog setup had been written and reviewed as part of the M0 commit. Specific-named secret keys flowing through to JSON output is the kind of bug a pre-commit lint hook doesn’t catch and a unit test absolutely should — but only if the test exists. Codex authored the test, the test failed, the failure pointed at the line. That’s the workflow doing exactly what it was designed to do: the pair of eyes that wrote the code wasn’t the pair of eyes that ran the verification.

The M1 test-file moment from the opening of this post sits inside the same pattern, with the additional wrinkle that Codex’s contract demand forced the boundary itself to flex. I have been watching this build closely, so I noticed, but there’s a distinct possibility I could’ve missed it if I had been getting a coffee refill.


What’s Important For This Workflow

The communication plumbing covered earlier is what keeps the agents on the same page across time. The other half of the substrate is the technical guardrails that keep individual milestones honest.

Standard professional hygiene is in place — pre-commit hooks running black, ruff, and mypy --strict across the application and eval surfaces; a fast unit-test smoke that fires before the commit lands; a CI matrix with five jobs (lint, typecheck, unit, docker-build, integration-interface) that runs integration tests inside a production-derived test image against a real Postgres testcontainer with pgvector. No mocks at the database, embedding, or service-boundary layer. None of this is exotic; the point is that all of it is enforced by either pre-commit or CI, so a “good intention” can’t survive without something blocking the merge.

Three pieces are project-distinctive and worth flagging:

  • Adversarial test catalog. Each milestone declares its known failure modes as a numbered catalog (ten items in M2 alone, seven in M3, and so on) and Codex authors a test per item. The catalog lives in the test plan and grows per milestone.
  • Eval harness frozen before tuning. Forty retrieval cases and ninety-two agent cases authored at M1 and frozen before M3 retrieval tuning, so the harness can’t be unconsciously biased toward whatever the implementation happened to produce. Page-range labels rather than chunk-ID labels — the same eval set survives a chunk-size change. The prior project’s harness was a one-shot pre-deployment retrieval test with no ongoing regression coverage; the M3 sweep against the new frozen set produced an NDCG@5 of 0.165 — far lower than the prior project’s 0.498, because the corpus is harder and the case set is broader, and that’s the entire point of revalidating. I’ll be covering the sweep in a future post.
  • Performance gate. M3 added a slow latency test behind an opt-in flag. Gating target was /search p95 < 500 ms; measured was 107 ms. The gate is the value, not the headroom.

What The Workflow Taught

A few observations from the build so far, not all of which I expected.

The discipline needs an explicit exception protocol. The M1 test-file moment from the opening of this post forced this lesson. Codex’s contract demand had no path forward without Claude touching test code, and the rule said Claude couldn’t touch test code. The resolution wasn’t to suspend the rule — it was to invert the review order, run Codex over the test diff before commit rather than after, and log the deviation as an acknowledged exception in the build journal. The discipline held because the exception was named and bounded, not silently applied. If I’d let the edit go through without surfacing it, the next time the rule got inconvenient I’d have less ground to stand on.

The build journal becomes the project’s actual story. This was a happy surprise. Because Claude is forced to write a journal entry on every milestone (and on most sessions that land a real change), and because the entries get reviewed alongside the code, the journal is a solid record of what happened during the rebuild, including the test-file moment, the redaction blocker, the integrity-error contract, every fix sweep, every acknowledged exception. Drafting this post was largely a search-and-quote exercise across the journal and the milestone reviews. The journal isn’t documentation built after the fact, it’s more like a courtroom transcript captured by a stenographer.

Latency is the price. A round-trip review at every PR boundary — and a milestone re-review on top — adds time. Several milestones have spent more than a few hours in the changes-requested → fix → re-review loop. If I were on a delivery deadline, my timeline certainly would’ve been more crunched. For a solo build where the audit trail is part of the deliverable, the trade is fine. For something else, it might not be. As I utilize this flow more I’m sure I’ll come up with more efficient ways of managing the handoffs.

What the second agent actually does. A reasonable peer-reviewer question: is the second agent doing the work, or is the human the actual auditor and the second agent is decoration? The honest answer is three roles, not two. Codex does the line-by-line verification — the kind of work neither I nor Claude would do reliably on every diff. I do the meta-arbitration: exception protocols when the rule needs to flex, ADR amendments when a decision has to change, the M1 boundary call. Claude defends its implementation choices in the journal so Codex has something concrete to verify against. If any of the three roles is missing, the workflow folds. The second agent isn’t decoration; it’s doing systematic line-by-line verification, and removing it would put that work back on me, which slows the project down significantly.


Where It Works And Where It Might Not

I don’t think this is a workflow most teams will use. Two reasons.

Cost. Running two frontier models on every milestone, with a human-in-the-loop arbitrating the ADRs, is more expensive in both API spend and human attention than a single-agent workflow with reasonable PR review. Across the first five milestones, roughly two hours of my own time per milestone (about ten hours total in active arbitration so far), about six million tokens through Claude across the build, and enough Codex usage on a Pro subscription to hit my weekly limit at least once. I’ve run into Claude’s session limit five or six times (Max 5X). For a solo project where the audit trail is part of the goal, those costs are the point. For a team with delivery pressure, they’re a tax. If I were running this project through the API rather than subscriptions I’d likely be more cost sensitive.

The bias the separation prevents is mostly a solo-build risk. The research grounding the workflow — Haroon et al. on LLM-generated tests drifting toward the original program semantics under code change, Spiliopoulou et al. on current frontier models systematically self-preferring, Stechly et al. on intrinsic self-critique failing without an external verifier — describes a failure mode that emerges sharply when one model owns both surfaces. If a team has a dedicated test author who isn’t the implementer, the same property holds without needing two agents in opposition. The strict separation matters most when one model, or one person, would otherwise own both code and verification.

Where I think the workflow works: solo or small-team builds where the engineer wants the verification path to be independent of the implementation path, and where the audit trail is itself a deliverable the project benefits from having. A peer engineer reading this codebase six months from now should be able to see what was demanded, what was caught, what was deferred, and what was acknowledged as an exception — without taking my word for it. Where I think it might not: a team with humans already filling the test-author role, or a project where time-to-ship dominates everything else.


Closing

The most important lesson of the rebuild so far isn’t the two-agent workflow itself — it’s what the workflow forces. Pre-commit hooks that actually have to pass. Migrations that have to live in Alembic, not in ad-hoc ALTER TABLE. Eval cases that have to exist before tuning, not after. Review reports that have to land in the repo, with changes-requested visible in git history even after the fix approves. Exceptions to the auditor/auditee rule that have to be named, scoped, and re-reviewed at the boundary they cross. The prior project had none of those substrates. This one couldn’t have shipped a single milestone without them.

If I’d been getting a coffee refill during the M1 moment, this post would be different. The two-agent split narrows the surface area for slips like that; the substrate underneath narrows it further; but the residual responsibility doesn’t go to zero. The discipline I’m building still depends on me being awake at the screen, and that’s the version of “rigor” I’m actually shipping — necessary plumbing plus a human who has to keep paying attention. Worth shipping anyway, because the alternative is a single-agent workflow where I’d be paying attention to the same things without any structural help at all. I’m keen to understand how others are managing this, and I’m sure I’ll have new strategies to add on to the next build which I’ll share.

The next post in the series picks up the half of the project I most want to write about: the hand-rolled agent loop with parallel tool use, SSE-streamed intermediate steps, and the designed-for-citation-faithfulness system prompt. That’s M5. The full repo, including the build journal and the milestone review reports, will be on GitHub once it’s ready.


References

[1] Haroon, Khan, Gulzar. “Evaluating LLM-Based Test Generation Under Software Evolution.” arXiv preprint, 2026. arXiv:2603.23443

[2] Spiliopoulou, Fogliato, Burnsky, Soliman, Ma, Horwood, Ballesteros. “Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge.” arXiv preprint, 2025. arXiv:2508.06709

[3] Stechly, Valmeekam, Kambhampati. “On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks.” ICLR 2025. arXiv:2402.08115