From Ignored Comments to Blocked PRs: Making AI Review Enforceable

The PR was clean. Correct. Complete. Somewhere else in the repo, a document quietly became a lie.

Not because anyone was careless. The agent followed its instructions. The team member who opened the PR reviewed it carefully. The conventions were all there: a glossary, architecture principles, examples. But nobody remembered the doc that referenced the flow that just changed. The agent never loaded it. The human hadn’t thought about it in weeks.

The change was perfect. Three files away, a doc that referenced it had no idea.

I kept noticing this pattern. A term renamed in the code but still used in a guide. A flow updated but the diagram describing it frozen in the past. Each drift was small. Across fifty PRs, it added up.

Working at AI velocity makes this structurally harder. Agents work on what they read. Humans working alongside them stop maintaining the full map too. There’s too much moving to track what references what. The intent is there. The coverage isn’t.

This post is about what I built to close that gap: an LLM-based review gate in CI that catches what neither the agent nor the developer thought to check. Not just a reviewer that runs, but one the team trusts and knows how to respond to.

The Problem: Conventions Without Coverage

We had good conventions. A shared glossary, documented architecture principles, reference examples for common patterns. We also had a local review skill developers could run before pushing. The kind of setup that, on paper, should keep quality high.

It didn’t. Not because people ignored the conventions. Because the conventions only helped with what you were actively thinking about.

AI-assisted contributors ship a lot of PRs fast. Each one is focused, often well-executed. But the agent working on a specific task loads the files for that task. It doesn’t scan the repo for everything that might reference what it’s about to change. Neither does the developer reviewing it. The conventions had no reach beyond the current context window.

What Didn’t Work

The solution seemed obvious: have the CI run the same kind of review an LLM can do locally, and post findings on the PR. The first version was non-blocking. Advisory. It felt like the right balance: information without friction.

It got lost in the noise. We had other bots and agents commenting on PRs. Status checks, deployment previews, automated labels. One more comment, even a good one, was easy to scroll past. Even if only 5% of comments were missed, that’s still real drift accumulating silently.

So I made it blocking. Any Critical finding would fail the CI check. Immediately the problem flipped. The reviewer flagged too many things as Critical. A terminology inconsistency that was technically valid but marginal. A structural concern that was a matter of taste. Developers started treating the gate as adversarial. Trust eroded fast.

I tightened the severity definitions. Fewer things qualified as Critical. The false positive rate dropped. And then came the worst phase: whack-a-mole.

Fix the Critical finding, push, wait for CI. A new finding appears. Not because the reviewer was inconsistent. Because each CI run sees the current diff in context, and context shifts. Fix one thing, and something adjacent becomes visible. Fix that, and something else surfaces. Three round trips to merge a single PR.

That’s when I understood the problem wasn’t just the reviewer. It was the feedback loop.

The Pattern That Works

Getting to something the team trusted took three pieces working together. But first: how does the reviewer even find what drifted?

How it catches what nobody loaded

The reviewer isn’t working from a pre-built index of cross-references. It’s an LLM agent running inside a full checkout of the repo with tool-calling capabilities. It reads the diff, then traces outward: what docs describe the flow that just changed? What glossary terms appear in the modified files? What diagrams reference the updated architecture?

The review prompt gives it a mandate (“from the diff, verify that documentation still accurately describes what exists”) and enough turns to reason about it. It reads files, follows references, and reports what no longer matches. No retrieval script, no hardcoded file list. Just an agent with full repo access and a clear question: did anything else break because of this change?

This is what makes it different from a local review skill. The local skill operates within the developer’s session context and depends on the developer choosing to run it. The CI reviewer always runs. It can’t be skipped. It’s not subject to the developer’s prompt or their session’s context window. Same principle as linting: the developer might not run it locally, but the CI will catch it regardless.

A common language for findings

The first piece was severity discipline. Not just “Critical means bad” but a precise contract: to classify something as Critical or Major, the reviewer must cite specific evidence from the diff. Not a heuristic, not a concern. Evidence.

The review prompt also includes a self-challenge step. Before finalizing severity, the reviewer asks: “Would a reasonable engineer push back on this?” If yes, downgrade. Severity is classified last, after the finding is fully described.

Critical — The change introduces something that will mislead or break. Evidence required: cite the exact line or phrase that causes the issue.

Major — Significant quality problem that will cause confusion or drift. Evidence required: cite specific evidence. Ask: would a reasonable engineer push back?

Minor — Improvement recommended but not blocking.

Suggestion — Optional. No action required.

This reduced false positives. More importantly, it made findings legible. Developers could tell at a glance whether a Critical was earned.

Findings as desired-state descriptions

The second piece was how findings were written. Early versions included things like “change line 42 to X.” That breaks the moment someone pushes a fix commit. Line numbers shift.

The format I settled on: describe the desired state, not the change.

In `docs/architecture.md`, ensure the diagram reflects that the validator
runs before the transformer, because the code was updated in this PR and
the diagram now shows the wrong order.

Paste this into any AI agent and it can satisfy it. No line numbers. No exact replacement text. Just: here’s what needs to be true, and here’s why.

The local fix skill with a self-verify loop

The third piece solved whack-a-mole. A local fix skill that reads the CI comment, fixes all Critical and Major findings, then re-runs the same review criteria to verify. It loops until the check is clean, up to three passes.

The key: same criteria. The skill doesn’t use a looser version of the review. It runs exactly what CI would run, from the same prompt file. So when the skill says it’s clean, CI agrees. One push, one green check.

If the loop doesn’t converge after three passes, the skill stops and reports what’s left. The developer can handle the remaining findings manually or use the override. In practice, three passes is almost always enough.

The UX That Makes It Stick

The gate is only as good as the experience of hitting it.

When CI blocks a PR, the comment includes one thing front and center: a copy-pasteable command to fix everything.

/fix-llm-feedback https://github.com/your-org/your-repo/pull/123

The developer copies it, runs it locally, pushes. The gate feels like “CI helped me fix this” rather than “CI blocked me and I have to figure out why.”

Individual findings are there too, in collapsible sections, each with its paste-ready fix prompt. For cases where the developer disagrees with a finding, there’s an override label. Add it to the PR, re-run the workflow, done. No escalation required.

The override exists for edge cases. In practice, it’s rarely used. When the fix path takes under two minutes, there’s little reason to bypass the gate. The friction of overriding is about the same as just fixing. And since the override is a PR label, it’s visible in the history. Anyone reviewing the PR can see it was overridden. That’s enough accountability without adding process.

Tradeoffs and Gotchas

I should be honest about what this costs and what it doesn’t solve.

Latency. LLM review adds time to every CI run. Not seconds. Minutes. On a team that ships fast, that’s real. I’ve accepted it because the alternative (manual review catching the same things repeatedly) costs more.

False positives still happen. The severity definitions and self-challenge step reduced them significantly. They didn’t eliminate them. The override label is the pressure valve. It doesn’t require justification. If the gate is wrong, skip it.

Prompt maintenance is real work. As the project evolves, the review criteria needs to evolve too. New conventions need to be added. Old ones that no longer apply need to be removed. A stale review prompt is worse than no prompt. It blocks for the wrong reasons.

Cost. Running an LLM review on every PR isn’t free. Worth it, in our case, because the value compounds. Every time the gate catches a documentation drift, it’s protecting future agents from working with stale context. Those agents read the rule files to make decisions. Stale rule files produce stale decisions, which produce more drift. The gate breaks that cycle early.

Same Principle, New Capability

AI agents are powerful. They write code fast, follow instructions diligently, and produce clean PRs. But they don’t have the full picture, and at AI velocity, neither do the humans alongside them.

The patterns we’ve relied on for years (CI gates, linting, automated checks) still work. They just need to evolve for a world where the code is written by agents with bounded context. An LLM-based review gate is a CI gate adapted for that world: same principle, new capability.

If you’ve run into the same coverage gap, tried something similar, or found a different way to keep AI-assisted teams consistent, I’d love to compare notes.