← Reserve

Essay · Epistemic Engineering

The Disagreement Engine

On building AI systems that are useful precisely because they refuse to agree.

thereserve.fi 2026

There is a design flaw at the center of almost every AI tool deployed in the world today, and it is not the one you are probably thinking of. It is not hallucination, though hallucination is real. It is not bias in the training data, though that too is real. The flaw is structural, and it operates at the level of the interaction rather than the model: the AI is optimized to make you feel good about the conversation.

This matters because the things you most need an intelligent system to do — find the hole in your argument, surface the scenario you haven't considered, name the bias that is operating in your reasoning — are precisely the things that most conflict with making you feel good. Satisfaction and accuracy are not the same thing. Kahneman demonstrated this decades ago. The technology industry has not yet absorbed it.

The result is a class of AI systems that are sophisticated validators of whatever you already believe. You arrive with a conviction. The AI, trained on human feedback that rewards agreement, confirms it. You leave more confident than you arrived. Both parties have optimized for the wrong thing — and the human is worse off than if they had thought alone, because at least solitary error is honest about its origin.

This essay is about a different design. It describes a protocol — the Multi-Agent Disagreement (MAD) system — built on the premise that the most useful thing an AI can do for a human is argue with them. Not helpfully. Structurally.

I. Why One Model Is Not Enough

The standard critique of AI assistance is that a single model cannot genuinely challenge the person it is talking to, because that model has been trained to avoid exactly the kind of friction that genuine challenge requires. Ask a sycophantic AI to play devil's advocate, and it will — performing devil's advocacy pitched at the level of discomfort the user finds stimulating rather than the level that would actually expose their reasoning. This is adversarialism in costume. The underlying reward model is unchanged.

The intuitive response is to instruct the model more firmly: be more critical, push harder, do not retreat when I push back. This is a guardrail, and guardrails are brittle. They break under the pressure of the same human psychology they are designed to resist. A user who signals displeasure at being challenged will find that the model softens — not because the model is defective, but because it is functioning exactly as designed.

The MAD protocol attacks this problem at the architectural level. Rather than instructing a single model to hold multiple postures simultaneously — build the strongest version of this idea, now attack it, now judge the exchange — it assigns each posture to a distinct agent with an explicit, non-negotiable mandate. The separation is not cosmetic. Each agent is given a different model, a different optimization target framed into its prompt, and, crucially, no visibility into the other agents' incentive to please.

The protocol's fundamental insight is that you cannot instruct your way out of sycophancy. You have to design around it.

II. The Three Agents

The protocol defines three roles, each of which is incompatible with the others. They are not perspectives or modes. They are jobs, with distinct mandates and distinct failure conditions.

Agent A
The Architect

Build the most robust, high-functioning version of the idea. Ignore all risks. Focus entirely on what this looks like at its absolute best. Identify the three core assumptions required for success.

Agent B
The Inquisitor

Find the load-bearing flaw that causes the structure to collapse. You are not a hater; you are a safety inspector. If you miss the flaw, the system fails and people get hurt. Attack the strongest version of the idea, not the weakest.

Agent C
The Moderator

Map the residual logic — what remains after the collision between A and B. Do not comment on whether the idea is good or bad. Report only what survived and what did not. No praise. No compromise. No middle ground.

The Architect uses the most capable available model — depth, elaboration, the ability to hold a complex argument together are what the construction phase requires. The Inquisitor uses a different model — speed and precision, the capacity to find the specific joint where load transfers and ask whether that joint is sound. The Moderator, bound to the current session, synthesizes what neither adversarial agent can see from inside the exchange.

The choice of different models is not incidental. Even within the same model family, capability variation produces meaningfully different attack vectors. The Inquisitor that built the structure cannot credibly demolish it — not because of any technical limitation, but because the framing of construction contaminates the framing of demolition. Separate agents have separate priors. That separation is the mechanism.

III. The Five Phases

The protocol runs in five phases, and the sequencing is not arbitrary. Each phase is designed to make the next phase harder to evade.

Phase 1 — Architectural Build. The Architect constructs the strongest possible version of the idea and names three core assumptions. This step is optimistic by mandate. It produces the object that will be stress-tested, and it forces the idea into its most defensible form before the attack begins. A protocol that attacks the weakest version of an idea is a ritual. A protocol that attacks the strongest version is a test.

Phase 2 — The Incursion. The Inquisitor receives the Architect's output verbatim — not summarized, not paraphrased — and does four things: identifies the single most dangerous vulnerability; stress-tests each of the three core assumptions; identifies a regime change scenario (where the logic does not merely bend but breaks, producing a qualitatively different outcome); and runs a hallucination audit, separately flagging unsupported factual claims. The distinction between logical flaws and factual gaps matters. They require different responses.

Phase 3 — The Rebuttal. The Architect receives the Inquisitor's attack and attempts to patch each vulnerability. The mandate is specific: if a hole is genuinely unpatchable, concede it explicitly. Do not rationalize away real flaws. This is the phase where the protocol most strains against the model's trained disposition — the model wants to defend the structure it built. The mandate forces it to choose between defense and honesty. Where it chooses defense, the Inquisitor will find out.

Phase 3.5 — Inquisitor Final Pass. The Inquisitor evaluates the patches. For each patch: does it hold structurally, or does it introduce new unverified dependencies? Is the patch itself load-bearing — does it only work if conditions that have not been established are true? The verdict for each patch is one of three: HOLDS, LOAD-BEARING, or FAILS. This phase closes the loop that most adversarial protocols leave open. A rebuttal that goes unevaluated is an assertion. An evaluated rebuttal is evidence.

Phase 4 — Judicial Summary. The Moderator maps surviving logic against structural compromises and issues one of three verdicts:

Verdict Criteria
Airworthy All three core assumptions survived stress tests. Regime change scenario was defended or shown implausible. Hallucination flags were minor or resolved.
Groundable One or two assumptions survived. Patches proposed but load-bearing — they depend on conditions that have not been established.
In Need of Redesign Regime change scenario undefended. A core assumption failed with no viable patch. Hallucinated facts are structural to the argument.

The verdict is not an opinion. It follows from the table. The Moderator is forbidden from suggesting compromise, praising the idea, or hedging toward "both sides have merit." The function of a judge who issues no verdict is indistinguishable from the function of no judge at all.

IV. What the Protocol Actually Produces

The output of a MAD session is not a better version of your idea. It is a clearer map of where your idea is solid and where it is not. These are different things, and confusing them is the most common misuse of structured critique.

A protocol that finds no flaws should increase your confidence modestly — assuming the Inquisitor was genuinely trying to find them and not performing rigor. A protocol that finds a flaw in a core assumption and cannot patch it should stop you. Not discourage you — stop you. "Groundable" means the plane can taxi. It does not mean the plane can fly.

The more useful output is what the protocol surfaces about the structure of your reasoning rather than its content. Which assumptions did you treat as obvious? Where did you anchor on a single piece of evidence? Which failure scenario did you instinctively categorize as implausible — and is that categorization itself available for audit? The Inquisitor does not care about your conclusion. It cares about the joints where the load transfers. That is where structures fail.

The protocol does not tell you whether your idea is good. It tells you whether it is load-bearing — whether the weight you are putting on it is weight it can actually hold.

V. The Honest Limitations

Any protocol that does not name its own failure modes is selling something.

The MAD protocol has three structural limitations that no version of the prompt can eliminate.

Same-family model bias. Using two models from the same developer and training lineage produces capability variance, not orthogonality. Both the Architect and the Inquisitor were trained on the same corpora, shaped by the same constitutional principles, and calibrated against the same preference data. Where they agree, that agreement is not independent confirmation. It is convergence toward a shared prior. The protocol is more powerful when the models come from genuinely different families — different training pipelines, different feedback mechanisms, different systematic errors. Within a single family, the adversarial structure produces real value. But the value is real precisely in proportion to how different the agents actually are, and that difference is smaller than the architecture implies.

Disclosure is not mitigation. The protocol names its failure modes openly. The Moderator is required to note when patches are load-bearing rather than structural. The Inquisitor is required to flag hallucinations separately from logical flaws. But naming a vulnerability does not close it. A user who reads "sycophancy residual is real" and proceeds as if it were not is in the same epistemic position as a user who was never told. The protocol makes the failure modes legible. It does not make them inert.

The protocol is weakest when it is most needed. High-stakes decisions — the ones where a regime change scenario would be catastrophic, where a load-bearing assumption is doing the most work — are precisely the decisions where human psychology is most resistant to genuine challenge. Ego investment, loss aversion, and motivated reasoning peak at exactly the moment the Inquisitor needs access. The protocol can be abandoned. A user who finds the Phase 2 output intolerable can close the session. No structural commitment survives that choice. The protocol cannot force the ape into the ring. It can only make the ring available.

VI. Why It Matters Anyway

The limitations above are not arguments against the protocol. They are arguments for precision about what the protocol is for.

The MAD system is not a replacement for judgment. It is an audit function — the epistemic equivalent of a financial review or a code inspection. Audits do not make companies profitable. They make bad accounting visible before it becomes catastrophic. The value is not in the finding of flaws per se; it is in the forcing of explicit accountability for the assumptions that have been treated as invisible.

The deepest argument for structured adversarial AI collaboration is one that applies across every version of the protocol: the quality of a decision is not visible from inside the reasoning that produced it. Confirmation bias feels like evidence. Narrative coherence feels like truth. Anchoring feels like having a sensible starting point. These feelings are not alerts. They are the failure modes, doing their job. The only way to see them is from outside — and the only way to be outside your own reasoning is to hand it to something whose job is to find the load-bearing flaw.

Most human environments do not provide this. Social environments punish disagreement. Hierarchies reward confidence. Advisors who share your cognitive architecture share your blind spots. For the first time, we have access to systems that fail differently than we do — that can hold the full distribution of outcomes visible when we collapse it to a narrative, that can generate failure scenarios without the emotional flinch that makes us edit them out, that can steelman the opposition without the social cost that makes humans avoid doing so.

The question is not whether these systems are reliable. They are not — not fully, not yet, not by themselves. The question is whether the combination is more reliable than either component alone. The evidence, at the level of mechanism, says yes. The combination is not a solution. It is a better error rate.

That is enough to matter. Better error rates, applied consistently to decisions with real consequences, compound. They compound in the same direction as calibration: slowly, without drama, in ways that are only legible in retrospect. The ape that uses the protocol is not smarter. It is wrong less often, in different places, in ways that are harder to miss.

Given the alternative — a flawed human and a flattering machine, each confirming the other's errors — that is a meaningful improvement. Build the engine. Let it disagree with you. Pay attention to what survives.

The MAD (Multi-Agent Disagreement) protocol was designed through adversarial collaboration between a human and Claude (Anthropic). Its initial specification was produced by Gemini (Google DeepMind) and subsequently stress-tested, patched, and evaluated using the protocol itself — which is to say: the tool was built with the tool it describes.

This essay is part of a series. See also The Calibrated Ape and Dangerous Minds.