Why "Multi-Agent" Doesn't Mean "Independent": Lessons from Compliance AI in Oil & Gas

The promise of multi-agent AI systems for regulatory compliance is intuitive: if one language model can make mistakes, surely three will make fewer. Deploy an ensemble of LLM agents over your regulatory corpus, let them vote, and watch accuracy climb. It's a compelling story. It's also largely wrong — and the oil and gas sector in Brazil offers an unusually clear lens for understanding why.

The Core Problem: Multiplicity Is Not Independence

Consider what actually happens when you run three LLM agents over the same RAG corpus to verify a regulatory classification. Each agent ingests the same documents, activates similar internal representations, and produces outputs that are statistically correlated in exactly the failure modes that matter. You get multiple tokens, not multiple perspectives.

This isn't speculation. Berg, Kölbel, and Rigobon (MIT Sloan, 2019) documented a correlation of only ~0.5 between ESG rating agencies — organizations with different methodologies, different teams, different incentive structures — and even that level of independence is hard to achieve. Sharma et al. (2023) demonstrated systematic sycophancy in LLMs, where models converge toward each other's outputs under social pressure. Magesh et al. (2025) showed 17–33% error rates in commercial RAG-based legal tools that were marketed as hallucination-resistant.

The problem has a name in software engineering. Knight and Leveson's N-version programming experiments in 1986 already showed that independent teams of programmers produce correlated errors on the same hard subproblems. The same failure mode appears in LLM ensembles, ESG rating agencies, and inter-rater reliability studies in waste classification.

The concept we need is epistemic independence (IE): the degree to which two verifiers can fail independently. Multiplicity without IE is noise amplification, not error correction.

A Framework Built Around Epistemic Tiers

A more rigorous approach decomposes compliance decisions by their epistemic character and routes each sub-decision to the oracle type best suited to it:

Tier 1 — Formal oracles: deterministic lookup against codified rules (list membership, numeric thresholds, boolean predicates). Implementable in SMT solvers, Datalog, or policy engines like Cedar or OPA.
Tier 2 — Statistical oracles: calibrated ML models with explicit confidence intervals, anomaly detectors, ensemble regressors. Appropriate where ground truth exists but uncertainty is irreducible.
Tier 3 — Structured human judgment: protocols like IDEA or SHELF that elicit expert probability distributions while controlling anchoring and social influence biases.
Tier 4 — Deliberation: multi-stakeholder governance for decisions involving incommensurable values — cost vs. environmental impact vs. social license.

The key architectural constraint: verifiers at the same tier must draw from structurally independent sources. A Tier 2 ML model trained on operator sensor data and a Tier 1 symbolic oracle over a canonical regulatory corpus are epistemically independent. Two LLMs reading the same PDF are not.

Three Processes, Three Readiness Profiles

Brazilian oil and gas operations offer three natural test cases for this framework, each with a distinct readiness profile.

Operational Waste Management (P1)

This is the strongest pilot candidate by a wide margin. A large operator generates tens of thousands of waste transfer records (MTRs) per day, each requiring classification under ABNT NBR 10004:2024 and routing under CONAMA 362/430/499 and IBAMA's SINIR/MTR system.

The framework maps cleanly here because the three epistemic sources are genuinely independent:

A third-party laboratory report (Tier 1/2 for threshold checks)
Operational sensor and ERP data (Tier 2 for process-origin inference)
The canonical regulatory corpus (Tier 1 for list lookup and routing constraints)

# Illustrative Tier 1 oracle: NBR 10004:2024 list lookup
def classify_waste_tier1(waste_code: str, lgr: dict) -> dict:
    entry = lgr.get(waste_code)
    if entry is None:
        return {"tier": 1, "result": "not_listed", "escalate_to": "tier2"}
    return {
        "tier": 1,
        "class": entry["class"],          # "Classe 1" or "Classe 2"
        "hazard_flags": entry["hazard"],
        "confidence": 1.0,                # deterministic
    }

# Tier 2 oracle: ML model for borderline classification
def classify_waste_tier2(features: dict, model, threshold=0.85) -> dict:
    prob = model.predict_proba([features])[0]
    return {
        "tier": 2,
        "class_prob": prob,
        "confident": max(prob) >= threshold,
        "escalate_to": None if max(prob) >= threshold else "tier3",
    }

The critical tension is the Classe 1 vs. Classe 2 boundary for novel or mixed streams — spent FCC catalyst, additive-laden drilling muds. This is the canonical case where the heterogeneous framework should outperform any monolithic pipeline, because the decision simultaneously requires deterministic list lookup, statistical uncertainty quantification over laboratory measurements, and interpretive judgment about process origin.

NBR 10004:2024 also introduced a significant structural change in November 2024: the old three-class scheme (I, IIA, IIB) was replaced by a two-class scheme (Classe 1 hazardous, Classe 2 non-hazardous), with the Laudo de Classificação de Resíduo (LCR) now a formal document under the generator's legal responsibility. The two versions coexist until December 2026 — itself a source of decisional noise that any deployed system must model explicitly.

Supplier and Recycled Materials Qualification (P2)

The normative coverage is reasonable, but the epistemic independence problem is acute. A multi-agent ESG screen — LLM-A plus LLM-B plus an ML classifier — looks like three verifiers. If all three consume the same self-declared sustainability PDF from the supplier, they provide one perspective with decorative multiplicity.

Ong et al. (2025) benchmarked LLMs on 1,679 SGX sustainability reports for greenwashing detection and found that state-of-the-art models rely on superficial lexical patterns with poor cross-category generalization. Bingler et al. documented the dual-use risk: the same LLMs that detect greenwashing can produce it, creating hidden correlation between writer and auditor when both are drawn from the same model family.

The mitigation is architectural: at least one verifier must be routed to a structurally independent state source — IBAMA's CTF/APP registry, CETESB's SIMA licensing database, ANP's SIMP. Physical resampling as Tier 1 ground truth, periodic and unannounced, is the only reliable anchor against document-level correlation.

End-of-Life Equipment (P3)

The literature here is the most mature — Caprace et al. (2025) validated MCDA PROMETHEE with 37 attributes on the Brazilian Espadarte field; Nguyen et al. (2022) used gradient boosting and ANNs to predict platform removal age in the Gulf of Mexico. CNEN NN 8.01 establishes a hard 1 Bq/g threshold for unconditional NORM clearance, making it a clean Tier 1 predicate.

The problem is that the high-consequence decisions — whole-platform life extension, borderline NORM at 0.8–1.5 Bq/g, rig-to-reef disposition — structurally resist formalization. Not because the technology is immature, but because failures are catastrophic and irreversible, data is sparse, and regulatory acceptability is partly endogenous to the decision itself. Routing these to Tier 3 structured judgment is not a concession; it's the correct epistemic assignment.

The practical implication: a pilot here should be scoped to valves and heat exchanger bundles in a single refinery turnaround, not to offshore decommissioning. The latter involves Board-level decisions with US$-billion consequences and active political dimensions — precisely the Tier 4 territory where the framework provides process structure, not automated answers.

What the Empirical Literature Demands

The case against monolithic RAG pipelines for regulatory compliance is now empirically grounded, not merely theoretical. Dahl et al. (2024) documented citation hallucination rates of 58–88% for frontier LLMs on legal tasks. Verbalized confidence intervals are systematically too narrow (FermiEval, arXiv:2510.26995). Sycophancy is endemic and has been shown to correspond to separable activation directions in model internals (arXiv:2509.21305). Adaptive prompt injection attacks break all known defenses (Debenedetti et al., AgentDojo).

These results, taken together, constitute the empirical argument for architectural heterogeneity. The diversity prediction theorem — that ensemble error equals average individual error minus variance in predictions — provides the mathematical foundation. But variance only helps if the errors are uncorrelated. Achieving that requires deliberate architectural choices about source independence, not just adding more agents.

The Measurement Gap

Perhaps the most striking finding from surveying this space: inter-rater reliability in Brazilian environmental compliance has never been empirically measured. No published study quantifies agreement between technical assessors on NBR 10004 classification, LCR issuance, or ANP 817 comparative assessment. The noise is universally assumed; it has never been instrumented.

This is simultaneously a gap in the literature and an opportunity. A pilot that deploys three epistemically independent verifiers on borderline waste classification decisions — symbolic Tier 1 oracle over codified LGR, calibrated ML Tier 2 model trained on operator laboratory history, and two independent technical assessors under a blinded IDEA protocol — would produce exactly the kind of disagreement data needed to operationalize IE as a measurable quantity rather than an architectural aspiration.

Crucially, this pilot has something most AI compliance benchmarks lack: authoritative ground truth. The CDF (Certificado de Destinação Final) and IBAMA/CETESB enforcement records provide post-hoc validation that ESG benchmarks built on self-declared disclosures cannot offer.

Conclusion

The heterogeneous verification framework isn't a novel invention — it's the rigorous composition of four mature traditions: formal oracles, calibrated statistical ensembles, structured expert elicitation, and deliberative governance. What's new is the metric (epistemic independence) that makes the composition principled rather than intuitive, and the empirical literature that makes it non-optional.

The practical takeaway for teams building compliance AI in regulated industries: audit your architecture for source independence before auditing for model accuracy. A single LLM with access to the right canonical sources, paired with an independent statistical model and a structured human review protocol, will outperform three correlated LLMs reading the same document — and the Brazilian waste management context, with its dense regulatory corpus, high decision volume, and available ground truth, is an unusually good place to prove it.

← Coming soon Agentic AI Is Already in Your Codebase — and Most Teams Aren't Ready →