Content moderation systems face a critical evaluation paradox: human agreement metrics assume a single ground truth, yet rule-governed domains inherently permit multiple defensible decisions. This Agreement Trap penalizes valid policy-consistent outputs while masking genuine ambiguity as classification error. The authors formalize an alternative: policy-grounded correctness that validates whether a decision logically derives from the governing rule hierarchy rather than matching historical labels.
The proposed framework introduces two complementary indices. The Defensibility Index (DI) measures whether a decision is logically consistent with policy rules, while the Ambiguity Index (AI) quantifies inherent policy ambiguity—cases where multiple contradictory decisions satisfy the rule set. Critically, the authors develop the Probabilistic Defensibility Signal (PDS), extracted from LLM token log-probabilities without additional inference passes. Rather than using the model's classification output, the audit model functions as a reasoner: given a proposed decision and rule hierarchy, it assesses logical derivability. PDS variance decomposition reveals that measured ambiguity stems primarily from governance structure rather than decoding noise.
Validation across 193,000+ Reddit moderation decisions demonstrates a 33–46.6 percentage-point divergence between agreement and policy-grounded metrics. Notably, 79.8–80.6% of agreement-based false negatives represent policy-defensible decisions. Controlled experiments auditing 37,286 identical decisions under three rule-specificity tiers show AI decreases by 10.8 pp while DI remains stable, isolating rule clarity as the ambiguity driver. A Governance Gate
This work reconceptualizes LLM reasoning traces as governance signals rather than classification artifacts, shifting evaluation from label-matching toward explicit rule consistency—a methodologically sound approach for policy-critical domains where interpretability and defensibility outweigh raw agreement.