Agentic RAG Citation: The Five-Gate × Two-Channel Model

The Five-Gate × Two-Channel Coordinate System for Agentic-RAG Authority

By Manuel Hürlimann for GaryOwl.com | Published: May 30, 2026 | Updated: July 13, 2026
Expertise: Digital Authority Engineering | Agentic-RAG Architecture | AI Citation Pipeline Diagnostics
Time to read: 56 minutes · ~13,900 words
Series: Operative Article 6 — DAE Glossary

📌 Navigate

Authority Intelligence Lab · DAE Framework · DAE Glossary · Article 4 · Article 5

📌 Reading Guide

If you read one section: “The Five Gates” — the gate definitions and the dual-channel principle are the diagnostic instrument.

If you are a content strategist: Start with “The Six-Step Triage Protocol” and the “GummySearch Case Study”, then read the gate sections relevant to your current bottleneck.

If you already know the classical Five-Gate model: Skip to “Three Forces Driving This Framework” and the new boxes in §G1b (Fan-Out Inflection) and §G2 (Asymmetric Bot Blocking) for what is new in the agentic-era reframing.

If you are skeptical of yet another agentic-RAG framework: Read “Honest Limitations” and “Independent Industry Validation” first. The framework’s structural claims are externally triangulated by practitioners who arrived at the same diagnosis without knowledge of DAE.

📌 Core Definition

The Five-Gate × Two-Channel Coordinate System is a behavioral model of how agentic AI search systems decide which content survives long enough to be cited. Five Gates (G1a Resolution/Routing, G1b Fan-Out Planning, G2 Retrievability, G3 Credibility Filter, G4 Consensus Pool, G5 Generation-Time Citation × Faithfulness) operate over Two Channels (Parametric, Retrieval). Each Gate is mapped onto one or more of the six DAE Authority Types (#63–#68), modulated by four Cross-Cutting Modifiers (#69 Temporal, #70 Platform, #71 Consensus, #72 Reflection-Iteration).

The five gates are a diagnostic model — a synthesis abstracted from peer-reviewed studies, patent literature and practitioner observation, not a verified blueprint of any one vendor’s pipeline. Formally, the gates are diagnostic categories: neither computational stages nor causal mechanisms. Evidence for an individual gate mechanism does not validate the five-way decomposition itself — the decomposition is a modeling choice, made for diagnostic utility (see Open Question 9). The model treats citation as requiring an open condition across all five gate categories on at least one channel — a definitional property of the model, not an empirically established necessity; the gates have no fixed order (see Architectural Variations). In the model’s terms, closing any gate strongly suppresses the citation, and opening four out of five is not sufficient. Citation suppression frequently behaves cliff-shaped rather than gradual — a closed gate strongly suppresses citation probability while leaving a non-zero floor. Whether cliff-shaped or gradual suppression dominates in production systems is an open empirical question this framework has not measured.

TL;DR — Key Takeaways

Agentic RAG is no longer single-shot retrieval. Production pipelines from Google AI Mode to ChatGPT Search pass through five gates — processing functions whose order, frequency and activation depend on the architecture; linear traversal is the exception, not the rule — that decide which content survives long enough to be cited. The visible symptom of failure — your name does not appear in the answer — is identical across all five failure modes. The diagnostic work is to determine which gate is closed before spending budget on remedies that target the wrong gate.

Relative to the standard Five-Gate model, this article — the architectural backbone of the DAE framework’s Agentic-AEO layer — integrates four agentic-era updates: Gate 1 splits into G1a Resolution/Routing and G1b Fan-Out Planning (typically 5–20 sub-queries per query); Gate 5 becomes two-dimensional (Survival × Faithfulness), since up to 57% [Tier A] of citations can be post-rationalized rather than causal (an adversarial-condition upper bound — see FAQ Q4); Tool/Endpoint Authority joins #67 Structural Authority as MCP scales to 97 million [Tier D] monthly SDK downloads; and a new #72 Reflection-Iteration Modifier extends the Article-4 modifier set (#69 Temporal, #70 Platform, #71 Consensus). Each is detailed in the Key Insights below.

The framework is triangulated across three independent sources per core claim, Wilson-bounded on every GCS sub-metric, and every industry source has its conflict-of-interest explicitly disclosed.

📌 Key Insights — What This Article Establishes

1. Agentic RAG is a planner → router → tool-mediated retrieval → critic → synthesis loop. One user query typically produces 5–20 internal sub-queries — a practitioner-observed range (King 2026 [Tier E] (COI: iPullRank Founder/CEO)); the fan-out mechanism itself is peer-reviewed (Trivedi IRCoT, ACL 2023 [Tier A]; Jeong Adaptive-RAG, NAACL 2024 [Tier A]), the universal production span is not. Gate 1 therefore has two distinct sub-stages.

2. Tool surfaces — MCP servers, OpenAPI endpoints, function-callable APIs — are first-class anchorable substrates within the Retrieval channel. (Throughout, the two channels are the parallel paths a citation can travel: the Parametric channel — training-time, encoded in model weights — and the Retrieval channel — query-time, fetched from a live index. They can pass and fail independently — the paths are architecturally distinct; whether their outcomes are statistically independent is untested (see Open Question 11). This is the second axis of the coordinate system, developed in full below.) The MCP donation to the Linux Foundation’s Agentic AI Foundation on 9 December 2025 establishes this as adopted production infrastructure (March 2026 adoption snapshot: 97M monthly SDK downloads, 10K+ active servers) — whether tool inclusion translates into citation visibility is a separate, open question (the ToolInclusion dimension below).

3. Correctness ≠ Faithfulness. Up to 57% of citations on Cohere Command-R+ / NaturalQuestions are post-rationalized in the relevant-but-uncited adversarial condition (Wallat et al., ICTIR 2025 — see FAQ Q4). Citation survival alone is no longer a sufficient KPI.

4. Public architecture guidance and survey literature consistently point to single-LLM-multi-prompt — not multi-agent — as the prevailing production pattern in mid-2026. Systematic market-share data does not exist, so this is a working assumption for diagnosis, not a measured population claim. Three converging sources: King (iPullRank, 2026) (COI: iPullRank Founder/CEO), Anthropic “Building effective agents” [Tier E] (Schluntz & Zhang, Dec 19, 2024), and Singh Survey §3.4 [Tier C] (arXiv:2501.09136v4, 2026).

5. The Generative Citation Score (GCS) is a six-dimensional, Wilson-bounded citation-likelihood score — higher is better. Default weights are deliberately unset and will be empirically calibrated in Article 7. The metric-construction methodology follows Aggarwal et al. (GEO, KDD 2024) [Tier A], who established the legitimacy of user-defined visibility metrics for generative engines; the six-dimensional GCS itself is a proposed diagnostic score, not a validated instrument.

📌 Evidence Tiers Used in This Article

[Tier A] Peer-reviewed academic research

[Tier B] Large-scale industry dataset (>100K samples, vendor-independent)

[Tier C] Independent Meta-Analysis (aggregates ≥ 10 external sources, transparent methodology, vendor affiliation disclosed)

[Tier D] Industry study with documented methodology, not vendor-self-published

[Tier E] Vendor study (self-published, regardless of sample size or methodology quality); COI disclosed inline

[Tier DAE] Framework term (synthesized from empirical sources, attributed to DAE)

Triangulation principle — applied throughout: every core claim must be supported by three independent sources (or be flagged explicitly as industry consensus without peer-reviewed support).

Source-hierarchy principle: Peer-reviewed primary sources outweigh vendor reports. When the two conflict, the peer-reviewed evidence governs and the vendor figure appears with a COI flag.

Vendor sources include Conflict-of-Interest (COI) disclosures — commercial or affiliation-based interests that may influence findings — in the Sources section. This article follows the DAE Tier System established in Operative Article 1.

📌 First Publication — Original DAE Contributions in This Article

The following constructs appear for the first time in DAE-framework literature in this article:

The Five-Gate × Two-Channel Coordinate System (the integrated matrix itself)

The Gate 1 split into G1a (Resolution/Routing) + G1b (Fan-Out Planning)

The Two-Dimensional Gate 5 (Survival × Faithfulness)

The Tool/Endpoint Authority sub-type within #67 Structural Authority

The #72 Reflection-Iteration Modifier as the fourth Cross-Cutting Modifier (extending the Article-4 set of #69 Temporal, #70 Platform, #71 Consensus)

The Six-Step Triage Protocol

The Six-Dimensional Generative Citation Score (GCS)

Disclaimer: These constructs are first published here, on this site, by this author. The synthesis draws on Mike King’s Beyond RAG (iPullRank, 20 May 2026), the Quality-Gate-Audit conducted in late May 2026, and the same-week integration of Bettinga’s LinkedIn double-gate analysis (which first publicly surfaced the substrate problem; the robots.txt facts are independently re-verified first-hand here), Landwehr’s Peec-AI Fan-Out-Inflection observation (COI: Peec AI CPO/CMO), and Cummins/Ramp’s marketing-incentives-to-AI-agents experiment (COI: vendor self-published). No claim is made that the underlying ideas (gates, cascades, credibility filters, fanout planning) are novel — every gate in this article rests on prior peer-reviewed work, cited inline. The novelty is the integration into a single 5×2 system with operational metrics and the Article-4-aligned modifier extension.

📌 Key DAE Terms in This Article

Gate — A near-binary filter. A closed gate strongly suppresses citation probability even when other gates are open — probabilistic effects (parametric leakage, hallucination) leave a non-zero floor, not a hard zero.

Channel — One of two parallel paths a citation can travel: Parametric (training-time, model weights) or Retrieval (query-time, live index — including text substrates and tool/endpoint substrates).

Gate 1a — Resolution/Routing — Entity disambiguation and channel routing (text vs. tool).

Gate 1b — Fan-Out Planning — Sub-query decomposition; typically 5–20 sub-queries per user query (practitioner-observed range).

Tool/Endpoint Authority — Sub-type of #67 Structural Authority. The capacity to be invoked as a tool, not merely cited as prose.

Faithfulness Axis (G5) — Causal use of a cited source vs. post-hoc justification (Wallat ICTIR 2025).

Reflection-Iteration (#72) — Number of critic-driven re-retrieval cycles before synthesis. New cross-cutting modifier, agentic-specific.

Generative Citation Score (GCS) — Six-dimensional Wilson-bounded citation-likelihood metric — higher = better — across SubQueryCov, RetrievalToCit, RefSurvival, Faithfulness, ToolInclusion, BridgeCentrality. A low dimension score localizes the closed gate; GCS is not a “gate-closure score”.

Dual-Assignment Gate — A property of this framework’s mapping: a single authority type assigned to more than one gate. #66 Network Authority is the only authority type so classified, assigned to G1 + G2 + G4.

What You Need from the Previous Articles

This article assumes you have read Article 4, Six Types of Authority AI Systems Actually Measure, and Article 5, Where Structure Actually Works. If you have not:

From Article 4 you need the six DAE authority types: #63 Entity, #64 Topical, #65 Content, #66 Network, #67 Structural, #68 Reputational — plus the three modifier dimensions: #69 Temporal, #70 Platform, #71 Consensus. This article extends this set with a fourth agentic-specific modifier, #72 Reflection-Iteration.
From Article 5 you need its core reframing of Structural Authority (#67): it is not one decision but a four-stage cascade — parsing quality, parsing robustness, retrieval granularity, and markup preservation — whose effects are multiplicative. Most brands get the HTML right and lose on the other three, so optimizing the wrong stage wastes budget. This article generalizes that four-stage logic to the full taxonomy.

Everything else builds from these two foundations.

Three Forces Driving This Framework

The Five-Gate × Two-Channel matrix in this article is the synthesis of three converging developments between late 2024 and mid-2026. Each shaped a specific structural decision in the framework.

(1) Mike King’s “Beyond RAG: Why Every AI Search Platform Is Now Agentic and What That Means for Your Content” [Tier E] (COI: King is iPullRank Founder/CEO) (iPullRank, 20 May 2026) provided the strongest single industry synthesis of the agentic-RAG production stack to date, triangulated internally against ReAct (Yao et al., ICLR 2023) [Tier A], Toolformer (Schick et al., NeurIPS 2023) [Tier A], IRCoT (Trivedi et al., ACL 2023) [Tier A], and Self-RAG (Asai et al., ICLR 2024 Oral) [Tier A]. King’s synthesis established that Gate 1 must be split into G1a (Resolution/Routing) and G1b (Fan-Out Planning) — the planner generates 5–20 internal sub-queries per user query, and a brand resolved at G1a can still lose four of five sub-retrievals at G1b. The architectural framing King derives from practice is independently established in the peer-reviewed-grade literature: Nowaczyk (“Architectures for Building Agentic AI”, Springer Nature, forthcoming; arXiv:2512.09458, 10 Dec 2025) [Tier B] argues that reliability in agentic systems is first and foremost an architectural property — emerging from componentisation, schema-validated interfaces, and control/assurance loops. This lifts the agentic-stack claim above a single vendor synthesis and supplies the component vocabulary (planner, tool router, verifier, supervisor) that the five gates operationalize.

(2) Two peer-reviewed papers reframed Gate 5 specifically: Wallat, Heuss, de Rijke & Anand (ICTIR 2025 Best Paper Honorable Mention, DOI 10.1145/3731120.3744592, arXiv:2412.18004) [Tier A] established that up to 57% of citations on Cohere Command-R+ / NaturalQuestions are post-rationalized rather than causally grounded — Faithfulness must be measured as a separate axis from Survival. Saxena, Bommireddy, Padia & Gaur (arXiv:2509.21557 v2, Dec 2025, submitted to NeurIPS 2025 LLM Eval Workshop) [Tier C] quantified the G-Cite vs. P-Cite trade-off across ALCE, LongBench-Cite, REASONS, and FEVER. Gate 5 in this framework is therefore two-dimensional (Survival × Faithfulness), not one-dimensional.

(3) The Linux Foundation’s announcement of the Agentic AI Foundation (AAIF) [Tier D] on 9 December 2025 made Tool/Endpoint Authority an operational reality: by the March 2026 adoption snapshot (corroborated independently by Pento.ai, Truto.one, DigitalApplied, and BraivIQ), the MCP SDK had reached 97 million monthly downloads across Python and TypeScript, with 10,000+ active public servers. Founders Anthropic (donating MCP), OpenAI (donating AGENTS.md), and Block (donating goose) plus eight platinum members (AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, OpenAI) establish this as an interop standard rather than a single-vendor protocol. Tool/Endpoint Authority is therefore a first-class sub-type of #67 Structural Authority in this framework, not an afterthought.

A fourth structural decision follows from the agentic-pipeline reality: the Article-4 modifier set (#69 Temporal, #70 Platform, #71 Consensus) is extended with one agentic-specific modifier, #72 Reflection-Iteration, capturing the Self-RAG / Singh-§3 / HiPRAG / King family of behaviors where the planner re-issues retrieval rounds based on draft critique.

What this framework does NOT change relative to Article 4’s authority taxonomy: The Dual-Channel Principle (Parametric vs. Retrieval) is preserved as the mechanistic decomposition, anchored to Sun et al. ReDeEP ICLR 2025 (Knowledge FFNs vs. Copying Heads). Tool surfaces are a substrate type within the Retrieval channel, not a third channel. The triangulation Algaba NAACL 2025 + Sun ReDeEP ICLR 2025 + Wallat ICTIR 2025 anchors this distinction in peer-reviewed mechanistic interpretability work.

The Dual-Channel Principle

A modern AI system can cite your content through two architecturally distinct channels — the parametric channel (training-time: your brand surfaces from the model’s weights, with no document retrieved) and the retrieval channel (query-time: your URL is fetched live and cited). Treating “citation” as a single phenomenon is the single most common mistake in this entire field.

A modern AI system can cite your content via two architecturally distinct paths:

Parametric channel (training-time). The model was trained on a corpus that included a representation of your content or your brand. At inference time the model can produce your name without ever retrieving a document — purely from its weights. Algaba and colleagues at Vrije Universiteit Brussel showed this directly in their April 2025 follow-up study (arXiv:2504.02767) [Tier C] (preprint): prompting GPT-4o for references on 10,000 focal papers produced 274,951 LLM-generated references with structural and bibliometric properties closely matching the human citation graph — strong evidence that citation networks are internalized parametrically and reproducible without retrieval. The mechanistic substrate of this channel is the late-layer Knowledge FFNs that inject parametric knowledge into the residual stream (Sun et al., ReDeEP, ICLR 2025) [Tier A].

Retrieval channel (query-time). The system performs a search at inference time — through Bing, Google, a vector database, a domain-specific index, or a tool/endpoint surface — and produces a citation by attributing one or more sentences in the answer to the retrieved evidence. This is the path Perplexity, ChatGPT Search, Google AI Mode, and most “AI Overviews”-style features take. The mechanistic substrate is the set of Copying Heads (attention heads with positive OV-matrix eigenvalues) that propagate retrieved-context tokens through the residual stream (Sun et al., ReDeEP, ICLR 2025).

The Retrieval channel itself has two substrate types:

Text substrate — web pages, vector embeddings, BM25 lexical indexes, document chunks. The classical RAG payload.
Tool/Endpoint substrate — MCP servers, REST APIs, function-callable schemas, code interpreters, structured-data endpoints. Per Mike King (iPullRank, May 2026) (COI: iPullRank Founder/CEO): “When a tool exists, the router calls the tool instead of citing prose.” (In practice this is a strong tendency, not an absolute rule — many systems run tool calls and text retrieval in parallel, or fall back between them by latency and cost.) This substrate type became operationally significant with the MCP donation to the Linux Foundation on 9 December 2025; by the March 2026 adoption snapshot, MCP had reached 97 million monthly SDK downloads with 10,000+ active servers, validated independently by Lumer et al. (ScaleMCP, arXiv:2505.06416) [Tier E] (COI: PwC co-affiliated) and Pento.ai’s “A Year of MCP” retrospective (independent industry analysis).

Tool substrate is governed by Tool/Endpoint Authority as a sub-type of #67 Structural Authority. It is not a separate channel — it lives entirely within the Retrieval channel. This preserves the mechanistic mapping (Parametric/Retrieval = FFN/Copying-Head substrate) and admits tool surfaces as a first-class anchorable target without adding axes the underlying transformer architecture does not differentiate.

These two channels — Parametric and Retrieval — can pass and fail independently: the paths are architecturally distinct, and individual cases dissociate cleanly, though the statistical independence of their outcomes at population level has not been measured (see Open Question 11). The most painful diagnostic pattern in our consulting practice in 2025–2026 has been the following: a brand is named correctly by GPT-4o in 70% of relevant prompts (parametric channel open) but is never linked in Perplexity or ChatGPT Search (retrieval channel closed, usually at Gate 2). The owner thinks “we’re doing fine” because the name appears; in fact half the user journey is invisibly broken.

The framework axiom — the Dual-Channel Principle — is:

Within the model, a citation is diagnosed as requiring at least one of the two channels to pass all five gates — an axiom in the strict sense: a modeling postulate, not an empirical law. Treating channel-pass on one path as evidence of channel-pass on the other is a category error.

The Five Gates

The framework retains the five-gate skeleton from the classical Five-Gate model with two agentic-era modifications: Gate 1 splits into G1a (Resolution/Routing) and G1b (Fan-Out Planning); Gate 5 acquires a second axis (Faithfulness alongside Survival).

Gate 1 — Query Triage

G1a — Resolution/Routing

G1a (Resolution/Routing) — the question this sub-gate asks: Has the model (a) resolved this query to one or more entities/topics it has parametric or retrieval anchors for, and (b) decided which channel(s) and which retrieval surface(s) to route this query to?

What “closed” looks like: The model produces an answer about your topic without ever generating your brand or URL as a candidate token — even before any retrieval or ranking step. Or: the model produces a candidate but routes the query to a channel/surface where your content is not present.

The mechanistic-interpretability literature has, over the last eighteen months, given us an unusually concrete picture of where this gate lives inside a transformer. Sun and colleagues, in their ICLR 2025 paper ReDeEP, decomposed RAG hallucination into two attributable substrates: Knowledge FFNs (later-layer feed-forward modules that inject parametric knowledge into the residual stream) and Copying Heads (attention heads that, identified by positive eigenvalues of their OV matrix, transfer information from context tokens into the residual stream). Hallucinations occurred when Knowledge FFNs over-added parametric knowledge while Copying Heads failed to retain external context. Their AARF method (Add Attention Reduce FFN) is an inference-time intervention that increases Copying-Head contribution and dampens Knowledge-FFN contribution, reducing hallucinated content without retraining.

Park & Kim (EMNLP 2025 Main, pp. 29766–29785) [Tier A] extended this with the SIPS metric (Semantic-Informed Parametric Signal), which measures the divergence between hidden states before and after the FFN layer using a semantic-entropy probe rather than ReDeEP’s Jensen–Shannon divergence on raw activations. Augenstein’s ECIR 2025 keynote (arXiv:2603.09654, March 2026) [Tier A] frames the underlying open problem: the interplay between parametric and contextual knowledge is still underexplored, and “when contextual knowledge should overwrite parametric knowledge” is itself a research question.

The routing component of G1a depends on a separate mechanism. Patent evidence: US20240362093A1 [Tier C] (published patent application, not yet granted) documents Google’s “Custom Corpus” routing patent, which describes selecting between query-time corpora based on classifier output. Singh Survey Section 4.1 (arXiv:2501.09136 v4, April 2026) [Tier C] formalizes this as the Single-Agent Router architecture.

📌 Box — Diagnostic Pattern: Title-Tag Loss (LinkedIn Posts)

LinkedIn generates post-page titles from a fixed template, not from a per-post, author-chosen <title>. In Bettinga’s German-locale view the template renders as <title>Posten | LinkedIn</title> (or “Beitrag von [Name]” in the feed variant); other locales and crawlers see the equivalent template in their own language, sometimes in the slightly richer form “[Name] on LinkedIn: [opening words]”. Either way the title is machine-generated boilerplate, not a deliberate, topic-specific page title — so it gives a model only a weak, generic Gate-1 anchor for which post this is and what it is about, far below what a dedicated article page with a hand-crafted title provides. (This title-pattern observation comes from Bettinga’s German-language analysis and — unlike the robots.txt facts below — is not independently re-verified first-hand here; the displayed string is locale-dependent.)

Newsletter pages under /pulse/, by contrast, derive the <title> from the article headline (schematically <title>[Newsletter Article Headline] | LinkedIn</title>), giving a genuine Gate-1 anchor that ordinary posts lack. This matters differently per channel: where Gate 2 is open (OAI-SearchBot, Googlebot — see §G2 below), the weak post-level title anchor is the relevant limit — a constraint on anchoring, not an absolute wall, since LinkedIn content does still surface on those channels; where Gate 2 is closed (the blocked training and live-fetch crawlers), the policy block dominates regardless of title quality. The net is a double-gate weakness specific to LinkedIn-as-substrate: limited Gate-1 anchoring on posts, plus Gate-2 policy blocking for the AI crawlers LinkedIn disallows.

Diagnostic implication: When a brand publishes only on LinkedIn (no own site), it is structurally disadvantaged in AI Search regardless of content quality — not because the content is poor, but because two gates are closed simultaneously at the platform level. Indirect discovery (Googlebot access, reposts, mirrored or cached copies) can still leak some signal, so the effect is strong suppression rather than literal invisibility.

Sources: robots.txt directives — LinkedIn robots.txt, first-hand verified 28 May 2026 (primary source). Title-tag / SERP-pattern observation and the original public surfacing of this double-gate analysis — Juliane Bettinga (SEO consultant & Co-Founder @SEOSOON), LinkedIn-Post May 2026 (COI: SEO consultancy); the title-tag pattern is itself independently verifiable.

Practical implication for content owners. A G1a failure cannot be fixed by adding more pages. It can only be addressed by changing the training-corpus signal (entity disambiguation, schema markup that survives ingestion, Wikipedia/Wikidata presence, citations from already-indexed corpora) or by changing the retrieval signal such that Copying Heads have something to copy from.

G1a is the gate #63 Entity Authority primarily lives at. It is also one of three gates that #66 Network Authority lives at — see the dual-assignment discussion in the Six-Authority Mapping section.

G1b — Fan-Out Planning

G1b (Fan-Out Planning) — the question this sub-gate asks: Has the planner decomposed the user query into a set of internal sub-queries, and does at least one of those sub-queries semantically match content I have published?

What “closed” looks like: The user query is resolved correctly at G1a (your brand is anchored), but the planner produces its fan-out sub-queries (typically 5–20 in practitioner observation) that all miss the angle, sub-topic, or framing your content covers. You are anchored but un-retrievable at the sub-query level.

This sub-stage is implicit in the classical Gate 1 and is made explicit here because the empirical evidence for fanout-planning as a separate behavioral stage has become unambiguous:

Jeong et al., Adaptive-RAG (NAACL 2024 Long, pp. 7036–7050) [Tier A] established a three-class query-complexity classifier (no-retrieval / single-step / multi-step). Multi-step queries trigger fanout — fanout is therefore complexity-conditional, not universal.
Trivedi et al., IRCoT (ACL 2023 Long, arXiv:2212.10509 v2) interleaves retrieval with chain-of-thought reasoning and demonstrates “11–21 recall points under a fixed-budget optimal recall setup” and “up to 15 F1 points… in downstream few-shot QA performance” on HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Fanout-with-iterative-retrieval is the production pattern.
King “Beyond RAG” (iPullRank, May 2026) (COI: iPullRank Founder/CEO): “Every modern AI search platform fans out one user query into multiple internal sub-queries before any retrieval happens. If your content only matches the surface query, you lose at the planner stage.”

Three independent sectors triangulate (academia / academia / industry-patent + industry-practitioner). Triangulation met.

📌 Box — Empirical Anchor: ChatGPT Fanout-Query Inflection (8 May 2026)

Industry telemetry from Peec AI (May 2026) provides the first dated, real-world signal of the Gate-1b fanout-planning mechanism. Reddit’s citation share in ChatGPT shifted from a baseline ~2.6% to over 11% within a single week around 8 May 2026. Per Malte Landwehr (CPO/CMO, Peec AI), the cause was a change in ChatGPT’s fanout-planning behavior: the planner began appending “reddit” to a substantially larger share of generated sub-queries, shifting which sources surfaced at G2 retrieval. A concurrent rise of the third-party site GummySearch (0.005% → 0.1% in the same week — see Case Study in the Six-Authority section below) is consistent with the same shift.

Lily Ray (Founder, Algorythmic; VP SEO & AI Search, Amsive) confirmed the interpretation in the comment thread: “When fanout queries contain the term ‘reddit’, these pages rank in addition to Reddit. Makes a ton of sense. I also imagine it might not work forever.”

Diagnostic implication: Gate-1b fanout-planning behavior is non-stationary. The same brand can score open at G1b in one platform-week and closed in the next, without any change to the brand’s own content. Monitoring G1b is continuous, not one-shot.

Sources: Landwehr, M. (26 May 2026). “How to Become a Top Source in ChatGPT with Recycled Reddit Content.” LinkedIn-Article. Tier E (Peec AI proprietary telemetry, vendor-self-published; not externally replicated). Lily Ray (Algorythmic / Amsive), comment thread on the same article — Tier D (independent practitioner comment).

Practical implication for content owners. G1b cannot be optimized by writing one canonical “best answer” page. It demands sub-topic breadth — multiple semantically distinct pages or sections, each addressing a plausible fanout angle. Topic coverage maps that pre-empt the planner’s likely sub-queries (see also Article 4‘s discussion of #64 Topical Authority) are the operational instrument.

Gate 2 — Retrievability

Gate 2 (Retrievability) — the question this gate asks: When the live system executes its retrieval step for each sub-query, does this document (or this tool endpoint) appear in the candidate set?

What “closed” looks like: Your content exists, is well-anchored at G1a, the planner produces fanout sub-queries that semantically match your content — and yet your URL does not appear in the candidate set because (a) the live RAG index does not crawl your domain, (b) your bot policy blocks the AI fetcher, (c) your URL does not rank in the underlying lexical/dense retriever, or (d) the substrate the planner queries (e.g. a tool surface) does not include you.

Gate 2 is the most boringly mechanical gate and therefore the easiest to misdiagnose. Three findings frame it.

First, modern production stacks are not single-stage retrievers but multi-stage retrieve-then-rerank pipelines (Gao et al., RAG Survey, arXiv:2312.10997) [Tier C]; Gao et al., Modular RAG, arXiv:2407.21059) [Tier C]; Barnett et al., CAIN 2024 Seven Failure Points) [Tier A]. Gate 2 itself decomposes into a bi-encoder first stage (top-k retrieval, typically k = 100) and a cross-encoder rerank to top-10. Cohere’s Rerank 4 (Dec 2025) [Tier E; vendor-independent benchmark by Agentset is Tier D] documented this stack and reports an overall +170 ELO improvement over Rerank 3.5 (Pro 1627 vs. v3.5 1457), with up to +400 ELO on business-domain-specific tasks and +300 / +140 ELO on Business/Finance for the Fast variant. Use the Agentset numbers in any decision-relevant context.

Second, the strongest 2026 evidence for what closes Gate 2 in real production AI search comes from the Trustpilot / Seer Interactive study released 12 May 2026 via PR Newswire [Tier D] (COI: Trustpilot-commissioned, Seer-executed): methodology of 804,491 AI responses across ChatGPT, Gemini, Perplexity, and Google AI Mode; 15,783 unique prompts; 1,926 brands segmented into four cohorts (T0=437 / T1=497 / T2=497 / T3=495). The headline of that study is about a Reputational-Authority effect — but the mechanism is a Gate-2 mechanism: the study itself attributes ~99% of Trustpilot citations to the Trustpilot domain ranking organically in retrieval, not to the AI model seeking Trustpilot out by name (Trustpilot’s “3Rs” framing: Recency, Relevance, Ranking; Moz domain authority 94/100 as of 8 May 2026). The result is published as a brand-cohort progression: T0 = 1% citation rate, T1 = 53.5%, T2 = intermediate, T3 = 75.3%. Tier D, COI explicitly disclosed; triangulated by 5W Public Relations Q1 2026 Citation Source Audit (Torossian, 11 May 2026, PR Newswire) reporting a ~3× citation multiplier for brands present across G2/Capterra/Trustpilot/Yelp.

Industry-vertical caveat for the Trustpilot finding. The Trustpilot/Seer methodology states only that the study covered “a range of products and services” with 15,000+ prompts; specific industry verticals are not disclosed in the PR Newswire methodology block. The 1% → 75.3% magnitude is empirically established for industries where Trustpilot is the dominant consumer-facing review platform — retail/e-commerce, travel, financial services, and consumer hospitality, the verticals where Trustpilot’s 361 million review base concentrates. The mechanism (review-platform presence as a Gate-2 lever via organic-search ranking) generalizes to other verticals through their respective dominant review platforms, but the magnitude does not transfer 1:1. Practitioners should read the magnitude as an industry-conditional anchor: in B2B SaaS the analog is G2 / Capterra / TrustRadius; in healthcare it is condition-specific (Healthgrades, Vitals, ZocDoc, Jameda in DACH); in local services it is Google Reviews and Yelp. Re-running the cohort design on those platforms would be required to establish the magnitude per vertical. The framework’s position is that the structural finding (review platforms close Gate 2 via organic ranking) is robust; the 1%/53.5%/75.3% numerical anchors are robust for consumer-facing brands and should not be quoted as universal benchmarks.

📌 Box — Diagnostic Pattern: Asymmetric Bot Blocking (LinkedIn Case)

LinkedIn’s robots.txt (verified first-hand against the live file, 28 May 2026) blocks the major AI crawlers with a full Disallow: / — among them GPTBot, ChatGPT-User, Google-Extended, anthropic-ai, ClaudeBot, Claude-Web, Claude-User, Claude-SearchBot, cohere-ai, Google-CloudVertexBot, PerplexityBot, and Perplexity-User — and adds a catch-all User-agent: * → Disallow: / that blocks any crawler not explicitly listed (roughly two dozen AI and scraper agents are fully disallowed in total). One consequential exception qualifies the picture: OAI-SearchBot — OpenAI’s search-indexing crawler, distinct from the blocked training crawler GPTBot and the blocked live-fetch agent ChatGPT-User — is not globally blocked; it receives only the same path-level restrictions as Googlebot. The generic Googlebot is likewise not globally disallowed and retains access to /posts/ and /pulse/ (LinkedIn Newsletter), whereas Google-Extended (Gemini training/grounding) is blocked. The precise net effect at Gate 2 is therefore narrower than a blanket block: AI training and live-fetch access is policy-blocked, but the two search-index channels that feed ChatGPT-search and Google’s AI surfaces remain open at the robots level — subject to the same path limits as any conventional search engine.

The asymmetry explains a measured citation pattern: LinkedIn ranks #7 in cross-platform citation share (llmpulse.ai data-studies [Tier E], May 2026: 4.43% of all citations) — but the citations concentrate in Google AI Mode and AI Overviews (Googlebot-routed), while ChatGPT, Gemini, Claude, and Perplexity have LinkedIn nearly invisible in their citation pools.

Diagnostic implication: A Gate-2 audit cannot be reduced to “is the site crawlable?” It must be bot-specific. The same domain can be open for one channel and closed for four others.

Source: Juliane Bettinga (SEO-Expertin & Co-Founder @SEOSOON), LinkedIn-Post May 2026 — Tier D (industry analysis with documented methodology). Data anchor: llmpulse.ai/data-studies/top-cited-domains — Tier D.

Tool-Surface Sub-Section (#67 Sub-Type)

Gate 2 has, since Q4 2025, acquired a second retrieval substrate: tool/endpoint surfaces. When a tool exists for a query class, the router (G1a) increasingly calls the tool instead of dispatching a text retrieval. This makes Tool/Endpoint Authority — a sub-type of #67 Structural Authority — a Gate-2-relevant capability for any brand whose content could be exposed as an endpoint rather than as prose.

The empirical anchor for tool-substrate prevalence is the Model Context Protocol (MCP). Anthropic open-sourced MCP in November 2024 and donated it to the Linux Foundation’s Agentic AI Foundation (AAIF) on 9 December 2025 (Linux Foundation press release; Anthropic news; modelcontextprotocol.io blog) +. AAIF co-founders: Anthropic donates MCP, OpenAI donates AGENTS.md, Block donates goose (per Paperclipped industry reporting). Platinum members: AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, OpenAI.

By the March 2026 adoption snapshot (corroborated independently by four industry sources: Pento.ai’s “A Year of MCP” retrospective [Tier E], Truto.one’s “What is MCP? 2026 Guide for SaaS PMs”, DigitalApplied.com’s “MCP Adoption Statistics 2026”, and BraivIQ’s infrastructure analysis), the MCP SDK had reached 97 million monthly downloads across Python and TypeScript (SDK downloads — including CI runs and transitive dependency installs, not unique users), with 10,000+ active public servers. Compare to launch (November 2024): approximately 2 million downloads per month. The donation date (9 December 2025) preceded the 97M/10K snapshot by approximately three months; the snapshot does not date to the donation. Adoption figures establish infrastructure momentum — they do not, by themselves, establish that tool inclusion drives citation visibility; that link is the open ToolInclusion question in the GCS section below.

Lumer et al. ScaleMCP (arXiv:2505.06416) (COI: PricewaterhouseCoopers U.S.A. co-affiliated) provides the academic-format stress-test: “5,000 financial metric MCP servers, across 10 LLM models, 5 embedding models, and 5 retriever types.”

Triangulation for Tool/Endpoint Authority as a #67 sub-type: Schick Toolformer (NeurIPS 2023) + Lumer ScaleMCP (Tier E, COI: PwC) + Linux Foundation AAIF press release (Tier D) + Pento.ai retrospective (Tier D) — academic / empirical / industry-standard / independent industry analysis. Triangulation met.

Practical implication for content owners. If your content can be answered by a tool call (price, availability, calculation, structured data), the question is no longer “is my page indexed?” but “is my endpoint discoverable in the MCP registry the agent’s router queries?” This is a different operational discipline than classical SEO — closer to API product management than to content marketing.

Gate 3 — Credibility Filter

Gate 3 (Credibility Filter) — the question this gate asks: Given the retrieved candidates (text passages or tool outputs), will the generator treat this evidence as credibility-worthy — or will it down-weight it before generation?

What “closed” looks like: Your URL or tool output appears in the candidate set but is filtered out (or down-weighted into invisibility) before generation, because the model has learned that documents of your type, domain class, structural quality, or formatting style are low-credibility.

Three peer-reviewed anchors:

Pan et al. “Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation” (arXiv:2404.06809, EMNLP 2024) [Tier A] — the CAG framework demonstrated that explicit credibility signaling during generation allows LLMs to discriminate among retrieved documents and significantly outperform vanilla RAG, particularly under noisy/misinformation contexts. The credibility filter is therefore not merely an emergent property; it is a trainable behavior, and production systems are increasingly trained on it.
Tan et al. HtmlRAG (WWW 2025) [Tier A] — chunkability, hierarchical heading structure, and lead-with-answer prose measurably increase extraction probability. Plain-text dumps from HTML perform worse than structured HTML for the same content.
Aggarwal et al. GEO: Generative Engine Optimization (KDD ’24), DOI 10.1145/3637528.3671900 [Tier A] — GEO-bench experiments document +41% visibility for Statistics Addition and +115% visibility for Cite-Sources Addition on position-5 content. Quotation Addition contributes +28%. These are content-authority operations that act at Gate 3.

This is the gate at which #67 Structural Authority lives almost exclusively, and at which #65 Content Authority lives partly.

The format question — Markdown vs HTML in the million-token era

A 2026-specific reframing of the Gate-3 extractability question is whether Markdown serving increases citation probability over HTML. The popular practitioner narrative (serve Markdown to AI crawlers via Accept: text/markdown content negotiation, save tokens, get cited more) is debated by four independent lines of 2025–2026 evidence — and the evidence is methodologically heterogeneous rather than convergent.

First, Tan et al. HtmlRAG (WWW 2025) — already cited above — provides the peer-reviewed retrieval-modeling perspective. Their finding: structured HTML with hierarchical headings, semantic tags, and chunkable structure outperforms plain-text dumps for retrieval-modeling. Markdown sits structurally closer to plain text than to richly-structured HTML. This is a retrieval-stage finding (Gate 2 to Gate 3 transition), not a citation-outcome finding.

Second, the Profound controlled A/B experiment (February 2026) (COI: Profound sells Agent Analytics; documented methodology) — 381 pages across 6 websites, 3-week measurement window, Profound Agent Analytics tracking OpenAI/Anthropic/Perplexity/Meta/DuckAssistBot bots, identical bot-detection logic across treatment and control — found a marginal ~16% mean lift driven almost entirely by high-traffic outlier pages and ~1 extra median visit per page, not statistically significant. Profound’s stated conclusion: “If Markdown were a game-changer, we would have seen it at this scale. We didn’t.”

Third, Thariq Shihipar (Engineering Lead, Claude Code, Anthropic) — “The Unreasonable Effectiveness of HTML” (thariqs.github.io/html-effectiveness/, 8 May 2026) [Tier E] (personal site of Anthropic engineer; 4.4 million views in 16 hours) reframes the original Markdown-default rationale. The token-economy argument that made Markdown the obvious choice in the GPT-4 era (8K–32K context windows, every token billed) has been structurally obsoleted: Claude Opus 4.7 and Sonnet 4.6 now run 1M-token context windows [Tier E] (Anthropic Docs, May 2026), GPT-5.5 supports 1M tokens (OpenAI, 24 April 2026), Gemini 3.1 Pro exceeds 2M. The few-bytes-of-overhead-per-paragraph cost that made Markdown the default is, in Shihipar’s words, now noise. HTML’s richer structural expressiveness — interactive elements, semantic depth, machine-parseable visual hierarchy — wins for both human inspection and machine extraction. Shihipar’s argument is about output format, not crawl-time format; the token-economy point transfers.

Fourth, Grace Cummins / Ramp (April 30, 2026) — “We Tested Marketing Incentives to AI Agents” (builders.ramp.com/post/marketing-to-ai-agents) [Tier E] (COI: vendor self-published on builders.ramp.com; documented methodology; reads as counter-evidence to the Profound/Tan/Shihipar consensus). Ramp ran a three-variant test across ~50 marketing pages: pure Markdown, stripped semantic HTML, and schema-injected HTML. Their headline finding: “Markdown was the only format that reliably surfaced in LLM responses.” Schema-markup, which Ramp had expected to win (literally designed for machines), performed worst.

Targeting confound — important caveat to Ramp’s headline. The three Ramp variants did not share bot-targeting rules. Markdown was served broadly (Cloudflare “AI Assistant” category OR unverified bots with low bot scores); stripped HTML and schema were served only to verified “AI Search” or “AI Assistant” bots. Compounding the issue, Ramp’s own diagnostic Finding #1 documents that Cloudflare’s “AI Search” label does not include ChatGPT, Claude, or Perplexity — those three are classified as “AI Assistants.” A strict “AI Search”-only rule misses the three largest LLM platforms. Ramp acknowledges this directly: “Part of this may reflect a targeting issue.” The cleanest reading: Markdown served to a broader bot set produced more downstream LLM responses than HTML served to a narrower bot set. That is not the same as “Markdown is causally better at producing citations than HTML.”

Synthesis (honest). The four-source evidence base is methodologically heterogeneous: Tan (peer-reviewed, retrieval-modeling, HTML > plain text); Profound (clean A/B, no significant Markdown effect); Shihipar (Anthropic engineering, token-economy argument obsoleted by 1M-context-window era); Ramp (Markdown won, but targeting-confounded). Three of four data points argue against a strong “Markdown is the citation lever” causal claim. The fourth (Ramp) is the strongest pro-Markdown data point but is methodologically not directly comparable to Profound’s clean A/B. The framework’s current synthesis: Markdown serving is a plausible Gate-2 hygiene optimization (token-cost, parser-friendliness), but the citation-causal evidence is mixed, and Tan WWW 2025 retains the peer-reviewed status for the retrieval-modeling layer. Practitioners adopting Markdown content negotiation should expect at most marginal gains, not multipliers.

Practical implication for content owners. Content negotiation via Accept: text/markdown is at most a hygiene-tier optimization — useful for token-cost reduction in API-billed agentic crawls, possibly marginal lift in narrow bot populations (Ramp), but not load-bearing for Gate 3 pass rates per the controlled-A/B (Profound) and peer-reviewed retrieval-modeling evidence (Tan). Investments in HTML structural quality (heading hierarchy, lead-with-answer prose, machine-parseable lists/tables, inline statistics and source-citations per Aggarwal GEO operations) are the empirically more robust Gate-3 levers.

Two related industry findings address the Schema-markup question that practitioners ask first: Search Atlas (Dec 2024, domain-level correlational) [Tier E] (COI: SEO-tool vendor self-published) found no correlation between schema coverage and LLM citation rate. Ahrefs [Tier E] (May 2026, 1,885 pages adding JSON-LD Aug 2025 – Mar 2026, page-level difference-in-differences) (COI: SEO-tool vendor self-published) found Google AI Overviews −4.6%, Google AI Mode +2.4% (n.s.), ChatGPT +2.2% (n.s.). Ramp’s Variant C (schema-heavy) (COI: vendor self-published) also underperformed Markdown and stripped HTML in their three-variant test. Three independent industry findings (all Tier E with disclosed COI) converge: schema markup is a hygiene factor, not a Gate-3 lever — and certainly not a remedy for a closed Gate 1 or Gate 2. Microsoft/Canel (SMX München March 2025, paraphrased via Schwartz, Search Engine Land 20 March 2025; cross-confirmed via David Mihm LinkedIn coverage) (third-party trade-press reporting of vendor statement) is the counter-example: Bing/Copilot uses schema for entity-graph signaling, which is a Gate-1 mechanism, not a Gate-3 lever.

Gate 4 — Consensus Pool & Pairwise Re-rank

Gate 4 (Consensus Pool) — the question this gate asks: Across the (typically 3–10) candidate documents that have survived Gate 3, do multiple of them agree, and is your document part of the agreeing set?

What “closed” looks like: Your URL is retrieved and is credibility-acceptable but lies off the consensus axis. The generator produces an answer dominated by the consensus and either omits your URL or cites it only as a contrast.

Gate 4 is where most of the heavy 2026 industry-published data lives, and where the noisiest noise around “generative engine optimization” originates. The framework’s job here is to separate the peer-reviewed mechanism (which is real and Tier A) from the vendor anecdote (which is real-but-COI-flagged).

Peer-reviewed mechanism (Tier A). Yang & Menczer (ACM WebSci 2025, DOI 10.1145/3717867.3717903; arXiv:2304.00228 v3, Feb 2025) [Tier A] — “Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models” found that “LLMs exhibit a high level of agreement among themselves (average Spearman’s ρ = 0.79)” when rating news-source credibility across 9 LLMs and 3 providers. Schuster et al. (arXiv:2601.03746, Jan 2026) [Tier C] (preprint) — “Whose Facts Win? LLM Source Preferences under Knowledge Conflicts” — showed that multi-source agreement is the dominant signal for which sentence-level claim a generator decides to attribute. Naser (arXiv:2603.03299, March 2026) [Tier B; 69,557 citation instances across 10 commercial LLMs in four academic domains] found that “multi-model consensus (with more than 3 LLMs citing the same work) yields 95.6% accuracy, a 5.8-fold improvement” over baseline.

The Pairwise-Rerank sub-mechanism within G4 is documented by Google patent US20250124067A1 [Tier C] (published patent application, not yet granted) — Pairwise Ranking Prompting. King (iPullRank, May 2026) (COI: iPullRank Founder/CEO): “Your content is being compared head-to-head against every other surviving candidate. Most production stacks now use an LLM-as-judge cross-encoder for this step.”

Industry-published trend evidence (Tier D, vendor-commissioned, COI-flagged). The Trustpilot/Seer Interactive March 2026 study (PR Newswire, 12 May 2026) (Trustpilot-commissioned, executed by Seer Interactive — methodology disclosed, COI explicit) reports that review-and-trust websites are “the second most cited source type, accounting for 14% of all citations in AI responses.” Direction triangulated by 5W Q1 2026 Citation Source Audit reporting a ~3× citation multiplier for brands across G2/Capterra/Trustpilot/Yelp; the absolute magnitudes remain provisional pending peer-reviewed replication.

This is #68 Reputational Authority acting at Gate 4. It is also one of the three places where #66 Network Authority acts — see the dual-assignment discussion in the Six-Authority Mapping section.

Gate 5 — Generation-Time Citation × Faithfulness (the two-axis gate)

Gate 5 is two-dimensional. A citation must both survive into the generated answer (Survival) and reflect causal use of the source rather than post-rationalization (Faithfulness) — Wallat et al. (ICTIR 2025) found up to 57% of citations are post-rationalized in an adversarial probe condition, which forced the second axis.

Questions the gate asks:

Survival: Once an answer has been drafted, will the generator attach a citation marker to your URL — or to a different surviving candidate — or to nothing at all?
Faithfulness: If a citation marker is attached, does it reflect causal use of the cited source — or is it post-rationalized?

Classical Five-Gate treatments model Gate 5 as one-dimensional (Survival only). The agentic-era framework here makes it two-dimensional. The second axis was forced by the peer-reviewed evidence that surfaced in Q4 2025 / Q1 2026.

The Faithfulness finding (Wallat et al., ICTIR 2025). Wallat, Heuss, de Rijke & Anand (ICTIR 2025, DOI 10.1145/3731120.3744592, arXiv:2412.18004) — Best Paper Honorable Mention, ACM SIGIR-affiliated conference established four desiderata for trustworthy citations: Correctness, Faithfulness, Appropriateness, Comprehensiveness. Verbatim faithfulness definition: “the model’s reliance on cited documents is genuine, reflecting actual reference use rather than superficial alignment with prior beliefs, which we call post-rationalization.” Experimental anchor: Cohere Command-R+ (104B parameters, 128k context, 4-bit quantized) on NaturalQuestions (1,444 questions, Top-5 BM25+ColBERTv2 retrieval) — up to 57% of citations lack faithfulness in the relevant-but-uncited-document adversarial condition (273 of 476 recovery cases). At random-adversarial baseline the rate was 12% (116 of 936); for “cited for different reasons” the rate was 55% (290 of 525). Tier A; ACM Best Paper Honorable Mention verified via uva.nl/IRLab announcement (19 July 2025) and ACM conferences best-paper-awards listing.

Million-token-era caveat. The Wallat experiment used Command-R+ with a 128K context window, the 2024-era standard. The conceptual finding — that Faithfulness is a separate axis from Survival, that post-rationalization is a measurable failure mode — is architectural and transfers to 2026-era models (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro), all of which now run 1M-token windows (Anthropic Docs, May 2026; OpenAI GPT-5.5 release, 24 April 2026). The specific 57% magnitude on million-token-era models is an open empirical question; one mechanistic prediction is that post-rationalization rates drop as Copying Heads operate under less context-budget pressure, but this prediction has not yet been directly tested. Article 7’s cross-model calibration will include 1M-context-window models specifically to measure this.

The Survival trade-off (Saxena et al., arXiv:2509.21557 v2, Dec 2025). Saxena, Bommireddy, Padia & Gaur (arXiv preprint v2, submitted to NeurIPS 2025 LLM Evaluation Workshop) — workshop submission, accepted-papers list not externally verified as of 28 May 2026 introduced a formal distinction between two citation paradigms:

G-Cite (Generation-Time Citation): model produces answer text and citation markers in a single decoding pass. Citation decisions are local — based only on what has been written so far and currently-attended evidence.
P-Cite (Post-hoc Citation): model first drafts the answer, then a second pass adds or verifies citations across the complete draft.

Empirical trade-off (as reported in Table 2 ALCE and Table 4 FEVER of arXiv:2509.21557 v2):

Benchmark / metric	G-Cite	P-Cite
ALCE — Coverage	0.372	0.748
ALCE — Citation Correctness	0.205	0.422
LongBench-Cite — Coverage	0.65	0.78
Human-Eval — Answer Correctness	0.69	0.78
Human-Eval — Citation Hallucination	0.41	0.37
FEVER — Citation Correctness	0.937	0.75
FEVER — Coverage	0.272	0.74
ALCE — Latency (s)	17.237	6.077 Zero-Shot: 2.925

Bold values indicate the better performer per row. Source: Saxena et al., arXiv:2509.21557 v2, Tables 2–4.

Paper Finding 1 (verbatim): “On ALCE, the advanced P-Cite achieve 75% coverage with 42% correctness, substantially outperforming the advanced G-Cite which reaches 37% coverage and 21% correctness.” Headline recommendation: “We recommend a retrieval-centric, P-Cite-first approach for high-stakes applications, reserving G-Cite for precision-critical settings such as strict claim verification.”

Flag: arXiv:2509.21557 is Tier C (preprint, workshop submission, main-conference peer-review not applicable). Treat magnitudes as directionally robust but quantitatively provisional. Tier-A replication would change the recommendation’s confidence interval, not its direction.

Mechanistic anchor (Sun et al., ReDeEP) — a hypothesis, not a finding. The two-axis structure of G5 has a candidate neural correlate: we hypothesize that Wallat-style Faithfulness failures relate to over-active Knowledge FFNs (parametric over-injection), and Saxena-style Survival failures to under-active Copying Heads (retrieved-context under-propagation). ReDeEP itself studies RAG hallucination detection, not citation pipelines — this mechanistic bridge is our own, is untested, and is offered as a falsifiable prediction rather than a result. What Sun et al. (ICLR 2025) does establish is the parametric-vs-retrieval substrate distinction on which both G5 axes are modeled.

Triangulation Gate 5 (Survival × Faithfulness): Wallat ICTIR 2025 (Tier A — Faithfulness axis) + Saxena arXiv:2509.21557 (Tier C — Survival trade-off across G-Cite/P-Cite) + Sun ReDeEP ICLR 2025 (Tier A — mechanistic substrate). Three independent methodologies. Triangulation met (with Saxena tier-caveat).

The Mapping: Six Authority Types in the Five-Gate × Two-Channel System

This is the heart of the framework. Each authority type from Article 4 maps onto a primary gate (the gate the authority type predominantly acts through) and, where evidence supports, secondary gates. Network Authority (#66) is the one deliberate exception: it acts at three gates simultaneously (dual-assignment — a property of this mapping, not an external finding; see Honest Limitations §1).

Authority Type	Primary	Secondary	Channel	Evidence anchor
#63 Entity	G1a	—	Parametric	Tier A — Algaba 2025; Sun 2025
#64 Topical	G3	G1b, G4	Both	Tier A — Aggarwal 2024; Tan 2025
#65 Content	G3	G5	Retrieval	Tier A / C — Aggarwal 2024; Saxena 2025
#66 Network	G4	G1a + G2 dual-assignment	Both	Tier A — Algaba 2025 (×2); Yang & Menczer 2025
#67 Structural incl. Tool/Endpoint	G2	G3	Retrieval	Tier A + B — Tan 2025; Lumer 2025
#68 Reputational	G4	G5	Retrieval	Tier A + D — Yang & Menczer 2025; Seer/Trustpilot 2026 (COI); 5W 2026

Gate labels: G1a = Resolution/Routing · G1b = Fan-Out Planning · G2 = Retrievability · G3 = Credibility Filter · G4 = Consensus Pool · G5 = Generation × Faithfulness.

The 6×2 Matrix

Gate	Parametric Channel	Retrieval Channel
G1a Resolution/Routing	Entity #63, Network #66	Network #66 (graph-recall features)
G1b Fan-Out Planning	(planner uses parametric anchors)	Topical #64 (sub-query coverage)
G2 Retrievability	(n/a — parametric is not “retrieved”)	Structural #67 incl. Tool/Endpoint sub-type, Network #66
G3 Credibility Filter	Topical #64 (via training)	Topical #64, Content #65, Structural #67
G4 Consensus Pool	(Network #66, indirect)	Network #66, Reputational #68
G5 Generation × Faithfulness	(rare; mostly retrieval)	Content #65, Reputational #68

Four Observations About This Mapping

Network Authority is the only authority type that appears in three different cells. This is not a bug — it is the framework’s recognition that citation-graph signal is the only authority modality that operates simultaneously in training, retrieval, and consensus. Every other authority type has a primary gate and at most one secondary gate. The triangulation: Algaba et al. NAACL 2025 Findings (Tier A) + Algaba et al. arXiv:2504.02767 follow-up (Tier B — 274K samples, vendor-independent academic team) + Yang & Menczer arXiv:2304.00228 (Tier A). No single peer-reviewed paper tests all three gates simultaneously — the dual-assignment is a theoretically motivated reclassification supported by three independent peer-reviewed lines of evidence. This caveat appears explicitly in Honest Limitations §1 below.
Gate 2 has no parametric column. This is structural: you cannot be “retrieved” by your own training weights; retrieval is by definition a query-time operation. Authority types that act on Gate 2 (#67 Structural including Tool/Endpoint sub-type, #66 Network’s retrieval-channel manifestation) must work through the live index or the live tool registry.
Gates 4 and 5 are where vendor-published “AI citation studies” overwhelmingly cluster. This is because vendor research can measure citation outcomes (G5) and pool composition (G4) without instrumenting the model. It cannot easily measure G1a–G3 from outside, which is why almost all peer-reviewed mechanistic work concentrates on G1a–G3 (Sun, Pan, Wallat, Augenstein) and almost all industry research on outcomes concentrates on G4–G5 (Trustpilot, 5W, Ahrefs, Search Atlas). A separate methodology-side strand of industry research — exemplified by Graphite’s “Demystifying Randomness in AI” (Druck & Smith, 2026) [Tier E] (COI: Graphite sells AEO services) — sits orthogonal to that classification: it does not measure citation outcomes but rather how visibility can be measured at all (Wilson-Score confidence intervals on n=10, Sequential Sampling reducing sample needs by 51%, API-vs-Logged-Out cosine similarity of 0.48). The framework cites Graphite as a methodology reference in the GCS construction, not as outcome evidence. Practitioners need all three literatures — peer-reviewed mechanistic, vendor outcome, and measurement-methodology — because they describe complementary halves of the same pipeline.
Existing citation authority operates as a content-surfacing gatekeeper — independent of format. Grace Cummins / Ramp (“We Tested Marketing Incentives to AI Agents”, 30 April 2026) (COI: vendor self-published on builders.ramp.com; documented methodology, ~50 pages × 1,300+ bot visits × 32-day tracking) documented an empirically clean version of the framework’s #66 Network Authority claim. Their headline finding, separate from the format-variant question: “Pages with higher existing AI citation volumes were far more likely to surface our embedded content. Pages with low existing citation volume got zero incentive mentions, regardless of format.” Ramp names this “a concept of ‘agent trust’ that’s analogous to domain authority in traditional SEO, but the signals are different.” Operationally this is the dual-assignment of #66 Network Authority across G1a (resolution/routing) and G2 (retrievability) and G4 (consensus): a page that the model has already learned to trust is a page whose new content gets a probabilistic head-start at every gate it must pass. This is the strongest 2026 industry-side corroboration of the dual-assignment classification.

📌 Box — Case Study: Five Gates Opening Simultaneously (GummySearch, May 2026)

In the same week as the Reddit-fanout inflection at ChatGPT (around 8 May 2026 — see §G1b Box above), an obscure third-party site — GummySearch, originally a Reddit search-and-analytics tool — rose from 0.005% to 0.1% of all ChatGPT citations. GummySearch had stopped accepting new customers on 30 November 2025 (the creator failed to negotiate a Reddit API deal at $35k MRR), but its bot-accessible landing pages remained crawlable and indexed.

The landing pages (e.g. /best-clothing-brands-on-reddit/) exhibit a textbook multi-authority stack:

Authority Type How GummySearch landing pages satisfy it

#64 Topical narrow deep topic (“Best clothing brands on Reddit”), 306 reviews from 45 subreddits

#65 Content 10+ verbatim quotes per ranked brand, with original Reddit-user attribution and dates

#66 Network piggyback on Reddit’s citation-graph position via on-top-of-Reddit data layer

#67 Structural listicle #1–#6 ranking, “By Brand / By Product” toggle, machine-parseable

#68 Reputational star ratings + user-quote provenance

#70 Platform Modifier Reddit’s platform authority transferred via the on-top-of-Reddit layer

All five Authority Types and the Platform Modifier light up simultaneously, without GummySearch having any of its own brand-, backlink-, or content-investment budget. The visibility lift is structural, not editorial.

Landwehr’s reverse-engineering (verbatim): “Almost everything on these landing pages is perfect for ChatGPT while it is in its ‘I Love Reddit’ phase.” Four mechanisms identified: listicle format; the term “reddit” mentioned 5+ times per page; subreddit names (r/BuyItForLife) prominent; every score backed by 10+ real-user quotes.

Diagnostic implication: Multi-Authority stacking on a single page is an achievable strategy — but per Lily Ray’s caveat (“might not work forever”), sustainability depends on a specific fanout-configuration that can shift platform-side at any time. The lift is real, the moat is structural-fragile.

Source: Malte Landwehr (CPO/CMO, Peec AI), LinkedIn-Article 26 May 2026 — Tier E (vendor-self-published, Peec AI proprietary telemetry).

Authority Type	How GummySearch landing pages satisfy it
#64 Topical	narrow deep topic (“Best clothing brands on Reddit”), 306 reviews from 45 subreddits
#65 Content	10+ verbatim quotes per ranked brand, with original Reddit-user attribution and dates
#66 Network	piggyback on Reddit’s citation-graph position via on-top-of-Reddit data layer
#67 Structural	listicle #1–#6 ranking, “By Brand / By Product” toggle, machine-parseable
#68 Reputational	star ratings + user-quote provenance
#70 Platform Modifier	Reddit’s platform authority transferred via the on-top-of-Reddit layer

Cross-Cutting Modifiers

Four dimensions are not gates but bend gate-pass probabilities across multiple gates. This framework inherits the three Article-4 modifiers (#69 Temporal, #70 Platform, #71 Consensus) without renaming or restructuring, and adds one agentic-pipeline-specific modifier (#72 Reflection-Iteration). “Authority Density” and “Multimodal Surface” — sometimes proposed elsewhere as standalone modifiers — are not maintained as separate dimensions here; they are sub-concepts under #66 Network Authority and #67 Structural Authority respectively.

#69 Temporal Modifier (Freshness)

Affects: G1b (planner may inject year-tokens into sub-queries), G2 (retrieval indices prefer recent content), G4 (consensus pools rotate), G5 (generation prefers fresh-dated citations).
Evidence tier: A + D. The causal anchor is Yubo Fang et al. (SIGIR APIR 2025) [Tier A] — seven LLM models tested in a controlled experiment where only the date of identical passages was changed; texts with newer dates rose by up to 95 ranking positions; up to 25% of all relevance decisions flipped solely due to date changes. Industry corroboration: Trustpilot’s “3Rs” framework (Recency, Relevance, Ranking); Ahrefs 17M-citation study finding AI-cited content is 25.7% fresher than organic Google results; Qwairy’s finding that AI systems inject the current year into 28.1% of all sub-queries even when users don’t specify it.
Mechanism is by design, not emergence: ChatGPT’s production configuration contains use_freshness_scoring_profile: true (Metehan Yesilyurt, October 2025 discovery via prompt-injection leak — Tier D).
Operational path (heuristic): The freshness bias itself is causally established (Fang, SIGIR APIR 2025); the cadence is not. Quarterly content refresh cycles for key pages are our operational heuristic — not an evidence-derived optimal interval — so calibrate the cadence to your vertical’s index-refresh dynamics. Systematic updating of data points and year references. Content age monitoring as a KPI.

#70 Platform Modifier (Inherited Trust)

Affects: G1a (some platforms have privileged entity-resolution), G2 (some platforms have privileged bot access — see Bettinga LinkedIn case), G4 (platform-trust transfers to documents hosted on the platform).
Evidence tier: D. Semrush 100M-citation study: Reddit appeared in ~60% of ChatGPT answers (before September 2025), Wikipedia at ~55%. Profound 680M-citation analysis: only 11% of domains are cited by both ChatGPT and Perplexity; only 7 of the top 50 domains appear across all three major platforms. Writesonic (2.4M domains): 67.4% of all cited domains appear on exactly one AI platform.
Platform Authority is systemically unstable. Reddit citations on ChatGPT collapsed from ~60% to ~10% in September 2025; recovered from 2.6% to >11% on 8 May 2026 via the fanout-planning shift documented in §G1b. Same source, opposite directions, within nine months.
Operational path: Multi-platform presence strategy prioritized by AI platform preferences. Monthly Cross-AI Coverage tracking. YouTube, Reddit, and LinkedIn as citation entry points — with the explicit caveat that LinkedIn closes Gate 2 for the AI channel (see §G2 Bettinga Box A).

#71 Consensus Modifier (Cross-Source Corroboration)

Affects: G4 (definitionally), G5 (citation attachment converges on consensus sources).
Evidence tier: A. Yang & Menczer arXiv:2304.00228 — Spearman ρ = 0.79 cross-LLM agreement; Schuster et al. arXiv:2601.03746 — multi-source agreement as dominant attribution signal; Naser arXiv:2603.03299 — multi-model consensus yields 95.6% citation accuracy, a 5.8-fold improvement.
Limitations: Consensus is a property of the pool, not of your document. You can write the most accurate document in the world and still fail G4 if it lies off the consensus axis. The remedy is not to write more; it is to seed consensus (third-party coverage, citations from already-consensus sources, structured presence on consensus platforms — Wikipedia, Reddit, G2, Trustpilot, YouTube).

#72 Reflection-Iteration Modifier (Agentic-Specific)

Affects: G1b (re-planning), G2 (re-retrieval), G3 (re-filtering), G5 (re-citation). Operates across the whole agentic loop.
Evidence tier: A + B + C. Asai et al. Self-RAG (ICLR 2024 Oral) — model emits reflection tokens (Retrieve / IsRel / IsSup / IsUse) and decides on-the-fly whether to re-retrieve. Singh Survey §3 (arXiv:2501.09136 v4) catalogues Reflection as one of four Agentic Design Patterns. Wu et al. HiPRAG (arXiv:2510.07794) [Tier C] — preprint, self-declared “under review”, venue unconfirmed reports 27% → 2.3% reduction in over-retrieval and a 29% under-retrieval floor, with overall accuracy 65.4% (in-domain) / 67.2% (out-of-domain) on a process-level reward-shaped pipeline.
Mechanism: The agent’s critic emits a re-retrieval signal when the candidate-set quality falls below a threshold. Each iteration is an opportunity for your content to re-enter the candidate set — and an opportunity for it to be filtered out again.
Why a modifier and not a gate: Reflection-Iteration is not a separate decision point in the pipeline; it is a count of how many times the existing Gates 1b–5 are re-traversed. A brand can be invisible at iteration 1 and become visible at iteration 3 (or vice versa). The modifier captures iteration-stability rather than single-pass survival.
Operational path: Audit candidate-set membership across iterations, not only at iteration 1. Brands whose visibility is iteration-1-only are structurally fragile to changes in critic thresholds.

Triangulation #72 Reflection-Iteration: Asai Self-RAG (Tier A) + Singh §3 (Tier C) + HiPRAG (Tier C) + King “Beyond RAG” (Tier E, COI: iPullRank). Triangulation met.

Production Reality: Single-LLM-Multi-Prompt, Not Multi-Agent

An important course-correction is required for any practitioner reading this article: the production architecture of agentic AI search in 2026 does not look like a constellation of communicating specialized agents. The best-documented pattern is a single large language model running tight loops with different prompts at each stage, plus tool calling. One epistemic note before the evidence: what follows is a working assumption grounded in converging public sources, not a measured market share — commercial providers disclose only part of their architectures, and no systematic deployment survey exists.

Three independent sources converge on this pattern:

Mike King (“Beyond RAG”, iPullRank, May 2026) (COI: iPullRank Founder/CEO): “Most production systems are not literal multi-agent constellations. They are a single LLM running tight loops with different prompts at each stage, plus tool calling. The ‘multi-agent’ framing is a presentation layer, not the underlying architecture.”
Anthropic, “Building effective agents” (Schluntz & Zhang, anthropic.com/research/building-effective-agents, 19 December 2024) [Tier E]: “Consistently, the most successful implementations weren’t using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.” And: “For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.”
Singh Survey §3.4 (arXiv:2501.09136 v4, April 2026): “While multi-agent collaboration offers significant potential, it is a less predictable design pattern compared to more mature workflows like Reflection and Tool Use.”

Singh’s full taxonomy in v4 deserves precise restatement, because earlier framings often simplified it: two macro-classes of Agentic RAG (Single-Agent, Multi-Agent), six concrete architecture patterns within them (Router, Multi-Agent Collaboration, Hierarchical, Corrective, Adaptive, Graph-Based), plus four cross-cutting Agentic Design Patterns (Reflection, Planning, Tool Use, Multi-Agent Collaboration — the last appears in both layers, a Singh-specific convention). For agentic-RAG diagnosis, the macro-class is “Single-Agent” across the production deployments publicly documented in 2026 — a pattern in the observable subset, not a census — with Reflection and Tool Use as the dominant cross-cutting patterns and Multi-Agent Collaboration as the least predictable.

Three sectors converge: practitioner-vendor (King) + vendor-engineering (Anthropic) + academic-survey (Singh §3.4). Triangulation met for the pattern; note that no Tier-A anchor exists (see Triangulation Audit) and the population-level claim remains a working assumption.

Practical implication. Diagnostic effort that targets imagined multi-agent architectures (e.g., “which agent rejected my brand?”) is wasted. The correct unit of diagnosis is the prompt-stage within a single-LLM loop: G1a-prompt, G1b-prompt, G2-retrieval-prompt, G3-rerank-prompt, G4-consensus-prompt, G5-citation-prompt. The Triage Protocol below operationalizes this stage-by-stage diagnosis.

Operationalization — The Generative Citation Score (GCS)

The Generative Citation Score (GCS) is a six-dimensional, Wilson-bounded citation-likelihood metric — one dimension per gate-component (SubQueryCov, RetrievalToCit, RefSurvival, Faithfulness, ToolInclusion, BridgeCentrality) — generalizing the classical one-dimensional gate-closure diagnostic into a likelihood score. Polarity, fixed here once: higher GCS = more citable. GCS is not a “gate-closure score”; a low value on a dimension indicates low citation likelihood there, and it is the gate diagnostic that then localizes the closed gate. The construction methodology follows Aggarwal et al. (GEO: Generative Engine Optimization, KDD ’24, DOI 10.1145/3637528.3671900), who established the legitimacy of user-defined visibility metrics for generative engines; the six-dimensional GCS itself is a proposed diagnostic score, pending the calibration and validation work scheduled for Article 7.

📌 The Six-Dimensional GCS — Definition

For each dimension d ∈ {SubQueryCov, RetrievalToCit, RefSurvival, Faithfulness, ToolInclusion, BridgeCentrality}, let n = sample size and k = observed positive events.

Step 1 — Point estimate:

p̂_d = k_d / n_d

Step 2 — Wilson lower bound (95% CI):

Wilson_Lower(p̂_d) = [ p̂_d + z²/(2n) − z · √(p̂_d(1−p̂_d)/n + z²/(4n²)) ] / [ 1 + z²/n ]

Step 3 — Per-dimension GCS (citability):

GCS_d = Wilson_Lower(p̂_d)

(high = citable / gate open; low = gate closed)

Step 4 — Composite GCS_total:

GCS_total = w · [ GCS_SubQ, GCS_RtoC, GCS_RefS, GCS_Faith, GCS_Tool, GCS_Bridge ]^T

where z = 1.96 for a 95% confidence interval, and w is a six-component weight vector. The default weight vector is deliberately unset. Empirical calibration is an open task scheduled for Article 7. The conservative Wilson lower bound means small samples pull GCS_d down — under-claiming citability rather than over-claiming it.

Dimension-Level Tier-A Anchors (per-dimension three-source triangulation)

GCS Component	What it measures	Tier-A Anchor
SubQueryCov	Share of fanout sub-queries semantically matching content	Jeong Adaptive-RAG NAACL 2024; Trivedi IRCoT ACL 2023
RetrievalToCit	Retrieve → cite conversion rate	ALCE (Gao et al., arXiv:2305.14627, ACL 2023)
RefSurvival	Reference survival across iteration cycles	Asai Self-RAG ICLR 2024 (Oral)
Faithfulness	Causal vs. post-rationalized citation	Wallat ICTIR 2025 (Best Paper Honorable Mention); Sun ReDeEP ICLR 2025
ToolInclusion	Brand endpoint inclusion in tool registry	Schick Toolformer NeurIPS 2023; Lumer ScaleMCP (Tier E, COI: PwC)
BridgeCentrality	Citation-graph betweenness as bridge node	Algaba NAACL 2025 (Network Authority); multi-hop QA literature

Sampling Parameters — Three Tiers

The sample size you need depends on the decision you are making. This framework distinguishes three sampling tiers — do not spend decision-grade budget on a screening question, and do not make budget decisions on screening data:

Screening (n ≈ 10 per dimension): “Is there a signal at all?” A cheap first read. The ~5–10% MAE expectation at this tier is extrapolated from Graphite’s visibility experiments (~5.6% MAE at n=10 for its visibility construct); whether that error level transfers to the six GCS dimensions is untested. Treat screening reads as signal detection, not diagnosis.
Baseline (n = 30–40 per dimension): Per-dimension rates with usable Wilson intervals — the tier for tracking movement over time.
Decision-grade (minimum n = 200 prompts per dimension per content item): Required before spending remediation budget. Sampling window ≥ 9 days to absorb day-of-week and index-refresh variance (Sielinski R., arXiv:2603.08924, March 2026) [Tier C]. Bootstrap 95% CI on top of the Wilson interval when n is small at this tier (200 ≤ n < 500). Flag the diagnostic as inconclusive when the bootstrap CI of the test item overlaps the control item’s CI by > 50%.

Triangulation for GCS construction methodology: King 6 Operations Metrics (iPullRank, May 2026) (COI: iPullRank Founder/CEO — used as practitioner-side anchor, not as load-bearing evidence) + Wilson Score (DAE-internal, derived from Wilson 1927; Brown, Cai & DasGupta 2001; Cao arXiv:1809.07694) + Aggarwal GEO KDD 2024 (Tier A; user-defined visibility metric framework). Three sectors: practitioner-operations, statistical-method, peer-reviewed-methodology. Triangulation met for the construction-method choice; the load-bearing peer-reviewed anchor is Aggarwal KDD 2024 (Tier A) plus the Wilson-statistics-methodology layer.

The Wilson choice in practice — and what it does and does not license: Druck & Smith / Graphite (“Demystifying Randomness in AI”, 2026) (COI: Graphite is an AEO agency selling visibility-measurement services) applied Wilson-Score binomial confidence intervals plus Sequential Sampling to over 200,000 LLM responses across gpt-5.2-chat-latest, ChatGPT-Logged-Out, and Gemini-Logged-Out conditions on 200 entity-comparison prompts × 400 responses. Their empirical findings on the Wilson-Score-plus-Sequential-Sampling workflow: Visibility is estimable with n=10 at Mean Absolute Error ~5.6% across entities; Sequential Sampling reduces required responses from a fixed 60 to an average of 29.4 (a 51% efficiency gain) without loss of CI tightness; the median ratio of observed-to-expected variance is ~1.02, confirming that independent API calls are statistically independent draws from the same distribution. The Graphite paper is not peer-reviewed and the author team has commercial COI, but its interval choice is consistent with the peer-reviewed statistics literature: Bowyer, Aitchison & Ivanova (ICML 2025 Position Paper Track, Spotlight) argue against CLT-based confidence intervals at small n and recommend using Wilson-Score or Bayesian intervals in practice. That is support for Wilson as an appropriate small-n interval method for binomial proportions — it is not a validation of the GCS construct, its six dimensions, or their aggregation. Appropriate uncertainty estimation and construct validation are two different things; GCS currently has the former and awaits the latter (Article 7).

Two operational consequences for GCS users:

n ≈ 10 per dimension is a legitimate screening tier: Graphite reports ~5.6% MAE at n=10 for its visibility construct, and the transfer of that error level to the six GCS dimensions is untested — treat screening reads as signal detection, not diagnosis (see Sampling Parameters above).
The API-vs-Logged-Out cosine similarity of 0.48 reported by Graphite means GCS measured purely on API calls is at best a parallel-reality estimate of the user-facing reality — practitioners running GCS should report which condition they sampled and ideally sample both.

What GCS does and does not do. GCS measures, per dimension, how likely your content is to pass a gate — and with what confidence. A low dimension score localizes the closed gate. It does not, by itself, tell you why. The Six-Step Triage Protocol below uses GCS as input.

The Diagnostic Tool — Six-Step Triage Protocol

The classical Five-Gate-protocol covers five steps; this framework’s six-step version adds Faithfulness (step 6) and refactors Steps 3–4 to reflect the G1a/G1b split.

Step 1 — Establish Baseline + GCS_total

Build a prompt set of n ≥ 200 prompts representative of your buyer’s journey (informational, comparative, transactional, branded). Sample over ≥ 9 days. Run the same set across at least four platforms (ChatGPT, Gemini, Perplexity, Google AI Mode). Compute GCS per dimension. Two heuristic triage thresholds — chosen for triage, not empirically calibrated; calibration is scheduled with the weight-vector work in Article 7: if GCS_total > 0.7 with tight CI, you have no citation problem at the level you’re measuring. If GCS_total < 0.3 with tight CI, you have a problem — and the next five steps localize it.

Step 2 — Sub-Query Coverage Audit (G1b)

For each prompt in your set, inspect (where the platform exposes it) the planner’s fanout sub-queries. In platforms that do not expose the fanout (the majority), use Lily-Ray-style reverse engineering: prompt the platform multiple times with controlled variations of the surface query and observe which sub-topic angles produce citations. If your content matches the surface query but no sub-query, G1b is closed. Remediation: expand topical coverage (multiple semantically distinct pages per topic cluster); see #64 Topical Authority.

Step 3 — Anchoring Probe (G1a)

Prompt the model directly: “List the top 10 brands in [your category]. Do not search the web.” This is a probe, not a validated test: results are confounded by prompt wording, model version, decoding settings, and instruction-following behavior — repeat it across phrasings, sessions, and models before concluding anything. If your brand consistently does not appear — the EntityResolve read stays below 0.5 — treat Gate 1a as closed. Remediation lives in Entity Authority (#63) and Network Authority (#66) work: Wikipedia/Wikidata, structured entity data, cross-domain co-occurrence.

Step 4 — Retrievability Probe (G2)

Run the same prompts in a system you can introspect (Perplexity with source-list, or a self-built RAG using Bing/Google APIs, or a self-built MCP-enabled agent). Are your URLs (or your tool endpoints) in the candidate set at all? If not — Gate 2 is closed in that system. One epistemic boundary applies: a probe system measures retrievability in the system you built or can inspect; mapping a negative result onto a proprietary platform’s Gate 2 is an inference — indices, retrievers, and rerankers differ (see Open Question 10). Treat probe results as observed behavioral failure, not as an identified internal mechanism. Remediation is technical SEO, llms.txt, bot-policy review, and indexability — not content marketing. For tool-substrate brands: audit MCP registry presence, OpenAPI discoverability, and function-call schema completeness.

Step 5 — Credibility + Consensus Differential (G3 / G4)

When your URL is in the candidate set but does not appear in the answer:

If competitor URLs appear in similar candidate sets and are cited, while yours is not → Gate 3 (credibility) is closed. Remediation: Structural Authority (#67) — schema, citations, structure — plus Content Authority (#65) — extractability, lead-with-answers, fact density. Aggarwal-GEO operations: +41% Statistics Addition, +115% Cite-Sources Addition, +28% Quotation Addition on position-5 content.
If competitor URLs appear in the answer but draw from a different cluster of co-cited sources → Gate 4 (consensus) is closed. Remediation: Reputational (#68) + Network (#66) — third-party coverage, presence on consensus platforms (Wikipedia, Reddit, G2, Trustpilot, YouTube).

Step 6 — Faithfulness Check (G5)

Even when your URL does appear in the answer with a citation marker, audit whether the citation is causal or post-rationalized. Use the Wallat-style adversarial probe: replace your URL in the retrieval context with a semantically similar but distinct URL, regenerate, and check whether the citation marker moves. If the model still attributes the claim to your URL despite the URL being absent from the retrieval context, the citation is post-rationalized — your brand is technically “cited” but functionally not driving the answer.

Investment Priority Matrix

Closed Gate	First investment (Tier-A-grounded)	Second investment	Avoid
G1a (Anchoring)	Wikipedia/Wikidata + entity disambiguation	Citation-network seeding (#66)	Schema-only campaigns (Search Atlas + Ahrefs null results)
G1b (Fan-Out)	Topical coverage breadth — multiple pages per sub-topic	Anticipate planner’s likely sub-queries	Single “canonical” pages without sub-topic structure
G2 (Retrievability)	Indexability, bot policy, llms.txt, technical SEO	MCP/OpenAPI endpoint exposure for tool-substrate	More on-domain prose content
G3 (Credibility)	Restructure for extractability (Aggarwal +41% / +115%)	Sourced statistics + quotations in content	Style-only rewrites
G4 (Consensus)	Earned third-party presence on consensus platforms	Reviews / review-platform profile build	Brand-controlled “thought leadership” only
G5 Survival (Citation)	Use systems that prefer P-Cite (most production)	Multi-model consensus (Naser 95.6%)	One-platform optimization
G5 Faithfulness	Make citations causally necessary (unique data the model cannot post-rationalize from priors)	Audit prompt patterns where competitors get post-rationalized citations	Bulk citations of widely-available facts

Architectural Variations

The Five Gates × Two Channels are universal in shape but their strictness varies across architectures.

Pure-RAG systems (Perplexity, ChatGPT Search without web tools disabled, Google AI Mode): G1b–G5 dominate; G1a is partially bypassed by aggressive retrieval. Network Authority’s G2 expression is highest here. Tool-substrate negligible.
Pure-parametric systems (raw GPT-4o without web tools, Claude without document upload): G1a dominates; G1b–G4 are skipped; G5 simplifies to “does the model produce a verifiable URL or hallucinate one?” Naser’s audit found hallucination rates spanning a fivefold range, 11.4–56.8%, depending on model and domain.
Hybrid systems (GPT-4o + light retrieval, Gemini Deep Research, Claude Projects with Web Search): All five gates active. Most production behavior in 2026 falls here.
Agentic systems (autonomous deep-research agents, MCP-enabled assistants, Cowork-class file-and-task automation): G1a–G5 are iterated under the Reflection-Iteration Modifier (#72). Tool-substrate is dominant; text-substrate often acts only as fallback when no tool matches. The Singh-Survey taxonomy’s “Adaptive” and “Corrective” patterns describe this architectural family. Empirical citation behavior of agentic stacks is still under-measured outside controlled benchmarks.

Independent Industry Validation

The Five-Gate × Two-Channel structure and the Multi-Authority-Stack diagnosis are framework constructs — they describe failure-mode topology, not directly falsifiable mechanism. The strongest external evidence that the topology is real is when independent practitioners arrive at the same diagnosis without knowledge of the framework.

Dana Billingsley (AI Discovery Intelligence; AI Search & AI Visibility specialist), commenting on Malte Landwehr’s GummySearch analysis (LinkedIn-Article, 26 May 2026), described the same phenomenon in different vocabulary:

“What is especially interesting here is not just the Reddit association itself, but the structure of the recommendation environment being created around it. These pages package: socially validated recommendations, buyer-intent phrasing, comparative context, quote-backed reinforcement, explicit recommendation formatting — into something extremely easy for retrieval and synthesis systems to process. It feels less like classic ‘ranking’ behavior and more like AI systems reinforcing environments that already resemble consensus-oriented recommendation layers. That may end up being the more important takeaway long term than the specific Reddit tactic itself.”

— Dana Billingsley, comment on Landwehr LinkedIn-Article, 26 May 2026 — independent specialist comment

Billingsley’s elements map cleanly onto the framework:

Billingsley’s element	DAE-Framework Mapping
“socially validated recommendations”	#68 Reputational Authority
“buyer-intent phrasing”	#64 Topical Authority (intent-mapped)
“comparative context”	#66 Network Authority (co-citation context)
“quote-backed reinforcement”	#65 Content Authority + #71 Consensus Modifier
“explicit recommendation formatting”	#67 Structural Authority
“consensus-oriented recommendation layers”	Gate 4 (Consensus Pool) + #71 Consensus Modifier
“less like classic ‘ranking’ behavior”	Near-binary failure-mode thesis (cliff-shape — an open empirical question; see Core Definition) (Article 4 §Five Gates)

Five of six Authority Types, the Consensus Modifier, and the cliff-shape thesis are independently named by Billingsley in her own analytic language, without reference to the DAE framework. This is external triangulation of the framework’s structural claims — Tier D (industry voice), not peer-reviewed, but evidentially independent. The same article’s comment thread contains Lily Ray’s complementary observation about the non-stationary nature of fanout-planning behavior (cited in §G1b above), giving the validation set two independent practitioner stimulations within one discussion.

Honest Limitations

The framework is operational, not finished. Eleven open questions are worth flagging by name, because they are the places where the framework’s confidence is lowest.

Open question 1 — Network Authority dual-assignment is theory-led, not single-paper-confirmed. The reclassification of #66 across G1a + G2 + G4 rests on three independent peer-reviewed lines of evidence (Algaba NAACL Findings 2025; Algaba arXiv:2504.02767 April 2025 follow-up — 274K samples, vendor-independent; Yang & Menczer arXiv:2304.00228) plus the three-source triangulation requirement. No single study simultaneously tests all three gates for #66. A targeted experiment that does so would either confirm or refute the dual-assignment; the framework treats the assignment as the best current synthesis and is open to revision.

Open question 2 — The Six-Dimensional GCS is statistically established but not AI-citation-established. The Wilson interval (1927) and the Wilson-Lower-Bound-for-ranking technique (Cao X. arXiv:1809.07694, 2018) are well-established statistical instruments. The Aggarwal et al. KDD 2024 framework legitimizes user-defined visibility metrics for generative engines. But the specific six-dimensional aggregation in GCS has not been peer-reviewed as the standard metric for agentic-RAG diagnostics — nor has its construct validity (do the six dimensions measure what they claim?), criterion validity, or test-retest reliability been established. Appropriate confidence intervals are uncertainty estimation, not construct validation. Practitioners should report the underlying p̂, n, and 95% CI per dimension alongside any GCS_total number so downstream readers can verify the calculation. Default weights are deliberately unset and will be empirically calibrated in Article 7.

Open question 3 — #72 Reflection-Iteration Modifier is not isolated in production-RAG pipelines. Asai Self-RAG, Singh §3, HiPRAG, and King “Beyond RAG” all describe Reflection as a real architectural pattern, but no peer-reviewed paper isolates the citation-visibility effect of Reflection-Iteration count in a production RAG pipeline (as opposed to Self-RAG’s controlled benchmark). The modifier is included in this framework because the qualitative evidence is strong; the quantitative magnitude is open.

Open question 4 — Temporal Modifier (#69) magnitude in agentic pipelines is unknown. Yubo Fang et al. (SIGIR APIR 2025) established the causal-direction in classic IR settings. Whether the same magnitudes apply when fanout-planning + reflection-iteration sit between the user query and the retrieval step is empirically untested.

Open question 5 — Saxena G-Cite vs. P-Cite trade-off is Tier C. arXiv:2509.21557 v2 is a workshop preprint whose accepted-papers list could not be externally verified as of 28 May 2026. The numerical findings (37%/75% Coverage, 21%/42% Citation Correctness on ALCE) are directionally robust against the framework’s qualitative claims but should be treated as quantitatively provisional pending main-conference replication.

Open question 6 — Tier-D industry findings have documented methodology but generalizability constraints. Three industry-side anchors in this framework are Tier D rather than peer-reviewed, each with a distinct generalizability boundary that practitioners should hold in mind.

Bettinga (LinkedIn-robots.txt asymmetry) and Landwehr/Peec AI (8 May 2026 Reddit/GummySearch fanout-inflection): Diagnostic illustrations, not peer-reviewed claims. The robots.txt directives are no longer taken from Bettinga’s post — they are first-hand verified against the live LinkedIn robots.txt (28 May 2026; see Primary Sources), so that part of the analysis rests on a primary source rather than a practitioner report. The Peec AI citation telemetry remains proprietary and not externally replicated.
Trustpilot / Seer Interactive 1% → 75.3% magnitude: The PR Newswire methodology block states only “a range of products and services” — specific industry verticals are not disclosed. Trustpilot’s 361 million review base concentrates in consumer-facing brands (retail/e-commerce, travel, financial services, hospitality), and the magnitude is empirically established for those verticals. The structural finding (review platforms close Gate 2 via organic-search ranking) generalizes; the magnitude does not transfer 1:1 to verticals where Trustpilot is not the dominant review platform — B2B SaaS (G2/Capterra/TrustRadius), healthcare (Healthgrades/Vitals/ZocDoc/Jameda), local services (Google Reviews/Yelp). Practitioners should treat 1%/53.5%/75.3% as industry-conditional anchors for consumer-facing brands, not as universal benchmarks. Vertical-specific replication on the respective dominant review platforms would be required to establish per-vertical magnitudes.
Cummins/Ramp marketing-incentives-to-AI-agents (April 2026): Targeting confound between the three variants is acknowledged by the authors; format-effect is not isolable from targeting-effect in this design. The “agent trust” finding (existing-citation-volume as a content-surfacing prerequisite) is robust against the targeting issue and triangulates with #66 Network Authority’s dual-assignment; the format-variant finding (Markdown won) does not survive the targeting confound as a clean causal claim.

Open question 7 — Million-token-era applicability of pre-2026 numerical findings. Several numerical anchors in this article were measured on pre-2026 models with 128K-or-smaller context windows: Wallat ICTIR 2025 (Cohere Command-R+ at 128K), Aggarwal GEO KDD 2024 (GPT-3.5/GPT-4-era), Pan CAG EMNLP 2024 (similar generation), and the bulk of ALCE/LongBench-Cite benchmarks. The conceptual claims — Faithfulness as a separate axis from Survival (Wallat), content-authority operations as Gate-3 levers (Aggarwal), credibility-aware generation as a trainable behavior (Pan) — are architectural and transfer to million-token-era models. The specific magnitudes (Wallat 57%, Aggarwal +41% / +115% / +28%, Pan’s vanilla-RAG performance gap) are pre-2026 snapshots and are not 1:1 transferable to Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro, or other 1M+-context models. The mechanistic prediction is that some failure modes (post-rationalization under context-budget pressure, lead-position bias under retrieval truncation) attenuate at million-token context; other failure modes (consensus-pool dynamics at Gate 4, faithfulness of attribution at Gate 5) may be largely unchanged. Article 7’s cross-model calibration will measure these explicitly. Practitioners should treat the pre-2026 numerical findings as architectural-direction evidence, not as production-model-current magnitudes.

Open question 8 — Per-model variance in Gate-4/Gate-5 behavior is large and undermechanistic. Grace Cummins / Ramp documented (April 2026) (COI: vendor self-published on builders.ramp.com) a striking per-model disparity in how content surfaces across LLMs even when all three platforms’ bots crawl the same content: Perplexity surfaced an embedded incentive vaguely from day 2; Claude surfaced it specifically (brand name, exact amount, tracked URL, step-by-step instructions) starting day 12 with a ~4× step-change at week 3; ChatGPT crawled the page but produced zero surface mentions across the entire 32-day window. The mechanistic explanation lives at Gate 4 (consensus pool composition) and Gate 5 (generation faithfulness filter), but the framework currently has no peer-reviewed mechanistic account of why ChatGPT’s Gate-4 / Gate-5 filter behaves so differently from Claude’s on identical input content. The Claude step-change at week 3 (no external cause identified, no model release) is itself an open phenomenon — plausibly index-refresh dynamics or trust-aggregation crossover, not yet characterized in the literature. Article 7’s multi-platform Cross-AI Coverage benchmark will quantify the dispersion across the five most-cited platforms; the mechanism behind the dispersion remains an open empirical question.

Open question 9 — Falsifiability of the five-way decomposition itself. What observation would falsify the Five-Gate model? The honest answer today: the decomposition is a diagnostic modeling choice, and no comparative evaluation against alternative decompositions has been run — for example a three-stage model (retrieval → integration → attribution) or a seven-stage model that separately treats indexing, reranking, and citation formatting. A benchmark comparing the predictive and diagnostic utility of competing decompositions on the same source–query pairs is the required experiment. Until it exists, evidence for individual gate mechanisms should not be read as evidence for the five-way factorization itself.

Open question 10 — Gate-construct independence and the ground truth for “closed”. Can a source fail Gate 2 but pass Gate 3? Downstream gates are only observable conditional on upstream passage, which makes the individual gate constructs hard to validate independently. And there is no internal ground truth for a “closed” gate: every diagnosis in this article is an inference from external behavior — observed behavioral failure, not an identified internal mechanism. This is why the Triage Protocol’s probe steps (Steps 3–4) carry explicit inference caveats, and why the framework describes failure-mode topology rather than vendor internals.

Open question 11 — Channel independence is asserted architecturally, not measured statistically. The two channels are architecturally distinct (ReDeEP’s FFN/Copying-Head decomposition), and individual diagnostic cases dissociate cleanly (parametric open, retrieval closed). Whether the two channels’ citation outcomes are statistically independent at population level — the strong reading of “pass and fail independently” — has not been measured. It is a defined measurement task for Article 7’s cross-platform benchmark.

Two additional caveats apply to the entire framework:

The mapping table is post-hoc-rationalist in the same sense any classification is. We do not claim the model “knows” about gates; we claim the gates are an accurate description of the failure-mode topology as observed from outside.
The architectures keep moving. GPT-5.2, Gemini 3, and Claude 4.x all shipped substantial behavior changes in Q1–Q2 2026. The gate topology has been stable for ~18 months; specific gate-pass probabilities are not. Re-run GCS against your current platforms quarterly.

What Comes Next

Article 7 (forthcoming) will turn the Six-Step Triage Protocol into a public, replicable methodology with code, prompt templates, and an empirical calibration of the GCS weight vector across five verticals (B2B SaaS, healthcare, finance, legal, consumer retail). Article 7 will also publish the first multi-platform Cross-AI Coverage benchmark for the DAE framework.

The framework’s claim in this article is one of scope, not completeness. The five gates × two channels, the six authority types with the Tool/Endpoint sub-type, and the four cross-cutting modifiers are the most complete description our triangulated evidence currently supports of where citation outcomes are decided in agentic-RAG systems as of mid-2026 — but known factors sit outside the model by design: policy and safety filtering, source-licensing constraints, personalization, geographic and language routing, diversity constraints, and latency/cost optimization. Where those factors dominate, this coordinate system will misdiagnose. The framework’s claim of operationalization is bounded — GCS is the proposed metric, Article 7 will be the calibration.

Frequently Asked Questions

Q1. Is Tool/Endpoint Authority a separate channel?

No. Tool surfaces are a substrate type within the Retrieval channel, governed by #67 Structural Authority as a sub-type. The Two-Channel structure (Parametric / Retrieval) remains the mechanistic decomposition, anchored to Sun et al. ReDeEP (Knowledge FFNs vs. Copying Heads). Treating tool surfaces as a third channel was considered and rejected because the underlying transformer architecture does not differentiate text-token retrieval from tool-output-token retrieval at the residual-stream level — they go through the same Copying Heads.

Q2. Why are the modifiers aligned to Article 4’s set (#69–#71)?

Because Article 4 is the live, authoritatively-published anchor of the DAE series (3 April 2026) and established the modifier numbering (#69 Temporal, #70 Platform, #71 Consensus). This article adds one new modifier (#72 Reflection-Iteration) that is agentic-pipeline-specific. “Authority Density” and “Multimodal Surface” — sometimes proposed as standalone modifiers — are not maintained here as separate modifiers; they are sub-concepts under #66 Network Authority and #67 Structural Authority respectively.

Q3. Saxena et al. — is it NeurIPS or not?

arXiv:2509.21557 v2 (18 Dec 2025) is submitted to the NeurIPS 2025 LLM Evaluation Workshop. The workshop accepted-papers list could not be externally verified as of 28 May 2026. The framework treats it as Tier C (preprint with claimed workshop submission, no main-track peer review). All numerical claims from the paper are quoted against Tables 2–4 of the v2 PDF; magnitudes are directionally robust, quantitatively provisional.

Q4. Is the 57% post-rationalization rate real?

Yes, but with context. Wallat et al. (ICTIR 2025, Best Paper Honorable Mention) measured 57% as the upper-bound on Cohere Command-R+ / NaturalQuestions in the relevant-but-uncited-document adversarial condition. The random-adversarial baseline was 12%. The figure is not a universal RAG failure rate — it is the failure rate under a specific adversarial probe, on a specific model, on a specific benchmark. The implication is structural (Faithfulness is a separate axis from Survival), not numerical (do not extrapolate “57% of all RAG citations are fake”).

Q5. What is the GummySearch case really showing?

A page that hits five Authority Types and one Modifier simultaneously can ride a temporary platform-specific fanout configuration to substantial citation share (0.005% → 0.1% in one week). The lift is real and structurally explainable. The moat is structurally fragile — it depends on ChatGPT continuing to inject “reddit” into fanout sub-queries at the May 2026 rate. Per Lily Ray’s caveat: “might not work forever.”

Q6. Does LinkedIn-posting help AI Search visibility?

It depends on the channel — and the precise answer is narrower than a blanket no. LinkedIn’s robots.txt fully blocks AI training and live-fetch crawlers (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, and ~20 others) plus a catch-all Disallow: / — so Claude, Perplexity, and Gemini-grounding have no robots-level path to LinkedIn content. But OpenAI’s OAI-SearchBot and Google’s Googlebot are only path-restricted, not blocked, so ChatGPT-search and Google AI Overviews/AI Mode can technically still index LinkedIn /posts/ and /pulse/. Where those channels are open, the harder constraint is Gate 1, not Gate 2: LinkedIn posts share an identical Title-Tag pattern that prevents G1a anchoring (newsletters under /pulse/ are the partial exception, with individualized titles). For sustained visibility the framework’s recommendation is unchanged: host on your own domain and use LinkedIn as a distribution channel, not as the canonical publication surface.

Q7. Is the MCP adoption really at 97 million monthly downloads?

Per the March 2026 adoption snapshot, yes — corroborated independently by Pento.ai, Truto.one, DigitalApplied, and BraivIQ. The figure is not as of the donation date (9 December 2025); it is approximately three months later. At launch (November 2024), monthly downloads were approximately 2 million. The trajectory is steeper than typical OSS-protocol adoption curves but consistent with the AAIF-member endorsement cascade. Two caveats on interpretation: SDK downloads are not unique users — the figure includes CI/CD runs, mirror traffic, and transitive dependency installs, so it tracks ecosystem momentum rather than an adopter headcount; and server counts vary by method — the official registry is in preview and excludes private enterprise servers, so public tallies range from ~5,800 to ~15,900 depending on whether registry, package-manager, or GitHub-topic signals are counted. And one caveat on inference: adoption figures establish infrastructure momentum, not citation relevance — whether tool inclusion translates into citation visibility is the open ToolInclusion question.

Q8. Why is the GCS weight vector unset?

Because empirical calibration requires cross-vertical data the framework does not yet have. Each industry vertical (B2B SaaS, healthcare, finance, legal, consumer retail) likely has different gate-criticality profiles. Article 7 will publish the first calibration. In the interim, practitioners should treat each GCS dimension independently and report dimension-level Wilson intervals — not a single composite score.

Q9. How does this map onto Article 4’s six authority types?

One-to-one with one exception. Each of #63–#68 has a primary gate (mostly) and a secondary gate. #66 Network Authority is the dual-assignment exception (G1a + G2 + G4 simultaneously). The mapping table in §”The Mapping: Six Authority Types in the Five-Gate × Two-Channel System” above is the canonical reference.

Sources & Methodology

All sources below are classified by tier and, where applicable, by conflict-of-interest disclosure. Peer-reviewed sources are listed before vendor sources within each tier. Tier definitions appear in the Evidence Tiers box at the start of this article.

Primary Sources (First-Hand Verified)

LinkedIn robots.txt. https://www.linkedin.com/robots.txt, verified first-hand against the live file on 28 May 2026. Primary, directly reproducible source for the §G2 Gate-2 analysis. Confirmed: full Disallow: / for GPTBot, ChatGPT-User, Google-Extended, anthropic-ai, ClaudeBot, Claude-Web, Claude-User, Claude-SearchBot, cohere-ai, Google-CloudVertexBot, PerplexityBot, Perplexity-User, and ~12 further AI/scraper agents (DuckAssistBot, Meta-ExternalAgent/-Fetcher, CCBot, Bytespider, Diffbot, Quora-Bot, DataForSeoBot, Timpibot, others), plus a catch-all User-agent: * → Disallow: /. Notable exceptions, also confirmed first-hand: OAI-SearchBot and Googlebot are path-restricted but not globally blocked, retaining access to /posts/ and /pulse/. Because LinkedIn disallows automated access to its own robots.txt, the file was retrieved through a normal browser session rather than an automated fetch. (Accessed: May 28, 2026)

[Tier A] Peer-Reviewed Primary Research

Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). GEO: Generative Engine Optimization. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), 5–16. DOI: 10.1145/3637528.3671900. arXiv:2311.09735. (Accessed: May 28, 2026)
Algaba, A., Mazijn, C., Holst, V., Tori, F., Wenmackers, S., & Ginis, V. (2025). Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias. Findings of the Association for Computational Linguistics: NAACL 2025, 6844–6879. aclanthology.org/2025.findings-naacl.381. (Accessed: May 28, 2026)
Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2024). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. ICLR 2024 (Oral). arXiv:2310.11511. (Accessed: May 28, 2026)
Augenstein, I. (2025). Understanding the Interplay between LLMs’ Utilisation of Parametric and Contextual Knowledge. ECIR 2025 Keynote. arXiv:2603.09654. (Accessed: May 28, 2026)
Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., & Abdelrazek, M. (2024). Seven Failure Points When Engineering a Retrieval-Augmented Generation System. CAIN 2024. arXiv:2401.05856. (Accessed: May 28, 2026)
Bowyer, S., Aitchison, L., & Ivanova, D. R. (2025). Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints. ICML 2025 Position Paper Track (Spotlight), PMLR 267:81143–81184. arXiv:2503.01747. Argues against CLT-based confidence intervals at small n; recommends Wilson-Score or Bayesian intervals in practice. Cited in the GCS section for the interval-method choice — not as a validation of the GCS construct. (Accessed: July 13, 2026)
Fang, Y. et al. (2025). Do Large Language Models Favor Recent Content? SIGIR APIR 2025. DOI: 10.1145/3767695.3769493. (Accessed: May 28, 2026)
Gao, L. et al. (2023). ALCE: Enabling Large Language Models to Generate Text with Citations. EMNLP 2023. arXiv:2305.14627. (Accessed: May 28, 2026)
Jeong, S., Baek, J., Cho, S., Hwang, S.J., & Park, J.C. (2024). Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. NAACL 2024 Long, pp. 7036–7050. aclanthology.org/2024.naacl-long.389. (Accessed: May 28, 2026)
Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., & Hajishirzi, H. (2023). When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL 2023 Long Papers, pp. 9802–9822. aclanthology.org/2023.acl-long.546. (Accessed: May 28, 2026)
Pan, R., Cao, B., Lin, H., Han, X., Zheng, J., Wang, S., Cai, X., & Sun, L. (2024). Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation (CAG framework). EMNLP 2024. arXiv:2404.06809. (Accessed: May 28, 2026)
Park, S.-J., & Kim, K.-M. (2025). Measuring and Mitigating Media Outlet Name Bias in LLMs. EMNLP 2025 Main, pp. 29766–29785. aclanthology.org/2025.emnlp-main.1513. (Accessed: May 28, 2026)
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arXiv:2302.04761. (Accessed: May 28, 2026)
Sun, J. et al. (2025). ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability. ICLR 2025 (Spotlight). arXiv:2410.11414. (Accessed: May 28, 2026)
Tan, J., Dou, Z. et al. (2025). HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems. WWW 2025. DOI: 10.1145/3696410.3714546. arXiv:2411.02959. (Accessed: May 28, 2026)
Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2023). Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions (IRCoT). ACL 2023 Long. arXiv:2212.10509 v2. (Accessed: May 28, 2026)
Wallat, J., Heuss, M., de Rijke, M., & Anand, A. (2025). Correctness is not Faithfulness in RAG Attributions. ICTIR 2025 (Best Paper Honorable Mention; ACM SIGIR-affiliated). DOI: 10.1145/3731120.3744592. arXiv:2412.18004. (Accessed: May 28, 2026)
Yang, K.-C., & Menczer, F. (2025). Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models. ACM WebSci 2025. DOI: 10.1145/3717867.3717903. arXiv:2304.00228 v3. (Accessed: May 28, 2026)
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arXiv:2210.03629. (Accessed: May 28, 2026)

[Tier B] Large-Sample Vendor-Independent Datasets (>100K Samples)

Algaba, A., Holst, V., Tori, F., Mobini, M., Verbeken, B., Wenmackers, S., & Ginis, V. (April 2025). How Deep Do Large Language Models Internalize Scientific Literature and Citation Practices? arXiv:2504.02767. 274,951 GPT-4o-generated references across 10,000 focal papers; Vrije Universiteit Brussel academic team — vendor-independent. Preprint, peer-reviewed venue pending; treated as Tier B based on sample size, vendor independence, and reproducible methodology. (Accessed: May 28, 2026)
Naser, M. Z. (2026). How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing. arXiv:2603.03299. 69,557 citation instances × 10 commercial LLMs (~696K total observations); multi-model consensus ≥3 LLMs yields 95.6% accuracy, 5.8× improvement. Academic single-author study (Clemson University) — vendor-independent. (Accessed: May 28, 2026)

[Tier C] Independent Meta-Analyses & Surveys (Aggregating ≥10 External Sources)

Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997. Comprehensive RAG-architecture survey across the academic literature. (Accessed: May 28, 2026)
Gao, Y. et al. (2024). Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks. arXiv:2407.21059. Modular-RAG meta-synthesis. (Accessed: May 28, 2026)
Singh, A., Ehtesham, A., Kumar, S., & Khoei, T.T. (2025/2026). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv:2501.09136 v4 (April 2026). Independent academic meta-synthesis. Taxonomy: two macro-classes (Single-Agent, Multi-Agent), six concrete architecture patterns, four cross-cutting Agentic Design Patterns. §3.4 explicitly notes Multi-Agent Collaboration is less predictable than Reflection and Tool Use. (Accessed: May 28, 2026)

[Tier C] (Article-6-specific) Preprints, Patents & Workshop Submissions (Primary Sources Pending Peer Review)

Article-6-specific Tier-C sub-classification for primary sources that are neither peer-reviewed (Tier A), large-sample-vendor-independent (Tier B), independent meta-analyses (Tier C, strict definition), nor vendor-published (Tier D/E). These are academic preprints, patent documents, and workshop submissions pending venue acceptance.

Cao, X. (2018). Improved Online Wilson Score Interval Method for Community Answer Quality Ranking. arXiv:1809.07694. (Accessed: May 28, 2026)
Saxena, Y., Bommireddy, R., Padia, A., & Gaur, M. (Sep 2025; v2 18 Dec 2025). Generation-Time vs. Post-hoc Citation: A Holistic Evaluation of LLM Attribution. arXiv:2509.21557 v2. Submitted to NeurIPS 2025 LLM Eval Workshop. Workshop acceptance not externally verified as of 28 May 2026. Numerical findings cited verbatim from Tables 2–4 of the v2 PDF.
Schuster, T., Gautam, V., & Markert, K. (2026). Whose Facts Win? LLM Source Preferences under Knowledge Conflicts. arXiv:2601.03746. (Accessed: May 28, 2026)
Sielinski, R. (March 2026). Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement. arXiv:2603.08924. (Accessed: May 28, 2026)
Wu, Y., Zhang, Z., Wan, C., Zhao, X., He, X., Du, B., & Chen, J. (October 2025). HiPRAG: Hierarchical Process-Reward Optimization for Adaptive Retrieval in RAG. arXiv:2510.07794. Preprint, self-declared “under review”, venue unconfirmed. (Accessed: May 28, 2026)

Patents — granted:

US11769017B1 — Google. Generative Summaries. Granted patent. patents.google.com/patent/US11769017B1. (Accessed: May 28, 2026)

Patents — published applications (not yet granted):

US20240362093A1 — Google. Custom Corpus / Routing. Published October 2024. patents.google.com/patent/US20240362093A1. (Accessed: May 28, 2026)
US20250124067A1 — Google. Pairwise Ranking Prompting. Published October 2024. patents.google.com/patent/US20250124067A1. (Accessed: May 28, 2026)
US20240289407A1 — Google. Stateful Chat / Memory. Published patent application. patents.google.com/patent/US20240289407A1. (Accessed: May 28, 2026)
WO2024064249A1 — Google. Promptagator (Few-Shot Dense Retrieval). PCT international application, published. patents.google.com/patent/WO2024064249A1. (Accessed: May 28, 2026)

[Tier D] Industry Study with Documented Methodology (Not Vendor-Self-Published)

Agentset (2025). Cohere Rerank 4: A real upgrade over 3.5. Independent benchmark, December 2025. agentset.ai/blog/cohere-reranker-v4. Independent industry benchmark of a vendor product (Cohere Rerank 4); Agentset is not testing its own product, satisfying the not-vendor-self-published criterion for this specific test. (Accessed: May 28, 2026)
Bettinga, J. (May 2026). Hilft LinkedIn wirklich für Sichtbarkeit in AI Search? LinkedIn-Post / Carousel (3 slides). COI: SEO consultant & Co-Founder @SEOSOON. The §G2 LinkedIn-as-substrate double-gate analysis originates from this post (German-language LinkedIn) and is credited accordingly. The load-bearing robots.txt directives are not relied on from the post — they are independently re-verified first-hand against the live primary source (see Primary Sources above). The title-tag / SERP-pattern observation is drawn from this post. (Accessed: May 28, 2026)
Billingsley, D. (May 2026). Comment on Landwehr LinkedIn-Article (26 May 2026). AI Discovery Intelligence; AI Search & AI Visibility specialist. Cited verbatim for the framework’s Independent Industry Validation section. (Accessed: May 28, 2026)
5W Public Relations (Torossian, R.). Q1 2026 Citation Source Audit. 11 May 2026, PR Newswire. Synthesis of 9 prior industry studies (Similarweb, SEMrush, Profound, Peec AI, SE Ranking, Goodie, Ahrefs, Evertune, Passionfruit). COI: 5W is a PR agency synthesizing third-party data; not vendor-self-published. (Accessed: May 28, 2026)
Linux Foundation. (9 December 2025). Linux Foundation Announces the Formation of the Agentic AI Foundation. linuxfoundation.org/press. Founding donations: Anthropic (MCP), OpenAI (AGENTS.md), Block (goose). Platinum members: AWS, Anthropic, Block, Bloomberg, Cloudflare, Google, Microsoft, OpenAI. Standards-body announcement, not a vendor-self-published study. (Accessed: May 28, 2026)
Mihm, D. / Schwartz, B. (20 March 2025). Microsoft Bing/Copilot use schema for its LLMs. Search Engine Land. searchengineland.com/microsoft-bing-copilot-use-schema-for-its-llms-453455 + David Mihm LinkedIn coverage. Third-party SEO-trade-press reporting of vendor (Microsoft/Canel) statement at SMX München. Microsoft/Canel SMX-München statement is a vendor-confirmed paraphrase, not original transcript. (Accessed: May 28, 2026)
Pento.ai. A Year of MCP: From Internal Experiment to Industry Standard. pento.ai/blog/a-year-of-mcp-2025-review. Independent industry retrospective; corroborating March 2026 adoption snapshot of 97M monthly downloads / 10K+ active servers. Pento.ai is not an MCP vendor — independent analysis. (Accessed: May 28, 2026)
Ray, L. (May 2026). Comment on Landwehr LinkedIn-Article. Founder of Algorythmic; VP, SEO & AI Search at Amsive. Cited for fanout-planning observation and stability caveat. (Accessed: May 28, 2026)
Trustpilot / Seer Interactive. “What AI says about you” report. 12 May 2026. PR Newswire and seerinteractive.com/insights. Methodology: 804,491 AI responses across ChatGPT/Gemini/Perplexity/Google AI Mode; 15,783 prompts covering “a range of products and services” (specific industry verticals not disclosed); 1,926 brands; T0–T3 cohort design (n = 437 / 497 / 497 / 495). COI: Trustpilot-commissioned, Seer-executed (third-party agency execution). Industry-vertical caveat (Honest Limitations §6): magnitudes (1% / 53.5% / 75.3%) are robust for consumer-facing brands where Trustpilot is the dominant review platform; they should not be quoted as universal benchmarks for B2B SaaS, healthcare, or local services. (Accessed: May 28, 2026)

[Tier E] Vendor Study (Self-Published, COI Disclosed Inline)

Ahrefs. (11 May 2026). We Tracked 1,885 Pages Adding Schema. AI Citations Barely Moved. ahrefs.com/blog/schema-ai-citations. COI: Ahrefs is an SEO-tool vendor; published on its own blog. Cited for the schema-effect-on-AI-citations finding. (Accessed: May 28, 2026)
Anthropic (Schluntz, E., & Zhang, B.). (19 December 2024). Building effective agents. anthropic.com/research/building-effective-agents. COI: Anthropic is an AI vendor. Cited for the explicit production-architecture finding that simple composable patterns + single-LLM-multi-prompt outperform complex multi-agent frameworks. (Accessed: May 28, 2026)
Anthropic Docs. (May 2026). Context windows. platform.claude.com/docs/en/build-with-claude/context-windows. COI: Anthropic platform documentation. Cited for the May 2026 status: Opus 4.7, Opus 4.6, Sonnet 4.6 at 1M-token context; Haiku 4.5 at 200K. (Accessed: May 28, 2026)
Cohere. (16 December 2025). Introducing Rerank 4. cohere.com/blog/rerank-4. COI: vendor self-report. Disambiguated against the Agentset benchmark [Tier D] in §G2. (Accessed: May 28, 2026)
Cummins, G. / Ramp. (30 April 2026). We Tested Marketing Incentives to AI Agents. Here’s What Happened. Ramp Builders Blog. builders.ramp.com/post/marketing-to-ai-agents. COI: Published on Ramp’s own builders blog; Ramp is a corporate-card and finance-tools vendor. Methodology documented in detail: 3-variant test (pure Markdown / stripped HTML / schema-injected) across ~50 marketing pages, Cloudflare Workers conditional serving, unique tracked incentives per variant, 32-day measurement window; 1,300+ bot visits over first 2.5 weeks, ~370 agent relays by day 32 with Claude dominant, ChatGPT zero, Perplexity vague-then-branded by day 33. Known limitation acknowledged by authors: targeting confound — Markdown served broadly (AI Assistant OR unverified low-bot-score), HTML+schema served strictly (verified bots only). Format-effect is not isolable from targeting-effect in this design. Cited for: (a) “agent trust” / existing-citation-volume as content-surfacing prerequisite (§Three/Four Observations); (b) per-model variance in Gate-4/Gate-5 behavior (Open Question 8); (c) format-variant evidence with explicit methodological caveat (§G3 Markdown-vs-HTML synthesis); (d) bot-detection diagnostic findings (Cloudflare label mismatch; OpenAI SearchBot caching; DeepSeek Chrome-58 UA spoofing) for §G2 operational layer. (Accessed: May 28, 2026)
Druck, G., & Smith, E. / Graphite. (2026). Demystifying Randomness in AI. Graphite Five Percent White Paper. graphite.io/five-percent/demystifying-randomness-in-ai. Methodology: 200 entity-comparison prompts × 400 responses across gpt-5.2-chat-latest (OpenAI API), ChatGPT-Logged-Out, and Gemini-Logged-Out conditions; >200,000 LLM responses total; Wilson-Score binomial confidence intervals, Sequential Sampling, McNemar’s test, Z-tests; all experimental data publicly accessible via Google Drive. Key findings used in this article: (1) Visibility estimable with n=10 at MAE ~5.6% for Graphite’s visibility construct — transfer to the six GCS dimensions untested; (2) Sequential Sampling reduces required responses by 51% without CI-tightness loss; (3) API-vs-Logged-Out cosine similarity 0.48 — API measurements are not a valid proxy for the user-facing reality. COI: Graphite is an AEO agency selling visibility-measurement services; the paper is not peer-reviewed. Author credentials: Druck holds PhD UMass Amherst NLP (1,200+ citations, McCallum lab); Smith is Graphite CEO (MSc UCL, growth marketing). Methodology is statistically rigorous; its interval choice is consistent with the small-n recommendations of Bowyer, Aitchison & Ivanova (ICML 2025 Position Paper Track) [Tier A] — which supports the interval method, not the GCS construct; all experimental data are publicly available; sample size (200K+ responses) is substantial. Cited in this article as a methodology reference for GCS construction (§GCS Triangulation). Limitations: scope is entity-comparison prompts only; entity-extraction accuracy not formally evaluated; temperature parameter not specified. (Accessed: May 28, 2026)
King, M. (May 20, 2026). Beyond RAG: Why Every AI Search Platform Is Now Agentic and What That Means for Your Content. iPullRank. ipullrank.com/agentic-rag. COI: King is iPullRank Founder/CEO. Substantive vendor synthesis with per-claim triangulation against peer-reviewed sources (ReAct, Toolformer, IRCoT, Self-RAG) — every load-bearing claim from King is independently triangulated against Tier-A evidence in this article. (Accessed: May 28, 2026)
Nowaczyk, S. (Dec 10, 2025). Architectures for Building Agentic AI (Chapter 3). In Generative and Agentic AI Reliability: Architectures, Challenges, and Trust for Autonomous Systems, Springer Nature (accepted, forthcoming). arXiv:2512.09458v1 [cs.AI], CC BY 4.0. arxiv.org/abs/2512.09458. Center for Applied Intelligent Systems Research, Halmstad University. [Tier B] Peer-review-grade academic anchor for the “reliability is an architectural property” framing; supplies the component vocabulary (planner, tool router, verifier, supervisor) that the five gates operationalize. (Accessed: May 29, 2026)
Landwehr, M. (26 May 2026). How to Become a Top Source in ChatGPT with Recycled Reddit Content. LinkedIn-Article (peec.ai/blog). COI: Author is CPO/CMO at Peec AI; data source is Peec AI proprietary citation telemetry. Vendor-affiliated; Peec AI sells citation-tracking tools. Methodology documented in-article. Cited for the Reddit-recycle observation in §G1b. (Accessed: May 28, 2026)
llmpulse.ai. Data Studies: Top Cited Domains. llmpulse.ai/data-studies/top-cited-domains. COI: llmpulse.ai is an AI-citation-analytics vendor; data published on own platform. Citation-share data May 2026: YouTube 26.47%, Reddit 17.39%, Google 15.45%, Instagram 6.78%, Facebook 6.7%, TikTok 4.7%, LinkedIn 4.43%, Apple 2.55%, Wikipedia 2.33%, Trustpilot 1.99%. (Accessed: May 28, 2026)
Lumer, E. et al. (2025). ScaleMCP: Scaling Tool Selection for Large-Scale Agentic AI Systems. arXiv:2505.06416. COI: PricewaterhouseCoopers U.S.A. co-authored. Cited for 5,000 financial-metric MCP-server stress-test methodology. Although academic preprint format, PwC co-affiliation classifies this as vendor-side under the strict vendor-independence requirement. (Accessed: May 28, 2026)
Profound. (February 2026). We Ran a Controlled Experiment on Markdown vs. HTML for AI Bots. tryprofound.com/blog/does-markdown-increase-ai-bot-traffic. COI: Profound sells Agent Analytics measurement; published on own blog. Methodology: 381 pages across 6 websites, controlled A/B (192 treatment + 189 control), Profound Agent Analytics, 19 January – 8 February 2026. Result: ~16% mean lift, ~1 median extra visit, statistically not significant. (Accessed: May 28, 2026)
Search Atlas. (December 2024). The Limits of Schema Markup for AI Search. searchatlas.com/research. COI: Search Atlas is an SEO-tool vendor; published on own research blog. Cited for schema-markup-effect findings. (Accessed: May 28, 2026)
Shihipar, T. (8 May 2026). Using Claude Code: The Unreasonable Effectiveness of HTML. thariqs.github.io/html-effectiveness/. Personal site of Anthropic engineer (Engineering Lead, Claude Code). 4.4 million views in 16 hours; widely covered (Simon Willison, Lenny’s Newsletter, Hacker News). COI: Anthropic employee; not an Anthropic publication. Cited in §G3 for the million-token-era reframing of the Markdown-vs-HTML format question. (Accessed: May 28, 2026)

Statistical Methodology References

Wilson, E. B. (1927). Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association 22, 209–212. DOI: 10.1080/01621459.1927.10502953. (Accessed: May 28, 2026)
Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval Estimation for a Binomial Proportion. Statistical Science 16(2), 101–117. DOI: 10.1214/ss/1009213286. (Accessed: May 28, 2026)
Bowyer, S., Aitchison, L., & Ivanova, D. R. (2025). Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints. ICML 2025 Position Paper Track (Spotlight), PMLR 267:81143–81184. arXiv:2503.01747. (Accessed: July 13, 2026)

Triangulation Audit Results

Core claim	Source 1	Source 2	Source 3	Triangulated
Five-Gate cascade architecture	Gao 2024 [Tier C]	Gao 2023 [Tier C]	Barnett CAIN 2024 [Tier A]	✅ mechanisms, not the 5-way split (OQ9)
Dual-Channel (Parametric / Retrieval)	Sun ICLR 2025 [Tier A]	Augenstein 2025 [Tier A]	Pan EMNLP 2024 [Tier A]	✅
G1a/G1b split (Fan-Out Planning)	Trivedi ACL 2023 [Tier A]	Jeong NAACL 2024 [Tier A]	King [Tier E] (COI)	✅
G5 two-axis (Survival × Faithfulness)	Wallat ICTIR 2025 [Tier A]	Saxena 2025 [Tier C]	Sun ICLR 2025 [Tier A]	✅ C-caveat; FFN/Copying-Head bridge = hypothesis
Tool/Endpoint as #67 sub-type	Schick NeurIPS 2023 [Tier A]	Lumer 2025 [Tier E] (COI: PwC)	LF AAIF 2025 [Tier D] + Pento.ai [Tier D]	✅ adoption ≠ citation relevance
#66 dual-assignment (G1a + G2 + G4)	Algaba NAACL 2025 [Tier A]	Algaba 2025 follow-up [Tier B]	Yang & Menczer 2025 [Tier A]	✅ theory-led
Consensus-based citation accuracy	Yang & Menczer 2025 [Tier A]	Naser 2026 [Tier B]	Schuster 2026 [Tier C]	✅
Single-LLM-multi-prompt as prevailing production pattern (working assumption)	King 2026 [Tier E] (COI)	Anthropic Dec 2024 [Tier E]	Singh §3.4 [Tier C]	⚠️ no Tier-A anchor; no population data
GCS six-dim Wilson construction	King 6 Metrics [Tier E] (COI)	Wilson / Bowyer ICML 2025 [Tier A] (interval method)	Aggarwal KDD 2024 [Tier A]	✅ method, not construct (OQ2)
#72 Reflection-Iteration Modifier	Asai ICLR 2024 [Tier A]	Singh §3 [Tier C]	HiPRAG 2025 [Tier C] + King [Tier E] (COI)	✅
Reputational/review-platform magnitude	Seer/Trustpilot 2026 [Tier D] (COI)	5W Q1 2026 [Tier D]	—	⚠️ industry consensus only
Schema markup is hygiene, not lever	Search Atlas 2024 [Tier E] (COI)	Ahrefs 2026 [Tier E] (COI)	Ramp 2026 [Tier E] (COI)	⚠️ 3× vendor convergence, no Tier-A
Temporal Modifier (#69) magnitude in agentic pipelines	Fang SIGIR APIR 2025 [Tier A] classic IR setting	Trustpilot “3Rs” [Tier D]	—	⚠️ agentic magnitude open

Full citations and DOIs in the Sources & Methodology section above. Tier letters per Article 1 standard: [Tier A] peer-reviewed academic research · [Tier B] large-scale industry dataset (>100K samples, vendor-independent) · [Tier C] independent meta-analysis aggregating ≥10 external sources (plus Article-6-specific extension for preprints/patents/workshop submissions pending peer review) · [Tier D] industry study with documented methodology, not vendor-self-published · [Tier E] vendor study (self-published, COI disclosed inline).

Update Log

V1.1 (July 13, 2026): Epistemic-language revision following §48 cross-model validation Runde 2 (external claim-by-claim stress test and SIGIR-style external review). No sources removed, no data changed. Substantive changes: (1) GCS polarity aligned with the canonical definition — GCS is a citation-likelihood score, higher = better; the Step-3 formula is now GCS_d = Wilson_Lower(p̂_d) (the previous version inverted it), and the Step-1 triage thresholds are inverted accordingly (>0.7 = no citation problem, <0.3 = problem) and explicitly labeled heuristic. (2) “Five sequential gates” (TL;DR, closing quote) reframed — gates are processing functions without fixed order, consistent with the Architectural Variations section. (3) The “complete description” claim reframed as a bounded scope statement with named omissions (policy/safety filtering, licensing, personalization, localization, latency/cost). (4) The cliff-shape claim reframed as a thesis with open empirical status. (5) The ReDeEP mechanistic bridge at Gate 5 (Knowledge FFNs → Faithfulness / Copying Heads → Survival) explicitly labeled an untested hypothesis. (6) “Dominant production reality / virtually all deployments” reframed as a working assumption without population data. (7) Sampling guidance unified into three tiers (screening n≈10 / baseline n=30–40 / decision-grade n≥200), resolving an internal inconsistency; the Graphite n=10 MAE transfer to GCS dimensions flagged as untested. (8) Bowyer, Aitchison & Ivanova (ICML 2025) re-characterized against the original paper — it argues against CLT intervals at small n and recommends Wilson-Score or Bayesian intervals; support for the interval method, not a validation of the GCS construct — and added to the Sources (it was cited but unlisted in V1.0). (9) “Dual-Assignment” relabeled from “finding” to a property of this framework’s mapping. (10) Probe caveats added to Triage Steps 3–4 (probe ≠ validated test; probe system ≠ proprietary platform). (11) Three open questions added: OQ9 falsifiability of the five-way decomposition, OQ10 gate-construct independence and ground truth for “closed”, OQ11 statistical channel independence. (12) Necessity language aligned with the diagnostic-category status: “must clear all five gates” and the Dual-Channel axiom are now explicitly framed as definitional properties of the model (“the model treats citation as requiring…”), not empirically established necessity.

V1.0 (May 30, 2026): First publication.

About the Author

Manuel Hürlimann is the creator of Digital Authority Engineering (DAE) — the systematic discipline of building machine-verifiable expertise that AI systems recognize, cite, and recommend. Based in Switzerland, he works as a consultant and lecturer at the intersection of AI search behavior, citation analysis, and brand authority.

Through the Authority Intelligence Lab at GaryOwl.com, he publishes original research on how AI systems select, evaluate, and cite sources — applying every principle to GaryOwl.com itself as a living lab. The Five-Gate × Two-Channel Coordinate System is the architectural backbone of the DAE framework’s Agentic-AEO layer, building on the six-type authority taxonomy (#63–#68) and three modifier dimensions (#69–#71) established in Article 4 and extending it with the agentic-pipeline-specific #72 Reflection-Iteration Modifier.

Connect: GaryOwl.com · LinkedIn · manuel@octyl.io

Framework Disclosure: The DAE framework is independently developed and not affiliated with any vendor whose products are evaluated in this article. The author has no equity, employment, or paid-advisory relationship with Cohere, Anthropic, Block, OpenAI, Google, Microsoft, AWS, Bloomberg, Cloudflare, Trustpilot, Seer Interactive, 5W Public Relations, Ahrefs, Search Atlas, Graphite, iPullRank, PricewaterhouseCoopers, Peec AI, llmpulse.ai, Pento.ai, Truto.one, DigitalApplied, or BraivIQ as of publication date. Where vendor-published research is used (Tier D / Tier E), COI is disclosed inline and again in Sources & Methodology. The framework’s stated preference for peer-reviewed Tier-A evidence under the source-hierarchy principle is consistently applied: when Tier-A and Tier-D evidence conflict, Tier-A governs and Tier-D is reported with its COI flag. The DAE framework is applied to GaryOwl.com itself as a living lab — every framework principle is simultaneously tested on this site. The framework is open for use with attribution. Validation is ongoing and published transparently; no guarantees implied. AI behavior varies by model and platform.

Article Navigation: ← Article 5 | Next: Article 7 (forthcoming) →

GaryOwl.com – Authority Intelligence Lab

“A citation is not a ranking outcome. It is the outcome of five near-binary gates — and most of them can close before any ‘ranking’ step is reached. Diagnose the gate, not the rank.” — Manuel Hürlimann, Digital Authority Engineering