When to Apply RAG: Detection Architecture for Open-Weight and API-Only Deployments

Fig 1 — Three detection layers ordered by cost. GSI at query time, t1_decay at token 2, confab_score on full response. Type 2 invisible.

Every RAG system in production today applies retrieval the same way: always, to every query. This is wasteful and sometimes harmful. The model already knows the answer to most queries — it just doesn't know that it knows. The TES signal stack turns RAG from a blanket policy into a precision instrument.

The core insight is simple. Before a language model generates a single token, its internal state already reveals whether it has parametric knowledge for the query. If it does, retrieval is unnecessary overhead — it adds latency, cost, and the risk of injecting irrelevant or contradictory documents. If it does not, retrieval is the right intervention. The decision to retrieve should come from the model itself, not from a keyword classifier sitting outside it.

This paper maps the TES signal stack — GSI, t1_decay, B_slope, and confab_score — onto two real-world deployment architectures: open-weight models where you have full access to internal states, and API-only models where you do not. The detection capabilities differ sharply between the two.

RAG is not ICL — and the difference matters

One of the most consequential findings in this work is that RAG document injection and in-context learning (ICL) few-shot examples affect the model's internal state in fundamentally different ways. GSI — the Gate Sparseness Index that measures FFN gate activation patterns — survives RAG injection almost perfectly. In Bielik and Mistral, the drift is exactly 0.000. In the remaining models, drift ranges from −0.01 to −0.08. The model's epistemic signal stays intact.

Under ICL few-shot prompting, GSI collapses to zero. The signal disappears entirely. Why? ICL bombards the FFN gates with repeated question-answer patterns — short, high-density instruction pairs that create broadband activation across the gate layer. RAG, by contrast, injects a single continuous document: passive semantic context with low instruction density. The gates see additional context but are not re-patterned by it. Different instruction density, different gate behavior.

This distinction has a direct engineering consequence. The constraint "measure GSI before any context is injected" — which appeared absolute in TES1 — applies only to ICL pipelines. In RAG pipelines, you can measure GSI even after the retrieval document is in the context window. The measurement remains valid.

Open-weight: three layers of defense

When you have access to model weights, you get three detection checkpoints arranged by cost. GSI comes first: a single forward pass through the upper FFN layers, computed in roughly 3 milliseconds. It catches Type 1 hallucinations — queries about topics entirely absent from training data. If GSI is below threshold, the model has no parametric anchor for this query. Retrieve immediately; do not generate.

If GSI passes — the model appears to have relevant knowledge — the second checkpoint fires at token 2. t1_decay measures how much the model's epistemic confidence changes between the first and second generated tokens by running three parallel generation branches. This catches Type 3 (schema-driven confabulation) and Type 4 (progressive commitment to fiction), where the model starts with knowledge but drifts into fabrication. Cost: roughly 50 milliseconds.

The third checkpoint, confab_score, operates post-generation. It combines the coefficient of variation across three response branches with their formal similarity. Confabulated outputs are paradoxically more internally coherent than factual ones — the model copies a single template rather than choosing among valid alternatives. This confirms Type 1 and Type 3 detections at a cost of roughly 100 milliseconds for full three-branch generation.

The pipeline is cascading: only queries that pass the cheap check reach the expensive one. Most queries are resolved at GSI for 3ms. The minority that require deeper inspection pay 50ms or 100ms. The average cost across a production query distribution is far lower than applying retrieval to every request.

API-only: what you lose

Fig 2 — Without weight access, GSI is unavailable. B_slope and confab_score provide partial coverage.

When you deploy through an API — GPT-4, Claude, Gemini — you lose the cheapest and most powerful detection layer. GSI requires access to FFN gate activations, which means access to model weights. No API exposes this. The 3-millisecond pre-generative triage disappears entirely.

The next best pre-generative signal is B_slope, the commitment rate. It measures how quickly the model converges on its answer across parallel generation branches. B_slope works through any API that supports temperature sampling — you generate three responses and compare early-token convergence. Effect sizes are strong: d = 1.06 to 1.65 across 7 of 8 tested models. But it costs 3x inference. Characterizing B_slope behavior under RAG document injection is a planned R&D direction — determining whether it survives retrieval augmentation the way GSI does will define the viable pre-screen for API-only pipelines.

confab_score — the post-generative confirmation signal — works fully in API deployments. It requires three complete generations, making it the most expensive check, but it detects the characteristic uniformity of confabulated outputs without any weight access. The tradeoff is that by the time confab_score fires, you have already generated three full responses. The damage — latency, cost, potentially serving a hallucinated answer — has already occurred.

API-only reality check: not every model, not every signal

An important caveat for API-only deployment: the detection signals do not transfer equally well to every closed model. Initial results on Gemini 2.5 Flash confirm that Dual Confirmation (trajectory similarity combined with token-level entropy) produces a viable epistemic signal through API access alone. This is a promising proof of concept, but further research is needed to determine how robust and production-ready the signal is across different API tiers and query distributions.

Open-weight models remain significantly more promising for epistemic detection. With direct access to FFN gate activations, the signal is stronger, cheaper, and validated across 8 architectures with thousands of queries. For API-only deployments, Dual Confirmation — sim for form-level confabulation, entropy for token-level uncertainty — is the recommended starting architecture, with the understanding that production calibration on closed models is an active research direction.

The decision to retrieve should come from the model's internal state, not from a keyword classifier. GSI below threshold means the model has no parametric anchor — retrieve. GSI above threshold means the model has the knowledge — serve directly. This turns RAG from an always-on cost center into an on-demand safety net.

The asymmetry between deployment modes is stark. Open-weight models get a cascading pipeline: cheap triage first, expensive confirmation only when needed. API-only models get a flat cost structure: every detection attempt costs 3x or more, with no way to skip cheap queries past the expensive checks. For organizations running millions of queries per day, this difference in detection economics may be the strongest argument for open-weight deployment — not model quality, not customization, but the ability to know what your model knows before it speaks.

And in both cases, Type 2 — the model confidently retrieving a wrong fact from its own weights — remains invisible. No internal signal distinguishes correct parametric knowledge from incorrect parametric knowledge. For high-stakes domains, external fact-checking is not optional. It is the only defense against the one hallucination type that neither architecture can detect.