The Epistemic Knowledge Gradient

Fig 1 — Pooled GSI across 7 epistemic levels. Three zones: absence (red), transition (orange), knowledge (green).

The previous five papers in this series each measured a binary contrast — known vs unknown, confabulation vs fact. This paper reveals the structure underneath. Knowledge inside a language model is not a switch. It is a gradient with seven levels, three zones, and two structural breaks that appear consistently across every architecture tested.

Across 1,000 standardized queries and eight open-source models, the Gate Sparseness Index maps a continuous landscape of epistemic certainty. At one end, core empirical facts (L6: “What is the capital of France?”) produce GSI values of 0.62 — the model is reaching into a precise memory slot. At the other, boundary questions (L3: “Is there free will?”) produce the lowest values of all — 0.112 — below even deliberate confabulation prompts. The seven levels are not a taxonomy imposed on the data. They are the structure the data reveals.

Three zones, not two states

The gradient organizes into three zones with clear boundaries. The absence zone (L1 and L3, GSI 0.11–0.17) is where the model has no parametric knowledge to retrieve. The transition zone (L2, GSI 0.23) is the intermediate state — the model has partial signal but not enough for confident retrieval. The knowledge zone (L4–L6, GSI 0.35–0.62) is where specific memories are encoded in FFN weights and can be reliably accessed.

The model does not flip between knowing and not-knowing. It passes through intermediate states. L2 queries — unknowable questions about real entities (“What was Einstein’s shoe size?”) — produce measurably higher GSI than confabulation or boundary questions, even though the model cannot answer them correctly. The entity activates something in the weights, even when the specific fact is absent. This is partial knowledge, and GSI measures it.

The L3→L4 break: where knowing begins

The sharpest transition in the gradient sits between L3 (boundary questions) and L4 (niche facts). Cohen’s d = 1.094, with p < 10⁻⁸⁶. This is two to four times stronger than any other adjacent step in the gradient. Below this threshold, the model is guessing or confabulating. Above it, the model is retrieving.

This is where the RAG routing threshold should sit — not as a design choice but as alignment with the model’s own epistemic structure. Paper 5 identified a practical threshold for RAG injection; this paper shows that threshold corresponds to a structural break in the knowledge gradient itself. The L3→L4 boundary is not an engineering parameter. It is a property of the data.

Fig 2 — Per-model GSI by knowledge level. L3 dip (boundary questions) and L6 peak (core empirical) consistent across all 8 models.

Two surprises in the data

The gradient is not monotonic. Two steps go in the wrong direction, and both reveal something important about how language models store and retrieve knowledge.

L3 < L1: Boundary questions (“Is there free will?”, “Does consciousness require a body?”) produce lower GSI than confabulation prompts. This is counterintuitive — shouldn’t confabulation be the floor? But confabulation at least activates a schema. When you ask for the melting point of a fictional compound, the model has a template for “melting point of X” and fills it with invented values. Boundary questions activate nothing — no template, no schema, no specific memory pathway. They are the truest form of epistemic absence.

L7 < L6: Mathematical and logical truths (“What is 17 times 23?”, “Is every prime greater than 2 odd?”) produce lower GSI than core empirical facts. This is not a failure of the model — it is a feature of the measurement. GSI measures retrieval from FFN memory. Mathematical reasoning uses distributed computation across attention heads and residual streams, not a concentrated FFN lookup. The model knows the answer, but it computes it rather than retrieving it. GSI, by design, does not capture this.

Fig 3 — Cohen’s d for each adjacent transition. Green = knowledge increases. Red = non-monotonic (knowledge decreases).

The gradient is the base reality. Paper 1 measured L6 vs L2. Paper 3 measured L1 vs L6. Paper 5’s RAG threshold maps to L3→L4. Five papers, five findings — one underlying structure, measured here across 7 levels, 1,000 queries, and 8 architectures.

The practical consequence is direct. Any system that routes queries based on model confidence — deciding whether to retrieve external context, escalate to a larger model, or abstain from answering — is making a decision about where on this gradient a query falls. The gradient provides the map. The L3→L4 boundary provides the threshold. And the two non-monotonic steps provide the warnings: boundary questions will read as more uncertain than confabulations, and mathematical reasoning will read as less certain than factual recall.

Eight models, five organizations, seven levels, 1,000 queries. The epistemic knowledge gradient is not a taxonomy — it is the terrain that every measurement in this series has been mapping, one contrast at a time.