The recipe: what it takes to make a model admit it doesn't know
Part 3 showed one system work, traced end to end. This is the recipe behind it — the loop the model runs, and the things that have to be true for that loop to make it say "I don't have this" instead of inventing an answer. Tamil is the worked example; the spine holds in any domain that is small, specialised, and mostly absent from training data.
Across this series we went from what the problem is to why a language model fails at it invisibly to a system that doesn't. That last post told the story of one run. This one pulls back to the method, because the interesting part isn't Tamil — it's that the same handful of ingredients make any model honest about the edge of its knowledge. Miss one and the guarantee leaks: the model goes back to sounding confident about things it has no basis for.
Here is the whole recipe in one view. Each row is a section below.
| Ingredient | In one line | What breaks without it | |
|---|---|---|---|
| 1 | The loop the model runs | A defined ReAct loop — propose, ground, revise, commit — with a clear bar for "done": groundedness, not confidence. | The model calls tools ad hoc and stops when it feels sure: Part 2 all over again. |
| 2 | A ground truth worth grounding in | A small, curated, well-linked store of what's actually attested — meaning flows through structure, never orphaned facts. | You ground in garbage and get confident garbage, just with a citation. |
| 3 | Retrieval tuned for relevance | The query under the tool puts the right nodes in front of the model — minimal noise, minimal tokens. | The real answer is in the graph but never reaches the model, or arrives buried in distractors. |
| 4 | Typed tools that expose it | A fixed set of schema-validated tools is the only path in; they wrap the queries and adjudicate writes. | The model runs raw queries, invents structure, and the curated store rots. |
| 5 | Monitoring and evaluation | Three levels of signal — tool-call logging, reason-before / review-after annotations, and trajectory metrics + reflection — over a gold set when you have one. | "It works on the demo." You can't see drift, can't localise failure, can't improve on purpose. |
01The loop the model runs
Start with the shape of the whole thing, because everything else hangs off it. The method is a loop: the model proposes a hypothesis, grounds every piece of it against a source of truth, treats a rejection as structured feedback and revises, and commits only once nothing is left ungrounded. This is the ReAct loop — Reason, Act, Observe, repeat — and the crucial detail is the bar for "done": not the model's confidence, but groundedness. It writes when everything resolves, not when it feels sure. One definition, held throughout the series: grounded means every claim resolves to attested structure in the graph through a typed validation path — not "has a citation," not "came from retrieval," not "probably true."
Why insist on a loop at all, rather than just good tools? Because a model handed excellent tools and no process will use them ad hoc — call a couple, form an impression, and stop the moment it feels confident. That feeling is exactly the failure from Part 2: fluent, confident, and wrong. A defined loop with a clear committable target — a state it is allowed to write only once everything checks out — is what converts a capable model from a confident guesser into something that keeps checking until it actually has support, or reports that it can't get any.
The loop has three moves, and the rest of this post is what each move needs to be real rather than theatre:
- Act grounds a claim against something it can't override — so it needs a well-linked ground truth (§02), found through a relevance-tuned query (§03), reached through a typed tool (§04). Three ingredients, one move.
- Observe turns the result into the structured feedback the model reasons over next — and, scaled across runs, into the raw material for evaluation (§05). A rejection isn't a dead end; it's an Observe that feeds the next Reason.
- Reason is the move this section already covered: the process and the committable target that keep the model honest about when it's allowed to stop.
This is why ReAct pairs so naturally with grounding: its unit of work — one Reason/Act/Observe step — is also the unit you constrain, the unit you log, and the unit you evaluate. The loop is the connective tissue that makes the four ingredients below composable and observable instead of a pile of disconnected best practices.
02A ground truth worth grounding in
Take the Act move first: it grounds a claim against a source of truth. That source is only as good as its contents — grounding in a noisy, half-attested, or internally contradictory store just produces confident nonsense that now looks sourced. So this ingredient is the least glamorous and the most load-bearing: a small, curated, authoritative store that can only confirm or deny, never generate. In the Tamil system that's a hand-verified Neo4j graph encoding Schiffman's Reference Grammar of Spoken Tamil.
But size and curation aren't enough. The store has to be well-linked,
and in a specific way: meaning must flow through structure rather than hang off
individual facts. Part 3's triad is the example — the graph never draws a Morpheme → meaning edge,
because the same morpheme (வி)டு means different things in
different constructions. Meaning is read off the Construction → Morpheme →
Concept triple. That linking decision is what lets a claim like "this
(வி)டு means completive" be checked in a single query: it
either resolves to an existing triad or it doesn't. Orphaned facts can't be checked
that cleanly — there's nothing for a hypothesis to resolve against.
Don't ask "do I have a knowledge base?" Ask "is meaning in my knowledge base relational — can an arbitrary claim be reduced to 'does this specific structure exist?'" If yes, an unsupported structural claim becomes catchable instead of merely regrettable. If your facts are flat key→value strings, the model can still drift between plausible neighbours and nothing will object.
03Retrieval tuned for relevance
Act can only ground against a fact it can actually find — which is a different problem from having it. And here's the distinction that's easy to blur. word_lookup is not the retrieval system — it's a tool call, a doorway. The retrieval system is the query underneath it: the Cypher that ranks surface-similar neighbours by longest common substring, caps the result set, and returns them with match scores. The tool is the contract (ingredient 4); the query is the information-retrieval engine, and it has to be tuned in its own right.
Tuned toward what, exactly? Maximising relevant retrieval — which is not the same as maximising retrieval. The goal is to get the right nodes in front of the model with as little noise as possible and as few tokens as possible. There are two ways to fail, in opposite directions:
-
Under-retrieval
The relevant node exists in the graph but the query doesn't surface it — wrong ranking, too tight a filter, the wrong entry point. The model, finding nothing, falls back on its prior and invents. The graph had the answer; the retrieval never delivered it.
-
Over-retrieval
The query dumps the whole neighbourhood — two hundred loosely related nodes — and the one that matters is buried among distractors. Noise pulls the model toward a plausible red herring, and every irrelevant node spends tokens and attention you don't get back. More context is not more grounding.
So the levers are all about precision: ranking and scoring so the best candidates come first; capping result counts; returning a tight closure (the triad, not a raw subgraph dump); and offering multiple entry points — search_* reaches the graph by surface form, by gloss, or by meaning, because you arrive at a problem knowing different things on different turns. A perfectly curated graph (ingredient 2) is inert if the right node never reaches the model, or reaches it drowned in a hundred it didn't need.
04Typed tools that expose it
Both the graph and its queries reach the model through one narrow channel — and that channel is the third thing Act needs. The model never touches the graph directly. It can't run a query or hand-write a node. It gets a fixed set of typed tools over the Model Context Protocol — schema-validated requests, so a malformed call fails harmlessly instead of corrupting a curated store. This is the layer where the queries from ingredient 3 are wrapped and exposed: one typed tool is the stable, documented contract; the carefully tuned query is what runs behind it. Keeping them separate matters — you re-tune the query without changing the model's interface, and you constrain what the model can ask without constraining how well the graph can answer.
Two properties earn their keep. First, the tools constrain the action space: a fixed vocabulary of calls is the only path in, so the model can't invent structure or reach around the guardrails. Second, they answer in prose with embedded structured fields, never raw JSON — that forces the model to weigh the evidence and judge, rather than branch on a magic number. And the unglamorous invariants — validation, normalisation, atomic writes, the ~12 checks before create_wordform commits — live once, in code, instead of being re-derived from a prompt every session.
05Monitoring and evaluation
The last ingredient is the Observe move scaled up — across one run, and then across many. This is where most grounded systems are thinnest. "It works on the demo" is not a measurement, and a single held-out accuracy number — useful as it is — misses most of what you need to actually run and improve the thing. Evaluation here is a stack of signals at three levels, most of which fall straight out of logging the loop and need no external harness, plus the gold-standard accuracy eval on top.
-
Level 1 · Tool-call logging — the mechanical signals
One structured line per call, keyed by session. The cheap, automatic, label-free questions: was the tool called? did it fail (schema error, exception)? did it return anything, or null? how long did the call take? how many times was each tool used? These catch regressions and degenerate behaviour the moment they appear — a tool that suddenly always returns null, a loop that calls the same search forty times — with no annotation required.
-
Level 2 · Reason-before / review-after — the intent signals
The important addition, and the one people skip. Before each call the agent states why it's making it and what it expects back. After, it reviews: did the result match the expectation, did the tool do its job, what were the pain points. Now a log line carries intent + prediction + post-hoc judgement — so you can find where the model's picture of the world diverged from the graph, not just where a call threw an error. A call that "succeeded" but returned the opposite of what the agent predicted is invisible at Level 1 and obvious here.Why it matters: errors are easy to log; wrong but error-free is where grounded systems fail without throwing an error. Expectation-vs-result is how you surface it.
-
Level 3 · Trajectory — the whole-run signals
Per run: how many tools were used, total wall-clock time, path length, how many refine loops, did it reach a verified commit. Plus trajectory reflection — the agent looking back over the entire run and saying what was hard and what it would do differently. Aggregate these across many runs and you're measuring behavioural drift: the path getting longer, refines climbing, time creeping up — a regression you'd never see from a pass/fail on the final answer.
Everything above measures consistency and behaviour — that the system grounded its claims and behaved sanely. It does not measure linguistic correctness; for that you need a held-out, expert-annotated set of segmentations and gloss-level accuracy. That set is still the open milestone for the Tamil system, and I'd rather say so than imply a metric I don't have. The point of the three levels is that you get most of your eval — and all of your drift detection — long before, and independent of, the gold set.
∴Groundedness isn't robustness
The five ingredients share one unstated assumption: that a single run, fully grounded, is a single run you can trust. It is grounded — but grounded and robust are different properties, and the gap between them is its own failure mode. A grounded agent isn't reasoning over the whole knowledge base; it's reasoning over the slice its retrieval path exposed. Every move in §01's loop — a search term, an intermediate hypothesis, which neighbour it anchored on first — conditions the hypothesis space the next move can even consider. Retrieval isn't passive evidence access; it's active trajectory shaping. One unlucky early step (a slightly-off query, an overconfident first guess) can steer the rest of the run into the wrong region of the graph, and every later step can still be perfectly grounded relative to that trajectory. The result isn't hallucination — Part 3 watched the graph catch that. It's premature convergence: the system confidently consistent inside a bad basin.
Part 2 already used the cure once. Eight runs of the ungrounded model landed in different basins, and that disagreement was the tell that it was generating, not recovering. The grounded agent earns the same test — only now the variation lives one level up, in the trajectory rather than the answer. Run it several times — different temperatures, retrieval seeds, prompt phrasings, even different model providers — and read the distribution. Agreement across independent traces is evidence the conclusion is robust to how the search unfolded; disagreement is the signal that the answer depends on the path taken to reach it. This is repeated-sampling uncertainty in the spirit of Monte-Carlo dropout: one run gives an answer, a spread of runs reveals stability — though it stays a diagnostic of instability, not a calibrated probability.
The catch: those traces aren't truly independent. They share an architecture, a retrieval strategy, a tool vocabulary — and shared bias steers them the same way. Ten runs of one brittle loop can converge on the same wrong basin and look reassuringly stable. So vary what actually breaks the symmetry (different providers especially, since they don't share priors), and lean on post-run validation: re-check a finished trajectory against an alternative trace, a verifier model, or a different retrieval strategy. The goal isn't a majority vote for truth — it's detecting when a conclusion is fragile to the search that produced it. None of which is free: every extra trace, verifier pass and refine loop is more tokens and more wall-clock. Reliability, here, is purchased with search — a real line on the bill, not a rounding error, and worth spending deliberately rather than by default.
Groundedness — did every claim resolve to attested structure? (§02–04)
· Correctness — does it match expert truth? (the gold set,
§05) · Robustness — do independent trajectories converge?
(this section) · Calibration — does it know when it might
be wrong? (search_matching_wordforms returning null). The recipe
nails the first; this section is how you stop mistaking it for the other
three.
The five ingredients make one run honest about the edge of the graph. This makes you honest about the edge of the run.
★What generalises
Three ways a grounded system still fails — and that they need three different mechanisms to catch is the whole point:
| Failure | Example | Caught by |
|---|---|---|
| Unsupported structure | an illegal allomorph; a relabelled morpheme | grounding · §02–04 |
| Wrong trajectory | a plausible but incomplete basin, reached by one bad early step | robustness · multiple traces (∴) |
| Missing / wrong ontology | the graph lacks the phenomenon, or its schema mis-frames it | human eval · the gold set (§05) |
- The domain shape, not the domain. This recipe earns its keep wherever the ground truth is small, specialised, and largely absent from a model's training data — Spoken Tamil morphology here, clinical entity linking in the work Part 3 gestured at. Different field, identical spine.
- Each ingredient guards a different failure. No loop and the model stops when it feels sure; a bad store gives sourced nonsense; bad retrieval starves or drowns the model; missing tools let it freelance; no eval hides the drift. They aren't interchangeable, and the guarantee is only as strong as the weakest one — and even all five guard only a single run: robustness is a separate axis, so run it again, differently, and see if it agrees.
-
What you get, honestly. Consistency, observability, and a model that says
"I don't have this" instead of fabricating —
search_matching_wordformsreturning null rather than inventing a parent. What you don't get for free is omniscience. Grounding buys provenance, not truth — a claim resolves to attested structure, which is not the same as being correct, and that gap is exactly what the gold set measures. And it buys no new ground: absence from the graph becomes absence from reality, so the system inherits the blind spots of its substrate — and the assumptions baked into it, because a schema is a theory and a construction an interpretation, not a neutral container. Which is the honest tell about where the work went — the expertise isn't discovered by the agent at run time; it was moved there ahead of time, into the ontology, the schema, the retrieval shaping and the invariants the model can't override. The recipe makes a model honest about the edge of its knowledge. It doesn't move that edge — a human drew it.