Creating the first morphological analyser for Spoken Tamil · Part 3 of 4

Teaching a model to admit it doesn't know — by forcing it to find evidence

The architecture that makes an ill-specified problem tractable: let the model break a Spoken Tamil word into parts without inventing parts that don't exist. It proposes; a knowledge graph disposes. A fully traced run is the story within.

Curtis Murray · worked example: படிச்சிட்டேன் · every Tamil & English word below is a live card

The series so far. Part 1: morphological analysis means breaking a word into its smallest meaningful pieces — learners → learn‑er‑s, and படிச்சிட்டேன் → study + participle + completive + "I". Part 2: hand that word to a language model and it invents, relabels and drifts — fluently and invisibly — because for Spoken Tamil there is no ground truth in its training data to recover, only plausible fictions to generate. This post is the fix.

01The fix: the model proposes, the graph disposes

So stop making the model the source of truth about what exists. The architecture is neuro-symbolic — split the system in two:

The symbolic half is a curated Neo4j knowledge graph encoding Schiffman's Reference Grammar of Spoken Tamil: every morpheme, every grammatical construction, every semantic concept, every rule that is actually attested. It is small, hand-verified, and authoritative. It cannot generate anything — it can only confirm or deny.

The neural half is the LLM. It is good at exactly what the graph can't do: looking at an unfamiliar surface string and proposing a plausible decomposition, and writing fluent learner-facing explanations.

The bargain: the LLM proposes; the graph disposes. Every hypothesis is checked against the graph. A rejection isn't a dead end — it comes back as structured feedback ("that morpheme ID doesn't exist; here are the ones that do, with their canonical forms"), which the model uses to revise. It loops until everything it wants to write is grounded in a node that actually exists. The model is allowed to be creative about hypotheses and never allowed to be creative about facts.

The two halves never touch directly. They're connected through a single MCP server exposing a fixed set of typed tools — the only path to the graph — so the model can't run raw queries or invent structure; it proposes, the graph adjudicates. Analysis runs as a ReAct loop: at each step the model reasons about the current state, calls one tool, and observes structured feedback — match scores, construction coverage, position-by-position diffs against the closest existing word — then refines before the next step, repeating until the analysis is grounded enough to commit as a new node. You can watch exactly this happen in the traced run below.

Step	What the system does	The transferable principle
① ProposeLLM · creative	Hypothesise a segmentation and gloss for the input word.	Define a committable target; give the model retrieval to find what's real.
② GroundMCP typed tools	Check every claim against the graph — match scores, construction coverage, position-by-position diffs.	Ground in a source of truth it can't override; constrain the action space to typed tools.
③ Commitcreate_wordform	Write the new, fully-grounded node back to the graph.	Evaluate the tool calls, not just the final answer.
↩ Not groundedrevise ①	The rejection returns as structured feedback — canonical forms, the legal allomorph set, the exact slot that failed — and the model revises its hypothesis.	Make failure useful — structured feedback, never a dead end.
↻ Repeatouter loop	Run real words, watch where it drifts, tighten the tools and prompts.	Test and iterate until it converges.

The analyser as a ReAct loop: the LLM proposes, deterministic tools ground every claim against the graph, rejections come back as structured feedback, and only a fully-grounded analysis is committed. The right-hand column is the transferable part — the same spine holds in any domain where the ground truth is small, specialised, and largely absent from training data.

02The unit of truth: the triad

You might expect the graph to just say "morpheme (வி)டு means completive aspect". It deliberately doesn't, and the reason is the most important modelling decision in the system — and a clue to where the expertise in this setup actually lives: not in the agent at run time, but in choices like this one, made by a human and committed to the graph's shape long before the agent runs.

Morphemes are polysemous. A morpheme's meaning is not intrinsic — it depends on the construction it appears in. The classic Tamil example: the clitic உம் means "also / too" in one construction (கோட்டையும் "the fort too") and "even if" in another (வந்தாலும் "even if X comes"). If you attach meaning directly to the morpheme, one of those has to be wrong. So the graph never draws a Morpheme → meaning edge. Meaning always flows through a Construction — a multi-morpheme pattern, in the Construction Grammar sense, that is the real locus of meaning. The load-bearing structure is a triple:

Construction

constr_marked_completion

pattern: VERB-AVP + COMPL-TENSE-PNG

"Action definitely completed"

போய்ட்டேன் "I definitely went"

Semantic Concept

sem_completive_aspect

"Action viewed as completed, definite, brought to conclusion"

props: [certainty, completion]

Morpheme

(வி)டு

morph_vidu

gloss COMPL · aspect marker

Rule

rule_aspectual_verb_syntax

"Aspectual verbs attach to the AVP of the main verb and carry tense / PNG marking"

The same morph_vidu also participates in constr_aspectual_vidu — one morpheme, multiple constructions. Meaning is read off the construction, never the morpheme alone.

— live subgraph, explore_nodes(constr_marked_completion, morph_vidu)

The completive-aspect triad, pulled live from the graph for this walkthrough. This is the actual closure the agent reasons over for the (வி)டு bead of படிச்சிட்டேன்.

Read it as a sentence: the construction constr_marked_completion uses the morpheme morph_vidu, expresses the concept sem_completive_aspect, and follows the rule that aspectual verbs attach to the main verb's AVP. That whole triad is the unit of truth. When the agent claims "this (வி)டு here means completive", it is not asserting a fact about a morpheme — it is asserting that this specific triad exists. The graph can check that in one query. That is what makes unsupported structure catchable instead of merely regrettable.

03Seeing it work — one word, traced end to end

Everything above is the architecture. This is the story within it: the same word, படிச்சிட்டேன், but now watch the loop actually run — including the two moments it caught itself being confidently wrong.

The agent never touches the graph directly. It cannot run a database query or hand-write a file. It gets a handful of typed tools over the Model Context Protocol — schema-validated requests, so a malformed call fails harmlessly instead of corrupting a curated graph. Validation, normalisation and atomic writes live in that tool layer, enforced once in code rather than re-derived from a prompt each session. The tools answer in prose with embedded structured fields, never raw JSON — that forces the model to weigh the evidence and judge, rather than branch on a magic number. The agent decides when it is grounded; the server only ever surfaces evidence.

The control loop is ReAct — Reason, Act, Observe, repeat — until every piece it wants to write resolves to a node that already exists:

Reason: form / revise a hypothesis→ Act: one tool call→ Observe: structured feedback↺ …until grounded→ Act: atomic write

Four tools carry the run. word_lookup finds surface-similar neighbours — the right-anchored match போய்ட்டேன் shares the ending ட்டேன், evidence for the completive-past-"I" tail before any analysis. validate_hypothesis scores a whole proposed segmentation against the graph and reports exactly which segments fail to resolve. search_* reaches the graph by surface form, by gloss, or by meaning — because you arrive knowing different things. create_wordform is the only writer: it checks ~12 invariants before the write and then re-queries the node it just wrote to confirm it landed.

accepted rejected by the graph — forced a refine read / discovery

Tool	What it asked the graph · ReAct step	Recorded outcome	Status
word_lookup	Does படிச்சிட்டேன் exist? Near-matches? (step 0)	exact: false · 7 LCS neighbours	new word
validate_hypothesis	Score the proposed root·participle·completive·past·"I" split (step 2)	constr_marked_completion · coverage 50% · all segments resolve · jaccard 0.40	scored
explore_nodes · search_general · search_construction	Confirm the COMPL triad; inspect `word_padi`; check parallels `word_kandupidicchittu`, `word_veesiduven`, `constr_class_vii_past` (steps 3 & 5)	triads confirmed; AVP+COMPL construction = 1 match	grounding
search_matching_wordforms	Find a derivational parent (step 4)	best_match: null · 0 alternatives	branch is new
search_wordform_tree	Walk descendants of root `word_padi`	2 nodes	read
explore_nodes	Verify the pattern against `word_pannikitteen`	18 nodes	read
create_wordform	`word_padichchu` — first write attempt (step 7)	REJECTED — gloss prefix mismatch	refine ↻
create_wordform	`word_padichchu` — corrected gloss	post_write_verified: true	committed
create_wordform	`word_padichchittu` (+ COMPL)	post_write_verified: true	committed
create_wordform	`word_padichchitt` — first write attempt	REJECTED — underlying form ≠ morpheme	refine ↻
create_wordform	`word_padichchitt` — corrected notation	post_write_verified: true	committed
create_wordform	`word_padichchitten` — target leaf (+ 1SG)	post_write_verified: true	committed

Note the shape: a wide read-only discovery phase grounding every morpheme and the triad, then a careful write phase. And search_matching_wordforms returned null — the system reported it did not already know this branch rather than guessing a parent. Knowing what it doesn't know is the whole point.

Two times the graph caught the model

Its linguistic reasoning was strong; it still made two of the exact errors from Part 2. Neither reached the graph. The text below is verbatim from the rejection.

Catch 1 — the gloss drifted (failure mode 3)create_wordform · word_padichchu

A child word must extend its parent by one morpheme and inherit the parent's gloss prefix exactly — that's what keeps the derivation forest coherent. The model glossed the root study; its parent word_padi is glossed read/study. A silent drift no human would catch at scale. The write was refused before it happened:

❌ Gloss prefix mismatch:
  Parent gloss: ['read/study']
  Child prefix: ['study']
  First 1 elements must match parent exactly.

Refined and re-submitted with the gloss aligned to the parent → committed, post-write verified.

Catch 2 — an illegal allomorph (failure mode 2)create_wordform · word_padichchitt

For the past-tense piece the model wrote the underlying form ட்ட் — with a trailing pulli (the vowel-killing dot). The graph knows the canonical form and the legal allomorphs, and that this notation is none of them. The rejection didn't merely say "wrong" — it returned the canonical form and the full allomorph set, so the correction was deterministic, not a guess:

❌ Underlying form 'ட்ட்' does not match morph_past_strong_tt.

✓ Canonical form: த்த
✓ Acceptable variants: ['ச்ச', 'ட்ட', 'த்த']
✓ Allomorphs: ['ச்ச', 'ட்ட']

💡 Suggestion: Use exact canonical form: த்த

Refined and re-submitted with ட்ட (the legal retroflex allomorph, no trailing pulli) → committed, post-write verified.

Two of the three failure modes from Part 2, happening for real — and both intercepted with a structured, machine-actionable correction. Not "the LLM is usually right" — the LLM is checked, every time, against something that cannot hallucinate.

The end state: the completive branch of படி now exists where there was nothing — four nodes written in dependency order, each re-queried and verified after its write.

word_padi root, pre-existing→ word_padichchu→ word_padichchittu→ word_padichchitt→ word_padichchitten target

04What this shows

The neuro-symbolic guarantee is narrow but real: unsupported structure cannot silently enter the graph. Two LLM errors — a drifted gloss and a fine morphophonological one — were caught by the symbolic layer and never reached the graph. The exact rejection text is in the log; the run replays from tool-calls.jsonl.
Rejections are structured, so self-correction is cheap. The graph returned the canonical form and the legal allomorph set, not just "invalid" — one extra call and the model is right. This is the same constrain-to-verified-entities pattern that beat direct LLM generation in clinical entity linking — different domain, identical spine.
The system reports what it doesn't know. search_matching_wordforms returned null; the agent built a fresh branch instead of fabricating a parent. A model that says "I don't have this" is worth more than one that always answers.
Observability is built in. One structured line per tool call, keyed by session. Replaying a session reconstructs any run; aggregating by the outcome field measures behavioural drift — no external eval harness required.
Honest limitation. This proves graph consistency, not linguistic correctness. Ground truth is Schiffman plus expert judgement; there is no held-out labelled accuracy set yet — and the surface segmentations in the cards above are my best analysis, not gold data. Building that set — sampled words, expert-annotated segmentations, segment- and gloss-level accuracy — is the next milestone, and I'd rather say so than imply a metric I don't have.