Creating the first morphological analyser for Spoken Tamil · Part 2 of 4

Where a language model goes wrong — and why you can't see it

Ask a capable model to break one Tamil word into its parts. Ask eight times. You'll get fluent, confident, Leipzig-formatted answers that don't agree with each other — and the majority is wrong. This is the production failure mode that matters.

In Part 1 we did morphological analysis by hand: breaking learners into learn‑er‑s, and the Spoken Tamil word படிச்சிட்டேன் into four welded pieces — study + a "having done it" participle + a "definitely finished" auxiliary + "I". Tap the card to refresh it:

Part 1 ended on the catch: Spoken Tamil has no canonical algorithm to recover those pieces from the surface string, so you reach for a language model. This post is about what happens when you do — and why the result is more dangerous than a plain wrong answer.

If you ask a large language model to do something it has seen a million examples of, it is excellent. If you ask it to do something precise in a domain where the ground truth is small, specialised, and largely absent from its training data, it does something worse than fail — it produces a fluent, confident, plausible answer that is wrong in ways you cannot see without being an expert yourself. That is the failure mode that matters in production, and morphological analysis of Spoken Tamil walks straight into it.

It's worth being concrete about why the ground truth is missing, because Spoken Tamil is under-resourced in a way that compounds on itself. Start with the agglutination from Part 1: grammar is built by stacking morphemes onto a root — and not just tense but aspect, mood, person, number, modality, all welded on at once. Where English gives one root a handful of forms (run, ran, runs, running), a single Tamil root spins off thousands of distinct words, and — as every card here shows — the morpheme boundaries don't survive in the finished surface form. Now multiply that: Tamil is dialectally diverse, so each dialect has its own way of forming those thousands of variants. And none of it is standardised — no canonical dialect, no agreed spelling, often not even agreement within a dialect — so there is no single target to aim at, and what little data exists is noisy and inconsistent.

Then the kicker, the one that empties the training set outright: Spoken Tamil has diverged from Literary Tamil, the formal written register whose script has barely changed in a thousand years. Essentially all written Tamil — documents, news, books, the web — is Literary. Spoken Tamil is spoken; it is hardly ever written down, surfacing mostly in places like film subtitles. So a model trained on the world's text has seen vast amounts of a different register and almost none of this one. Tiny corpus, astronomically large form-space: nearly every spoken word the analyser meets is one the model has effectively never seen. There is no fact to recover — only a plausible fiction to generate.

I didn't have to invent a cautionary example. I gave a capable model the same one word, படிச்சிட்டேன், with the same prompt, eight times. Every answer was fluent, Leipzig-formatted, and sure of itself. They did not agree with each other. Collapsing the duplicates, here are the distinct analyses it returned:

# Segmentation it returned Its Leipzig gloss What it claims the middle piece is
A படி‑ச்ச்‑இட்ட்‑ஏன் · paṭi‑cc‑iṭṭ‑ēṉ read‑PST‑CMPL‑1SG ‑ச்ச்‑ is the past tense; the completive is separate
B படி‑ச்சு‑ட்ட்‑ஏன் · paḍi‑cc‑ṭṭ‑ēn read‑PST‑COMPL‑1SG same theory, but the segment boundary & romanisation moved
C படி‑ச்சு‑(வி)ட்ட்‑ஏன் · paṭi‑ccu‑(v)iṭṭ‑ēṉ read‑CVB‑COMPL.PST‑1SG ‑ச்சு‑ is the converb (not tense); tense rides on the completive

Look at what they agree on: the root படி "read", the suffix ஏன் "1SG", the whole-word translation, and the literary equivalent படித்துவிட்டேன். Every part a non-expert could spot-check is correct and consistent. Now look at where they split: is ‑ச்சு‑ the past tense, or the converb — the "having ___" participle — with tense carried by the completive auxiliary? That is not a detail. It is two different theories of what the word is. The model picks one or the other at random across runs, and never once says it is choosing.

It gets worse: the confident majority is the wrong one. Reading C — converb plus a completive that carries the past — is the defensible analysis; it is exactly how the literary form படித்துவிட்டேன் decomposes, and it is the reading the grounded system commits to (you'll watch it do so in Part 3). The two-to-one majority that calls ‑ச்சு‑ "past tense" is the more fluent and the more wrong. A reviewer trusting the majority ships the error.

That single experiment contains the three failure modes worth naming:

1. It invents or relabels a morpheme. Calling the converb "PST" isn't a typo — it's a plausible-looking category assigned to a slot that doesn't hold it. Fiction in a citation voice.

2. It picks an illegal allomorph. The morpheme is real but the form is one the language never uses in that position — a notation that looks like Tamil and is not. (You will watch this happen, exactly once, in the traced run in Part 3: a past-tense form written with a stray pulli.)

3. It lets the gloss drift. study where the established label is read/study; cc here, ccu there, ச்ச் vs ச்சு for the same claim. Harmless-looking, corrosive over time, and invisible in prose.

All of it is fluent, confident, plausible, and unverifiable by a non-expert — the failure mode that actually matters in production. Note the deeper tell: a model that was recovering a fact would be stable across runs. One that is generating a plausible analysis has more than one basin, and lands in a different one each time. The fix is therefore not a better prompt or more samples. It is to stop letting the model be the source of truth about what exists.

Next in the series

If the model can't be trusted to know what's real, give it something that can — a curated graph of every morpheme, construction and rule that's actually attested, and make the model check every claim against it before it's allowed to write anything down. Part 3: Teaching a model to admit it doesn't know — by forcing it to find evidence.