Creating the first morphological analyser for Spoken Tamil · Part 1 of 4

What is morphological analysis?

Words often contain grammatical structure that influences their function in a sentence, and understanding that structure can be significantly harder than it seems. We explore this idea in Tamil, using the Tamil root, படி "study".

Every word you read, you pull apart without noticing. Morphological analysisBreaking a word into its smallest meaningful pieces and naming what each one does. is the process of breaking a word into its smallest meaningful pieces (its morphemesThe smallest unit of a word that still carries meaning, such as learn, -er or -s.) and naming what each one does. In your own language you manage it effortlessly. A computer faced with a low-resource, unstandardised, linguistically diverse language that is almost never written (looking at you, spoken Tamil)Tamil is strongly diglossic. Its literary register has been written and codified for well over a millennium, and it differs substantially from the colloquial Tamil that people actually speak, which is only rarely written down. The grammar, the vocabulary and even the pronunciation diverge. Where this series refers to spoken Tamil, it means the everyday colloquial register of Tamil Nadu., has, shall we say, some difficulties. This post is the first part of the story, exploring what the analysis actually is. Everything is built on one spoken Tamil verb root, படி "study", and you need no Tamil to follow: you can hear and unpack every word as you read.

Try it — this page is the app

Every Tamil word below is live. Tap one to hear it pronounced (with a karaoke-style highlight) and to open its card — segmentation, gloss, description and the build from root to word. Tap a single morpheme for its own quick info. Two cards are embedded directly in the next section so the format is clear before you start clicking.


Tamil is a well-known example of a highly agglutinativeA language that builds words by stacking morphemes, each carrying one clean piece of grammar. language: it builds grammatical meaning by suffixing morphemesThe smallest unit of a word that still carries meaning, such as learn, -er or -s. (the smallest fragments of a word that carry meaning) onto a root. This is present in English too, but to a much smaller degree. Take the English word learners, which carries three morphemes: the root learn, the derivational suffix -er that turns the verb into a noun, and the plural inflection -s. Splitting a word into its morphemes is segmentationSplitting a word into its individual morphemes.; labelling what each one does is glossingLabelling what each morpheme does, such as PAST, 1SG or COMPL.; chaining them from the root outward is the buildChaining the morphemes from the root outward to show how the word is assembled.. Tamil does the same, only it stacks far more. Below, side by side, are learners and the spoken-Tamil word for those who are studying, on the same படி "study" root we'll use throughout, each shown as one of the app's cards:

Both cards work the same way. Tap the word to hear it, tap any single piece for its own quick note, and the build shows how each grows from its root outward. English stacks three morphemes; spoken Tamil stacks four: the present stem, an adjectival participleA non-finite verb form that lets a verb behave like an adjective or adverb instead of the main verb of the clause. that turns the verb into "(the one) who is studying", and an animateA grammatical category for words referring to people or animals, as opposed to inanimate things; Tamil treats the two differently.-plural pronoun head folded onto the end. Tamil goes much further than this, to the point of folding an entire clause into a single word. Here is "if one has to go and study": a purpose infinitiveThe untensed base form of a verb. A purpose infinitive marks the aim of an action, in order to do something., a motion verb, a necessity modalAn element expressing necessity, possibility or obligation rather than the action itself; here the necessity modal, has to., and a conditionalA verb form marking an if clause: the condition under which the rest holds., all welded together:

Five morphemes, and not one survives as a clean slice. The necessity modal even deletes its own first syllable (வேணும் → ணும்), and the conditional rides on it because the modal has no past stem of its own. This is the kind of complexity the analyser has to handle. The single word the rest of the series follows, the one we feed to the system and watch it analyse, is "I've studied it, and it's done":

One word, four morphemes: the verb படி "study", a participle, the completiveAn aspect that marks an action as definitely completed; in spoken Tamil it rides on an auxiliary verb fused onto the main one. auxiliary விடு "let go" (which the spoken language strips the v‑ from and folds the past tense into), and "I". The speaker's identity, the tense, and a sense of decisive finality are all welded on. The card is open about the awkward part: those morphemes do not survive as separate chunks, because a vowel elidesTo drop a sound that would otherwise be pronounced. At a morpheme boundary a vowel elides, blurring where one piece ends and the next begins. at every join, so you cannot recover them by cutting the string. A learner who taps that word needs every morpheme named, and the system, as we will see, has to name them without ever inventing one.

The fundamental reason this is so challenging in Tamil is that there is no canonical algorithm for unpacking a spoken surface formThe morphemes in the form they appear in after sandhi, rather than their clean underlying forms. back into these pieces. Literary Tamil has formal rules and mature analysers; the spoken register has neither. The completive auxiliary விடு never appears as a clean suffix: in speech it loses its v‑ (விடு → இட்டு) and its vowels elide at every join, which is why the cards above show morphemes that do not cleanly cut out of the string. You cannot recover the pieces with string rules, and there is no lookup table to fall back on. So for a language like this, segmentingSplitting a word into its individual morphemes. even a single word remains an open problem.

Terminology

Morpheme
Smallest meaningful unit. English: learn, -er, -s. Tamil: ‑ச்சு (AVP), ‑ஏன் ("I").
Root
The lexical core a word is built on. Here: படி "study".
Gloss
A short standardised label for a morpheme's job. PAST, 1SG (1st person singular), COMPL (completive).
Agglutinative
A language that builds words by stacking morphemes, each carrying one clean piece of grammar.
AVP
Adverbial / verbal participle — a non-finite "having ___ed" form. படிச்சு = "having studied". Tamil uses it to chain actions.
Aspect
Not when an action happened (that's tense) but how it's viewed — completed, ongoing, habitual. COMPL = viewed as definitely completed.
Sandhi
Sound changes at morpheme boundaries. படி+ச்சு doesn't stay neatly separable — the join mutates. The system tracks an underlying form (the clean morphemes) and a surface form (what's actually pronounced).
Next in the series

So you reach for the one tool that is good with unfamiliar strings: a large language model. Hand it படிச்சிட்டேன் and ask it to break the word down, and it answers fluently, confidently, and differently every time you ask. That inconsistency is the problem. Part 2: Where a language model goes wrong.