2025 – Present · AI Engineer | Tech Founder · Neuro-Symbolic Agentic AI
Spoken Tamil Morphology Analyser
The first working computational morphological analyser for spoken Tamil. Spoken Tamil is highly agglutinative and non-standardised across dialects, with no canonical rules to unpack surface forms — a setting where direct LLM generation is unreliable. It runs a ReAct loop: the agent proposes a segmentation hypothesis, validates it against a ~1,000-node Neo4j knowledge graph via a custom MCP server, observes the structured feedback, and refines — iterating until the segmentation grounds to verified graph entities, reducing hallucination by constraining outputs to the graph.
The MCP server exposes 12 Zod-validated typed tools that return structured prose with embedded scores and IDs, roughly 3–5× cheaper per call than the JSON equivalent. It runs as a multi-agent system with agent-to-agent (A2A) delegation across non-overlapping roles — analyst, validator, and grammar-reference sub-agents — where the validator fans out to N parallel analysts. Every tool call is written to a structured trajectory stream with automated graph-health invariant checks, giving per-run observability without an external eval framework.
Stack:
Python · TypeScript · Custom MCP server · Neo4j · LLM agent orchestration
i wrote a series of
posts
on this.
2025 – Present · Tech Founder · Full-Stack EdTech
Kathaivazhi: Tamil Language Learning Platform
EdTech platform for Tamil diaspora learners. Click-to-analyse words with morphological breakdown via the analyser above, spaced repetition, and learner progression tracking. Pre-launch.
Stack:
React · Express · PostgreSQL · Neo4j
2024 – 2025 · Clinical NLP at RMIT / UoM
RAG Pipeline for Clinical Entity Linking
Mapped free-text clinical mentions to ontology codes (DrugBank, RxNorm) by embedding ontology entities, retrieving candidates via similarity search, and adjudicating with an LLM. Direct LLM generation hallucinated codes about half the time — unacceptably dangerous in clinical settings. The retrieval-plus-adjudication pattern outperformed string matching, ClinicalBERT embeddings, and direct generation on F1, because constraining the LLM to choose between real candidates prevents it from inventing codes that don't exist.
Added an LLM-driven query reformulation loop (e.g., brand → generic drug name) that measurably improved hit rates on hard cases. Built inside a regulated clinical data environment with governance and privacy constraints; presented to clinical stakeholders at MCBK 2024.
Stack:
Python · ClinicalBERT · Vector search · LLM adjudication · Slurm HPC
watch a
conference talk
i presented on this.