Projects

2025 – Present · AI Engineer | Tech Founder · Neuro-Symbolic Agentic AI

Spoken Tamil Morphology Analyser

The first working computational morphological analyser for spoken Tamil. Spoken Tamil is highly agglutinative and non-standardised across dialects, with no canonical rules to unpack surface forms, a setting where direct LLM generation is unreliable. It runs a ReAct loop: the agent proposes a segmentation hypothesis, validates it against a ~1,000-node Neo4j knowledge graph via a custom MCP server, observes the structured feedback, and refines, iterating until the segmentation grounds to verified graph entities and reducing hallucination by constraining outputs to the graph.

The MCP server exposes typed tools that return structured prose with embedded scores and IDs, roughly 3–5× cheaper per call than the JSON equivalent, and structured to keep the model reasoning over the evidence. Every tool call is written to a structured trajectory stream (latency, outcome, success), with automated graph-health invariant checks, giving per-run observability without an external eval framework. The graph itself was built via multi-agent adjudication over OCR'd reference material, with deduplication and edge-case handling across agents.

Stack: Python · TypeScript · Custom MCP server · Neo4j · LLM agent orchestration

2024 – 2025 · Clinical NLP at RMIT / UoM

RAG Pipeline for Clinical Entity Linking

Mapped free-text clinical mentions to ontology codes (DrugBank, RxNorm) by embedding ontology entities, retrieving candidates via similarity search, and adjudicating with an LLM. Direct LLM generation hallucinated codes about half the time, unacceptably dangerous in clinical settings. The retrieval-plus-adjudication pattern outperformed string matching, ClinicalBERT embeddings, and direct generation on F1, because constraining the LLM to choose between real candidates prevents it from inventing codes that don't exist.

Added an LLM-driven query reformulation loop (e.g., brand → generic drug name) that measurably improved hit rates on hard cases. Built inside a regulated clinical data environment with governance and privacy constraints; presented to clinical stakeholders at MCBK 2024.

Stack: Python · ClinicalBERT · Vector search · LLM adjudication · Slurm HPC

2025 – Present · Tech Founder · Full-Stack EdTech

Kathaivazhi: Tamil Language Learning Platform

EdTech platform for Tamil diaspora learners. Click-to-analyse words with morphological breakdown via the morphological analyser above, spaced repetition, and learner progression tracking. Pre-launch.

Stack: React · Express · PostgreSQL · Neo4j

Apps & Talks