June 17, 2026
Riyaz with a machine
I wanted an AI practice companion for surbahar and sitar. The model kept inventing things about raags it barely knew. What followed was a Wikipedia corpus of 1,385 articles, some careful retrieval engineering, and a knowledge base nobody had built before.
tl;dr — I built a local AI practice companion for Hindustani classical music, grounded in a retrieval database of 1,385 Wikipedia articles about raags, talas, instruments, and gharanas, so it would stop hallucinating raga grammar. The system runs entirely on my own hardware, knows the Dagarvani bani and the Etawah gharana, and cites its sources. As far as the literature goes, nobody had done this before.
Table of contents
- The practice context
- The model that made things up
- Why the music has always been hard to digitise
- What I built
- Getting retrieval right
- Two models, two voices
- Before and after
- Nobody had done this
- What it isn’t
- Five in the morning
The practice context
I play surbahar and sitar. The surbahar is the bass version of the sitar — larger instrument, lower register, much slower music, designed for the kind of unhurried alap that can take an hour to reach the middle octave. I practise in the Dagarvani bani, the tradition associated with the Dagar family, which treats the unmeasured opening of a raag as the heart of the music rather than a preamble to it. On the khayal side — the more common Hindustani vocal form adapted for instrumental playing — I practise in the Etawah gharana, also called the Imdadkhani gharana, descended from Imdad Khan and his son Inayat Khan — the lineage that runs through Vilayat Khan and Imrat Khan.
This matters for what follows. The instrument, the bani, and the gharana are not interchangeable details. They are the grammar of how I approach a raag. The same raag treated in the Dagarvani way has a different character and different priorities than the same raag treated in the Kirana or Agra way. When I practise Bhimpalasi on the surbahar, I’m not practising a scale with some ornamentation guidelines. I’m exploring the personality of that raag through the particular lens of a tradition that has been orally transmitted for several generations.
A practice companion needs to understand this to be useful. A tool that just knows Bhimpalasi belongs to the Kafi thaat and has komal Ni is a dictionary, not a guide.
The model that made things up
Earlier this year I began experimenting with using a locally-running language model as a practice journal. The idea was simple enough. After a session, I wanted somewhere to describe what I’d worked on, what felt right, what slipped — and have something intelligent to think alongside. Something that would remember previous sessions and ask useful questions. Not a teacher. More like a reflective surface.
The model I was running was qwen3.6:27b — twenty-seven billion parameters, quantised to four bits, running entirely on my own hardware. It is capable of remarkable things. But the first serious test — asking it to walk me through the grammar of Raag Bhimpalasi, the vadi and samvadi swaras, the characteristic phrase structure, the treatment of komal Ni in the Dagarvani approach — produced something disturbing.
The response was confident and mostly wrong. The aroha it gave me belonged to a different raag. The vadi swar was misidentified. The ornamentation guidance was plausible-sounding but would have caused a real musician to wince. I asked about Marwa — one of the more demanding raags in the late afternoon repertoire, with its sharpened Ma and flat Re, the careful avoidance of Pa, the way the raag builds tension through what it withholds — and the response was worse. Darbari Kanada, the great late-night raag of the Mian Tansen tradition, came back garbled in ways that were internally consistent but factually wrong.
In music practice, wrong information isn’t an inconvenience. It can actively degrade what you’re trying to build. A model that confidently tells you the vadi is Sa when it is Ma is worse than no model at all. I didn’t abandon the experiment. I wanted to fix it.
Why the music has always been hard to digitise
The short answer: the music has mostly not been.
Language models absorb what’s abundant in their training data, and Western music theory is everywhere — textbooks, forums, Wikipedia, software documentation, YouTube transcripts, millions of blog posts. Hindustani classical music has not been transmitted that way. Its primary mechanism has always been the guru-shishya relationship: a student learning from a teacher, in person, over years or decades. The technical vocabulary exists in texts — Vishnu Narayan Bhatkhande’s Kramik Pustak Malika, the Sangit Ratnakar, numerous Urdu and Hindi treatises — but these are largely untranslated, non-digitised, outside the training corpus. What a model learns about Indian classical music is a thin impression. The broad shape of the tradition, some famous names, a few raags by title. The grammar of those raags — which is what actually matters for practice — is where the knowledge gets sparse and the model starts generating plausible-sounding confabulation.
A paper published in June 2026 did the first systematic evaluation of language models on Hindustani classical theory. Thirty-three models, 504 questions covering raag grammar, thaat classification, tala systems, compositional reasoning. The large cloud-based frontier models scored eighty-five to ninety percent. Smaller locally-runnable models scored considerably worse. The paper described the failure precisely and offered no fix. I found it while building the thing described here, which felt both vindicating and somewhat lonely.
What I built
The approach is called retrieval-augmented generation. Before the model answers a question, it searches a library of documents, retrieves the closest passages, and reasons from those rather than from training alone. If the library has accurate information about a raag’s grammar, the model cites and reasons from that. If the library doesn’t have something, it usually says so — which is better than inventing it.
The critical question was what to put in the library. I was sceptical of Wikipedia at first. But the major Hindustani raags have Wikipedia articles, and many are surprisingly thorough: vadi and samvadi swaras, aroha and avaroha, pakad phrases, thaat classification, time of day, seasonal associations, characteristic ornamentation, relationships to other raags. The coverage is uneven — some articles are detailed, others thin stubs, and a handful of raags I work in have no article at all. But for the core repertoire, the information is there and largely accurate. Beyond the raags themselves, Wikipedia covers the thaats, the melakarta system in Carnatic music, talas, instruments, gharanas, individual musicians. Building a library from this means the model has context not just for individual raags but for the web of relationships between them.
So I wrote a script to walk Wikipedia’s category system — starting from root categories like Ragas, Ragas of Hindustani music, Melakarta ragas, Indian musical instruments, Gharanas, Hindustani singers, Carnatic music composers, and twelve others, two levels deep into each category tree, fetching the plain-text extract for each article through Wikipedia’s API. Four parallel workers, politely rate-limited. The final corpus: 1,385 articles, roughly 650,000 words of cleaned text.
The whole thing runs on a desktop I built for this purpose — Ryzen 7 9850X3D, 32GB of DDR5, an RX 9070 XT with 16GB of VRAM. Ollama handles model serving on the GPU; Open WebUI provides the chat interface. There are two language models in use: qwen3.6:27b (27.8B parameters, Q4_K_M quantisation, ~15.6GB on GPU) for sessions where I want the model to think carefully; and qwen3:14b (14.8B parameters, ~8.6GB on GPU) for faster everyday questions. The smaller model leaves enough VRAM headroom for a 32,768-token context window; the larger one is managed to 24,576 tokens using quantised KV cache to fit alongside the model weights. The embedding model — BAAI/bge-m3, which handles IAST romanisation and Devanagari script alongside English — runs on CPU, hard-separated from the GPU so neither language model loses VRAM to it. CPU inference for a single query embedding takes about two seconds, which is perfectly acceptable before a generation that takes thirty.
Getting retrieval right
The retrieval uses hybrid search: dense vector similarity combined with BM25 keyword matching. For Indian classical music specifically, this matters — the word vadi means a particular thing, samvadi means something related but distinct, these are not synonyms and cannot be swapped. Vector similarity finds semantically related passages; keyword matching finds passages containing the exact terms. Both are necessary.
Open WebUI generates three sub-queries from each user message, runs hybrid search for each, merges the results, and passes the top five passages as context. The model reasons from those passages and cites them with numbered references.
During testing I found passages appearing twice in the retrieval results — identical content with different database identifiers, exact duplicates flooding the context window. This had happened because the process of adding passages to the database had run in two phases — an initial phase that partially populated the main collection, and a consolidation phase that copied from staging collections without checking for duplicates. Scanning for passages with identical (source file, character offset) pairs and deleting the redundant ones removed 5,482 duplicates, reducing the collection from 12,560 to 7,078 unique chunks. At that point, the retrieval started behaving cleanly.
Two models, two voices
There are two models in the interface, each with a distinct system prompt and configuration.
The Dhrupad guide knows the Dagarvani bani and the surbahar. Its system prompt references the tradition’s characteristic aesthetic — the long unhurried alap, the weight given to the lower octave, the gravity and patience that distinguishes Dagar bani from other dhrupad traditions. It knows I play surbahar, which changes the practical advice: the guidance around meend on a bass instrument is different from sitar, the constraints on gamak are different. This is where I take dhrupad practice sessions.
The Khayal guide knows the Etawah gharana — the lineage of Imdad Khan, Inayat Khan, Vilayat Khan, and their descendants. It understands the gayaki ang, the vocal style adapted for instruments that defines this gharana’s approach to khayal. This is where I take questions about bandish exploration and the characteristic phrasing of the Imdadkhani style.
Each persona runs in two versions. The deep reasoning version uses qwen3.6:27b with its chain-of-thought mode enabled — the model works through the problem internally before producing a visible answer, which adds time but meaningfully improves accuracy on questions that require musical reasoning rather than simple retrieval. This runs at around 14 tokens per second and is best suited to sessions where I want to think carefully about a raag. The fast version uses qwen3:14b with reasoning off, generating at around 38 tokens per second — better suited to quick questions in the middle of practising. The 14B model also fits comfortably within VRAM at 32,768 tokens of context, which is about a third more than the 27B can safely hold, and that matters for longer sessions.
Before and after
The practical difference was immediate.
Before the knowledge base, a direct question about Bhimpalasi’s grammar returned something like this:
Without the knowledge base
Raag Bhimpalasi belongs to the Khamaj thaat. Its vadi is Gandhar (Ga) and the samvadi is Nishad (Ni). The aroha follows: Sa Re Ga Ma Pa Dha Ni Sa'. The raga is characterised by its use of shuddha swaras throughout and is performed in the afternoon hours...Most of the grammar in that passage is wrong. Bhimpalasi belongs to Kafi thaat, not Khamaj. Its vadi is Madhyam (Ma), not Gandhar. Both Re and Dha are omitted in the ascent. Ni is komal, not shuddha. The model produces this with complete confidence and no signal that it might be wrong.
After building the knowledge base, the same question pulls from the Bhimpalasi Wikipedia article. The response cites its sources, and I can check them. Here is an edited excerpt from an actual session:
Session — June 2026
YouToday while practising aalap on surbahar on Bhimpalasi raag, I was noticing that I tend to repeat my phrases when I go back and forth between the octaves. While I understand that there are only a few phrases that can take me back or forward, how do I improvise such that it doesn't feel like direct repetition?
GuideIt is good to hear that you are spending time with Bhimpalasi. The observation about repetition during octave movement is a very common hurdle in developing vistar. In the Dagarvani tradition, the alap is not merely about playing notes, but about breathing into them — so that even the same swara feels different when approached with fresh intention.
In Bhimpalasi, whose vadi is Madhyam — Ma [1] — the challenge is that the melodic vocabulary can feel limited if one relies only on standard phrases. Repetition feels jarring because the listener expects an evolution of thought, not transposition.
Vary the gamak and meend application. Even if the swara sequence is identical — say, Ni Sa Ma — the physical execution on the surbahar can differ. A slow meend from Ni to Sa in one phrase can give way to a gamak-touched approach in the next. The note is the same; the texture is different. Shift the point of arrival. Instead of always returning to Sa as your resting point in the mandra saptak, try resting on Pa, or allow Ma to carry the phrase before ascending again...
For Darbari Kanada, which has no recorded misidentification of its vadi now that the knowledge base is populated, the responses are accurate. Marwa is accurate. For raags with thin or absent Wikipedia coverage — Charukeshi, Kalavati, Nand, Hem Kalyan, several others — the model falls back on its training data and the old problem recurs. The knowledge base raises the floor for the raags that are covered. It doesn’t solve coverage.
The journaling function is where the system is most useful in practice. After a session I describe what I worked on — which raag, which passage of the alap, where the phrasing kept collapsing, what felt different — and the model connects that to what I’ve described in previous sessions, asks questions, suggests something specific to focus on. Because the system prompt establishes the instrument, the tradition, and the gharana, the guidance is grounded. When it says something about alap, it means the Dagarvani treatment of alap. When it asks about meend, it knows the surbahar context. These are not small things.
Nobody had done this
After building it, I went looking for prior work. I expected to find something.
A paper published in June 2026 was the first systematic evaluation of language models on Hindustani classical theory — thirty-three models, 504 questions, precisely describing the failure modes I’d observed. It offered no fix. A December 2025 paper built a Wikipedia-filtered music retrieval database spanning global music across 163 countries, 3.2 million passages from 144,000 articles, deployed on cloud infrastructure. Same architecture; entirely different domain. Stanford’s WikiChat system won the Wikimedia Research Award of the Year 2024 for using Wikipedia as a grounding corpus to reduce hallucination — 97.9% factual accuracy in human-user testing.
For Hindustani or Carnatic music specifically — Wikipedia corpus, locally deployed, personalised to a specific instrument and gharana, used as a riyaz journal — nothing. Not a paper, not a repository, not a blog post. Whether this is because the problem didn’t seem worth solving to the people who build these systems, or because the people who care deeply about the music and the people who build these systems rarely sit in the same room, I can’t say with confidence. Probably some of both.
What it isn’t
There are things this can’t do, and it would be wrong not to say them.
It can’t replace a teacher. The guru-shishya relationship is the mechanism by which this music has survived for as long as it has. It transmits not just information but judgment — the ear that has heard thousands of hours of performance, the correction of a phrase that’s technically right but aesthetically wrong, the sense of when a student needs to sit with something longer. None of that is in a language model, however well it retrieves Wikipedia.
It can’t transmit sruti — the microtonal pitch placement, below the resolution of notation, that distinguishes a great performance of Darbari Kanada from a merely correct one. The tradition holds that sruti is demonstrated rather than described. A language model can describe it. It cannot demonstrate it.
It still hallucinates. When a raag is absent from the knowledge base, or the Wikipedia article is thin, the model may produce confident prose that sounds right and isn’t. The response includes source citations when the knowledge base was used; the absence of citations is a signal to be more sceptical.
And it’s a text interface. It can’t hear me playing and correct my meend. It can’t notice I’ve been avoiding the upper Sa in a raag that calls for it. For a practice companion, these are significant things to be missing.
Five in the morning
At five in the morning, before work, with the surbahar on the stand and a session still fresh, there’s something useful about having a place to be precise about the music. Not a recording, not a notation, but a description: what I was trying in this passage, where it held, where it slipped. Putting that in language forces a clarity that helps the next session.
The knowledge base makes the model accurate enough to ask real questions of. The questions I bring to a session — which approach to the upper octave serves this raag better, what the characteristic phrase structure says about where to spend time in the alap, what a musician in this tradition has said about approaching the weaker notes — often have factual components. Getting those consistently right was the piece that was missing.
What’s there now is not a teacher. It’s a well-read practice companion that knows the grammar of the music and remembers what you told it last time. That turns out to be a useful thing to have at five in the morning.