June 20, 2026

Technology

Ten models, four domains

I built a local AI coaching system for Hindustani music, tennis, photography, and career coaching — then spent two weeks benchmarking ten models to find the right one. Speed, quality, thinking mode, MoE architecture, and a lot of out-of-memory crashes.

tl;dr — After building a local RAG coaching system for Hindustani classical music, tennis, photography, and career coaching, I spent two weeks systematically benchmarking ten locally-runnable language models to find the right ones. The winner was a Mixture of Experts model — Google’s Gemma4 26B-A4B in a Quantization-Aware Training build — running at 85–91 tokens per second while using only 13.8GB of VRAM. For sessions requiring deeper reasoning, a smaller 12B model with extended thinking mode turned out to be the right call. Getting to those conclusions involved two hardware crashes, one architecture I had never heard of, and a 40-minute test run that went nowhere useful.

Table of contents

What I was trying to solve
The ten models
Speed and VRAM — the full table
Hindustani classical music
Tennis coaching
Photography — text coaching
Career coaching
Photography — visual critique
The thinking mode question
The MoE detour
Extended conversations and OOM
What I chose
What I learned

What I was trying to solve

The system described in the previous article — a local AI practice companion for Hindustani classical music, grounded in a Wikipedia retrieval database — works by pairing a language model with retrieved passages from 1,385 articles. The model’s job is to synthesise what the knowledge base returns and speak in a specific voice: Dagarvani bani for dhrupad practice on the surbahar, Etawah gharana for khayal. The knowledge base handles factual accuracy; the model handles how to say things.

The original model was qwen3.6:27b. It worked, but it was slow — around 16 tokens per second with thinking enabled, producing a full response in 30–50 seconds. It also had a factual error that appeared without the knowledge base: it attributed the Dagarvani bani to the Maihar gharana of Allauddin Khan, which is wrong. More importantly, it was sitting right at the VRAM limit of the card, leaving almost no headroom for the context window to grow across a long conversation.

The question was whether something better existed — better on quality, faster, or both — and whether I could find it without spending a week reading specifications.

The ten models

All models were run locally on an AMD Ryzen 7 9850X3D with an RX 9070 XT (16GB GDDR6), using Ollama on Bazzite Linux.

The candidates came in three categories. Dense models of various sizes: qwen3.6:27b (the baseline), qwen3:14b, gemma4:12b, gemma4:31b, gemma4:e4b, and phi4-reasoning:plus. Dedicated reasoning models: deepseek-r1 variants, which I eliminated after research showed their training was optimised for mathematics and code rather than nuanced language tasks. And Mixture of Experts models: gemma4’s 26B-A4B architecture in two quantization builds, and Qwen3’s 30B-A3B from HuggingFace.

One model — gpt-oss:20b — appeared in community benchmarks as the fastest thing tested on the exact GPU I was using, at 101 tokens per second. It generated tokens happily but produced empty output. The MXFP4 quantization format it used is too new for Ollama to extract content from properly. I mention it only because 101 tokens per second with nothing to show for it is a specific kind of frustrating.

Speed and VRAM — the full table

All numbers measured on RX 9070 XT (16GB GDDR6, ROCm, Ollama 0.30.10). Generation speed is tokens per second for a 280-token response. Wall time is end-to-end including prompt processing.

Model	Size	VRAM used	Speed	Wall time (280t)	Thinking	Status
gemma4:26b QAT (Unsloth UD-Q4_K_XL)	14.4 GB	13.8 GB	85–91 tok/s	~4s	❌ template missing	✅ Recommended
gpt-oss:20b	12.8 GB	—	101 tok/s	—	—	❌ Empty output
Qwen3 30B-A3B Q3_K_M	13.1 GB	—	100 tok/s	—	✅	❌ Broken output (Q3 too aggressive)
gemma4:26b Q4_K_M (ggml-org)	16.8 GB	13.4 GB	57–59 tok/s	~5s	✅	❌ OOM after first query
phi4-reasoning:plus	10.4 GB	—	43 tok/s	~9s	✅	⚠️ Needs system prompt
qwen3:14b	8.6 GB	8.6 GB	37 tok/s	~8s	✅ (1500t needed)	✅
gemma4:12b	7.0 GB	7.5 GB	36 tok/s	~9s	✅ (1500t needed)	✅ Recommended
gemma4:e4b	8.9 GB	—	41–72 tok/s	~7s	✅ (1500t needed)	⚠️ 8B quality ceiling
qwen3.6:27b (baseline)	16.2 GB	14.3 GB	16 tok/s	~18s	✅	✅
gemma4:31b	18.5 GB	16 GB	7 tok/s	~40s	✅	✅ Best prose

Hindustani classical music

Two test questions. The first required technical knowledge of raag grammar — specifically the relationship between the slow oscillation on komal Ga in Darbari Kanada and how it should inform phrasing in the lower octave. The second asked about the Dagarvani tradition’s philosophy of dwelling in a note, and what to do when alap feels restless.

Model	Speed	Factual accuracy	Domain vocabulary	Quality
gemma4:31b	7 tok/s	✅	vazan, gambhir, nyasa	⭐⭐⭐⭐⭐
gemma4:26b QAT	85 tok/s	✅	andolan, mandra saptak	⭐⭐⭐⭐½
qwen3:14b	37 tok/s	✅	Adequate	⭐⭐⭐⭐
gemma4:12b	36 tok/s	❌ Surbahar = “bowed”	Good	⭐⭐⭐⭐
qwen3.6:27b	16 tok/s	❌ Dagarvani = Maihar gharana	Adequate	⭐⭐⭐⭐

The two errors are instructive. qwen3.6:27b attributed the Dagarvani bani to the Maihar gharana of Allauddin Khan — that is the wrong lineage entirely. gemma4:12b described the surbahar as a bowed instrument; it is plucked. Both errors disappear when the retrieval knowledge base returns the correct passages, but they indicate gaps in baseline parametric knowledge that matter in edge cases. The QAT model and the 31B made neither error.

On prose quality, the 31B introduced terms like vazan (weight) and nyasa (the resting point of a note) naturally, in context, rather than decoratively. The 12B produced the best single phrase across all models on the restlessness question: “your mind is living in the next phrase while your voice is still in the current one.”

Tennis coaching

Two questions: a match collapse diagnosis (won 6-3, lost 3-6 after tightening at 3-1 up), and a technical question about maintaining forehand topspin while moving laterally at full stretch.

Model	Speed	Diagnosis	Technical advice	Quality
gemma4:31b	7 tok/s	”Shift from Process to Outcome”	Specific, biomechanically correct	⭐⭐⭐⭐⭐
gemma4:26b QAT	85 tok/s	”Loss Aversion Trap”	Good structure	⭐⭐⭐⭐½
gemma4:12b	36 tok/s	”Success Trap”	Solid	⭐⭐⭐⭐
qwen3:14b	37 tok/s	Structured, slightly generic	Good drills	⭐⭐⭐⭐
qwen3.6:27b	16 tok/s	Good, used emojis	Adequate	⭐⭐⭐½

The tennis domain is well-covered in training data across all models — quality differences were smaller here than for music. The 31B’s diagnosis “Shift from Process to Outcome” was the sharpest frame, and its forehand drill description used correct biomechanical terminology. The QAT matched it in speed while producing nearly comparable depth.

Photography — text coaching

Text-only coaching on B&W photographic technique: one question about creating tonal separation from flat overcast light on industrial architecture, one about developing Zone System pre-visualisation.

Model	Speed	Tonal analysis	Zone System vocabulary	Quality
gemma4:31b	7 tok/s	”colour from luminance” — precise	Adams’ “score vs performance” analogy	⭐⭐⭐⭐⭐
gemma4:26b QAT	85 tok/s	”local contrast over global contrast”	Solid Adams reference	⭐⭐⭐⭐½
gemma4:12b	36 tok/s	”texture as proxy for value” — excellent	Good	⭐⭐⭐⭐
qwen3:14b	37 tok/s	Mentioned polarising filter — generic	Standard Zone System explanation	⭐⭐⭐
qwen3.6:27b	16 tok/s	Good but textbook	Adequate	⭐⭐⭐

Photography coaching is where the smaller models started to diverge more visibly. gemma4:12b’s phrase “texture as proxy for value” — using surface detail to create tonal separation when light isn’t doing the work — is exactly the expert-level insight a B&W photographer needs. The 31B correctly used Adams’ own “score vs performance” analogy for pre-visualisation. The QAT held its own at 85 tokens per second.

Career coaching

Two questions tested on the same models: a seven-year software engineer wanting to transition into AI product management, and a mid-level engineer wanting to position for promotion in a three-week window.

Model	Speed	Strategic depth	Actionability	Quality
gemma4:26b QAT	86 tok/s	”change the narrative of how the work is presented”	Concrete 30-day framing	⭐⭐⭐⭐½
gemma4:31b	7 tok/s	”already operating at the next level”	Specific visibility tactics	⭐⭐⭐⭐½
gemma4:12b	36 tok/s	”executing tasks well vs owning outcomes”	Good structure	⭐⭐⭐⭐
qwen3:14b	37 tok/s	Solid framing	Practical steps	⭐⭐⭐⭐
qwen3.6:27b	16 tok/s	Strong narrative arc	Good tactics	⭐⭐⭐⭐

Career coaching is the most generic of the four domains — all models had adequate knowledge here, and quality differences were the smallest. The QAT’s speed advantage becomes decisive when quality is roughly equivalent across models; at 86 tokens per second versus 7, you receive comparable advice twelve times faster.

Photography — visual critique

A separate category: models that can actually receive and analyse photographs, not just give text advice about photography. Seven vision-capable models were tested with a standardised greyscale gradient image and asked for a B&W photograph critique covering tonal range, contrast, and composition.

Model	Size	VRAM	Speed	Saw the image	Quality
Qwen2.5-VL 7B	5.6 GB	~5 GB	116 tok/s	✅	⭐⭐⭐⭐
Qwen3-VL 8B	5.7 GB	2.5 GB	84 tok/s	✅	⭐⭐⭐⭐
MiniCPM-V 4.5	5.7 GB	4.7 GB	54 tok/s	✅	⭐⭐⭐⭐
Pixtral 12B	7.4 GB	~8 GB	43 tok/s	✅	⭐⭐⭐½
gemma4:12b	7.0 GB	7.5 GB	49 tok/s	✅	⭐⭐⭐⭐
gemma4:e4b	8.9 GB	—	41 tok/s	❌ Ignored image	❌
llama3.2-vision:11b	7.3 GB	—	—	❌ 500 error	❌

Qwen2.5-VL was the fastest at 116 tokens per second, and both Qwen VL models produced accurate analysis of the image’s tonal structure. MiniCPM-V 4.5 was notable for its analytical framing, recognising the gradient composition and commenting on the transition from shadow to highlight with specific language. Pixtral 12B, which handles artistic and photographic content well, produced competent critique but did not meaningfully outperform the Qwen VL models on this test. gemma4:e4b received the image file but appeared to discard it, asking the user to provide the photograph it had already been sent.

For photography critique in practice: Qwen3-VL:8b as the primary model (fast, accurate, stable), with MiniCPM-V 4.5 as an alternative for sessions where very high-resolution image analysis matters (it supports images up to 1.8 megapixels with fewer visual tokens than most models).

The thinking mode question

Several models support a reasoning mode in which the model produces an internal chain of thought before answering. Enabling it consistently improved the quality of responses to nuanced questions — but with a significant cost. A 300-token budget was routinely consumed entirely by the thinking chain before a single word of visible response appeared. Useful results required 1,200–1,500 tokens.

Scenario	gemma4:12b + thinking	gemma4:26b QAT (no think)
Token budget	1,500	350
Speed	30 tok/s	85 tok/s
Wall time	~51s per turn	~8s per turn
Quality	Comparable	Comparable

The direct comparison settled the question. At 51 seconds per turn versus 8 seconds, the 12B with thinking costs more than six times as long for responses that are roughly equal in quality. The thinking chain improved the 12B’s outputs noticeably on genuinely hard analytical questions, but not enough to consistently beat the larger QAT model answering directly. For everyday journaling, the QAT without thinking wins. For specific hard questions where precision matters, the 12B with thinking is worth the wait.

The MoE detour

Mixture of Experts models were the most interesting detour and the most disappointing practical result. The architecture — routing each token through a subset of expert networks rather than the full model — theoretically offers the quality of a large model at the compute cost of a small one.

The gemma4:26b in standard Q4_K_M quantization (16.8GB) used 13.4GB of VRAM and crashed with an out-of-memory error after the first response. The growing KV cache from conversation history filled the remaining VRAM.

The QAT build at 14.4GB used 13.8GB of VRAM — smaller despite containing the same model — and remained stable across all ten test turns. The difference was headroom: the QAT build left 3.1GB for the KV cache, the standard build left under 1GB.

Qwen3’s 30B-A3B MoE, pulled from HuggingFace at Q3_K_M quantization, was worse. The aggressive 3-bit quantization degraded the model enough that it leaked its raw reasoning process as visible output rather than producing a properly formatted response. Speed was extraordinary — 100 tokens per second — but the output was unusable.

Extended conversations and OOM

The ten-turn conversation test — ten consecutive exchanges of a practice session, each building on the previous — produced the most useful practical finding.

	gemma4:12b + thinking	gemma4:31b + thinking
num_ctx	65,536 tokens	32,768 tokens
Speed	30 tok/s	6–7 tok/s
Avg time per turn	51s	242s
Full 10-turn session	~8.5 min	~40 min
OOM events	0	0
Response quality	⭐⭐⭐⭐½	⭐⭐⭐⭐⭐

Neither model crashed across the full ten turns. Both produced substantive responses throughout, with the 31B maintaining slightly more precise musical vocabulary — the terms vazan and nyasa appearing naturally on later turns. The 12B completed the session in 8.5 minutes. The 31B took 40 minutes. The quality gap does not justify the time cost for daily use.

What I chose

For daily sessions: the gemma4:26b QAT from Unsloth’s HuggingFace repository, pulled via Ollama’s direct HuggingFace integration (hf.co/unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL). It runs at 85–91 tokens per second, supports 64K context, uses 13.8GB of VRAM, and produces responses in 4–8 seconds. It replaced both the fast and the deep reasoning tiers that previously ran separate models.

For sessions where careful reasoning matters: gemma4:12b with thinking enabled, 1,500-token budget, ~51 seconds per turn.

For photography critique with actual images: Qwen3-VL 8B as primary, MiniCPM-V 4.5 as high-resolution alternative.

The gemma4:31b stays installed for occasional use when maximum prose quality is the priority.

What I learned

Running ten models across four domains taught more about the tradeoffs than reading specifications ever would have.

Thinking mode is not always better, only differently distributed. A model thinking for 900 tokens before answering is not more accurate than a larger model answering in 300 tokens. The reasoning chain adds precision on genuinely hard analytical tasks and adds time everywhere.

MoE architecture solves a different problem than the one I had. It reduces compute per token, not memory. On a constrained VRAM card used for multi-turn conversation, where memory is the actual constraint, the storage cost of the full model cancels out the compute savings. The architecture makes more sense for server inference with many simultaneous users than for a personal practice companion running one conversation at a time.

Quantization-Aware Training is worth seeking out. The QAT build consistently outperformed the standard Q4_K_M of the same model — smaller, faster, and comparably accurate — because the quantization was part of training rather than applied afterwards.

The knowledge base still matters more than the model. The factual errors that appeared in models running without retrieval — the wrong gharana attribution, the wrong instrument classification — disappeared when the correct passages were returned by the knowledge base. The model’s parametric knowledge is a floor, not a ceiling.

Vision models are a separate category. The best text models and the best vision models are not the same models at this hardware tier. Qwen2.5-VL at 116 tokens per second and 5.6GB outperforms gemma4:12b for image analysis despite being smaller, because it was trained specifically for vision understanding. Choosing a vision model requires different criteria than choosing a reasoning model.

← All thoughts