TL;DR
I'm building a system that takes Quran audio and tells you exactly which verse is being recited. Fully offline, no API calls. The pipeline: Whisper encoder (fine-tuned on Quran by Tarteel) -> CTC head -> WFST beam search -> verse ID. After one epoch of training with the encoder frozen, it's already identifying 8% of verses up from 0%. The decoding graph covers all 6,236 verses and fits in 7.9 MB.
Where this started
I came across yazinsai/offline-tarteel, a really well-documented project trying to solve the same problem: identify which Quran verse is being recited, entirely on-device. Yazin's analysis of the problem space is sharp. His central insight is correct - there are no small Arabic wav2vec2 models, and that's the fundamental blocker for traditional ASR approaches.
His best results were 67-72% accuracy with Tarteel's Whisper model using transcribe-then-fuzzy-match. An 81% result with a larger CTC model proved the ceiling is high when the acoustic model actually understands Arabic, but that model was 1.2GB - way too big for on-device use.
I thought I could do better with a different decoding strategy. The Quran is a closed corpus - 6,236 verses, and the text will never change. That's the dream scenario for constrained decoding.
The key insight: WFST decoding
Instead of transcribing audio to text and then fuzzy-matching against a database (which is what most approaches do), I compiled the entire Quran into a weighted finite state transducer. It's basically a massive trie where every verse is a valid path through token states, and the decoder can only follow valid paths.
Think of it as autocomplete but for the entire Quran. Even a mediocre acoustic model becomes dramatically more accurate because it literally can't hallucinate. If the model is 60% confident between two similar-sounding words, the WFST resolves the ambiguity by checking which one is valid in that context.
Audio (16kHz) -> Mel Spectrogram -> Whisper Encoder (512-dim)
-> CTC Head (512 -> 3000) -> Log-Softmax
-> WFST Beam Search -> Verse ID
Building the pieces
Tokeniser
I trained a SentencePiece Unigram tokeniser on all 6,236 verses with a vocab of 3,000 tokens. Unigram over BPE because Arabic is morphologically rich - you want subword units that respect the structure of the language rather than just frequent byte pairs.
100% round-trip accuracy on all 6,236 verses. I set normalization_rule_name="identity" to preserve the Uthmani script exactly. Quran text is not the place to get creative with normalisation.
Average tokens per verse: 32.7. Max: 382 (Al-Baqarah 2:282, the longest verse in the Quran).
The decoding graph
This was three steps:
I originally tried k2 for the FST operations but its C extension wouldn't load on Python 3.13 with macOS ARM. kaldifst (lighter OpenFst wrapper) worked straight away.
For CTC with SentencePiece tokens, the lexicon is implicit - each word's token sequence is deterministic from the tokeniser. So the grammar FST already operates at token level. Traditional LāG composition would be needed if I had a separate phone/grapheme level, but SentencePiece handles word-to-token mapping directly. Simpler than expected.
Training data
I needed ayah-level audio from multiple reciters. Tarteel's own CDN returns 403s. qurancdn.com works for some but limited coverage. alquran.cloud was the reliable one - public API, no auth, 27+ reciters.
Downloaded 68,500 MP3 files across 11 reciters, about 18 GB total. The split is by reciter, not by verse - I want the model to generalise across voices, not memorise how Alafasy sounds on verse 2:255.
| Split | Entries | Reciters |
|---|---|---|
| Train | 49,887 | 8 Arabic reciters |
| Val | 12,464 | Sudais, Husary Mujawwad |
| Test | 6,236 | Ibrahim Walk (English) |
Training
I'm using tarteel-ai/whisper-base-ar-quran as the base - a Whisper model already fine-tuned on Quran recitation by Tarteel. I stripped the decoder, kept only the encoder (512-dim), and bolted a linear CTC head onto it (512 -> 3000, matching the tokeniser vocab).
Three-phase training schedule:
| Phase | What's trained | Learning rate | Epochs |
|---|---|---|---|
| A | CTC head only (encoder frozen) | 1e-3 | 5 |
| B | Top 4 encoder layers + head | 1e-5 | 10 |
| C | Everything | 5e-6 | 5 |
One annoyance - nn.CTCLoss doesn't work on MPS. The workaround is running the forward pass on GPU, then moving log-probs to CPU for the loss computation. Gradients still flow through the .cpu() call (I verified), but it adds about 10% overhead. Better than running the whole encoder on CPU though.
Where it's at
After one epoch with just the CTC head trained:
| Metric | Value |
|---|---|
| Correct verses | 4/50 (8%) |
| Empty predictions | 33/50 |
| Wrong predictions | 13/50 |
| Avg inference time | 0.16s/sample (CPU) |
The four correct predictions came from four different reciters (verses 2:145, 5:54, 39:52, 25:68), which is a good sign - no single-reciter bias.
Inference is fast: 0.16s per sample on CPU. The target is under 1 second, so there's plenty of headroom even before optimisation.
Why I think this gets to 95%+
Yazin's 67% with the Tarteel model was using transcribe-then-match. Most errors were partial - close enough that the WFST graph should resolve them. The encoder already understands Quranic Arabic, it just needs the CTC head trained to produce useful frame-level outputs, and then the WFST handles disambiguation.
The Whisper encoder was fine-tuned by Tarteel for seq2seq transcription, not CTC, so the frame-level representations probably need some adjustment. That's what Phase B and C are for - unfreezing encoder layers should make a big difference.
The full decoding graph (7.9 MB) plus the quantised model should fit comfortably under 200 MB, which was the whole point - something that can run on a phone or cheap device without connectivity.
What's next
Finish the three-phase training. If it hits >85% SeqAcc, I'll move to quantisation (INT8, maybe INT4) and ONNX export for on-device deployment via sherpa-onnx. If the Tarteel encoder isn't enough, I have a fallback path to Whisper-small or training a Zipformer from scratch using k2/icefall.
I'll post results once I have numbers from the full training run. The QUL dataset (62 reciters with word-level timestamps) is also sitting there waiting to be used for a more ambitious fine-tune if needed - that's potentially 4.8 million word-level audio-text pairs across dozens of speakers.
All code is in quran-asr if you want to follow along.