Building an offline Quran verse recogniser from scratch

How I'm building a system that listens to Quran recitation and identifies the exact verse, fully offline, using a Whisper encoder, CTC, and finite state transducers.

TL;DR

I'm building a system that takes Quran audio and tells you exactly which verse is being recited. Fully offline, no API calls. The pipeline: Whisper encoder (fine-tuned on Quran by Tarteel) -> CTC head -> WFST beam search -> verse ID. After one epoch of training with the encoder frozen, it's already identifying 8% of verses up from 0%. The decoding graph covers all 6,236 verses and fits in 7.9 MB.

Where this started

I came across yazinsai/offline-tarteel, a really well-documented project trying to solve the same problem: identify which Quran verse is being recited, entirely on-device. Yazin's analysis of the problem space is sharp. His central insight is correct - there are no small Arabic wav2vec2 models, and that's the fundamental blocker for traditional ASR approaches.

His best results were 67-72% accuracy with Tarteel's Whisper model using transcribe-then-fuzzy-match. An 81% result with a larger CTC model proved the ceiling is high when the acoustic model actually understands Arabic, but that model was 1.2GB - way too big for on-device use.

I thought I could do better with a different decoding strategy. The Quran is a closed corpus - 6,236 verses, and the text will never change. That's the dream scenario for constrained decoding.

The key insight: WFST decoding

Instead of transcribing audio to text and then fuzzy-matching against a database (which is what most approaches do), I compiled the entire Quran into a weighted finite state transducer. It's basically a massive trie where every verse is a valid path through token states, and the decoder can only follow valid paths.

Think of it as autocomplete but for the entire Quran. Even a mediocre acoustic model becomes dramatically more accurate because it literally can't hallucinate. If the model is 60% confident between two similar-sounding words, the WFST resolves the ambiguity by checking which one is valid in that context.

Audio (16kHz) -> Mel Spectrogram -> Whisper Encoder (512-dim)
    -> CTC Head (512 -> 3000) -> Log-Softmax
    -> WFST Beam Search -> Verse ID

Building the pieces

Tokeniser

I trained a SentencePiece Unigram tokeniser on all 6,236 verses with a vocab of 3,000 tokens. Unigram over BPE because Arabic is morphologically rich - you want subword units that respect the structure of the language rather than just frequent byte pairs.

100% round-trip accuracy on all 6,236 verses. I set normalization_rule_name="identity" to preserve the Uthmani script exactly. Quran text is not the place to get creative with normalisation.

Average tokens per verse: 32.7. Max: 382 (Al-Baqarah 2:282, the longest verse in the Quran).

The decoding graph

This was three steps:

I originally tried k2 for the FST operations but its C extension wouldn't load on Python 3.13 with macOS ARM. kaldifst (lighter OpenFst wrapper) worked straight away.

For CTC with SentencePiece tokens, the lexicon is implicit - each word's token sequence is deterministic from the tokeniser. So the grammar FST already operates at token level. Traditional L∘G composition would be needed if I had a separate phone/grapheme level, but SentencePiece handles word-to-token mapping directly. Simpler than expected.

Training data

I needed ayah-level audio from multiple reciters. Tarteel's own CDN returns 403s. qurancdn.com works for some but limited coverage. alquran.cloud was the reliable one - public API, no auth, 27+ reciters.

Downloaded 68,500 MP3 files across 11 reciters, about 18 GB total. The split is by reciter, not by verse - I want the model to generalise across voices, not memorise how Alafasy sounds on verse 2:255.

Split Entries Reciters
Train 49,887 8 Arabic reciters
Val 12,464 Sudais, Husary Mujawwad
Test 6,236 Ibrahim Walk (English)
Yes, the test set is an English reciter. Deliberately harsh - if the model can handle a completely different language rendering of the same verses, it's actually learning the content and not just the accent.

Training

I'm using tarteel-ai/whisper-base-ar-quran as the base - a Whisper model already fine-tuned on Quran recitation by Tarteel. I stripped the decoder, kept only the encoder (512-dim), and bolted a linear CTC head onto it (512 -> 3000, matching the tokeniser vocab).

Three-phase training schedule:

Phase What's trained Learning rate Epochs
A CTC head only (encoder frozen) 1e-3 5
B Top 4 encoder layers + head 1e-5 10
C Everything 5e-6 5
Currently in Phase A. One epoch takes about 83 minutes on MPS (Apple Silicon). CTC loss dropped from 605 (random init, as expected) down to around 3.0 by end of epoch 1. Val loss: 3.79.

One annoyance - nn.CTCLoss doesn't work on MPS. The workaround is running the forward pass on GPU, then moving log-probs to CPU for the loss computation. Gradients still flow through the .cpu() call (I verified), but it adds about 10% overhead. Better than running the whole encoder on CPU though.

Where it's at

After one epoch with just the CTC head trained:

Metric Value
Correct verses 4/50 (8%)
Empty predictions 33/50
Wrong predictions 13/50
Avg inference time 0.16s/sample (CPU)
8% from a frozen encoder and one epoch of head training. Not useful yet, but the trajectory is right - it was 0% with a random head. The 33 empty predictions are cases where beam search can't complete a valid verse path because the CTC output is still too noisy. That should improve as training continues and especially after unfreezing the encoder in Phase B.

The four correct predictions came from four different reciters (verses 2:145, 5:54, 39:52, 25:68), which is a good sign - no single-reciter bias.

Inference is fast: 0.16s per sample on CPU. The target is under 1 second, so there's plenty of headroom even before optimisation.

Why I think this gets to 95%+

Yazin's 67% with the Tarteel model was using transcribe-then-match. Most errors were partial - close enough that the WFST graph should resolve them. The encoder already understands Quranic Arabic, it just needs the CTC head trained to produce useful frame-level outputs, and then the WFST handles disambiguation.

The Whisper encoder was fine-tuned by Tarteel for seq2seq transcription, not CTC, so the frame-level representations probably need some adjustment. That's what Phase B and C are for - unfreezing encoder layers should make a big difference.

The full decoding graph (7.9 MB) plus the quantised model should fit comfortably under 200 MB, which was the whole point - something that can run on a phone or cheap device without connectivity.

What's next

Finish the three-phase training. If it hits >85% SeqAcc, I'll move to quantisation (INT8, maybe INT4) and ONNX export for on-device deployment via sherpa-onnx. If the Tarteel encoder isn't enough, I have a fallback path to Whisper-small or training a Zipformer from scratch using k2/icefall.

I'll post results once I have numbers from the full training run. The QUL dataset (62 reciters with word-level timestamps) is also sitting there waiting to be used for a more ambitious fine-tune if needed - that's potentially 4.8 million word-level audio-text pairs across dozens of speakers.

All code is in quran-asr if you want to follow along.