The Problem
When I was building a word-by-word Quran reading app with synchronized audio highlighting, the original segment timings were all over the place. Some gaps between words stretched 200-800ms where the highlight would just disappear. The last word of each ayah was worse: 3-6 seconds of duration that included trailing silence. And the timing offsets were so inconsistent across ayahs that I couldn't just apply a simple global correction.
Example: Original Timings for Surah Al-Fatiha, Ayah 7
Word 0 (ṣirāṭa): [ 0- 640ms] 640ms
Word 1 (alladhīna): [ 880-1280ms] 400ms ← 240ms gap
Word 2 (anʿamta): [1680-2480ms] 800ms ← 400ms gap
Word 3 (ʿalayhim): [2880-3280ms] 400ms ← 400ms gap
Word 4 (ghayri): [3480-3600ms] 120ms ← 200ms gap
Word 5 (l-maghḍūbi): [4360-5200ms] 840ms ← 760ms gap
Word 6 (ʿalayhim): [5600-5960ms] 360ms ← 400ms gap
Word 7 (walā): [6250-6350ms] 100ms ← 290ms gap
Word 8 (l-ḍālīna): [6720-13280ms] 6560ms ← last word too long!
Total gaps: 8 gaps ranging from 100ms to 760ms
Total duration: 13,187ms
Surah Al-Fatiha had 21 gaps total. The highlighting wouldn't flow smoothly from one word to the next.
The Solution: WhisperX with Forced Alignment
Rather than manually tweaking timings or trying to fix offsets algorithmically, I used WhisperX, a speech recognition model with forced alignment capabilities.
Why WhisperX?
Forced alignment with Wav2Vec2. After transcription, WhisperX uses a wav2vec2 model to align each detected word to its precise audio timestamp.
Word-level accuracy. Standard Whisper gives segment-level timestamps, but WhisperX provides precise word boundaries.
Language support. Works well with Arabic Quranic recitation.
Consistent results. Produces continuous word timings without artificial gaps.
Key Design Decision: Trusting Timings, Not Transcription
I do NOT use the transcribed text from WhisperX. I only use the timestamp data.
Here's why: Quran text is sacred and must be 100% accurate. Whisper occasionally mistranscribes Arabic words, so I rely on the correct word count and text already in the database to realign WhisperX's word boundaries to match the expected count.
The process:
- Run WhisperX on the audio file
- Get word-level timestamps (start_ms, end_ms for each detected word)
- If word count matches expected, use timings directly
- If word count differs, realign by merging/splitting segments
- Update only the timestamp columns in the database
Technical Implementation
Database Schema
I used the audio_segments table in quran.db:
CREATE TABLE audio_segments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
surah_id INTEGER NOT NULL,
ayah_number INTEGER NOT NULL,
word_number INTEGER NOT NULL,
audio_edition_id INTEGER NOT NULL,
start_ms INTEGER NOT NULL,
end_ms INTEGER NOT NULL,
timestamp_from INTEGER,
UNIQUE(surah_id, ayah_number, word_number, audio_edition_id)
);
The Fixing Script
# Key components:
# 1. WhisperX for transcription + alignment
# 2. Word count validation against expected
# 3. Merge/split logic for word count mismatches
# 4. Database update with new timings
def process_audio(audio_path: str, expected_words: int) -> List[WordSegment]:
# Load audio
audio = whisperx.load_audio(audio_path)
# Transcribe with Whisper
result = model.transcribe(audio, batch_size=16, language="ar")
# Align with wav2vec2 for word-level timestamps
result = whisperx.align(
result["segments"],
align_model,
align_metadata,
audio,
device,
return_char_alignments=False
)
# Extract word segments
segments = extract_words(result)
# Realign if word count doesn't match
if len(segments) != expected_words:
segments = realign_segments(segments, expected_words)
return segments
Handling Word Count Mismatches
When WhisperX detects a different number of words than expected:
Too many words (e.g., WhisperX detected 5, expected 4): Merge adjacent segments with smallest gap between them. Preserve total time span, just consolidate boundaries.
Too few words (e.g., WhisperX detected 3, expected 4): Split longest segments evenly. Distribute the timestamp range.
Results
Before vs After: Surah Al-Fatiha, Ayah 7
Before (Original):Gaps: 8 (ranging 100-760ms)
Total duration: 13,187ms
Last word duration: 6,560ms (50% of total!)
After (WhisperX):
Gaps: 3 (ranging 120-161ms)
Total duration: 8,978ms
Last word duration: 1,906ms (21% of total)
Summary Statistics
| Metric | Before | After | Improvement |
|---|---|---|---|
| Total gaps (Surah 1) | 21 | 5 | 76% reduction |
| Average gap size | 300ms | 120ms | 60% smaller |
| Max last word duration | 6,560ms | 3,500ms | 47% shorter |
Usage
Process All Surahs
cd /Users/bilawalriaz/coding/islam-llm/quran-dump
# Run the fixing script
python3 fix_all_segments.py --model medium
# Check progress
sqlite3 quran.db "SELECT COUNT(*) FROM (
SELECT DISTINCT surah_id, ayah_number
FROM audio_segments
WHERE audio_edition_id = 29
AND word_number = 1
AND start_ms < 200
);"
Model Selection
I used the medium model for the best balance of speed and accuracy when processing all 6,236 ayahs. I'm not looking for accurate word transcriptions here, just accurate word timings.
Files Created
fix_all_segments.py- Main processing scriptwhisper_segment_fixer.py- Alternative script with more optionssegment-demo.html- Visual comparison demo page
Backup Safety
The script automatically creates a backup table before making changes:
CREATE TABLE audio_segments_whisperx_backup AS
SELECT * FROM audio_segments;
To restore if needed:
DELETE FROM audio_segments WHERE audio_edition_id = 29;
INSERT INTO audio_segments SELECT * FROM audio_segments_whisperx_backup;
Conclusion
WhisperX's forced alignment cut gaps between word highlights by 76%. The reading experience is smoother now, with continuous highlighting and word boundaries that match actual speech patterns. The whole thing runs automated across all 6,236 Quran ayahs.
The real insight here: I can use speech recognition models for timing accuracy without risking transcription errors in sacred text. Take the timestamps, apply them to verified word data, and you're done.