Fixing Quran Audio Segment Timings with WhisperX

The Problem

When I was building a word-by-word Quran reading app with synchronized audio highlighting, the original segment timings were all over the place. Some gaps between words stretched 200-800ms where the highlight would just disappear. The last word of each ayah was worse: 3-6 seconds of duration that included trailing silence. And the timing offsets were so inconsistent across ayahs that I couldn't just apply a simple global correction.

Example: Original Timings for Surah Al-Fatiha, Ayah 7

Word 0 (ṣirāṭa):      [    0-  640ms]  640ms
Word 1 (alladhīna):   [  880-1280ms]  400ms  ← 240ms gap
Word 2 (anʿamta):     [1680-2480ms]  800ms  ← 400ms gap
Word 3 (ʿalayhim):    [2880-3280ms]  400ms  ← 400ms gap
Word 4 (ghayri):      [3480-3600ms]  120ms  ← 200ms gap
Word 5 (l-maghḍūbi):  [4360-5200ms]  840ms  ← 760ms gap
Word 6 (ʿalayhim):    [5600-5960ms]  360ms  ← 400ms gap
Word 7 (walā):        [6250-6350ms]  100ms  ← 290ms gap
Word 8 (l-ḍālīna):    [6720-13280ms] 6560ms ← last word too long!

Total gaps: 8 gaps ranging from 100ms to 760ms
Total duration: 13,187ms

Surah Al-Fatiha had 21 gaps total. The highlighting wouldn't flow smoothly from one word to the next.

The Solution: WhisperX with Forced Alignment

Rather than manually tweaking timings or trying to fix offsets algorithmically, I used WhisperX, a speech recognition model with forced alignment capabilities.

Why WhisperX?

Forced alignment with Wav2Vec2. After transcription, WhisperX uses a wav2vec2 model to align each detected word to its precise audio timestamp.

Word-level accuracy. Standard Whisper gives segment-level timestamps, but WhisperX provides precise word boundaries.

Language support. Works well with Arabic Quranic recitation.

Consistent results. Produces continuous word timings without artificial gaps.

Key Design Decision: Trusting Timings, Not Transcription

I do NOT use the transcribed text from WhisperX. I only use the timestamp data.

Here's why: Quran text is sacred and must be 100% accurate. Whisper occasionally mistranscribes Arabic words, so I rely on the correct word count and text already in the database to realign WhisperX's word boundaries to match the expected count.

The process:

Run WhisperX on the audio file
Get word-level timestamps (start_ms, end_ms for each detected word)
If word count matches expected, use timings directly
If word count differs, realign by merging/splitting segments
Update only the timestamp columns in the database

Technical Implementation

Database Schema

I used the audio_segments table in quran.db:

CREATE TABLE audio_segments (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    surah_id INTEGER NOT NULL,
    ayah_number INTEGER NOT NULL,
    word_number INTEGER NOT NULL,
    audio_edition_id INTEGER NOT NULL,
    start_ms INTEGER NOT NULL,
    end_ms INTEGER NOT NULL,
    timestamp_from INTEGER,
    UNIQUE(surah_id, ayah_number, word_number, audio_edition_id)
);

The Fixing Script

# Key components:
# 1. WhisperX for transcription + alignment
# 2. Word count validation against expected
# 3. Merge/split logic for word count mismatches
# 4. Database update with new timings

def process_audio(audio_path: str, expected_words: int) -> List[WordSegment]:
    # Load audio
    audio = whisperx.load_audio(audio_path)

    # Transcribe with Whisper
    result = model.transcribe(audio, batch_size=16, language="ar")

    # Align with wav2vec2 for word-level timestamps
    result = whisperx.align(
        result["segments"],
        align_model,
        align_metadata,
        audio,
        device,
        return_char_alignments=False
    )

    # Extract word segments
    segments = extract_words(result)

    # Realign if word count doesn't match
    if len(segments) != expected_words:
        segments = realign_segments(segments, expected_words)

    return segments

Handling Word Count Mismatches

When WhisperX detects a different number of words than expected:

Too many words (e.g., WhisperX detected 5, expected 4): Merge adjacent segments with smallest gap between them. Preserve total time span, just consolidate boundaries.

Too few words (e.g., WhisperX detected 3, expected 4): Split longest segments evenly. Distribute the timestamp range.

Results

Before vs After: Surah Al-Fatiha, Ayah 7

Before (Original):

Gaps: 8 (ranging 100-760ms)
Total duration: 13,187ms
Last word duration: 6,560ms (50% of total!)

After (WhisperX):

Gaps: 3 (ranging 120-161ms)
Total duration: 8,978ms
Last word duration: 1,906ms (21% of total)

Summary Statistics

Metric	Before	After	Improvement
Total gaps (Surah 1)	21	5	76% reduction
Average gap size	300ms	120ms	60% smaller
Max last word duration	6,560ms	3,500ms	47% shorter

Usage

Process All Surahs

cd /Users/bilawalriaz/coding/islam-llm/quran-dump

# Run the fixing script
python3 fix_all_segments.py --model medium

# Check progress
sqlite3 quran.db "SELECT COUNT(*) FROM (
    SELECT DISTINCT surah_id, ayah_number
    FROM audio_segments
    WHERE audio_edition_id = 29
    AND word_number = 1
    AND start_ms < 200
);"

Model Selection

I used the medium model for the best balance of speed and accuracy when processing all 6,236 ayahs. I'm not looking for accurate word transcriptions here, just accurate word timings.

Files Created

fix_all_segments.py - Main processing script
whisper_segment_fixer.py - Alternative script with more options
segment-demo.html - Visual comparison demo page

Backup Safety

The script automatically creates a backup table before making changes:

CREATE TABLE audio_segments_whisperx_backup AS
SELECT * FROM audio_segments;

To restore if needed:

DELETE FROM audio_segments WHERE audio_edition_id = 29;
INSERT INTO audio_segments SELECT * FROM audio_segments_whisperx_backup;

Conclusion

WhisperX's forced alignment cut gaps between word highlights by 76%. The reading experience is smoother now, with continuous highlighting and word boundaries that match actual speech patterns. The whole thing runs automated across all 6,236 Quran ayahs.

The real insight here: I can use speech recognition models for timing accuracy without risking transcription errors in sacred text. Take the timestamps, apply them to verified word data, and you're done.