The Challenge of Arabic ASR

Automatic Speech Recognition (ASR) for Arabic is notoriously difficult. Unlike English, where "read" and "read" are spelled the same but pronounced differently, Arabic relies heavily on diacritics (Tashkeel) to define meaning.

A model that ignores these nuances fails to capture the true semantic meaning of the spoken word.

The Contenders

In this benchmark, we compared two models fine-tuned on OpenAI's Whisper architecture:

Nahw.ai Model: Trained on 1,000+ audio snippets recorded by diverse participants. Crucially, our transcripts are pronunciation-based.
- Example: If a speaker says فَعَلَ (fa'ala) vs. فَعِلَ (fa'ila), we transcribe exactly what was said, ensuring the model learns the acoustic-to-diacritic mapping.
Tarteel Model: A robust model trained primarily on the Quran. While excellent for recitation, we wanted to see how it fares on general storytelling.

The Test Data

We used a short story, cut into snippets and manually transcribed with full diacritics to serve as our Ground Truth.

Evaluation Methodology

We ran a direct comparison using a custom Python evaluation script. Unlike simple string matching, our improved evaluation logic uses Levenshtein distance to account for insertions, deletions, and substitutions. This prevents a single missing character from "shifting" the entire comparison and unfairly penalizing the rest of the sentence.

The script calculates three key metrics:

1. Character Accuracy

Instead of simple position checking, we use the Character Error Rate (CER). This finds the minimum number of edits required to change the prediction into the ground truth using jiwer.cer.

2. Word Accuracy

We use Word Error Rate (WER) to align words and check for exact matches (including all diacritics) using jiwer.wer.

3. Tashkeel (Diacritic) Accuracy (Alignment-Aware)

This is the most complex metric. A simple extraction of diacritics fails if the underlying text doesn't match (e.g., missing a letter).

Our improved method first aligns the base characters (without diacritics) using a sequence matcher. Once the "skeleton" of the text is aligned, we compare the diacritics only on the matching characters.

from difflib import SequenceMatcher
 
def calculate_tashkeel_accuracy(predicted: str, ground_truth: str) -> float:
    # 1. Strip diacritics to align the "skeleton" of the text
    base_gt = remove_tashkeel(ground_truth)
    base_pred = remove_tashkeel(predicted)
 
    # 2. Align base characters
    matcher = SequenceMatcher(None, base_gt, base_pred)
 
    aligned_matches = 0
    total_aligned = 0
 
    # 3. Check diacritics only on aligned characters
    for tag, i1, i2, j1, j2 in matcher.get_opcodes():
        if tag == 'equal':
            # Compare tashkeel maps for these aligned positions
            if get_tashkeel(ground_truth, i1) == get_tashkeel(predicted, j1):
                aligned_matches += 1
            total_aligned += 1
 
    return aligned_matches / total_aligned

The Results

The results were clear. Nahw.ai outperformed the Tarteel model across every single metric.

Metric	Nahw.ai	Tarteel	% Change
Character Accuracy	77.49%	68.10%	▲ 13.79%
Word Accuracy	21.08%	9.21%	▲ 128.88%
Tashkeel Accuracy	85.97%	75.31%	▲ 14.15%

Why the huge gap in Word Accuracy?

The most telling statistic is Word Accuracy, where Nahw.ai scored 21.08% vs Tarteel's 9.21%. That is more than double the performance.

This massive difference comes down to training data distribution. Tarteel is optimized for Quranic recitation—a very specific, formal, and slow style of speech. When faced with a natural storytelling pace and general vocabulary, it struggles to predict the correct diacritized form.

Nahw.ai, being trained on diverse, human-verified voice snippets, generalizes far better to new contexts.

Conclusion

Data quality is not just about volume; it's about relevance and precision. By focusing on accurate, pronunciation-based labeling, we were able to fine-tune a Whisper model that significantly outperforms domain-specific alternatives for general Arabic tasks.

If you are building Arabic AI, your labels matter more than your architecture.

Benchmarking Arabic ASR: Nahw.ai vs. Tarteel