The Challenge of Arabic ASR
Automatic Speech Recognition (ASR) for Arabic is notoriously difficult. Unlike English, where "read" and "read" are spelled the same but pronounced differently, Arabic relies heavily on diacritics (Tashkeel) to define meaning.
A model that ignores these nuances fails to capture the true semantic meaning of the spoken word.
The Contenders
In this benchmark, we compared two models fine-tuned on OpenAI's Whisper architecture:
- Nahw.ai Model: Trained on 1,000+ audio snippets recorded by diverse participants. Crucially, our transcripts are pronunciation-based.
- Example: If a speaker says فَعَلَ (fa'ala) vs. فَعِلَ (fa'ila), we transcribe exactly what was said, ensuring the model learns the acoustic-to-diacritic mapping.
- Tarteel Model: A robust model trained primarily on the Quran. While excellent for recitation, we wanted to see how it fares on general storytelling.
The Test Data
We used a short story, cut into snippets and manually transcribed with full diacritics to serve as our Ground Truth.
Evaluation Methodology
We ran a direct comparison using a custom Python evaluation script. Unlike simple string matching, our improved evaluation logic uses Levenshtein distance to account for insertions, deletions, and substitutions. This prevents a single missing character from "shifting" the entire comparison and unfairly penalizing the rest of the sentence.
The script calculates three key metrics:
1. Character Accuracy
Instead of simple position checking, we use the Character Error Rate (CER). This finds the minimum number of edits required to change the prediction into the ground truth using jiwer.cer.
2. Word Accuracy
We use Word Error Rate (WER) to align words and check for exact matches (including all diacritics) using jiwer.wer.
3. Tashkeel (Diacritic) Accuracy (Alignment-Aware)
This is the most complex metric. A simple extraction of diacritics fails if the underlying text doesn't match (e.g., missing a letter).
Our improved method first aligns the base characters (without diacritics) using a sequence matcher. Once the "skeleton" of the text is aligned, we compare the diacritics only on the matching characters.
from difflib import SequenceMatcher
def calculate_tashkeel_accuracy(predicted: str, ground_truth: str) -> float:
# 1. Strip diacritics to align the "skeleton" of the text
base_gt = remove_tashkeel(ground_truth)
base_pred = remove_tashkeel(predicted)
# 2. Align base characters
matcher = SequenceMatcher(None, base_gt, base_pred)
aligned_matches = 0
total_aligned = 0
# 3. Check diacritics only on aligned characters
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == 'equal':
# Compare tashkeel maps for these aligned positions
if get_tashkeel(ground_truth, i1) == get_tashkeel(predicted, j1):
aligned_matches += 1
total_aligned += 1
return aligned_matches / total_alignedThe Results
The results were clear. Nahw.ai outperformed the Tarteel model across every single metric.
| Metric | Nahw.ai | Tarteel | % Change |
|---|---|---|---|
| Character Accuracy | 77.49% | 68.10% | ▲ 13.79% |
| Word Accuracy | 21.08% | 9.21% | ▲ 128.88% |
| Tashkeel Accuracy | 85.97% | 75.31% | ▲ 14.15% |
Why the huge gap in Word Accuracy?
The most telling statistic is Word Accuracy, where Nahw.ai scored 21.08% vs Tarteel's 9.21%. That is more than double the performance.
This massive difference comes down to training data distribution. Tarteel is optimized for Quranic recitation—a very specific, formal, and slow style of speech. When faced with a natural storytelling pace and general vocabulary, it struggles to predict the correct diacritized form.
Nahw.ai, being trained on diverse, human-verified voice snippets, generalizes far better to new contexts.
Conclusion
Data quality is not just about volume; it's about relevance and precision. By focusing on accurate, pronunciation-based labeling, we were able to fine-tune a Whisper model that significantly outperforms domain-specific alternatives for general Arabic tasks.
If you are building Arabic AI, your labels matter more than your architecture.