Today we're releasing the Nahw Arabic Tashkeel Speech Dataset, a collection of over 1,000 fully diacritized Arabic speech recordings on Hugging Face under CC-BY-4.0.
Most publicly available Arabic ASR datasets strip diacritics (tashkeel) from transcriptions, losing critical pronunciation and meaning information. A word like "كتب" could be "kataba" (he wrote), "kutub" (books), or "kuttiba" (it was written). The diacritics are what disambiguate. Our dataset preserves full diacritization, making it useful for fine-tuning ASR models that need to produce accurate, vowelized Arabic text.
Dataset
The dataset contains 1,000+ recordings of native Arabic speakers reading fully diacritized sentences, reviewed and approved by human annotators.
Columns
- audio: The speech recording, resampled to 16 kHz mono WAV.
- transcription: The fully diacritized Arabic sentence that was read aloud.
- sentence: The same sentence with diacritics removed.
- speaker_id: An anonymized speaker identifier.
License
The dataset is released under CC-BY-4.0, free for commercial and non-commercial use with attribution.
Future work
We plan to continue growing this dataset with more recordings, speakers, and dialect coverage. We're also working on additional annotation layers and new datasets to support Arabic NLP research.
Check out the dataset on Hugging Face: NahwAI/arabic-tashkeel-speech
