1,000+ Diacritized Arabic Speech Recordings

Today we're releasing the Nahw Arabic Tashkeel Speech Dataset, a collection of over 1,000 fully diacritized Arabic speech recordings on Hugging Face under CC-BY-4.0.

Most publicly available Arabic ASR datasets strip diacritics (tashkeel) from transcriptions, losing critical pronunciation and meaning information. A word like "كتب" could be "kataba" (he wrote), "kutub" (books), or "kuttiba" (it was written). The diacritics are what disambiguate. Our dataset preserves full diacritization, making it useful for fine-tuning ASR models that need to produce accurate, vowelized Arabic text.

Dataset

The dataset contains 1,000+ recordings of native Arabic speakers reading fully diacritized sentences, reviewed and approved by human annotators.

Columns

audio: The speech recording, resampled to 16 kHz mono WAV.
transcription: The fully diacritized Arabic sentence that was read aloud.
sentence: The same sentence with diacritics removed.
speaker_id: An anonymized speaker identifier.

License

The dataset is released under CC-BY-4.0, free for commercial and non-commercial use with attribution.

Future work

We plan to continue growing this dataset with more recordings, speakers, and dialect coverage. We're also working on additional annotation layers and new datasets to support Arabic NLP research.

Check out the dataset on Hugging Face: NahwAI/arabic-tashkeel-speech

Ready to scale your Arabic AI?

Get started today with our expert annotation team.

nahw.ai

Enterprise-grade Arabic data labeling services powered by native speakers and advanced quality control systems.

Book demo