Home / Programms / AlKhalil Diacritizer

AlKhalil Diacritizer



The absence of short vowels in Arabic texts is the source of some difficulties in several automatic processing systems of Arabic language. Several developed hybrid systems of automatic diacritization of the Arabic texts are presented and evaluated in this paper. All these approaches are based on three phases: a morphological step followed by statistical phases based on Hidden Markov Model at the word level and at the character level. The two versions of the morpho-syntactic analyzer Alkhalil were used and tested and the outputs of this stage are the different possible diacritizations of words. A lexical database containing the most frequent words in the Arabic language has been incorporated into some systems in order to make the system faster. The learning step was performed on a large Arabic corpus and the impact of the size of this learning corpus on the performance of the system was studied. The systems use smoothing techniques to circumvent the problem of missing transitions words and the Viterbi algorithm to select the optimal solution. Our proposed system that benefits from the wealth of morphological analysis and a large diacritized corpus presents interesting experimental results in comparison to other automatic diacritization systems known until now.

For further details, please check the following paper :

Amine Chennoufi and Azzeddine Mazroui, “Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization”, International Journal of Speech Technology (IJST), 2015, (DOI) 10.1007/s10772-015-9313-5.