Home / Corpora / NEMLAR Corpus




The first version of Nemlar corpus was produced within the NEMLAR project. This is a set of annotated Arabic texts collected from 13 different domains and contains about 500,000 words.

The Arabic Language Processing team (ALP team) of Mohammed First University in Morocco enriched this corpus by adding the lemma label to all the words in the corpus, and also corrected some annotation errors in the first version.

This new version is in XML format and each word is accompanied by the following tags:

  • Vowelized form
  • Lemma
  • POS tag
  • Clitics attached to the stem
  • Root
  • Pattern

For further details, please check the following paper :

  • Boudchiche, M.; Mazroui, A.; 2015“Enrichment of the Nemlar corpus by the lemma tag”. Workshop Language Resources of Arabic NLP: Construction, Standardization, Management and Exploitation. Rabat, Morocco. November 26, 2015.

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.