fso
nlp-oujda
ump
Home / Programms / TreeTagger

Abstract:

Several probabilistic methods used for Part of speech (POS) tagging are based on Hidden Markov Models (HMM), these methods have difficulties especially in estimating transition probabilities accurately from limited amounts of training data. Consequently, a new method appeared to avoid problems that HMM face. However, the transition probabilities are estimated using a decision tree. Based on this method a language independent POS tagger (called TreeTagger) has been implemented. The main purpose of this work is to create the language model to adapt TreeTagger for Arabic POS tagging and lemmatization. Furthermore, different configurations have been done, namely, collecting lexical resources, as well as the annotated training corpora. In addition, we used the proposed universal tagset that consists of common POS categories of 22 different languages including Arabic.

THE BASIC TAGS OF OUR TAGSET :

Tags

Tag Symbols

Tag in Arabic

Example

1.      Verbs (all tenses and modes)

2.      Nouns

3.      Proper nouns

4.      Pronouns

5.      Adjectives

6.      Adverbs

7.      Utilities words (particles, conjunctions…)

8.      Disconnected letters (Quranic Initials)

9.      Speech-specific sounds

10.   Other: foreign words, typos, abbreviations…

11.   Punctuation marks

VERB

NOUN

PN

PRON

ADJ

ADV

PRT

DISL

Uh

X

SENT

فعل

اسم

اسم علم

ضمير

صفة

ظرف

أداة

حروف مقطعة

حرف صوت

أخرى

علامة ترقيم

كَتَبَ” (kataba “to Write”)

مَدْرَسَة” (madrasap “School”)

مُحَمَّد” (muHam~ad “Mohamed”)

هِيَ” (hiya “She”)

جَمِيل” (jamyl “Beautiful”)

بَعْدَ، فَوْقَ” (baEda, fawoqa “After, Above”)

إلى، ذلك، الذي” (<ilY, *lk, Al*y “To, That, who”)

الم، طه، كهيعص” (Alm, Th, khyES)

آه، هيهات” (|h, hayhAt)

أوبك، مانشستر” (>wbk, mAn$str “OPEC, Manchester”)

.

For further details, please check the following paper :

Imad Zeroual and Abdelhak Lakhouaja, “Adapting a decision tree based tagger for Arabic,” in 2016 International Conference on Information Technology for Organizations Development (IT4OD), 2016, pp. 1–6., (DOI) 10.1109/IT4OD.2016.7479306.

Download the parameter files :

It is important to mention that the input file has to be already tokenized (using as delimiters: white-space and newline) and the text has to be transliterated using Buckwalter table and the encoding is UTF-8.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

ăn dặm kiểu NhậtResponsive WordPress Themenhà cấp 4 nông thônthời trang trẻ emgiày cao gótshop giày nữdownload wordpress pluginsmẫu biệt thự đẹpepichouseáo sơ mi nữhouse beautiful