fso
nlp-oujda
ump
Home / Corpora / OSIAN corpus

The corpus will be available for Download soon
You can download sample files

Abstract:

The Open Source International Arabic News (OSIAN) corpus has been collected from international Arabic news websites like CNN, DW, RT, Aljazeera, among others. With a server-friendly crawling policy we extracted 1 million web pages. After necessary cleaning and filtering steps, the OSIAN corpus has 477,556 articles comprising 2,861,944 sentences and roughly 157 million words. The corpus is encoded in XML, each article is annotated with metadata information, which gives the information about its web location and the date of its extraction. Moreover, Each word is annotated with lemma and part-of-speech.

ISLRN: 255-977-746-042-1

For further details, please check the following paper(s) :

Or contact us at: mr.imadine@gmail.com

Leave a Reply

Your email address will not be published. Required fields are marked *

*

ăn dặm kiểu NhậtResponsive WordPress Themenhà cấp 4 nông thônthời trang trẻ emgiày cao gótshop giày nữdownload wordpress pluginsmẫu biệt thự đẹpepichouseáo sơ mi nữhouse beautiful