Home / Corpora / OSIAN corpus

OSIAN corpus

The corpus will be available for Download soon
You can download sample files


The Open Source International Arabic News (OSIAN) corpus has been collected from international Arabic news websites like CNN, DW, RT, Aljazeera, among others. With a server-friendly crawling policy we extracted 1 million web pages. After necessary cleaning and filtering steps, the OSIAN corpus has 477,556 articles comprising 2,861,944 sentences and roughly 157 million words. The corpus is encoded in XML, each article is annotated with metadata information, which gives the information about its web location and the date of its extraction. Moreover, Each word is annotated with lemma and part-of-speech.

ISLRN: 255-977-746-042-1

For further details, please check the following paper(s) :

Or contact us at: mr.imadine@gmail.com

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.