Home / Corpora / MulTed corpus

MulTed corpus

MulTed corpus

The corpus will be available for Download soon


The MulTed is a multilingual aligned and tagged parallel corpus. i.e.,  it is multilingual and Part of Speech (PoS) tagged, but the sentence-alignment is bilingual, with English as a pivot language. This corpus is designed for many NLP applications, where the sentence alignment, the PoS tagging, and the size of corpora are influential, such as statistical machine translation, language recognition, and bilingual dictionary generation. The corpus is a collection of extracted subtitles from TEDx talks. Currently, it has subtitles that cover 1100 talks available in over 30 languages. Yet, the subtitles are classified based on a variety of topics such as Business, Education, and Sport. Regarding the PoS tagging, the Treetagger, a language-independent PoS tagger, is used. Moreover, to make the PoS tagging maximally useful, a mapping process to a universal common tagset is performed. Finally, we believe that making the MulTed corpus available for a public use can be a significant contribution to the literature of NLP, Information Retrieval, and Corpus linguistics, especially for under-resourced languages.

ISLRN: 367-302-230-252-1

For further details, please check the following paper(s) :

Or contact us at: i.zeroual@ump.ac.ma

Under revision

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.