Datasets¶

DUC 2004 Gujarati Dataset¶

https://github.com/nisargjhaveri/duc2004-translated

This is a cross-lingual summarization evaluation dataset for English to Gujarati summarization. The dataset can be obtained from the link mentioned above.

You’ll also need original DUC 2004 dataset as the above link does not contain source documents due to licensing reasons.

MultiLing Pilot 2011 Dataset¶

http://users.iit.demokritos.gr/~ggianna/TAC2011/MultiLing2011.html

This dataset contains parallel document sets in seven languages: English, Arabic, Czech, French, Greek, Hebrew and Hindi. Summaries for each document set is available in all languages, making the dataset usable for cross-lingual summarization evaluation.

The data needs to be cleaned and formatted for the use with clstk.