Traditional multi-document summarization aims at generating a summary from a set of text documents, e.g., news, of a given topic/event. We investigate a problem known as reader-aware multi-document summarization (RA-MDS). Under the RA-MDS setting, one should jointly consider news documents and reader comments when generating the summaries.
One challenge of the RA-MDS problem is how to conduct salience estimation by jointly considering the focus of news reports and the reader interests revealed by comments. Meanwhile, the model should be insensitive to the availability of diverse aspects of reader comments. Another challenge is that reader comments are very noisy, not fully grammatical and often expressed in informal expressions.
Existing datasets from DUC and TAC are not appropriate. Therefore, we introduce a new dataset for RA-MDS. We employed some experts to conduct the tasks of data collection, aspect annotation, and summary writing as well as scrutinizing. To our best knowledge, this is the first dataset for RA-MDS.
The definition of the terminology related to the dataset is given as follows. In fact, for the core terminology, namely, topic, document, category, and aspect, we follow the MDS task in TAC.
- Topic: A topic refers to an event and it is composed of a set of news documents from different sources.
- Document: A news article describing some aspects of the topic. The set of documents in the same topic typically span a period, say a few days.
- Category: Each topic belongs to a category. There are 6 predefined categories: (1) Accidents and Natural Disasters, (2) Attacks (Criminal/Terrorist), (3) New Technology, (4) Health and Safety, (5) Endangered Resources, and (6) Investigations and Trials (Criminal/Legal/Other).
- Comment: A piece of text written by a reader conveying his or her altitude, emotion, or any thought on a particular news document.
The dataset contains 45 topics from those 6 predefined categories. Some examples of topics are ``Malaysia Airlines Disappearance'', ``Flappy Bird'', ``Bitcoin Mt. Gox'', etc.
Each topic contains 10 news documents and 4 model summaries. The length limit of the model summary is 100 words (slitted by space).
On average, each topic contains 215 pieces of comments and 940 comment sentences. Each news document contains an average of 27 sentences, and each sentence contains an average of 25 words.
85% of non-stop model summary terms (entities, unigrams, bigrams) appeared in the news documents, and 51% of that appeared in the reader comments.
Download link: RA-MDS.zip
If any problems, please go to Issues to raise questions or send email directly to Piji Li: lipiji.pz [at] gmail.com
How to cite?
Piji Li, Lidong Bing, Wai Lam, Hang Li and Yi Liao. Reader-Aware Multi-Document Summarization via Sparse Coding. Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI'15). Buenos Aires, Argentina. July 25-31, 2015. [pdf] [bib]
Piji Li, Lidong Bing, Wai Lam. Reader-Aware Multi-Document Summarization: An Enhanced Model and The First Dataset. Proceedings of the EMNLP 2017 Workshop on New Frontiers in Summarization (EMNLP-NewSum'17). Sep 2017. [pdf] [bib]
1, Respect the privacy of personal information of the original source.
2, The original copyright of all the news reports and reader comments belongs to the source owner.
3, The copyright of annotated model summaries belongs to our group, and they are free to the public.
4, If you find the dataset useful, we appreciate that our paper can be cited.
5, The dataset is only for the specified applicant or study groups for research purposes. Without permission, it may not be used for any commercial purposes.