Xtractive summarization as a graph-based problem

Abstract: Graph-based approaches have been successfully used in many domains (e.g., computer vision, recommendation system, big data, information extraction, etc.). In recently years, some studies have exploited the graph-based approach for extractive summarization and obtained significant results compared to other approaches. In this paper, we first encode an extractive summarization task into a graph-based problem. Then, we conduct several approaches on two well-known datasets: SoLSCSum and USATodayCNN. Finally, we draw some insights, which would be helpful for the future research.

pdf7 trang | Chia sẻ: thanhle95 | Lượt xem: 404 | Lượt tải: 1download
Bạn đang xem nội dung tài liệu Xtractive summarization as a graph-based problem, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
ISSN 2354-0575 Journal of Science and Technology14 Khoa học & Công nghệ - Số 27/Tháng 9 - 2020 EXTRACTIVE SUMMARIZATION AS A GRAPH-BASED PROBLEM Van-Hau Nguyen1,*, Van-Chien Nguyen2, Minh-Tien Nguyen1, Le Chi Ngoc2 1 Hung Yen University of Technology and Education 2 Ha Noi University of Technology and Science * Corresponding author: haunv@utehy.edu.vn Received: 10/04/2020 Revised: 10/08/2020 Accepted for publication: 09/09/2020 Abstract: Graph-based approaches have been successfully used in many domains (e.g., computer vision, recommendation system, big data, information extraction, etc.). In recently years, some studies have exploited the graph-based approach for extractive summarization and obtained significant results compared to other approaches. In this paper, we first encode an extractive summarization task into a graph-based problem. Then, we conduct several approaches on two well-known datasets: SoLSCSum and USAToday- CNN. Finally, we draw some insights, which would be helpful for the future research. Keywords: Text Summarization, Graph-based methods, Extractive Summarization, Machine Learning. 1. Introduction Nowadays, people are overwhelmed by accessing a massive of information from a number of sources. Reading and understanding a billions of documents (still growing exponentially) containing information, mostly textual, is a laborious and time- consuming task. Motivation to make condensed and fluent informationfrom the source documents while preserving key informationcontent and overall meaning led to the development of many automatic text summarization systems. In the community ofnatural language processing (NLP), automatic text summa-rization is a fundamental, and interesting task. In the 1950s, Luhn [1] introduced a method to select salient sentences fromthe source based on words which are more descriptive of itscontent than others. The Luhn’s work became one of the most important pillars of NLP. In recent years, text summarization is a challenging task of document understanding, in which a summarization system compresses the content of input documents while preserving its meaning.The task has been deeply investigated with a lot of studies in two directions: unsupervised learning [2-3] and supervised learning [4]. The results of text summarization systems benefit many NLP applications, ranging from search engines (e.g. Google or Bing) which return a short description of Webpages corresponding to a search query; online-news providers which produce highlights of a Web document on its interface; clinician supports to quickly find the relevant information from number of published journals, conference proceedings, medical sites and portals on Web, electronic medical records,etc. Online [5]; and scientific articles summarization [6]. Text summarization is mainly classified into two types: extraction and abstraction. Extraction approaches aim at extracting a short text summary for a document by selecting salient phrases or sentences from the original text, while abstraction approaches try to generate a summary for a documentfrom scratch, even using words which do not appear in theoriginal text. Since the output of abstraction is still far fromhuman satisfaction, most of the promising results of the text summarization systems today have been from extraction. Therefore, this survery is intended to make a review the graph- based approaches on extractive summarization. Intuitively, the extraction problem can be formulated as follows. With a (or multiple) given document consisting of M sentences, the purpose ISSN 2354-0575 Khoa học & Công nghệ - Số 27/Tháng 9 - 2020 Journal of Science and Technology 15 of the problem is to extract N sentences from the document (N < M). The differences between methods is the way they find out and retrieve the salience of sentences, then build a coherent summary, which collects the main points from the original document. In fact, there are several summarization tasks: generic, query-oriented, muti-document, single- -document, abstractive, and extractive approaches. However, when building a summarization for a given document, we have to deal with the same challenges [10]: • Relevance (information coverage or significance) by finding the salient sentences from a (multiple) document • Non-redundancy by building a summany that do not contain the same information. • Coherence: by forming a summary that is syntacitally well and readable. Extraction and abstraction are normally considered as a two types of text summarization. While the former aims at extracting a short text summary for a document by selecting salient phrases or sentences from the original text, the latter attempts to create a summary from scratch, even using words that might not appear in the given text. Compared to the extractive task, the abstractive task is much more complex due to its writing style of humans. That’s the reason this paper focuses on extractive summarization. Finding the cross-sentence relations plays a key role to effectively extract sentences to build a summary. Several different approaches have been extensively studied to improve the summarization process. The traditional way is feature – based approaches [1, 3], which is based on statistic method by employing different features of the documents (e.g., word frequency, sentence position, sentence length, etc.). The drawback of feature – based approaches is that they do not take the semantics into account. In many real problems, understanding the text is very important for multi-document summarization. Therefore, the semantic – based approaches have attempted to understand the context of meaning in the text [7]. Recently, neural methods have been popular in text summarization thanks to the power of such encoder- decoder frameworks. For example, recurrent neuron networks (RNNs) have achieved a great success [8]. However, these RNN approaches have not been successful for multi-document summarization task because the models are inadequate to capture sentence-level and complex relations across long document or multi-documents. In order to overcome this challenge, a number of graph – based methods has been studied [2, 9, 10, 11]. Although graph-based approaches have made a great progress by obtaining many state-of-the-art results, there remains an open question: how to build an effective graph for (specific) extractive summarization task. Intuitively, a graph-based method is an algorithm to identify the importance of each vertex within a graph by taking advantages of global and informative representation of the graph. Graph-based approaches have been successfully studied and used in social networks. Brin and Page [12] introduced a prototype of large- scale search engine which improves search quality including page rank. The Google’s PageRank became one of the most important pillars in the field of Web search technology. The authors encoded each document as a set of word occurrences – called hits, which record the word, position in the document. The main objective of this paper is to review the graph-based approaches on extractive summarization. We aim to providing a starting point for new researchers by reviewing the graph-based approaches. The motivation for our study comes from the highly representative graph for documents. Like other methods, the graph-based approaches exploit the statistical or linguistic information between sentences. Furthermore, instead of taking into account local specific information, thanks to graph, we can recursively compute the importance of a vertex (i.e., text units of given documents) from the entire graph. Therefore, a summary is expected to obtain the most salient sentences through exploiting global information. The rest of the paper is organized as follows. In Section 2, we formulate the text summarization as a graph-based problem. In Section 3, we evaluate several state-of-the-art graph-based algorithms. Finally, Section 4 is Conclusions. ISSN 2354-0575 Journal of Science and Technology16 Khoa học & Công nghệ - Số 27/Tháng 9 - 2020 2. Extractive Summarization as a Graph-based Problem In the context of extractive summarization, there are four main steps for any graph-based approaches as follows. • Step 1 – Graphical Vertices: representing text units (e.g. word/sentence/graph) as vertices in the graph. Normally, a sentence, which is encoded as a vector, is presented as a node. • Step 2 – Graphical Edges: building relations among vertices and considering these relations as edges in the graph. Depending on specific problems, we may have directed or undirected, weighted or unweighted edges. • Step 3 – Vertices ranked: using graph-based algorithms to rank vertices. • Step 4 – Summary: selecting salient sentences (sentences extraction). In the following sections, we will address the above steps one after another. For the rest of the paper, we denote graph G = (V, E), where V is a set of vertices, and E is the set of edges (i.e., a subset of V x V). 2.1. Graphical Vertices The most important goal of extraction is to rank entire sentences in the text. In most of the cases each vertex (node) is represented for a sentence in the graph [2,9]. However, Parveen and Strube [10] used two sets of nodes in a bipartite graph, one presents entities (e.g. word/term), and the other present sentences. And there is no edges in the same set of nodes. By doing this, Parveen and Strube take advantages of strong entity transitions to exploit information central to a text. Similarly, Wang et al. [11] adjusted a heterogeneous graph network to exploit not only sentence-level nodes, but also semantic units as additional nodes represented for words, which enrich the relationship between sentences. Each edges is constructed word-sentence in the heterogeneous graph by connecting to its contained words. It means that there is no direct edge between any 2 sentences, and between any 2 words. 2.2. Graphical Edges As mentioned above, each vertex or node in the graph is represented a unit (e.g., work, sentence, etc.) of a given document. Each edge in the graph connects two vertices. In the original context of Web connection [12, 13], one vertex connects to only one other vertex. Therefore, these graph-based algorithms are unweighted graphs. Later on, most models [2, 9, 12, 13] build multiple or partial links between vertices for graphs. Hence, these models employ weighted graphs which express how relationship or similarity between two vertices. The weighted edges are evaluated by various ways, for example, cosine similarity [2], tf- idf consine similartify [14]. Mihalcea and Tarau [9] introduced a new edge weight, between vertices Vi and Vj as a weight wij, that directly determines the score associated with a vertex: * WS V d d w w WS V 1i kjV Out V ij V In V j k j j i = + + ! ! ^ ^ ^ ^^ h h h hh/ / (1) where In(Vi ) is the set of vertices that point to Vi; Out(Vi ) is the set of edges that go out from Vi; and ,d 0 1! 6 @ is a damping factor. Erkan and Radev [2] introduced the concept sentence centrality. The author observed that a given document can be considered as a set of cluster which consists of many similar sentences. And the sentence is central or salient to the topic if it is similar with many other sentences. The authors define the similarity between two sentences: cosine(x, y) = tf idf tf idf tf f idf , , , ,, x x xx x y y yy y w x w y ww x y 2 2 2 i ii i ii! ! ! ^ ^h h/ / / (2) In order to consider coherence as one of the criterias to balance with relevance of summary sen- tences, Christensen et al. [15] introduced G-FLOW system to exploit discourse relationships across sen- tences. G-FLOW represents Approximate Discou- rse Graph (ADG) in which each edge, connecting two sentences, is constructed based on discourse relation indicators (e.g., deverbal nouns, co-refe- renct mentions, event/entity continuations, inferred edges, discourse markers, etc.). In particular, edge from vertex (or sentence) u to v is constructed if and only if two sentences share a discourse relationship ISSN 2354-0575 Khoa học & Công nghệ - Số 27/Tháng 9 - 2020 Journal of Science and Technology 17 and sj is putted right after u in a final summary. The authors set the edge weight, w ADG (u,v), in ADG to the number of distinct indicators between u and v: , ,w DG u v indicators u vA =^ ^h h/ (3) In order to make the edge weights more diverse, [14] adjusted ADG to propose Personalized Discourse Graph (PDG). The authors score each edge with the following weight: , , , w u v w u v s u w u v s u PDG ADGu V ADG l l = l! ^ ^ ^^ ^h h hh h/ (4) Where s(v) is considered as weighting of sentence – sentence personalization score s(v), which is calculated to take into account the surface features for each sentence. 2.3. Ranking Vertices In order to obtain salient sentences, we use graph-based ranking algoirthms to calculate the core (i.e. importance) for all the nodes. the graph- based approaches differ from each other on the graph ranking algorithms. A number of approaches has been addressed [10, 12, 13, 16]. Brin and Page [12] determined the importance of a vertex by the score of the vertex. The PageRank score a given vertex Vi as follows: *S V d d Out V S V 1i j j j In Vi = - + !^ ^ ^ ^ ^h h h h h/ (5) Where In(Vi) is the set of vertices that point to Vi; Out(Vi) is the set of edges that go out from Vi; and ,d 0 1! 6 @ is a damping factor (usually d = 0.85). It is worth noting that PageRanks forms a proba- blity distribution (i.e., )S V 1iV Vi =! ^ h/ ). PageRank initializes arbitrary values for each node, and uses simple iterative algorithm until convergence with a given threshold. Kleinberg [13] introduced HITS algorithm that learns to rank vertices (corresponding to the Web pages) by using two sets of scores – a “authority” score (pages with a number of incoming links), and a “hub” score (pages with number of outgoing links) – for each vertex: HITS V HITS VA i H jV In Vj i= !^ ^^h hh/ (6) HITS V HITS VH i A jV Out Vj i= !^ ^^h hh/ (7) The algorithms extracts information from the link structures by exploiting semantic relations within a bipartite network. Parveen and Strube [10] adapted the HITS algorithm for a bipartite graph in which one set of nodes represents for entities, while the other represents for sentences. In particular, after transforming documents into a bipartite graph, Parveen and Strube applied two steps to update two matrices: HubScore A AuthorityScore$= (8) AuthorityScore A HubScoreT $= (9) Here, A is an adjacency matrix representing the edges in the graph; HubScore and AuthorityScore are the matrix score for hub (i.e. entities) and authoridy (i.e. sentences) nodes, respectively. The algorithm initials score to all nodes, and HubScore and AuthorityScore are updated until convergence. Finally, every node is scored. Herings et al. [16] proposed the prositonal power function for a digraph method in which the power (i.e. scores) of a node is measured by both the number of its successors, and the the powers of its successors: ( )POP V V POS V 1 1P i P jV Out Vj i = +!^ ^^h hh/ (10) ( )POS V V POS V 1 1W i W jV In Vj i = +!^ ^^h hh/ (11) 2.4. Summary Straighforwardly, a summary can be built by extracting salient sentences. In particular, the more important sentences are, the higher their corresponding vertex’s score. Hence, we can pick the top n sentences of highest scores from the result of the above section (i.e., Ranking Vertices). 3. Evaluation We choose SVM as a traditional and very compatetive machine-learning model to compare with three state-of-the-art graph-based approaches: TextRank [9], LexRank [2], and Heterogeneous graph neural networks [11]. 3.1 Data Preparation In order to evaluate graph-based models (both supervised and unsupervised learning methods), we have conducted the experiments on SoLSCSum and USAToday-CNN datasets for multi-document summarization tasks. SoLSCSum: The dataset [17] consists of 157 documents, collected from Yahoo News. Humans involved to create the label of each sentence. Ech sentence was labeled (i.e., summary or non- -summary) by the majority voting. Finally, each ISSN 2354-0575 Journal of Science and Technology18 Khoa học & Công nghệ - Số 27/Tháng 9 - 2020 document includes the summary sentences and comments. USAToday-CNN: The dataset [18] includes 121 events, 455 highlights, and 6,413 sentences. Each sentence was annotated (i.e., summary and non-summanry) by two people. 3.2. Settings and Evaluation Metrics In this paper, we utilized several feature-based techniques for the SVM model, such as the sentence position, length, IF-IDF, the number of common words with the title for training a summarizer. Furthermore, we set 80% for training dataset and 20% for test set. We used ROUGE-1 (uni-gram) and ROUGE-2 (bi-gram) metrics to measure graph- based approaches. These mesurements are popular and considered as an appropriate for evaluating on text summarization. Table 1. Statistics for SoLSCSum multi-document summarization datasets Dataset #docs #sentences #refs SoLSCSum 157 3,462 5,858 USAToday-CNN 121 6,413 455 Furthermore, we only selected several studies to demonstrate the performance of graph-based approaches. As we can see in Table 2, although SVM, a traidtionally competitive method, is slighly better than TextRank and LexRank in Rouge-2, it is poorer than TextRank and LexRank in Rouge-1. Amazingly, HeterSumGraph [11] performs the best in both Rouge-1 and Rouge-2. Table 2. ROUGE-scores of graph-based methods on SoLSCSum. Bold is the best Algorithms ROUGE-1 ROUGE-2 SVM [19] 0.385 0.314 TextRank [9] 0.395 0.306 LexRank [2] 0.389 0.295 HeterSumGraph [11] 0.457 0.370 As we can see in Table 3, SVM is much better than TextRank, and marginally better than LexRank with repects to Rouge-2. However, SVM obtains the Rouge-1 higher than LexRank, and considerably lower than TextRank in. Again, HeterSumGraph [11] performs significantly the best in both Rouge-1 and Rouge-2. Especially, HeterSumGraph achieves the substantial Rouge-2 compared to the others, e.g., it is about 4.3 times better than SVM. Table 3. ROUGE-scores of graph-based methods on USAToday-CNN. Bold is the best Algorithms ROUGE-1 ROUGE-2 SVM [19] 0.247 0.082 TextRank [9] 0.269 0.008 LexRank [2] 0.239 0.079 HeterSumGraph [11] 0.425 0.353 Through the experiment we have some insights. That is, LexRank [2] and TextRank [9] demonstrate that graph-based approaches are very competitive compared to SVM, a very compatitive approach of machine learning methods. Furthermore, by combining with neural networks, graph-based approaches can get state-of-the-art results for extractive summarization. Therefore, it is convinced that we may exploit the graph-based approaches by different neural network models which we leave for future work. 4. Conclusions Although many studies on summarization have obtained a great progress, there is a big gap to meet our expectation. Compared to other approaches, graph-based methods have demonstrated their power on text summarization through many state- of-the-art results. There are two main advantages which facilitate the success of the graph-based approaches. The first is the