Abstract:
Graph-based approaches have been successfully used in many domains (e.g., computer vision,
recommendation system, big data, information extraction, etc.). In recently years, some studies have
exploited the graph-based approach for extractive summarization and obtained significant results compared
to other approaches. In this paper, we first encode an extractive summarization task into a graph-based
problem. Then, we conduct several approaches on two well-known datasets: SoLSCSum and USATodayCNN. Finally, we draw some insights, which would be helpful for the future research.
7 trang |
Chia sẻ: thanhle95 | Lượt xem: 512 | Lượt tải: 1
Bạn đang xem nội dung tài liệu Xtractive summarization as a graph-based problem, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
ISSN 2354-0575
Journal of Science and Technology14 Khoa học & Công nghệ - Số 27/Tháng 9 - 2020
EXTRACTIVE SUMMARIZATION AS A GRAPH-BASED PROBLEM
Van-Hau Nguyen1,*, Van-Chien Nguyen2, Minh-Tien Nguyen1, Le Chi Ngoc2
1 Hung Yen University of Technology and Education
2 Ha Noi University of Technology and Science
* Corresponding author: haunv@utehy.edu.vn
Received: 10/04/2020
Revised: 10/08/2020
Accepted for publication: 09/09/2020
Abstract:
Graph-based approaches have been successfully used in many domains (e.g., computer vision,
recommendation system, big data, information extraction, etc.). In recently years, some studies have
exploited the graph-based approach for extractive summarization and obtained significant results compared
to other approaches. In this paper, we first encode an extractive summarization task into a graph-based
problem. Then, we conduct several approaches on two well-known datasets: SoLSCSum and USAToday-
CNN. Finally, we draw some insights, which would be helpful for the future research.
Keywords: Text Summarization, Graph-based methods, Extractive Summarization, Machine Learning.
1. Introduction
Nowadays, people are overwhelmed by
accessing a massive of information from a number
of sources. Reading and understanding a billions of
documents (still growing exponentially) containing
information, mostly textual, is a laborious and time-
consuming task. Motivation to make condensed and
fluent informationfrom the source documents while
preserving key informationcontent and overall
meaning led to the development of many automatic
text summarization systems. In the community
ofnatural language processing (NLP), automatic text
summa-rization is a fundamental, and interesting
task. In the 1950s, Luhn [1] introduced a method
to select salient sentences fromthe source based
on words which are more descriptive of itscontent
than others. The Luhn’s work became one of the
most important pillars of NLP. In recent years, text
summarization is a challenging task of document
understanding, in which a summarization system
compresses the content of input documents while
preserving its meaning.The task has been deeply
investigated with a lot of studies in two directions:
unsupervised learning [2-3] and supervised learning
[4].
The results of text summarization systems
benefit many NLP applications, ranging from
search engines (e.g. Google or Bing) which return
a short description of Webpages corresponding to a
search query; online-news providers which produce
highlights of a Web document on its interface;
clinician supports to quickly find the relevant
information from number of published journals,
conference proceedings, medical sites and portals
on Web, electronic medical records,etc. Online [5];
and scientific articles summarization [6].
Text summarization is mainly classified into
two types: extraction and abstraction. Extraction
approaches aim at extracting a short text summary
for a document by selecting salient phrases or
sentences from the original text, while abstraction
approaches try to generate a summary for a
documentfrom scratch, even using words which do
not appear in theoriginal text. Since the output of
abstraction is still far fromhuman satisfaction, most
of the promising results of the text summarization
systems today have been from extraction. Therefore,
this survery is intended to make a review the graph-
based approaches on extractive summarization.
Intuitively, the extraction problem can be
formulated as follows. With a (or multiple) given
document consisting of M sentences, the purpose
ISSN 2354-0575
Khoa học & Công nghệ - Số 27/Tháng 9 - 2020 Journal of Science and Technology 15
of the problem is to extract N sentences from
the document (N < M). The differences between
methods is the way they find out and retrieve
the salience of sentences, then build a coherent
summary, which collects the main points from the
original document.
In fact, there are several summarization tasks:
generic, query-oriented, muti-document, single-
-document, abstractive, and extractive approaches.
However, when building a summarization for a
given document, we have to deal with the same
challenges [10]:
• Relevance (information coverage or
significance) by finding the salient sentences from
a (multiple) document
• Non-redundancy by building a summany
that do not contain the same information.
• Coherence: by forming a summary that is
syntacitally well and readable.
Extraction and abstraction are normally considered
as a two types of text summarization. While the
former aims at extracting a short text summary
for a document by selecting salient phrases
or sentences from the original text, the latter
attempts to create a summary from scratch, even
using words that might not appear in the given
text. Compared to the extractive task, the abstractive
task is much more complex due to its writing style
of humans. That’s the reason this paper focuses on
extractive summarization.
Finding the cross-sentence relations plays a key role
to effectively extract sentences to build a summary.
Several different approaches have been extensively
studied to improve the summarization process. The
traditional way is feature – based approaches [1, 3],
which is based on statistic method by employing
different features of the documents (e.g., word
frequency, sentence position, sentence length, etc.).
The drawback of feature – based approaches is
that they do not take the semantics into account.
In many real problems, understanding the text is
very important for multi-document summarization.
Therefore, the semantic – based approaches have
attempted to understand the context of meaning in
the text [7].
Recently, neural methods have been popular in text
summarization thanks to the power of such encoder-
decoder frameworks. For example, recurrent neuron
networks (RNNs) have achieved a great success
[8]. However, these RNN approaches have not
been successful for multi-document summarization
task because the models are inadequate to capture
sentence-level and complex relations across
long document or multi-documents. In order to
overcome this challenge, a number of graph – based
methods has been studied [2, 9, 10, 11]. Although
graph-based approaches have made a great progress
by obtaining many state-of-the-art results, there
remains an open question: how to build an effective
graph for (specific) extractive summarization task.
Intuitively, a graph-based method is an
algorithm to identify the importance of each vertex
within a graph by taking advantages of global and
informative representation of the graph.
Graph-based approaches have been
successfully studied and used in social networks.
Brin and Page [12] introduced a prototype of large-
scale search engine which improves search quality
including page rank. The Google’s PageRank
became one of the most important pillars in the field
of Web search technology. The authors encoded
each document as a set of word occurrences –
called hits, which record the word, position in the
document.
The main objective of this paper is to
review the graph-based approaches on extractive
summarization. We aim to providing a starting point
for new researchers by reviewing the graph-based
approaches. The motivation for our study comes
from the highly representative graph for documents.
Like other methods, the graph-based approaches
exploit the statistical or linguistic information
between sentences. Furthermore, instead of taking
into account local specific information, thanks to
graph, we can recursively compute the importance
of a vertex (i.e., text units of given documents) from
the entire graph. Therefore, a summary is expected
to obtain the most salient sentences through
exploiting global information.
The rest of the paper is organized as follows.
In Section 2, we formulate the text summarization
as a graph-based problem. In Section 3, we evaluate
several state-of-the-art graph-based algorithms.
Finally, Section 4 is Conclusions.
ISSN 2354-0575
Journal of Science and Technology16 Khoa học & Công nghệ - Số 27/Tháng 9 - 2020
2. Extractive Summarization as a Graph-based
Problem
In the context of extractive summarization,
there are four main steps for any graph-based
approaches as follows.
• Step 1 – Graphical Vertices: representing
text units (e.g. word/sentence/graph) as vertices in
the graph. Normally, a sentence, which is encoded
as a vector, is presented as a node.
• Step 2 – Graphical Edges: building relations
among vertices and considering these relations as
edges in the graph. Depending on specific problems,
we may have directed or undirected, weighted or
unweighted edges.
• Step 3 – Vertices ranked: using graph-based
algorithms to rank vertices.
• Step 4 – Summary: selecting salient sentences
(sentences extraction).
In the following sections, we will address the
above steps one after another. For the rest of the
paper, we denote graph G = (V, E), where V is a set
of vertices, and E is the set of edges (i.e., a subset
of V x V).
2.1. Graphical Vertices
The most important goal of extraction is to
rank entire sentences in the text. In most of the cases
each vertex (node) is represented for a sentence in
the graph [2,9].
However, Parveen and Strube [10] used two
sets of nodes in a bipartite graph, one presents
entities (e.g. word/term), and the other present
sentences. And there is no edges in the same set
of nodes. By doing this, Parveen and Strube take
advantages of strong entity transitions to exploit
information central to a text. Similarly, Wang et
al. [11] adjusted a heterogeneous graph network
to exploit not only sentence-level nodes, but also
semantic units as additional nodes represented
for words, which enrich the relationship between
sentences. Each edges is constructed word-sentence
in the heterogeneous graph by connecting to its
contained words. It means that there is no direct
edge between any 2 sentences, and between any 2
words.
2.2. Graphical Edges
As mentioned above, each vertex or node in
the graph is represented a unit (e.g., work, sentence,
etc.) of a given document. Each edge in the graph
connects two vertices.
In the original context of Web connection
[12, 13], one vertex connects to only one other
vertex. Therefore, these graph-based algorithms are
unweighted graphs.
Later on, most models [2, 9, 12, 13] build
multiple or partial links between vertices for graphs.
Hence, these models employ weighted graphs which
express how relationship or similarity between
two vertices. The weighted edges are evaluated by
various ways, for example, cosine similarity [2], tf-
idf consine similartify [14].
Mihalcea and Tarau [9] introduced a new edge
weight, between vertices Vi and Vj as a weight wij,
that directly determines the score associated with a
vertex:
*
WS V d
d w
w
WS V
1i
kjV Out V
ij
V In V j
k j
j i
= + +
!
!
^ ^
^
^^
h h
h
hh/
/ (1)
where In(Vi ) is the set of vertices that point to Vi;
Out(Vi ) is the set of edges that go out from Vi; and
,d 0 1! 6 @ is a damping factor.
Erkan and Radev [2] introduced the concept
sentence centrality. The author observed that a
given document can be considered as a set of
cluster which consists of many similar sentences.
And the sentence is central or salient to the topic if
it is similar with many other sentences. The authors
define the similarity between two sentences:
cosine(x, y) =
tf idf tf idf
tf f idf
, ,
, ,,
x x xx x y y yy y
w x w y ww x y
2 2
2
i ii i ii! !
!
^ ^h h/ /
/
(2)
In order to consider coherence as one of the
criterias to balance with relevance of summary sen-
tences, Christensen et al. [15] introduced G-FLOW
system to exploit discourse relationships across sen-
tences. G-FLOW represents Approximate Discou-
rse Graph (ADG) in which each edge, connecting
two sentences, is constructed based on discourse
relation indicators (e.g., deverbal nouns, co-refe-
renct mentions, event/entity continuations, inferred
edges, discourse markers, etc.). In particular, edge
from vertex (or sentence) u to v is constructed if and
only if two sentences share a discourse relationship
ISSN 2354-0575
Khoa học & Công nghệ - Số 27/Tháng 9 - 2020 Journal of Science and Technology 17
and sj is putted right after u in a final summary. The
authors set the edge weight, w
ADG
(u,v), in ADG to
the number of distinct indicators between u and v:
, ,w DG u v indicators u vA =^ ^h h/ (3)
In order to make the edge weights more diverse, [14]
adjusted ADG to propose Personalized Discourse
Graph (PDG). The authors score each edge with the
following weight:
, ,
,
w u v w u v s u
w u v s u
PDG
ADGu V
ADG
l l
=
l!
^ ^ ^^ ^h h hh h/ (4)
Where s(v) is considered as weighting of sentence
– sentence personalization score s(v), which is
calculated to take into account the surface features
for each sentence.
2.3. Ranking Vertices
In order to obtain salient sentences, we use
graph-based ranking algoirthms to calculate the
core (i.e. importance) for all the nodes. the graph-
based approaches differ from each other on the
graph ranking algorithms. A number of approaches
has been addressed [10, 12, 13, 16].
Brin and Page [12] determined the importance
of a vertex by the score of the vertex. The PageRank
score a given vertex Vi as follows:
*S V d d Out V
S V
1i
j
j
j In Vi
= - + !^ ^ ^
^
^h h h
h
h/ (5)
Where In(Vi) is the set of vertices that point to Vi;
Out(Vi) is the set of edges that go out from Vi; and
,d 0 1! 6 @ is a damping factor (usually d = 0.85).
It is worth noting that PageRanks forms a proba-
blity distribution (i.e., )S V 1iV Vi =! ^ h/ ). PageRank
initializes arbitrary values for each node, and uses
simple iterative algorithm until convergence with a
given threshold.
Kleinberg [13] introduced HITS algorithm that
learns to rank vertices (corresponding to the Web
pages) by using two sets of scores – a “authority”
score (pages with a number of incoming links),
and a “hub” score (pages with number of outgoing
links) – for each vertex:
HITS V HITS VA i H jV In Vj i= !^ ^^h hh/ (6)
HITS V HITS VH i A jV Out Vj i= !^ ^^h hh/ (7)
The algorithms extracts information from the link
structures by exploiting semantic relations within a
bipartite network.
Parveen and Strube [10] adapted the HITS algorithm
for a bipartite graph in which one set of nodes
represents for entities, while the other represents
for sentences. In particular, after transforming
documents into a bipartite graph, Parveen and
Strube applied two steps to update two matrices:
HubScore A AuthorityScore$= (8)
AuthorityScore A HubScoreT $= (9)
Here, A is an adjacency matrix representing the
edges in the graph; HubScore and AuthorityScore
are the matrix score for hub (i.e. entities) and
authoridy (i.e. sentences) nodes, respectively. The
algorithm initials score to all nodes, and HubScore
and AuthorityScore are updated until convergence.
Finally, every node is scored.
Herings et al. [16] proposed the prositonal power
function for a digraph method in which the power
(i.e. scores) of a node is measured by both the
number of its successors, and the the powers of its
successors:
( )POP V V POS V
1
1P i P jV Out Vj i
= +!^ ^^h hh/ (10)
( )POS V V POS V
1
1W i W jV In Vj i
= +!^ ^^h hh/ (11)
2.4. Summary
Straighforwardly, a summary can be built
by extracting salient sentences. In particular, the
more important sentences are, the higher their
corresponding vertex’s score. Hence, we can pick
the top n sentences of highest scores from the result
of the above section (i.e., Ranking Vertices).
3. Evaluation
We choose SVM as a traditional and very
compatetive machine-learning model to compare
with three state-of-the-art graph-based approaches:
TextRank [9], LexRank [2], and Heterogeneous
graph neural networks [11].
3.1 Data Preparation
In order to evaluate graph-based models (both
supervised and unsupervised learning methods),
we have conducted the experiments on SoLSCSum
and USAToday-CNN datasets for multi-document
summarization tasks.
SoLSCSum: The dataset [17] consists of 157
documents, collected from Yahoo News. Humans
involved to create the label of each sentence.
Ech sentence was labeled (i.e., summary or non-
-summary) by the majority voting. Finally, each
ISSN 2354-0575
Journal of Science and Technology18 Khoa học & Công nghệ - Số 27/Tháng 9 - 2020
document includes the summary sentences and
comments.
USAToday-CNN: The dataset [18] includes
121 events, 455 highlights, and 6,413 sentences.
Each sentence was annotated (i.e., summary and
non-summanry) by two people.
3.2. Settings and Evaluation Metrics
In this paper, we utilized several feature-based
techniques for the SVM model, such as the sentence
position, length, IF-IDF, the number of common
words with the title for training a summarizer.
Furthermore, we set 80% for training dataset and
20% for test set. We used ROUGE-1 (uni-gram)
and ROUGE-2 (bi-gram) metrics to measure graph-
based approaches. These mesurements are popular
and considered as an appropriate for evaluating on
text summarization.
Table 1. Statistics for SoLSCSum
multi-document summarization datasets
Dataset #docs #sentences #refs
SoLSCSum 157 3,462 5,858
USAToday-CNN 121 6,413 455
Furthermore, we only selected several studies
to demonstrate the performance of graph-based
approaches.
As we can see in Table 2, although SVM, a
traidtionally competitive method, is slighly better
than TextRank and LexRank in Rouge-2, it is
poorer than TextRank and LexRank in Rouge-1.
Amazingly, HeterSumGraph [11] performs the best
in both Rouge-1 and Rouge-2.
Table 2. ROUGE-scores of graph-based methods
on SoLSCSum. Bold is the best
Algorithms ROUGE-1 ROUGE-2
SVM [19] 0.385 0.314
TextRank [9] 0.395 0.306
LexRank [2] 0.389 0.295
HeterSumGraph [11] 0.457 0.370
As we can see in Table 3, SVM is much better
than TextRank, and marginally better than LexRank
with repects to Rouge-2. However, SVM obtains
the Rouge-1 higher than LexRank, and considerably
lower than TextRank in. Again, HeterSumGraph
[11] performs significantly the best in both Rouge-1
and Rouge-2. Especially, HeterSumGraph achieves
the substantial Rouge-2 compared to the others,
e.g., it is about 4.3 times better than SVM.
Table 3. ROUGE-scores of graph-based methods
on USAToday-CNN. Bold is the best
Algorithms ROUGE-1 ROUGE-2
SVM [19] 0.247 0.082
TextRank [9] 0.269 0.008
LexRank [2] 0.239 0.079
HeterSumGraph [11] 0.425 0.353
Through the experiment we have some
insights. That is, LexRank [2] and TextRank
[9] demonstrate that graph-based approaches
are very competitive compared to SVM, a very
compatitive approach of machine learning methods.
Furthermore, by combining with neural networks,
graph-based approaches can get state-of-the-art
results for extractive summarization. Therefore, it
is convinced that we may exploit the graph-based
approaches by different neural network models
which we leave for future work.
4. Conclusions
Although many studies on summarization have
obtained a great progress, there is a big gap to meet
our expectation. Compared to other approaches,
graph-based methods have demonstrated their
power on text summarization through many state-
of-the-art results. There are two main advantages
which facilitate the success of the graph-based
approaches. The first is the