Extractive Text Summarization Using Word Vector Embedding

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2017 International Conference on Machine learning and Data Science

Extractive Text Summarization using Word Vector


Embedding
Aditya Jain, Divij Bhatia, Manish K Thakur
Department of Computer Science Engineering & IT
Jaypee Institute of Information Technology, Noida, India
[email protected], [email protected], [email protected]

Abstract— These days, text summarization is an active research learning, large amount of annotated or labeled data is needed.
field to identify the relevant information from large documents
The summarization is carried out through modeling the problem
produced in various domains such as finance, news media,
academics, politics, etc. Text summarization is the process of
as a binary class classification where positive class involves the
shortening the documents by preserving the important contents of sentences included in the summary and vice-versa. The examples
the text. This can be achieved through extractive and abstractive of these methods are Naive Bayes [1], Support Vector Machine
summarization. In this paper, we have proposed an approach to [13] and Neural Networks [2][3][4], etc. The unsupervised
extract a good set of features followed by neural network for approaches [5][6][7] do not require labeled data. The examples of
supervised extractive summarization. Our experimental results on these methods are K-Means [8], and DBSCAN [9], etc.
Document Understanding Conferences 2002 dataset show the There are four major challenges for extractive text
effectiveness of the proposed method against various online summarization as follows:
extractive text summarizers.
x identification of the most important pieces of
IndexTerms—Extractive Text Summarization; Neural Network, information from the document,
Machine Learning; Word Vector Embedding x removal of irrelevant information,
x minimizing details, and
I. INTRODUCTION x assembling of the extracted relevant information into a
compact coherent report.
Electronic documents are emerging as a principle media for In order to overcome the above challenges, we have proposed
business and academic information for many years. Millions of an approach to extract a good set of features followed by neural
documents are produced and made available on the internet every network for supervised extractive summarization. In addition to
day. In order to get relevant information from these documents, it the standard features [1], we have used word vector embedding
is necessary to extract the features from the documents in an based features for summarization. Our proposed method yields
efficient way. In literature, various approaches have been higher accuracy when tested with various online extractive text
developed to address this issue. Particularly, text summarization summarizers for DUC 2002 dataset.
based approaches are more popular amongst the researches [1- Rest of the paper is organized as follows: in Section 2, we
13]. have described our proposed approach; Section 3 details the
There are two methods to summarize the text data namely experiments and observations. We conclude our work in Section
abstractive summarization [8] and extractive summarization 4 following the future directions.
[1][2][3]. In abstractive summarization based approaches, natural
language generation techniques are used to create the summary
which represents the internal semantics [10][11]. The summary II. METHODOLOGY
generated is in the lines of what a human can express. Generally, In this section, we present our methodology for the text
these methods are complex in practice due to the need of fine summarization considering it as the problem of binary
linguistic information [10][11]. classification, i.e. the explored text to be categorized either as
In extractive summarization based approaches, summary is relevant and included into the summary or irrelevant and
generated through collecting the relevant sentences from the excluded from the summary. In this process, the input document
documents cohesively [1][2][3]. These approaches are easy to is broken down into sentences. Feature extraction is performed
implement being conceptually simpler with minimal language over these sentences and fed into the neural network for training
understandings [1][2][3]. and prediction. The neural network decides the inclusion of the
Extractive summarization can be carried out through sentence in the summary. This process is described in Algorithm
supervised and unsupervised learning [1][2][3][4]. In supervised 1.

978-1-5386-3446-2/17 $31.00 © 2017 IEEE 51


DOI 10.1109/MLDS.2017.12
Authorized licensed use limited to: Somaiya University. Downloaded on October 17,2023 at 14:43:18 UTC from IEEE Xplore. Restrictions apply.
n
Algorithm1: Text Summarizer TF-ISFi =  TF-ISFi,j (2)
j=0

Input: Text Data X with the Corresponding Abstractive


TF-ISFi; denotes the TF-ISF of the ith sentence
Summary, Learning Rate η, Number of Epochs 't'
TF-ISFi,j; denotes the TF-ISF of jth word of the ith sentence
Output: Summarized Text TFi; denotes the Term Frequency of the ith word of the sentence
ISFi; denotes the Inverse Sentence Frequency of the ith word of
1. The Summarized Text is initialized to empty. the sentence
N; denotes the total number of sentences
2. Data Preprocessing: Based on the abstractive summary of
n; denotes the total number of words in a sentence
the text dataset, we have generated extractive summary of
the text data by using a similarity score among the sentences (ii) Sentence Length: This feature provides less weightage to
of the text data and abstractive summary. short sentences as short sentences are relatively less important
than the longer sentences in the text [1][2]. It is measured by
3. Feature Extraction: We have computed ten features for computing the ratio of number of words in the sentence to the
each of the sentences in the dataset. number of words in the longest sentence in the document.
4. Apply Neural Network classifier with three fully (iii) Sentence Position: This feature is used to compute the
connected hidden layers on the processed dataset. position of a sentence in a document [1][16]. The position has
5. The sentences with high predictive score (neural network) been computed in terms of the normalized percentile score in the
are assigned to the summarized text. range between 0 and 1.
(iv) Sentence-to-Sentence Cohesion: This feature [1] uses cosine
A. Dataset Processing similarity of each sentence S with every other sentence S'. This
We have used first 10K documents from CNN news article value is later normalized in the range between 0 and 1 by
corpus having 90K documents [14] for training. Besides this, the computing the ratio of the raw value of this feature for S to the
dataset contains abstractive summaries corresponding to the largest raw value obtained in a document. The values closer to
documents. Since supervised learning approach requires the 1.0 indicate the high cohesion and vice-versa.
labeled training data, there is a need of extractive summary
(sentences in the summary are borrowed from the text). In order (v) Sentence-to-Centroid Cohesion: In this feature, the centroid
to generate the labeled training data, we have calculated vector of a document is computed by calculating the average of
similarity score of every sentence in the summary with every sentence vectors in it [1]. Also, the similarity of these sentence
sentence in the corresponding documents by using 100- vectors with the centroid vector is calculated. This similarity
dimensional glove vectors [15]. The sentences in the documents value is normalized by taking its ratio to the largest similarity
with high similarity score are chosen in the extractive summary value in the corresponding document. The sentences with larger
in place of corresponding sentence in the abstractive summary. ratios are deemed to represent the intrinsic information
B. Feature Extraction contained in the document.
We have computed the following features for the sentences (vi) Depth of Tree: This feature is used at document level and
on the lines of Neto et al [1]. used to group the similar sentences based on their lexical
(i) Mean TF-ISF: The basic feature in text processing task is TF- similarity [1]. To compute this feature, agglomerative clustering
IDF. For text summarization, this feature is termed as Term is applied on the document. The root of the tree represents the
Frequency Inverse Sentence Frequency (TF-ISF) where entire document. Then the depth of every sentence is obtained
document din TF-IDF is analogous to sentence S in the from this cluster. Sentences at same depth are assumed to be
summarization [1]. The TF-ISF for a jth word (token) in ith lexically similar and hence grouped.
sentence is computed using Equation 1. The TF-ISF for a
(vii) Sentences Having Main Concepts: In this feature, our main
sentence is the mean of the TF-ISF scores of the words present in
the sentence (Equation 2). aim is to look for nouns present in the document [1]. We chose
fifteen most frequent nouns or noun phrases in the document.
N Score of this feature is assigned as 1 for the sentences which
TF-ISFi,j =TFj × log10 ISF (1)
contained any of these 15 nouns or noun phrases otherwise 0.
j

52

Authorized licensed use limited to: Somaiya University. Downloaded on October 17,2023 at 14:43:18 UTC from IEEE Xplore. Restrictions apply.
(viii) Occurrence of proper names: Proper names referring to probability, we have used the scikit-learn package in python [17].
places and people might be useful to decide the relevance of a This helped in calculating these two important measures related
sentence [1]. This is a binary feature, with a value of 1 if a to every sentence; first, in which class the sentence belongs to
sentence contains these proper names and 0 otherwise. and second, the probability of a sentence belonging to that class.
Higher the probability of a sentence to belong to the positive
(ix) Occurrence of Non-Essential Information: Some words (e.g. class (sentence should be included in the summary), higher is its
“because”, “moreover”, “additionally”, etc.) are indicators of relevance to the text which was fed to the summarizer for
non-essential information [1]. This is a binary feature, taking the summarizing.
value 1 if the sentence contains at least one of these words and 0
III. EXPERIMENTS
otherwise.
We trained our model using the steps discussed in previous
(x) Word to vector Embedding: Every word in a sentence is section. To test the performance of the trained model, we used
represented as a 100 dimensional vector using the pre-trained first 284 documents of the DUC 2002 dataset. We used ROUGE
GLoVE vectors. The sentence score of this feature is calculated (Recall-Oriented Understudy for Gisting Evaluation) score [18],
by taking the mean of each dimension of all the word vectors, specifically ROUGE-1, ROUGE-2, and ROUGE-L as the
forming a vector as shown in Equation 3. Let’s say a sentence S performance metrics. ROUGE-1 refers to the overlap of 1-
has words = [word1, word2, word3], then, these words are gram (each word) between the system and reference summaries,
represented as: ROUGE-2 refers to the overlap of bigrams between the system
and reference summaries and ROUGE-L refers to the Longest
word1 = [x1,1 ,x1,2 ,…….x1,100 ] Common Subsequence (LCS) based statistics.
word2 = [x2,1 ,x2,2 ,……..x2,100 ] In our experiments, length of the summary used for testing
was fixed as six i.e. each summary of the documents contained
word3 = [x3,1 ,x3,2 ,……..x3,100 ]
six sentences. In Table I, we have shown the ROUGE -1,
ROUGE -2 and ROUGE -L scores computed on different number
Hence, the sentence vector is computed as: of test documents ranging between 50 and 284. Table II shows
the 95% confidence interval of above stated ROUGE scores. It
X1,1 +2,1 +X3,1 X1,2 +X2,2 +3,2 X1,100 +X2,100 +X3,100
 ,  ,…,( ) (3) indicates that if the same set of documents is summarized then in
3 3 3
95% of the cases their ROUGE scores will lie in this interval.
C. Training We also tested the performance of our proposed model
against some online text summarizers. The comparative
In the training dataset of CNN corpus, the summary length is performances of all the summarizers against the ROUGE scores
approximately one third of the text document i.e. the sentences (1, 2, and L) have been shown in Table III. As seen from Table
belonging to class 0 (not to be included in the summary) is III, our proposed model outperformed all the mentioned
dominating. So, the dataset has been randomly under sampled summarizers.
by us such that the number of sentences with both the class
labels becomes equal. This ensured that the model does not TABLE I. ROUGE SCORES OF THE EXPERIMENTS PERFORMED
become biased towards the sentences with label 0. ON TESTING DATASET HAVING FIRST 50 ONWARDS TO ALL 284
DOCUMENTS
For summarization, Multi Layer Perceptron (MLP) has been
used with some modifications, described subsequently. MLP is a No. of ROUGE-1 ROUGE -2 ROUGE -L
feed-forward neural network with one or more layers between Documents
input and output layer where data flows in one direction from First 50 0.46606 0.22770 0.43899
input to output layer (i.e. forward). First 100 0.46125 0.23034 0.43587
We used MLP with three fully connected hidden layers. In First 150 0.38092 0.18508 0.35960
our model, the output layer has two nodes for deciding the First 200 0.35885 0.16583 0.33833
inclusion of the sentences into the summary. ReLU function has First 250 0.37094 0.16692 0.34913
been used as the activation function (Equation 4). Learning rate
All 284 0.36625 0.15735 0.34410
‘η’ has been taken as 0.001 and number of epochs ‘t’ has been
taken as 10.

f(x)=max(0,x) (4)

This neural network was modified to predict the probability


of each sentence belonging to a particular class. To predict the

53

Authorized licensed use limited to: Somaiya University. Downloaded on October 17,2023 at 14:43:18 UTC from IEEE Xplore. Restrictions apply.
TABLE II. 95% CONFIDENCE LIMIT FOR THE DATASET USED IN Lecture Notes in Computer Science, Vol. 2507. Springer, 2002,
TABLE I pp. 205-215
[2] R. Nallapati, F. Zhai, B. Zhou, “SummaRuNNer: A Recurrent
No. of ROUGE-1 ROUGE-2 ROUGE-L Neural Network based Sequence Model for Extractive
Documents Summarization of Documents,” in Proc. AAAI Conference on
First 50 0.43836 to 0.19398 to 0.41183 to Artificial Intelligence (AAAI), Computational and
0.49346 0.26529 0.46739 Language(cs.CL), arXiv:1611.04230v1 [cs.CL]
First 100 0.43510 to 0.20469 to 0.40943 to [3] A.T. Sarda, A.R. Kulkarni, “Text Summarization using Neural
0.48589 0.25758 0.46066 Networks and Rhetorical Structure Theory”, International Journal
First 150 0.34667 to 0.16160 to 0.32662 to of Advanced Research in Computer and Communication
0.41281 0.20873 0.39035 Engineering, Vol. 4, Issue 6, June 2015
First 200 0.33043 to 0.14606 to 0.31059 to [4] J. Cheng, M. Lapata, “Neural summarization by extracting
0.38472 0.18718 0.36374 sentences and words,” arXiv:1603.07252 [cs.CL]
First 250 0.34730 to 0.15054 to 0.32608 to [5] M. S. Patil, M. S. Bewoor, S. H. Patil, “A Hybrid Approach for
0.39535 0.18654 0.37336 Extractive Document Summarization Using Machine Learning
and Clustering Technique,” International Journal of Computer
All 284 0.34287 to 0.14106 to 0.32148 to
Science & Information Technology, Vol. 5, Issue 2, 2014, pp.
0.38700 0.17320 0.36430
1584
[6] García-Hernández R.A., Montiel R., Ledeneva Y., Rendón E.,
TABLE III. ROGUE 1 COMPARISON WITH OTHER ONLINE Gelbukh A., Cruz R., “Text Summarization by Sentence
SUMMARIZERS
Extraction Using Unsupervised Learning,” In: Gelbukh A.,
Online ROUGE-1 ROUGE-2 ROUGE-L Morales E.F. (eds) MICAI 2008: Advances in Artificial
Summarizers Intelligence. MICAI 2008. Lecture Notes in Computer Science,
AutoSummarizer vol 5317. Springer, Berlin, Heidelberg
0.33651 0.11738 0.24874 [7] T. Nomoto, Y. Matsumoto, “A New Approach to Unsupervised
[19]
SPLITBRAIN Text Summarization,” in Proc. ACM SIGIR conference on
0.34483 0.14211 0.19565 Research and development in information retrieval (SIGIR),
[20]
2001, pp. 26-34
Text Compactor
0.34287 0.15382 0.20315 [8] S. Akter, A.S. Asa, Md. P. Uddin, “An extractive text
[21]
summarization technique for Bengali document(s) using K-means
Tools4noobs clustering algorithm,” in Proc. IEEE International conference on
0.25138 0.16258 0.22437
[22] Imaging, Vision & Pattern Recognition, 2017
Our Proposed [9] S. D’Silva, N. Joshi, S. Rao, S. Venkatraman, S. Shrawne,
0.38249 0.2256 0.27486
Model “Improved Algorithms for Document Classification & Query-
based Multi-Document Summarization,” International Journal of
IV. CONCLUSION AND FUTURE WORK
Engineering and Technology (IACSIT), Vol. 3, No. 4, August
In this paper we presented an extensive text summarization 2011
approach using neural network. The neural network has been [10] R. Nallapati, B. Zhou, C.N. Dos Santos, C. Gulcehre, B. Xiang,
trained by extracting ten features including word vector “Abstractive Text Summarization Using Sequence-to-Sequence
embedding from the training dataset. Testing has been RNNs and Beyond,” in Proc. The SIGNLL Conference on
performed on DUC 2002 dataset, where upto 284 documents Computation Natural Language Learning(CoNLL), 2016,
were used in various test experiments. ROUGE scores (1, 2, and arXiv:1602.06023[cs.CL]
L) computed for our proposed model and four of the online text [11] K. Ganesan, C. Zhai, J. Han, ”Opinosis: a graph-based approach
summarizers show the effectiveness of the proposed model. to abstractive summarization of highly redundant opinions”, in
Performance of the proposed model may further be improved by Proc. International Conference on Computational
increasing the size and diversity of the training dataset and Linguistics(COLING), 2010, pp. 340-348
applying more effective approaches [10] to convert the [12] J. Carbonell, J. Goldstein,”The Use of MMR, Diversity-Based
abstractive summaries into extractive summaries. Reranking for Reordering Documents and Producing Summaries,”
in Proc. 21st annual international ACM SIGIR conference on
REFERENCES Research snd Development in information retrieval(SIGIR ’98),
[1] J. L. Neto, A. A. Freitas, C. A. A. Kaestner, “Automatic Text pp.335-336
Summarization using a Machine Learning Approach,” in Proc. [13] S. Km, R.Soumya, “Text summarization using clustering
Brazilian Symposium on Artificial Intelligence (SBIA 2002), technique and SVM technique”, International Journal of Applied
Engineering Research, 2015, Vol. 10, pp. 28873-28881

54

Authorized licensed use limited to: Somaiya University. Downloaded on October 17,2023 at 14:43:18 UTC from IEEE Xplore. Restrictions apply.
[14] K.M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. [23] G. Erkan and D. R. Radev: LexRan, “Graph-based Lexical
Kay, M. Suleyman, P. Blunsom, “Teaching Machines to Read Centrality as Salience in Text Summarization,” Journal of
and Comprehend,” arXiv:1506.03340 [cs.CL] Artificial Intelligence Research, Vol. 22, Issue. 1, July 2004, pp.
[15] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global 457-479
Vectors for Word Representation” [24] V. Gupta, G.S. Lehal, “A Survey of Text Summarization
[16] T. Sri Rama Raju, B. Allarpu, “Text Summarization using Extractive Techniques,” Journal of emerging technologies in
Sentence Scoring Method,” International Research Journal of web intelligence, Vol. 2, Issue. 3, pp. 258-268
Engineering and Technology (IRJET), 2017, Vol. 04, Issue. 04 [25] S. Gupta, A. Nenkova, D. Jurafsky, “Measuring Importance and
[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Query Relevance in Topic-focused Multi-Document
Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Summarization,” in Proc. 45th Annual Meeting of the ACL on
Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, Interactive Poster and Demonstration Sessions, 2007, pp. 193-
M. Perrot, and E. Duchesnay, E., “Scikit-learn: Machine 196
Learning in Python,” Journal of Machine Learning Research, [26] M.Kageback, O. Mogren, N. Tahmasebi, D. Dubhashi,
Vol. 12, pp. 2825-2830, 2011 “Extractive summarization using continuous vector space
[18] Lin, Chin-Yew, “ROUGE: A Package for Automatic Evaluation models,” in Proc. EACL workshop on continuous vector space
of summaries,” in Proc. of the ACL Workshop: Text models and their compositionality (CVSC), 2014, pp. 31-39
Summarization Braches, 2004 [27] W. Yin, Y. Pei, “Optimizing sentence modeling and selection
[19] AutoSummarizer, http://www.autosummarizer.com/, accessed for document summarization,” in Proc. International Conference
on 1st Aug 2017 on Artificial Intelligence (IJCAI), 2015, pp. 1383-1389
[20] SplitBrain, https://www.splitbrain.org/services/ots, accessed on [28] Y. Zhao, G. Karypis, U. Fayyad, “Hierarchical Clustering
1st Aug 2017 Algorithms for Document Datasets”, Data Mining and
[21] TextCompactor, http://www.textcompactor.com/, accessed on Knowledge Discovery Journal, Vol. 10, Issue 2, 2005, pp.141-
1st Aug 2017 168.
[22] Tools4noobs, https://www.tools4noobs.com/summarize/,
accessed on 1st Aug 2017

55

Authorized licensed use limited to: Somaiya University. Downloaded on October 17,2023 at 14:43:18 UTC from IEEE Xplore. Restrictions apply.

You might also like