Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Volume 64, Issue 1, 2020

Journal of Scientific Research


Institute of Science,
Banaras Hindu University, Varanasi, India.

Accountability of NLP Tools in Text


Summarization for Indian Languages
Pradeepika Verma1* and Anshul Verma2

1* Dept. of CSE, Indian Institute of Technology (ISM) Dhanbad, India. [email protected]


2 Dept. of CS, Banaras Hindu University, Varanasi, India. [email protected]

Abstract: In the era of digital world, online information is categories into several other domains such as single and multi-
growing exponentially. It leads to emergence of inconvenient document summarization, query and topic focused
searching of relevant information in relevant time. In this regard, summarization, monolingual and multilingual summarization
automatic text summarizer proves to be a good tool. It helps in
etc. Single document summarization is simply defined as
creating a brief and meaningful form of the given text using natural
summary of text from a single document and summary from
language tool kit so that users can access the information in quick
manner. Today, a lot of summarization tools are available for rich
more than one document is called multi document. Multi
resource languages such as English. But, it seems difficult to document summarization is more difficult than single document
summarize the text for Indian languages (low resource languages) in the sense of redundancy of text, compression of text from
due to limited availability of NLP tools and techniques for Indian multiple documents, collection of significant information etc.
languages. In this paper, we present a survey on existing text Next, the summary process which involves summarization on
summarization methods and NLP tools for Indian languages. We the basis of Interrogating phrase is called query based
also discuss about the issues associated with the Indian languages summarization. Here, the system itself generate the topic
that are the bottlenecks for summarizing Indian language text.
according to given query and create summary of topic related
documents. The topic based summarization involves
Index Terms: Text summarization, Natural language processing,
Indian languages, Language dependency.
summarization of topic related documents. It will help in finding
a quick view of several documents in less time. Monolingual
I. INTRODUCTION summarization involves summary for a particular language
domain whereas multilingual summarization refer to generate
The area of natural language processing gained much
summary for more than one language. Gupta (3) proposed a
attention since the emergence of online information. It includes
summarization system for Hindi and Punjabi language both so
processing with Human generated language in either form to
this system is called multilingual summarization system.
facilitate the user, for example, information extraction. In this
regard, several challenges are introduced such as human
language understanding, human language generation etc. The
task of text summarization comes with these challenges. To
make understandable the human languages, several NLP tools
are available such as stemmer, PoS tagger, parser, named entity
recognition system etc., but very limited for low resource
languages.
Generally, Text summarization task can be classified into two
categories, extractive summarization and abstractive
summarization (Verma and Om 2016a,b,c). Extractive
summarization extracts the most relevant sentences as it appears
from the document while abstractive summarization generates Fig. 1. Basic methodology for text summarization
new sentences from the set of concepts or topics residing in the
document using natural language generation tools. It can also Rest of the paper is organized as follows. Section II reviews

DOI: http://dx.doi.org/10.37398/JSR.2020.640149 358


Journal of Scientific Research, Volume 64, Issue 1, 2020

the state-of-the-art related to automatic text summarization for sentences, appearance of words in paragraphs, repeating nature
Indian languages. Section III briefly describes about issues in etc. (Efat et al. 2013) also focused on finding text features scores
text summarization techniques for Indian languages. Section IV to summarize the document in Bengali text. (Akter et al. 2017)
illustrates the impact of different NLP tools in the performance proposed a Bengali summarizer based on
of summarization techniques. Finally, the paper concludes with K-means clustering. They clustered the document into two
the direction for future works. according to their features’ scores and top scored sentences from
each clusters are extracted as summary sentences. (Sarkar 2012)
II. SUMMARIZATION TECHNIQUES FOR INDIAN LANGUAGES also proposed a Bengali text summarizer based on text features
There are several papers introduced related to the automatic scores of sentences.
text summarizer in the literature on different extractive and D. Marathi
abstractive techniques for different Indian languages. Few of
Rathod (2018) has proposed a marathi text summarizer based
them are discussed here.
on text rank technique proposed by (Mihalcea and Tarau 2004).
A. Hindi It is graph based approach where Pagerank algorithm has been
Kumar et al. (2015) proposed a graph based technique for used to obtain the significance of sentences. Also, it includes
Hindi text summarization. The fundamental concept to present two unsupervised method for keywords and sentence extraction.
this approach is to find important information from a document (Gaikwad 2018) rule based Marathi text summarization method
of Hindi language. Graph based approach is used to find the where noun words based a set of questions for each sentence is
relation between two sentences and find the importance of the generated. Thereafter, each question is ranked according to their
sentence with respect to document. Here, they used the concept importance and top ranked questions are extracted to obtain their
of semantic similarity to find the relevancy between sentences. answers. The collection of answers of these questions are
They assumed that the sentence with high relevance consist considered as the summary of the document.
same information. Only that sentence should be added in E. Tamil
summary if the importance of that sentence is high. (Kumar and
Devi et al. (2011) proposed a graph based summarization
Yadav 2015) proposed technique that is based on thematic
approach which is tested on Tamil text. It ranks each sentences
words. Thematic words are generated by evaluating frequency of
based on their words frequencies and Levenshtein distance with
terms and their respective inverse frequency in the document. It
other sentences. The average of these ranks are taken as final
generates a list of thematic words and creates summary by using
ranking of these sentences and top ranked sentences are
these words. The generated summary is further processed by
extracted for generating summary. However, this method is
Hindi WordNet. (Gupta 2013) proposed an algorithm which
language domain independent. Next, (Banu et al. 2007) also
summarizes the documents written in Hindi and Punjabi both. It
introduced a method for Tamil document summarization using
is based on statistical approach. These are key phrases, cue
semantic graph. A set of linguistic rules has been used to create
phrases, nouns and verbs, negative terms, font feature, named
semantic graph for the document. Moreover, support vector
entities, sentence position,sentence length and numerical data.
machine has been used to extract the sub graphs from the graph
System uses regression function to weight each feature.
of document. An LF parser has been used to find the semantic
B. Panjabi similarity features.
Gupta and Lehal (2012) proposed a summarization method for F. Kannad
punjabi text. It calculates the sentences based on nine weighted
Jayashree et al. (2012) proposed a kannad text summarizer
text features such as named entities, title words, keywords etc.
using keywords extraction. In this regard, they had combine GSS
They have used a rule based and dictionary based approaches to
(Galavotti, Sebastiani, Simi) coefficient and TF-IDF method to
recognise the Punjabi words related to text features. (Gupta and
extract keywords from the document. A list of keywords for
Kaur 2016) proposed another punjabi text summarization
each category using GSS and TF-IDF has been discovered and
method based on hybrid model of support vector mechine and
weight of each sentence is calculated by sum of the weights of
simple text features. They have used entropy based approach for
these keywords. (Geetha and Deepamala 2015) also proposed a
discovering important words in the document.
kannad text summarizer based on latent semantic analysis. In
C. Bengali this work, they have find the semantic relationship between
Abujar et al. (2017) proposed a heuristic approach for Bengali sentences using LSA. Moreover, the concept of SVD is used to
text summarization. Different linguistic rules for extrction of generate summary of documents. (Kallimani et al. 2010)
each text feature has been used for obtaining better results. For proposed a kannad text summarizer ‘KanSum’which is based on
example, they find the effect rate of every word in the document the concept of AutoSum summary system for single Kannada
by several parameters like appearance of words in number of document summarization. AutoSum is based on the features of

359
Institute of Science, BHU Varanasi, India
Journal of Scientific Research, Volume 64, Issue 1, 2020

First line, sentence position, numerical data, keywords and combination function.

Fig. 2. Dependency in the performance of extractive text summarization (LD represents language dependent and LI represents
language independent).

self generation of stop words list from corpus. (Rani and Lobiyal
III. ISSUES IN AUTOMATIC TEXT SUMMARIZATION 2018) proposed a method for construction of Hindi stop words
FOR INDIAN LANGUAGES list using statistical and information based methods. The
As illustrated in fig. 1, ATS process is partitioned into two information model which is based on entropy is used to find the
major parts. One part is language dependent in the significance of the terms in the corpus while statistical model
summarization process and another part is language which is based on TF-IDF feature is exploited for weighting the
independent. The later process is common for all summarizers in terms. According to these two models, every term is ranked by
the context of language domain while previous process is two ranks and final ranking of a term is given by summation of
dependent on resources and tools available for the language of these two ranks. They found a total of 1475 stop words which is
the document. Therefore, the main challenge with Indian a big amount in comparison to other existing lists. Although, this
languages summarizers is to do accurate preprocessing and technique can be applied on other languages corpus to generated
feature extraction. Fig. 2 shows the dependency of the their stop words. In the similar way, (Raulji and Saini 2017) also
performance of text summarization tools. It illustrates that proposed a method for generating Sanskrit stop words based on
availability of stop words list, stemmer, segmentation rules, frequency of term. However, No standard stop words list is
named entity recognition system, wordnet and word vector available for Indian languages which causes in reduction in the
dictionaries, sentiment analyzer and training corpus are basic performance of text summarizers.
essential tools and resources for text summarization. However,
B. Stemmer
these are easily available for rich resource languages but are
Stemming is the process of normalizing the inflected words in
limited for Indian languages. Here, we highlight some resources
the natural language text. A few work is done on this research
that are available in the Indian languages contexts.
area. (Ramanathan and Rao 2003) introduces a light weighted
A. Stop words list Hindi stemmer which works on longest match stripping using
The first process for summary generation is to remove stop human generated list of total 65 suffixes. (Islam et al. 2007) also
words from the document and to recognize the unique keywords. proposed a light weighted Bengali stemmer based on same
In this regard, some of the stop words lists are available by IIIT approach as proposed by (Ramanathan and Rao 2003). They
Hyderabad1, TDIL2, kaggle3 and Ranks NL4 where kaggle and used 72 suffixes of verbs, 22 for nouns and 8 for adjectives.
Ranks NL provides stop words only for Hindi, Bengali, and (Majumder et al. 2007) proposed a corpus based stemmer which
Marathi. Some of the researchers are also proposed methods for works effectively for the primarily suffixing languages such as
Bengali. It clusters the words with same stem but different
1
https://ltrc.iiit.ac.in/showfile.php?filename=ltrc/internal/nlp/corpus/index.html
2
http://tdil-dc.in/index.php?option=com_download&task=showresourceDetails&toolid=1637&lang=en variants using distance function to find out the stem word.
3
https://www.kaggle.com/rtatman/stopword-lists-for-19-languages
4
https://www.ranks.nl/stopwords (Saharia et al. 2014) proposed a rule based approach for

360
Institute of Science, BHU Varanasi, India
Journal of Scientific Research, Volume 64, Issue 1, 2020

stemming the words of Assamese, Bengali, Bishnupriya segmentation method for Kannad language, (Ghosh et al. 2010)
Manipuri, and Bodo languages. They introduced a dictionary of proposed a method for Bengali language, and (Devi and
frequent words to reduce the over-stemming and under- Lakshmi 2013) reported for Malayalam. No other work has been
stemming errors and an HMM model to prevent the errors in reported for other Indian languages.
special cases. (Dasgupta and Ng 2006) proposed a Bengali
D. Feature Extraction
stemmer which is based on segmenting the words according to
To summarize the text, feature extraction is an essential part
morphemes discovered in a large annotated corpus. (Pandey and
which requires a lot of processing with text. Feature extraction is
Siddiqui 2008) introduces a Hindi stemmer based on finding
mainly used to find the relevant or important sentence in the
probabilities for occurrences of suffixes and stem together using
document. (Oliveira et al. 2016; Verma and Om 2018,
EMILLE corpus. (Majgaonker 2010) introduced a Marathi
2019a,b,c,d; Verma et al. 2019) introduces 18 text features
stemmer based on suffix stripping rules generated by human
which requires the processing of named entity recognition,
experts. (Suba et al. 2011) introduced two versions of Gujrati
semantic analysis, sentiment analysis, cue-phrases recognition
stemmer. One is light weighted, hybrid approach based stemmer
etc. in natural language text. Named entity recognition in text
and another is heavy weighted rule based stemmer. However,
summarization can help in finding the centrality of the text.
except all these methods, No standard tool to stem Indian
Moreover, the sentence appears with a number of these entities
language words with their effective performance in comparison
can be considered as significant sentence. However, a number of
of rich languages is available which causes in reduction in the
NER methods are available for some Indian languages but still
performance of text summarizers.
their performances are limited to available rules and corpus for
C. Sentence boundary detection rules the language. A very limited work has been done on these areas
Sentence boundary detection or segmentation process is the which affects in generating summary for the text of Indian
primary step of text summarization task. It is the task of languages. Moreover, the available wordnets consist of limited
detecting every sentence in the document. Four punctuation synsets in comparison to English language and word vectors are
marks: period (.), exclamation mark (!), question mark (?), and also limited.
pipe (|) are used to end the sentences in Indian languages. Hindi,
Bengali, Punjabi languages use pipe to end a declarative IV. IMPACT OF NLP TOOLS IN THE PERFORMANCE OF
sentence while other languages uses a period for the same. This TEXT SUMMARIZATION
punctuation mark has ambiguous definition as it is also used to In this section, we have taken four techniques which were
represent an abbreviation. As we know, the English language implemented for Indian languages text summarization. To show
also uses a period to end the sentences, but consists of other the impact of NLP tools, we have implemented these methods
features such as ‘capitalization of character at the beginning of for English language and compare the results. The considered
every sentence’ which is very helpful in detecting the sentence techniques are graph based technique for Hindi text
boundary. This is not an option for Indian languages which summarization (Kumar et al. 2015), a hybrid model for Punjabi
places an extra burden in segmentation. Moreover, a very limited text summarization (Gupta and Kaur 2016), Text rank based
work has been done in this area. In this regard, (Wanjari et al. technique for Marathi language (Rathod 2018), and semantic
2016) proposed a rule based segmentation method for Marathi graph based Tamil summarizer (Banu et al. 2007). Here, we
language, (Parakh et al. 2011) proposed a rule based have experimented on 100 news articles for each language.

Methods Language Precision Recall F1 score


Kumar et al. Hindi 0.44 0.32 0.37
(2015) English 0.46 0.38 0.41
Gupta and Kaur Punjabi 0.45 0.21 0.29
(2016) English 0.49 0.28 0.35
Marathi 0.43 0.27 0.33
Rathod (2018)
English 0.47 0.31 0.37
Banu et al. Tamil 0.42 0.31 0.35
(2007) English 0.45 0.36 0.40
Table 1. Results of precision, recall, and F1 measures for summarization methods for different languages

As illustrated in Table 1, the results for all summarization


methods show that they performs better with English language in CONCLUSION
comparison to other languages in all cases. It proves the maturity In this paper, we described briefly about existing text
of English language NLP tools better than other languages. summarization methods for Indian texts. We have also discussed
about the need of NLP tools during text summarization and their

361
Institute of Science, BHU Varanasi, India
Journal of Scientific Research, Volume 64, Issue 1, 2020

importance. We have showed the results of existing techniques Aniruddha Ghosh, Amitava Das, and Sivaji Bandyopadhyay.
of text summarization for Indian languages with English 2010. Clause identification and classification in bengali. In
language and found that the NLP tools affects the performance Proceedings of the 1st Workshop on South and Southeast
of any summarizer. Here, we have analyzed that although many Asian Natural Language Processing. 17–25.
summarizers are proposed previously but there are still lack of Vishal Gupta. 2013. Hybrid algorithm for multilingual
Indian context summarizers as most of the tools are not easily summarization of Hindi and Punjabi documents. In Mining
available or that are not performing satisfactory. Most of the Intelligence and Knowledge Exploration. Springer, 717–727.
proposed summarizers are based on statistical approaches or the Vishal Gupta and Narvinder Kaur. 2016. A novel hybrid text
combination of statistical and semantic models. There are other summarization system for Punjabi text. Cognitive
learning based, fuzzy based and neural network based Computation 8, 2 (2016), 261–277.
approaches for text summarization. We can applied these Vishal Gupta and Gurpreet Lehal. 2012. Automatic Punjabi text
approaches for better performance for Indian languages. Also, extractive summarization system. Proceedings of COLING
NLP tools can also be matured with these techniques for low 2012: Demonstration Papers (2012), 191–198.
resource languages. Md Islam, Md Uddin, Mumit Khan, et al. 2007. A light weight
stemmer for Bengali and its Use in spelling Checker. (2007).
REFERENCES R Jayashree, Srikanta Murthy, and Basavaraj S Anami. 2012.
Sheikh Abujar, Mahmudul Hasan, MSI Shahin, and Syed Akhter Categorized Text Document Summarization in the Kannada
Hossain. 2017. A heuristic approach of text summarization Language by Sentence Ranking. In Intelligent Systems
for Bengali documentation. In Computing, Communication Design and Applications (ISDA), 2012 12th International
and Networking Technologies (ICCCNT), 2017 8th Conference on. IEEE, 776–781.
International Conference on. IEEE, 1–8. Jagadish S Kallimani, KG Srinivasa, et al. 2010. Information
Sumya Akter, Aysa Siddika Asa, Md Palash Uddin, Md Delowar retrieval by text summarization for an Indian regional
Hossain, Shikhor Kumer Roy, and Masud Ibn Afjal. 2017. language. In Natural Language
An extractive text summarization technique for Bengali Processing and Knowledge Engineering (NLP-KE), 2010
document (s) using K-means clustering algorithm. In International Conference on. IEEE, 1–4.
Imaging, Vision & Pattern Recognition (icIVPR), 2017 IEEE K Vimal Kumar and Divakar Yadav. 2015. An Improvised
International Conference on. IEEE, 1–6. Extractive Approach to Hindi Text Summarization. In
M Banu, C Karthika, P Sudarmani, and TV Geetha. 2007. Tamil Information Systems Design and Intelligent Applications.
document summarization using semantic graph method. In | Springer, 291–300.
iccima. IEEE, 128–134. K Vimal Kumar, Divakar Yadav, and Arun Sharma. 2015.
Sajib Dasgupta and Vincent Ng. 2006. Unsupervised Graph Based Technique for Hindi Text Summarization. In
morphological parsing of Bengali. Language Resources and Information Systems Design and Intelligent Applications.
Evaluation 40, 3-4 (2006), 311–330. Springer, 301–310.
Sobha Lalitha Devi et al. 2011. Text Extraction for an Mudassar M Majgaonker. 2010. Discovering suffixes: A case
Agglutinative Language. Language in India 11, 5 (2011). study for Marathi language. (2010).
Sobha Lalitha Devi and S Lakshmi. 2013. Malayalam clause Prasenjit Majumder, Mandar Mitra, Swapan K Parui, Gobinda
boundary identifier: Annotation and evaluation. In Kole, Pabitra Mitra, and Kalyankumar Datta. 2007. YASS:
Proceedings of the 4th Workshop on South and Southeast Yet another suffix stripper. ACM transactions on information
Asian Natural Language Processing. 83–90. systems (TOIS) 25, 4 (2007), 18.
Md Iftekharul Alam Efat, Mohammad Ibrahim, and Humayun Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order
Kayesh. 2013. Automated Bangla text summarization by into text. In Proceedings of the 2004 conference on empirical
sentence scoring and ranking. In Informatics, Electronics & methods in natural language processing.
Vision (ICIEV), 2013 International Conference on. IEEE, 1– Hilário Oliveira, Rafael Ferreira, Rinaldo Lima, Rafael Dueire
5. Lins, Fred Freitas, Marcelo Riss, and Steven J Simske. 2016.
Deepali Kailash Gaikwad. 2018. Rule Based Text Assessing shallow sentence scoring techniques and
Summarization for Marathi Text. Journal of Global Research combinations for single and multi-document summarization.
in Computer Science 9, 5 (2018), 19–21. Expert Systems with Applications 65 (2016), 68–86.
JK Geetha and N Deepamala. 2015. Kannada text Amaresh Kumar Pandey and Tanveer J Siddiqui. 2008. An
summarization using Latent Semantic Analysis. In Advances unsupervised Hindi stemmer with heuristic improvements. In
in Computing, Communications and Informatics (ICACCI), Proceedings of the second workshop on Analytics for noisy
2015 International Conference on. IEEE, 1508–1512. unstructured text data. ACM, 99–105.

362
Institute of Science, BHU Varanasi, India
Journal of Scientific Research, Volume 64, Issue 1, 2020

Mona Parakh, N Rajesha, and M Ramya. 2011. Sentence Technologies in Data Mining and Information Security (pp.
Boundary Disambiguation in Kannada Texts. Language in 417-426). Springer, Singapore.
India, www.languageinindia.com, Special Volume: Problems Verma, P., & Om, H. (2019c). A variable dimension
of Parsing in Indian Languages (2011), 17–19. optimization approach for text summarization. In Harmony
Ananthakrishnan Ramanathan and Durgesh D Rao. 2003. A Search and Nature Inspired Optimization Algorithms (pp.
lightweight stemmer for Hindi. In the Proceedings of EACL. 687-696). Springer, Singapore.
Ruby Rani and DK Lobiyal. 2018. Automatic Construction of Verma, P., & Om, H. (2019d). A novel approach for text
Generic Stop Words List for Hindi Text. Procedia Computer summarization using optimal combination of sentence
Science 132 (2018), 362–370. scoring methods. Sādhanā, 44(5), 110.
Yogeshwari V Rathod. 2018. Extractive Text Summarization of Verma, P., Pal, S., & Om, H. (2019). A Comparative Analysis
Marathi News Articles. (2018). on Hindi and English Extractive Text Summarization. ACM
Jaideepsinh K Raulji and Jatinderkumar R Saini. 2017. Transactions on Asian and Low-Resource Language
Generating Stopword List for Sanskrit Language. In 2017 Information Processing (TALLIP), 18(3), 30.
IEEE 7th International Advance Computing Conference ***
(IACC). IEEE, 799–802.
Navanath Saharia, Utpal Sharma, and Jugal Kalita. 2014.
Stemming resource-poor Indian languages. ACM
Transactions on Asian Language Information Processing
(TALIP) 13, 3 (2014), 14.
Kamal Sarkar. 2012. Bengali text summarization by sentence
extraction. arXiv preprint arXiv:1201.2240 (2012).
Kartik Suba, Dipti Jiandani, and Pushpak Bhattacharyya. 2011.
Hybrid inflectional stemmer and rule-based derivational
stemmer for gujarati. In Proceedings of the 2nd Workshop on
South Southeast Asian Natural Language Processing
(WSSANLP). 1–8.
Nagmani Wanjari, GM Dhopavkar, and Nutan B Zungre. 2016.
Sentence Boundary Detection for Marathi Language.
Procedia Computer Science 78 (2016), 550–555.
Verma, P., & Om, H. (2016a). Extraction based text
summarization methods on user’s review data: A
comparative study. In International Conference on Smart
Trends for Information Technology and Computer
Communications (pp. 346-354). Springer, Singapore.
Verma, P., & Om H. (2016b). Theme driven Text
Summarization using k-means with gap statistics for Hindi
Documents. In International conference on Computing,
Communication and Sensor Network (pp. 90-94), IASTM
Kolkata.
Verma, P., & Om, H (2016c). A survey on Indian Language
Text Summarization Techniques, Evaluation, and Existing
Tools. In International conference on Innovative Systems
(pp. 32), IRO Bangalore.
Verma, P., & Om, H. (2018). Fuzzy Evolutionary Self-Rule
Generation and Text Summarization. In 15th International
Conference on Natural Language Processing (p. 115).
Verma, P., & Om, H. (2019a). MCRMR: Maximum coverage
and relevancy with minimal redundancy based multi-
document summarization. Expert Systems with Applications,
120, 43-56.
Verma, P., & Om, H. (2019b). Collaborative ranking-based text
summarization using a metaheuristic approach. In Emerging

363
Institute of Science, BHU Varanasi, India

You might also like