Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Text Analytics: An Introduction to the

Science and Applications of


Unstructured Information Analysis
John Atkinson-Abutridy
Visit to download the full and correct content document:
https://ebookmass.com/product/text-analytics-an-introduction-to-the-science-and-appl
ications-of-unstructured-information-analysis-john-atkinson-abutridy/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

The Organization of Information, 4th Edition (Library


and Information Science Text) – Ebook PDF Version

https://ebookmass.com/product/the-organization-of-
information-4th-edition-library-and-information-science-text-
ebook-pdf-version/

Library and Information Center Management, 9th Edition


(Library and Information Science Text) – Ebook PDF
Version

https://ebookmass.com/product/library-and-information-center-
management-9th-edition-library-and-information-science-text-
ebook-pdf-version/

Forensic Science: An Introduction to Scientific and


Investigative

https://ebookmass.com/product/forensic-science-an-introduction-
to-scientific-and-investigative/

The New Testament in Syriac Peshitta Version: Based on


the Collations of John Pinkerton and on the B.F.B.S.
Text with a Critical Apparatus and an Introduction to
the History of the Text A Juckel
https://ebookmass.com/product/the-new-testament-in-syriac-
peshitta-version-based-on-the-collations-of-john-pinkerton-and-
on-the-b-f-b-s-text-with-a-critical-apparatus-and-an-
Reference and Information Services : An Introduction
(Ebook PDF)

https://ebookmass.com/product/reference-and-information-services-
an-introduction-ebook-pdf/

Philosophy of Computer Science: An Introduction to the


Issues and the Literature William J. Rapaport

https://ebookmass.com/product/philosophy-of-computer-science-an-
introduction-to-the-issues-and-the-literature-william-j-rapaport/

eTextbook 978-1305947412 Spreadsheet Modeling &


Decision Analysis: A Practical Introduction to Business
Analytics

https://ebookmass.com/product/etextbook-978-1305947412-
spreadsheet-modeling-decision-analysis-a-practical-introduction-
to-business-analytics/

Criminalistics: An Introduction to Forensic Science


(12th Edition)

https://ebookmass.com/product/criminalistics-an-introduction-to-
forensic-science-12th-edition/

An Introduction to Genetic Analysis 11th Edition,


(Ebook PDF)

https://ebookmass.com/product/an-introduction-to-genetic-
analysis-11th-edition-ebook-pdf/
Text Analytics
Text Analytics
An Introduction to the Science
and ­Applications of Unstructured
­Information Analysis

John A
­ tkinson-Abutridy
First edition published 2022
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL ­33487-2742

and by CRC Press


4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

CRC Press is an imprint of Taylor & Francis Group, LLC

© 2022 John ­Atkinson-Abutridy

Reasonable efforts have been made to publish reliable data and information, but the author and
publisher cannot assume responsibility for the validity of all materials or the consequences of
their use. The authors and publishers have attempted to trace the copyright holders of all m
­ aterial
reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write
and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, ­reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any i­ nformation
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, access www.­
copyright.com or contact the Copyright Clearance Center, Inc. (­CCC), 222 Rosewood Drive,
Danvers, MA 01923, 9­ 78-­750-8400. For works that are not available on CCC please contact
­[email protected]

Trademark notice: Product or corporate names may be trademarks or registered trademarks and
are used only for identification and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data


Names: Atkinson-Abutridy, John, author.
Title: Text analytics : an introduction to the science and applications of unstructured information
analysis / John Atkinson-Abutridy.
Description: First edition. | Boca Raton : CRC Press, 2022. |
Includes bibliographical references and index.
Identifiers: LCCN 2021056159 | ISBN 9781032249797 (hardback) |
ISBN 9781032245263 (paperback) | ISBN 9781003280996 (ebook)
Subjects: LCSH: Text data mining. | Natural language processing (Computer science) |
Semantics—Data processing. | Document clustering.
Classification: LCC QA76.9.D343 A85 2022 | DDC 006.3/12—dc23/eng/20220105
LC record available at https://lccn.loc.gov/2021056159

ISBN: ­978-­1- ­032-­24979-7 (­hbk)


ISBN: ­978-­1-­032-­24526-3 (­pbk)
ISBN: ­978-­1-­0 03-­28099-6 (­ebk)

DOI: 10.1201/­9781003280996

Typeset in Minion
by codeMantra

eResources are available for download at https://www.routledge.com/Atkinson-Abutridy/p/


book/9781032249797
To Ivana, my wife and my light
Contents

List of Figures, xi
List of Tables, xv
Preface, xvii
Acknowledgments, xxv
Author, xxvii

Chapter ­1   ◾   Text Analytics 1


1.1 INTRODUCTION 1
1.2 TEXT MINING AND TEXT ANALYTICS 4
1.3 TASKS AND APPLICATIONS 7
1.4 THE TEXT ANALYTICS PROCESS 10
1.5 SUMMARY 12
1.6 QUESTIONS 13

Chapter 2   ◾   ­Natural-Language Processing 15


2.1 INTRODUCTION 15
2.2 THE SCOPE OF ­NATURAL-LANGUAGE
PROCESSING 16
2.3 NLP LEVELS AND TASKS 18
2.3.1 Phonology 20
2.3.2 Morphology 20
2.3.3 Lexicon 22
2.3.4 Syntax 29

vii
viii   ◾    Contents

2.3.5 Semantics 33
2.3.6 Reasoning and Pragmatics 38
2.4 SUMMARY 39
2.5 EXERCISES 39
2.5.1 Morphological Analysis 39
2.5.2 Lexical Analysis 44
2.5.3 Syntactic Analysis 45

Chapter 3   ◾   Information Extraction 49


3.1 INTRODUCTION 49
3.2 ­RULE-BASED INFORMATION EXTRACTION 53
3.3 ­NAMED-ENTITY RECOGNITION 54
3.3.1 ­N-Gram Models 57
3.4 RELATION EXTRACTION 60
3.5 EVALUATION 64
3.6 SUMMARY 67
3.7 EXERCISES 67
3.7.1 Regular Expressions 68
3.7.2 ­Named-Entity Recognition 72

Chapter 4   ◾   Document Representation 75


4.1 INTRODUCTION 75
4.2 DOCUMENT INDEXING 77
4.3 VECTOR SPACE MODELS 79
4.3.1 Boolean Representation Model 79
4.3.2 Term Frequency Model 80
4.3.3 Inverse Document Frequency Model 82
4.4 SUMMARY 84
4.5 EXERCISES 84
4.5.1 TFxIDF Representation Model 84
Contents   ◾    ix

Chapter 5   ◾   Association Rules Mining 91


5.1 INTRODUCTION 91
5.2 ASSOCIATION PATTERNS 92
5.3 EVALUATION 94
5.3.1 Support 94
5.3.2 Confidence 95
5.3.3 Lift 95
5.4 ASSOCIATION RULES GENERATION 96
5.5 SUMMARY 101
5.6 EXERCISES 101
5.6.1 Extraction of Association Rules 101

Chapter 6   ◾   ­Corpus-Based Semantic Analysis 105


6.1 INTRODUCTION 105
6.2 ­CORPUS-BASED SEMANTIC ANALYSIS 107
6.3 LATENT SEMANTIC ANALYSIS 109
6.3.1 Creating Vectors with LSA 110
6.4 WORD2VEC 115
6.4.1 Embedding Learning 118
6.4.2 Prediction and Embeddings Interpretation 121
6.5 SUMMARY 123
6.6 EXERCISES 123
6.6.1 Latent Semantic Analysis 124
6.6.2 Word Embedding with Word2Vec 130

Chapter 7   ◾   Document Clustering 137


7.1 INTRODUCTION 137
7.2 DOCUMENT CLUSTERING 139
7.3 ­K-MEANS CLUSTERING 145
7.4 ­SELF-ORGANIZING MAPS 149
7.4.1 Topological Maps Learning 150
7.5 SUMMARY 155
7.6 EXERCISES 155
x   ◾    Contents

7.6.1 ­K-means Clustering 155


7.6.2 ­Self-organizing Maps 162

Chapter 8   ◾   Topic Modeling 165


8.1 INTRODUCTION 165
8.2 TOPIC MODELING 166
8.3 LATENT DIRICHLET ALLOCATION 169
8.4 EVALUATION 176
8.5 SUMMARY 179
8.6 EXERCISES 179
8.6.1 Modeling Topics with LDA 179

Chapter 9   ◾   Document Categorization 185


9.1 INTRODUCTION 185
9.2 CATEGORIZATION MODELS 187
9.3 BAYESIAN TEXT CATEGORIZATION 191
9.3.1 Conditional Class Probability 192
9.3.2 A Priori Probability 193
9.3.3 Evidence 194
9.3.4 Classification 194
9.4 MAXIMUM ENTROPY CATEGORIZATION 195
9.5 EVALUATION 200
9.6 SUMMARY 203
9.7 EXERCISES 203
9.7.1 Naïve Bayes Categorization 203
9.7.2 MaxEnt Categorization 208

CONCLUDING REMARKS, 215

BIBLIOGRAPHY, 221

GLOSSARY, 225

INDEX, 229
List of Figures

­Figure 1.1 Search versus data discovery 2


­Figure 1.2 The scope of text mining 6
­Figure 1.3 Document clustering 8
­Figure 1.4 Information extraction 8
­Figure 1.5 Text categorization 9
­Figure 1.6 Relationship inference 9
­Figure 1.7 The text mining process 10
­Figure 2.1 Levels, tasks, and linguistic resources in NLP 19
­Figure 2.2 A simple Markov model 25
­Figure 2.3 Hidden Markov Models (­HMM) with transition
and emission probabilities 27
­Figure 2.4 Syntactic analysis task 29
­Figure 2.5 ­Context-Independent Grammar (­CFG) rules 30
­Figure 2.6 Parse tree for the sentence “­The flight arrived” 31
­Figure 2.7 Dependency grammar for “­The flight arrived
without problems” 32
­Figure 2.8 S emantic graph for the sentence “­The flight arrived
without problems” 34
­Figure 2.9 The task of Word Sense Disambiguation (­WSD) 34

xi
xii   ◾    List of Figures

­Figure 2.10 Structure of rhetorical relationships for an


example text 38
­Figure 3.1   A complaint text 50
­Figure 3.2   Relation extraction to feed a template 50
­Figure 3.3   Steps in information extraction 51
­Figure 3.4   Simple and relational information extraction 52
­Figure 3.5   Example of extraction based on cascading rules 53
­Figure 3.6   Association and search for specific relationships 60
­Figure 3.7   Extracting ­protein-protein interaction relationships 61
­Figure 4.1   The indexing or characteristics generation task 78
­Figure 4.2   Vector representation of sample texts 80
­Figure 5.1   Documents transactions as shopping baskets 93
­Figure 5.2   Generating frequent itemsets using the
APRIORI method 98
­Figure 5.3   Association rules search space 100
­Figure 6.1   Word embeddings representation 108
­Figure 6.2   Matrix decomposition using Singular Value
Decomposition (­SVD) 112
­Figure 6.3   Selecting the best number of dimensions 113
­Figure 6.4   Types of Word2Vec models 116
­Figure 6.5   Examples of context windows for training 117
­Figure 6.6   A
 rchitecture of a Continuous Bag of Words
(­CBOW) model 118
­Figure 6.7   Updating hidden layer values 119
­Figure 6.8   Computing values for output neurons 119
­Figure 6.9   Output words prediction using the SoftMax classifier 120
­Figure 6.10 Continuous Bag of Words (­CBOW) training
process to generate predictions 121
List of Figures   ◾    xiii

­Figure 6.11 General network structure for context prediction 121


­Figure 7.1   Simple news grouping 138
­Figure 7.2   Generating document groups 139
­Figure 7.3    xample of three clusters represented by
E
­2-dimensional vectors 140
­ igure 7.4  
F A document clustering system 141
­ igure 7.5  
F Clusters distribution based on their centroids 142
­ igure 7.6  
F Spatial distribution of document vectors 144
­ igure 7.7  
F Generating three clusters from a corpus of documents 146
­ igure 7.8   Different clusters based on their initial centroids
F 147
­Figure 7.9   Selecting the best number of clusters 148
­Figure 7.10 SOM general architecture 150
­ igure 7.11
F Some shapes of SOM topologies: (­a) rectangle (­2D),
(­b) octahedron (­2D), and (­c) linear (­1D) 151
­ igure 7.12
F Geometric representation of clusters in ­Self-
Organizational Map (­SOM) 154
­Figure 8.1   Topic distribution, words and documents 167
­Figure 8.2   A mixture of three distributions 168
F
­ igure 8.3   Geometric interpretation of topics 171
­ igure 8.4  
F Dependency modeling between documents,
topics and words 172
­ igure 8.5  
F Modeling topics with Dirichlet distribution 173
F
­ igure 8.6    irichlet distribution on a 2­ -simplex for different
D
values of α 175
­Figure 8.7   Topic visualization for a sample corpus 176
­Figure 8.8   Evaluating a topic model according to perplexity 177
­Figure 8.9   Evaluating a topic model according to coherence 178
F
­ igure 9.1   Testing a text classification model 188
xiv   ◾    List of Figures

­Figure 9.2 Training a text classification model 189


­Figure 9.3 Entropy for probability distribution of X 196
­Figure 9.4 Computing indicator functions and probability
of classes 200
­Figure 9.5 A
 n example of a Receiver Operating Characteristic
(­ROC) curve 202
List of Tables

­Table 3.1 Confusion Matrix 65


­Table 3.2 Example of a Confusion Matrix 67
­Table 4.1 Sample Documents and Terms 79
­Table 4.2 Normalized TF Vector Representation 81
­Table 4.3 Correlation between Document Vectors: (­a) TF Model
(­b) TFxIDF Model 83
­Table 6.1 Initial Matrix of Terms by Documents 111

xv
Preface

W hen I thought about writing this book, I wasn’t clear enough


where to start, as there are many related topics that could be
addressed. However, it was clear to me that I wanted to share knowledge in
the simplest possible way, so that professionals with basic knowledge could
not only understand the theoretical and practical topics related to text ana-
lytics but also perceive their applications and impact in a c­ ross-sectional
way in many areas.
The challenge wasn’t simple, because despite my experience as an aca-
demic, I’m used to writing specialized scientific articles (­papers) and
technical books, giving technical conferences to scientific audiences and
directing ­scientific-technological projects evaluated by experts, etc. that
is, activities aimed at professionals/­scientists who are able to understand
complex issues. But this isn’t the case for common professionals and spe-
cialists in general; in other words, accessible and understandable formal
literature was required for them to understand the fundamentals in a sim-
ple way as well as the applications.
There were several alternatives to address the above, such as theoreti-
cal books, which are already available. However, these are usually focused
toward more advanced professionals and/­or postgraduate students leaving
aside practical and applied aspects. On the other hand, there are several
practical books focused on programming aspects; however, they’re usually
highly biased to the author’s own conception of the subject. Furthermore,
in my experience as an academic teaching text analytics and text mining
courses for many years, both within and outside Chile, I’ve been able to
verify how the students, professionals, or technicians who rely on such
books simply don’t understand what they’re doing, not being able to
compare/­analyze methods or models, being unable to go beyond what’s
written. This is mainly because they’re o ­ ver-focused on coding in specific
languages, where, unfortunately, many functionalities are arranged as

xvii
xviii   ◾    Preface

“­black boxes”, that is, hiding all the technical details, making it hard to
replicate their experiences using other computational methods or tools.
The book’s nature was clear then. This should be an introductory book
combining the basic theoretical foundations of the different paradigms
and methods of textual analytics, with practical aspects, accompanied by
examples in some programming language, which would allow us to fully
understand the background and logic of computational methods, but at
the same time, being flexible enough to use and implement them in other
languages or computational tools.
I should take advantage of my academic experience to help profession-
als better understand computational concepts and methods. For more
than 25 years, I’ve taught undergraduate and postgraduate courses related
to text analytics, text mining, ­natural-language processing, and artificial
intelligence at several national and foreign universities. While doing this,
I learned a lot from students and professionals alike, understanding what
was easier or difficult for them to understand, their questioning of what
was established, their ways of taking something complex and simplifying
it, etc.
On the other hand, my extensive experience as a scientific researcher
and consultant, developing and leading scientific projects, and transferring
technologies to the public and private sectors, should also have something
to say in the way I’m focusing this book. Indeed, much of what the book
conveys had to do not only with the basis behind computational methods
but also with the challenges and considerations involved in the study, use
and design of computational methods in real practical problems. Thus, the
book is the result of both types of experiences, which allows us to under-
stand not only the how but also the why.
After all these years doing this, why am I creating this text analytics
book now? Globally, this is a l­ong-standing topic, and we started seeing
this in the ­mid-1990s in Chile, at the academy. However, in many ways,
society and industry weren’t prepared, and, up to a certain extent, many
didn’t see the need to be prepared, considering it to be something abstract
and, therefore, impractical, taking into account not only the small amount
of data available for them in that period but also performing several ana-
lytical tasks manually, like the old days.
However, the access and size of data sources, and in particular unstruc-
tured ones such as texts and documents, have experienced exponential
growth in the last 10 years. This led to important scientific advances in
both new and improved methods of text analytics as well as a greater need
Preface   ◾    xix

for companies, not only to analyze structured information automatically


per se but to generate insights from it, improving their ­decision-making
and business productivity.
Nor is it fortuitous that the writing of this book went through two major
local and international crises. First, there was the unprecedented social
crisis that affected Chile in October 2019. Then, since the end of 2019, the
world has been affected by a gigantic global pandemic of the C ­ OVID-19
disease, and, even now, when finishing this book, when it will come to an
end is still uncertain. But what do these two recent events have in common?
In addition to the already increasing information overload, in the case of
Chile, the social crisis generated an enormous flow of information, especially
related to social media information, whose analysis was key to making stra-
tegic decisions in terms of discovering links in informal textual information.
On the other hand, in the case of the pandemic, there has been a large accu-
mulation of scientific literature (­i.e., formal textual information), around the
investigations that shed light on various factors related to C ­ OVID-19, and
that, due to its dimension (­more than 45,000 articles) and complexity, it’s
necessary to have advanced textual analytics methods that can automatically
extract and discover relevant knowledge that allows research to progress, a
task that clearly cannot be performed manually by human experts.
The above shows once again not only the technological, critical, and
strategic relevance of text analytics but also the need of a book capable of
introducing this technology as a powerful ­decision-making tool.
Crises bring opportunities, and this book was written in the midst of a
crisis.

ABOUT THIS BOOK


This is an introductory book to the science and applications of text ana-
lytics or text mining, which enables automatic knowledge discovery from
unstructured information sources. With this purpose, the book introduces
the main concepts, models, and computational techniques that allow solv-
ing several real ­decision-making problems coming from textual and/­or
documentary sources.

Audience
If you have textual data sources that you need to understand, this is the
right book for you. If you want to obtain textual data and want to take
advantage of it to discover and/­or detect important knowledge, this is also
xx   ◾    Preface

the right book for you. If you want to understand the paradigms and com-
putational methods to perform textual analytics, this book then is a must
for you. If you are looking to understand the basic theoretical and practical
aspects to teach a course in text analytics, this is definitely a suitable guide
for you.
Although this book is introductory, it would help if you are familiar
with the following aspects:

• Basic knowledge of linear algebra, statistics, and probabilities.


• Basic concepts of Machine Learning.
• Some knowledge of Python. If you don’t know how to use it, it’s not a
problem since it’s really easy to learn. In particular, it would be useful
if you’re familiar with the definition of functions, handling of storage
structures (­i.e., tables, lists, matrices, data frames, dictionaries, and
files), and some basic visualization libraries.

Organization of the Book


This book has nine chapters, each of which contains two parts: (­1) An
introductory part that teaches the main concepts, paradigms and meth-
ods, models; and (­2) a second part showing practical exercises in Python
on everything that was studied in the chapter. On the other hand, for the
reader’s familiarity and as complementary literature, each chapter will end
with the basic terminology used internationally.

­Chapter 1: Text Analytics


This chapter introduces the main concepts, approaches, and applications
for the automatic analysis of unstructured information (­i.e., texts) known
as textual analytics. In addition, the textual analytics process, tasks, and
main challenges are described.

­Chapter 2: ­Natural-Language Processing


This chapter introduces the basic concepts and computational and linguis-
tic techniques that make it possible for a computer to process natural lan-
guage. In addition, the main techniques and the way in which they address
different problems associated with language processing in texts written
by humans are described (­i.e., morphological analysis, syntactic analysis,
semantic analysis, discourse analysis).
Preface   ◾    xxi

­ hapter 3: Information Extraction


C
This chapter introduces the concepts and some methodologies for the
identification and extraction of specific information from the document
body, using NLP techniques (­i.e., relation extraction, ­named-entity rec-
ognition). In addition, the main problems and how they can be solved to
support the tasks of text analytics are also described.

­ hapter 4: Document Representation


C
This chapter introduces the different concepts, approaches, and mod-
els to computationally characterize and represent textual information in
the form of documents so that they can be used in textual analytic tasks.
Typical approaches based on indexing methods and space model vector
for documents are described (­i.e., term frequency models, inverse docu-
ment frequency models).

­Chapter 5: Association Rules Mining


This chapter introduces the main concepts, methods, and problems asso-
ciated with the patterns extraction from documents, in the form of asso-
ciation rules. The main approaches and metrics for evaluating the quality
of the discovered patterns are described (­i.e., APRIORI algorithm).

­ hapter 6: ­Corpus-based Semantic Analysis


C
This chapter explores the fundamentals of different techniques and models,
allowing readers to study and model the meaning of words and documents.
For this, different approaches are described for the automatic generation of
­low-dimensional distributed representation or word embeddings (­i.e., LSA,
Word2Vec) that allow one to efficiently capture the meaning of words and
documents in context from the training corpus.

­Chapter 7: Document Clustering


This chapter describes computational concepts and methods for perform-
ing document clustering. Modern grouping principles, metrics, and algo-
rithms (­i.e., ­K-means, ­self-organizing maps) are introduced to find hidden
patterns in the document corpus.

­ hapter 8: Topic Modeling


C
This chapter introduces the main concepts and methods for grouping
documents based on latent topics within those documents. The main
xxii   ◾    Preface

approaches for the automatic generation of topics based on probabilistic


models (­i.e., pLSA, LDA) are discussed.

­ hapter 9: Document Categorization


C
This chapter describes the main concepts, models, and techniques for per-
forming automatic text categorization. Different probabilistic and stochas-
tic methods are described to predict the category to which the documents
belong from a training corpus (­i.e., Naïve Bayes classifier, Maximum
entropy classifier).

Exercises
Each chapter has an exercise section that shows examples and simple
practical applications written in Python of the different models and meth-
ods described. Each chapter introduces concepts that are cumulative, so
your h­ ands-on exercises also reuse code functionality from p ­ revious
chapters. The sample programs are divided by chapter, and there’s also
a functions library defined in all chapters common to various exercises
called “­utils.py”. All this, together with all the text documents used in
each exercise, are available on the book site: https://www.routledge.com/
Atkinson-Abutridy/p/book/9781032249797
Everything is programmed in Python 3.7 in the Spyder environment for
Windows, under the Anaconda1 distribution. In addition, we use methods
and functions that are available in two libraries:

• NLTK2: The Natural-Language Toolkit (­NLTK) was created by aca-


demics and researchers as a tool to create complex n ­ atural-language
processing (­NLP) functions and algorithms for multiple human lan-
guages using Python. In addition to libraries for different NLP tasks,
NLTK includes sample datasets and graphical demos to explain the
principles of NLP tasks.
• SpaCY3: This is an NLP library in Python and Cython designed to
build industrial systems. SpaCY includes several pretrained machine
learning and statistical models for more than 50 languages.

1 https://­w ww.anaconda.com/­products/­individual
2 https://­w ww.nltk.org/
3 https://­spacy.io/
Preface   ◾    xxiii

Both provide functionality for similar tasks. However, NLTK is more


aimed at scientists/­researchers who want to build and experiment with
more complex algorithms, while SpaCY is oriented toward application
development, providing access to tasks with the best results, but not allow-
ing for practical deep experimentation compared to NLTK.
In order to have the facilities provided by both tools, some packages
must be installed through the Anaconda command console:

pip install spacy


pip i­ nstall --­user -U nltk

In addition, a pretrained language model for English4 must be installed in


SpaCY, which is based on a news corpus (­en_core_news_sm):
­python -m spacy download en_core_news_sm

On the other hand, sample documents in the form of short news, and those
used in the exercises of each chapter are available in the corpus folder on
the book site. This contains a sample of international news summaries
divided in three subdirectories: music (­22 summaries), sports (­10 summa-
ries), and coronavirus (­16 summaries). In addition, for some exercises, data
sets available in three CSV format files will be used: “­training.csv” (­data
for model training), “­testing.csv” (­data for model testing), and “­new.csv”
(­data for model application).

John Atkinson-Abutridy
Santiago, Chile

4 https://­spacy.io/­models/­en
Acknowledgments

I thank all those who helped with the reading and revision
of the draft of this book in its different stages, providing comments,
corrections, and invaluable feedback: Graciela Mardones, Margarita
­
Hantke, Mabel Vega, Diego Palma, Eladio Lisboa, Diego Reyes, Victor
Toledo, Rodolfo Abanto, Gonzalo Gómez, Carlos Parrá, and Francisco
Pallauta. Special thanks to Andrés Morales who made a wonderful front
cover design.
I want to thank my wife Ivana, who always supported and urged me
to write this book. I owe her a lot for all the time she sacrificed so I could
move forward and finish this book.

xxv
Author

John ­Atkinson-Abutridy earned a PhD in Artificial Intelligence from


the University of Edinburgh, UK. He is currently a f­ ull-time professor at
the Faculty of Engineering and Sciences of the Adolfo Ibañez University
(­Santiago, Chile) and has been a ­full-time professor at other Chilean
universities such as the Technical University Federico Santa María
(­Valparaíso) and the University of Concepción (­Concepción), as well as
visiting professor and researcher at various European, North American,
and Asian universities. His main research areas include ­natural-language
processing, text analytics, artificial intelligence, and b
­ io-inspired comput-
ing, on which he has published more than 90 scientific articles. In the last
24 years, Dr. Atkinson-Abutridy has directed several scientific and techno-
logical projects at national and international levels and has been a visiting
researcher at several universities and research centers in the USA, Europe,
Asia, and South America. As a business consultant, he has led the imple-
mentation of advanced technological projects in artificial intelligence, as
well as transferred and commercialized intelligent systems for multiple
productive sectors. He is a professional member of the AAAI (­Association
for the Advancement of Artificial Intelligence) and a senior member of the
ACM (­Association for Computing Machinery).

xxvii
Chapter ­1

Text Analytics

1.1 INTRODUCTION
There are thousands of scientific articles in the world on viruses and
diseases that human specialists aren’t able to read or analyze. How
could computers process such documents and be able to make dis-
coveries and/­or detect patterns of interest so that humans can make
decisions about new treatments, drugs, and interactions between
­bio-components? A company receives hundreds of complaints or inqui-
ries from customers daily through its website or emails. How could this
company analyze those complaints to study and determine common
behaviors and customer profiles in order to offer them a better service?
An Internet news outlet receives hundreds of national and interna-
tional news reports weekly. How could this medium synthesize, group,
or characterize them to provide more filtered and digested informa-
tion to readers seeking specific data? As a result of several national
events, various public bodies receive thousands of opinion messages
through social networks such as Twitter. How could these messages
be analyzed in order to determine trends and/­or preferences of users
regarding those events?
Clearly, in the last decades, we’ve experienced a gigantic growth of the
data available in various electronic media. The information overload is
such that it becomes very difficult to take advantage of such data using
conventional technologies, so new abilities are required for its efficient

DOI: 10.1201/9781003280996-­ 1
2   ◾    Text Analytics

analysis. This will depend on the nature of the information, which in gen-
eral can be divided into two large groups:
• Structured data: Corresponds to data that have been organized in
repositories such as a database, so that its elements can be accessed
by effective analysis and processing methods (­i.e., an SQL table).
• Non-Structured data: Corresponds to data that don’t have a
­predefined structure or model or that’s not organized in a predefined
way, ­making them hard to understand using traditional computa-
tional methods (­i.e., news and customer complaints).

Depending on the nature of the data, we can perform two types of tasks on
them: search and discovery, as shown in ­Figure 1.1.
A search task is g­ oal-oriented, which means that you must provide a
clear criterion to receive the results that you need (­i.e., a condition that
must be met by the data attributes). In this scenario, we’re not looking for
anything new, we’re only reducing the information overload, retrieving
only data which satisfy certain conditions (­Zhai & Massung, 2016). Then,

• If data are structured: We must specify some condition, key or char-


acteristic, of the data we want to search. For example, you want to
retrieve the information of all the clients that were registered in a
company in 2018 from a SQL database. For this, there are usually
database engines capable of efficiently accessing, query and retrieve
data from a previously specified combination of attributes (­i.e., a
structured query).
• If data are not structured: We must then search for documents relevant
to a query, consisting of a list of keywords. For example, you want to

­FIGURE 1.1 Search versus data discovery.


Text Analytics   ◾    3

search documents online that contain the terms rent and houses. For
this, information retrieval (­IR) technologies (­Büttcher et al., 2010) are
usually available in the form of web search engines such as Google
and Yahoo or specialized search systems (­i.e., MEDLINE medical lit-
erature search engines1).

Unlike search, a discovery task is by nature opportunistic, that is, you don’t
know what you want to search for, so data hypotheses are automatically
explored to discover new opportunities in the form of data hidden pat-
terns (­or latent), which can be interesting and novel. Then,

• If data are structured: We must have some discovery task in mind so


that later, some Data Mining technology (­Tan et al., 2018) can mine
the data to discover or extract hidden patterns that are actionable,
that is, having the ability to act regarding some kind of process that
produces real results. For example, given a database of purchase
transactions made by customers in a supermarket, we would like to
know if there’s any behavior pattern which allows us to understand
how these purchases are associated with each other, to make recom-
mendations, create better promotions, adjust the product layout, etc.
• If data are not structured: We must have some discovery task in
mind about textual data, so that later, some Text Mining or Textual
Analytics Technology can automatically discover hidden patterns in
texts that support ­decision-making. For example, given a set of docu-
ments that describe complaints from clients of a company, we would
like to find patterns that allow characterizing these complaints, find-
ing nonobvious connections between them, and grouping them to
generate recommendations.

The nature of unstructured data and the complexity of its analysis have
generated a growing need for technologies that allow it to be analyzed and
automatically discover insights (­i.e., hidden aspects regarding how users/­
clients act, which can generate opportunities for new products/­services,
strategies, etc.). This becomes even more latent at the business level, con-
sidering that unstructured information represents more than 85% of the
data handled by corporations. Hence, this has impacted practically all

1 https://­w ww.nlm.nih.gov/­bsd/­medline.html.
4   ◾    Text Analytics

industrial, public, scientific, and technological areas in a transversal way.


Thus, we can find different types of textual information, including emails,
insurance claims statements, news pages, scientific articles, innovation
patents descriptions, customer complaints, business contracts, and opin-
ions on forums and/­or social networks, among others.
Clearly, it’s not possible to analyze this kind of data with known Data
Mining techniques, due to its linguistic nature, and therefore the unstruc-
tured and free way to express human knowledge. For this, computational
techniques are required to discover patterns of interest in those textual
information sets.

1.2 TEXT MINING AND TEXT ANALYTICS


Text mining and text analytics are highly interchangeable terms. Text min-
ing is the automated process of examining large collections of documents
or corpora to discover patterns or insights that may be interesting and
useful (­Ignatow & Mihalcea, 2017; Struhl, 2015; Zhai & Massung, 2016).
For this, text mining identifies facts, relationships, and patterns that would
otherwise be buried in textual data (­Atkinson & Pérez, 2013). This infor-
mation can be converted to a structured form that can be later analyzed
and integrated with other types of systems (­i.e., business intelligence, data-
bases, and data warehouses). On the other hand, text analytics synthesizes
the results of text mining so that they can be quantified and visualized in a
way that supports d ­ ecision-making, producing actionable insights, so text
mining encompasses broader aspects than text analytics.
The applications of text analytics in industrial and business areas are
many, including document clustering, text categorization, information
extraction to populate databases, text generation, association discovery,
etc. However, since the goal is to automatically analyze textual informa-
tion sources that are written in natural language by humans, computa-
tional methods (­Jurafsky et al., 2014) must be able to address three key
linguistic problems:

1. Ambiguity: Natural language is by nature a communication mode


characterized by inherent ambiguity. In linguistics, this ambiguity
originates when some linguistic object has multiple interpretations
or meanings. Thus, this ambiguity can be lexical (­i.e., a single word
with more than one meaning), syntactic (­i.e., a single sentence that
has several possible grammatical structures), semantic (­i.e., a sen-
tence with several possible interpretations), and pragmatic (­i.e., a
Text Analytics   ◾    5

sentence with several possible contexts to determine its intention).


To understand why this is relevant to text mining, consider the fol-
lowing two sentences extracted from informal texts, when searching
for the word nail:
The nails of the installation are rusty.
Her nails are split after falling out.
Assume the desired task was to group phrases like these to deter-
mine common patterns. In this case, if we take few words to com-
pare these sentences, a group with both sentences would be created.
However, you know that this isn’t right, because both sentences
refer to very different topics since it’s the same word that has two
interpretations.
2. Dimensionality: Given the lexical ambiguity of the previous exam-
ple, if you try to compare both sentences that have a simple syntac-
tic structure and just a few words, you could surely compare them
out without much difficulty, but even so, with quite limited analy-
sis. However, the reality is much more complex, since a text writ-
ten in natural language is highly dimensional, that is, it has many
characteristics or dimensions that can describe it. Each dimension
could be a word, a term (­i.e., “­San Francisco”), or a phrase, etc.; so, if
you consider collections of many texts or documents, clearly, using
conventional data analysis methods is not enough. For example, the
dimensions of a Twitter message are all the words and symbols it
contains, and if thousands or millions of messages are considered,
clearly the dimensions begin to increase enormously, increasing the
difficulty of some analysis tasks.
3. Linguistic Knowledge: For a human reader, the previous example
sentences are relatively simple to understand for further analysis.
However, for a computational method to be able to understand them,
there should be a lot of lexical (­i.e., Do I know the word?), syntactic
(­i.e., Is the phrase well formed?), semantic (­i.e., What’s the meaning
of the phrase?), and pragmatic (­i.e., What’s the text trying to com-
municate as a whole?) knowledge.
For example, consider the following opinion taken from a social
network: “­I didn’t like your customer service”. Suppose we want to
automatically determine if it expresses a positive or negative emo-
tion about a product or service. Clearly, for this to be effective, a
6   ◾    Text Analytics

computational method should have or infer lexical (­i.e., Are the words
known and relevant?), syntactic (­i.e., Is the sentence well written?),
and semantic (­i.e., What’s the literal meaning of the phrase?) knowl-
edge. However, the analysis is not enough, as pragmatic knowledge
is also required (­To whom is this opinion referring in the context?
What is it trying to communicate?), which allows reasoning about
the implicit intentions of that statement and that can feed further
analysis tasks. Otherwise, the answer will still be pending: didn’t like
customer service, but from “­whom?”

To address these insight discovery aspects in textual information, text


mining combines three areas: Natural-Language Processing, Machine
Learning, and IR, as shown in ­Figure 1.2.
Natural-Language Processing (­NLP) provides theories, models, and
methods so that a computer can understand natural language (­w ritten
or spoken) at different linguistic levels (­i.e., phonetic, morphological,
lexicon, syntactic, semantic, discursive, and pragmatic). In practice, NLP
techniques focus on creating systems that process textual information
in order to make it accessible to other computer applications. Usually,
many of these require retrieving information from specific unstructured
data sources (­i.e., texts, images, and videos), in which an analysis based
on some measure of relevance (­i.e., importance) is key with respect to a
certain input query in order to make them available to other tasks and
applications. To this end, information retrieval (­IR) methods and models
can be used (­Büttcher et al., 2010) in which NLP plays a fundamental
role in characterizing and “­understanding” some elements of the infor-
mation (­i.e., documents) that’s retrieved. Many NLP tasks can be solved
by using traditional ­rule-based approaches and probabilistic methods.

­FIGURE 1.2 The scope of text mining.


Text Analytics   ◾    7

However, in many language comprehension and textual analytical tasks,


problems are complex and ­non-deterministic in nature, so there are no
efficient algorithmic methods to address them. In those cases, machine
learning (­ML) is applied, an AI area that provides computational tech-
niques allowing a computer to learn how to perform a task based on expe-
rience (­Wilmott, 2020; Mohri et al., 2018). Thus, an ML system improves
its performance with experience, without the need to write explicit rules
or models. Models automatically created by ML are thus capable of gen-
eralizing behaviors for unknown cases, improving the performance of
certain tasks.

1.3 TASKS AND APPLICATIONS


Depending on the textual analytics question you want to answer or the
discovery task you have in mind, virtually all kinds of textual informa-
tion can be analyzed clearly at different levels of linguistic and ana-
lytical complexity. Although specific problems and tasks are described
in detail throughout this book, it’s important to provide an overview
to understand some of the typical tasks. For this, let’s assume the
following:

• You have a collection of documents (­i.e., news, reviews, articles,


tweets, etc.).
• For simplicity, each document is represented as a list of two ­real-valued
features, which results in a ­2-dimensional numeric vector.
• For visualization purposes, each document will simply be repre-
sented as a data point in a ­2-dimensional space.

Then some tasks could include:

• Text Clustering: Suppose that you want to group written customer


complaints, according to some mathematical measure of closeness
(­i.e., cosine similarity), which allows understanding how such infor-
mation is related. For example, F ­ igure 1.3 shows black data points
representing such complaints (­i.e., documents) according to two
characteristics, which have been grouped by some method so we
have two complain groups (­i.e., red circles) that could express certain
“­hidden” patterns in the complaints contained in the documents,
which might aid in business ­decision-making.
8   ◾    Text Analytics

­FIGURE 1.3 Document clustering.

­FIGURE 1.4 Information extraction.

• Information extraction: Suppose you have a large collection of docu-


ments describing traffic accidents, from which you want to extract
specific pieces of information to feed other applications (­i.e., Date,
Place, Type of Accident, City). ­Figure 1.4 shows data points repre-
senting these documents, from which specific information can be
extracted to further fill out some template or transfer them to an
SQL database.
• Text Categorization: Suppose you receive many messages from social
networks like Twitter and want to automatically separate them
depending on whether they express a positive or negative emotion
about your products and services. ­Figure 1.5 shows data points repre-
senting different types of opinions. For this, categorization methods
can be applied, which automatically classify new opinions received,
Text Analytics   ◾    9

­FIGURE 1.5 Text categorization.

­FIGURE 1.6 Relationship inference.

from models built with experience. This should allow separating


positive polarity views (­i.e., yellow data points) from those express-
ing negative sentiments (­i.e., blue data points).
• Relationship Inference: Suppose that you suspect that complaints
about a product/­service (­i.e., broadband) from your company are
closely associated with complaints about a certain staff and from that
company. However, the specific link between the two is unknown, so
you want to explore the information to find out more details about
the possible link. ­Figure 1.6 shows that, once both products are
identified in a collection of complaint documents, a textual analysis
10   ◾    Text Analytics

method could be used to identify some semantic references between


them. Next, this method could identify specific relationships that
account for this link to make future decisions.

1.4 THE TEXT ANALYTICS PROCESS


In a perfect world, tasks such as those described above could be car-
ried out using some textual analysis computational method (­Bengfort &
Bilbro, 2018; Ignatow & Mihalcea, 2017; Aggarwal, 2018), assuming that
the textual data of input are ready to be processed and that the results are
always the best. However, this assumption is quite far from reality, and it’s
required to carry out several activities before carrying out the analytical
task itself.
The process starts from a collection of texts automatically extracted
from a repository (­i.e., website, document database) or available directly
in some electronic media. Linguistically speaking, a collection of docu-
ments on some subject, topic, genre, etc., is usually called a Corpus, and
can usually contain pure text or text with some specific purpose annota-
tions, made by humans for further analysis.
As shown in F ­ igure 1.7, the following activities of the text mining pro-
cess are carried out:

• Text Pre-processing: This stage transforms an input text into a more


“­digestible” form for computers. Usually, this involves perform-
ing tasks such as tokenization, normalization, and noise removal.
Tokenization separates text strings into smaller pieces called tokens.

­FIGURE 1.7 The text mining process.


Text Analytics   ◾    11

For example, paragraphs can be tokenized into sentences, and sen-


tences can be tokenized into words. On the other hand, normaliza-
tion standardizes all text (­i.e., convert characters to lowercase), while
noise removal cleanses text, for example, removing extra whitespace,
removing irrelevant characters, or reducing variability in the form
of the words (­i.e., the reduced form infect is the same for the word
infected). Note that noise could also be reduced by removing many
words that don’t contribute to the analysis of a text such as some
articles, prepositions, etc., usually known as stopwords.
The approaches to doing this stage are discussed in ­Chapters 2 and 3.
• Feature Generation: At this stage, relevant features must be extracted
from textual data, which can be used later to build textual analy-
sis models. This stage is critical since we need to convert free text
into some numerical representation that can be understood by fur-
ther analysis methods (­Kuhn & Johnson, 2019). These features can
usually be words, short sequences of words, phrases, etc. As a text
can have many features or dimensions, the usual representation
of a document is based on vector models of the features and their
importance.
The approaches to doing this stage are discussed in C
­ hapters 3 and 4.
• Feature Selection: Eventually, the previous activity could extract too
many features representing a document. However, not all of them
could be relevant. This high dimensionality can make not only
inefficient further tasks (­i.e., computation time and memory space
required) but also generate too much noise in the data, making it
difficult to understand and therefore to discover patterns of interest.
For textual information, we need to evaluate and weigh the charac-
teristics in order to select the most relevant to represent the texts, a
process usually known as indexing (­Büttcher et al., 2010). For exam-
ple, the most important words or phrases could be the characteristics
of a text; however, in an opinion analysis task, types of words such
as adjectives and names could be the important characteristics, as
they allow to discriminate one type of opinion expressing a positive
sentiment from other expressing a negative sentiment. Usually, the
features selection for indexing purposes can be based on statistical
methods, information theory, ML models, linguistic taxonomies,
etc. (­Kendall & McGuinness, 2019).
Approaches to doing this stage are discussed in ­Chapter 4.
12   ◾    Text Analytics

• Pattern Discovery: Once the best features are extracted and selected
to represent textual data, several textual analytical tasks can be per-
formed, depending on whether our focus is on the patterns we want
to discover or the result that can be generated from those patterns:
• Patterns as the focus of discovery: This considers relationship
extraction tasks, which explain specific links or associations
between document elements.
• Result as focus of discovery: This considers tasks in which some
analytical method tries to discover implicit patterns that are
used to make decisions. This includes document classification
tasks, document clustering, an analysis of similarity between
texts, etc.
Approaches to doing this stage are discussed in the ­Chapters ­5–9.
• Pattern Evaluation and Interpretation: As in the previous stage, the
interpretation or evaluation of the patterns discovered by each task
will depend on whether the focus is on the patterns or the results.
This aims, on the one hand, to generate patterns that are ideally use-
ful, actionable, and novel, and, on the other hand, to feed back the
previous stages of the process in case it’s required to improve the
quality of those patterns. For example, the evaluation of a textual
categorization task can be seen as the accuracy of the classification
of previously unseen documents; the evaluation of a pattern dis-
covery system can be seen as the degree of novelty or the degree of
understanding that the detected relationship represents; the qual-
ity of a document clustering task can be determined as the degree
of novelty that represents the groups discovered or learned by the
method; the evaluation of an analysis association task between
document items can be determined as the level of relevance and
correlation represented by the associations, and thus, with other
analysis methods.
Approaches to doing this stage are discussed in ­Chapters ­5–9.

1.5 SUMMARY
Text analytics is the science that is based on examining and discovering
interesting and ideally actionable patterns from large collections of docu-
ments (­corpuses) written in natural language. To make this possible, text
analytics combines techniques and models from NLP, ML, and IR. This
Text Analytics   ◾    13

combination allows performing tasks such as document categorization, text


clustering, specific information extraction from documents, association
discovery, topic detection, etc. These tasks are the basis for c­ ross-cutting
applications in all domains, both in the private and public spheres, from
scientific applications to industrial and business applications.

1.6 QUESTIONS
In this section, some questions are proposed for discussion around tasks
and applications of text analytics:

1. What are the main differences between a text analytics application


and a NLP application?
2. What difficulties does linguistic ambiguity cause in a text analytics
or text mining task?
3. List two differences between analyzing informal texts on social
media and analyzing formal news texts.
4. How does the dimensionality of documents affect the performance
of a text analytics task?
5. Describe two problems that can arise when analyzing documents
using only lexical analysis, that is, at the word level.
6. Suppose two applications that involve the handling of textual infor-
mation: One that allows hotel reservations to be made through
natural language and the other that allows the detection of names
of personalities in news texts. In which of them is it necessary to
use NLP tools and in which is it required to use a textual analysis
method?
7. How can a ML method help a text analytics task?
8. You should automatically group all news reaching your email and
then store it in specific folders. What analysis methodology would
you use, a clustering method or a categorization method?
9. What’s the fundamental difference between an IR task and an infor-
mation extraction task?
10. What type of features could be selected as input to a text mining or
text analytics task?
14   ◾    Text Analytics

11. In the text mining process, the evaluation of patterns discovered by


some analysis task is essential. In what ways could these patterns be
evaluated in order to generate insights?
12. Describe two types of patterns that can be discovered in text analytic
tasks.
13. In a textual analysis task, such as the sentiment classification on
social networks, you could simply use keywords such as features,
so that an automatic classifier can determine whether an opin-
ion expresses positive or negative sentiments. What’s the problem
that we’ll encounter if such an application uses only such a type of
features?
14. State which of the following applications use NLP models and which
use text analytics approaches:
• Simple document search engine
• ­Keyword-based sentiment classifier
• ­Rules-based spam filter
• Virtual tutor that helps a child understand math
• Assess quality of texts written by job applicants
• Classify medical diagnoses
15. You know there is an important link between an organization X and
a person Y in a set of news articles. What text analytics approach
would you take to determine the specific link that exists?
16. A service company receives many written complaints (­no more
than ­2–3 paragraphs) through its web portal. What kind of analysis
method would you use to generate statistics about the client’s com-
plaints for a given date?
17. You have a large database of invention patent descriptions and want
to determine whether a new patent “­application” that comes to you is
similar to one that you already have in your database. What text ana-
lytics and/­or NLP approach would you use to address this problem?
Chapter 2

­N atural-Language
Processing

2.1 INTRODUCTION
The following news excerpt was published in an international newspaper:
UK Prime Minister Boris Johnson has entered the Intensive Care Unit at
St. Thomas Hospital in London tonight as the symptoms of his coronavirus
worsen, acknowledged a Downing Street spokesman.
Johnson was admitted to the hospital on Sunday and until Monday after-
noon “­took the reins of the government”, according to Foreign Secretary
Dominic Raab, who may be forced to temporarily assume the leadership of
the cabinet.
Once you’ve carefully read this text, try to answer the following
questions:

1. What’s Boris Johnson?


2. Why Boris Johnson was admitted into St. Thomas?
3. What’s the role of Dominic Raab?
4. What’s the relationship between the first and second paragraph?

With some ease you can recognize that Boris Johnson is a ­two-word term
that forms the name of a person (­and not a thing) in question (­1), that
there’s a verbal relationship (­i.e., “…has entered”) indicating that Boris
Johnson was admitted to a hospital called St. Thomas in question (­2), that
DOI: 10.1201/9781003280996-2 15
16   ◾    Text Analytics

the term Dominic Raab is the name of a person whose words around him
indicate that he’s a secretary of foreign relations in question (­3), and that
the second paragraph explains a consequence of what’s described in the
first paragraph, and furthermore, with all these information, you could
even infer the objective of communicating this news.
As humans, answering the above seems relatively straightforward.
However, cognitively speaking, our brains had to carry out multiple lan-
guage processing tasks, which aren’t only complex but require access to a
lot of linguistic knowledge, which we have acquired over the years, almost
without realizing it.

• Perhaps we had never seen the sentences in the example news.


However, we were able to generalize structures and/­or “­rules” regard-
ing the words used and the sentence formation.
• Every time we tried to answer some of the questions, we had to
understand the way the words are structured in the text and their
functions in it, the way they connect to form sentences, the mean-
ing of those words and sentences, how these sentences logically con-
nect to create a meaningful text, and, finally, in which way they’re
expressed to provide the communicative objective of the message.
• For this to be possible, we had to access several internal linguistic
resources acquired over the years: dictionaries, grammar rules, previ-
ous knowledge about the meaning of those words and sentences, etc.

Now suppose that we want to carry out these ­natural-language compre-


hension tasks on longer texts than the example, such as the complete news
item that appears every week in the media. Even with lots of experience,
training and infinite time, it wouldn’t be feasible to perform such tasks
manually. Hence, we clearly need efficient computational methods that
allow us to process language automatically and efficiently to understand
human written texts and infer important knowledge that allow us to make
decisions.

2.2 THE SCOPE OF ­NATURAL-LANGUAGE


PROCESSING
Natural-Language Processing (­NLP) is the area of Artificial Intelligence (­AI)
that allows computers to understand human language to perform complex
tasks on different linguistic objects (­i.e., speech, words, phrases, meaning).
Another random document with
no related content on Scribd:
monument to his memory in the portico of the church of St. Lorenzo,
with an inscription stating that it was built in honour of the traveller’s
father. Neither the exact date of his father’s death nor of his own has
hitherto been ascertained; but it is supposed that our illustrious
traveller’s decease took place either in the year 1323 or 1324.
According to Mr. Marsden’s opinion, he was then seventy years of
age; but if we follow the opinion of the majority of writers, and of M.
Walkenaer among the rest, he must have attained the age of
seventy-three or seventy-four. The male line of the Polos became
extinct in 1417, and the only surviving female was married to a
member of the noble house of Trevisino, one of the most illustrious in
Venice.
When the travels of Marco Polo first appeared, they were
generally regarded as a fiction; and this absurd belief had so far
gained ground, that when he lay upon his deathbed, his friends and
nearest relatives, coming to take their eternal adieu, conjured him,
as he valued the salvation of his soul, to retract whatever he had
advanced in his book, or at least such passages as every person
looked upon as untrue; but the traveller, whose conscience was
untroubled upon that score, declared solemnly in that awful moment,
that far from being guilty of exaggeration, he had not described one-
half of the wonderful things which he had beheld. Such was the
reception which the discoveries of this extraordinary man
experienced when first promulgated. By degrees, however, as
enterprise lifted more and more the veil from central and eastern
Asia, the relations of our traveller rose in the estimation of
geographers; and now that the world, though still containing many
unknown tracts, has been more successfully explored, we begin to
perceive that Marco Polo, like Herodotus, was a man of the most
rigid veracity, whose testimony presumptuous ignorance alone can
call in question.
To relate the history of our traveller’s work since its first publication
would be a long and a dry task. It was translated during his lifetime
into Latin (for the opinion of Ramusio that it was originally composed
in that language seems to be absurd), as well as into several modern
languages of Europe; and as many of those versions were made,
according to tradition, under the author’s own direction, he is thought
to have inserted some numerous particulars which were wanting in
others; and in this way the variations of the different manuscripts are
accounted for. The number of the translations of Marco Polo is
extraordinary; one in Portuguese, two in Spanish, three in German,
three in French, three or four in Latin, one in Dutch, and seven in
English. Of all these numerous versions, that of Mr. Marsden is
generally allowed to be incomparably the best, whether the
correctness of the text or the extent, riches, and variety of the
commentary be considered.
IBN BATŪTA.
Born about 1300.—Died after 1353.

This traveller, whose name and works were little known in Europe
before the publication of Professor Lee’s translation, was born at
Tangiers, in Northern Africa, about the year 1300. He appeared to be
designed by nature to be a great traveller. Romantic in his
disposition, a great lover of the marvellous, and possessing a
sufficient dash of superstition in his character to enable him
everywhere to discover omens favourable to his wishes, the slightest
motives sufficed to induce him to undertake at a day’s notice the
most prodigious journeys, though he could reckon upon deriving
from them nothing but the pleasure of seeing strange sights, or of
believing that he was fulfilling thereby the secret intentions of
Providence respecting him.
Being by profession one of those theologians who in those times
were freely received and entertained by princes and the great in all
Mohammedan countries, he could apprehend no danger of wanting
the necessaries of life, and had before him at least the chance, if not
the certain prospect, of being raised for his learning and experience
to some post of distinction. The first step in the adventures of all
Mohammedan travellers is, of course, the pilgrimage to Mecca, as
this journey confers upon them a kind of sacred character, and the
title of Hajjî, which is a passport generally respected in all the
territories of Islamism.
Ibn Batūta left his native city of Tangiers for the purpose of
performing the pilgrimage in the year of the Hejira 725 (A. D. 1324-
5). Traversing the Barbary States and the whole breadth of Northern
Africa, probably in company with the great Mogrebine caravan which
annually leaves those countries for Mecca, he arrived without
meeting with any remarkable adventure in Egypt, where, according
to the original design of his travels, he employed his time in visiting
the numerous saints and workers of miracles with which that
celebrated land abounded in those days. Among the most
distinguished of these men then in Alexandria was the Imam
Borhaneddin el Aaraj. Our traveller one day visiting this man,
“Batūta,” said he, “I perceive that the passion of exploring the various
countries of the earth hath seized upon thee!”—“I replied, Yes,” says
the traveller, “though I had at that time no intention of extending my
researches to very distant regions.”—“I have three brothers,”
continued the saint, “of whom there is one in India, another in Sindia,
and the third in China. You must visit those realms, and when you
see my brothers, inform them that they are still affectionately
remembered by Borhaneddin.”—“I was astonished at what he said,”
observes Batūta, “and determined within myself to accomplish his
desires.” He in fact regarded the expressions of this holy man as a
manifestation of the will of Heaven.
Having thus conceived the bold design of exploring the remotest
countries of the East, Ibn Batūta was impatient to be in motion; he
therefore abridged his visits to the saints, and proceeded on his
journey. Nevertheless, before his departure from this part of Egypt
he had a dream, which, being properly interpreted by a saint, greatly
strengthened him in his resolution. Falling asleep upon the roof of a
hermit’s cell, he imagined himself placed upon the wings of an
immense bird, which, rising high into the air, fled away towards the
temple at Mecca. From thence the bird proceeded towards Yarren,
and, after taking a vast sweep through the south and the regions of
the rising sun, alighted safely with his burden in the land of darkness,
where he deposited it, and disappeared. On the morrow the sage
hermit interpreted this vision in the sense most consonant with the
wishes of the seer, and, presenting our traveller with some dirhems
and dried cakes, dismissed him on his way. During the whole of his
travels Ibn Batūta met with but one man who equalled this hermit in
sanctity and wisdom, and observes, that from the very day on which
he quitted him he experienced nothing but good fortune.
At Damietta he saw the cell of the Sheïkh Jemaleddin, leader of
the sect of the Kalenders celebrated in the Arabian Nights, who
shave their chins and their eyebrows, and spend their whole lives in
the contemplation of the beatitude and perfection of God. Journeying
onwards through the cities and districts of Fariskūr, Ashmūn el
Rommān, and Samānūd, he at length arrived at Misz, or Cairo,
where he appears to have first tasted the pure waters of the Nile,
which, in his opinion, excel those of all other rivers in sweetness.
Departing from Cairo, and entering Upper Egypt, he visited,
among other places, the celebrated monastery of Clay and the
minyet of Ibn Khasib. Upon the mention of this latter place, he takes
occasion to relate an anecdote of a poet, which, because it is in
keeping with our notions of what a man of genius should be, we shall
here introduce. Ibn Khasib, raised from a state of slavery to the
government of Egypt, and again reduced to beggary, and deprived of
sight by the caprice and cruelty of a calif of the house of Abbas, had
while in power been a munificent patron and protector of literary
men. Hearing of his magnificence and generosity, a poet of Bagdad
had undertaken to celebrate his praises in verse; but before he had
had an opportunity of reciting his work, Khasib was degraded from
his high office, and thrown out in blindness and beggary into the
streets of Bagdad. While he was wandering about in this condition,
the poet, who must have known him personally, encountered him,
and exclaimed, “O, Khasib, it was my intention to visit thee in Egypt
to recite thy praises; but thy coming hither has rendered my journey
unnecessary. Wilt thou allow me to recite my poem?”—“How,” said
Khasib, “shall I hear it? Thou knowest what misfortunes have
overtaken me!” The poet replied, “My only wish is that thou shouldst
hear it; but as to reward, may God reward thee as thou hast others.”
Khasib then said, “Proceed with thy poem.” The poet proceeded:—

“Thy bounties, like the swelling Nile,


Made the plains of Egypt smile,” &c.

When he had concluded, “Come here,” said Khasib, “and open this
seam.” He did so. Khasib then said, “Take this ruby.” The poet
refused; but being adjured to do so, he complied, and went away to
the street of the jewellers to offer it for sale. From the beauty of the
stone, it was supposed it could have belonged to no one but the
calif, who, being informed of the matter, ordered the poet before him,
and interrogated him respecting it. The poet ingenuously related the
whole truth; and the tyrant, repenting of his cruelty, sent for Khasib,
overwhelmed him with splendid presents, and promised to grant him
whatever he should desire. Khasib demanded and obtained the
small minyet in Upper Egypt in which he resided until his death, and
where his fame was still fresh when Ibn Batūta passed through the
country.
Frustrated in his attempt to reach Mecca by this route, after
penetrating as far as Nubia, our traveller returned to Cairo, and from
thence proceeded by way of the Desert into Syria. Here, like every
other believer in the Hebrew Scriptures, he found himself in the
midst of the most hallowed associations; and strengthened at once
his piety and his enthusiasm by visiting the graves of Abraham,
Isaac, and Jacob, as well as the many spots rendered venerable by
the footsteps of Mohammed. As the believers in Islamism entertain a
kind of religious respect for the founder of Christianity, whom they
regard as a great prophet, Batūta did not fail to include Bethlehem,
the birthplace of Christ, in the list of those places he had to see.
Upon this town, however, as well as upon Jerusalem, Tyre, Sidon,
and others of equal renown in Syria, he makes few observations
which can assist us in forming an idea of the state of the country in
those times; but in return for this meagerness, he relates a very
extraordinary story of an alchymist, who had discovered the secret of
making gold, and exercised his supernatural power in acts of
beneficence.
From Syria he proceeded towards Mesopotamia, by Emessa,
Hameh, and Aleppo, and having traversed the country of the Kurds,
and visited the fortresses of the Assassins, the people who, as he
says, “act as arrows for El Malik el Nāisr,” returned to Mount
Libanus, which he pronounces the most fruitful mountain in the
world, and describes as abounding in various fruits, fountains of
water, and leafy shades. He then visited Baalbec and Damascus;
and, after remaining a short time at the latter city, departed with the
Syrian caravan for Mecca. His attempt to perform the pilgrimage, a
duty incumbent on all true Mussulmans, was this time successful:
the caravan traversed the “howling wilderness” in safety; arrived at
the Holy City; and the pilgrims having duly performed the prescribed
rites, and spent three days near the tomb of the prophet, at Medina,
Ibn Batūta joined a caravan proceeding through the deserts of Nejed
towards Persia.
The early part of this journey offered nothing which our traveller
thought worthy of remark; but he at length arrived at Kadisia, near
Kufa, anciently a great city, in the neighbourhood of which that
decisive victory was obtained by Saad, one of the generals of Omar,
over the Persians, which established the interests of Islamism, and
overthrew for ever the power of the Ghebers. He next reached the
city of Meshed Ali, a splendid and populous place, where the grave
of Ali is supposed to be. The inhabitants, of course, were Shiahs, but
they were rich; and Ibn Batūta, who was a tolerant man, thought
them a brave people. The gardens were surrounded by plastered
walls, adorned with paintings, and contained carpets, couches, and
lamps of gold and silver. Within the city was a rich treasury,
maintained by the votive offerings of sick persons, who then
crowded, and still crowd, to the grave of Ali, from Room, Khorasān,
Irak, and other places, in the hope of receiving relief. These people
are placed over the grave a short time after sunset, while other
persons, some praying, others reciting the Koran, and others
prostrating themselves, attend expecting their recovery, and before it
is quite dark a miraculous cure takes place. Our traveller, from some
cause or another, was not present on any of these occasions, and
remarks that he saw several afflicted persons who, though they
confidently looked forward to future benefit had hitherto received
none.
The whole of that portion of Mesopotamia was at this period in the
power of the Bedouin Arabs, without whose protection there was no
travelling through the country. With them, therefore, Ibn Batūta
proceeded from Basra, towards various holy and celebrated places,
among others to the tomb of “My Lord Ahmed of Rephaā,” a famous
devotee, whose disciples still congregate about his grave, and
kindling a prodigious fire, walk into it, some eating it, others trampling
upon it, and others rolling in it, till it be entirely extinguished, while
others take great serpents in their teeth, and bite the head off. From
hence he again returned to Basra, the neighbourhood of which
abounded with palm-trees. The inhabitants were distinguished for
their politeness and humanity towards strangers. Here he saw the
famous copy of the Koran in which Othman, the son of Ali, was
reading when he was assassinated, and on which the marks of his
blood were still visible.
Embarking on board a small boat, called a sambūk, he descended
the Tigris to Abbadān, whence it was his intention to have proceeded
to Bagdad; but, adopting the advice of a friend at Basra, he sailed
down the Persian Gulf, and landing at Magul, crossed a plain
inhabited by Kurds, and arrived at a ridge of very high mountains.
Over these he travelled during three days, finding at every stage a
cell with food for the accommodation of travellers. The roads over
these mountains were cut through the solid rock. His travelling
companions consisted of ten devotees, of whom one was a priest,
another a muezzin, and two professed readers of the Koran, to all of
whom the sultan of the country sent presents of money.
In ten days they arrived in the territories of Ispahan, and remained
some days at the capital, a large and handsome city. From thence
he soon departed for Shiraz, which, though inferior to Damascus,
was even then an extensive and well-built city, remarkable for the
beauty of its streets, gardens, and waters. Its inhabitants likewise,
and particularly the women, were persons of integrity, religion, and
virtue; but our singular traveller remarks, that for his part he had no
other object in going thither than that of visiting the Sheïkh Majd
Oddin, the paragon of saints and workers of miracles! By this holy
man he was received with great kindness, of which he retained so
grateful a remembrance, that on returning home twenty years
afterward from the remotest countries of the east, he undertook a
journey of five-and-thirty days for the mere purpose of seeing his
ancient host.
The greater portion of the early life of Ibn Batūta was consumed in
visiting saints, or the birthplaces and tombs of saints: but his time
was not therefore misemployed; for, besides the positive pleasure
which the presence or sight of such objects appears to have
generated in his own mind, at every step he advanced in this sacred
pilgrimage his personal consequence, and his claims upon the
veneration and hospitality of princes and other great men, were
increased. As he may be regarded as the representative of a class of
men extremely numerous in the early ages of Islamism, and whose
character and mode of life are highly illustrative of the manners of
those times, it is important to follow the footsteps of our traveller in
his whimsical wanderings a little more closely than would otherwise
be necessary.
Proceeding, therefore, at the heels of the honest theologian, we
next find him at Kazerun, beholding devoutly the tomb of the Sheïkh
Abu Is-hāk, a saint held in high estimation throughout India and
China, especially by sailors, who, when tossed about by adverse or
tempestuous winds upon the ocean, make great vows to him, which,
when safely landed, they pay to the servants of his cell. From hence
he proceeded through various districts, many of which were desert
and uninhabitable, to Kufa and Hilla, whence, having visited the
mosque of the twelfth imam, whose readvent is still expected by his
followers, he departed for Bagdad. Here, as at Rome or Athens, the
graves of great men abounded; so that Ibn Batūta’s sympathies were
every moment awakened, and apparently too painfully; for,
notwithstanding that it was one of the largest and most celebrated
cities in the world, he almost immediately quitted it with Bahadar
Khan, sultan of Irak, whom he accompanied for ten days on his
march towards Khorasān. Upon his signifying his desire to return,
the prince dismissed him with large presents and a dress of honour,
together with the means of performing the pilgrimage to Mecca,
which, as an incipient saint, he imagined he could not too frequently
repeat.
Finding, on his return to Bagdad, that a considerable time would
elapse before the departure of the caravan for the Holy City, he
resolved to employ the interval in traversing various portions of
Mesopotamia, and in visiting numerous cities which he had not
hitherto seen. Among these places the most remarkable were
Samarā, celebrated in the history of the Calif Vathek; Mousul, which
is said to occupy the site of ancient Nineveh; and Nisibēn, renowned
throughout the east for the beauty of its position, and the
incomparable scent of the rose-water manufactured there. He
likewise spent some time at the city and mountain of Sinjar, inhabited
by that extraordinary Kurdish tribe who, according to the testimony of
several modern travellers, pay divine honours to the Devil.
This little excursion being concluded, Batūta found the caravan in
readiness to set out for Mecca, and departing with it, and arriving
safe in the Holy City, he performed all the ceremonies and rites
prescribed, and remained there three years, subsisting upon the
alms contributed by the pious bounty of the inhabitants of Irak, and
conveyed to Mecca by caravans. His travelling fit now returning, he
left the birthplace of the prophet, and repairing to Jidda, proceeded
with a company of merchants towards Yemen by sea. After being
driven by contrary winds to the coast of Africa, and landing at
Sūakin, he at length reached Yemen; in the various cities and towns
of which he was entertained with a hospitality so generous and
grateful that he seems never to be tired of dwelling on their praises.
He did not, however, remain long among his munificent hosts, but,
taking ship at Aden, passed over once more into Africa, and landed
at Zaila, a city of the Berbers. The inhabitants of this place, though
Mohammedans, were a rude, uncultivated people, living chiefly upon
fish and the flesh of camels, which are slaughtered in the streets,
where their blood and offals were left putrefying to infect the air.
From this stinking city he proceeded by sea to Makdasha, the
Magadocia of the Portuguese navigators; a very extensive place,
where the hospitable natives were wont, on the arrival of a ship, to
come down in a body to the seashore, and select each his guest
from among the merchants.—When a theologian or a nobleman
happened to be among the passengers, he was received and
entertained by the kazi; and as Ibn Batūta belonged to the former
class he of course became the guest of this magistrate. Here he
remained a short time, passing his days in banqueting and pleasure;
and then returned to Arabia.
During the stay he now made in this country he collected several
particulars respecting the trade and manners of the people, which
are neither trifling nor unimportant. The inhabitants of Zafār, the most
easterly city of Yemen, carried on at that period, he observes, a great
trade in horses with India, the voyage being performed in a month.
The practice he remarked among the same people of feeding their
flocks and herds with fish, and which, he says, he nowhere else
observed, prevails, however, up to the present day, among the
nations of the Coromandel coast, as well as in other parts of the
east. At El Ahkāf, the city of the tribe of Aād, there were numerous
gardens, producing enormous bananas, with the cocoanut and the
betel. Our fanciful traveller discovered a striking resemblance
between the cocoanut and a man’s head, observing that exteriorly
there was something resembling eyes and a mouth, and that when
young the pulp within was like brains. To complete the similitude, the
hair was represented by the fibre, from which, he remarks, cords for
sewing together the planks of their vessels, as also cordage and
cables, were manufactured. The nut itself, according to him, was
highly nourishing, and, like the betel-leaf, a powerful aphrodisiac.
Still pursuing his journey through Arabia, he crossed the desert of
Ammān, and met with a people extraordinary among
Mahommedans, whose wives were liberal of their favours, without
exciting the jealousy of their husbands, and who, moreover,
considered it lawful to feed upon the flesh of the domestic ass. From
thence he crossed the Persian Gulf to Hormuz, where, among many
other extraordinary things, he saw the head of a fish resembling a
hill, the eyes of which were like two doors, so that people could walk
in at one eye and out at the other! He now felt himself to be within
the sphere of attraction of an object whose power he could never
resist. There was, he heard, at Janja-bal, a certain saint, and of
course he forthwith formed the resolution to refresh himself with a
sight of him. He therefore crossed the sea, and hiring a number of
Turcomans, without whose protection there was no travelling in that
part of the country, entered a waterless desert, four days’ journey in
extent, over which the Bedouins wander in caravans, and where the
death-bearing simoom blows during the hot months of summer.
Having passed this desolate and dreary tract, he arrived in Kusistān,
a small province of Persia, bordering upon Laristān, in which Janja-
bal, the residence of the saint, was situated. The sheïkh, who was
secretly, or, as the people believed, miraculously, supplied with a
profusion of provisions, received our traveller courteously, sent him
fruit and food, and contrived to impress him with a high idea of his
sanctity.
He now entered upon the ancient kingdom of Fars, an extensive
and fertile country, abounding in gardens producing a profusion of
aromatic herbs, and where the celebrated pearl-fisheries of Bahrein,
situated in a tranquil arm of the sea, are found. The pearl divers
employed here were Arabs, who, tying a rope round their waists, and
wearing upon their faces a mask made of tortoise-shell, descended
into the water, where, according to Batūta, some remained an hour,
others two, searching among forests of coral for the pearls.
Ibn Batūta was possessed by an extraordinary passion for
performing the pilgrimage to Mecca; and now (A. D. 1332), the year
in which El Malik El Nāsir, sultan of Egypt, visited the holy city, set
out from Persia on his third sacred expedition. Having made the
necessary genuflexions, and kissed the black stone at the Kaaba, he
began to turn his thoughts towards India, but was prevented, we
know not how, from carrying his design into execution; and
traversing a portion of Arabia and Egypt, entered Room or Turkey.
Here, in the province of Anatolia, he was entertained by an
extraordinary brotherhood, to whom, as to all his noble hosts and
entertainers, he devotes a portion of his travels. This association,
which existed in every Turcoman town, consisted of a number of
youths, who, under the direction of one of the members, called “the
brother,” exercised the most generous hospitality towards all
strangers, and were the vigorous and decided enemies of
oppression. Upon the formation of one of these associations, the
brother, or president, erected a cell, in which were placed a horse, a
saddle, and whatever other articles were considered necessary. The
president himself, and every thing in the cell, were always at the
service of the members, who every evening conveyed the product of
their industry to the president, to be sold for the benefit of the cell;
and when any stranger arrived in the town, he was here hospitably
entertained, and contributed to increase the hilarity of the evening,
which was passed in feasting, drinking, singing, and dancing.
Travelling to Iconium, and other cities of Asia Minor, in all of which
he was received and entertained in a splendid manner, while
presents of slaves, horses, and gold were sometimes bestowed
upon him, he at length took ship at Senab, and sailed for Krim
Tartary. During the voyage he endured great hardships, and was
very near being drowned; but at length arrived at a small port on the
margin of the desert of Kifjāk, a country over which Mohammed
Uzbek Khan then reigned. Being desirous of visiting the court of this
prince, Ibn Batūta now hired one of those arabahs, or carts, in which
the inhabitants travel with their families over those prodigious plains,
where neither mountain nor hill nor tree meets the eye, and where
the dung of animals serves as a substitute for fuel, and entered upon
a desert of six months’ extent. Throughout these immense steppes,
which are denominated desert merely in reference to their
comparative unproductiveness, our traveller found cities, but thinly
scattered; and vast droves of cattle, which, protected by the
excessive severity of the laws, wandered without herdsmen or
keepers over the waste. The women of the country, though they
wore no veils, were virtuous, pious, and charitable; and consequently
were held in high estimation.
Arriving at the Bish Tag, or “Five Mountains,” he there found the
urdu (whence our word horde) or camp of the sultan, a moving city,
with its streets, palaces, mosques, and cooking houses, “the smoke
of which ascended as they moved along.” Mohammed Uzbek, then
sovereign of Kifjāk, was a brave and munificent prince; and Ibn
Batūta, having, according to Tartar etiquette, first paid a visit of
ceremony to each of his wives, was politely received by him.
From this camp our traveller set out, with guides appointed by the
sultan, for the city of Bulgār, which, according to the Maresid Al
Etluā, is situated in Siberia. Here, in exemplification of the extreme
shortness of the night, he observes, that while repeating the prayer
of sunset he was overtaken, though he by no means lagged in his
devotions, by the time for evening prayer, which was no sooner over
than it was time to begin that of midnight; and that before he could
conclude one voluntary orison, which he added to this, the dawn had
already appeared, and morning prayer was to be begun. Forty days’
journey to the north of this place lay the land of darkness, where, he
was told, people travelled over interminable plains of ice and snow,
on small light sledges, drawn by dogs; but he was deterred from
pushing his researches into these Cimmerian regions by the fear of
danger, and considerations of the inutility of the journey. He returned,
therefore, to the camp of the sultan.
Mohammed Uzbek had married a daughter of the Greek Emperor
of Constantinople, who, being at this time pregnant, requested his
permission to be confined in her father’s palace, where it was her
intention to leave her child. The sultan consented, and Ibn Batūta,
conceiving that an excellent opportunity for visiting the Greek capital
now presented itself, expressed a desire to accompany the princess,
but the sultan, who regarded him apparently as something too gay
for a saint, at first refused to permit him. Upon his pressing the
matter, however, representing that he should never appear before
the queen but as his servant and guest, so that no fears need be
entertained of him, the royal husband, relenting, allowed him to go,
and presented him, on his departure, with fifteen hundred dinars, a
dress of honour, and several horses; while each of his sultanas,
together with his sons and daughters, caused the traveller to taste of
their bounty.
The queen, while she remained in her husband’s territories,
respected the religion and manners of the Mohammedans; but she
had no sooner entered her father’s dominions, and found herself
surrounded by her countrymen, than she drank wine, dismissed the
ministers of Islamism, and was reported to commit the abomination
of eating swine’s flesh. Ibn Batūta was still treated with respect,
however, and continuing to be numbered among the suite of the
sultana, arrived at length at Constantinople, where, in his zeal to
watch over the comfort of his royal mistress, he exposed himself to
the risk of being squeezed to death in the crowd. On entering the
city, his ears appear to have been much annoyed by the ringing of
numerous bells, which, with the inveterate passion of all Europeans
for noise when agitated by any joyous emotions, the Greeks of
Constantinople substituted for their own voices in the expression of
their satisfaction.
Remaining about five weeks in Constantinople, where, owing to
the difference of manners, language, and religion, he does not
appear to have tasted of much pleasure, he returned to Mohammed
Uzbek, whose bounty enabled him to pursue his journey towards the
east in a very superior style. The country to which his desires now
pointed was Khavāresm, the road thither traversing, during the
greater part of the way, a barren desert, where little water and a very
scanty herbage were to be found. Crossing this waste in a carriage
drawn by camels, he arrived at Khavāresm, the largest city at that
period possessed by the Turks. Here he found the people friendly
towards strangers, liberal, and well-bred,—and no wonder; for in
every mosque a whip was hung up, with which every person who
absented himself from church was soundly flogged by the priest,
besides being fined in five dinars. This practice, which Ibn Batūta
thought highly commendable, no doubt contributed greatly towards
rendering the people liberal and well-bred. Next to the refinement of
the people, the most remarkable thing he observed at Khavāresm
was a species of melon, green on the outside, and red within, which,
being cut into thin oblong slices and dried, was packed up in cases
like figs, and exported to India and China. Thus preserved, the
Khavāresm melon was thought equal to the best dried fruits in the
world, and regarded as a present worthy of kings.
From hence Ibn Batūta departed for Bokhāra, a city renowned
throughout the east for the learning and refinement of its inhabitants,
but at this period so reduced and impoverished by the long wars of
Genghis Khan and his successors, that not one man was to be found
in it who understood any thing of science. Leaving this ancient seat
of oriental learning, he proceeded to Māwarā El Nahr, the sultan of
which was a just and powerful prince, who received him hospitably,
and furnished him with funds to pursue his wanderings. He next
visited Samarkand, Balkh, and Herat, in Khorasān; and scaling the
snowy heights of the Hindoo Koosh, or Hindoo-Slayer, so called
because most of the slaves attempted to be carried out of India by
this route are killed by the severity of the cold, he entered Kabul.
Here, in a cell of the mountain called Bashāi, he found an old man,
who, though he had the appearance of being about fifty, pretended
to be three hundred and fifty years old, and assured Ibn Batūta that
at the expiration of every hundred years he was blessed with a new
growth of hair and new teeth, and that, in fact, he was the Rajah Aba
Rahim Ratan of India, who had been buried in Mooltam.
Notwithstanding his innate veneration for every thing saintly, and this
man bore the name of Ata Evlin, or “Father of Saints,” our honest
traveller could not repress the doubts which arose in his mind
respecting his extraordinary pretensions, and observes in his travels
that he much doubted of what he was, and that he continued to
doubt.
Ibn Batūta now crossed the Indus, and found himself in Hindostan,
where, immediately upon his arrival, he met, in a city which he
denominates Janai, one of the three brothers of Borhaneddin, the
Egyptian saint, whose prediction, strengthening his natural bent of
mind, had made a great traveller of him. Traversing the desert of
Sivastān, where the Egyptian thorn was the only tree to be seen, and
then descending along the banks of the Sinde, or Indus, he arrived
at the city of Lahari, on the seashore, in the vicinity of which were the
ruins of an ancient city, abounding with the sculptured figures of men
and animals, which the superstitious natives supposed to be the real
forms of the ancient inhabitants transformed by the Almighty into
stone for their wickedness.
At Uja, a large city on the Indus, our traveller contracted a
friendship with the Emīr Jelaleddin, then governor of the place, a
brave and generous prince, whom he afterward met at Delhi. In
journeying eastward from this place, Batūta proceeded through a
desert lying between two ridges of mountains, inhabited by Hindoos,
whom the traveller terms infidel and rebellious, because they
adhered to the faith of their ancestors, and refused submission to the
power of the Mohammedan conquerors of their country. Ibn Batūta’s
party, consisting of twenty-two men, was here attacked by a large
body of natives, which they succeeded in repulsing, after they had
killed thirteen of their number. In the course of this journey he
witnessed the performance of a suttee, and remarks upon the
occasion, that these human sacrifices were not absolutely required
either by the laws or the religion of Hindostan; but that, owing to the
vulgar prejudice which regarded those families as ennobled who
thus lost one of their members, the practice was greatly encouraged.
On arriving at Delhi, which, for strength, beauty, and extent, he
pronounces the greatest city, not only of all Hindostan, but of all
Islamism in the east, he resorted to the palace of the queen-mother
and presenting his presents, according to custom, was graciously
received and magnificently established by the bounty of that princess
and the vizier. It is to be presumed, that the money he had received
in presents from various princes on the way had exceeded his
travelling expenses, and gone on accumulating, until, on his arrival
at Delhi, it amounted to a very considerable sum; for with his house,
costly furniture, and forty attendants, his expenditure seems greatly
to have exceeded the munificence of his patrons; indeed, he very
soon found that all the resources he could command were too scanty
to supply the current of his extravagance.
Being of the opinion of that ancient writer who thought a good
companion better than a coach on a journey, Ibn Batūta appears to
have increased his travelling establishment with a mistress, by whom
he seems to have had several children, for shortly after his arrival at
the capital, he informs us that “a daughter of his,” evidently implying
that he had more than one, happened to die. At this time our worthy
theologian was so deeply intoxicated with the fumes of that vanity
which usually accompanies the extraordinary smiles of fortune, that,
although by no means destitute of natural affection, nothing in the
whole transaction appears to have made any impression upon his
mind except the honour conferred upon him by the condescension of
the vizier and the emperor. The latter, then at a considerable
distance from the capital, on being informed of the event,
commanded that the ceremonies and rites usually performed at the
funeral of the children of the nobility should now take place; and
accordingly, on the third day, when the body was to be removed to its
narrow house, the vizier, the judges, and the nobles entered the
chamber of mourning, spread a carpet, and made the necessary
preparations, consisting of incense, rose-water, readers of the
Koran, and panegyrists. Our traveller, who anticipated nothing of all
this, confesses ingenuously that he was “much gratified.” To the
mother of the child the queen-mother showed the greatest kindness,
presenting her with magnificent dresses and ornaments, and a
thousand dinars in money.
The Emperor Mohammed having been absent from Delhi ever
since our traveller’s arrival, he had hitherto found no opportunity of
presenting himself before the “Lord of the World;” but upon that great
personage’s returning, soon after the funeral, the vizier undertook to
introduce him to the presence. The emperor received him graciously,
taking him familiarly by the hand, and, in the true royal style,
lavishing the most magnificent promises. As an earnest of his future
bounty, he bestowed upon each of the many travellers who were
presented at the same time, and met with the same reception, a
gold-embroidered dress, which he had himself worn; a horse from
his own stud, richly caparisoned with housings and saddle of silver;
and such refreshments as the imperial kitchen afforded. Three days
afterward Ibn Batūta was appointed one of the judges of Delhi, on
which occasion the vizier observed to him, “The Lord of the World
appoints you to the office of judge in Delhi. He also gives you a
dress of honour with a saddled horse, as also twelve thousand
dinars for your present support. He has moreover appointed you a
yearly salary of twelve thousand dinars, and a portion of lands in the
villages, which will produce annually an equal sum.” He then did
homage and withdrew.
The fortune of Ibn Batūta was now changed. From the condition of
a religious adventurer, wandering from court to court, and from
country to country, subsisting upon the casual bounty of the great, he
had now been elevated to a post of great honour and emolument in
the greatest city then existing in the world. But it is very certain he
was not rendered happier by this promotion. The monarch upon
whose nod his destiny now depended was a man of changeful and
ferocious nature, profuse and lavish in the extreme towards those
whom he affected, but when provoked, diabolically cruel and
revengeful. In the very first conference which our traveller held with
his master after his appointment, he made a false step, and gave
offence; for when the emperor had informed him that he would by no
means find his office a sinecure, he replied that he belonged to the
sect of Ibn Malik, whereas the people of Delhi were followers of
Hanīfa; and that, moreover, he was ignorant of their language. This
would have been a good reason why he should not in the first
instance have accepted the office of judge; but, having accepted of
it, he should by no means have brought forward his sectarian
prejudices, or his ignorance, in the hope of abridging the extent of
his duties. The emperor, with evident displeasure, rejoined, that he
had appointed two learned men to be his deputies, and that these

You might also like