Professional Documents
Culture Documents
Modified MLKNN Algorithm
Modified MLKNN Algorithm
Abstract Social media has become a very rich source of information. Labeling
unstructured social media text is a critical task as features belong to multiple labels.
Without appropriate labels, raw data does not make any sense. So it is mandatory
to provide appropriate labels. In this work, we have proposed a modified multilabel
K nearest neighbor (Modified ML-KNN) for generating multiple labels of tweets
which when configured with a certain distance measure and number of nearest neigh-
bors gives better performance than conventional ML-KNN. To validate the proposed
approach, we have used two different twitter data sets, one Disease related tweets
set prepared by us using five different disease keywords and an other benchmark
Seattle data set consisting of incident-related tweets. The modified ML-KNN is able
to improve the performance of conventional ML-KNN with a minimum of 5% in
both the datasets.
Social media is a place where people use a lot of text postings. Social media text
classification systems retrieve such posts to user’s interest and views in the form
of summaries. Textual data over social media belongs to either the unstructured or
semi-structured category. Due to the emergence of web 3.0 especially online infor-
mation is growing enormously. Thus, we require some automatic tools for analyzing
such large collection of textual data. In this regard, work in [1] proposed architec-
ture to track real-time disease-related posting for early disease outbreaks prediction.
Support vector machine (SVM) used for classifying postings, achieved up to 88%
2 Algorithms Used
In multi-label classification, reported work has been done in two well-known cate-
gories of algorithms.
Problem transformation method [5] are multi-label learning algorithms that trans-
form learning problem into one or more single-label classification. The problem
transformation methods are binary relevance, label powerset, and classifier chains
method.
Multi-label Classification of Twitter Data Using Modified ML-KNN 33
Algorithm adaptation methods [5] adapted machine learning algorithms for the task
of multi-label classification. Following popular machine learning algorithms have
been adapted in the literature like boosting, k-nearest neighbors, decision trees, and
neural networks. The adapted methods can directly handle multi-label data. Here,
in this research work, we have presented a modified multi-label k-nearest neigh-
bor method that upgrades the nearest neighbor family using appropriate similarity
measures and number of nearest neighbors.
ML-KNN is derived from the popular k-nearest neighbor (KNN) algorithm [6]. It
works in two different phases. First, k-nearest neighbors of each test instance in the
training set is identified. Then, according to the number of neighboring instances
belonging to each possible class, maximum a posteriori (MAP) principle is utilized
to determine the label set for the test instance. The original ML-KNN uses Euclidean
similarity measure with default 8 nearest neighbors. In our work, the effectiveness
of ML-KNN is evaluated based on four similarity measures of Minkowski family
mentioned in [7] and their variations with number of nearest neighbors.
category but in real-world scenario, each tweet is associated with multiple labels.
Following architecture shows our methodology for efficient result evaluation.
We have considered two different configurations of dataset. First when it belongs
to raw category (C0). Here, raw category is defined by removing link, special sym-
bols, and duplicate tweets from the corpus. Second, when stop words are removed
and all the text data is stemmed means processed category (C3). For both the dataset,
we have identified the appropriate similarity measure and number of nearest neigh-
bors (NN) which can give better performance. We have used few configurations in
ML-KNN for improving the multi-label algorithm. We have used MULAN Library
[8] for result evaluation.
In our research work, we have created our own disease corpus and found some
motivating examples that belong to multiple categories of diseases. We have prepared
tweets dataset manually annotated with the help of medical domain expert and the
prepared corpus is used for result evaluation. Some of motivating examples which
belong to multiple categories are as follows (Fig. 1, Table 1).
Multi-label Classification of Twitter Data Using Modified ML-KNN 35
We have used two different datasets 1. Disease corpus 2. Seattle dataset, both the
dataset are based on Twitter data. Seattle is a standard dataset mentioned in paper
[9]. We prepared our synthetic dataset based on disease keywords suggested in [10].
The disease data preparation phases are as follows.
In data collection phase, raw tweets are collected to build a corpus. Twitter is the
source of information capturing where we used the disease keyword for capturing
relevant disease tweets from social media. Disease corpus is built by collecting tweets
for five (D-1 to D-5) different diseases—Abdominal pain, conjunctivitis, cough,
diarrhea, and nausea. The keywords to search tweets related to these diseases are
taken from one of the classical work [10]. We have used Tweepy streaming API
[11] for tweet collection. We collected only textual content of tweets in five different
categories. All tweets were processed to remove duplicate tweets as well as other
URLs. A total of 2009 unique disease tweets of five different disease categories were
used in the final disease corpus.
In cleaning phase, raw tweets are first cleaned before they are subjected to different
preprocessing configurations. Cleaning process, generally taken in an effort to reduce
noise that improves quality of training model. The idea behind these steps is to remove
the noise from the dataset as special symbols, special syntaxes, duplicates, and stop
words are viewed as noise and will not be beneficial for the input in any models.
36 S. K. Srivastava and S. K. Singh
Following measures are used for performance evaluation of Modified ML-KNN and
ML-KNN.
Subset accuracy [5] evaluates the fraction of correctly classified samples based on
their ground truth label set. It is a multi-label counterpart of the traditional accuracy
metric.
F measure score
N is the harmonic mean between precision and recall and is defined
as F1 = 1/N i1 2∗ |h(xi ) ∩ yi |/|h(xi )|+|yi |. F measure score is an example-based
metric and its value is an average overall example in the dataset. F measure score
reaches its best value at 1 and worst score at 0.
Multi-label Classification of Twitter Data Using Modified ML-KNN 37
Micro-precision (precision
averaged over all the example/label pairs) is defined
as—Micro-precision = Qj1 t p j / Qj1 t p j + Qj1 f p j where tpj, fpj are defined
as macro-precision.
Micro-recall
(recall averaged overall the example/label pairs) is defined as—Micro-
recall = Qj1 t p j / Qj1 t p j + Qj1 f n j where tpj , fnj are defined as for macro-
recall.
7.1.1 C0 Configuration
When we use C0 configuration Tables 2 and 4, it is clearly visible that the Euclidean
and Minkowski distance measures along with eight neighbors perform best among
all with the value of 84.72% subset accuracy. In case of Seattle dataset, Euclidean
and Minkowski perform the best when configured with nearest neighbor value 5. The
subset accuracy, in this case, is 48.49%. The Chebyshev distance measure is having
poor performance among all the considered distance measures for both the datasets.
38 S. K. Srivastava and S. K. Singh
7.1.2 C3 Configuration
When we use C3 configuration Tables 3 and 5, it means we use concrete feature set for
the classification task. We found Manhattan, Euclidean, and Minkowski with eight
nearest neighbor performs best among all with 91.44% overall subset accuracy in
case of Disease dataset. For the Seattle dataset, we found Manhattan, Euclidean, and
Minkowski with 14 nearest neighbor performs best among all with 53.15% overall
subset accuracy.
It is clearly visible with the experimentation that there is around 7% more accu-
racy in case of Disease data set and 5% more accuracy in case of Seattle dataset. This
stands that concrete features play important role in classification task irrespective
of their belongingness to single, multi-class, or multi-label classification.
40 S. K. Srivastava and S. K. Singh
8 Conclusion
9 Future Work
Complex statistical information other than the membership counting statistics can
facilitate the usage of maximum a posteriori principle. This can be an interesting
issue for future work.
References
1. Sofean M, Smith M (2012) A real-time disease surveillance architecture using social networks.
Stud Health Technol Inf 180:823–827
2. Guo J, Zhang P, Guo L (2012) Mining hot topics from twitter streams. Procedia Comput Sci
9:2008–2011
3. Rui W, Xing K, Jia Y (2016) BOWL: Bag of word clusters text representation using word
embeddings. In: International conference on knowledge science, engineering and management.
Springer International Publishing
4. Ding W et al (2008) LRLW-LSI: an improved latent semantic indexing (LSI) text classifier.
Lect Note Comput Sci 5009:483
5. Zhang ML, Zhou ZH (2014) A review on multi-label learning algorithms. IEEE Trans Knowl
Data Eng 26(8):1819–1837
6. Aha DW (1991) Incremental constructive induction: an instance-based approach. In: Proceed-
ings of the eighth international workshop on machine learning
7. Cha SH (2007) Comprehensive survey on distance/similarity measures between probability
density functions. City 1(2):1
8. Tsoumakas G et al (2011) Mulan: a java library for multi-label learning. J Mach Learn Res,
2411–2414
9. Schulz A et al (2014) Evaluating multi-label classification of incident-related tweets. In: Making
Sense of Microposts (Microposts2014), vol 7
10. Velardi P et al (2014) Twitter mining for fine-grained syndromic surveillance. Artif Intell Med
61(3):153–163
11. Roesslein J (2009) Tweepy documentation. http://tweepy.readthedocs.io/en/v3.5