A Survey On Partitioning and Hierarchical Based Data Mining Clustering Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp.

16787-16791
© Research India Publications. http://www.ripublication.com

A Survey on Partitioning and Hierarchical based Data Mining Clustering


Techniques

1
M.Kiruthika, 2Dr.S.Sukumaran
1
Ph.D Research Scholar, 2Associate Professor
Department of Computer Science,
Erode arts and Science College (Autonomous),
Erode, India.

Abstract uses historical data or predefined data [13]. It classifies or


predicting the model behavior based on the data availability.
In Data Mining, Clustering is a general technique for
statistical data analysis, which is used in dissimilar fields, The Data Mining tasks includes Clustering, Association
including machine learning, pattern recognition, image rule, Summarization, Regression, Classification and
analysis and bioinformatics. Clustering is an excellent data Prediction. Clustering is also known as segmentation or
mining tool for a huge and multivariate database. It is the unsupervised learning. It is usually practiced by finding the
one of data mining techniques in which data is separated connection among the data on predefined attributes.
into the set of related objects. Clustering is an appropriate Clustering defines the process of grouping physical objects
example of unsupervised classification. It means that into classes. The most similar data are group into clusters.
clustering does not depend on pre-defined classes and There are many types in clustering such as Centroid-based,
training examples through classifying the data objects. A Density-based, Connectivity-based and Distribute-based [3].
Partitioning and Hierarchical algorithm in data mining is the An association rule is a model that identifies the special type
most active research algorithm among proposed algorithms. of data associations. Associations are often used in many
Several factors or themes determine the optimal actual applications.
clustering. The significant idea of this paper is classifying
Summarization is the process of maps data into subsets with
the methods on the bases of different themes so that it aids
associated descriptions. Summarization derives or extracts
in choosing algorithms for some further improvement and
needed information about the database. It summarily
optimization. In this survey paper, a review of clustering,
characterizes the contents of the database and it also known
partitioning and hierarchical based clustering techniques and
as generalization or characterization. Regression is mainly
evaluation metrics for clustering are discussed.
used to map between data item and prediction variable. It is
Keywords: Clustering, Clustering Technique (Partitioning, used to predict a numeric values range in a particular
Hierarchical), Data Mining, Evaluation Metrics. dataset. Classification maps between data and classes or
groups which are predefined. Classification is also known as
supervised learning. Regression and classification are used
to solve identical problems. Prediction is a type of
1. INTRODUCTION classification. Prediction is mainly used to predict the future
Data mining is mainly meant for finding hidden details or state than a present state.
information in the large databases. It involves different types In this paper, clustering analysis is done. Cluster Analysis,
of algorithms to accomplish different type of tasks and an an automatic process to find related objects from a database.
attempt to make sense of the information explosion It is a fundamental operation in data mining.
embedded in big volume of data [15]. Data mining consists
of extract, transform, and load transaction information onto
the data warehouse system, Store and manage the Clustering
information in an exceedingly multidimensional database
system, Provide data access to business analysts and Clustering is the technique of combining a set of similar
information technology professionals, Analyze the objects known as clusters. It is used in information retrieval,
information by application software and Present the statistical data analysis, machine learning, pattern
information in an exceedingly useful format, like a graph or recognition, image analysis, medical analysis, spatial data
table. In data mining two learning approaches are used i.e. analysis and bioinformatics. Clustering aims at discovering
supervised learning and unsupervised learning [19]. groups and identifying interesting allocation and model in
data sets.
Data mining tasks can be divided into two types -
descriptive data mining and predictive data mining. Generally, the outputs produced by a clustering algorithm
Descriptive data mining is also called as pattern or are the assignment of data objects in dataset to various
relationships. It is used for finding interesting patterns or groups. In addition, it will be sufficient to identify each data
describing the relationships of data. Predictive data mining object with a unique cluster label.

16787
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp. 16787-16791
© Research India Publications. http://www.ripublication.com

The Clustering objectives are: optimize a certain criterion function. The criterion function
may emphasize the local or global arrangement of the data
 To reveal natural combinations. and its optimization is an iterative method.
 To set off premise about the data. There are various types of partitioning clustering algorithms
 To come across dependable and convincing are:
organization of the data. A. K–Means
Clustering groups the objects depending upon the K-Means algorithm is the well-liked clustering algorithm. It
information established in the data describing the objects or iteratively computes the clusters and their centroids. It is a
their associations. The objective of clustering is that the top down approach to clustering. It is used for making and
objects in a group are related to one other and different from analyzing the clusters with ‘n’ amount of data points, point
the objects in other groups [15]. A high-quality cluster is separated into ‘K’ clusters supported the similarity
contains high intra-class similarity and low inter-class measure criterion. The result generated by the algorithm
similarity. The procedure of clustering is performed with generally depends on initial cluster centroids chosen. It is an
four basic steps. unvarying clustering algorithm during which items are
stirred among sets of clusters till the necessary set is
reached. As such, it can be viewed as a sort of squared
Feature selection or extraction inaccuracy algorithm, though the convergence criteria need
Feature selections decide characteristic features from a set of not be distinct supported the squared inaccuracy [18]. A
candidates’. The feature extraction also exploits some high degree of association between components in clusters is
transformations on data to generate useful and novel features obtained, whereas a high degree of variation between
from the original ones. Both are very crucial to the components in different clusters is achieved at the same
efficiency of clustering applications. time.

Clustering Algorithm design or Selection


The step is generally combined with the selection of an B. K-Medoids
equivalent proximity measure and the formation of a
criterion function. It is important to carefully investigate the The K-medoids algorithm also termed as PAM (Partitioning
distinctiveness of the problem at hand, in order to select or Around Medoids) algorithm mean a cluster by medoid.
design an appropriate clustering strategy. Mostly, a random set of k items is taken to be the collection
of medoids. Then at every step, all items from the input
Cluster validation dataset that are not presently medoids are examined
separately to ascertain if they ought to be medoids [18]. That
Different approaches usually lead to different clusters and is, the algorithm determines whether or not there is an item
even for the same algorithm, parameter identification or the that ought to replace one in all the prevailing medoids. Pam
presentation of input patterns may affect the final results. is a lot of robust than k‐means within the presence of noise
Therefore, effective evaluation standards and criteria are and outliers as a result of a medoid is less influenced by
important to provide the users with a quantity of confidence outliers or alternative extreme values than a mean. PAM
for the clustering results. works efficiently for small data sets, however does not scale
well for huge data sets.
Results interpretation
The vital aim of clustering is offering a user with novel data C. FCM
as a result, that they can successfully solve the problems
encountered. Further analysis and experiments may be Fuzzy C-Means (FCM) is one in all the foremost standard
required to guarantee the reliability of extracted knowledge. fuzzy clustering algorithms [26]. Fuzzy C-Means (FCM)
may be a technique of clustering that assign membership
levels and exploit them to assign data components to one or
2. CLUSTERING TECHNIQUES additional clusters. It uses mutual distance to calculate fuzzy
weights. Each part of the universe will belong to some fuzzy
The clustering approaches can be sorted out as partition, set with a level of membership that varies from 0 to 1. FCM
hierarchical, density- based and grid- based. In this paper, introduces the fuzziness for the belongingness of every
we examined various partitioning clustering techniques and object and might retain a lot of data of the dataset. This
hierarchical clustering techniques. algorithm works by conveying membership to every data
point of datasets parallel to each cluster center on the source
of distance among the cluster center and also the data point.
2.1 Partitioning Clustering Algorithms The membership and cluster centers are updated after the
every iteration.
Partitioning clustering attempts to decompose the data set
into a set of disjoint clusters. More specifically, it
endeavors’ to create an integer quantity of partitions that

16788
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp. 16787-16791
© Research India Publications. http://www.ripublication.com

D. CLARANS algorithm that works best for huge amount of numerical


information. The fundamental scheme is that a tree is
CLARANS (Clustering Large Application Based upon
created that captures required information. Clustering is
Randomized Search) is one in all the partitioning techniques
performing on the tree itself; the nodes within the tree
used for huge database. It is a clustering method which
contain the information that is employed for the computation
draws sample of neighbours dynamically by looking for the
of distance values. It contains two novel features called
spatial clusters within the information. It takes every node in
Clustering Feature (CF) and Clustering Tree (CE). Both of
the graph as a feasible resolution [3]. It dynamically draws
the CF and CE summarize the cluster representations, and
an arbitrary sample at each new search. It makes the
provide helps in achieving good speed and scalability for
foremost “natural” clusters with the assist of “silhouette
huge databases [17]. BIRCH procedure is fundamentally
coefficient” that tell the number of belongingness of a data
separated in four stages. It runs on the consideration that not
point to an essential cluster. It outperforms PAM (also a
every data points are equally significant. CF (Clustering
Partitioning clustering method) in terms of run time and
Feature)-Tree is used in its first phase. It is then condensed
cluster feature. Combination of Sampling procedure and
in the second phase. Global Clustering phase is third phase
PAM is employed in CLARANS. CLARANS doesn’t assure
and some simple traditional clustering of CF are performed
search to localized region. The minimum distance among
and not data points. In last phase new clusters are formed
Neighbours nodes enhance effectiveness of the algorithm
and closest seed data points redistributed to get good to
[17].
better clusters. BIRCH is reasonably fast: Inability to deal
with non-spherical clusters of varying size and data order
sensitivity [3].
2.2 Hierarchical Clustering Algorithms
Hierarchical based clustering is additionally brought up as
connectivity based clustering. Hierarchical clustering B. CHAMELEON
algorithm basically creates collection of clusters. It differs in
Chameleon is an agglomerative hierarchical clustering
however the sets are created. A tree data structure, referred
algorithm which uses the dynamic modeling technique to
to as a dendrogram, may be accustomed to illustrate the
find out the similarity among the pairs of clusters.
hierarchical clustering method and the sets of different
CHAMELON is employed to determine the natural clusters
clusters [8]. The basis in a dendrogram tree contains one
of numerous sizes and shapes through automatically
cluster wherever every component is together. The leaves
adapting the merging decision supported on different
within the dendrogram contain a single component cluster.
clustering model features. It deals with two stage method of
Internal nodes within the dendrogram represent novel
clustering: Initial, a graph partitioning is used to divider the
clusters shaped by merging the clusters that seem as its
data points into sub-clusters [1]. In the second stage,
children within the tree. Every level within the tree is related
frequently merging these sub-clusters till its find the actual
to the distance measure that was accustomed merge the
clusters. The key feature in CHAMELEON is that it defines
clusters. Every clusters created at a specific level were
the pair of the foremost similar sub-clusters by two
combined as a result of the children clusters had a distance
considering relative closeness and also the relative inter-
among them less than the distance value related to the level
connectivity of the clusters. The relative closeness between
within the tree. The hierarchical clustering algorithms,
pair of clusters is that the absolute clones between two
according to the technique that produce clusters can further
clusters normalized with respect to the internal closeness of
are separated into Agglomerative algorithms and Divisive
them. The relative interconnectivity between pair of clusters
algorithms.
is that the absolute inter-connectivity between two
 Agglomerative - This algorithm is a bottom up normalized clusters with respect to the internal inter-
approach. It produces a sequence of clustering schemes connectivity of them.
of decreasing amount of clusters at every measure. The
clustering scheme produced at each measure results
from the preceding one by integration the two C. CURE
neighbouring clusters into one.
CURE (Clustering Using Representatives) is a Divisive
 Divisive - This algorithm is a top down approach. It hierarchical clustering algorithm to draw a random sample
produces a sequence of clustering schemes of and additional partition the sample, so as to have partially
increasing amount of clusters at every measure [7]. clustered partitions. The data points taken for random
Converse to the agglomerative algorithms, the sample can be well-scattered. The shape of the sample is
clustering produced at each measure results from the decided by selected data points. Therefore, this algorithm
preceding one by splitting a cluster into two. forms a non-spherical shape of clusters [1]. Subsequently,
these partitions are shrunk to remove outliers. During this
Shrink factor between 0.2-0.7 is measured to find accurate
A. BIRCH clusters. It follows a central outlook between the centroid-
based and every point extremes. This algorithm
BIRCH (Balanced Iterative Reducing and Clustering Using fundamentally addresses to the key issues of hierarchical
Hierarchies) is an agglomerative hierarchical clustering clustering i.e. outliers and clusters of only spherical shape. It

16789
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp. 16787-16791
© Research India Publications. http://www.ripublication.com

ignores the interconnectivity among the clusters and The cluster stability measures are based on the cross-
provides preference to distance between the representatives classification table of the actual clustering of the complete
points of two clusters [6]. Also, it fails to contemplate data with the clustering based on the removal of one
important features of a particular cluster therefore affecting column. The values of APN, ADM and FOM ranges from 0
the clusters merging decisions. CURE technique only works to 1, with smaller value corresponding to highly consistent
for metric data. clustering results. AD has a value between 0 and infinity,
and smaller values are also preferred [21]. Some of other
external measures including F- measures, Fair- Counting F-
D. ROCK measures, Rand measures, Jaccard index, Fowlkes –
Mallows Index, Confusion Matrix and Mutual
ROCK (Robust Clustering using Links) is a hierarchical Information’s.
algorithm in that to form clusters it utilizes a link strategy.
From bottom to top links are merging together to form a
cluster. It introduced the concept of link and neighbour.
Link incorporates comprehensive data of other similar 4. CONCLUSION
sufficient neighbours in order that not only two points are Clustering is significant in data analysis and data mining
measured whenever merging or splitting clusters [17]. applications. This paper discussed the various partitioning
Bigger is that the link, higher the likelihood of points being and hierarchical based clustering techniques and evaluation
in same cluster. Traditional algorithms used functions for metrics for clustering. Partition clustering algorithms are
Categorical and Boolean attributes however here concept of very helpful when the clusters are of bowed shape having
links (common neighbours) is introduced. ROCK has similar size and the amount of clusters can be recognized
demonstrated its power by being effectively utilized for real earlier. Due to the disability in predicting the amount of
datasets. clusters in advance Hierarchical clustering algorithms are
used. They partition the dataset into numerous levels of
partitioning termed as dendograms. These algorithms are
3. EVALUATION METRICS FOR CLUSTERING very useful in mining but the cost of formation of
dendograms is extremely high for huge datasets. All the
The process of validating the results of a clustering clustering algorithms are validated using cluster validation
algorithm is called as cluster validity. The two cluster
metrics such as internal and stability measures.
validation metrics are
1. Internal measures: The quality of clustering are
measured using the basic information of internal
REFERENCES
measures. Connectivity, Silhouette Width and Dunn
Index are the internal measures of clusters. [1]. Abdullah. Z, A. R. Hamdan, “Hierarchical Clustering
2. Stability Measures: It is a special version of internal Algorithms in Data Mining”, World Academy of
measures, which assesses the reliability of a clustering Science, Engineering and Technology International
outcome by matching it with the clusters obtained after Journal of Computer, Electrical, Automation, Control
every column is detached, one by one. Average and Information Engineering, Vol:9, No:10, 2015.
Proportion of Non-overlap (APN), Average Distance [2]. Amandeep Kaur Mann and Navneet Kaur, “Survey
(AD), Average Distance between Means (ADM) and Paper on Clustering Techniques”, International Journal
Figure of Merit (FOM) are the stability measures of of Science, Engineering and Technology Research
clusters. (IJSETR) Volume 2, Issue 4, April 2013.
As determined by the k-nearest neighbors the level of [3]. Amudha. S, “An Overview of Clustering Algorithm in
connectedness of the clusters is indicates the connectivity. It Data Mining”, International Research Journal of
has a range between 0 and ∞ and should be minimized. The Engineering and Technology (IRJET), Volume: 03
Silhouette width is the average of every observation's Issue: 12, Dec -2016.
Silhouette value [21]. Silhouette validation is validating the
[4]. Anoop Kumar Jain and Prof. Satyam Maheswari,
outcome of clustering to find out the accuracy of the
“Survey of Recent Clustering Techniques in Data
obtained outcomes from the cluster value. It has a range
Mining”, International Journal of Computer Science
between -1 to 1. The Silhouette cluster interpretation result
and Management Research, pp.72-78, 2012.
is,
[5]. Ding. K, C. Huo, Y. Xu, Z. Zhong, and C. Pan, “Sparse
 <0.25 horrible split
hierarchal clustering for VHR image change
 0.26 – 0.50 weak structure detection,” Geoscience and Remote Sensing Letters,
 0.51 – 0.71 reasonable structure IEEE, 12 (3), pp. 577 – 581, 2015.
 0.71 – 1.00 excellent split
[6]. Fahad A, Alshatri N, Tari Z and Alamri A, “A survey
The Dunn Index is the ratio between the minimum distances of clustering algorithms for Big Data: Taxonomy and
between observations not in the same cluster to the largest empirical analysis”, IEEE Transactions on Emerging
intra-cluster distance. It has a range between 0 and ∞ and Topics in Computing, pp. 267–79, 2014.
must be maximized.

16790
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp. 16787-16791
© Research India Publications. http://www.ripublication.com

[7]. Garima, Hina Gulati and P.K.Singh, “Clustering [21]. Sukhdev Singh Ghuman, “Clustering Techniques- A
Techniques in Data Mining: A Comparison“, Review”, International Journal of Computer Science
International Conference on Computing for Sustainable and Mobile Computing, Vol.5 Issue.5, pg. 524-530,
Global Development (INDIACom), pp. 410 – 415, May- 2016.
IEEE 2015.
[22]. Suman and Pinki Rani, “A Survey on STING and
[8]. Harshada S. Deshmukh and Prof. P. L. Ramteke, CLIQUE Grid Based Clustering Methods”,
“Comparing the Techniques of Cluster Analysis for International Journal of Advanced Research in
Big Data”, International Journal of Advanced Research Computer Science, Volume 8, No. 5, May-June 2017.
in Computer Engineering & Technology (IJARCET),
[23]. Sunil Chowdary, D. Sri Lakshmi Prasanna and P.
Volume 4 Issue 12, 2015.
Sudhakar, “Evaluating and Analyzing Clusters in Data
[9]. Namrata S Gupta, Bijendra S.Agrawal and Rajkumar Mining using Different Algorithms”, International
M. Chauhan , “ Survey on Clustering Techniques of Journal of Computer Science and Mobile Computing,
Data Mining”, American International Journal of Vol.3 Issue.2, pg. 86-99, 2014.
Research in Science, Technology, Engineering & [24]. Vaishali R. Patel and Rupa G. Mehta, “Clustering
Mathematics, pp-206-111, 2015. Algorithms: A Comprehensive Survey”, International
[10]. Lori Dalton, Virginia Ballarin and Marcel Brun, Conference on Electronics, Information and
“Clustering Algorithms: On Learning, Validation, Communication Systems Engineering, 2011.
Performance, and Applications to Genomics”, Current
[25]. Vijayalakshmi. M and M.Renuka Devi, “A Survey of
Genomics, Vol. 10, No. 6, 2009.
Different Issue of Different clustering Algorithms
[11]. Pandove D and Goel S, “A comprehensive study on
Used in Large Data sets” , International Journal of
clustering approaches for Big Data mining”, IEEE
Advanced Research in Computer Science and Software
Transactions on Electronics and Communication
Engineering, pp.305-307, 2012.
System; Coimbatore, pg.26-27, Feb-2015.
[12]. Pradeep Rai and Shubha Singh, “A Survey of [26]. Zeynel Cebeci and Figen Yildiz, “Comparison of K-
Clustering Techniques”, International Journal of Means and Fuzzy C-Means Algorithms on Different
Computer Applications, October - 2010. Cluster Structures”, Journal of Agricultural
[13]. Pragati Shrivastava and Hitesh Gupta, “A Review of Informatics, Vol. 6, No. 3, 2015.
Density-Based clustering in Spatial Data”,
International Journal of Advanced Computer Research,
pp.2249-7277, Sep-2012. AUTHOR PROFILE
[14]. Rashedi. E, A. Mirzaei, and M. Rahmati, “An M. Kiruthika received Bachelor of Computer Science
information theoretic approach to hierarchical (B.Sc.) degree from the Anna University, in 2014 and the
clustering combination,” Neurocomputing, 148, pp. Master of Computer Application (MCA) degree from the
487-497, 2015. Anna University, in 2016. She also recieved the M.Phil
degree from the Bharthiar Unviersity, Coimbatore, in 2018.
[15]. Rama Kalaivani.E, Suganya. G and Kiruba. J,
Currently she is doing her Ph.D computer science in Erode
“Review on K-Means and Fuzzy C Means Clustering
Arts and Science College. Her research area includes Data
Algorithm”, Imperial Journal of Interdisciplinary
Mining.
Research (IJIR), Vol-3, Issue-2, 2017.
[16]. Saroj and Tripti Chaudhary, “Study on Various Dr. S. Sukumaran graduated in 1985 with a degree in
Clustering Techniques”, International Journal of Science. He obtained his Master Degree in Science and
M.Phil in Computer Science from the Bharathiar University.
Computer Science and Information Technologies,
He received the Ph.D degree in Computer Science from the
Vol. 6 (3), pp.3031-3033, 2015.
Bharathiar University. He has 30 years of teaching
[17]. Sajana. T, C. M. Sheela Rani and K. V. Narayana “A
Survey on Clustering Techniques for Big Data experience starting from Lecturer to Associate Professor. At
Mining”, Indian Journal of Science and Technology, present he is working as Associate Professor of Computer
Science in Erode Arts and Science College, Erode, Tamil
Vol 9(3), DOI: 10.17485/ijst/2016/v9i3/75971, January
nadu. He has guided for more than 55 M.Phil research
2016.
Scholars in various fields and guided 13 Ph.D Scholars.
[18]. Sonamdeep Kaur, Sarika Chaudhary and Neha
Currently he is Guiding 3 M.Phil Scholars and 6 Ph.D
Bishnoi, “A Survey: Clustering Algorithms in Data
Mining”, International Journal of Computer Scholars. He is member of Board studies of various
Applications, ISSN: 0975 – 8887, 2015. Autonomous Colleges and Universities. He published
around 63 research papers in national and international
[19]. Soni Madhulatha. T “An Overview on Clustering
journals and conferences. His current research interests
Methods”, IOSR Journal of Engineering, Vol. 2(4),
include Image processing, Network Security and Data
pp: 719-725, Apr-2012.
Mining.
[20]. Sukhvir Kaur, “Survey of Different Data Clustering
Algorithms”, International Journal of Computer
Science and Mobile Computing, Vol.5 Issue.5, pg.
584-588, May- 2016.

16791

You might also like