R18CSE4102-UNIT 2 Data Mining Notes
R18CSE4102-UNIT 2 Data Mining Notes
R18CSE4102-UNIT 2 Data Mining Notes
Data Mining:
Data mining refers to extracting or mining knowledge from large amounts of data.
The term is actually a misnomer. Thus, data mining should have been more appropriately
named as knowledge mining which emphasis on mining from large amounts of data.
Clustering – is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
Regression – attempts to find a function which models the data with the least error.
This figure generated on the IRIS data set of the UCI machine repository. Basically, three
different class labels available in the data set: Setosa, Versicolor, and Virginia.
Support Vector Machine (SVM) Classifier Method: Support Vector Machines is a
supervised learning strategy used for classification and additionally used for regression.
When the output of the support vector machine is a continuous value, the learning
methodology is claimed to perform regression; and once the learning methodology will
predict a category label of the input object, it’s known as classification. The independent
variables could or could not be quantitative. Kernel equations are functions that transform
linearly non-separable information in one domain into another domain wherever the
instances become linearly divisible. Kernel equations are also linear, quadratic, Gaussian,
or anything that achieves this specific purpose. A linear classification technique may be a
classifier that uses a linear function of its inputs to base its decision on. Applying the kernel
equations arranges the information instances in such a way at intervals in the multi-
dimensional space, that there is a hyper-plane that separates knowledge instances of one
kind from those of another. The advantage of Support Vector Machines is that they will
make use of certain kernels to transform the problem, such we are able to apply linear
classification techniques to nonlinear knowledge. Once we manage to divide the
information into two different classes our aim is to include the most effective hyper-plane
to separate two kinds of instances.
Generalized Linear Models: Generalized Linear Models(GLM) is a statistical technique,
for linear modeling.GLM provides extensive coefficient statistics and model statistics, as
well as row diagnostics. It also supports confidence bounds.
Bayesian Classification: Bayesian classifier is a statistical classifier. They can predict
class membership probabilities, for instance, the probability that a given sample belongs to
a particular class. Bayesian classification is created on the Bayes theorem. Studies
comparing the classification algorithms have found a simple Bayesian classifier known as
the naive Bayesian classifier to be comparable in performance with decision tree and neural
network classifiers. Bayesian classifiers have also displayed high accuracy and speed when
applied to large databases. Naive Bayesian classifiers adopt that the exact attribute value on
a given class is independent of the values of the other attributes. This assumption is termed
class conditional independence. It is made to simplify the calculations involved, and is
considered “naive”. Bayesian belief networks are graphical replicas, which unlike naive
Bayesian classifiers allow the depiction of dependencies among subsets of attributes.
Bayesian belief can also be utilized for classification.
Classification by Backpropagation: A Backpropagation learns by iteratively processing a
set of training samples, comparing the network’s estimate for each sample with the actual
known class label. For each training sample, weights are modified to minimize the mean
squared error between the network’s prediction and the actual class. These changes are
made in the “backward” direction, i.e., from the output layer, through each concealed layer
down to the first hidden layer (hence the name backpropagation). Although it is not
guaranteed, in general, the weights will finally converge, and the knowledge process stops.
5. Regression
Regression can be defined as a statistical modeling method in which previously obtained
data is used to predicting a continuous quantity for new observations. This classifier is also
known as the Continuous Value Classifier. There are two types of regression models:
Linear regression and multiple linear regression models.
4. Retail Industry
The organized retail sector holds sizable quantities of data points covering sales, purchasing
history, delivery of goods, consumption, and customer service. The databases have become
even larger with the arrival of e-commerce marketplaces.
In modern-day retail, data warehouses are being designed and constructed to get the full
benefits of data mining. Multidimensional data analysis helps deal with data related to
different types of customers, products, regions, and time zones. Online retailers can also
recommend products to drive more sales revenue and analyze the effectiveness of their
promotional campaigns. So, from noticing buying patterns to improving customer service and
satisfaction, data mining opens many doors in this sector.
5. Higher Education
As the demand for higher education goes up worldwide, institutions are looking for
innovative solutions to cater to the rising needs. Institutions can use data mining to predict
which students would enrol in a particular program, who would require additional assistance
to graduate, refining enrollment management overall.
Moreover, the prognosis of students’ career paths and presentation of data would become
more comfortable with effective analytics. In this manner, data mining techniques can help
uncover the hidden patterns in massive databases in the field of higher education.
6. Energy Industry
Big Data is available even in the energy sector nowadays, which points to the need for
appropriate data mining techniques. Decision tree models and support vector machine
learning are among the most popular approaches in the industry, providing feasible solutions
for decision-making and management. Additionally, data mining can also achieve productive
gains by predicting power outputs and the clearing price of electricity.
7. Spatial Data Mining
Geographic Information Systems (GIS) and several other navigation applications make use of
data mining to secure vital information and understand its implications. This new trend
includes extraction of geographical, environment, and astronomical data, including images
from outer space. Typically, spatial data mining can reveal aspects like topology and
distance.
Top Data Science Skills to Learn in 2022
8. Biological Data Analysis
Biological data mining practices are common in genomics, proteomics, and biomedical
research. From characterizing patients’ behaviour and predicting office visits to identifying
medical therapies for their illnesses, data science techniques provide multiple advantages.
Some of the data mining applications in the Bioinformatics field are:
Semantic integration of heterogeneous and distributed databases
Association and path analysis
Use of visualization tools
Structural pattern discovery
Analysis of genetic networks and protein pathways
Data Preprocessing
Data Preprocessing
Steps that should be applied to make the data more suitable for data mining.
Consists of a number of different strategies and techniques that are interrelated in
omplex ways.
Strategies and techniques
(1) Aggregation
(2) Sampling
(3) Dimensionality reduction
(4) Feature subset selection
(5) Feature creation
(6) Discretization and binarization
(7) Variable transformation
Two categories of Strategies and techniques
Selecting data objects and attributes for the analysis.
Creating/changing the attributes.
Goal:
To improve the data mining analysis with respect to time, cost, and quality.
Aggregation
Quantitative attributes are typically aggregated by taking a sum or an average.
A qualitative attribute can either be omitted or summarized.
Disadvantage of aggregation
potential loss of interesting details.
Sampling Approaches
Random sampling.
Progressive or Adaptive Sampling
Random sampling
Sampling without replacement: as each item is selected, it is removed from the set
of all objects that together constitute the population.
Sampling with replacement: objects are not removed from the population as they
are selected for the sample. same object can be picked more than once.
Progressive or Adaptive Sampling
Difficult to determine proper sample size.
So adaptive or progressive sampling schemes are used.
These approaches start with a small sample, and then increase the sample size
until a sample of sufficient size has been obtained.
Dimensionality reduction
Data mining algorithms work better if the dimensionality - the number of
attributes in the data - is lower.
Eliminate irrelevant features and reduce noise.
Lead to a more understandable model due to fewer attributes.
Allow the data to be more easily visualized.
Amount of time and memory required by the data mining algorithm is reduced.
Feature subset selection
The reduction of dimensionality by selecting new attributes that are a subset of the
old.
Noisy Data
Noise is a random error or variance in a measured variable. Given a numerical attribute,
the following data smoothing techniques are used:
1. Binning:
Binning methods smooth a sorted data value by consulting its “neighbor- hood,”
that is, the values around it.
The sorted values are distributed into a number of “buckets,” or bins. Binning is
also used as a discretization technique.
2. Regression:
Data can be smoothed by fitting the data to a function, such as with regression.
Linear regression involves finding the “best” line to fit two attributes (or variables),
so that one attribute can be used to predict the other.
Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
3. Clustering:
Outliers may be detected by clustering, where similar values are organized into
groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may
be considered outliers.
Dimensionality Reduction
Data sets can have a large number of features.
Example:
Consider a set of documents, where each document is represented by a vector whose
components are the frequencies with which each word occurs in the document.
In such cases, there are typically thousands or tens of thousands of attributes
(components), one for each word in the vocabulary.
As another example, consider a set of time series consisting of the daily closing price
of various stocks over a period of 30 years.
In this case, the attributes, which are the prices on specific days, again number in the
thousands.
Proximity
o refers to either similarity or dissimilarity.
o proximity between two objects is a function of the proximity between the
corresponding attributes of the two objects.
o This includes measures such as
Correlation and Euclidean distance, which are useful for dense data
such as time series or two-dimensional points
Jaccard and cosine similarity measures, which are useful for sparse
data like documents.
Similarity
o Similarity between two objects is a numerical measure of the degree to which
the two objects are alike.
o are higher for pairs of objects that are more alike.
o usually non-negative and are often between 0 (no similarity) and 1 (complete
similarity).
Dissimilarity
o Dissimilarity between two objects is a numerical measure of the degree to
which the two objects are different.
o are lower for more similar pairs of objects.
o term distance is used as a synonym for dissimilarity.
o fall in the interval [0,1], but it is also common for them to range from 0 to .
Transformations
o are often applied to convert a similarity to a dissimilarity, or vice versa, or to
transform a proximity measure to fall within a particular range, such as [0,1].
o proximity measures, especially similarities, are defined or transformed to have
values in the interval [0,1].
o motivation for this is to use a scale in which a proximity value indicates the
fraction of similarity (or dissimilarity) between two objects.
o transformation of similarities to the interval [0,1] is given by the expression
where max_s and min_s are the
maximum and minimum similarity values, respectively.
o dissimilarity measures with a finite range can be mapped to the interval [0,1]
by using the formula
Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the data.
These sources may include multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”.
Tight Coupling:
In this coupling, data is combined from different sources into a single physical location
through the process of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
2. Redundancy:
The method of data reduction may achieve a condensed description of the original data which
is much smaller in quantity but keeps the quality of the original data.
Step-1: {X1}
Suppose there are the following attributes in the data set in which few attributes are
redundant.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data from
the compressed data.
Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component analysis)
are examples of this compression. For e.g., JPEG image format is a lossy compression, but
we can find the meaning equivalent to the original the image. In lossy-data compression, the
decompressed data may differ to the original data but are useful enough to retrieve
information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller
representation of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric method such as clustering, histogram, sampling. For More
Information on Numerosity Reduction Visit the link below:
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to divide
the whole set of attributes and repeat of this method up to the end, then the process is known
as top-down discretization also known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a
combination of the neighbourhood values in the interval, that process is called bottom-up
discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for
age) to high-level concepts (categorical variables such as middle age or Senior).
Binning –
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.
Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute X,
into disjoint ranges called brackets. There are several partitioning rules:
The data are transformed in ways that are ideal for mining the data. The data transformation
involves steps that are:
1.Smoothing:
It is a process that is used to remove noise from the dataset using some algorithms It allows
for highlighting important features present in the dataset. It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any variance or any other
noise form.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for finding patterns that they
wouldn’t see otherwise.
2.Aggregation:
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used. Gathering accurate
data of high quality and a large enough quantity is necessary to produce relevant results.
The collection of data is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing strategies.
For example, Sales, data may be aggregated to compute monthly& annual total amounts.
3.Discretization:
It is a process of transforming continuous data into set of small intervals. Most Data Mining
activities in the real world require continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes.
Also, even if a data mining task can manage a continuous attribute, it can significantly
improve its efficiency by replacing a constant quality attribute with its discrete values.
4. Attribute Construction:
Where new attributes are created & applied to assist the mining process from the given set of
attributes. This simplifies the original data & makes the mining more efficient.
5.Generalization:
It converts low-level data attributes to high-level data attributes using concept hierarchy. For
Example Age initially in Numerical form (22, 25) is converted into categorical value (young,
old).
For example, Categorical attributes, such as house addresses, may be generalized to higher-
level definitions, such as town or country.
Min-Max Normalization:
Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
v’ is the new value you get after normalizing the old value.
Solved example:
Suppose the minimum and maximum value for an attribute profit(P) are Rs. 10, 000 and Rs.
100, 000. We want to plot the profit in the range [0, 1]. Using min-max normalization the
value of Rs. 20, 000 for attribute profit can be plotted to:
Z-Score Normalization:
In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation
For example:
Let mean of an attribute P = 60, 000, Standard Deviation = 10, 000, for the attribute P. Using
z-score normalization, a value of 85000 for P can be transformed to:
Decimal Scaling:
It normalizes the values of an attribute by changing the position of their decimal points
The number of points by which the decimal point is moved can be determined by the absolute
maximum value of attribute A.
For example:
For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of integers
in the largest number) so that values come out to be as 0.98, 0.97 and so on.
Data visualization convert large and small data sets into visuals, which is easy to understand
and process for humans.
Data visualization tools provide accessible ways to understand outliers, patterns, and trends in
the data.
In the world of Big Data, the data visualization tools and technologies are required to analyze
vast amounts of information.
Data visualizations are common in your everyday life, but they always appear in the form of
graphs and charts. The combination of multiple visualizations and bits of information are still
referred to as Infographics.
Data visualizations are used to discover unknown facts and trends. You can see visualizations
in the form of line charts to display change over time. Bar and column charts are useful for
observing relationships and making comparisons. A pie chart is a great way to show parts-of-
a-whole. And maps are the best way to share geographical data visually.
Today's data visualization tools go beyond the charts and graphs used in the Microsoft Excel
spreadsheet, which displays the data in more sophisticated ways such as dials and gauges,
geographic maps, heat maps, pie chart, and fever chart.
Effective data visualization are created by communication, data science, and design collide.
Data visualizations did right key insights into complicated data sets into meaningful and
natural.
To craft an effective data visualization, you need to start with clean data that is well-sourced
and complete. After the data is ready to visualize, you need to pick the right chart.
After you have decided the chart type, you need to design and customize your visualization to
your liking. Simplicity is essential - you don't want to add any elements that distract from the
data.
Similarity Measures
Similarity and dissimilarity are important because they are used by a number of data mining
techniques, such as clustering nearest neighbour classification and anomaly detection
The term proximity is used to refer to either similarity or dissimilarity
Definitions
The similarity between two objects is a numeral measure of the degree to which the
Consequently, similarities are higher for pairs of objects that are more alike. Similarities are
usually non- are often between 0 (no two objects are alike. negative and similarity) and
1(complete similarity).
The dissimilarity between two objects is the numerical measure of the degree to which the
two objects are different. Dissimilarity is lower for more similar pairs of objects.
Proximity Measures
Proximity measures, especially similarities, are defined to have values in the interval [0,1]. If
the similarity between objects can range from 1 (not at all similar) to 10 (completely similar),
we can make them fall into the range [0,1] by using the formula: s'=(s-1)/9, where s and s' are
the original and the new similarity values, respectively.
The more general case, s' is calculated as s'=(s-min_s)/(max_s-min_s), where min_s and max
s are the minimum and maximum similarity values respectively.
Likewise, dissimilarity measures with a finite range can be mapped to the interval [0,1] by
using the formula d'=(d-min_d)/(max_d- min_d)
If the proximity measure originally takes values in the interval [0, 0], then we usually use the
formula: d'= d/(1+d) for such cases and bring the dissimilarity measure between [0,1]
The proximity of objects with a number of attributes is defined by combining the proximities
of individual attributes.
1) For interval or ratio attributes, the natural measure of dissimilarity between two attributes
is the absolute difference of their values. For example, we might compare our current weight
to our weight one year ago. In such cases the dissimilarities range from 0 to ∞.
2) For objects described with one nominal attribute, the attribute value describes whether the
attribute is present in the object or not. Comparing two objects with one nominal attribute
means comparing the values of this attribute. In that case, similarity is traditionally defined as
1 if attribute values match and as 0 otherwise. A dissimilarity would be defined in the
opposite way: 0 if the attribute values match, 1 if they do not.
3) For objects with a single ordinal attribute, information about order should be taken into
account. Consider an attribute that measures the quality of a product on the scale {poor, fair,
OK, good, wonderful}. It would be reasonable that a product P1 which was rated wonderful
would be closer to a product P2 rated good rather than a product P3 rated OK. To make this
observation quantitative, the values of the ordinal attribute are often mapped to successive
integers, beginning at 0 or 1, e.g.{poor=0, fair=1, OK=2, good=3, wonderful=4}. Then d(P1-
P2) =4-3 =1
Distances:
Distances are dissimilarities with certain properties. The Euclidian distance, d, between two
points , x and y in one , two or higher dimensional space is given by the formula:
d(x,y)=∑n−−−−√(xk−yk)2d(x,y)=∑n(xk−yk)2
where n is the number of dimensibns and xk and yk are, respectively, the kth attribute
(component) of x and y.
The Euclidian distance measure is given generalized by the Minkowski distance metric
shown as:
d(x,y)=(∑n|xk−yk|r)1/rd(x,y)=(∑n|xk−yk|r)1/r
r = 1 also known as City block (Manhattan or L1 norm) distance. A common example is the
Hamming distance, which is the number of bits that are different between two objects that
only have binary attributes (i.e., binary vectors)
d(x,y)=limr→∞(∑n|xk−yk|r)1/rd(x,y)=limr→∞(∑n|xk−yk|r)1/r
s(x, y) is the similarity between points x and y, then typically we will have 1. s(x, y) =1 only
if x=y. (0 £ s£ 1)
Non-symmetric Similarity
Consider an experiment in which people are asked to classify a small set of characters as they
flash on the screen. The confusion matrix for this experiment records how often each
character is classified as itself, and how often it is classified as another character. For
example, suppose "0" appeared 200 times and was classified as "0" 160 times but as "o" 40
times. Likewise, suppose that "o" appeared 200 times and was classified as "o" 170 times and
as "0" 30 times
Non-symmetric Similarity
If we take these counts as a measure of similarity between the two characters, then we have a
similarity measure, but not a symmetric one.