Evans Analytics2e PPT 10 Data Mining
Evans Analytics2e PPT 10 Data Mining
Introduction to Data
Mining
Data Mining
Data mining is focused on better understanding
of characteristics and patterns among variables in
large databases using a variety of statistical and
analytical tools.
◦ It is used to identify relationships among variables in
large data sets and understand hidden patterns that
they may contain.
◦ XLMiner software implement many basic data mining
procedures in a spreadsheet environment.
The Scope of Data Mining
Data Exploration and Reduction
identifying groups in which elements are in some way similar
Classification
analyzing data to predict how to classify a new data element
Association
analyzing databases to identify natural associations among
variables and create rules for target marketing or buying
recommendations
Cause-and-effect Modeling
developing analytic models to describe relationships between
metrics that drive business performance
Data Exploration in XLMiner
XLMiner ribbon
Cluster # Colleges
1 23
2 22
3 3
4 1
Classification
Classification methods seek to classify a
categorical outcome into one of two or more
categories based on various data attributes.
For each record in a database, we have a
were approved
◦ Classification rule: Reject if credit score ≤ 640
2 misclassifications
out of 50 = 4%
Example 10.7 Continued
Alternate classification rule using visualization
Reject if years + 0.095(credit score) ≤ 74.66
3 misclassifications
out of 50 = 6%
Measuring Classification Performance
specified.
Example 10.9: Partitioning Data Sets in XLMiner
Modified Credit
Approval Decisions
data
XLMiner > Partition
Data > Standard
Partition
Select the variables
Choose partitioning
options and
percentages
Example 10.9 Continued
Results
Classifying New Data
After a classification scheme is chosen and the
best model is developed based on existing data,
we use the predictor variables as inputs to the
model to predict the output.
Example 10.9: Classifying New Data for Credit
Decisions Using Credit Scores and Years of Credit
History
Classify new data using the prior rules developed
PC Purchase Data
We might want to know which components are often
ordered together.
Measuring Strength of Association
Support for the (association) rule is the percentage (or number) of
transactions that include all items both antecedent and consequent.
Confidence of the (association) rule is the ratio of the number of
transactions that include all items in the consequent as well as the
antecedent (namely, the support) to the number of transactions that
include all items in the antecedent.
◦ Confidence (Conf.%) means that of the people who bought a 15-inch screen and a
core i7 processor, all (100%) bought 750 GB hard drives as well.
◦ Support (a) indicates that 5 customers bought a 15-inch screen and a core i7
processor.
◦ Support (c) indicates the number of transactions involving the purchase of options,
total.
◦ Support (a U c) is the number of transactions in which a 15-inch screen, Intel Core
i7, and 750 GB hard drive were ordered.
◦ Lift Ratio indicates how much more likely we are to encounter a 750 GB
transaction if we consider just those transactions where a 15-inch screen and Intel
Core i7 are purchased, as compared to the entire population of transactions.
Cause-and-Effect Modeling
Correlation analysis can help us develop cause-
and-effect models that relate lagging and leading
measures.
Lagging measures tell us what has happened and are
often external business results such as profit, market
share, or customer satisfaction.
Leading measures predict what will happen and are
usually internal metrics such as employee satisfaction,
productivity, and turnover.
Example 10.19: Using Correlation for
Cause-and-Effect Modeling
Ten Year Survey data
◦ Satisfaction was measured on a 1-5 scale.
Correlation matrix
Example 10.19 Continued
Logical model