Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

DATA MINING
CSE-443
Ayesha Aziz Prova

Lecturer,
Dept. of CSE
CWU
Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy
generation
Data Cleaning

Importance
 “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
 “Data cleaning is the number one problem in data
warehousing”—DCI survey

Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
Missing Data

Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data

Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data

Missing data may need to be inferred.
How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian
formula or decision tree
Noisy Data

Noise: random error or variance in a measured variable

Incorrect attribute values may due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention

Other data problems which requires data cleaning
 duplicate records
 incomplete data
 inconsistent data
How to Handle Noisy Data?

Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

Regression
 smooth by fitting the data into regression functions

Clustering
 detect and remove outliers

Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)
Simple Discretization Methods: Binning

Equal-width (distance) partitioning
 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well

Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately
same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Chapter 2: Data Preprocessing


Data cleaning


Data reduction

generation

Summary
Data Integration

Data integration:
 Combines data from multiple sources into a coherent
store

Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources

Entity identification problem:
 Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton

Detecting and resolving data value conflicts
 For the same real world entity, attribute values from
different sources are different
 Possible reasons: different representations, different
scales, e.g., metric vs. British units
Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple
databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue

Redundant attributes may be able to be detected by
correlation analysis

Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Correlation Analysis (Numerical Data)

Correlation coefficient (also called Pearson’s product
moment coefficient)
rA,B =
 (A  A )(B  B )  (AB)  n AB
=
(n  1 )σσAσ (n  1 )σσAσ
where n is the number of tuples, A

and B the respective means of
are
A and B, σA and σB are the respective standard deviation of A and B,
and Σ(AB) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rA,B < 0: negatively correlated
Correlation Analysis (Categorical Data)

Χ2 (chi-square) test
2
2 ( Observed − Expected )
χ =∑
Expected

The larger the Χ2 value, the more likely the variables are
related

The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count

Correlation does not imply causality
 # of hospitals and # of car-theft in a city are correlated
 Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
Play Not play chess Sum (row)

chess
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
2 2 2 2
2 ( 250 −90 ) ( 50− 210 ) ( 200−360 ) ( 1000−840 )
χ = + + + =507 . 93
90 210 360 840

It shows that like_science_fiction and play_chess are
correlated in the group
Data Transformation

Smoothing: remove noise from data

Aggregation: summarization, data cube construction

Generalization: concept hierarchy climbing

Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling

Attribute/feature construction
 New attributes constructed from the given ones
Data Transformation: Normalization
 Min-max normalization: to [new_minA, new_maxA]
v− min A
v '= ( new max A − new min A )+ new min A
max A − min A
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,

73 , 600 −12 , 000
1.0]. Then $73,000 is mapped to 98 , 000 −12 , 000 ( 1. 0− 0 )+0= 0 . 716

Z-score normalization (μ: mean, σ: standard deviation):
v− μ A
v '=
σ A
73 , 600−54 , 000
=1 . 225
 Ex. Let μ = 54,000, σ = 16,000. Then 16 , 000

Normalization by decimal scaling
v
v '= j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Preprocessing


Data cleaning


Data reduction

generation

Summary
Data Reduction Strategies

Why data reduction?
 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time to run
on the complete data set

Data reduction
 Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results

Data reduction strategies
 Data cube aggregation:
 Dimensionality reduction — e.g., remove unimportant attributes
 Data Compression
 Numerosity reduction — e.g., fit data into models
 Discretization and concept hierarchy generation
THANKS
20
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Any Question???

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

Uploaded by

Copyright:

Available Formats

DATA MINING

Ayesha Aziz Prova

where n is the number of tuples, A

Play Not play chess Sum (row)

Sum(col.) 300 1200 1500

 Ex. Let income range $12,000 to $98,000 normalized to [0.0,

You might also like