Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Operational Data Analysis for

Industrial Systems
(Lecture 3)
Synopsis
• Data Preprocessing
✓ Data cleaning
✓ Data integration
✓ Data reduction
✓ Data transformation
Data Preprocessing
• Real word data are highly susceptible to noisy, missing, and inconsistent data.

• Improving the data quality

• Data quality – accuracy, completeness, consistency, timeliness, believability, and


interpretability
Major Task in Data Preprocessing
1. Data cleaning – filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies

2. Data integration – include and integrate data from multiple sources in


data analysis

3. Data reduction – reduce complexity of the data


• Dimensionality reduction – reduce representation
• Numerosity reduction – replaced by parametric model
(regression or log-linear models) or nonparametric model
(histograms, cluster, sampling)

4. Data transformation – normalization, discretization, concept hierarchy


generation
Major Task in Data Preprocessing
Data Cleaning
Missing Value

1. Ignore the data point - usually done when class label is missing (not effective)
2. Filling in the missing value manually – time consuming
3. Use a global constant to fill in the missing value – Unknown or -∞
4. Use a measure of central tendency for the attribute – mean or median
5. Use the attribute mean or median for all sample belonging to the same class as the given
tuple
6. Use the most probable value to fill in the missing value
Data Cleaning
Noisy Data - Noise is a random error or variance in a measured variable

• Binning: binning method smooth a sorted data value by consulting its “neighborhood”, that is,
the values around it
1. Smoothing by bin means
2. Smoothing by bin medians
3. Smoothing by bin boundaries
Data Cleaning
Noisy Data - Noise is a random error or variance in a measured variable

• Regression
• Outlier analysis – outliers may be detected by clustering
Data Integration
Redundancy and Correlation Analysis

• Redundancy – an attribute may be redundant if it can be derived from another attribute or a


set of attribute
• Some redundancy can be detected by correlation analysis
✓ 𝜒 2 correlation test for nominal data
✓ Correlation coefficient for numeric data
✓ Covariance of numeric data
Data Integration
𝜒 2 Correlation Test for Nominal Data

• Suppose A has c distinct values, 𝑎1 , 𝑎2 , … , 𝑎𝑐 ; B has r distinct values, 𝑏1 , 𝑏2 , … , 𝑏𝑟


c r (oij − eij ) 2
2 = 
𝑐𝑜𝑢𝑛𝑡(𝐴 = 𝑎𝑖 ) × 𝑐𝑜𝑢𝑛𝑡(𝐵 = 𝑎𝑗 )
𝑒𝑖𝑗 =
i =1 j =1 eij 𝑛

• 𝑜𝑖𝑗 is the observed frequency B


• 𝑒𝑖𝑗 is the expected frequency 𝑏1 𝑏2 … 𝑏𝑟 Total

• n is number of data points 𝑎1


A 𝑎2
• degree of freedom = (r-1)(c-1)

𝑎𝑐
Total
Data Integration
Correlation Coefficient for Numeric Data

• Correlation coefficient (Pearson’s product moment coefficient)


n n
 (ai − A)(bi − B )  (aibi ) − nAB
rA, B = i =1 = i =1
n A B n A B

• n is number of data points


• 𝑎𝑖 and 𝑏𝑖 are values for A and B
• 𝐴ҧ and 𝐵ത are mean values of A and B
• 𝜎𝐴 and 𝜎𝐵 are standard deviation of A and B
• 𝑟𝐴,𝐵 in the range of [-1, +1]
Data Integration
Covariance of Numeric Data


n
a
i =1 i
E ( A) = A =
n

n
b
i =1 i
E ( B) = B =
n

n
=
(ai − A)(bi − B )
Cov( A, B ) = E (( A − A)( B − B )) = i 1
n
Cov( A, B )
rA, B =
 A B
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Principal Components Analysis (PCA)
Data Reduction
Attribute Subset Selection
Data Reduction
• Regression and Log-linear Models: Parametric Data Reduction
• Histograms
• Clustering
• Sampling
Data Transformation
Data Transformation by Normalization
• Min-max normalization – a linear transformation; Suppose that 𝑚𝑖𝑛𝐴 and 𝑚𝑎𝑥𝐴 are the minimum and
maximum values of an attribute. A min-max normalization maps a value, 𝑣𝑖 , of A to 𝑣𝑖′ in the range
[new_𝑚𝑖𝑛𝐴 , new_𝑚𝑎𝑥𝐴 ] by computing
vi − min A
vi' = (new _ max A − new _ min A ) + new _ min A
max A − min A

• Z-score normalization
vi − A
vi' =
A

• Decimal scaling normalization: normalization by moving the decimal point of values of attribute A

vi
vi' =
10 j
Data Transformation
Data Transformation by Discretization - The raw values of a numeric attribute
are replaced by interval or conceptual labels

• Discretization by binning
• Discretization by histogram analysis
• Discretization by cluster
• Discretization by decision tree
• Discretization by correlation analysis

You might also like