DWDM Unit-2

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 75

UNIT-2

Syllabus :

Data Mining:
Types of Data,
Data Mining Functionalities,
Interestingness Patterns-Classification of Data Mining systems,
Data Mining Task Primitives,
Integration of a Data
Mining System with a Database or a Data Warehouse System,
Major issues in Data Mining
Applications of Data mining

Data Preprocessing:
Data cleaning,
Data integration and data transformation,
Data reduction: data cube aggregation, dimensionality reduction
UNIT-II : Chapter1

• 2.1 Motivation: Why data mining?


• 2.2 What is data mining?
• 2.3 Data Mining: On what kind of data?
• 2.4 Data mining functionality: What kinds of Patterns Can Be Mined?
• 2.5 Are all the patterns interesting?
• 2.6 Classification of data mining systems
• 2.7 Data Mining Task Primitives
• 2.8 Integration of data mining system with a DB and DW System
• 2.9 Major issues in data mining

Data Mining: Concepts and Techniques


2.1 Motivation: Why data mining?

• The Explosive Growth of Data: from terabytes(10004) to yottabytes(10008)


– Data collection and data availability
• Automated data collection tools, database systems, web
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: bioinformatics, scientific simulation, medical research …
• Society and everyone: news, digital cameras, …
• Data rich but information poor!
– What does those data mean?
– How to analyze data?

• Data mining — Automated analysis of massive data sets


Data Mining: Concepts and Techniques
2.2 What is data mining?

• Data mining (knowledge discovery from data)


– Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
– Data mining: a misnomer?
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

Data Mining: Concepts and Techniques 5


• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and detection of unusual patterns (outliers)
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics andData
bio-data analysis
Mining: Concepts and Techniques
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
– Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 - $800 a month live in that area
– Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k usually buy this type of
CD player

• Cross-market analysis—Find associations/co-relations between product sales, &


predict based on such association
– E.g. Customers who buy computer A usually buy software B

Data Mining: Concepts and Techniques 7


• Customer requirement analysis
– Identify the best products for different customers
– Predict what factors will attract new customers
• Provision of summary information
– Multidimensional summary reports
• E.g. Summarize all transactions of the first quarter from three different branches
Summarize all transactions of last year from a particular branch
Summarize all transactions of a particular product
– Statistical summary information
• E.g. What is the average age for customers who buy product A?

• Fraud detection
– Find outliers of unusual transactions
• Financial planning
– Summarize and compare the resources and spending

Data Mining: Concepts and Techniques 8


Data Mining: Concepts and Techniques
• Learning the application domain
– relevant prior knowledge and goals of application
• Identifying a target data set: data selection
• Data processing
– Data cleaning (remove noise and inconsistent data)
– Data integration (multiple data sources maybe combined)
– Data selection (data relevant to the analysis task are retrieved from database)
– Data transformation (data transformed or consolidated into forms appropriate for
mining)
(Done with data preprocessing)
– Data mining (an essential process where intelligent methods are applied to extract
data patterns)
– Pattern evaluation (indentify the truly interesting patterns)
– Knowledge presentation (mined knowledge is presented to the user with
visualization or representation techniques)
• Use of discovered knowledge
Data Mining: Concepts and Techniques 10
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Concepts and Techniques 11
• Database, data warehouse, WWW or other information
repository (store data)
• Database or data warehouse server (fetch and
combine data)
• Knowledge base (turn data into meaningful groups
according to domain knowledge)
• Data mining engine (perform mining tasks)
• Pattern evaluation module (find interesting patterns)
• User interface (interact with the user)

Data Mining: Concepts and Techniques


Database
Technology Statistics

Information Machine
Science Data Mining Learning

Visualization Other
Disciplines

• Not all “Data Mining System” performs true data mining


 machine learning system, statistical analysis (small amount of data)
 Database system (information retrieval, deductive querying…)

Data Mining: Concepts and Techniques 14


2.3 Data Mining: On what kind of data?

• Database-oriented data sets and applications


– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Object-Relational Databases
– Temporal Databases, Sequence Databases, Time-Series databases
– Spatial Databases and Spatiotemporal Databases
– Text databases and Multimedia databases
– Heterogeneous Databases and Legacy Databases
– Data Streams
– The World-Wide Web

Data Mining: Concepts and Techniques 15


• DBMS – database management system, contains a collection of
interrelated databases
e.g. Faculty database, student database, publications database
• Each database contains a collection of tables and functions to
manage and access the data.
e.g. student_bio, student_graduation, student_parking
• Each table contains columns and rows, with columns as attributes of
data and rows as records.
• Tables can be used to represent the relationships between or
among multiple tables.

Data Mining: Concepts and Techniques


Data Mining: Concepts and Techniques
• With a relational query language, e.g. SQL, we will be able to find
answers to questions such as:
– How many items were sold last year?
– Who has earned commissions higher than 10%?
– What is the total sales of last month for Dell laptops?
• When data mining is applied to relational databases, we can search
for trends or data patterns.
• Relational databases are one of the most commonly available and
rich information repositories, and thus are a major data form in our
study.

Data Mining: Concepts and Techniques


• A repository of information collected from multiple sources, stored
under a unified schema, and that usually resides at a single site.
• Constructed via a process of data cleaning, data integration, data
transformation, data loading and periodic data refreshing.

Data Mining: Concepts and Techniques


• Data are organized around major subjects, e.g. customer, item,
supplier and activity.
• Provide information from a historical perspective (e.g. from the past
5 – 10 years)
• Typically summarized to a higher level (e.g. a summary of the
transactions per item type for each store)
• User can perform drill-down or roll-up operation to view the data at
different degrees of summarization

Data Mining: Concepts and Techniques


• Consists of a file where each record represents a transaction
• A transaction typically includes a unique transaction ID and a list of
the items making up the transaction.

• Either stored in a flat file or unfolded into relational tables


• Easy to identify items that are frequently sold together

Data Mining: Concepts and Techniques


2.4 Data mining functionality: What kinds of Patterns Can
Be Mined?

• Concept/Class Description: Characterization and


Discrimination
– Data can be associated with classes or concepts.
• E.g. classes of items – computers, printers, …
concepts of customers – bigSpenders, budgetSpenders, …
• How to describe these items or concepts?
– Descriptions can be derived via
• Data characterization – summarizing the general characteristics of a
target class of data.
– E.g. summarizing the characteristics of customers who spend more than $1,000 a year
at AllElectronics. Result can be a general profile of the customers, such as 40 – 50 years
old, employed, have excellent credit ratings.

Data Mining: Concepts and Techniques 23


• Data discrimination – comparing the target class with one or a set of
comparative classes
– E.g. Compare the general features of software products whole sales increase by 10% in
the last year with those whose sales decrease by 30% during the same period

• Or both of the above

• Mining Frequent Patterns, Associations and


Correlations
– Frequent itemset: a set of items that frequently appear
together in a transactional data set (e.g. milk and bread)
– Frequent subsequence: a pattern that customers tend to purchase
product A, followed by a purchase of product B

Data Mining: Concepts and Techniques 24


– Association Analysis: find frequent patterns
• E.g. a sample analysis result – an association rule:
buys(X, “computer”) => buys(X, “software”) [support = 1%, confidence = 50%]
(if a customer buys a computer, there is a 50% chance that she will buy software.
1% of all of the transactions under analysis showed that computer and software
are purchased together. )
• Associations rules are discarded as uninteresting if they do not satisfy both a
minimum support threshold and a minimum confidence threshold.
– Correlation Analysis: additional analysis to find statistical correlations
between associated pairs

Data Mining: Concepts and Techniques 25


• Classification and Prediction
– Classification
• The process of finding a model that describes and distinguishes the data classes or
concepts, for the purpose of being able to use the model to predict the class of
objects whose class label is unknown.
• The derived model is based on the analysis of a set of training data (data objects
whose class label is known).
• The model can be represented in classification (IF-THEN) rules, decision trees,
neural networks, etc.
– Prediction
• Predict missing or unavailable numerical data values

Data Mining: Concepts and Techniques 26


Data Mining: Concepts and Techniques 27
• Cluster Analysis
– Class label is unknown: group data to form new classes
– Clusters of objects are formed based on the principle of maximizing
intra-class similarity & minimizing interclass similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters may
represent individual target groups for marketing.

Data Mining: Concepts and Techniques 28


• Outlier Analysis
– Data that do no comply with the general behavior or model.
– Outliers are usually discarded as noise or exceptions.
– Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts

• Evolution Analysis
– Describes and models regularities or trends for objects whose
behavior changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the stocks of
particular companies.

Data Mining: Concepts and Techniques 29


2.5 Are all the patterns interesting?
• Data mining may generate thousands of patterns: Not all of them
are interesting
• A pattern is interesting if it is
– easily understood by humans
– valid on new or test data with some degree of certainty,
– potentially useful
– novel
– validates some hypothesis that a user seeks to confirm
• An interesting measure represents knowledge !

Data Mining: Concepts and Techniques 30


2.5 Are all the patterns interesting?
• Objective measures
– Based on statistics and structures of patterns, e.g., support, confidence, etc.
(Rules that do not satisfy a threshold are considered uninteresting.)
• Subjective measures
– Reflect the needs and interests of a particular user.
• E.g. A marketing manager is only interested in characteristics of customers who shop
frequently.

– Based on user’s belief in the data.


• e.g., Patterns are interesting if they are unexpected, or can be used for strategic planning,
etc

• Objective and subjective measures need to be combined.

Data Mining: Concepts and Techniques 31


2.5 Are all the patterns interesting?
• Find all the interesting patterns: Completeness
– Unrealistic and inefficient
– User-provided constraints and interestingness measures should be used
• Search for only interesting patterns: An optimization problem
– Highly desirable
– No need to search through the generated patterns to identify truly
interesting ones.
– Measures can be used to rank the discovered patterns according their
interestingness.

Data Mining: Concepts and Techniques 32


2.6 Classification of data mining
systems

Database
Technology Statistics

Information Machine
Science Data Mining Learning

Visualization Other
Disciplines
2.6 Classification of data mining
systems
• Database
– Relational, data warehouse, transactional, stream, object-oriented/relational,
active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
2.7 Data Mining Task Primitives
• How to construct a data mining query?
– The primitives allow the user to interactively communicate with

the data mining system during discovery to direct the mining

process, or examine the findings

Data Mining: Concepts and Techniques 35


2.7 Data Mining Task Primitives
– The primitives specify:

(1) The set of task-relevant data – which portion of the database to be used
– Database or data warehouse name

– Database tables or data warehouse cubes

– Condition for data selection

– Relevant attributes or dimensions

– Data grouping criteria

Data Mining: Concepts and Techniques 36


2.7 Data Mining Task Primitives
– The primitives specify:

(2) The kind of knowledge to be mined – what DB functions to be performed


– Characterization
– Discrimination
– Association
– Classification/prediction
– Clustering
– Outlier analysis
– Other data mining tasks

Data Mining: Concepts and Techniques 37


2.7 Data Mining Task Primitives
(3) The background knowledge to be used – what domain knowledge,

concept hierarchies, etc.

(4) Interestingness measures and thresholds – support, confidence, etc.

(5) Visualization methods – what form to display the result, e.g. rules,

tables, charts, graphs, …

Data Mining: Concepts and Techniques 38


2.7 Data Mining Task Primitives
• DMQL – Data Mining Query Language
– Designed to incorporate these primitives
– Allow user to interact with DM systems
– Providing a standardized language like SQL

Data Mining: Concepts and Techniques 39


An Example Query in DMQL

(1)
(3)
(2)
(1)
(1)

(1)

(2)
(1)

(5)
Data Mining: Concepts and Techniques 40
• Automated vs. query-driven?
– Finding all the patterns autonomously in a database?—unrealistic
because the patterns could be too many but uninteresting
• Data mining should be an interactive process
– User directs what to be mined
• Users must be provided with a set of primitives to be used to
communicate with the data mining system
• Incorporating these primitives in a data mining query language
– More flexible user interaction
– Foundation for design of graphical user interface
– Standardization of data mining industry and practice

Data Mining: Concepts and Techniques 41


2.8 Integration of data mining system with a DB and DW System

• No coupling
– Flat file processing, no utilization of any functions of a DB/DW
system
– Not recommended
• Loose coupling
– Fetching data from DB/DW
– Does not explore data structures and query optimization
methods provided by DB/DW system
– Difficult to achieve high scalability and good performance with
large data sets

Data Mining: Concepts and Techniques 42


2.8 Integration of data mining system with a DB and DW System

• Semi-tight
– Efficient implementations of a few essential data mining primitives in
a DB/DW system are provided, e.g., sorting, indexing, aggregation,
histogram analysis, multiway join, precomputation of some stat
functions
– Enhanced DM performance
• Tight
– DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query analysis, data structures, indexing,
query processing methods of a DB/DW system
– A uniform information processing environment, highly desirable
Data Mining: Concepts and Techniques 43
2.9 Major issues in data mining

• Mining methodology and User interaction


– Mining different kinds of knowledge
• DM should cover a wide spectrum of data analysis and knowledge discovery tasks
• Enable to use the database in different ways
• Require the development of numerous data mining techniques
– Interactive mining of knowledge at multiple levels of abstraction
• Difficult to know exactly what will be discovered
• Allow users to focus the search, refine data mining requests
– Incorporation of background knowledge
• Guide the discovery process
• Allow discovered patterns to be expressed in concise terms and different levels of
abstraction
– Data mining query languages and ad hoc data mining
• High-level query languages need to be developed
• Should be integrated with a DB/DW query language
Data Mining: Concepts and Techniques 44
2.9 Major issues in data mining

– Presentation and visualization of results


• Knowledge should be easily understood and directly usable
• High level languages, visual representations or other expressive forms
• Require the DM system to adopt the above techniques
– Handling noisy or incomplete data
• Require data cleaning methods and data analysis methods that can handle noise
– Pattern evaluation – the interestingness problem
• How to develop techniques to access the interestingness of discovered patterns,
especially with subjective measures bases on user beliefs or expectations

Data Mining: Concepts and Techniques 45


2.9 Major issues in data mining

• Performance Issues
– Efficiency and scalability
• Huge amount of data
• Running time must be predictable and acceptable
– Parallel, distributed and incremental mining algorithms
• Divide the data into partitions and processed in parallel
• Incorporate database updates without having to mine the entire data again from
scratch

• Diversity of Database Types


– Other database that contain complex data objects, multimedia data,
spatial data, etc.
– Expect to have different DM systems for different kinds of data
– Heterogeneous databases and global information systems
• Web mining becomes
Data a very Concepts
Mining: challenging and fast-evolving field in data mining
and Techniques 46
Unit-II
— Chapter 2 —

Data Preprocessing

October 16, 2021 47


Chapter 2: Data
Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

October 16, 2021 48


Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
October 16, 2021 49
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
October 16, 2021 50
Why Is Data Preprocessing
Important?

 No quality data, no quality mining results!


 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse

October 16, 2021 51


Major Tasks in Data
Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results

October 16, 2021 52


Forms of Data Preprocessing

October 16, 2021 53


Chapter 2: Data
Preprocessing
 Why preprocess the data?

 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
October 16, 2021 54
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems

in data warehousing”—Ralph Kimball


 “Data cleaning is the number one problem in data

warehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration
October 16, 2021 55
Missing Data,Reasons?
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data

October 16, 2021 56


How to Handle Missing
Data?
 Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the same class:
smarter
 the most probable value: inference-based such as Bayesian formula
or decision tree
October 16, 2021 57
Noisy Data?
Reasons?
 Noise: random error or variance in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which requires data cleaning


 duplicate records

 incomplete data

 inconsistent data

October 16, 2021 58


How to Handle Noisy
Data?
 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.


 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human (e.g.,

deal with possible outliers)

October 16, 2021 59


Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
October 16, 2021 60
Regression
y

Y1

Y1’ y=x+1

X1 x

October 16, 2021 61


Cluster Analysis

October 16, 2021 62


Chapter 2: Data
Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

October 16, 2021 63


Data Integration
 Data integration:
 Combines data from multiple sources into a coherent

store

October 16, 2021 64


Issues to be considered
during Data Integration
 Entity Identification problem :
How can equivalent real- world entities from multiple data
sources be matched up?
Example : How can the data analyst or the computer be
sure that customer_id in one database and
cust_number in another database refer to same
attribute?
 Redundancy problem :An attribute may be redundant if it

can be derived from another attribute or set of attributes.


Example : “Age” attribute can be derived from “Date-of birth”
attribute
October 16, 2021 65
Solution for Entity Identification Problem

 Metadata can be used to solve the Entity Identification


Problem .

October 16, 2021 66


Solution for Data Redundancy Problem

 Redundant attributes may be able to be detected by


correlation analysis
 For numerical attributes, we can evaluate the co-relation
between two attributes by using “co-relation co-efficient”
measure.
 For categorical attributes, we can evaluate the co-relation
between two attributes by using “chi-square” measure.

October 16, 2021 67


Correlation Analysis (Numerical
Data)
 Correlation coefficient (also called Pearson’s product
moment coefficient)
∑ ( A − A )( B − B ) ∑ ( AB )−n A B
r A ,B = =
( n− 1) σ A σ B ( n−1) σ A σ B

where n is the number of tuples, A and B are the respective


means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(AB) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rA,B < 0: negatively correlated
October 16, 2021 68
Correlation Analysis (Categorical
Data)

 Χ2 (chi-square) test
2
2 ( Observed− Expected )
χ =∑
Expected
 The larger the Χ2 value, the more likely the variables are
related

October 16, 2021 69


Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution
in the two categories)
2 2 2 2
2 ( 250 − 90 ) ( 50 −210 ) ( 200 −360 ) ( 1000 −840 )
χ = + + + =507 . 93
90 210 360 840
 It shows that like_science_fiction and play_chess are
correlated in the group
October 16, 2021 70
Data Transformation
 Smoothing: remove noise from data
 Aggregation: summarization, data cube construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small, specified
range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
October 16, 2021 71
Data Transformation:
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v −min A
v '= ( new max A −new min A )+ new min A
max A −min A
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600−12,000
1.0]. Then $73,000 is mapped to 98,000−12,000 (1.0−0)+0=0.716
 Z-score normalization (μ: mean, σ: standard deviation):
v −μ A
v '=
σA
73,600−54,000
 Ex. Let μ = 54,000, σ = 16,000. Then 16,000 =1.225
 Normalization by decimal scaling
v
v '= j Where j is the smallest integer such that Max(|ν’|) <
10 1
October 16, 2021 72
Chapter 2: Data
Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

October 16, 2021 73


Data Reduction Strategies

 Why data reduction?


 A database/data warehouse may store terabytes of data

 Complex data analysis/mining may take a very long time to run

on the complete data set


 Data reduction
 Obtain a reduced representation of the data set that is much

smaller in volume but yet produce the same (or almost the
same) analytical results
 Data reduction strategies
 Data cube aggregation:

 Dimensionality reduction — e.g., remove unimportant attributes

October 16, 2021 74


Data Cube Aggregation
 The lowest level of a data cube (base cuboid)
 The aggregated data for an individual entity of interest
 E.g., a customer in a phone calling data warehouse
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to
solve the task
 Queries regarding aggregated information should be
answered using data cube, when possible
October 16, 2021 75

You might also like