DWDM Unit-2
DWDM Unit-2
DWDM Unit-2
Syllabus :
Data Mining:
Types of Data,
Data Mining Functionalities,
Interestingness Patterns-Classification of Data Mining systems,
Data Mining Task Primitives,
Integration of a Data
Mining System with a Database or a Data Warehouse System,
Major issues in Data Mining
Applications of Data mining
Data Preprocessing:
Data cleaning,
Data integration and data transformation,
Data reduction: data cube aggregation, dimensionality reduction
UNIT-II : Chapter1
• Fraud detection
– Find outliers of unusual transactions
• Financial planning
– Summarize and compare the resources and spending
Data Exploration
Statistical Summary, Querying, and Reporting
Information Machine
Science Data Mining Learning
Visualization Other
Disciplines
• Evolution Analysis
– Describes and models regularities or trends for objects whose
behavior changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the stocks of
particular companies.
Database
Technology Statistics
Information Machine
Science Data Mining Learning
Visualization Other
Disciplines
2.6 Classification of data mining
systems
• Database
– Relational, data warehouse, transactional, stream, object-oriented/relational,
active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• Knowledge
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
2.7 Data Mining Task Primitives
• How to construct a data mining query?
– The primitives allow the user to interactively communicate with
(1) The set of task-relevant data – which portion of the database to be used
– Database or data warehouse name
(5) Visualization methods – what form to display the result, e.g. rules,
(1)
(3)
(2)
(1)
(1)
(1)
(2)
(1)
(5)
Data Mining: Concepts and Techniques 40
• Automated vs. query-driven?
– Finding all the patterns autonomously in a database?—unrealistic
because the patterns could be too many but uninteresting
• Data mining should be an interactive process
– User directs what to be mined
• Users must be provided with a set of primitives to be used to
communicate with the data mining system
• Incorporating these primitives in a data mining query language
– More flexible user interaction
– Foundation for design of graphical user interface
– Standardization of data mining industry and practice
• No coupling
– Flat file processing, no utilization of any functions of a DB/DW
system
– Not recommended
• Loose coupling
– Fetching data from DB/DW
– Does not explore data structures and query optimization
methods provided by DB/DW system
– Difficult to achieve high scalability and good performance with
large data sets
• Semi-tight
– Efficient implementations of a few essential data mining primitives in
a DB/DW system are provided, e.g., sorting, indexing, aggregation,
histogram analysis, multiway join, precomputation of some stat
functions
– Enhanced DM performance
• Tight
– DM is smoothly integrated into a DB/DW system, mining query is
optimized based on mining query analysis, data structures, indexing,
query processing methods of a DB/DW system
– A uniform information processing environment, highly desirable
Data Mining: Concepts and Techniques 43
2.9 Major issues in data mining
• Performance Issues
– Efficiency and scalability
• Huge amount of data
• Running time must be predictable and acceptable
– Parallel, distributed and incremental mining algorithms
• Divide the data into partitions and processed in parallel
• Incorporate database updates without having to mine the entire data again from
scratch
Data Preprocessing
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
October 16, 2021 54
Data Cleaning
Importance
“Data cleaning is one of the three biggest problems
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
October 16, 2021 55
Missing Data,Reasons?
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
technology limitation
incomplete data
inconsistent data
Clustering
detect and remove outliers
Y1
Y1’ y=x+1
X1 x
store
Χ2 (chi-square) test
2
2 ( Observed− Expected )
χ =∑
Expected
The larger the Χ2 value, the more likely the variables are
related
smaller in volume but yet produce the same (or almost the
same) analytical results
Data reduction strategies
Data cube aggregation: