Download as pdf or txt
Download as pdf or txt
You are on page 1of 134

1

Advanced Databases and Mining

UNIT I: Introduction:
 Concepts and Definitions,-------------3,6
 Relational models, ---6,7 ,Data Modeling and Query Languages,--7,13
 Database Objects,------14,16
 Normalization Techniques: Functional Dependency,------16,21
 1NF, 2NF, 3NF, BCNF,-----21,26
 Multi valued Dependency; ----27,28
 Loss-less Join and Dependency Preservation,28
UNIT II: Transaction Processing:
 Consistency, Atomicity, Isolation and Durability,
 Serializable Schedule, Recoverable Schedule,
 Concurrency Control, Time-stamp based protocols,
 Isolation Levels, Online Analytical Processing,
 Database performance Tuning and Query optimization: Query Tree, Cost of Query, Join,
Selection and Projection Implementation Algorithms and Optimization Database
Security: Access Control, MAC, RBAC, Authorization, SQL Injection Attacks.
UNIT III: Data Mining:
 Stages and techniques, knowledge representation methods,
 data mining approaches (OLAP, DBMS, Statistics and ML).
 Data warehousing: data warehouse and DBMS, multidimensional data model, OLAP
operations.
 Data processing: cleaning, transformation, reduction, filters and discretization with
weka.
UNIT IV: Knowledge representation:
 background knowledge,
 representing input data and output knowledge,
 visualization techniques and experiments with weka.
 Data mining algorithms: association rules,
 mining weather data,
 generating item sets and rules efficiently,
 correlation analysis.
2

UNIT V: Classification & Clustering:


 1R algorithm, decision trees, covering rules, task prediction,
 statistical classification, Bayesian network, instance based methods,
 linear models, Cluster/2, Cobweb, k-means,
 Hierarchical methods.
 Mining real data: preprocessing data from a real medical domain,
 Data mining techniques to create a comprehensive and accurate model of data.
Advanced topics: text mining, text classification, web mining,
 data mining software.
3

Advanced Databases and Mining


UNIT-1

Database Concepts and Definitions:

Database system is an excellent computer-based record-keeping system. A collection of


data, commonly called a database, contains information about a particular enterprise. It
maintains any information that may necessary to the decision-making process involved in
the management of that organization. It can also be defined as a collection of interrelated
data stored together to serve multiple applications, the data is stored so that it is
independent of programs that use the data. A generic and controlled approach is used to
add new data and modify and retrieve existing data within the database. The data is
structured so as to provide the basis for future application development.

Purpose of Database:
 The intent of a database is that a collection of data should serve as many applications
as possible. Therefore, a database is often thought of as a repository of information
needed to run certain functions in a corporation or organization. It would permit only
the retrieval of data but also the continuous modification of data needed for the
control of operations. It may be possible to search the database to obtain answers to
questions or information for planning purposes.
 In a typical file-processing system, permanent records are stored in different files.
Many different application programs are written to extract the records and add the
records to the appropriate files. But this scheme has several major limitations and
disadvantages, such as data redundancy (duplication of data), data inconsistency,
maladaptive data, non-standard data, insecure data, incorrect data, etc. A database
management system is an answer to all these problems as it provides centralized
control of the data.
 Database Abstraction
A major purpose of a database is to provide the user with only as much information as
is required of them. This means that the system does not disclose all the details of the
data, rather it hides some details of how the data is stored and maintained. The
complexity of databases is hidden from them which, if necessary, are ordered through
multiple levels of abstraction to facilitate their interaction with the system. The
different levels of the database are implemented through three layers:
1. Internal Level(Physical Level): The lowest level of abstraction, the internal level, is closest
to physical storage. It describes how the data is stored concretely on the storage medium.
4

2. Conceptual Level: This level of abstraction describes what data is concretely stored in the
database. It also describes the relationships that exist between the data. At this level,
databases are described logically in terms of simple data structures. Users at this level are
not concerned with how these logical data structures will be implemented at the physical
level.
3. External Level(View Level): It is the level closest to users and is related to the way the
data is viewed by individual users.

Since a database can be viewed through three levels of abstraction, any change at one
level can affect plans at other levels. As databases continue to grow, there may be
frequent changes to it at times. This should not lead to redesign and re-
implementation of the database. In such a context the concept of data independence
proves beneficial.
Concept of Database
To store and manage data efficiently in the database let us understand some key terms:
1. Database Schema: It is a design of the database. Or we can say that it is a skeleton of the
database that is used to represent the structure, types of data will be stored in the rows and
columns, constraints, relationships between the tables.
2. Data Constraints: In a database, sometimes we put some restrictions on the table that
what type of data can be stored in one or more columns of the table, it can be done by using
constraints. Constraints are defined while we are creating a table.
3. Data dictionary or Metadata: Metadata is known as the data about the data. Or we can
say that the database schema along with different types of constraints on the data is stored
by DBMS in the dictionary is known as metadata.
5

4. Database instance: In a database, a database instance is used to define the complete


database environment and its components. Or we can say that it is a set of memory
structures and background processes that are used to access the database files.
5. Query: In a database, a query is used to access data from the database. So users have to
write queries to retrieve or manipulate data from the database.
6. Data manipulation: In a database, we can easily manipulate data using the three main
operations that is Insertion, Deletion, and updation.
7. Data Engine: It is an underlying component that is used to create and manage various
database queries.
Advantages of Database :
Let us consider some of the benefits provided by a database system and see how a database
system overcomes the above-mentioned problems:-
1. Reduces database data redundancy to a great extent
2. The database can control data inconsistency to a great extent
3. The database facilitates sharing of data.
4. Database enforces standards.
5. The database can ensure data security.
6. Integrity can be maintained through databases.
Therefore, for systems with better performance and efficiency, database systems are
preferred.
Disadvantages of Database:
With the complex tasks to be performed by the database system, some things may come up
which can be termed as the disadvantages of using the database system. These are:-
1. Security may be compromised without good controls.
2. Integrity may be compromised without good controls.
3. Extra hardware may be required
4. Performance overhead may be significant.
5. The system is likely to be complex.

Relational models:
Relational Model was proposed by E.F. Codd to model data in the form of relations or tables.
After designing the conceptual model of Database using ER diagram, we need to convert the
conceptual model in the relational model which can be implemented using any RDBMS
languages like Oracle SQL, MySQL etc. So we will see what Relational Model is.
What is Relational Model?
Relational Model represents how data is stored in Relational Databases. A relational
database stores data in the form of relations (tables). Consider a relation STUDENT with
attributes ROLL_NO, NAME, ADDRESS, PHONE and AGE shown in Table 1.
6

STUDENT

ROLL_NO NAME ADDRESS PHONE AGE

1 RAM DELHI 9455123451 18

2 RAMESH GURGAON 9652431543 18

3 SUJIT ROHTAK 9156253131 20

4 SURESH DELHI 18
IMPORTANT TERMINOLOGIES
 Attribute: Attributes are the properties that define a relation. e.g.; ROLL_NO, NAME
 Relation Schema: A relation schema represents name of the relation with its attributes.
e.g.; STUDENT (ROLL_NO, NAME, ADDRESS, PHONE and AGE) is relation schema for
STUDENT. If a schema has more than 1 relation, it is called Relational Schema.
 Tuple: Each row in the relation is known as tuple. The above relation contains 4 tuples,
one of which is shown as:
RAM DELHI 9455123451 18
 Relation Instance: The set of tuples of a relation at a particular instance of time is called
as relation instance. Table 1 shows the relation instance of STUDENT at a particular time.
It can change whenever there is insertion, deletion or updation in the database.
 Degree: The number of attributes in the relation is known as degree of the relation.
The STUDENT relation defined above has degree 5.
 Cardinality: The number of tuples in a relation is known as cardinality.
The STUDENT relation defined above has cardinality 4.
 Column: Column represents the set of values for a particular attribute. The
column ROLL_NO is extracted from relation STUDENT.
ROLL_NO

3
7

4
 NULL Values: The value which is not known or unavailable is called NULL value. It is
represented by blank space. e.g.; PHONE of STUDENT having ROLL_NO 4 is NULL.

Data Modeling:
Data modeling (data modelling) is the process of creating a data model for the data to be
stored in a database. This data model is a conceptual representation of Data objects, the
associations between different data objects, and the rules.

Data modeling helps in the visual representation of data and enforces business rules,
regulatory compliances, and government policies on the data. Data Models ensure consistency
in naming conventions, default values, semantics, security while ensuring

Data Models in DBMS


The Data Model is defined as an abstract model that organizes data description, data
semantics, and consistency constraints of data. The data model emphasizes on what data is
needed and how it should be organized instead of what operations will be performed on data.
Data Model is like an architect’s building plan, which helps to build conceptual models and set
a relationship between data items.
The two types of Data Modeling Techniques are

1. Entity Relationship (E-R) Model


2. UML (Unified Modeling Language)

Why use Data Model?


The primary goal of using data model are:

 Ensures that all data objects required by the database are accurately represented.
Omission of data will lead to creation of faulty reports and produce incorrect results.
 A data model helps design the database at the conceptual, physical and logical levels.
 Data Model structure helps to define the relational tables, primary and foreign keys and
stored procedures.
 It provides a clear picture of the base data and can be used by database developers to
create a physical database.
 It is also helpful to identify missing and redundant data.
8

 Though the initial creation of data model is labor and time consuming, in the long run, it
makes your IT infrastructure upgrade and maintenance cheaper and faster.

Types of Data Models in DBMS


Types of Data Models: There are mainly three different types of data models: conceptual data
models, logical data models, and physical data models, and each one has a specific purpose.
The data models are used to represent the data and how it is stored in the database and to set
the relationship between data items.

1. Conceptual Data Model: This Data Model defines WHAT the system contains. This
model is typically created by Business stakeholders and Data Architects. The purpose is
to organize, scope and define business concepts and rules.
2. Logical Data Model: Defines HOW the system should be implemented regardless of the
DBMS. This model is typically created by Data Architects and Business Analysts. The
purpose is to developed technical map of rules and data structures.
3. Physical Data Model: This Data Model describes HOW the system will be implemented
using a specific DBMS system. This model is typically created by DBA and developers.
The purpose is actual implementation of the database.
9

Conceptual Data Model


A Conceptual Data Model is an organized view of database concepts and their relationships.
The purpose of creating a conceptual data model is to establish entities, their attributes, and
relationships. In this data modeling level, there is hardly any detail available on the actual
database structure. Business stakeholders and data architects typically create a conceptual
data model.

The 3 basic tenants of Conceptual Data Model are

 Entity: A real-world thing


 Attribute: Characteristics or properties of an entity
 Relationship: Dependency or association between two entities

Data model example:

 Customer and Product are two entities. Customer number and name are attributes of
the Customer entity
 Product name and price are attributes of product entity
 Sale is the relationship between the customer and product

Conceptual Data Model


Characteristics of a conceptual data model

 Offers Organisation-wide coverage of the business concepts.


 This type of Data Models are designed and developed for a business audience.
 The conceptual model is developed independently of hardware specifications like data
storage capacity, location or software specifications like DBMS vendor and technology.
The focus is to represent data as a user will see it in the “real world.”
10

Logical Data Model:


 The Logical Data Model is used to define the structure of data elements and to
set relationships between them. The logical data model adds further information
to the conceptual data model elements. The advantage of using a Logical data
model is to provide a foundation to form the base for the Physical model.
However, the modeling structure remains generic.

 Logical Data Model


 At this Data Modeling level, no primary or secondary key is defined. At this Data
modeling level, you need to verify and adjust the connector details that were set earlier
for relationships.
Characteristics of a Logical data model

 Describes data needs for a single project but could integrate with other logical data models
based on the scope of the project.
 Designed and developed independently from the DBMS.
 Data attributes will have datatypes with exact precisions and length.
 Normalization processes to the model is applied typically till 3NF.

Physical Data Model:


 A Physical Data Model describes a database-specific implementation of the data model. It
offers database abstraction and helps generate the schema. This is because of the richness
of meta-data offered by a Physical Data Model. The physical data model also helps in
visualizing database structure by replicating database column keys, constraints, indexes,
triggers, and other RDBMS features.

 Physical Data Model


11

Characteristics of a physical data model:

 The physical data model describes data need for a single project or application though it
maybe integrated with other physical data models based on project scope.
 Data Model contains relationships between tables that which addresses cardinality and
nullability of the relationships.
 Developed for a specific version of a DBMS, location, data storage or technology to be
used in the project.
 Columns should have exact datatypes, lengths assigned and default values.
 Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc. are
defined.

Query Languages:

Query is a question or requesting information. Query language is a language which is used to


retrieve information from a database.
Query language is divided into two types as follows −
 Procedural language
 Non-procedural language

Procedural language:

Information is retrieved from the database by specifying the sequence of operations to be


performed.
For Example: Relational algebra
Structure Query language (SQL) is based on relational algebra.
Relational algebra consists of a set of operations that take one or two relations as an input
and produces a new relation as output.
The different types of relational algebra operations are −
 Select operation
 Project operation
 Rename operation
 Union operation
 Intersection operation
 Difference operation
 Cartesian product operation
 Join operation
 Division operation.
12

Select, project, rename comes under unary operation (operate on one table). Union,
intersection, difference, cartesian, join, division comes under binary operation (operate on
two tables).

Non-Procedural language:

Information is retrieved from the database without specifying the sequence of operation to
be performed. Users only specify what information is to be retrieved.
For Example: Relational Calculus
Query by Example (QBE) is based on Relational calculus
Relational calculus is a non-procedural query language in which information is retrieved from
the database without specifying sequence of operation to be performed.
Relational calculus is of two types which are as follows −
 Tuple calculus
 Domain calculus
Database Objects:
The ODBMS which is an abbreviation for object-oriented database management system is
the data model in which data is stored in form of objects, which are instances of classes.
These classes and objects together make an object-oriented data model.
Components of Object-Oriented Data Model:
The OODBMS is based on three major components, namely: Object structure, Object classes,
and Object identity. These are explained below.
1. Object Structure:
The structure of an object refers to the properties that an object is made up of. These
properties of an object are referred to as an attribute. Thus, an object is a real-world entity
with certain attributes that makes up the object structure. Also, an object encapsulates the
data code into a single unit which in turn provides data abstraction by hiding the
implementation details from the user.
The object structure is further composed of three types of components: Messages, Methods,
and Variables. These are explained below.
1. Messages –
A message provides an interface or acts as a communication medium between an object
and the outside world. A message can be of two types:
 Read-only message: If the invoked method does not change the value of a variable,
then the invoking message is said to be a read-only message.
 Update message: If the invoked method changes the value of a variable, then the
invoking message is said to be an update message.
13

2. Methods –
When a message is passed then the body of code that is executed is known as a
method. Whenever a method is executed, it returns a value as output. A method can
be of two types:
 Read-only method: When the value of a variable is not affected by a method,
then it is known as the read-only method.
 Update-method: When the value of a variable change by a method, then it is
known as an update method.

3. Variables –
It stores the data of an object. The data stored in the variables makes the object
distinguishable from one another.

3. Object Classes:
An object which is a real-world entity is an instance of a class. Hence first we need to
define a class and then the objects are made which differ in the values they store but
share the same class definition. The objects in turn correspond to various messages
and variables stored in them.

Example –
class CLERK
{ //variables
char name;
string address;
int id;
int salary;
//Messages
charget_name();
stringget_address();
intannual_salary();
};
In the above example, we can see, CLERK is a class that holds the object variables and
messages.
14

An OODBMS also supports inheritance in an extensive manner as in a database there may be


many classes with similar methods, variables and messages. Thus, the concept of the class
hierarchy is maintained to depict the similarities among various classes.
The concept of encapsulation that is the data or information hiding is also supported by an
object-oriented data model. And this data model also provides the facility of abstract data
types apart from the built-in data types like char, int, float. ADT’s are the user-defined data
types that hold the values within them and can also have methods attached to them.
Thus, OODBMS provides numerous facilities to its users, both built-in and user-defined. It
incorporates the properties of an object-oriented data model with a database management
system, and supports the concept of programming paradigms like classes and objects along
with the support for other concepts like encapsulation, inheritance, and the user-defined
ADT’s (abstract data types).

Object-Oriented Database Examples

There are different kinds of implementations of object databases. Most contain the following
features:

Feature Description
Query Language Language to find objects and retrieve data from the database.
Transparent Ability to use an object-oriented programming language for data
Persistence manipulation.
ACID transactions guarantee all transactions are complete without
ACID Transactions
conflicting changes.
Creates a partial replica of the database. Allows access to a database
Database Caching
from program memory instead of a disk.
Recovery Disaster recovery in case of application or system failure.
15

Functional Dependency:
A functional dependency is a constraint that specifies the relationship between two sets of
attributes where one set can accurately determine the value of other sets. It is denoted as X
→ Y, where X is a set of attributes that is capable of determining the value of Y. The attribute
set on the left side of the arrow, X is called Determinant, while on the right side, Y is called
the Dependent. Functional dependencies are used to mathematically express relations
among database entities and are very important to understand advanced concepts in
Relational Database System and understanding problems in competitive exams like Gate.
Example:
roll_no name dept_name dept_building

42 abc CO A4

43 pqr IT A3

44 xyz CO A4

45 xyz IT A3

46 mno EC B2

47 jkl ME B2
From the above table we can conclude some valid functional dependencies:
 roll_no → { name, dept_name, dept_building },→ Here, roll_no can determine values of
fields name, dept_name and dept_building, hence a valid Functional dependency
 roll_no → dept_name , Since, roll_no can determine whole set of {name, dept_name,
dept_building}, it can determine its subset dept_name also.
 dept_name → dept_building , Dept_name can identify the dept_building accurately,
since departments with different dept_name will also have a different dept_building
 More valid functional dependencies: roll_no → name, {roll_no, name} ⇢ {dept_name,
dept_building}, etc.
Here are some invalid functional dependencies:
 name → dept_name Students with the same name can have different dept_name, hence
this is not a valid functional dependency.
 dept_building → dept_name There can be multiple departments in the same building,
For example, in the above table departments ME and EC are in the same building B2,
hence dept_building → dept_name is an invalid functional dependency.
16

 More invalid functional dependencies: name → roll_no, {name, dept_name} → roll_no,


dept_building → roll_no, etc.
Armstrong’s axioms/properties of functional dependencies:
1. Reflexivity: If Y is a subset of X, then X→Y holds by reflexivity rule
For example, {roll_no, name} → name is valid.
2. Augmentation: If X → Y is a valid dependency, then XZ → YZ is also valid by the
augmentation rule.
For example, If {roll_no, name} → dept_building is valid, hence {roll_no, name,
dept_name} → {dept_building, dept_name} is also valid.→
3. Transitivity: If X → Y and Y → Z are both valid dependencies, then X→Z is also valid by the
Transitivity rule.
For example, roll_no → dept_name&dept_name → dept_building, then roll_no →
dept_building is also valid.
Types of Functional dependencies in DBMS:
1. Trivial functional dependency
2. Non-Trivial functional dependency
3. Multivalued functional dependency
4. Transitive functional dependency
1. Trivial Functional Dependency
In Trivial Functional Dependency, a dependent is always a subset of the determinant.
i.e. If X → Y and Y is the subset of X, then it is called trivial functional dependency
For example,
roll_no name age

42 abc 17

43 pqr 18

44 xyz 18

Here, {roll_no, name} → name is a trivial functional dependency, since the


dependent name is a subset of determinant set {roll_no, name}
Similarly, roll_no → roll_no is also an example of trivial functional dependency.
17

2. Non-trivial Functional Dependency


In Non-trivial functional dependency, the dependent is strictly not a subset of the
determinant.
i.e. If X → Y and Y is not a subset of X, then it is called Non-trivial functional dependency.
For example,
roll_no name age

42 abc 17

43 pqr 18

44 xyz 18

Here, roll_no → name is a non-trivial functional dependency, since the


dependent name is not a subset of determinant roll_no
Similarly, {roll_no, name} → age is also a non-trivial functional dependency, since age is not
a subset of {roll_no, name}
3. Multivalued Functional Dependency
In Multivalued functional dependency, entities of the dependent set are not dependent on
each other.
i.e. If a → {b, c} and there exists no functional dependency between b and c, then it is called
a multivalued functional dependency.

For example,

roll_no name age

42 abc 17

43 pqr 18

44 xyz 18

45 abc 19
18

Here, roll_no → {name, age} is a multivalued functional dependency, since the


dependents name & age are not dependent on each other(i.e. name → age or age → name
doesn’t exist !)
4. Transitive Functional Dependency
In transitive functional dependency, dependent is indirectly dependent on determinant.
i.e. If a → b & b → c, then according to axiom of transitivity, a → c. This is a transitive
functional dependency
For example,
enrol_no name dept building_no

42 abc CO 4

43 pqr EC 2

44 xyz IT 1

45 abc EC 2

Here, enrol_no → dept and dept → building_no,


Hence, according to the axiom of transitivity, enrol_no → building_no is a valid functional
dependency. This is an indirect functional dependency, hence called Transitive functional
dependency.
Normal Forms in DBMS:
Normalization is the process of minimizing redundancy from a relation or set of relations.
Redundancy in relation may cause insertion, deletion, and update anomalies. So, it helps to
minimize the redundancy in relations. Normal forms are used to eliminate or reduce
redundancy in database tables.

1. First Normal Form –

If a relation contain composite or multi-valued attribute, it violates first normal form or a


relation is in first normal form if it does not contain any composite or multi-valued attribute.
A relation is in first normal form if every attribute in that relation is singled valued attribute.
 Example 1 – Relation STUDENT in table 1 is not in 1NF because of multi-valued attribute
STUD_PHONE. Its decomposition into 1NF has been shown in table 2.
19

Example 2 –

ID Name Courses
------------------
1 A c1, c2
2 E c3
3 M C2, c3
In the above table Course is a multi-valued attribute so it is not in 1NF.
Below Table is in 1NF as there is no multi-valued attribute
ID Name Course
------------------
1 A c1
1 A c2
2 E c3
3 M c2
3 M c3

Second Normal Form :

To be in second normal form, a relation must be in first normal form and relation must not
contain any partial dependency. A relation is in 2NF if it has No Partial Dependency, i.e., no
20

non-prime attribute (attributes which are not part of any candidate key) is dependent on any
proper subset of any candidate key of the table.
Partial Dependency – If the proper subset of candidate key determines non-prime attribute,
it is called partial dependency.

 Example 1 – Consider table-3 as following below.


 STUD_NO COURSE_NO COURSE_FEE
 1 C1 1000
 2 C2 1500
 1 C4 2000
 4 C3 1000
 4 C1 1000
 2 C5 2000
{Note that, there are many courses having the same course fee. }
Here,
COURSE_FEE cannot alone decide the value of COURSE_NO or STUD_NO;
COURSE_FEE together with STUD_NO cannot decide the value of COURSE_NO;
COURSE_FEE together with COURSE_NO cannot decide the value of STUD_NO;
Hence,
COURSE_FEE would be a non-prime attribute, as it does not belong to the one only
candidate key {STUD_NO, COURSE_NO} ;
But, COURSE_NO -> COURSE_FEE, i.e., COURSE_FEE is dependent on COURSE_NO, which
is a proper subset of the candidate key. Non-prime attribute COURSE_FEE is dependent
on a proper subset of the candidate key, which is a partial dependency and so this
relation is not in 2NF.
To convert the above relation to 2NF,
we need to split the table into two tables such as :
Table 1: STUD_NO, COURSE_NO
Table 2: COURSE_NO, COURSE_FEE
Table 1Table 2
STUD_NO COURSE_NO COURSE_NO COURSE_FEE
1 C1 C1 1000
2 C2 C2 1500
1 C4 C3 1000
4 C3 C4 2000
4 C1 C5 2000
21

2 C5
NOTE: 2NF tries to reduce the redundant data getting stored in memory. For instance, if
there are 100 students taking C1 course, we don’t need to store its Fee as 1000 for all the
100 records, instead, once we can store it in the second table as the course fee for C1 is
1000.
 Example 2 – Consider following functional dependencies in relation R (A, B , C, D )
 AB -> C [A and B together determine C]
BC ->D [B and C together determine D]
In the above relation, AB is the only candidate key and there is no partial dependency,
i.e., any proper subset of AB doesn’t determine any non-prime attribute.

Third Normal Form:

A relation is in third normal form, if there is no transitive dependency for non-prime


attributes as well as it is in second normal form.
A relation is in 3NF if at least one of the following condition holds in every non-trivial
function dependency X –> Y
1. X is a super key.
2. Y is a prime attribute (each element of Y is part of some candidate key).

Transitive dependency – If A->B and B->C are two FDs then A->C is called transitive
dependency.
 Example 1 – In relation STUDENT given in Table 4,
FD set: {STUD_NO -> STUD_NAME, STUD_NO -> STUD_STATE, STUD_STATE ->
STUD_COUNTRY, STUD_NO -> STUD_AGE}
Candidate Key: {STUD_NO}
For this relation in table 4, STUD_NO -> STUD_STATE and STUD_STATE ->
STUD_COUNTRY are true. So STUD_COUNTRY is transitively dependent on
STUD_NO. It violates the third normal form. To convert it in third normal form, we
will decompose the relation STUDENT (STUD_NO, STUD_NAME, STUD_PHONE,
STUD_STATE, STUD_COUNTRY_STUD_AGE) as:
STUDENT (STUD_NO, STUD_NAME, STUD_PHONE, STUD_STATE, STUD_AGE)
STATE_COUNTRY (STATE, COUNTRY)
22

 Example 2 – Consider relation R(A, B, C, D, E)


A -> BC,
CD -> E,
B -> D,
E -> A
All possible candidate keys in above relation are {A, E, CD, BC} All attributes are on
right sides of all functional dependencies are prime.

4. Boyce-Codd Normal Form (BCNF) :

A relation R is in BCNF if R is in Third Normal Form and for every FD, LHS is super key. A
relation is in BCNF iff in every non-trivial functional dependency X –> Y, X is a super
key.
 Example 1 – Find the highest normal form of a relation R(A,B,C,D,E) with FD
set as {BC->D, AC->BE, B->E}
Step 1. As we can see, (AC)+ ={A,C,B,E,D} but none of its subset can
determine all attribute of relation, So AC will be candidate key. A or C can’t
be derived from any other attribute of the relation, so there will be only 1
candidate key {AC}.
Step 2. Prime attributes are those attributes that are part of candidate key
{A, C} in this example and others will be non-prime {B, D, E} in this example.
Step 3. The relation R is in 1st normal form as a relational DBMS does not
allow multi-valued or composite attribute.
The relation is in 2nd normal form because BC->D is in 2nd normal form (BC is
not a proper subset of candidate key AC) and AC->BE is in 2nd normal form
(AC is candidate key) and B->E is in 2nd normal form (B is not a proper subset
of candidate key AC).
The relation is not in 3rd normal form because in BC->D (neither BC is a super
key nor D is a prime attribute) and in B->E (neither B is a super key nor E is a
prime attribute) but to satisfy 3rd normal for, either LHS of an FD should be
super key or RHS should be prime attribute.
So the highest normal form of relation will be 2nd Normal form.
 Example 2 –For example consider relation R(A, B, C)
A -> BC,
B ->
A and B both are super keys so above relation is in BCNF.
Key Points –
3. BCNF is free from redundancy.
4. If a relation is in BCNF, then 3NF is also satisfied.
23

5. If all attributes of relation are prime attribute, then the relation is always in 3NF.
6.A relation in a Relational Database is always and at least in 1NF form.
7.Every Binary Relation ( a Relation with only 2 attributes ) is always in BCNF.
8.If a Relation has only singleton candidate keys( i.e. every candidate key consists of
only 1 attribute), then the Relation is always in 2NF( because no Partial functional
dependency possible).
9. Sometimes going for BCNF form may not preserve functional dependency. In that
case go for BCNF only if the lost FD(s) is not required, else normalize till 3NF only.
10. There are many more Normal forms that exist after BCNF, like 4NF and more.
But in real world database systems it’s generally not required to go beyond BCNF.
Multi valued Dependency:

o Multivalued dependency occurs when two attributes in a table are independent of each
other but, both depend on a third attribute.
o A multivalued dependency consists of at least two attributes that are dependent on a
third attribute that's why it always requires at least three attributes.
o Example: Suppose there is a bike manufacturer company which produces two
colors(white and black) of each model every year.

BIKE_MODEL MANUF_YEAR COLOR

M2011 2008 White

M2001 2008 Black

M3001 2013 White

M3001 2013 Black

M4006 2017 White

M4006 2017 Black

Here columns COLOR and MANUF_YEAR are dependent on BIKE_MODEL and independent of
each other.

In this case, these two columns can be called as multivalued dependent on BIKE_MODEL. The
representation of these dependencies is shown below:

1. BIKE_MODEL → → MANUF_YEAR
2. BIKE_MODEL → → COLOR
24

This can be read as "BIKE_MODEL multidetermined MANUF_YEAR" and "BIKE_MODEL


multidetermined COLOR".

Loss-less Join and Dependency Preservation:

Decomposition of a relation is done when a relation in relational model is not in appropriate


normal form. Relation R is decomposed into two or more relations if decomposition is
lossless join as well as dependency preserving.
Lossless Join Decomposition
If we decompose a relation R into relations R1 and R2,
 Decomposition is lossy if R1 ⋈ R2 ⊃ R
 Decomposition is lossless if R1 ⋈ R2 = R
To check for lossless join decomposition using FD set, following conditions must hold:
1. Union of Attributes of R1 and R2 must be equal to attribute of R. Each attribute of R must
be either in R1 or in R2.
Att(R1) U Att(R2) = Att(R)
2. Intersection of Attributes of R1 and R2 must not be NULL.
Att(R1) ∩ Att(R2) ≠ Φ
3. Common attribute must be a key for at least one relation (R1 or R2)
Att(R1) ∩ Att(R2) ->Att(R1) or Att(R1) ∩ Att(R2) ->Att(R2)
For Example, A relation R (A, B, C, D) with FD set{A->BC} is decomposed into R1(ABC) and
R2(AD) which is a lossless join decomposition as:
1. First condition holds true as Att(R1) U Att(R2) = (ABC) U (AD) = (ABCD) = Att(R).
2. Second condition holds true as Att(R1) ∩ Att(R2) = (ABC) ∩ (AD) ≠ Φ
3. Third condition holds true as Att(R1) ∩ Att(R2) = A is a key of R1(ABC) because A->BC is
given.
Dependency Preserving Decomposition
If we decompose a relation R into relations R1 and R2, All dependencies of R either must be a
part of R1 or R2 or must be derivable from combination of FD’s of R1 and R2.
For Example, A relation R (A, B, C, D) with FD set{A->BC} is decomposed into R1(ABC) and
R2(AD) which is dependency preserving because FD A->BC is a part of R1(ABC).
GATE Question: Consider a schema R(A,B,C,D) and functional dependencies A->B and C->D.
Then the decomposition of R into R1(AB) and R2(CD) is [GATE-CS-2001]
A. dependency preserving and lossless join
B. lossless join but not dependency preserving
C. dependency preserving but not lossless join
D. not dependency preserving and not lossless join
25

Answer: For lossless join decomposition, these three conditions must hold true:
1. Att(R1) U Att(R2) = ABCD = Att(R)
2. Att(R1) ∩ Att(R2) = Φ, which violates the condition of lossless join decomposition. Hence
the decomposition is not lossless.
For dependency preserving decomposition,
A->B can be ensured in R1(AB) and C->D can be ensured in R2(CD). Hence it is dependency
preserving decomposition.
So, the correct option is C.
26

UNIT II:
Transaction Processing:
Consistency:
In order to maintain consistency in a database, before and after the transaction, certain
properties are followed. These are called ACID properties.

Atomicity:
By this, we mean that either the entire transaction takes place at once or doesn’t happen at
all. There is no midway i.e. transactions do not occur partially. Each transaction is considered
as one unit and either runs to completion or is not executed at all. It involves the following
two operations.
—Abort: If a transaction aborts, changes made to the database are not visible.
—Commit: If a transaction commits, changes made are visible.
Atomicity is also known as the ‘All or nothing rule’.
Consider the following transaction T consisting of T1 and T2: Transfer of 100 from
account X to account Y.
27

If the transaction fails after completion of T1 but before completion of T2.( say,
after write(X) but before write(Y)), then the amount has been deducted from X but not
added to Y. This results in an inconsistent database state. Therefore, the transaction must be
executed in its entirety in order to ensure the correctness of the database state.

Consistency:

This means that integrity constraints must be maintained so that the database is consistent
before and after the transaction. It refers to the correctness of a database. Referring to the
example above,
The total amount before and after the transaction must be maintained.
Total before T occurs = 500 + 200 = 700.
Total after T occurs = 400 + 300 = 700.
Therefore, the database is consistent. Inconsistency occurs in case T1 completes but T2 fails.
As a result, T is incomplete.

Isolation:

This property ensures that multiple transactions can occur concurrently without leading to
the inconsistency of the database state. Transactions occur independently without
interference. Changes occurring in a particular transaction will not be visible to any other
transaction until that particular change in that transaction is written to memory or has been
committed. This property ensures that the execution of transactions concurrently will result
in a state that is equivalent to a state achieved these were executed serially in some order.
Let X= 500, Y = 500.
Consider two transactions T and T”.
28

Suppose T has been executed till Read (Y) and then T’’ starts. As a result, interleaving of
operations takes place due to which T’’ reads the correct value of X but the incorrect value
of Y and sum computed by
T’’: (X+Y = 50, 000+500=50, 500)
is thus not consistent with the sum at end of the transaction:
T: (X+Y = 50, 000 + 450 = 50, 450).
This results in database inconsistency, due to a loss of 50 units. Hence, transactions must
take place in isolation and changes should be visible only after they have been made to the
main memory.

Durability:

This property ensures that once the transaction has completed execution, the updates and
modifications to the database are stored in and written to disk and they persist even if a
system failure occurs. These updates now become permanent and are stored in non-volatile
memory. The effects of the transaction, thus, are never lost.
Some important points:
Property Responsibility for maintaining properties

Atomicity Transaction Manager

Consistency Application programmer

Isolation Concurrency Control Manager

Durability Recovery Manager

The ACID properties, in totality, provide a mechanism to ensure the correctness and
consistency of a database in a way such that each transaction is a group of operations that
acts as a single unit, produces consistent results, acts in isolation from other operations, and
updates that it makes are durably stored.
29

Atomicity:
Atomicity means that multiple operations can be grouped into a single logical entity, that is,
other threads of control accessing the database will either see all of the changes or none of
the changes.

In the context of databases, atomicity means that you either:

 Commit to the entirety of the transaction occurring


 Have no transaction at all

Essentially, an atomic transaction ensures that any commit you make finishes the entire
operation successfully. Or, in cases of a lost connection in the middle of an operation, the
database is rolled back to its state prior to the commit being initiated.

This is important for preventing crashes or outages from creating cases where the transaction
was partially finished to an unknown overall state. If a crash occurs during a transaction with
no atomicity, you can’t know exactly how far along the process was before the transaction was
interrupted. By using atomicity, you ensure that either the entire transaction is successfully
completed—or that none of it was.

Isolation and Durability:

DBMS is the management of data that should remain integrated when any changes are done
in it. It is because if the integrity of the data is affected, whole data will get disturbed and
corrupted. Therefore, to maintain the integrity of the data, there are four properties described
in the database management system, which are known as the ACID properties. The ACID
properties are meant for the transaction that goes through a different group of tasks, and
there we come to see the role of the ACID properties.

In this section, we will learn and understand about the ACID properties. We will learn what
these properties stand for and what does each property is used for. We will also understand
the ACID properties with the help of some examples.
30

ACID Properties

The expansion of the term ACID defines for:

1) Atomicity: The term atomicity defines that the data remains atomic. It means if
any operation is performed on the data, either it should be performed or
executed completely or should not be executed at all. It further means that the
operation should not break in between or execute partially. In the case of
executing operations on the transaction, the operation should be completely
executed and not partially.

 Example: If Remo has account A having $30 in his account from which he wishes to send
$10 to Sheero's account, which is B. In account B, a sum of $ 100 is already present.
When $10 will be transferred to account B, the sum will become $110. Now, there will
be two operations that will take place. One is the amount of $10 that Remo wants to
transfer will be debited from his account A, and the same amount will get credited to
account B, i.e., into Sheero's account. Now, what happens - the first operation of debit
executes successfully, but the credit operation, however, fails. Thus, in Remo's account
A, the value becomes $20, and to that of Sheero's account, it remains $100 as it was
previously present.
31

In the above diagram, it can be seen that after crediting $10, the amount is still $100 in
account B. So, it is not an atomic transaction.

The below image shows that both debit and credit operations are done successfully. Thus the
transaction is atomic.

Thus, when the amount loses atomicity, then in the bank systems, this becomes a huge issue,
and so the atomicity is the main focus in the bank systems.
32

2) Consistency: The word consistency means that the value should remain preserved always.
In DBMS, the integrity of the data should be maintained, which means if a change in the
database is made, it should remain preserved always. In the case of transactions, the integrity
of the data is very essential so that the database remains consistent before and after the
transaction. The data should always be correct.

Example:


 4) Isolation: The term 'isolation' means separation. In DBMS, Isolation is the property of
a database where no data should affect the other one and may occur concurrently. In
short, the operation on one database should begin when the operation on the first
database gets complete. It means if two operations are being performed on two
different databases, they may not affect the value of one another. In the case of
transactions, when two or more transactions occur simultaneously, the consistency
should remain maintained. Any changes that occur in any particular transaction will not
be seen by other transactions until the change is not committed in the memory.
 Example: If two operations are concurrently running on two different accounts, then
the value of both accounts should not get affected. The value should remain persistent.
As you can see in the below diagram, account A is making T1 and T2 transactions to
account B and C, but both are executing independently without affecting each other. It
is known as Isolation.
33


 4) Durability: Durability ensures the permanency of something. In DBMS, the term
durability ensures that the data after the successful execution of the operation becomes
permanent in the database. The durability of the data should be so perfect that even if
the system fails or leads to a crash, the database still survives. However, if gets lost, it
becomes the responsibility of the recovery manager for ensuring the durability of the
database. For committing the values, the COMMIT command must be used every time
we make changes.
 Therefore, the ACID property of DBMS plays a vital role in maintaining the consistency
and availability of data in the database.
 Thus, it was a precise introduction of ACID properties in DBMS. We have discussed these
properties in the transaction section also.

Serializable Schedule:

Schedule, as the name suggests, is a process of lining the transactions and executing
them one by one. When there are multiple transactions that are running in a concurrent
manner and the order of operation is needed to be set so that the operations do not
overlap each other, Scheduling is brought into play and the transactions are timed
accordingly. The basics of Transactions
34

Serial Schedules:
Schedules in which the transactions are executed non-interleaved, i.e., a serial schedule is
one in which no transaction starts until a running transaction has ended are called serial
schedules.
Example: Consider the following schedule involving two transactions T 1 and T2.
T1 T2

R(A)

W(A)

R(B)

W(B)

R(A)

R(B)

where R(A) denotes that a read operation is performed on some data item ‘A’
This is a serial schedule since the transactions perform serially in the order T 1 —> T2
35

Recoverable Schedule:

Schedules in which transactions commit only after all transactions whose changes they read
commit are called recoverable schedules. In other words, if some transaction Tj is reading
value updated or written by some other transaction Ti, then the commit of Tj must occur after
the commit of Ti.
Example 1:
S1: R1(x), W1(x), R2(x), R1(y), R2(y),
W2(x), W1(y), C1, C2;

Given schedule follows order of Ti->Tj => C1->C2. Transaction T1 is executed before T2 hence
there is no chances of conflict occur. R1(x) appears before W1(x) and transaction T1 is
committed before T2 i.e. completion of first transaction performed first update on data item
x, hence given schedule is recoverable.
Example 2: Consider the following schedule involving two transactions T 1 and T2
T1 T2

R(A)

W(A)

W(A)

R(A)

 This is a recoverable schedule since T 1 commits before T2, that makes the value read by
T2 correct.
Irrecoverable Schedule:
 The table below shows a schedule with two transactions, T1 reads and writes A and that
value is read and written by T2. T2 commits. But later on, T1 fails. So we have to rollback
T1. Since T2 has read the value written by T1, it should also be rollbacked. But we have
already committed that. So this schedule is irrecoverable schedule. When Tj is reading the
value updated by Ti and Tj is committed before committing of Ti, the schedule will be
36

irrecoverable.

Recoverable with Cascading Rollback:


 The table below shows a schedule with two transactions, T1 reads and writes A and that
value is read and written by T2. But later on, T1 fails. So we have to rollback T1. Since T2
has read the value written by T1, it should also be rollbacked. As it has not committed, we
can rollback T2 as well. So it is recoverable with cascading rollback. Therefore, if Tj is
reading value updated by Ti and commit of Tj is delayed till commit of Ti, the schedule is
called recoverable with cascading rollback.

Cascadeless Recoverable Rollback:


 The table below shows a schedule with two transactions, T1 reads and writes A and
commits and that value is read by T2. But if T1 fails before commit, no other transaction
has read its value, so there is no need to rollback other transaction. So this is a
Cascadeless recoverable schedule. So, if Tj reads value updated by Ti only after Ti is
37

committed, the schedule will be cascadeless recoverable.

 Concurrency Control:
Concurrency control concept comes under the Transaction in database management system
(DBMS). It is a procedure in DBMS which helps us for the management of two simultaneous
processes to execute without conflicts between each other, these conflicts occur in multi user
systems.
Concurrency can simply be said to be executing multiple transactions at a time. It is required
to increase time efficiency. If many transactions try to access the same data, then
inconsistency arises. Concurrency control required to maintain consistency data.
For example, if we take ATM machines and do not use concurrency, multiple persons cannot
draw money at a time in different places. This is where we need concurrency.

Advantages

The advantages of concurrency control are as follows −


 Waiting time will be decreased.
 Response time will decrease.
 Resource utilization will increase.
 System performance & Efficiency is increased

Control concurrency

The simultaneous execution of transactions over shared databases can create several data
integrity and consistency problems.
For example, if too many people are logging in the ATM machines, serial updates and
synchronization in the bank servers should happen whenever the transaction is done, if not it
gives wrong information and wrong data in the database.
38

Main problems in using Concurrency

The problems which arise while using concurrency are as follows −


 Updates will be lost − One transaction does some changes and another transaction
deletes that change. One transaction nullifies the updates of another transaction.
 Uncommitted Dependency or dirty read problem − On variable has updated in one
transaction, at the same time another transaction has started and deleted the value of
the variable there the variable is not getting updated or committed that has been done
on the first transaction this gives us false values or the previous values of the variables
this is a major problem.
 Inconsistent retrievals − One transaction is updating multiple different variables, another
transaction is in a process to update those variables, and the problem occurs is
inconsistency of the same variable in different instances.

Concurrency control techniques

The concurrency control techniques are as follows −

Locking

Lock guaranties exclusive use of data items to a current transaction. It first accesses the data
items by acquiring a lock, after completion of the transaction it releases the lock.
Types of Locks
The types of locks are as follows −
 Shared Lock [Transaction can read only the data item values]
 Exclusive Lock [Used for both read and write data item values]

Time Stamping

Time stamp is a unique identifier created by DBMS that indicates relative starting time of a
transaction. Whatever transaction we are doing it stores the starting time of the transaction
and denotes a specific time.
This can be generated using a system clock or logical counter. This can be started whenever a
transaction is started. Here, the logical counter is incremented after a new timestamp has
been assigned.

Optimistic

It is based on the assumption that conflict is rare and it is more efficient to allow transactions
to proceed without imposing delays to ensure serializability.
39

Time-stamp based protocols:


Concurrency Control can be implemented in different ways. One way to implement it is by
using Locks. Now, let us discuss Time Stamp Ordering Protocol.
As earlier introduced, Timestamp is a unique identifier created by the DBMS to identify a
transaction. They are usually assigned in the order in which they are submitted to the
system. Refer to the timestamp of a transaction T as TS(T). For the basics of Timestamp
Timestamp Ordering Protocol –
The main idea for this protocol is to order the transactions based on their Timestamps. A
schedule in which the transactions participate is then serializable and the only equivalent
serial schedule permitted has the transactions in the order of their Timestamp Values.
Stating simply, the schedule is equivalent to the particular Serial Order corresponding to
the order of the Transaction timestamps. An algorithm must ensure that, for each item
accessed by Conflicting Operations in the schedule, the order in which the item is accessed
does not violate the ordering. To ensure this, use two Timestamp Values relating to each
database item X.
 W_TS(X) is the largest timestamp of any transaction that executed write(X) successfully.
 R_TS(X) is the largest timestamp of any transaction that executed read(X) successfully.
Basic Timestamp Ordering –
Every transaction is issued a timestamp based on when it enters the system. Suppose, if an
old transaction Ti has timestamp TS(Ti), a new transaction Tj is assigned timestamp TS(Tj)
such that TS(Ti) < TS(Tj). The protocol manages concurrent execution such that the
timestamps determine the serializability order. The timestamp ordering protocol ensures
that any conflicting read and write operations are executed in timestamp order. Whenever
some Transaction T tries to issue aR_item(X) or a W_item(X), the Basic TO algorithm
compares the timestamp of T with R_TS(X) & W_TS(X) to ensure that the Timestamp order is
not violated. This describes the Basic TO protocol in the following two cases.
1. Whenever a Transaction T issues a W_item(X) operation, check the following conditions:
 If R_TS(X) >TS(T) or if W_TS(X) > TS(T), then abort and rollback T and reject the
operation. else,
 Execute W_item(X) operation of T and set W_TS(X) to TS(T).
2. Whenever a Transaction T issues a R_item(X) operation, check the following conditions:
 If W_TS(X) > TS(T), then abort and reject T and reject the operation, else
 If W_TS(X) <= TS(T), then execute the R_item(X) operation of T and set R_TS(X) to the
larger of TS(T) and current R_TS(X).
40

Isolation Levels:
As we know that, in order to maintain consistency in a database, it follows ACID properties.
Among these four properties (Atomicity, Consistency, Isolation, and Durability) Isolation
determines how transaction integrity is visible to other users and systems. It means that a
transaction should take place in a system in such a way that it is the only transaction that is
accessing the resources in a database system.
Isolation levels define the degree to which a transaction must be isolated from the data
modifications made by any other transaction in the database system. A transaction isolation
level is defined by the following phenomena
 Dirty Read – A Dirty read is a situation when a transaction reads data that has not yet
been committed. For example, Let’s say transaction 1 updates a row and leaves it
uncommitted, meanwhile, Transaction 2 reads the updated row. If transaction 1 rolls back
the change, transaction 2 will have read data that is considered never to have existed.
 Non Repeatable read – Non Repeatable read occurs when a transaction reads the same
row twice and gets a different value each time. For example, suppose transaction T1 reads
data. Due to concurrency, another transaction T2 updates the same data and commit,
Now if transaction T1 rereads the same data, it will retrieve a different value.
 Phantom Read – Phantom Read occurs when two same queries are executed, but the
rows retrieved by the two, are different. For example, suppose transaction T1 retrieves a
set of rows that satisfy some search criteria. Now, Transaction T2 generates some new
rows that match the search criteria for transaction T1. If transaction T1 re-executes the
statement that reads the rows, it gets a different set of rows this time .

Based on these phenomena, The SQL standard defines four isolation levels :
1. Read Uncommitted – Read Uncommitted is the lowest isolation level. In this level, one
transaction may read not yet committed changes made by other transactions, thereby
allowing dirty reads. At this level, transactions are not isolated from each other.
2. Read Committed – This isolation level guarantees that any data read is committed at the
moment it is read. Thus it does not allow dirty read. The transaction holds a read or write
lock on the current row, and thus prevents other transactions from reading, updating, or
deleting it.
3. Repeatable Read – This is the most restrictive isolation level. The transaction holds read
locks on all rows it references and writes locks on referenced rows for update and delete
actions. Since other transactions cannot read, update or delete these rows, consequently
it avoids non-repeatable read.
4. Serializable – This is the highest isolation level. A serializable execution is guaranteed to
be serializable. Serializable execution is defined to be an execution of operations in which
concurrently executing transactions appears to be serially executing.
The Table is given below clearly depicts the relationship between isolation levels, read
phenomena, and locks :
41

Online Analytical Processing:


OLAP stands for Online Analytical Processing Server. It is a software technology that
allows users to analyze information from multiple database systems at the same time. It is
based on multidimensional data model and allows the user to query on multi-dimensional
data (eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one or more cubes
and these cubes are known as Hyper-cubes..

OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly detailed
data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).
42

Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing up
in the concept hierarchy of Location dimension (City -> Country).

Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following dimensions
with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”
43

Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension
Time = “Q1”.

Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation, performing
pivot operation gives a new view of it.

Query Tree:
Query optimization involves three steps, namely query tree generation, plan generation,
and query plan code generation.
Step 1 − Query Tree Generation
44

A query tree is a tree data structure representing a relational algebra expression. The tables of
the query are represented as leaf nodes. The relational algebra operations are represented as
the internal nodes. The root represents the query as a whole.
During execution, an internal node is executed whenever its operand tables are available. The
node is then replaced by the result table. This process continues for all internal nodes until the
root node is executed and replaced by the result table.
For example, let us consider the following schemas –
EMPLOYEE

EmpID EName Salary DeptNo DateOfJoining

DEPARTMENT

DNo DName Location


Example 1
Let us consider the query as the following.
$$\pi_{EmpID} (\sigma_{EName = \small "ArunKumar"} {(EMPLOYEE)})$$
The corresponding query tree will be −

Cost of Query:

In the previous section, we understood about Query processing steps and evaluation plan.
Though a system can create multiple plans for a query, the chosen method should be the best
of all. It can be done by comparing each possible plan in terms of their estimated cost. For
calculating the net estimated cost of any plan, the cost of each operation within a plan should
be determined and combined to get the net estimated cost of the query evaluation plan.

The cost estimation of a query evaluation plan is calculated in terms of various resources that
include:
45

o Number of disk accesses


o Execution time taken by the CPU to execute a query
o Communication costs in distributed or parallel database systems.

To estimate the cost of a query evaluation plan, we use the number of blocks transferred from
the disk, and the number of disks seeks. Suppose the disk has an average block access time of
ts seconds and takes an average of tT seconds to transfer x data blocks. The block access time is
the sum of disk seeks time and rotational latency. It performs S seeks than the time taken will
be b*tT + S*tS seconds. If tT=0.1 ms, tS =4 ms, the block size is 4 KB, and its transfer rate is 40
MB per second. With this, we can easily calculate the estimated cost of the given query
evaluation plan.

Generally, for estimating the cost, we consider the worst case that could happen. The users
assume that initially, the data is read from the disk only. But there must be a chance that the
information is already present in the main memory. However, the users usually ignore this
effect, and due to this, the actual cost of execution comes out less than the estimated value.

The response time, i.e., the time required to execute the plan, could be used for estimating
the cost of the query evaluation plan. But due to the following reasons, it becomes difficult to
calculate the response time without actually executing the query evaluation plan:

o When the query begins its execution, the response time becomes dependent on the
contents stored in the buffer. But this information is difficult to retrieve when the query
is in optimized mode, or it is not available also.
o When a system with multiple disks is present, the response time depends on an
interrogation that in "what way accesses are distributed among the disks?". It is difficult
to estimate without having detailed knowledge of the data layout present over the disk.
o Consequently, instead of minimizing the response time for any query evaluation plan,
the optimizers finds it better to reduce the total resource consumption of the query
plan. Thus to estimate the cost of a query evaluation plan, it is good to minimize the
resources used for accessing the disk or use of the extra resources.

Join:
As the name shows, JOIN means to combine something. In case of SQL, JOIN means "to
combine two or more tables".
In SQL, JOIN clause is used to combine the records from two or more tables in a database.
Types of SQL JOIN
 INNER JOIN
 LEFT JOIN
46

 RIGHT JOIN
 FULL JOIN

Sample Table
EMPLOYEE

EMP_ID EMP_NAME CITY SALARY AGE

1 Angelina Chicago 200000 30

2 Robert Austin 300000 26

3 Christian Denver 100000 42

4 Kristen Washington 500000 29

5 Russell Los angels 200000 36

6 Marry Canada 600000 48


PROJECT
PROJECT_NO EMP_ID DEPARTMENT

101 1 Testing

102 2 Development

103 3 Designing

104 4 Development

INNER JOIN

In SQL, INNER JOIN selects records that have matching values in both tables as long as the
condition is satisfied. It returns the combination of all rows from both the tables where the
condition satisfies.

Syntax

1. SELECT table1.column1, table1.column2, table2.column1,....


2. FROM table1
3. INNER JOIN table2
47

4. ON table1.matching_column = table2.matching_column;

Query

1. SELECT EMPLOYEE.EMP_NAME, PROJECT.DEPARTMENT


2. FROM EMPLOYEE
3. INNER JOIN PROJECT
4. ON PROJECT.EMP_ID = EMPLOYEE.EMP_ID;

Output

EMP_NAME DEPARTMENT

Angelina Testing

Robert Development

Christian Designing

Kristen Development

LEFT JOIN

The SQL left join returns all the values from left table and the matching values from the right
table. If there is no matching join value, it will return NULL.

Syntax

1. SELECT table1.column1, table1.column2, table2.column1,....


2. FROM table1
3. LEFT JOIN table2
4. ON table1.matching_column = table2.matching_column;

Query

1. SELECT EMPLOYEE.EMP_NAME, PROJECT.DEPARTMENT


2. FROM EMPLOYEE
3. LEFT JOIN PROJECT
4. ON PROJECT.EMP_ID = EMPLOYEE.EMP_ID;
48

Output

EMP_NAME DEPARTMENT

Angelina Testing

Robert Development

Christian Designing

Kristen Development

Russell NULL

Marry NULL

RIGHT JOIN

In SQL, RIGHT JOIN returns all the values from the values from the rows of right table and the
matched values from the left table. If there is no matching in both tables, it will return NULL.

Syntax

1. SELECT table1.column1, table1.column2, table2.column1,....


2. FROM table1
3. RIGHT JOIN table2
4. ON table1.matching_column = table2.matching_column;

Query

1. SELECT EMPLOYEE.EMP_NAME, PROJECT.DEPARTMENT


2. FROM EMPLOYEE
3. RIGHT JOIN PROJECT
4. ON PROJECT.EMP_ID = EMPLOYEE.EMP_ID;

Output

EMP_NAME DEPARTMENT

Angelina Testing
49

Robert Development

Christian Designing

Kristen Development

4. FULL JOIN

In SQL, FULL JOIN is the result of a combination of both left and right outer join. Join tables
have all the records from both tables. It puts NULL on the place of matches not found.

Syntax

1. SELECT table1.column1, table1.column2, table2.column1,....


2. FROM table1
3. FULL JOIN table2
4. ON table1.matching_column = table2.matching_column;

Query

1. SELECT EMPLOYEE.EMP_NAME, PROJECT.DEPARTMENT


2. FROM EMPLOYEE
3. FULL JOIN PROJECT
4. ON PROJECT.EMP_ID = EMPLOYEE.EMP_ID;

Output

EMP_NAME DEPARTMENT

Angelina Testing

Robert Development

Christian Designing

Kristen Development

Russell NULL

Marry NULL
50

Access Control:

Database Security means keeping sensitive information safe and prevent the loss of data.
Security of data base is controlled by Database Administrator (DBA).
The following are the main control measures are used to provide security of data in
databases:
1. Authentication
2. Access control
3. Inference control
4. Flow control
5. Database Security applying Statistical Method
6. Encryption
These are explained as following below.
1. Authentication :
Authentication is the process of confirmation that whether the user log in only according
to the rights provided to him to perform the activities of data base. A particular user can
login only up to his privilege but he can’t access the other sensitive data. The privilege of
accessing sensitive data is restricted by using Authentication.
By using these authentication tools for biometrics such as retina and figure prints can
prevent the data base from unauthorized/malicious users.
2. Access Control :
The security mechanism of DBMS must include some provisions for restricting access to
the data base by unauthorized users. Access control is done by creating user accounts and
to control login process by the DBMS. So, that database access of sensitive data is
possible only to those people (database users) who are allowed to access such data and
to restrict access to unauthorized persons.
The database system must also keep the track of all operations performed by certain user
throughout the entire login time.
3. Inference Control :
This method is known as the countermeasures to statistical database security problem. It
is used to prevent the user from completing any inference channel. This method protect
sensitive information from indirect disclosure.
Inferences are of two types, identity disclosure or attribute disclosure.
4. Flow Control :
This prevents information from flowing in a way that it reaches unauthorized users.
Channels are the pathways for information to flow implicitly in ways that violate the
privacy policy of a company are called convert channels.
5. Database Security applying Statistical Method :
Statistical database security focuses on the protection of confidential individual values
51

stored in and used for statistical purposes and used to retrieve the summaries of values
based on categories. They do not permit to retrieve the individual information.
This allows to access the database to get statistical information about the number of
employees in the company but not to access the detailed confidential/personal
information about the specific individual employee.
6. Encryption :
This method is mainly used to protect sensitive data (such as credit card numbers, OTP
numbers) and other sensitive numbers. The data is encoded using some encoding
algorithms.
An unauthorized user who tries to access this encoded data will face difficulty in decoding
it, but authorized users are given decoding keys to decode data.

MAC:
1. DAC :
DAC is identity-based access control. DAC mechanisms will be controlled by user
identification such as username and password. DAC is discretionary because the owners can
transfer objects or any authenticated information to other users. In simple words, the owner
can determine the access privileges.
Attributes of DAC –
1. Users can transfer their object ownership to another user.
2. The access type of other users can be determined by the user.
3. Authorization failure can restrict the user access after several failed attempts.
4. Unauthorized users will be blind to object characteristics called file size, directory path,
and file name.
Examples- Permitting the Linux file operating system is an example of DAC.
2. MAC :
The operating system in MAC will provide access to the user based on their identities and
data. For gaining access, the user has to submit their personal information. It is very secure
because the rules and restrictions are imposed by the admin and will be strictly followed.
MAC settings and policy management will be established in a secure network and are limited
to system administrators.
Attributes of MAC –
1. MAC policies can help to reduce system errors.
2. It has tighter security because only the administrator can access or alter controls.
3. MAC has an enforced operating system that can label and delineate incoming application
data.
4. Maintenance will be difficult because only the administrator can have access to the
database.
Examples- Access level of windows for ordinary users, admins, and guests are some of the
examples of MAC.
52

Differences between DAC and MAC :


DAC MAC

DAC stands for Discretionary Access Control. MAC stands for Mandatory Access Control.

DAC is easier to implement. MAC is difficult to implement.

DAC is less secure to use. MAC is more secure to use.

In DAC, the owner can determine the access In MAC, the system only determines the
and privileges and can restrict the resources access and the resources will be restricted
based on the identity of the users. based on the clearance of the subjects.

DAC has extra labor-intensive properties. MAC has no labor-intensive property.

Users will be provided access based on their Users will be restricted based on their power
identity and not using levels. and level of hierarchy.

DAC has high flexibility with no rules and MAC is not flexible as it contains lots of strict
regulations. rules and regulations.

DAC has complete trust in users. MAC has trust only in administrators.

Decisions will be based only on user ID and Decisions will be based on objects and tasks,
ownership. and they can have their own ids.

Information flow is impossible to control. Information flow can be easily controlled.

DAC is supported by commercial DBMSs. MAC is not supported by commercial DBMSs.

MAC can be applied in the military,


DAC can be applied in all domains. government, and intelligence.

MAC prevents virus flow from a higher level to


DAC is vulnerable to trojan horses. a lower level.
53

Role-based Access Control:


Only the administrator should have complete access to the network while the other
employees like junior network engineer need not full access to the network device. A junior-
level engineer generally requires only to crosscheck the configuration of the device, not to
add or delete any configuration so why should give full access to that employee?
For these types of scenarios, the administrator defines access to the devices according to the
roles of the user.
Role-based Access Control –
The concept of Role-based Access Control is to create a set of permissions and assign these
permissions to a user or group. With the help of these permissions, only limited access to
users can be provided therefore level of security is increased.
There are different ways to perform RBAC such as creating custom privilege levels or creating
views.
Custom level privilege –
When we take a console of the router, we enter into the user-level mode. The user-level
mode has privilege level 1. By typing enable, we enter into a privileged mode where the
privilege level is 15. A user with privilege level 15 can access all the commands that are at
level 15 or below.
By creating a custom privilege level (between 2 and 14) and assigning commands to it, the
administrator can provide subset of commands to the user.
Configuration –
First we will add a command to our privilege level say 8 and assign a password to it.
R1(config)#privilege exec level 8 configure terminal
R1(config)#enable secret level 8 0 saurabh
Here, we have assigned the password as saurabh. Also note that 0 here means the password
followed is clear text (non-hashed) .
Now, we will create a local user name saurabh and associated this user with configured level.
Enable aaa model and assign default list to various lines.
R1(config)#username saurabh privilege 8 secret cisco123
R1(config)#aaa new-model
R1(config)#line vty 0 4
R1(config)#login local
Now, whenever username Saurabh will take remote access through vty lines, he will be
assigned privilege level 8.
Creating views:
Role-Based CLI access enables the administrator to create different views of the device for
54

different users. Each view defines the commands that a user can access. It is similar to
privilege levels. Role-based CLI provides 2 types of views:
1. Root view – Root view has the same access privilege level as user who has level 15.The
administrator should be in root view as view can be added, edited or deleted in root view.

Configuration –
To enter into root view, we first have to enable aaa on the device and then have to set
enable password or secret password which will be used when any user will enter the root
view.

To enable aaa on the device and to apply secret password, command is:
2. R1(config)#aaa new-model
R1(config)#enable secret geeksforgeeks
Now, we will enter the root view by command:
R1#enable view
By typing this, we will enter into root level where we can add, delete or edit views.
3. Super view – A super view consists of 2 or more CLI views. A network administrator can
assign a user or group of users a superview which consists of multiple views. A super view
can consists of more than one view therefore it has the access to all the commands which
are being provided in other views.

Configuration –
As the super view consists of more than one view therefore first we will create 2 views
named, Cisco and IBM. Now, in view Cisco, we will allow all show command in exec mode
and int e0/0 command on global configuration mode.
4. R1(config)#parser view cisco
5. R1(config-view)#secret geeksforgeeks1
6. R1(config-view)#commands exec include all show
R1(config-view)#commands configure include int e0/0
Now, we will create IBM view in which we will allow ping and config terminal on exec
mode and ip address on configuration mode.
R1(config)#parser View ibm
R1(config-view)#secret geeksforgeeks1
R1(config-view)#commands exec include ping
55

R1(config-view)#commands exec include config terminal


R1(config-view)#commands configure include ip address
Now we will create a super view and name it as sup_user. We will enable a secret
password superuser to the superviewsup_user and add views Cisco and IBM to it
therefore it has all the privilege to execute commands which are included in views Cisco
and IBM only.
R1(config)#parser view sup_usersuperuser
R1(config-view)#secret superuser
R1(config-view)#view cisco
R1(config-view)#view ibm

Security, Integrity and Authorization in DBMS:


56

UNIT III:
Data Mining
Data Mining Techniques:

Data mining includes the utilization of refined data analysis tools to find previously unknown,
valid patterns and relationships in huge data sets. These tools can incorporate statistical
models, machine learning techniques, and mathematical algorithms, such as neural networks
or decision trees. Thus, data mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers
to better understanding how to process and make conclusions from the huge amount of data,
but what are the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been developed
and used, including association, classification, clustering, prediction, sequential patterns, and
regression.

1. Classification:

This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:


57

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization,
etc. some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented
or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the
data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.
v. 2. Clustering:
vi. Clustering is a division of information into groups of connected objects. Describing the
data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters. Data modeling puts clustering from a
historical point of view rooted in statistics, mathematics, and numerical analysis. From a
machine learning point of view, clusters relate to hidden patterns, the search for
clusters is unsupervised learning, and the subsequent framework represents a data
concept. From a practical point of view, clustering plays an extraordinary job in data
mining applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational biology,
medical diagnostics, and much more.
vii. In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between
the data. Clustering is very similar to the classification, but it involves grouping chunks of
data together based on their similarities.
viii. 3. Regression:
ix. Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of planning
and modeling. For example, we might use it to project certain costs, depending on other
58

factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.
x. 4. Association Rules:
xi. This data mining technique helps to discover a link between two or more items. It finds
a hidden pattern in the data set.
xii. Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
xiii. The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.

(Confidence) / (item B)/ (Entire dataset)

o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.

(Item A + Item B) / (Entire dataset)

o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.

(Item A + Item B)/ (Item A)

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used in
various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest of
the dataset. The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field. Outlier detection is valuable in numerous fields like
59

network interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of
sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.

knowledge representation methods:

There are mainly four ways of knowledge representation which are given as follows:

1. Logical Representation
2. Semantic Network Representation
3. Frame Representation
4. Production Rules
60

Logical Representation:

Logical representation is a language with some concrete rules which deals with propositions
and has no ambiguity in representation. Logical representation means drawing a conclusion
based on various conditions. This representation lays down some important communication
rules. It consists of precisely defined syntax and semantics which supports the sound
inference. Each sentence can be translated into logics using syntax and semantics.

Syntax:
o Syntaxes are the rules which decide how we can construct legal sentences in the logic.
o It determines which symbol we can use in knowledge representation.
o How to write those symbols.

Semantics:
o Semantics are the rules by which we can interpret the sentence in the logic.
o Semantic also involves assigning a meaning to each sentence.

Logical representation can be categorised into mainly two logics:

a. Propositional Logics
b. Predicate logics

Advantages of logical representation:


1. Logical representation enables us to do logical reasoning.
2. Logical representation is the basis for the programming languages.

Disadvantages of logical Representation:


1. Logical representations have some restrictions and are challenging to work with.
2. Logical representation technique may not be very natural, and inference may not be so
efficient.
3. Semantic Network Representation:

Semantic networks are alternative of predicate logic for knowledge representation. In


Semantic networks, we can represent our knowledge in the form of graphical networks.
This network consists of nodes representing objects and arcs which describe the
relationship between those objects. Semantic networks can categorize the object in
different forms and can also link those objects. Semantic networks are easy to understand
and can be easily extended.
61

This representation consist of mainly two types of relations:

a. IS-A relation (Inheritance)


b. Kind-of-relation

Example: Following are some statements which we need to represent in the form of nodes
and arcs.

Statements:
a. Jerry is a cat.
b. Jerry is a mammal
c. Jerry is owned by Priya.
d. Jerry is brown colored.
e. All Mammals are animal.

In the above diagram, we have represented the different type of knowledge in the form of
nodes and arcs. Each object is connected with another object by some relation.

Drawbacks in Semantic representation:


1. Semantic networks take more computational time at runtime as we need to traverse the
complete network tree to answer some questions. It might be possible in the worst case
scenario that after traversing the entire tree, we find that the solution does not exist in
this network.
62

2. Semantic networks try to model human-like memory (Which has 1015 neurons and
links) to store the information, but in practice, it is not possible to build such a vast
semantic network.
3. These types of representations are inadequate as they do not have any equivalent
quantifier, e.g., for all, for some, none, etc.
4. Semantic networks do not have any standard definition for the link names.
5. These networks are not intelligent and depend on the creator of the system.

Advantages of Semantic network:


1. Semantic networks are a natural representation of knowledge.
2. Semantic networks convey meaning in a transparent manner.
3. These networks are simple and easily understandable

3. Frame Representation:

4. A frame is a record like structure which consists of a collection of attributes and its
values to describe an entity in the world. Frames are the AI data structure which divides
knowledge into substructures by representing stereotypes situations. It consists of a
collection of slots and slot values. These slots may be of any type and sizes. Slots have
names and values which are called facets.
5. Facets: The various aspects of a slot is known as Facets. Facets are features of frames
which enable us to put constraints on the frames. Example: IF-NEEDED facts are called
when data of any particular slot is needed. A frame may consist of any number of slots,
and a slot may include any number of facets and facets may have any number of values.
A frame is also known as slot-filter knowledge representation in artificial intelligence.
6. Frames are derived from semantic networks and later evolved into our modern-day
classes and objects. A single frame is not much useful. Frames system consist of a
collection of frames which are connected. In the frame, knowledge about an object or
event can be stored together in the knowledge base. The frame is a type of technology
which is widely used in various applications including Natural language processing and
machine visions.

Example: 1

Let's take an example of a frame for a book

Slots Filters
63

Title Artificial Intelligence

Genre Computer Science

Author Peter Norvig

Edition Third Edition

Year 1996

Page 1152

Advantages of frame representation:


1. The frame knowledge representation makes the programming easier by grouping the
related data.
2. The frame representation is comparably flexible and used by many applications in AI.
3. It is very easy to add slots for new attribute and relations.
4. It is easy to include default data and to search for missing values.
5. Frame representation is easy to understand and visualize.

Disadvantages of frame representation:


1. In frame system inference mechanism is not be easily processed.
2. Inference mechanism cannot be smoothly proceeded by frame representation.
3. Frame representation has a much generalized approach.

4. Production Rules

Production rules system consist of (condition, action) pairs which mean, "If condition then
action". It has mainly three parts:

o The set of production rules


o Working Memory
o The recognize-act-cycle

In production rules agent checks for the condition and if the condition exists then production
rule fires and corresponding action is carried out. The condition part of the rule determines
which rule may be applied to a problem. And the action part carries out the associated
problem-solving steps. This complete process is called a recognize-act cycle.
64

The working memory contains the description of the current state of problems-solving and
rule can write knowledge to the working memory. This knowledge match and may fire other
rules.

If there is a new situation (state) generates, then multiple production rules will be fired
together, this is called conflict set. In this situation, the agent needs to select a rule from these
sets, and it is called a conflict resolution.

Example:
o IF (at bus stop AND bus arrives) THEN action (get into the bus)
o IF (on the bus AND paid AND empty seat) THEN action (sit down).
o IF (on bus AND unpaid) THEN action (pay charges).
o IF (bus arrives at destination) THEN action (get down from the bus).

Advantages of Production rule:


1. The production rules are expressed in natural language.
2. The production rules are highly modular, so we can easily remove, add or modify an
individual rule.

Disadvantages of Production rule:


1. Production rule system does not exhibit any learning capabilities, as it does not store
the result of the problem for the future uses.
2. During the execution of the program, many rules may be active hence rule-based
production systems are inefficient.

Data mining Approaches (OLAP):

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight
into information through fast, consistent, interactive access in a wide variety of
possible views of data that has been transformed from raw information to reflect
the real dimensionality of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and


support the capability for complex estimations, trend analysis, and sophisticated
data modeling. It is rapidly enhancing the essential foundation for Intelligent
Solutions containing Business Performance Management, Planning, Budgeting,
65

Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge


Discovery, and Data Warehouses Reporting. OLAP enables end-clients to perform
ad hoc analysis of record in multiple dimensions, providing the insight and
understanding they require for better decision making.
Who uses OLAP and Why?:
OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:


o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling

Sales and Marketing

o Sales analysis and forecasting


o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation

Production

o Production planning
o Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.

The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.

How OLAP Works?

Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that are
typically very hard to execute over tabular databases, namely aggregation, joining, and
grouping. These queries are calculated during a process that is usually called 'building' or
66

'processing' of the OLAP cube. This process happens overnight, and by the time end users get
to work - data will have been updated.

OLAP Guidelines (Dr.E.F.Codd Rule):

Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines and
requirements as the basis for selecting OLAP systems:

Statistics and ML:


Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, data mining is the science, art, and technology of discovering large and complex
bodies of data in order to discover useful patterns. Theoreticians and practitioners are
continually seeking improved techniques to make the process more efficient, cost-effective,
and accurate. Any situation can be analyzed in two ways in data mining:
 Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to
identify patterns and trends. Alternatively, it is referred to as quantitative analysis.
 Non-statistical Analysis: This analysis provides generalized information and includes
sound, still images, and moving images.
67

In statistics, there are two main categories:


 Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify
the main characteristics of that data. Graphs or numbers summarize the data. Average,
Mode, SD(Standard Deviation), and Correlation are some of the commonly used
descriptive statistical methods.
 Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about
populations and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with statistics.
Some of these are:
 Population
 Sample
 Variable
 Quantitative Variable
 Qualitative Variable
 Discrete Variable
 Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using
mathematical formulas, models, and techniques. Through the use of statistical methods,
information is extracted from research data, and different ways are available to judge the
robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically are
derived from the vast statistical toolkit developed to answer problems arising in other fields.
These techniques are taught in science curriculums. It is necessary to check and test several
hypotheses. The hypotheses described above help us assess the validity of our data mining
endeavor when attempting to infer any inferences from the data under study. When using
more complex and sophisticated statistical estimators and tests, these issues become more
pronounced.
For extracting knowledge from databases containing different types of observations, a
variety of statistical methods are available in Data Mining and some of these are:
 Logistic regression analysis
 Correlation analysis
 Regression analysis
 Discriminate analysis
 Linear discriminant analysis (LDA)
 Classification
 Clustering
 Outlier detection
68

 Classification and regression trees,


 Correspondence analysis
 Nonparametric regression,
 Statistical pattern recognition,
 Categorical data analysis,
 Time-series methods for trends and periodicity
 Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used in
data mining:
 Linear Regression: The linear regression method uses the best linear relationship
between the independent and dependent variables to predict the target variable. In order
to achieve the best fit, make sure that all the distances between the shape and the actual
observations at each point are as small as possible. A good fit can be determined by
determining that no other position would produce fewer errors given the shape chosen.
Simple linear regression and multiple linear regression are the two major types of linear
regression. By fitting a linear relationship to the independent variable, the simple linear
regression predicts the dependent variable. Using multiple independent variables,
multiple linear regression fits the best linear relationship with the dependent variable. For
more details, you can refer linear regression.
 Classification: This is a method of data mining in which a collection of data is categorized
so that a greater degree of accuracy can be predicted and analyzed. An effective way to
analyze very large datasets is to classify them. Classification is one of several methods
aimed at improving the efficiency of the analysis process. A Logistic Regression and a
Discriminant Analysis stand out as two major classification techniques.
 Logistic Regression: It can also be applied to machine learning applications and
predictive analytics. In this approach, the dependent variable is either binary
(binary regression) or multinomial (multinomial regression): either one of the
two or a set of one, two, three, or four options. With a logistic regression
equation, one can estimate probabilities regarding the relationship between the
independent variable and the dependent variable. For understanding logistic
regression analysis in detail, you can refer to logistic regression.
 Discriminant Analysis: A Discriminant Analysis is a statistical method of
analyzing data based on the measurements of categories or clusters and
categorizing new observations into one or more populations that were identified
a priori. The discriminant analysis models each response class independently
then uses Bayes’s theorem to flip these projections around to estimate the
likelihood of each response category given the value of X. These models can be
either linear or quadratic.
 Linear Discriminant Analysis: According to Linear Discriminant
Analysis, each observation is assigned a discriminant score to classify it
69

into a response variable class. By combining the independent variables


in a linear fashion, these scores can be obtained. Based on this model,
observations are drawn from a Gaussian distribution, and the predictor
variables are correlated across all k levels of the response variable, Y.
and for further details linear discriminant analysis
 Quadratic Discriminant Analysis: An alternative approach is provided
by Quadratic Discriminant Analysis. LDA and QDA both assume
Gaussian distributions for the observations of the Y classes. Unlike LDA,
QDA considers each class to have its own covariance matrix. As a
result, the predictor variables have different variances across the k
levels in Y.
 Correlation Analysis: In statistical terms, correlation analysis captures the
relationship between variables in a pair. The value of such variables is usually
stored in a column or rows of a database table and represents a property of an
object.
 Regression Analysis: Based on a set of numeric data, regression is a data mining
method that predicts a range of numerical values (also known as continuous
values). You could, for instance, use regression to predict the cost of goods and
services based on other variables. A regression model is used across numerous
industries for forecasting financial data, modeling environmental conditions, and
analyzing trends.
Data warehouse and DBMS:
Background
A Database Management System (DBMS) stores data in the form of tables, uses ER model
and the goal is ACID properties. For example, a DBMS of college has tables for students,
faculty, etc.
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically
collected from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce
statistical results that may help in decision makings. For example, a college might want to
see quick different results, like how the placement of CS students has improved over the last
10 years, in terms of salaries, counts, etc.
Need for Data Warehouse
An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For
storing data of TB size, the storage shifted to Data Warehouse. Besides this, a transactional
database doesn’t offer itself to analytics. To effectively perform analytics, an organization
keeps a central Data Warehouse to closely study its business by organizing, understanding,
and using its historic data for taking strategic decisions and analyzing trends.
70

Data Warehouse vs DBMS


Database System Data Warehouse

It supports operational It supports analysis and


processes. performance reporting.

Capture and maintain the data. Explore the data.

Current data. Multiple years of history.

Data must be integrated and


Data is balanced within the balanced from multiple
scope of this one system. system.

Data is updated when Data is updated on scheduled


transaction occurs. processes.

Data verification occurs when Data verification occurs after


entry is done. the fact.

100 MB to GB. 100 GB to TB.

ER based. Star/Snowflake.

Application oriented. Subject oriented.

Primitive and highly detailed. Summarized and consol


Example Applications of Data Warehousing
Data Warehousing can be applied anywhere where we have a huge amount of data and we
want to see statistical results that help in decision making.

 Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin,
etc. are based on analyzing large data sets. These sites gather data related to members,
groups, locations, etc., and store it in a single central repository. Being a large amount of
data, Data Warehouse is needed for implementing the same.
 Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.
71

 Government: Government uses a data warehouse to store and analyze tax payments
which are used to detect tax thefts.
Multidimensional data model:

A multidimensional model views data in the form of a data-cube. A data cube enables data to
be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the store's
sales for the dimension time, item, and location. These dimensions allow the save to keep
track of things, for example, monthly sales of items and the locations at which the items were
sold. Each dimension has a table related to it, called a dimensional table, which describes the
dimension further. For example, a dimensional table for an item may contain the attributes
item_name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item
sold). The fact or measure displayed in rupee_sold (in thousands)
72

Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai,
Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table
are represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
73

OLAP operations:

OLAP stands for Online Analytical Processing Server. It is a software technology that allows
users to analyze information from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on multi-dimensional data (eg.
Delhi -> 2018 -> Sales data). OLAP databases are divided into one or more cubes and these
cubes are known as Hyper-cubes.
74

OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
75

In the cube given in the overview section, the roll-up operation is performed by climbing
up in the concept hierarchy of Location dimension (City -> Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In
the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”

4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension
76

Time = “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation, performing
pivot operation gives a new view of it.

Data Cleaning in Data Mining:

Data cleaning is a crucial process in Data Mining. It carries an important part in the building of
a model. Data Cleaning can be regarded as the process needed, but everyone often neglects it.
Data quality is the main issue in quality information management. Data quality problems occur
anywhere in information systems. These problems are solved by data cleaning.

Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or


incomplete data within a dataset. If data is incorrect, outcomes and algorithms are unreliable,
even though they may look correct. When combining multiple data sources, there are many
opportunities for data to be duplicated or mislabeled.

Generally, data cleaning reduces errors and improves data quality. Correcting errors in data
and eliminating bad records can be a time-consuming and tedious process, but it cannot be
ignored. Data mining is a key technique for data cleaning. Data mining is a technique for
77

discovering interesting information in data. Data quality mining is a recent approach applying
data mining techniques to identify and recover data quality problems in large databases. Data
mining automatically extracts hidden and intrinsic information from the collections of data.
Data mining has various techniques that are suitable for data cleaning.

Steps of Data Cleaning

While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:

1. Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or


irrelevant observations. Duplicate observations will happen most often during data collection.
When you combine data sets from multiple places, scrape data, or receive data from clients or
multiple departments, there are opportunities to create duplicate data. De-duplication is one
of the largest areas to be considered in this process. Irrelevant observations are when you
notice observations that do not fit into the specific problem you are trying to analyze.

For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can make
analysis more efficient, minimize distraction from your primary target, and create a more
manageable and performable dataset.

2. Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find "N/A" and "Not Applicable" in any sheet, but
they should be analyzed in the same category.

3. Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit within
the data you are analyzing. If you have a legitimate reason to remove an outlier, like improper
data entry, doing so will help the performance of the data you are working with.

However, sometimes, the appearance of an outlier will prove a theory you are working on.
And just because an outlier exists doesn't mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a
mistake, consider removing it.
78

4. Handle missing data

You can't ignore missing data because many algorithms will not accept missing values. There
are a couple of ways to deal with missing data. Neither is optimal, but both can be considered,
such as:

o You can drop observations with missing values, but this will drop or lose information, so
be careful before removing it.
o You can input missing values based on other observations; again, there is an opportunity
to lose the integrity of the data because you may be operating from assumptions and
not actual observations.
o You might alter how the data is used to navigate null values effectively.

Methods of Data Cleaning:

There are many data cleaning methods through which the data should be run. The methods
are described below:

1. Ignore the tuples: This method is not very feasible, as it only comes to use when the
tuple has several attributes is has missing values.
2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it
can be a time-consuming method. In the approach, one has to fill in the missing value.
This is usually done manually, but it can also be done by attribute mean or using the
most probable value.
3. Binning method: This approach is very simple to understand. The smoothing of sorted
data is done using the values around it. The data is then divided into several segments
of equal size. After that, the different methods are executed to complete the task.
79

4. Regression: The data is made smooth with the help of using the regression function. The
regression can be linear or multiple. Linear regression has only one independent
variable, and multiple regressions have more than one independent variable.
5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar
values are then arranged into a "group" or a "cluster".

Data Transformation in Data Mining:

Raw data is difficult to trace or understand. That's why it needs to be preprocessed before
retrieving any information from it. Data transformation is a technique used to convert the raw
data into a suitable format that efficiently eases data mining and retrieves strategic
information. Data transformation includes data cleaning techniques and a data reduction
technique to convert the data into the appropriate form.

Data transformation is an essential data preprocessing technique that must be performed on


the data before data mining to provide patterns that are easier to understand.

Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and analytic
processes, and it enables businesses to make better data-driven decisions. During the data
transformation process, an analyst will determine the structure of the data. This could mean
that data transformation may be:

o Constructive: The data transformation process adds, copies, or replicates data.


o Destructive: The system deletes fields or records.
o Aesthetic: The transformation standardizes the data to meet requirements or
parameters.
o Structural: The database is reorganized by renaming, moving, or combining columns.

Data Transformation Techniques:

There are several data transformation techniques that can help structure and clean up the
data before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.
80

Data Smoothing

Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form.

The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to
look at a lot of data which can often be difficult to digest for finding patterns that they
wouldn't see otherwise.

Attribute Construction

In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied to
assist the mining process from the given attributes. This simplifies the original data and makes
the mining more efficient.

For example, suppose we have a data set referring to measurements of different plots, i.e., we
may have the height and width of each plot. So here, we can construct a new attribute 'area'
from attributes 'height' and 'weight'. This also helps understand the relations among the
attributes in a data set.
81

Data Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis insights
is highly dependent on the quantity and quality of the data used.

Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.

For example, we have a data set of sales reports of an enterprise that has quarterly sales of
each year. We can aggregate the data to get the enterprise's annual sales report.

4. Data Normalization:

Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or
[0.0, 1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values for
attribute A that are V1, V2, V3, ….Vn.

Data Generalization

It converts low-level data attributes to high-level data attributes using concept hierarchy. This
conversion from a lower level to a higher conceptual level is useful to get a clearer picture of
the data. Data generalization can be divided into two approaches:

o Data cube process (OLAP) approach.


o Attribute-oriented induction (AOI) approach.
82

For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a
higher conceptual level into a categorical value (young, old).

Data Reduction in Data Mining:


The method of data reduction may achieve a condensed description of the original data
which is much smaller in quantity but keeps the quality of the original data.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example, imagine that
information you gathered for your analysis for the years 2012 to 2014, that data includes the
revenue of your company every three months. They involve you in the annual sales, rather
than the quarterly average, So we can summarize the data in such a way that the resulting
data summarizes the total sales per year instead of per quarter. It summarizes the data.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant
features.
 Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of the
original attributes on the set based on their relevance to other attributes. We know it as a
p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
 Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at each
point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes are
redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
83

Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }


Step-1: {X1, X2, X3, X4, X5}
Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
 Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and making the
process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.
 Lossless Compression –
Encoding techniques (Run Length Encoding) allows a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
 Lossy Compression –
Methods such as Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original the image. In lossy-
data compression, the decompressed data may differ to the original data but are useful
enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models or smaller
representation of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric method such as clustering, histogram, sampling. For More
Information on Numerosity Reduction Visit the link below:
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise, and easily understandable
way.
 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points) to
divide the whole set of attributes and repeat of this method up to the end, then the
process is known as top-down discretization also known as splitting.
 Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through a
84

combination of the neighbourhood values in the interval, that process is called bottom-up
discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43
for age) to high-level concepts (categorical variables such as middle age or Senior).
For numeric data following techniques can be followed:
 Binning –
Binning is the process of changing numerical variables into categorical counterparts. The
number of categorical counterparts depends on the number of bins specified by the user.
 Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the attribute
X, into disjoint ranges called brackets. There are several partitioning rules:
1. Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on the number
of bins i.e. a set of values ranging from 0-20.
3. Clustering: Grouping the similar data together.
85

UNIT IV:
Knowledge representation:
Background knowledge:
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
86

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −


 Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad
range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express
the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms but
at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues

There can be performance-related issues such as follows −


 Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate
the development of parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged. The incremental algorithms, update databases
without mining the data again from scratch.

Diverse Data Types Issues


87

 Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible
for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.
visualization techniques and experiments with weka:

Weka contains a collection of visualization tools and algorithms for data analysis and
predictive modelling, together with graphical user interfaces for easy access to these
functions. The original non-Java version of Weka was a Tcl/Tk front-end to (mostly third-party)
modelling algorithms implemented in other programming languages, plus data preprocessing
utilities in C and a makefile-based system for running machine learning experiments.

This original version was primarily designed as a tool for analyzing data from agricultural
domains. Still, the more recent fully Java-based version (Weka 3), developed in 1997, is now
used in many different application areas, particularly for educational purposes and research.
Weka has the following advantages, such as:

o Free availability under the GNU General Public License.


o Portability, since it is fully implemented in the Java programming language and thus runs
on almost any modern computing platform.
o A comprehensive collection of data preprocessing and modelling techniques.
o Ease of use due to its graphical user interfaces.

History of Weka
o In 1993, the University of Waikato in New Zealand began the development of the
original version of Weka, which became a mix of Tcl/Tk, C, and makefiles.
o In 1997, the decision was made to redevelop Weka from scratch in Java, including
implementing modelling algorithms.
o In 2005, Weka received the SIGKDD Data Mining and Knowledge Discovery Service
Award.
o In 2006, Pentaho Corporation acquired an exclusive licence to use Weka for business
intelligence. It forms the data mining and predictive analytics component of the Pentaho
business intelligence suite. Hitachi Vantara has since acquired Pentaho, and Weka now
underpins the PMI (Plugin for Machine Intelligence) open-source component.
88

Performance Issues

There can be performance-related issues such as follows −


 Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate
the development of parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel fashion. Then the
results from the partitions is merged. The incremental algorithms, update databases
without mining the data again from scratch.

Diverse Data Types Issues

 Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible
for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.
Data mining algorithms: association rules:

Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the variables
of dataset. It is based on different rules to discover the interesting relations between variables
in the database.

The association rule learning is one of the very important concepts of machine learning, and it
is employed in Market Basket analysis, Web usage mining, continuous production, etc. Here
market basket analysis is a technique used by the various big retailer to discover the
associations between items. We can understand it by taking an example of a supermarket, as
in a supermarket, all products that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so
these products are stored within a shelf or mostly nearby. Consider the below diagram:
89

Association rule learning can be divided into three types of algorithms:

1. Apriori
2. Eclat
3. F-P Growth Algorithm

We will understand these algorithms in later chapters.

How does Association Rule Learning work?

Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These
types of relationships where we can find out some association or relation between two items
is known as single cardinality. It is all about creating rules, and if the number of items
increases, then cardinality also increases accordingly. So, to measure the associations between
thousands of data items, there are several metrics. These metrics are given below:
90

o Support
o Confidence
o Lift

Let's understand each of them:

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is defined as
the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:

Confidence

Confidence indicates how often the rule has been found to be true. Or how often the items X
and Y occur together in the dataset when the occurrence of X is already given. It is the ratio of
the transaction that contains X and Y to the number of records that contain X.

Lift

It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of


each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.
91

Types of Association Rule Lerning

Association rule learning can be divided into three algorithms:

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on
the databases that contain transactions. This algorithm uses a breadth-first search and Hash
Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be
bought together. It can also be used in the healthcare field to find drug reactions for patients.

Eclat Algorithm

Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first
search technique to find frequent itemsets in a transaction database. It performs faster
execution than Apriori Algorithm.

F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the
Apriori Algorithm. It represents the database in the form of a tree structure that is known as a
frequent pattern or tree. The purpose of this frequent tree is to extract the most frequent
patterns.

Applications of Association Rule Learning

It has various applications in machine learning and data mining. Below are some popular
applications of association rule learning:

o Market Basket Analysis: It is one of the popular examples and applications of


association rule mining. This technique is commonly used by big retailers to determine
the association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it
helps in identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial
Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.
92

Correlation Analysis in Data Mining:

Correlation analysis is a statistical method used to measure the strength of the linear
relationship between two variables and compute their association. Correlation analysis
calculates the level of change in one variable due to the change in the other. A high correlation
points to a strong relationship between the two variables, while a low correlation means that
the variables are weakly related.

Researchers use correlation analysis to analyze quantitative data collected through research
methods like surveys and live polls for market research. They try to identify relationships,
patterns, significant connections, and trends between two variables or datasets. There is a
positive correlation between two variables when an increase in one variable leads to an
increase in the other. On the other hand, a negative correlation means that when one variable
increases, the other decreases and vice-versa.

Correlation is a bivariate analysis that measures the strength of association between two
variables and the direction of the relationship. In terms of the strength of the relationship, the
correlation coefficient's value varies between +1 and -1. A value of ± 1 indicates a perfect
degree of association between the two variables.

Why Correlation Analysis is Important:

Correlation analysis can reveal meaningful relationships between different metrics or groups
of metrics. Information about those connections can provide new insights and reveal
interdependencies, even if the metrics come from different parts of the business.

Suppose there is a strong correlation between two variables or metrics, and one of them is
being observed acting in a particular way. In that case, you can conclude that the other one is
also being affected similarly. This helps group related metrics together to reduce the need for
individual data processing.

Types of Correlation Analysis in Data Mining

Usually, in statistics, we measure four types of correlations: Pearson correlation, Kendall rank
correlation, Spearman correlation, and the Point-Biserial correlation.

1. Pearson r correlation

Pearson r correlation is the most widely used correlation statistic to measure the degree of the
relationship between linearly related variables. For example, in the stock market, if we want to
measure how two stocks are related to each other, Pearson r correlation is used to measure
the degree of relationship between the two. The point-biserial correlation is conducted with
93

the Pearson correlation formula, except that one of the variables is dichotomous. The
following formula is used to calculate the Pearson r correlation:

rxy= Pearson r correlation coefficient between x and y

n= number of observations

xi = value of x (for ith observation)

yi= value of y (for ith observation)

2. Kendall rank correlation

Kendall rank correlation is a non-parametric test that measures the strength of dependence
between two variables. Considering two samples, a and b, where each sample size is n, we
know that the total number of pairings with a b is n(n-1)/2. The following formula is used to
calculate the value of Kendall rank correlation:

Nc= number of concordant

Nd= Number of discordant

Spearman rank correlation

Spearman rank correlation is a non-parametric test that is used to measure the degree of
association between two variables. The Spearman rank correlation test does not carry any
assumptions about the data distribution. It is the appropriate correlation analysis when the
variables are measured on an at least ordinal scale.

This coefficient requires a table of data that displays the raw data, its ranks, and the difference
between the two ranks. This squared difference between the two ranks will be shown on a
94

scatter graph, which will indicate whether there is a positive, negative, or no correlation
between the two variables. The constraint that this coefficient works under is -1 ≤ r ≤ +1,
where a result of 0 would mean that there was no relation between the data whatsoever. The
following formula is used to calculate the Spearman rank correlation:

ρ= Spearman rank correlation

di= the difference between the ranks of corresponding variables

n= number of observations
95

UNIT V:
Classification & Clustering:
1R algorithm:
Learn-One-Rule:
This method is used in the sequential learning algorithm for learning the rules. It returns a
single rule that covers at least some examples (as shown in Fig 1). However, what makes it
really powerful is its ability to create relations among the attributes given, hence covering a
larger hypothesis space.
For example:
IF Mother(y, x) and Female(y), THEN Daughter(x, y).
Here, any person can be associated with the variables x and y

The Learn-One-Rule algorithm follows a greedy searching paradigm where it searches for the
rules with high accuracy but its coverage is very low. It classifies all the positive examples for
a particular instance. It returns a single rule that covers some examples.
Learn-One-Rule(target_attribute, attributes, examples, k):

Pos = positive examples


Neg = negative examples
96

best-hypothesis = the most general hypothesis


candidate-hypothesis = {best-hypothesis}

while candidate-hypothesis:
//Generate the next more specific candidate-hypothesis

constraints_list = all constraints in the form "attribute=value"


new-candidate-hypothesis = all specializations of candidate-
hypothesis by adding all-constraints
remove all duplicates/inconsistent hypothesis from new-candidate-hypothesis.
//Update best-hypothesis
best_hypothesis = argmax(h∈CHs) Performance(h,examples,target_attribute)

//Update candidate-hypothesis

candidate-hypothesis = the k best from new-candidate-hypothesis


according to Performance.
prediction = most frequent value of target_attribute from examples that match best-
hypothesis
IF best_hypothesis:
return prediction
It involves a PERFORMANCE method that calculates the performance of each candidate
hypothesis. (i.e. how well the hypothesis matches the given set of examples in the training
data.
Performance(NewRule,h):
h-examples = the set of rules that match h
return (h-examples)
It starts with the most general rule precondition, then greedily adds the variable that most
improves performance measured over the training examples.
Learn-One-Rule Example
Let us understand the working of the algorithm using an example:

Day Weather Temp Wind Rain PlayBadminton

D1 Sunny Hot Weak Heavy No

D2 Sunny Hot Strong Heavy No


97

Day Weather Temp Wind Rain PlayBadminton

D3 Overcast Hot Weak Heavy No

D4 Snowy Cold Weak Light Yes

D5 Snowy Cold Weak Light Yes

D6 Snowy Cold Strong Light Yes

D7 Overcast Mild Strong Heavy No

D8 Sunny Hot Weak Light Yes


Step 1 -best_hypothesis = IF h THEN PlayBadminton(x) = Yes
Step 2 - candidate-hypothesis = {best-hypothesis}
Step 3 -constraints_list = {Weather(x)=Sunny, Temp(x)=Hot, Wind(x)=Weak, ......}
Step 4 - new-candidate-hypothesis = {IF Weather=Sunny THEN PlayBadminton=YES,
IF Weather=Overcast THEN PlayBadminton=YES, ...}
Step 5 - best-hypothesis = IF Weather=Sunny THEN PlayBadminton=YES
Step 6 - candidate-hypothesis = {IF Weather=Sunny THEN PlayBadminton=YES,
IF Weather=Sunny THEN PlayBadminton=YES...}
Step 7 - Go to Step 2 and keep doing it till the best-hypothesis is obtained.

Decision Tree Introduction with example:


 ecision tree algorithm falls under the category of supervised learning. They can be used to
solve both regression and classification problems.
 Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.
 We can represent any boolean function on discrete attributes using the decision tree.
98

Below are some assumptions that we made while using decision tree:
 At the beginning, we consider the whole training set as the root.
 Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
 On the basis of attribute values records are distributed recursively.
 We use statistical methods for ordering attributes as root or the internal node.

As you can see from the above image that Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting the use
of computer in the daily life of the people.
In Decision Tree the major challenge is to identification of the attribute for the root node in
each level. This process is known as attribute selection. We have two popular attribute
selection measures:
1. Information Gain
2. Gini Index
1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller subsets
the entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values (A) is the set of all possible values of A, then

Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an
arbitrary collection of examples. The higher the entropy more the information content.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values (A) is the set of all possible values of A, then
99

Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
= -[0.375 * (-1.415) + 0.625 * (-0.678)]
=-(-0.53-0.424)
= 0.954
Building Decision Tree using Information Gain
The essentials:
 Start with all training instances associated with the root node
 Use info gain to choose which attribute to label each node with
 Note: No root-to-leaf path should contain the same discrete attribute twice
 Recursively construct each subtree on the subset of training instances that would be
classified down that path in the tree.
The border cases:
 If all positive or all negative training instances remain, label that node “yes” or “no”
accordingly
 If no attributes remain, label with a majority vote of training instances left at that node
 If no instances remain, label with a majority vote of the parent’s training instances
Example:
Now, lets draw a Decision Tree for the following data using Information gain.
Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II

Here, we have 3 features and 2 output classes.


To build a decision tree using Information gain. We will take each of the feature and calculate
100

the information for each feature.

Split on feature X

Split on feature Y
101

Split on feature Z
From the above images we can see that the information gain is maximum when we make a
split on feature Y. So, for the root node best suited feature is feature Y. Now we can see that
while splitting the dataset by feature Y, the child contains pure subset of the target variable.
So we don’t need to further split the dataset.
The final tree for the above dataset would be look like this:

2. Gini Index
 Gini Index is a metric to measure how often a randomly chosen element would be
incorrectly identified.
 It means an attribute with lower Gini index should be preferred.
 Sklearn supports “Gini” criteria for Gini Index and by default, it takes “gini” value.
 The Formula for the calculation of the of the Gini Index is given below.

Covering rules:
Data Mining - Rule Based Classification:

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a
rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
102

 The antecedent part the condition consist of one or more attribute tests and these tests
are logically ANDed.
 The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction

Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a
decision tree.
Points to remember −
To extract a rule from a decision tree −
 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We
do not require to generate a decision tree first. In this algorithm, each rule for a given class
covers many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general
strategy the rules are learned one at a time. For each time rules are learned, a tuple covered
by the rule is removed and the process continues for the rest of the tuples. This is because the
path to each leaf in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a
time. When learning a rule from a class Ci, we want the rule to cover all the tuples from class C
only and no tuple form any other class.
Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.


103

Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
returnRule_Set;

Rule Pruning

The rule is pruned is due to the following reason −


 The Assessment of quality is made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule
pruning is required.
 The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg
wherepos and neg is the number of positive tuples covered by R, respectively.
Data Mining – Tasks:
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data
to be mined, there are two categories of functions involved in Data Mining −
 Descriptive
 Classification and Prediction

Descriptive Function

The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions −
 Class/Concept Description
 Mining of Frequent Patterns
104

 Mining of Associations
 Mining of Correlations
 Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example, in
a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived by the following two
ways −
 Data Characterization − This refers to summarizing data of class under study. This class
under study is called as Target Class.
 Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here is the
list of kind of frequent patterns −
 Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
 Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they
have positive, negative or no effect on each other.
105

Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.

Classification and Prediction

Classification is the process of finding a model that describes the data classes or concepts. The
purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −
 Classification (IF-THEN) Rules
 Decision Trees
 Mathematical Formulae
 Neural Networks
The list of functions involved in these processes are as follows −
 Classification − It predicts the class of objects whose class label is unknown. Its objective
is to find a derived model that describes and distinguishes data classes or concepts. The
Derived Model is based on the analysis set of training data i.e. the data object whose
class label is well known.
 Prediction − It is used to predict missing or unavailable numerical data values rather
than class labels. Regression Analysis is generally used for prediction. Prediction can also
be used for identification of distribution trends based on available data.
 Outlier Analysis − Outliers may be defined as the data objects that do not comply with
the general behavior or model of the data available.
 Evolution Analysis − Evolution analysis refers to the description and model regularities
or trends for objects whose behavior changes over time.

Data Mining Task Primitives

 We can specify a data mining task in the form of a data mining query.
 This query is input to the system.
 A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
 Set of task relevant data to be mined.
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
106

 Interestingness measures and thresholds for pattern evaluation.


 Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion includes the
following −
 Database Attributes
 Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of knowledge
discovery. There are different interesting measures for different kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed. These
representations may include the following. −
 Rules
 Tables
 Charts
 Graphs
 Decision Trees
 Cubes
107

Statistical Methods in Data Mining:


Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, data mining is the science, art, and technology of discovering large and complex
bodies of data in order to discover useful patterns. Theoreticians and practitioners are
continually seeking improved techniques to make the process more efficient, cost-effective,
and accurate. Any situation can be analyzed in two ways in data mining:
 Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to
identify patterns and trends. Alternatively, it is referred to as quantitative analysis.
 Non-statistical Analysis: This analysis provides generalized information and includes
sound, still images, and moving images.
In statistics, there are two main categories:
 Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify
the main characteristics of that data. Graphs or numbers summarize the data. Average,
Mode, SD(Standard Deviation), and Correlation are some of the commonly used
descriptive statistical methods.
 Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about
populations and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with statistics.
Some of these are:
 Population
 Sample
 Variable
 Quantitative Variable
 Qualitative Variable
 Discrete Variable
 Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using
mathematical formulas, models, and techniques. Through the use of statistical methods,
information is extracted from research data, and different ways are available to judge the
robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically are
derived from the vast statistical toolkit developed to answer problems arising in other fields.
These techniques are taught in science curriculums. It is necessary to check and test several
hypotheses. The hypotheses described above help us assess the validity of our data mining
endeavor when attempting to infer any inferences from the data under study. When using
more complex and sophisticated statistical estimators and tests, these issues become more
pronounced.
108

For extracting knowledge from databases containing different types of observations, a


variety of statistical methods are available in Data Mining and some of these are:
 Logistic regression analysis
 Correlation analysis
 Regression analysis
 Discriminate analysis
 Linear discriminant analysis (LDA)
 Classification
 Clustering
 Outlier detection
 Classification and regression trees,
 Correspondence analysis
 Nonparametric regression,
 Statistical pattern recognition,
 Categorical data analysis,
 Time-series methods for trends and periodicity
 Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used in
data mining:
 Linear Regression: The linear regression method uses the best linear relationship
between the independent and dependent variables to predict the target variable. In order
to achieve the best fit, make sure that all the distances between the shape and the actual
observations at each point are as small as possible. A good fit can be determined by
determining that no other position would produce fewer errors given the shape chosen.
Simple linear regression and multiple linear regression are the two major types of linear
regression. By fitting a linear relationship to the independent variable, the simple linear
regression predicts the dependent variable. Using multiple independent variables,
multiple linear regression fits the best linear relationship with the dependent variable. For
more details, you can refer linear regression.
 Classification: This is a method of data mining in which a collection of data is categorized
so that a greater degree of accuracy can be predicted and analyzed. An effective way to
analyze very large datasets is to classify them. Classification is one of several methods
aimed at improving the efficiency of the analysis process. A Logistic Regression and a
Discriminant Analysis stand out as two major classification techniques.
 Logistic Regression: It can also be applied to machine learning applications and
predictive analytics. In this approach, the dependent variable is either binary
(binary regression) or multinomial (multinomial regression): either one of the
two or a set of one, two, three, or four options. With a logistic regression
equation, one can estimate probabilities regarding the relationship between the
109

independent variable and the dependent variable. For understanding logistic


regression analysis in detail, you can refer to logistic regression.
 Discriminant Analysis: A Discriminant Analysis is a statistical method of
analyzing data based on the measurements of categories or clusters and
categorizing new observations into one or more populations that were identified
a priori. The discriminant analysis models each response class independently
then uses Bayes’s theorem to flip these projections around to estimate the
likelihood of each response category given the value of X. These models can be
either linear or quadratic.
 Linear Discriminant Analysis: According to Linear Discriminant
Analysis, each observation is assigned a discriminant score to classify it
into a response variable class. By combining the independent variables
in a linear fashion, these scores can be obtained. Based on this model,
observations are drawn from a Gaussian distribution, and the predictor
variables are correlated across all k levels of the response variable, Y.
and for further details linear discriminant analysis
 Quadratic Discriminant Analysis: An alternative approach is provided
by Quadratic Discriminant Analysis. LDA and QDA both assume
Gaussian distributions for the observations of the Y classes. Unlike LDA,
QDA considers each class to have its own covariance matrix. As a
result, the predictor variables have different variances across the k
levels in Y.
 Correlation Analysis: In statistical terms, correlation analysis captures the
relationship between variables in a pair. The value of such variables is usually
stored in a column or rows of a database table and represents a property of an
object.
 Regression Analysis: Based on a set of numeric data, regression is a data mining
method that predicts a range of numerical values (also known as continuous
values). You could, for instance, use regression to predict the cost of goods and
services based on other variables. A regression model is used across numerous
industries for forecasting financial data, modeling environmental conditions, and
analyzing trends.
110

Bayesian network:

Bayesian Belief Network

Bayesian Belief Networks specify joint conditional probability distributions. They are also
known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
 A Belief Network allows class conditional independencies to be defined between subsets
of variables.
 It provides a graphical model of causal relationship on which learning can be performed.
 We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
 Directed acyclic graph
 A set of conditional probability tables

Directed Acyclic Graph

 Each node in a directed acyclic graph represents a random variable.


 These variable may be discrete or continuous valued.
 These variables may correspond to the actual attribute given in the data.

Directed Acyclic Graph Representation

The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For example, lung cancer is
influenced by a person's family history of lung cancer, as well as whether or not the person is a
smoker. It is worth noting that the variable PositiveXray is independent of whether the patient
has a family history of lung cancer or that the patient is a smoker, given that we know the
patient has lung cancer.
111

Conditional Probability Table

The conditional probability table for the values of the variable LungCancer (LC) showing each
possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is as
follows −

Instance based methods:

The Machine Learning systems which are categorized as instance-based learning are the
systems that learn the training examples by heart and then generalizes to new instances
based on some similarity measure. It is called instance-based because it builds the
hypotheses from the training instances. It is also known as memory-based learning or lazy-
learning. The time complexity of this algorithm depends upon the size of training data. The
worst-case time complexity of this algorithm is O (n), where n is the number of training
instances.
For example, If we were to create a spam filter with an instance-based learning algorithm,
instead of just flagging emails that are already marked as spam emails, our spam filter would
be programmed to also flag emails that are very similar to them. This requires a measure of
resemblance between two emails. A similarity measure between two emails could be the
same sender or the repetitive use of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the
target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch.
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
112

Linear models:

Regression refers to a data mining technique that is used to predict the numeric values in a
given data set. For example, regression might be used to predict the product or service cost or
other variables. It is also used in various industries for business and marketing behavior, trend
analysis, and financial forecast. In this tutorial, we will understand the concept of regression,
types of regression with certain examples.

What is regression?

Regression refers to a type of supervised machine learning technique that is used to predict
any continuous-valued attribute. Regression helps any business organization to analyze the
target variable and predictor variable relationships. It is a most significant tool to analyze the
data that can be used for financial forecasting and time series modeling.

Regression involves the technique of fitting a straight line or a curve on numerous data points.
It happens in such a way that the distance between the data points and cure comes out to be
the lowest.

The most popular types of regression are linear and logistic regressions. Other than that, many
other types of regression can be performed depending on their performance on an individual
data set.

Regression can predict all the dependent data sets, expressed in the expression of
independent variables, and the trend is available for a finite period. Regression provides a
good way to predict variables, but there are certain restrictions and assumptions like the
independence of the variables, inherent normal distributions of the variables. For example,
suppose one considers two variables, A and B, and their joint distribution is a bivariate
distribution, then by that nature. In that case, these two variables might be independent, but
they are also correlated. The marginal distributions of A and B need to be derived and used.
Before applying Regression analysis, the data needs to be studied carefully and perform
certain preliminary tests to ensure the Regression is applicable. There are non-Parametric
tests that are available in such cases.
113

Regression is divided into five different types

1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression

Regression Classification

Regression refers to a type of Classification refers to a process of


supervised machine learning assigning predefined class labels to
technique that is used to predict any instances based on their attributes.
continuous-valued attribute.

In regression, the nature of the In classification, the nature of the


predicted data is ordered. predicated data is unordered.

The regression can be further divided Classification is divided into two


into linear regression and non-linear categories: binary classifier and
regression. multi-class classifier.

In the regression process, the In the classification process, the


114

calculations are basically done by calculations are basically done by


utilizing the root mean square error. measuring the efficiency.

Examples of regressions are The examples of classifications are


regression tree, linear regression, etc. the decision tree.

The regression analysis usually enables us to compare the effects of various kinds of feature
variables measured on numerous scales. Such as prediction of the land prices based on the
locality, total area, surroundings, etc. These results help market researchers or data analysts to
remove the useless feature and evaluate the best features to calculate efficient models.

Cluster/2:

Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.

What is Clustering?

Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to changes
and helps single out useful features that distinguish different groups.

Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer base. And
they can characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies, categorize
genes with similar functionalities and gain insight into structures inherent to
populations.
 Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to
house type, value, and geographic location.
 Clustering also helps in classifying documents on the web for information discovery.
115

 Clustering is also used in outlier detection applications such as detection of credit card
fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining

The following points throw light on why clustering is required in data mining −
 Scalability − We need highly scalable clustering algorithms to deal with large databases.
 Ability to deal with different kinds of attributes − Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and
binary data.
 Discovery of clusters with attribute shape − The clustering algorithm should be capable
of detecting clusters of arbitrary shape. They should not be bounded to only distance
measures that tend to find spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and
usable.

Clustering Methods

Clustering methods can be classified into the following categories −


 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs ‘k’
partition of data. Each partition will represent a cluster and k ≤ n. It means that it will classify
the data into k groups, which satisfy the following requirements −
 Each group contains at least one object.
 Each object must belong to exactly one group.
116

Points to remember −
 For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
 Then it uses the iterative relocation technique to improve the partitioning by moving
objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We can
classify hierarchical methods on the basis of how the hierarchical decomposition is formed.
There are two approaches here −
 Agglomerative Approach
 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to one
another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the objects
in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is
down until each object in one cluster or the termination condition holds. This method is rigid,
i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical clustering −
 Perform careful analysis of object linkages at each hierarchical partitioning.
 Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm
to group objects into micro-clusters, and then performing macro-clustering on the
micro-clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the given
cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data
point within a given cluster, the radius of a given cluster has to contain at least a minimum
number of points.
117

Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of
cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.

 It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a given
model. This method locates the clusters by clustering the density function. It reflects spatial
distribution of the data points.
This method also provides a way to automatically determine the number of clusters based on
standard statistics, taking outlier or noise into account. It therefore yields robust clustering
methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of desired
clustering results. Constraints provide us with an interactive way of communication with the
clustering process. Constraints can be specified by the user or the application requirement.

Cobweb:

COBWEB is an incremental system for hierarchical conceptual clustering. COBWEB was


invented by Professor Douglas H. Fisher, currently at Vanderbilt University.[1][2]
COBWEB incrementally organizes observations into a classification tree. Each node in a
classification tree represents a class (concept) and is labeled by a probabilistic concept that
summarizes the attribute-value distributions of objects classified under the node. This
classification tree can be used to predict missing attributes or the class of a new object.[3]
There are four basic operations COBWEB employs in building the classification tree. Which
operation is selected depends on the category utility of the classification achieved by applying
it. The operations are:

 Merging Two Nodes


Merging two nodes means replacing them by a node whose children is the union of the
original nodes' sets of children and which summarizes the attribute-value distributions of
all objects classified under them.
 Splitting a node
A node is split by replacing it with its children.
118

 Inserting a new node


A node is created corresponding to the object being inserted into the tree.
 Passing an object down the hierarchy
Effectively calling the COBWEB algorithm on the object and the subtree rooted in the node.

The COBWEB Algorithm[edit]

COBWEB(root, record):
Input: A COBWEB node root, an instance to insert record
ifroot has no children then
children := {copy(root)}
newcategory(record) \\ adds child with record’s feature values.
insert(record, root) \\ update root’s statistics
else
insert(record, root)
forchildinroot’s children do
calculate Category Utility for insert(record, child),
setbest1, best2 children w. best CU.
end for
ifnewcategory(record) yields best CU then
newcategory(record)
elseifmerge(best1, best2) yields best CU then
merge(best1, best2)
COBWEB(root, record)
else ifsplit(best1) yields best CU then
split(best1)
COBWEB(root, record)
else
COBWEB(best1, record)
end if
end
119

K-Means Clustering Algorithm:


K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled


dataset into different clusters. Here K defines the number of pre-defined clusters that need to
be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three
clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
120

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Hierarchical Clustering in Data Mining:


A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a separate cluster. Then, it repeatedly
executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the
clusters are merged together.
121

In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A


diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the
sequences of merges or splits) graphically represents this hierarchy and is an inverted tree
that describes the order in which factors are merged (bottom-up view) or clusters are broken
up (top-down view).
The basic method to generate hierarchical clustering is
1. Agglomerative: Initially consider every data point as an individual Cluster and at every
step, merge the nearest pairs of the cluster. (It is a bottom-up method). At first, every
dataset is considered as an individual entity or cluster. At every iteration, the clusters merge
with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:
 Calculate the similarity of one cluster with all the other clusters (calculate proximity
matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each other.
 Recalculate the proximity matrix for each cluster
 Repeat Steps 3 and 4 until only a single cluster remains.
Let’s see the graphical representation of this algorithm using a dendrogram.
Note: This is just a demonstration of how the actual algorithm works no calculation has been
performed below all the proximity among the clusters is assumed.
Let’s say we have six data points A, B, C, D, E, and F.
122

Figure – Agglomerative Hierarchical clustering


 Step-1: Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.
 Step-2: In the second step comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we
merge them in the second step similarly to cluster (D) and (E) and at last, we get the
clusters [(A), (BC), (DE), (F)]
 Step-3: We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
 Step-4: Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
 Step-5: At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
2. Divisive:
We can say that the Divisive Hierarchical clustering is precisely the opposite of the
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into
account all of the data points as a single cluster and in every iteration, we separate the data
points from the clusters which aren’t comparable. In the end, we are left with N clusters.
123

Figure – Divisive Hierarchical clustering

Data mining techniques to create a comprehensive and accurate model of data:

Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, Data mining is the science, art, and technology of discovering large and complex
bodies of data in order to discover useful patterns. Theoreticians and practitioners are
continually seeking improved techniques to make the process more efficient, cost-effective,
and accurate. Many other terms carry a similar or slightly different meaning to data mining
such as knowledge mining from data, knowledge extraction, data/pattern analysis data
dredging.
Data mining treats as a synonym for another popularly used term, Knowledge Discovery from
Data, or KDD. In others view data mining as simply an essential step in the process of
knowledge discovery, in which intelligent methods are applied in order to extract data
patterns.
Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” in 1989.
However, the term ‘data mining’ became more popular in the business and press
communities. Currently, Data Mining and Knowledge Discovery are used interchangeably.
Nowadays, data mining is used in almost all places where a large amount of data is stored
and processed.
124

Knowledge Discovery From Data Consists of the Following Steps:


 Data cleaning (to remove noise or irrelevant data).
 Data integration (where multiple data sources may be combined).
 Data selection (where data relevant to the analysis task are retrieved from the database).
 Data transformation (where data are transmuted or consolidated into forms appropriate
for mining by performing summary or aggregation functions, for sample).
 Data mining (an important process where intelligent methods are applied in order to
extract data patterns).
 Pattern evaluation (to identify the fascinating patterns representing knowledge based on
some interestingness measures).
 Knowledge presentation (where knowledge representation and visualization techniques
are used to present the mined knowledge to the user).
Now we discuss here different types of Data Mining Techniques which are used to predict
desire output.
Text Mining in Data Mining:
Rapid increment in computerized or digital information has prompted an enormous volume
of information and data. A substantial portion of the available information is stored in Text
databases, which consist of large collections of documents from various sources. Text
databases are rapidly growing due to the increasing amount of information available in
electronic form. In excess of 80% of the present information is in the form of unstructured or
semi-organized data. Traditional information retrieval techniques become inadequate for the
increasingly vast amount of text data. Thus, text mining has become an increasingly popular
and essential part of Data Mining. The discovery of proper patterns and analyzing the text
document from the huge volume of data is a major issue in real-world application areas.
“Extraction of interesting information or patterns from data in large databases is known as
data mining.”
125

Text mining is a process of extracting useful information and nontrivial patterns from a large
volume of text databases. There exist various strategies and devices to mine the text and
find important data for the prediction and decision-making process. The selection of the
right and accurate text mining procedure helps to enhance the speed and the time
complexity also. This article briefly discusses and analyzes text mining and its applications in
diverse fields.
“Text Mining is the procedure of synthesizing information, by analyzing relations, patterns,
and rules among textual data.”
As we discussed above, the size of information is expanding at exponential rates. Today all
institutes, companies, different organizations, and business ventures are stored their
information electronically. A huge collection of data available on the internet and store in
digital libraries, database repositories, and other textual data like websites, blogs, social
media networks, and e-mails. It is a difficult task to determine appropriate patterns and
trends to extract knowledge from this large volume of data. Text mining is a part of Data
mining to extract valuable text information from a text database repository. Text mining is a
multi-disciplinary field based on data recovery, Data mining, AI, statistics, Machine learning,
and computational linguistics.
The conventional process of text mining as follows:
 Gathering unstructured information from various sources accessible in various document
organizations, for example, plain text, web pages, PDF records, etc.
 Pre-processing and data cleansing tasks are performed to distinguish and eliminate
inconsistency from the data. The data cleansing process makes sure to capture the
genuine text, and it is performed to eliminate stop words stemming (the process of
identifying the root of a certain word and indexing the data.
 Processing and controlling tasks are applied to review and further clean the data set.
 Pattern analysis is implemented in Management Information System.
 Information processed in the above steps is utilized to extract important and applicable
data for a powerful and convenient decision-making process and trend analysis.
126

Procedures of analyzing Text Mining:


 Text Summarization: To extract its partial content reflection its whole content
automatically.
 Text Categorization: To assign a category to the text among categories predefined by
users.
 Text Clustering: To segment texts into several clusters, depending on the substantial
relevance.

TextMining Techniques:

Text Classification Algorithms are at the heart of many software systems that process large
amounts of text data. Text Classification is used by email software to determine whether
incoming mail is sent to the inbox or filtered into the spam folder. Text classification is used in
discussion forums to determine whether comments should be flagged as inappropriate.

These are two examples of Topic Classification, in which a text document is classified into one
of a predefined set of topics. Many topic classification problems rely heavily on textual
keywords for categorization.
127

Image Source

Sentiment Analysis is another common type of text classification, with the goal of determining
the polarity of text content: the type of opinion it expresses. This can be expressed as a binary
like/dislike rating or as a more granular set of options, such as a star rating from 1 to 5.

Sentiment Analysis can be used to determine whether or not people liked the Black Panther
movie by analyzing Twitter posts or extrapolating the general public’s opinion of a new brand
of Nike shoes based on Wal-Mart reviews.

Data Mining:

Data Mining is the process of analyzing data in order to uncover patterns, correlations, and
anomalies in large datasets. These datasets contain information from employee databases,
financial information, vendor lists, client databases, network traffic, and customer accounts,
among other things. Statistics, Machine Learning (ML), and Artificial Intelligence can be used
to explore large datasets manually or automatically (AI).

The Data Mining process begins with determining the business goal that will be achieved using
the data. Data is then collected from various sources and loaded into Data Warehouses, which
act as a repository for analytical data. Data is also cleansed, which includes the addition of
missing data and the removal of duplicate data. Sophisticated tools and mathematical models
are used to find patterns in data.

Key Features of Data Mining

These are the characteristics of Data Mining:

 Probable Outcome Prediction


128

 Focuses on Large Datasets and Databases


 Automatic Pattern Predictions are made based on Behavior Analysis
 To compute a feature from other features, any SQL expression can be used

How does Text Classification in Data Mining Work?

The process of categorizing text into organized groups is known as text classification, also
known as text tagging or text categorization. Text Classification in Data Mining can
automatically analyze text and assign a set of pre-defined tags or categories based on its
content using Natural Language Processing (NLP).

Text Classification in Data Mining is becoming an increasingly important part of the business
because it enables easy data insights and the automation of business processes.

The following are some of the most common examples and use cases in Text Classification in
Data Mining for Automatic Text Classification:

 Sentiment Analysis for determining whether a given text is speaking positively or


negatively about a particular subject (e.g. for brand monitoring purposes).
 The task of determining the theme or topic of a piece of text is known as topic detection
(e.g. knowing if a product review is about Ease of Use, Customer Support, or Pricing
when analyzing customer feedback).
 Language detection refers to the process of determining the language of a given text
(e.g. knowing if an incoming support ticket is written in English or Spanish for
automatically routing tickets to the appropriate team).

Here is the Text Classification in Data Mining workflow:

 Step 1: Collect Information


 Step 2: Investigate Your Data
 Step 3: Gather Your Data
 Step 4: Create, Train, and Test Your Model
 Step 5: Fine-tune the Hyperparameters
 Step 6: Put Your Model to Work

Benefits of Text Classification in Data Mining

Here are some benefits of Text Mining Approaches in Data Mining:

 Text Classification in Data Mining provides an accurate representation of the language


and how meaningful words are used in context.
129

 Text Classification in Data Mining can work at a higher level of abstraction, it makes it
easier to write simpler rules.
 Text Classification in Data Mining uses the fundamental features of semantic technology
to understand the meaning of words in context. Because semantic technology allows
words to be understood in their proper context, this provides superior precision and
recall.
 Documents that do not “fit” into a specific category are identified and automatically
separated once the system is deployed, and the system administrator can fully
understand why they were not classified

Web Mining:
Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is
discovering useful information from the World-Wide Web and its usage
patterns. Applications of Web Mining:
1. Web mining helps to improve the power of web search engine by classifying the web
documents and identifying the web pages.
2. It is used for Web Searching e.g., Google, Yahoo etc and Vertical Searching e.g., FatLens,
Become etc.
3. Web mining is used to predict user behavior.
4. Web mining is very useful of a particular Website and e-service e.g., landing page
optimization.
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
1. Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of several
types of data – text, image, audio, video etc. Content data is the group of facts that a web
page is designed. It can provide effective and interesting patterns about user needs. Text
documents are related to text mining, machine learning and natural language processing.
This mining is also known as text mining. This type of mining performs scanning and
mining of the text, images and groups of web pages according to the content of the input.
2. Web Structure Mining: Web structure mining is the application of discovering structure
information from the web. The structure of the web graph consists of web pages as
nodes, and hyperlinks as edges connecting related pages. Structure mining basically
shows the structured summary of a particular website. It identifies relationship between
web pages linked by information or direct link connection. To determine the connection
between two commercial websites, Web structure mining can be very useful.
3. Web Usage Mining: Web usage mining is the application of identifying or discovering
interesting usage patterns from large data sets. And these patterns enable you to
130

understand the user behaviors or something like that. In web usage mining, user access
data on the web and collect data in form of logs. So, Web usage mining is also called log
mining.

Comparison Between Data mining and Web mining:


Points Data Mining Web Mining

Data Mining is the process that Web Mining is the process of data
attempts to discover pattern and mining techniques to automatically
hidden knowledge in large data sets discover and extract information from
Definition in any system. web documents.

Data Mining is very useful for web Web Mining is very useful for a
Application page analysis. particular website and e-service.

Target
Users Data scientist and data engineers. Data scientists along with data analysts.

Access Data Mining access data privately. Web Mining access data publicly.

In Web Mining get the information


In Data Mining get the information from structured, unstructured and
Structure from explicit structure. semi-structured web pages.

Problem Clustering, classification, regression, Web content mining, Web structure


Type prediction, optimization and control. mining.

It includes tools like machine Special tools for web mining are Scrapy,
Tools learning algorithms. PageRank and Apache logs.

It includes approaches for data It includes application level knowledge,


cleansing, machine learning data engineering with mathematical
Skills algorithms. Statistics and probability. modules like statistics and probability.
131

Data mining software:

How Data Mining Works

Data mining involves exploring and analyzing large blocks of information to glean meaningful
patterns and trends. It can be used in a variety of ways, such as database marketing, credit
risk management, fraud detection, spam Email filtering, or even to discern the sentiment or
opinion of users.

The data mining process breaks down into five steps. First, organizations collect data and load
it into their data warehouses. Next, they store and manage the data, either on in-house
servers or the cloud. Business analysts, management teams, and information technology
professionals access the data and determine how they want to organize it. Then, application
software sorts the data based on the user's results, and finally, the end-user presents the data
in an easy-to-share format, such as a graph or table.

Data Warehousing and Mining Software

Data mining programs analyze relationships and patterns in data based on what users
request. For example, a company can use data mining software to create classes of
information. To illustrate, imagine a restaurant wants to use data mining to determine when
it should offer certain specials. It looks at the information it has collected and creates classes
based on when customers visit and what they order.

In other cases, data miners find clusters of information based on logical relationships or look
at associations and sequential patterns to draw conclusions about trends in consumer
behavior.

Warehousing is an important aspect of data mining. Warehousing is when companies


centralize their data into one database or program. With a data warehouse, an organization
may spin off segments of the data for specific users to analyze and use. However, in other
cases, analysts may start with the data they want and create a data warehouse based on
those specs.

Data Mining Techniques

Data mining uses algorithms and various techniques to convert large collections of data into
useful output. The most popular types of data mining techniques include:

 Association rules, also referred to as market basket analysis, searches for relationships
between variables. This relationship in itself creates additional value within the data set
as it strives to link pieces of data. For example, association rules would search a
132

company's sales history to see which products are most commonly purchased together;
with this information, stores can plan, promote, and forecast accordingly.
 Classification uses predefined classes to assign to objects. These classes describe
characteristics of items or represent what the data points have in common with each.
This data mining technique allows the underlying data to be more neatly categorized
and summarized across similar features or product lines.
 Clustering is similar to classification. However, clustering identified similarities between
objects, then groups those items based on what makes them different from other
items. While classification may result in groups such as "shampoo", "conditioner",
"soap", and "toothpaste", clustering may identify groups such as "hair care" and
"dental health".
 Decision trees are used to classify or predict an outcome based on a set list of criteria
or decisions. A decision tree is used to ask for input of a series of cascading questions
that sort the dataset based on responses given. Sometimes depicted as a tree-like
visual, a decision tree allows for specific direction and user input when drilling deeper
into the data.
 K-Nearest Neighbor (KNN) is an algorithm that classifies data based on its proximity to
other data. The basis for KNN is rooted in the assumption that data points that are
close to each are more similar to each other than other bits of data. This non-
parametric, supervised technique is used to predict features of a group based on
individual data points.
 Neural networks process data through the use of nodes. These nodes is comprised of
inputs, weights, and an output. Data is mapped through supervised learning (similar to
how the human brain is interconnected). This model can be fit to give threshold values
to determine a model's accuracy.
 Predictive analysis strives to leverage historical information to build graphical or
mathematical models to forecast future outcomes. Overlapping with regression
analysis, this data mining technique aims at supporting an unknown figure in the future
based on current data on hand.

Applications of Data Mining

In today's age of information, it seems like almost every department, industry, sector, and
company can make use of data mining. Data mining is a vague process that has many
different applications as long as there is a body of data to analyze.

Sales
The ultimate goal of a company is to make money, and data mining encourages smarter,
more efficient use of capital to drive revenue growth. Consider the point-of-sale register at
your favorite local coffee shop. For every sale, that coffeehouse collects the time a purchase
133

was made, what products were sold together, and what baked goods are most popular. Using
this information, the shop can strategically craft its product line.

Marketing
Once the coffeehouse above knows its ideal line-up, it's time to implement the changes.
However, to make its marketing efforts more effective, the store can use data mining to
understand where its clients see ads, what demographics to target, where to place digital ads,
and what marketing strategies most resonate with customers. This includes
aligning marketing campaigns, promotional offers, cross-sell offers, and programs to findings
of data mining.

Manufacturing
For companies that produce their own goods, data mining plays an integral part in analyzing
how much each raw material costs, what materials are being used most efficiently, how time
is spent along the manufacturing process, and what bottlenecks negatively impact the
process. Data mining helps ensure the flow of goods is uninterrupted and least costly.

Fraud Detection
The heart of data mining is finding patterns, trends, and correlations that link data points
together. Therefore, a company can use data mining to identify outliers or correlations that
should not exist. For example, a company may analyze its cash flow and find a reoccurring
transaction to an unknown account. If this is unexpected, the company may wish to
investigate should funds be potentially mismanaged.

Human Resources
Human resources often has a wide range of data available for processing including data on
retention, promotions, salary ranges, company benefits and utilization of those benefits, and
employee satisfaction surveys. Data mining can correlate this data to get a better
understanding of why employees leave and what entices recruits to join.

Customer Service
Customer satisfaction may be caused (or destroyed) for a variety of reasons. Imagine a
company that ships goods. A customer may become unhappy with ship time, shipping quality,
or communication on shipment expectations. That same customer may become frustrated
with long telephone wait times or slow e-mail responses. Data mining gathers operational
information about customer interactions and summarizes findings to determine weak points
as well as highlights of what the company is doing right.
134

Limitations of Data Mining:

This complexity of data mining is one of the largest disadvantages to the process. Data
analytics often requires technical skillsets and certain software tools. Some smaller
companies may find this to be a barrier of entry too difficult to overcome.

Data mining doesn't always guarantee results. A company may perform statistical analysis,
make conclusions based on strong data, implement changes, and not reap any benefits.
Through inaccurate findings, market changes, model errors, or inappropriate data
populations, data mining can only guide decisions and not ensure outcomes.

There is also a cost component to data mining. Data tools may require ongoing costly
subscriptions, and some bits of data may be expensive to obtain. Security and privacy
concerns can be pacified, though additional IT infrastructure may be costly as well. Data
mining may also be most effective when using huge data sets; however, these data sets must
be stored and require heavy computational power to analyze.

You might also like