ML Techmax
ML Techmax
(of
Course Code
CSDLO6021)
(OQepartment Level
CPtional Course- Il)
ebooks
(PDE download)
Download App
ag
Gratt
ACAN TO WT
UNIVERSITY
PAPER SOLUTIONS
ISBN
78-8 1-941546-7-8
hh! __
i . mm (tfec u-Neo
[N= PUBLICATIONS
= | iil
Price % 245/- A Sachin Shoh Venture
FREE
Baan Sample chapter from our Android: App
University of Mumbai
Machine Learning
(Course Code : CSDLO6021)
Pa ! Department Level Optional Course -II
or
i, Tecu-Neo ;T
™ M6-14A
i]
Where
PUBLICATIONS
Authors Inspire Ae
PU
A Sachin Shah Venture
—_
vachine Learning
About Managing Director....
06021) - Mr. Sachin Shah
se Code : CSDL 2018-2019)
6/ Co mp uter — ForR ev Syil.
the Semester
Liladhar, ; r= Over 25 years of experience in Academ;
kole Shubhangi
Authors : Dr. Vai
mar Deoraoji Publishing...
Dr, Sawarkar Sudhirku
With over two and a half decades of :
in bringing out more than 1200 tee
: February 2021202
First Edition for Rev Syllabus Engineering. Polytechnic, Pharmacy, ¢,__*
Sciences and Information T, =
M6-14 Sachin Shah is a name synonymoys ra
Tech-Neo ID :
quality and innovative content.
rights reserved.
Copyright © by Author. All
A driven Educationalist...
No part of this publication may be teproduced,
1. AB.E. in Industrial ~=—- Elegy
copied, or stored in a retrieval system, distributed or
transmitted in any form or by any means, including
(1992 Batch) from Bharati Vidyapees
or other electronic or College of Engineering, affiliatag b
photocopy, recording,
University of Pune.
seopeeorpreaniareninesimnnemisnats
mechanical methods, without the prior written
permission of the Publisher. An Alumnus of ITM Ahmedabad.
Vv
This book is sold subject to the condition that it 3. A Co-Author of a bestselling book ee
shall not, by the-way of trade or otherwise, be lent,
“Engineering Mathematics” for Polytechsi-
Students of Maharashtra State.
resold, hired out, or otherwise circulated without the
publisher’s prior written consent in any form of 4. Sachin has for over a decade, been working
binding or cover other than which it is published as a Consultant for Higher Education &
and without a similar condition including this USA and several other countries.
condition being imposed “on. the subsequent With path-breaking career...
purchaser and without limiting the “righ ts under A publishing career that started wih
copyright reserved above. handwritten cyclostyled notes back in [992
Published by Sachin Shah has to his credit setting up 203
Mr. Sachin S. Shah
expansion of one of the leading companies 2
Managing Director, B. E (Industrial Electronics) higher education publishing.
An Alumunus of IIM Ahmedabad An experienced professional and an expert
Address An energetic, creative & resourefl
professional Sachin Shah's exteas
Tech-Neo Publications LLP
experience of closely working with the tes§
Sr. No. 38/1, Behind Pari Com
Industrial Estate, Narhe, Mah
pany, Khedekar the most eminent authors of Publish:
Pune-411041,
arashtra, Industry, ensures high standards of quiliy
contents. This ability has helped sue
Website: www.techne
;
obooks.c attain better understanding and ine
om
Printed at knowledge of the subject.
Ce
in
ter
ith
Dedicated to
- Authors
Preface
We are g lad to present the New Edition of this book titled “Machine Learnings
. . ering
ter Engine wings
This book covers the revised for th:
Syllabus of Comput ” 4 7 Ta Year
enginineering (semester-6) course of “Mumbai University” which has been effect;Ve Stingg
We have divided the subject into small chapters so that the topics Seah be
arranged and understood properly. The topics within the chapters
have been arranged
ina proper sequence to ensure smooth flow of the subject.
|
you find any, please let us know,
because that will help us to improv
e further.
We are also thankful to wy
family membe rs and friends
encouragement. for their patience and
“= Authors 7
—
Syllabus...
_
Course Code
CSDLO6021 Course Name Credits As i
Machine Leaming ened
4
| Prerequisieostes :. Data Structures Basic Pro |
: l
|f aayModule Detailed Contents Hrs.
| 1 Introduction to Machine Learning 6
ene ek Types of Machine Learning, Issues in Machine Learning, Application
achine Learning, Steps in developing a Machine Learning Application.
(Refer chapter 1)
2 Introduction to Neural Network 8
Introduction - Fundamental concept - Evolution of Neural Networks - Biological Neuron,
Artificial Neural Networks, NN architecture, Activation functions, McCulloch-Pitts
(Refer chapter 2)
Model. 6
3 Introduction to Optimization Techniques
Derivative based optimization- Steepest Descent, Newton method. Derivative free 1
(Refer chapter 3)
optimization- Random Search, Down Hill Simplex.
10
4 Learning with Regression and trees
Logistic Regression.
j Leaming with Regression : Linear Regression,
@earning with Trees : Decision Trees, Constru cting Decision Trees using Gini Index,
(Refer chapter 4)
Classification and Regression Trees (CART). 14
clustering
5 Learning with Classification and by Bayesian Belief
d classification, classification
5.1 Classification : Rule base
Mode ls.
networks, Hidden Markov |
: Maximum Margin Linear Separators, Quadratic
Support Vector Machine separators, Kernels for
|
to finding maximu m margin
Programming solution .
ng non -li nea r fun cti ons .
learni Se oe ctiaen “=
Maximization Algorithm, :
5.2 Clustering : Expectation efer chapter
i .
adi ial Basisi functions
clustering, Rad
Dimensionality Reduction Component Analysis, Independent
Te chniques, Princi pal
Dimensionality Reduction ition.
(Refer chapter 6)
Single va |we decompos
Component Analysis, Total
god
Chapter...
Introduction to Machine
Learning
ee Learning, Types of Machine Leaming, Issues in Machine Leaming, Application of Machine Learning, Steps
in developing a Machine Learning Application. :
neeaens ss
ChapterEnds...cccssscsssccsssssnsessssesseessssessnerssssansnesossnecnnsssennnenss
snensscnansnesenscncsenanansnragecsssaaous
oe eee
Machine focuses on the study and development of algorithms that can learn from data and also make predictio
data. - , ‘
Machine learning is defined by Tom Mitchell as ‘‘A program leams from experience ‘E’ with respect to some cl
tasks ‘T’ and performance measure ‘P’, if its performance on tasks in “T’ as measured by ‘P’ improves with ‘E’.”
‘E’ represents the past experienced data and “T’ represents the tasks such as prediction, classification, etc. Exampl
‘P’, we might want to increase accuracy in prediction. a
( MU-New ew S Syllabus w.e.f academic. year 18-19) (M6-14) Lehech Neo Publicati A SACHIN SHAH Ventu
- ublications...A SA e
:
Machine learning = Take data + understand it + process it + extract value from it + visualize it + communicate it
— Let’s assume training dataset contains 14 training records in Table 1.2.1. Suppose each training record has four features
and one target or the response variable,as shown in Fig. 1.2.1. The machine learning algorithm is used to predict the
target variable.
— In classification task the target variable takes a discrete value, and in the task of regression its value could be
continuous.
— imatraining dataset we have the value of target variable. The relationship that exists between the features and the target
variable is used by machine for learning. The target variable is the evaluation of the car. Classes are the target variables
in the classification task. In classification systems it is assumed that classes are to beof limited number
— Attributes or features are the individual values that, when combined with other features, make up a training example.
This is usually columns in a training or test set.
A training dataset and a testing dataset, is used to test machine learning algorithms. First the training dataset is given as
input to the program. Program uses this data to Jearn. Next, the test set is given to the program. The program decides
. . ,
which i is compared with the actual output of the
instance of test data belongs to which class. The predicted output
program,
g m, and w i
€ can get anidea about the accuracy of the algorith
i m. There are best ways to use all the information in
the training dataset and test dataset.
:
Assume in car ev: aluation
i class A;
‘ en system, we have tested the program and it meets the desired level of accuracy: °
. Knowledge representation used to check what the machine has learned. There are many ways in which knowledge
can be represented,
We can use set of rules ora probabi lity distrib
ability distri ution to represent the knowledge
:
yllabus w.e.f w. academic; year 18-19) (M6-14)
(MU-New Syllabus [ehech-Neo Publications...A SACHIN SHAH Venture
Features Target
atures Variables
ict the
Fig. 1.2.1: Features and target variable identified
:
target Some of the main types of machine learning are
e
[ebrech-Neo Publications...A SACHIN SHAH Ventur
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
e no
ge 1-6)
no (1-6
Introduction to Machine Learning ...Pa
6-Comp)
Machine Learning (MU-Sem
3, - Reinforcement learning
tee , Action7 a »
In
o
reinforcement learning you have an agent who is
= = eg
=
acting in an environment and you want to findout what
State 5; Stet {ood
‘action the agent must take based on the reward or
= _ =—]| Environment| — j a
penalty that the agent gets itIn this an agent (e.g., a Agent a |
=
‘
Reward r, ret 4
robot or controller) seeks to learn theoptimal actions to
7take Wo4 2 5G
based the outcomes of past actions.
Fig. 1.3.3 : Reinforcement Learning
‘ ‘ 3
;
4. Semi-supervised learning
— Itdis a combination of supervised and unsupervised learning. In this there is some amount of labeled training data — :
and also you have large amount of unlabeled data and you try to come up with some learning algorithm that
convert even when training data is not labeled. 3 5
— _ In classification task, the aim is to predict class of an instance of data. Another method in machine tearning is :
regression. j ¢
— , Regression is the prediction of a numeric value. Regression’s example is to draw a best fit line which passes —
i i
through some data points in order to generalize the data points.
i mi 4
— Classification and regression are example:
b . " ples of supervised learning. These types of problems are called as supervised q
ecause we are asking the algorithm what to predict.
Boe!
— pervised isi a task called as unsupervised learni
The ‘ exact opposite of supervised
arming. In unsu i i
learning, target value of *'
: data. §. In unsuper vised
label is not given for the
4
A problemaein which
ich simila
similar r items
i are grouped together isi called as clustering. In unsup
n ervised learning, we may.
o find statistical values that describe the data.
woe
:
This is called as density estimatio n.
| 1 f
|
. 11 . t duci | I f Jat fi om ma ny attributes to a small
number so that Ww e€ can properly visualize It in two or three dimensions,
4
(MU-New yi abus
Syll , wef acader lc
i y year 18 = 19) (M6 -14
1. )
‘ [Bhech -N
4 1 eo Pub I ications...A SACHIN SHAH Ventu
. Which algorithiy we hive to select to lene xenerl Girget functions from specific (raining dataset? What should
be the
settings for partientar alporithis, so as to converpe to the desired function, given suffigient training, data? Which
algorithms perform: best for whieh type of problems and representations’?
Tow much training data is suffeient? What should be the gener! umount of datathat can be found to relate the
confidence in learned hypotheses Co the amount tralniig experience and the character of the learner's hypothesis space’?
Prior knowledge held by the learner is used at which time and manner to guide the process of generalizing from
examples? Lf we have approximately correct knowledpe, will ithelpftul even when itis only approximately correct?
What is the best strategy for choosing useful next training experience, and how does the choice of this strategy after
the complexity of the learning problem?
To reduce the task of learning to one or more function approximation problems, what will be the best approach? What
specitic functions should the system attempt to learn’? Can this process itself be automated’?
To improve the knowledge representation and to learn the target function, how can the learner automatically alter its
representation?
With all the different algorithms available in machine learning, how can you select which one to use? First you need to
focus on your goal, What are you trying to get out of this? What data do you have or ean you collect? Secondly you have to
consider the data,
Goal : If you are trying to predict or forecust a target value, then you need to look into supervised learning. Otherwise,
you have to use unsupervised learning.
(a) Ifyou have chosen supervised learning, then next you need to focus on what's your target value?
If target value is discrete (c.g. Yes/ No, | /2/3, A/B/C), then use Classification,
If target value is continuous ic, Number of values (e.g. 0 — 100, — 99 to 99), then ase Repression.
(b) Ifyou have chosen unsupervised learning, then next you need to focus onwhat is your aim?
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [ahech-teo Publications...A SACHIN SHAH Venture
N
pt 1.6 STEPS IN DEVELOPING A MACHINE LEARNING APPLICATIO
7. Use
1. Collection of Data
In th
You could collect the samples from a website and extracting data. you
- FromRSS feed or an API
- Once you have the input data, you need to check whether it’s in a useable format or not.
- Some algorithm can accept target variables and features as string; some need them to be integers.
- Some algorithm accepts features in a special format.
- Looking at the data you have passed in a text editor to checkcollection and preparation of input data steps are
properly working and you don’t have a bunch of empty values.
‘You can also check at the data to find out if you can see any patterns or if there is anything obvious, such as a few
data points greatly differ from remaining set of the data.
a
Da
8
8). : Machine Leaming (MU-Sem 6-Comp) Introduction to Machine Learning ...Page no (1 -9)
Module
5. Train the algorithm
on 1
- Good clean data from the first two steps is given as input to the algorithm. The algorithm extracts information or
. knowledge. This knowledge is mostly stored in a format that is readily useable by machine for next 2 steps.
or
ur — Incase of unsupervised learning, training step is not there because target valuc is not present. Complete data is
used in the next step.
— — Inthis step the information learned in the previous step is used. When you are checking an algorithm, you will test
it to find out whether it works properly or not. In supervised case, you have some known values that can be used to
evaluate the algorithm.
— Incase of unsupervised, you may have to use some other matrices to evaluate the success. In either case, if you are
not satisfied, you can again go back to step 4, change some things and test again.
- Mostly problem occurs in collection or preparation of data and you will have to go back to step I.
7. Useit
In this step a real program is developed to do some task, and once again it is checked if all the previous steps worked as
you expected. You might encounter some new data and have to revisit step 1-5.
/
Training Phase
Label Machine
learning
Input
Faw COO
tracto.
extractor Features
en
iN
eae PPLLLLL
cmeeteef cate
extractor
Input Features
\
Fig. 1.6.1 : Typical example of Machine Learning Application
1. Learning Associations
— A supermarket chain-one an example of retail application of machine learning is basket analysis, which is finding
associations between products bought by customers:
— If people who buy P typically also buy Q and if there is a customer who buys Q and does not buy P, he or she is a
potential P customer. Once we identify such customers, we can target them for cross-selling.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [hech-Neo Publications...A SACHIN SHAH Venture
=
association rulc,
~ In finding an P. which are the product / products which we know that customer has
the product we would like to condition on
already purchased.
P(Milk/ Bread) = 0.7
—. Itimplies that 70% of customers who buy bread also buy milk
2. Classification
— Incredit scoring, the bank calculates the risk given the amount of credit and the information about the customer,
(Income, savings, collaterals, profession, age, past financial history). The aim is to infer a general rule from this
data, coding the association between a customer’s attributes and his risk.
ttt
are Optical character
. Tecognition, face recognition, medical diagnosis,
speech recognition and biometric.
Izl
ab
3. Regression
aa
the price of a flat. Let’s take the inputs as
the area of
the flat, location and purchase year
and other oO
information that affects the rate of
flat. The output is
©
Y = wx+w,
Fig. 1,7,2: Regression for prediction of price of flat : a
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
4
Ehrech-Neo Publications...A SACHIN SHAH Venture
| Reinforcement Learning
- Another application of reinforcement learning is robot navigation. The robot can move in all possible directions at
any point of time.The algorithm should reach goal state from an initial state by learning the correct sequence of
actions after conducting number of trial runs.
- When the system has unreliable and partial sensory information, it makes reinforcement learning complex. Let’s
take an example of robot with incomplete camera information. Here robot does not know its exact location.
~ May 2015
(5 Marks)
Q.1 What are the issues in Machine learning ? (Ans. : Refer section 1.4)
> May 2016
(5 Marks)
9 (Ans. : Refer sections 1.2 and 1.3)
Q.2 What are the key tasks of Machine Learning
(Ans. : Refer section 1.5) (8 Marks)
ine learning algorithm.
2.3 Explain the steps required for selectin ig the right mach
(10 Marks)
rning applicatiions. (Ans, :: Refer sectioion n 1.7)
1.
0.4 — Write short note on : Machine lea |
> May 2017 d learniingng !sis di different from unsupervised lear
ning. ts merks
supervise
» What is Machine learning ? Explain how i
|
ut
Explanation : Memory (previous experience), logic and (a) Specific output values are given
inference (deduction) is required. (b) Specific output values are not given
Q.1.3 Which of these is not the type of machine Learning ? (c) No specitic Inpuis are given
(d) Both inputs and outputs are given ¥ Ans. : (b)
(a) Supervised Learning
Explanation : In unsupervised leaming since supervisor
(b) Unsupervised Learning
(6) Reinf : is not present, we do not have the idea about expected
c) Reinforcement learning outpul. System learns only from the input.
(d) Semi-unsupervised Learning “Ans. :
. . 6 Ans. :(@) | Q. 1.9 Spam mail filtering is
Explanation : Supervised, unsupervised, reinforcement (a) Classificati
and hybrid arc types of learning. 7 assification problem
0.4 (b) Clustering problem
Which of the following is not an application of Icarning ? (c) Classification and Clustering Problem
(a) Data mining (b) World wide web (d) Time Series problem v Ans. : (a)
(c) Speech recognition (d) Data manipulation Explanation : In spam mail based on training examples
Y Ans. : (d) classification is done. Training cxamples contain
attrib utes
ibutes of ai
mail ‘i or not).
and output (spam
Explanation :4 Data manip: ulation isa data preprocessing.
Q.1.10 Data used to optimize the parameter settings of a
Q.1.5 Fraud Detection, Image Classification, Diagnostics
applications in , gnostics are supervised learner model
Ino (1-12
(10 Marka)
Machino Learning (MU-Sem 6-Comp) Introduction ta Machine Leaming ...Page no (1-13)
Q.1.11 In regression the output is Q.1.16 You are given reviews of few movies marked as positive, Module
(a) Continuous (b) Discrete negative or neutral. Classifying reviews of a new movie | 1
(¢) May be discrete or continuous is an example of
(d) Continuous and always lics in a finite range (a) Supervised learming
(5 Marka)
¥ Ans. : (a) (b) Unsupervised learning
Explanation : In regression the output is numerical. (c) Semi supervised learning
Q.1.12 Machine Learning is a branch of (d) Reinforcement learning ~ Ans. : (a)
(a) Natural Language processing Explanation : Supervised learning as in this cases we
(b) Artificial Intelligence know the expected output.
(c) Java (Wd) ¢ v Ans. : (b)
“Ans, : (a) Q. 1.17 Imagine a newly bom starts to leam walking.it will try to
Explanation : ML is a category of Al.
latubase is find a suitable policy to leam walking after repeated
esoond car Q. 1.13 A shop owner has a store that stores a variety of fabrics. falling and getting up. Specify what type of MI algorithm
When fabric is brought to the store, various types of is best suited to do the same.
fabrics may be mixed together. The shop owner wants a (a) Supervised (b) Unsupervised
?
model that will sort the fabric according to type. Which (c) Reinforcement (d) Semi supervised
ws
model will be efficient/accurate for this task? “Ans. : (c)
(a) Machine learning model Explanation : In this case child learns using the concept
‘Ans, ¢ (11)
(b) Feature based classification technique. of punishment (fall down) and reward (walk properly)
common (c) Computer vision which are the characteristics of reinforcement learning.
tions are (d) Fuzzy Logic v Ans. : (a)
Q. 1.18 Automated vehicle is an example of
Explanation : Machine Icarning model will be efficient
(a) Supervised Learning
since ones the model is trained can be used for futuristic
(b) Unsupervised Leaming
use,
(c) Active Learning
Q. 1.14 To Understand the role of machine learning in public
(d) Reinforcement learning ¥ Ans. : (a)
health and = safety and the cultural, societal, and
environmental considerations in determining — the Explanation : Supervised learning as in this case we
ins. 2 (b)
non-functional requirements of products and processes. know the expected output.
Ipervisor
Which type of learning can be used?
expected Q. 1.19 Real-time decisions, Game AI, Leaming task are
(a) Supervised Learning applications in
(b) Unsupervised Learning
(a) Active Learning
(c) Competitive Learning
(b) Supervised Learning
(d) Reinforcement Learning v Ans. : (a)
(c) Reinforcement learning
Explanation : Supervised learning as in this cases we
(d) Unsupervised Learning Ans. : (c)
know the expected output.
Explanation : Reinforcement leaming since in these
ns. = (a) Q.1.15 Suppose you want to design a system for waste
cases system learns from punishment and reward.
samples management. In this first the garbage is collected and it
contain is sent to the main server for analysis. Main server Q. 1.20 In which of the following leaming the teacher returns
compares the categories of garbage and appropriate reward and punishment to leamer ?
Uisposal method is selected. For comparison of garbage
{a) Active Learning
s ofa type you will use which Machine learning method ?
(b) Reinforcement learning
(a) Regression (b) Classification
(c) Unsupervised Learning
(c) Clustering (d) Dimensionality Reduction
(d) Supervised Learning Ans. : (b)
Ans. : (b)
is. ¢ (d) Explanation : Classification, depending on the attributes Explanation : Definition and working of Reinforcement
cd that waste will be segregated and also we know output. learning.
output
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [ehech-neo Publications...A SACHIN SHAH Venture
fle
the sample
Q. 1.21 : utational learni ng theory analyses : evidences so that performance can be improved.
Comp : and computational complexity of
complexity Why is Goggle very successful in Machine Leaming
Q. 1.27
(a) They have more data than other companies
(a) Unsupervised Learning (b) Inductive learning
(d) Weak learning
(b) They have better algorithms
(c) Forced based learning
¥ Ans. : (b) (c) They have better training sets
d (d) They work on low level data features 6
Explanation : Since in this system learns from observe
examples. V Ans. : (a)
Explanation : For better generalisation and accuracy —
Q. 1.22 Which of the following is an example of active learning?
. more training data is required. ‘
(a) News Recommender system
(b) Dust cleaning machine Q. 1.28 Which of the following is not Machine learning |
(c) Automated vehicle (d) Speech recognition disciplines
v Ans.
: (a) (a) Information theory (b) Neurostatistics
Explanation : In news recommendation items are (c) Optimization (d) Physics Ans. :(b)
presented to the user to learn more about their Explanation : Game theory, control theory, operation |
preferences,there like or dislike where active learning research,information theory, optimization, swam |
comes in. intelligence and genetic algorithm are disciplines of ML,
Which of the following is also called as exploratory Q. 1.29 What are the three essential components of a leaming _
learning? system? .
(a) Supervised Learning (a) Model, gradient descent, learning algorithm
(b) Reinforcement leamin z (b) Error function, model, leaning algorithm
(c) Unsupervised Learning
(c) Accuracy, Sensitivity, Specificity
(d) Active Learning ~ Ans. : (c) (d) Model, error function, cost function ¥ Ans. : (b)
Explanation, :. Since unsupervised learning identify Explanation : Learning algorithm trains the model using
structure within data.
error function.
Q. 1.24 Supervised Learning and Unsupervised Learning both Q. 1.30 You are reviewing papers for the World's Fanciest
Tequire at least one?
Machine Learning Conference, and you see submissions
(a) Hidden attribute (b) Output Attribute with the following claims. Which ones would
“(©) Input Attribute —(d)
you
Categorical Attribute consider accepting?
Ans. : (a) (a) My method achieves a training error lower than all
. Explanation : Since hidden attribute Prese
rves the previous methods.
robustness of semantic attributes
and inherits the
~ discrimination ability of visual features, (b) My method achieves a test error lower than all :
previous methods.
Q. 1.25 What is the function of Supervised Lear
ning? (c) My method achieves a test error lower than all
(a) Grouping of data
previous methods
(b) Find centroid of data
(c) Find relationship betw (d) My method achieves a cross-validation error lower
een input and output
(d) Learn from punishment than all previous methods. v Ans. (0)
and reward Explanation : Test error lower than
all previous methods
“Ans. : (c) when regularization parameter is
Explanation : In Supervised
chosen so as (0
between input and Out based on relationship minimize cross-validation error
.
put model is trained,
Q. 1.26 Which modifies
the performance
Q. 1.34 What is true about Machine
Leaming?
makes better decisi element so that
on? it (a) Machine Learning (ML) is that field of computet
(a) Performance Science
element ‘
(b) Changi
(c) Leaming element (d) (b) ML is a lype of artificial intelligence that
HearinBing element
g clement extrac”
Patterns out of raw data by using
YAns. : (c) an algorithm
. (MU- New Syllabus wi method
ef academic year 2 (ML
18 ~19) (M6-14)
[ah ech-neo Publications...A SACHIN SHAH
Venture
Q. 1.34 Which of the following is NOT supervised learning ? Q. 1.39 What's the main point of difference between human and
est
(a) PCA (b) Decision Tree machine intelligence?
ns
(c) Linear Regression (d) (a) human perceive everything as a
ou Naive Bayesian pattern while
machine perceive it merely as data
“Ans. : (a)
(b) human have emotions
all Explanation : Principal Component Analysis (PCA) is
(c) human have more IQ and inteMect
not predictive analysis tool. It is a data pre-processing
tool. It helps in picking out the most relevant linear (d) human have sense organs ¥ Ans. : (a)
all
combination of variables and use them in our predictive Explanation : Humans have emotions and thus
form
model.PCA is a technique different patterns on that basis, while a machine(
for reducing the say
ul computer) is dumb and everything is just a data for
dimensionality of large datasets, increasing him,
interpretability but at the same time minimizing Q. 1.40 Choose the options that are correct regarding machin
information loss. e
c. learning (ML) and artificial intelligence (AI),
Q.1.35 Which of the following is a good test dataset (a) ML is an alternate way of Programming intelligent
characteristic? machines.
(a) Large enough to yield meaningful results (b) All options are correct.
(b) Is representative of the dataset as a whole (c) ML isa set of techniques that turns a dataset into a
(c) Both (a) and (b) sof tware,
(d) None of the above (d) Al is a software that can emulate the human mind.
Vv Ans. : (ec)
Explanation ; For better result more records as well as Ans.: (b)
meaningful records are required. Explanation : Since all three options are correct.
a
ChapterEnds...
gag
Chapter...
Introduction to Neural
Network
— University Prescribed Syllabus
Introduction - Fundamental concept - Evolution of Neural Networks - Biological Neuron, Artificial Neural Networks, NN
architecture, Activation functions, McCulloch-Pitts Model.
Neurales
LOOT eetoeam
Introduction "Age e | no
Network ,,.Pag
a
Machine Learning (MU-Sem 6-Comp)
| ; a
‘pi 241. INTRODUCTION
: nyt s Sug uy WY awhile
) is inspired from the Biological Nervous . System. The way by whieh Motogic
An Arti i Artificial
Sas aL
¢ the information in the same manner ANN also‘i processes the Information ;
Oy CARE
-: i Neural Network . (® NN isins
i i cesses
tem such a3 Brain pro’ Sse
ng paradig' m. It sesembl |
brain ini two respects
esss the the brain 4
oe ‘ nformati on processi a -naiwork.
ANN is also called as jnforma
the knowledge from the Environment by the a
i ing process isis used d toto acquire
a
weights, fh 4
using Interncuron connection strengths known as synaptic
d Ising a
= oO0 i
Acqcquired Knowledge iis stored U
‘ 1 of a large number of highly interconnected processing s;
elementa4
— The Information processing system 18 composcec
specifi ems.
(neurons) which works ; together to solve specific problem ; ‘ iF
ions
connections that ¢ sxist between the neurons. ns |
i j
process iinvolves aadjustm ent s to the ynap ic connect
synaptic
—° In Biological System learning |
in the same manner learning is carried out in ANN.
¢ detection of trends - is a tedious oe oe and | fs
— ANNcan be used ini many Applicat icatiions. Pattern extractiion on aand
nd process for humans
| ted of t || tast5
techni
- other computeiter r techniqu cir remarka
Wi their
es. Neural networks,‘s, with remarkabble ability to derive meaningg from : complica
le ability
imprecise data, can be used to extract patterns and detect trends.
anees ° i isu
soe ]
— Each neuron has: three main regions, cell body or Soma, Axon
and
Dendrites. Soma contains the nucleus and it processes
the information.
Axon is a long fiber that serves as a transmission line.
End part of the Dendrites
Axon splits into fine arborization that ends into small bulb called as a Fir. 2.2.1 fodel
: the dendrite of the neighbor
Synapse almost touching Fig. 2.2.1: Bi el
ing neuron. & lological Neuron Mo
Dendrites accept the input from
the neighboring neuron through
axon. Dendrites look like a tree structure, receives
signals from other neurons, Syna
pse is the electro-chemical
contact between the or, gans. They do not physically touch
“because they are Separated
by a cleft. The electric gi
signal is called pre- Synaptic cell andt gnals are sent through chemical interaction. The neuron sending (he
iving the electric .
— ‘
The electricas ° he neuron rece al signat is called postsynaptic cell.
Signals that the neurons
use to convey the informat
3determine which { ype ion of the brain are all identical. The brain can
of informat
i ion
iis bein 8 received
si gnals sent, and from i based on the
. that informat
. ion jt interprets i
path of the signal,
t ain i analyzes all pattems? ‘¢ !
Biological neurons, When the neur | he “ype of information received.
cccved, There :
° Thee aiare different types of
. ONS are classifi led by th : . i nolat 1
neurons, bipolar neurons and multipolar neurons ons Processes they carry out they are classified as uniP®.
:
(MU-New Syllabus w.ef academic year 18-19) (M6-14)
— ;
oH Tech-Neo Publications..A SACHIN SHAH Ventu
— Unipolar neurons have a single process. Their dendrites and axon are located on the same stem. These neurons are
“opie
found in invertebrates. Bipolar neurons have two processes. Their dendrites and axon have two separated processes too.
Tryation
Multipolar neurons are commonly found in mammals. Some examples of these neurons are spinal motor neurons,
pyramidal cells and purkinje cells.
- When biological neurons are classified by function they fall into three categories. The first group is sensory neurons.
These ncurons provide all information for perception and motor coordination. The second group provides information
to muscles, and glands. There are called motor neurons. The last group, the interneuronal, contains all other neurons and
le
Ment, has two subclasses. One group called relay or protection intemeuron. They are usually found in the brain and connect Module
different parts of it. The other group called local interneuron’s are only used in local circuits. i 9 3
Neuron, 5 ‘yy 2.35 BASIC ANN MODEL/ MCCULLOCH -— PITTS MODEL ool
,
McCulloch and Pitts proposed a computational model that resembles the Biological Neuron model. These neurons were
lans and
represented as models of biological networks into conceptual components for circuits that could perform computational
cated or
tasks. The basic model of the artificial neuron is founded upon the functionality of the biological neuron. An artificial neuron
is a mathematical function that resembles the biological neuron.
ork Uses
Neuron Model
lel, and
— Aneuron with a scalar input and no bias appears below. Fig. 2.3.1 shows a
simple artificial neural net with n input neurons (X,, X)..., X,) and one
ression,
output neuron (Y). The interconnected weights are given by W,, W,,
and W,.
— The scalar input X is transmitted through a connection that multiplies its strength by the scalar weight W to form the
product W * X, again a scalar. The weighted input W * X is the only argument of the transfer function f, which
produces the scalar output Y. The neuron may have a scalar bias, b. You can view the bias as simply being added to the
product W * X. The bias is much like a weight, “xcept that it has a constant input of 1.
- The transfer function net input ‘n’, again a scalar, 1s the sum of the weighted input W * X and the bias b. This sum is the
argument of the transfer function f. Here f is a transfer function, typically a step function or a sigmoid function, that
takes the argument n and produces the output Y. Note that W and b are both adjustable scalar parameters of the neuron.
- The central idea of neural networks is that such parameters can be adjusted so that the network exhibits some desired or
interesting behaviour. Thus, you can train the network to do a particular job by adjusting the weight or bias parameters,
or perhaps the network itself will adjust these parameters to achieve some desired end.
- As previously noted, the bias b is an adjustable (scalar) parameter of the neuron. It is not an input. However, the
constant | that drives the bias is an input and must be treated as such when you consider the linear dependence of input
vectors in Linear Filters.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [el Tech-Neo Publications..A SACHIN SHAH Venture
\z
. . - Pi I
Example 1: Simulation of NOT gate using McCulloch - Pitts Mode
The truth table of the NOT gate is as follows
Input | Output
x Y
1 0
0 1
For the first row (i.e. input 1) we may write net value as W * X = W * 1 = W according to the McCulloch - Pitts mode]
if the output is 0 then net value must be less than threshold W < T.
For the second cow (i.c. input 0) we may write net value as W * X = W * 0 = 0 according to the McCulloch - Pitts
model if the output is | then net value must be greater than or equal to threshold, 0>T.
Iwi
Now we are having two equations
1
l. W<T 2. OST
Now select the values of W and T such that the above conditions gets satisfied.
One of the possible values are T = 0.8, W =— 1.
ica
Example 2; Simulation of AND gate using McCulloch - Pitts Model
Input Output
a lence
X, X, Y
0 0
a Seanad
0
este Peal
0 1 0.
1 0 0
1 1 1
We assume the weight vector as W;, for X, and {
W, for X,.
For the first row, we may write net values
as (W,* X,)+ (W,
* X2) = (W, * 0) +(W, 0) *
=0,
According to the McCulloch - Pitt
s model if the output is 0 then
net valu ¢ must be less than threshold
For the second row, we may 0 <T.
write net values as (W, * X))
+ (W.* X,) =(W, * 0) +
According to the McCulloch - SW (W. 2
Pitts model if the output is 0 then
net value must be less tt 1an threshold W,<T.
For the thir
d row, we may write net valu
es as (W,
According to the McCulloch -
*X1)
+ (W, X,)=(W, * 1) +(W,* 0)=W,.
Pitts m
odel ifi the tg
output is 0 then net value must be less than threshold W,<T.
(MU-New Syllabus w.e.f academ
ic year 18-19) (M6-14)
Tech-Neo Publications...A SACHIN SHAH Ventu
Ie n t
L(2 “4
>) : Machine Learning (MU-Sem 6-Comp) Introduction to Neural Network ...Page no (2-5)
For the fourth row, we may write net values as (W, * X,) + (W,* X,) = (W, * 1) +(W,* ID =W,+W3.
According to the McCulloch - Pitts model if the output is 1 then net value must be less than threshold W,+W, >T.
Now we are having four equations :
1 O<T 2 W,<T 3. Wi<T 4.W,+W,>T
Now select the values of W,, W,, and T such that the above conditions get satisfied.
Activation function f (x) is used to give output of a neuron in terms of a local field x or net. The various activat
ion
functions are follows.
1. Linear
Output = net
Linear activation function gives the output same as that of the input or x
net value. The Matlab toolbox has a function, purelin, to realize the a
mathematical linear transfer function shown above. Neurons of this type
are used as linear approximators in Linear Filters :
a = purelin(n)
2. Hard limit / Unipolar binary
Output = 0 ifnet<O
= 1 if net >0
_ ' (MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
ye
Network ...
Introduction to Neural
m 6-Comp)
Machine Learning (MU-Se
of the
ve limits the output
it tran sfer function shown abo
The hard -lim 0, or 1, if nis
net inputargumen' t nis less than
neuzon to either 0, if the in Perceptron’s, to
than or equa l to 0. This function 1S used
greater
classification decisions.
create neurons that make
a = hardlim(n)
Hard-Limit Transfer Function
ary
3, - Symmetrical Hard limit/Bipolar bin
a. ri
Output =o= —1 ne <0
ifif net
-1 if net >0
~ a= Satlins (n) i
ey Satlins Transfer Function
=1 > 1
if net
The symbol in the square to the right of each transfer function graph
shown aboye represents the associated transfer function. These icons
. ams to show the
5 of network diagr : a = satlins(n
tlins(n)
1 e the general f ini the boxes
replac Satlins Transfer Function
particular transfer function being used.
6. Unipolar continuous.
, Output = ao
Output =-—2
Trem -] “7)
€
VE takes the ;
ey i Input, Which can
have any value between plus and minus infinity
and s,
‘ato the range— | to 1. quashes the output
_ 4 tansig(n)
Tan: “sigmoid Transfer
Function
fy { |
Single Layer|| Multi Layer Radial | | Competitive Self Hopfield
Perceptron || Perceptron Basis Network ]]Organizing| | Network
Network Network Function Map
Network Network
Publicatio
al Tech-Neo
a . _19) (M6-14)
> New Syllabus w.e.f academic year 18-19) (
ge no (2.9 4
Introduction to Neural Network ...Pa
m 6-Comp)
Machine Learning (MU-Se function ¢
ulated, the output of
the j neuron is calculated by applying the activation
— Once the net input is calc
net input as follows,
Y; = f (net)
k ¢
1.1 Single Layer Perceptron Networ outpy
there are only input and
layer Perc eptr on netw ork. In this type of network
Fig. 2.6.2 represents the singl e
connected to the outputs, via a serrjeg of i
—
e layer, where the inputs are directly
layers are present. It consists of singl i
weights.
is aboy, |
is calculated in each neuron node, and if the value
— The sum of the products of the weights and the inputs other wise inhibi t the value
and displays the output (generally, 1)
some threshold (generally, 0) the neuron fires
(generally, — 1).
oa ane
Output layer
‘ Input layer layer 1 layer 2
7if
sll nici st
a ‘+
p
4
1
;
i
:
Fig. 2.6.3 : Multi Layer Perceptron Network ;
4
13 Radial Basis Function Network
Ce pg
Fig. 2.6.5 : Feedback / Recurrent Network
Each neuron is connected to every other neuron but not back to itself.
Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons. This is
true of ANNs as well; Learning of neural network means setting or updating the weights.
1. Supervised Learning
Input ——»] Adaptive Network Oo
In this type of learning when input is applied supervisor provides a desired |
response. The difference between the actual response (0) and the desired d-O 4
response (d) is calculated called as error measure which is used to correct ( )
the network parameters. Fig. 2.7.1 : Supervised Learning
2. Unsupervised Learning .
Input——»| Adaptive Network To oO
In this type of learning supervisor is not present due to this there is no idea
or guess of output. Network modifies it’s weights based on patterns of input
and/or output.
Fig. 2.7.2 : Unsupervised Learning
y,
Network Fase ne
Introduction to Neural
——=—— !
M1 Solution:
4
total 8 neurons are required.
For each input and output one neuron is required, hence
‘the input neurons,
In weight matrix the number of row reprints the output neurons and number of columns represent
represent all the incoming|
The first row reprints all ‘the incoming vectors of the first output neuron and the second row
‘ !
vectors of the second output neuron. Hence the matrix dimension is 2 x 6 as follows :
W = [ Wi We Ws Wa Wis Wie |
Wa Wa Wo; Wo Ws Wo
The outputs are to be limited to and continuous over the range 0 to 1. Hence we may use unipolar continuous function. 5
‘
' .! » 1
f(net) = ———————
(net) (1 + exp (—A* net))
; ;
- Example 2.7.2: j Given a 2 input neuron with followin Q parameters, b = 1.2, w = —-[_&
output
, for following transfer functions. w=[3, 2], x = [-§, 6]. Calculate neuron’s
1. Hard limit 2. Symmetrical Hard limit
.
3. . Li Linear a ii
5. Symmetrical saturating linear 6. Unipolar . 4. Saturating linear
I Solution: continuous 7. Bipolar continuous
6
. : can be calculated as
First we cal culate the net value which
‘ i
follows : -5 3
nh
Y
net; = > W, xX.
: i 6 2
yet 1.2
But, in this particular '
- problem bias value j L . .
* #'S0 formula
performance of the net work SO we will: use the following ich ‘j 'is added in‘ the net to improve the
given, Bias is a val ue which "
an
net = > W, X;+b
. jet
5*3)+(6*2419-
Net e(-
Machine Leaming
(MU-Sem 6Comp
D)
1. Introduction to Neural Networ
Hard limit k ---Page no (2-11
Outpur = 0
if. net <0
=1
if. net =0
Hence, Y =0
2. Symmetrical Hard limit
Output = —]
if. net <0
=I
if, net>0
Hence. Y =~—]
Module
3. Linear
Output = net
Hence. Y = —O8
tall the
Ourput = 0 if. net <0
= net if,O<net<1
=] if. net> 1
Hence. Y = 0
inuous fz 5. Symmetrical saturating linear
Output = ~1 if. net <0
__
= net if,O<net<1
te neurti!
=]
ify net > 1
neat Hence, Y = -1
jnuous
Unipolar continuous
Output =
(1 + exp (—2 * net))
aa
We assume the value of 2. equal to 1(If the value c* ) is not given then assume the standard value as equal to 1)
Hence, Y = 0.1418
7. Bipolar continuous
1 a 7
Output =
(1 + exp (—2
* net)) -1
Hence, Y = —0.71
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) fe Tech-Neo Publications..A SACHIN SHAH Venture
...Page no (2-19
Introduction to Neural Network
m Machin
Machine Learning (MU-Se
= : flowing network us ing a inuous.
Unipolar cont
Exam
Example 2.7.3 : Compute thea
output of the following -2.82 F
Ms
I
1
-I fe
-2.74 Bipc
MM Solution : First we will calculate net input and output of hidden nodes, : For!
x1 oi i
Perceptron Learning is a supervised type of learning as the desired _th neuron
response is present. It is applicable only for Binary types of neurons x2
(activation functions). Learning signal is the difference between the actual Xn on ’
output and desired output of neuron and it is used to update the weight.
Leaming signal, r = d;-0,
Where 0; is the output of the i” neuron and d, is the desired response.
Weight Increment, AW, = C *[d,-0,]* X,
. ee : c
Where C is a constant, X is input and j = 1 ton. Fig. 2.8.1 : Perceptron Learning rule
Wrew = Wogt AWij
In the Perceptron learning weights are updated only if, d, # 9,
Let’s assume »
if, dj=1 and o,=-1
then AW; = C*[d\-o]*X=C*[l-( 1)] * X= 2Cx,
if d=-1 and o= 1 ;
Den
Hence,
AW, = C¥Ed-0] X= C* [1-1]
AW, =+2C X,
* X=—20x i,
2
(l
.
Syl us w.e.f academic year 18-19) (M6-14)
(MU-New yllab fal Tech-Neo Publications..A SACHIN SHAH Venture ~.
n e
S01,
Machine Learning (MU-Sem 6-Comp)
Introduction to Neural Network ...Page no (2-13)
Example 2.8.1: Initial weight vector W, needs to be trained using three
input vectors, X,, X, and X, and their desired
responses. Find final weight vector using Perceptron learning
rule.
1 1 0 -1
Perceptron learning is applicable only for binary function and also in this problem
Module
desired responses are given in 1 and
27
— 1 format Hence we solve this problem using bipolar binary function
Bipolar Binary
es W, = W,+C*[d,-0,]
*X,
pes 1 1 08
-1 -2 -0.6
3 = 0 (1-1
+0.1C1-)) 0 = 0
0.5 -1 0.7
eo 4 net, == W.*X,=[08-0.6
W, * X,=[0. . 0 0.7]. 05" |=-16
-1
So, W; =W,
:
net; = W,*X=[08 -06007]| 4. 1 | =-22.1
-I
03 f (net) =- 1 as d, #0,
a Tech-Neo
- Publications .
ications...A SACHIN SHAH Venture
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
t
sve
Ms
rule.
Example 2.8.2 :' Implement Perceptron Learning 4
0 2 0
i
we| 1 1% 1|,X%=| -1 (Cet, dy=- 102-17.
“1
- 1
0 ieved.
ponses in a row are ach
ce until two correct res
4
Bipolar Binary
=1 if net>0
[01 0]]
inet, = W)*X,= Ff =l
-1
=1
o, = f (net) as d, #0,
“W, = W,+C*[d,-0,]*X,
0 2 -4
=] 1 J#1(-l-ly+] 1 f=] -1
0 -1 2
b Step2: When X, is applied
=} -l/+1d-Ciy] -1 [=] ~3
2 -1 0
™ -1
0; = f(net,)=-1]
As d, = 05, weight updati
on is not required
W, = W;
Ws = W,
Example 2.8.3 : A Single neuron network using f (net) = sgn (net) has been trained using the pairs of X,, d, as follows. Find
Initial weight vector.
3 1 0 -2
2 -2 -1 0
W, = 6 »X,= 5 Xo= 9 X= -3 ,C=1,d,=—-1,d,=1,d,;=—-1
1 -1 -1 -1
v Solution : This Problem is different from the previous problems; we have to find the initial weight vector using the
final weight vector
> Stept:
. W, = W,+AW;
~ 4
AW, = -2*C*X,= 6
; 2
i 2
4 W; = W,-AW;= 0
a |
» Step2:
W, = W,+ AW,
AW,
The above equation can be written as, W,= W,-
er the positive sign
As we know AW, =+2*C * X, and d,= | we consid
0
-2
AW, =2*C*X;=|
-2
-1
4
W, = W;-AW:=| _,
» Step3:
W, W, + AW,
The above equation can be written as, W, = W,— AW,
As we know AW, =+ 2 * C * X, and d, =— | we consider the negative sign
-2
4
AW, =-2*C#X,=
-6
2
0
W, = W,-AW, = 5
Perceptron Architecture
&
Machine Learning (MU-Sem 6-Co
mp)
Introduction to Neural Network ...Page
no (2-17)
Input vector for which the net input
is zero determines the decision boun
dary
Wi" P,+Wi.*P,+b = 0
Let’s take the value of W,,== W).==la
| andb=-1 and substitute this values in the above equat
ion we will get
P,+P,-1 =0
To draw the decision boundary we
need to fi nd the intercepting
points of P, and P, axes
P}+P,-1 =0
Substitute P, = 0 then we will get P, = | ie.
(0, 1)
Substitute P,= 0 then we will get P,= I ice. (1,
0)
Now to find which decision region belongs to output
= 1.
Let’s pick one point (2, 0) and substitute in the followi
ng equation
‘ ( 2
O = Hardlim (W' P +b) = Haratim (| 0 lu l+(- 1) = Haratim (N=!
Decision boundary is always orthogonal to the weight vector and it always points towards
the region where neuron
output is 1.
Example 2.9.1: Implement Perceptron Network for AND function Using the concept of decision Boundary.
P, = [0 0], t, = 0 and P,= [0 1], t,= 0 and P;=[1 0], t,=0 and P,=[1 1], ty=1
M1 Solution:
Now first we plot the given points, For the P, point the target or
a
desired response is given as | this is represented as filled circle and for the
remaining points it’s 0 which is represented as empty circle. After plotting
a
the points the next step is to draw the decision boundary such that it will
LLP,
divide the points into two regions according to their desired responses
(i.e. output 1 and 0).
To find the value of bias pick one point on decision boundary as P = [1.5 0].
work ...Page no (
Introduction to Neural Net
olve this problem using un
Machino Learning (MU-Sem 6-Comp) iin 4/0 fo mat so We will Ss
i
are give
d desl ed responses
In this problem t
he
function.
wi Solution :
Iteration I
\ b; =b,
Step 3: When P, is applied
0, = hardlim (W, * P; +b,) = hardlim (- 1) =0
Error, e=t,-0,=0-0=0
, W, = W;
b, = by
> Step4: When P,is applied
0, = hardlim (W, * P,+b,) =hardlim (— 1) =0
Error,
e = t,-0,= 1-O0=1
Ws = WetePi=2-241E1 =E-3-1]
bs = bjt+e=-1+1=0
(MU-Ne
| w Syllabus W.e.
eT acade mi ic year r
18 -19
1 ) (M6-14)
[al
. ech-Neo Pub HIIN
li cat
z ion s...A'
: SAC
SHAH
AH
W, = WeteP, =[-3-1]+1[1-2]=-2-3]
b, = b,+e=O04+1=1
W, = W,
b, = b,
Wo = Ws
by = bs
In this Steration for P,, P, and Py, ¢ = 0 Hence we will again go for Iteration 3
Iteration 3
rs)
Step9; When P, is applied
0, = hardlim (W, * P, + by)= hardline(- 9) =0
Error, e = t;— 0,=0
is not required.
Since the error is 0, weight and bias updation
Wy = Wy
IN SHAH Venture
[Fal rech-neo Publications...A SACH
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
fe
lel ntv
Network
Introduction to Neural
-Sem 6-Comp)
Machino Learning (MU
by = b,
. a
applicd
Step 10: When P, is
®
= hardlim (5) = |
Oy = hardlim (Wo * Pa + Pio)
Error, ¢ = ty— On= 0
required.
and bias updation is not
Since the error is 0, weight
St
Wi = Wio
;
by = bio ‘ - i
Thus, we can say that we hare’
also we are getti ng the crror, ¢ = 0. In this iteration ¢ = 0 for all input vectors.
Now for P,
found a solution
s
Final weight vector and bias value are as follow
W = [-2-3]b=1
la
on to find the equation of decision boundary
Now we substitute these values in the following equati
aia
Wi," Py + Wiz * Pp +b =0 ;
i
-2P,-3P,+1 =0
points.
4 Now to draw the decision boundary we need to find the intercepting
(MU-New Syllabus w.e.f academici year 18-19) (M6-14) Tech-Neo-Neo Publ A SACHIN SHAH Ver
o Publications...A SA : 3
a
Module
NOTE : For XOR use (X1 AND (NOT X2)) OR ((NOT X1) AND
X2)
> Dec. 2019
Q.5 _Wite short note on McCulloch-Pitts Neuron Model. (Ans. : Refer section 2.3) (5 Marks)
(c) Output (d) Interconnections v Ans. :(b) (d) Both Supervised and Unsupervised ~ Ans.
: (b)
Explanation : Node collects all inputs for processing Explanation : Data is groped together based on common
same as that of soma in BNN. characteristics in unsupervised learning.
are tree-like branches, responsible for Q. 2.7 The maximum time involved in case of layer calculation
receiving the information from other neurons it is isin
connected to (a) Input layer computation.
(a) Soma (b) Axon (b) Output layer computation.
(c) Dendnites (d) Synapse Y Ans. : (c) (c) Hidden layer
(d) Equal effort in each layer ¥ Ans. : (c)
Explanations: Dendrite is the part of neuron that accepts
layers
the input. Explanation : Main processing is done in hidden
only.
is just like a cable through which neurons send on ?
the information. Q.2.8 Which one is type of linear activation functi
(a) Identity (b) Unipolar Continuous
(a) Axon (b) Dendnites (d) Binary ,
(a)
v Ans: . (c) Bipolar Continuous
(c) Soma (d) Synapse Ans.: ( (a)
send the
Explanation : Axon is a part of neuron that both are same.
Explanation : In identity input and output
output.
e
Tech-Neo Publications..A SACHIN SHAH Ventur
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
olla
Q.2.9 — Attitiviat Noval Network iy used ti input and initialized weights, if output is not same ag thay
(a) Unsupervised teaiilig model of expected output then weights are
updated and then
(DP) Supervised learniag mode is perfo rmed.
Model next iteration
(ed to both Unsupervised andl Supervised learning
(UD Neither used te Unsupervised wer in Supervised Q. 2.15 Processing of ANN depends upon
¥ Anse: (e)
loarning Model 1) Network Topology
Uxphination : ANN applications ate developed using
2) Adjustments of Weights or Learning
both type of Teartiig, methods,
Q) Activation Functions
Q. 2.410 ‘Training set of data tn supervised learning includes a 12 (b)) 2,3
(by Output \ (c) 1.3 qd) =1,2,3 Y Ans. : (i
(ay Input
(ed Both input and Output
Explanation 2 Since all components are used for
QD Neither input nor output Y Ans. : (ec)
processing.
Explination : lo training data input an outpat both are
Q. 2.16 Suppose you have to design a system where you want to
present which is sce to tear,
perform word prediction also known as language
Q.2.11 _ is the connection between the axon and other modelling You are to take output from previous state and
, neuron dendrites. also the input at cach step to predict the next word The
@ Soma (b) Axon inputs at cach step are the words for which the next |
(c) Dendrites (I) Synapse Y Aus.) words are to be predicted. Can we ‘use Recurrent neural |
network for the design?
Explanation + Synapse is the electro chemical contact
organ. (a) Yes (b) No : ¥ Ans,
: (a) |
Q.2.12 Which is true for neural networks, Explanation : Yes, as in case of RNN output is again|
; 1) It has set of nodes and connections.
given to input unless and until we get desired result.
2) Each node computes is weighted input Q. 2.17 Suppose you want to predict the cyber bullying so that '
the parents can run this system im the background and
3) They have the ability to learn by example.
when the children’s are watching any video/audiolsiteif
4) The training time does not depends on the size of the
it contains any offended contents then site is blockedor
network.
contents are blurred. According to you which method
(a) 1,3,4 (b) 1,2,3
send
will give you best result?
(c) 2.3.4 (dd) 1,2,3,4 ¥ Ans, + (b) (a) Long Short Term Memory
Explanation : Since training time depends on the size of
(b) Convolution Neural Network
the network,
(c) Recurrent Neural Network
Q. 2.13) The following Gate cannot be modelled with a single (d) Anificial Neural Network Y Ans.: (2),
neuron
Explanation : LSTM, as audio or video is to be
(a) 3-input AND Gate (b) 3-input XOR Gate converted in to sentence model then will be given
(c) Not Gate network.
(d) All can be easily modeled Q.2.18
Y Ans. : (b) Can you represent the following Boolean function with!
Explanation : Since to design XOR gate we require single logistic threshold unit?
AND, NOT and OR gate.
Q.2.14 Steps followed in training a perce A |B | F(A,B)
ption are listed below
What is the correct sequence of the steps? Ly 0
\. Fora sample input, compute
an output 0] 0 0
2. , Initialize weights of perception
randomly 1] 90 1 .
_ 3. Goto the next batch of dataset ‘
4 oOo] 1 0
If the Prediction does not match the Output,
pul, ch ange (a) Yes (b) no . ¥ Ans:
. +8 wit!
(a) 1.4.3.2 (b) 2,443 Explanation : Yes, you can represent this function Wo
1234 @ 23,44 a single logistic threshold unit, since it is lin
¥ Ans. : (b) Separable.
(MU-New Syllabus w.ef academic
year 18-19) (M6-14)
Tech-Neo Publications...A SACHIN SHAH ventut :
N SHAH Venture
(el Tech-Neo Publications...A SACHI
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
Q.2.29 We have decided to us¢ a neural network to solve this Explanation : Cell is said to be fired if and only if
problem. We have two choices : either to train a separate potential of body reaches a certain steady threshold
neural network for each of the diseases or to train a values.
single neural nctwork with one output neuron for each Q. 2.34 The cell body of neuron can be analogous to what
disease, but with a shared hidden layer. Which method mathematical operation?
do you prefer? (There is dependencies between diseases.)
(a) summing (b) differentiator
(a) First (b) Second
(c) integrator (d) none of the mentioned
(c) Both (d) None v Ans. : (a)
: (a)
Y Ans.
Explanation :
Explanation : Because adding of potential(due to neural
1 Neural network with a shared hidden layer can
fluid) at different parts of neuron is the reason of its
capture dependencies between diseases. It can be
firing.
shown that in some cases, when there is a
dependency between the output nodes, having a Q, 2.35 What is hebb’s rule of leaning ?
Shared mode in the hidden layer can improve the
(a) the system learns from its past mistakes
accuracy. ,
(b) the system recalls previous reference inputs
2 If there is no dependency between diseases (output and
respective ideal outputs
neurons), then we would prefer to have a separate
neural network for each disease.
(c) the strength of neural connection get modified
accordingly
4, ‘ Q. 2.30 Let's say, you are using activation function X in
hidden (d) none of the mentioned v Ans.
layers of neural network. At a particular neuron for : (c)
! any Explanation : The Strength of neuron to
given input, you get the output as “— 0.0001”. Which of fire in future
increases, if it is fired repeatedly,
the following activation function could X repres
ent?
(a) ReLU (b) tanh Q. 2.36 What is the feature of ANNs due to
which they can deal
(c) SIGMOID (d) with noisy, fuzzy, inconsistent
None of these ¥ Ans. : (b) data?
Explanation : The function is a tanh (a) associative nature of netw
because the this orks
function output range is between (-1,-1). (b) distributive nature
of networks
Q. 2.31 (c) both associative and dist
Why do we need biological neural networks ? ributive
(a) to solve tasks like (d) none of the mentioned
machine vision and Y Ans. : (c)
natural
language processing Explanation : General characte
ristics of ANNs.
(b) to apply heuristic search meth Q. 2.37
ods to find solutions of What was the name of the first
problem model which can perform
weighted sum of inputs?
(c) to make smart human inter
active and user fri endly (a) McCulloch-pitts neur
system on model
(b) Marvin Minsky neur
(d) all of the mentioned on model
Y Ans. : (d) (c) Hopfield model of
Explanation : These are the neuron
basic aims that a neural (d) None of the mentio
network achieve. ned v Ans. : (a)
Explanation; McCulloch-pitts neuron model
Q. 2.32 What is auto-association cam
task in neural networks ? perform weighted Sum of inputs followed by threshold .
(a) find relation between logic operation,
2 consecutive inputs
ATT TY,
‘
(M6-14)
Tech-Neo Publications... SACHIN SHAH Ventu
re.
(c) both rosenblatt and pitts model (d) none of the mentioned v Ans. : (a)
ion get me
(d) neither rosenblatt nor pitts v Ans. : (b) Explanation : Competitive learning laws modulate
Vv Asi : Weights are fixed in pitts model but difference between synaptic weight and output signal.
Explanation
adjustable in rosenblatt. Q. 2.47 What are the advantages of neural networks over
ato freee
conventional computers?
Q. 2.42 When both inputs are 1, what will be the output of the
(i) They have the ability to lear
pitts model nand gate ?
hich bes (ii) They are more fault tolerant
(a) 0 (b) 1
(iii) They are more suited for real time operation due to
(c) either0 or | (d) z ¥ Ans. : (a)
their high computational power
of a nand
Explanation : According to the wuth table (a) (i) and (ii) (b) (i) and (ii)
gate. (c) Only (i) (d) All Y Ans. : (d)
fr
(a) in feedforward manner network 7
pic (b) Bias
W (b) in feedback manner (a) Gradient descent
(d) None Correct
(c) both feedforward and feedback (c) Relu Activation Function
v Ans. : (d) Y Ans.
: (c)
(d) either feedforward and feedback
in standard function such as Relu gives
Explanation : Connections across the layers Explanation : An activation
r or in feedback
topologies can be in feedforward manne
ork.
anon-linearity to the neural netw
manner but not both.
|
by Awij,
Q. 2.44 If the change in weight vector is represented
what does it mean ?
... SACHIN SHAH Venture
BI Tech-Neo Publications
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
...Page n
Introduction to Neural Network
Machine Learning (MU-Sem 6-Comp)
rule?
layer, Q. 2.50 Error back propagation uses which learning
Q. 2.49 For a fully-connected ‘network with one hidden tion Jearnin;
what (a) Hebbian learning (b) Percep
increasing the number of hidden units should have
(c) Delta learning (d) e
Competitivlea
effect on bias and variance? . , ae
(a) Decrease bias, increase variance
(b) Increase bias, increase variance Explanation : Delta learning tule is used to
weights of hidden and output layers with the
(c) Increase bias, decrease variance
Y Ans.: (a) function.
* d), No change Correct
Explanation : Adding more “hidden units should
decrease bias and increase variance. In general, more
complicated models will result in lower bias but larger
variance, and adding more hidden units certainly makes
the model more complex.
Chapter
Module 3
Chapter...
lee
Introduction to
Optimization Techniques
%
A,
=
University Prescribed Syllabus
method.
Derivative based optimization - Steepest Descent, Newton _
Down Hill Simplex . _
Derivative free optimization - Random Search,
3.1 Introduction
3.2 Derivative Based Optimization
3.2.1 Gradient based Methods
ent
3.2.2 Method of Steepest Desc
3.3
3.4
—<—
pl 3.1 INTRODUCTION
Methods are used to minimize the scalar function of a number of variables. In Unconstrainey |
— Optimization
Want tg ;
optimization, the variables are not restricted by inequalities or equality relationships. Scalar function which we
|
minimize is called as an objective function.
The objective function is an error function for feed forward network and energy function for recurrent network, For the”
further topics we need the terms, Hessian matrix and gradient vector, so first we will see the basic concept of these two
terms.
Gradient Vector - Let’s take a scalar function E(x) of a vectorial variable x, defined as an n-elements column vectorx is |
denoted as follows,
— The gradient vector of E(X) with respect to column vector x is denoted as GE (X)
— Hessian Matrix — For a scalar function E(x), a matrix of second derivatives called the Hessian matrix is defined as”
follows
0°E(X) = OE (X)]
dE GE | | WE
dx. dx, dx, dx, dx,
dE gE | | WE
dx,dx; dx" dx, dx,
d°E(X) =
dE dE WE
dx, dx, dx, dx, “ axe
n
ry - — ee Spectrum . methods exists for unconstrained optimization, methods can be broadly categorized in
sage _.
terms= of the derivative informatiation that iis, used (Derivative based optimization)or is not used (Derivative free
hig, oN 4
‘y optimization).
We have
E(X,) 2 E (X,) 2 E (Xp) 2 ++ « i
step size 7) |S
Note that the value of the
ows
(X,) conver ges to the desired local minimum.
- So hopefully the sequence
; allowed to change at every iteration. the plane, and that its graph has
a bow!
in the Fig. 3.2.1. Here E is assumed to be defined on at
- This process is illustr ated t. An arrow originating
alto”
that is, the region s on which the value of E is constan
contour lines,
posit shape. The curves are the
at that point.
of the negative gradient one of the stopping criteria is
, ' a point shows the direction . 41 . Pe aller shan a
at proced ures are typica lly repeat ed until on
on, the descent
pt - For minimizing the objective functi of the gradient vector 'S §
value is sufficiently small. The length
)7 satisfied such as, the objective function
g time is exceeded.
specified value or The specified computin
Venture
ons...A SACHIN SHAH
Bl Tech-Neo Publicati
18-19) (M6-14)
(MU-New Syllabus w.e.f academic year
— "In the method of steepest descent, the successive adjustments applied to the weight vector ware in the direction of.
steepest descent i.c., ina direction opposite of gradient vector,
Wecan write, 2 = OE (W) i
P
‘Accordingly, steepest descent method is formally i
described by, }
¥ Wnt) = W(n)-y
ge (0) | Bt
: Where 11 is positive constant and g (n) is gradient veetor evaluated at point W (1). When going from Heration n ton+t)
‘the algorithm applies the correction as, ‘
{
AW(n) = Wnt 1)- WW)
=- Ne (n) \
To show that the formulation of the steepest descent algorithm satisfies the condition for iterative descent, we use a first,
order Taylor series expansion around w(n) to approximate ¢ (W (nF 1)) as,
1
i
oe
e(W (n+ 1)) = €(W(n)) +2 (n) AW (n)
|
|
- Substituting value of AW (n) we gel,
Step2: ep2:Calcul
Calculate i
the optimum SS
step length A= ene and a new point as : i
. : tool
Xie = X; +A 15
0 Step3 :_ Test t optimality for new point Xi, by V(x, ) = 0
nlf met stop, otherwise repeat
step | forthe new point,
no (3-5)
Ws j Learning (MU-Sem 6-Comp)
nj tyRig te ‘i Machine—=—— SETechniques ...Page
Introduction to Optimization
.
r infy ‘ fin Example 3.2.1: Minimize f(X;, Xz) = X; — X2 + ax +2X, Xo4 x
Ca] ati, 4 Starting from the point X, = (0, 0)
th
og “itt, ‘, 1 Solution:
I
ae of
“tay
i\ a
We will calculate gradient of f as, V, = ae
ax,
re ; Take a derivative of given equation w.r.t X, and X,
i 7 the g vy, -{ 1% +2X,
. -142X, +2, (1)
Now we will calculate Hessian matrix as,
vf _ar
Ox, OX, X, 42
of at “L2 2
i eration», Ox, 21x Ox
1 . 2
Moduie
Iteration 1
[33
At X, = (0, 0) :
Cent, We use: » Step1: Find S, at X, , substitute the yalue(0, 0) in Equation 1 and take a negation of that,
-1
S; = -vEK)=[ 1 ]
b Step2: Computed, at X,
siT Sy em)
-1 1 ']
Aa == OI
7
S, HS) (1 ul 5 i]
progressesit
Hence the new point is,
-1
X, = x +4,8)=| l |
Iteration 2
At X, = ¢1,-))
b Stepi: Find S, at X,, substitute the value(- 1, — 1) in Equation 1 and take a negation of that,
1
S, = -viom=| | |
A (MU-New Syllabus w.e.f academic year 18-19) (M6-14) [Bl Tech-Neo Publications...A SACHIN SHAH Venture
At we :
ee aS. e
oN
A, 3
7 Stisy
Henee the new point is,
- 0,8
Ny = Nat A 81=[ 12 ]
Equation 1
% Step ds Cheek optimality by substituting yalue of X, in
ane 0.2 |
Wiis) = [ve leLo
So X, is not optimum, go to next iteration
Iteration 3
x= %hSe[ 14]
1.
b Step3: Check optimality by substituting value of X, in Equation
Iteration 4
At X,; = C114)
and take a negation of that,
» Step1: Find S, at X,, substitute the value(— 1, 1.4) in Equation 1
0.2
S, = _ vex) =[ 05 |
¥ Step 2: Compute A, at X,
Ay = 1S
Hence the new point is,
— 0.96 ]
Xs; = X+AS.=|_ 1.44
ture
[al Tech-Neo Publications...A SACHIN SHAH Ven
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
—— |
Scanned with CamScanner
To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419
:
Machine Learni one
ng (MU-Sem 6-Comp) .
Introduction to Optimi
> Step3: zationoI Techni niq
Check optimality by ues ...Page no (3-7
substituting value of
X; in Equation 1
VE(X5) — 0.04 =
= ~ 0041-12 |
So X, is optimum.
— The second order Taylor expansion f; (x) of function f(.) around x, (where Ax = x — x,) is
fp(X,+AX) = fp(X)=
fp (X) + F(X) AX + V2 £(X) AX 2
Attains its extremism when its derivative with respect to Ax is equal to zero, i.e. when Ax solves the linear equation :
£(X,) +£(X,)AX = 0
coefficients. Thus, provided
— Considering the right-hand side of the al bove equation as a quadratic in Ax, with constant
that f(x) is a twice-differentiable function well
approximated by its second order Taylor expansion and the initial guess
is chosen close enough to Xx,
AX =X-X,=-(f(%,)/£%))
— The sequence (X,) defined by :
Xi.) = X,— (£(X) (Xp) ) 9 1 = 0, Dyer
. ; =0.
will converge towards a root of f’, ie. X* for which f (Xx)
Venture
er ; : 14 Tech-Neo Publications..A SACHIN SHAH
- (MU-New Syllabus w.e.f academic year 18-19) (M624)
Algorithm
Nua? Xx “d, Vi
% Step Js. 9) 0
‘Test optimatlty for new polut X;,, by VE (xy,
new point
If met stop, otherwise repeat sep) 1 for the
——~"7 a
Xy—Xyh BX 2X Xyt xy
Example 3.2.2: Mininilzo ((X4, Xa)
Starting from tho polnt X, = (0, 0)
M sotution:
ot
7 | Oxy
We will calculate gradient of fas, VPS) 5p
iy
‘
Take a derivative of given equation w.e.t X; and X)
if
og :
, | ' /_ «= |+aX +2% | wll
i - os VE = | 1 42X, 42K) 4
Now we will culculate V¥, by substituting (0, 0) in above equation
Wi, = ('.]
Now we will calculate Jaccobian matrix as,
ar art
Ox; OX) X) ry 9
J=ll = 4 . -|
or ot 2 2/1
OX) XOX;
7 ” .
sppertcct, Hl
Now we will find J
rol—
rvi—
ble
ni
Iteration 1
AUX, = (0,0)
b Step I :
ee a e
rJ—
vi—
2 ! a
ntute
- (MU-New Syllabus wef academic year 18-19) (M6-14)
s
rech
Teal eNeo Publications..A SACHIN SHAH Ve
posse
So X, is Optimum.
point x equal to x + b + dx
6 otherwise go to next Step. and the bias b equal to 0.2 b
+ 0.4 dx, go to step
4. iff(e+b-— dx) < f(x), Set the Current point
x equal to x + b — dx and the
6 otherwise go to next Step. bias b equal to b — 0.4 dx, go
to step
5. Set the bias equal to 0.5b and 80 to step
6.
6. Stop if the maximum number of function
evaluations is reached, otherwise 80 back to
step 2 to find new point.
Initialisation
Selecta random dx
N N
Y Y
x =x+b+dx x=x+b-dx b=0.5b
b=0.2b+0.4dx b = b-0.4 dx 2
Stop
ZL (MU-New Syllabus w.e.f academic year 18-19 [el Tech-Neo Publications...A SACHIN SHAH Venture
) (M6-14)
- The main superiority of this tec hnique is that in this technique calculation of
the derivative of the function jg nes
required. This technique generates its pseudo-derivative by assessing enough points for each independent variable
the function for defining derivative.
Fig. 3.3.2.
ee:
derivative calculation. If
The downhill simplex optimization technique only function assessing is required instead of
the space is N- dimensional then a polyhedron with N + 1 vertices is used as a simplex. Initial
simplex is define by
oS
help of
The worst point for which objective function value is maximum is moved to a point which is reflected with the
remaining N points in Reflection. In expansion method tries to expand the simplex along this line if this point is beter
as compared to the best point.
In Contraction the simplex is contracted along one dimension from the maximum point if the new point is not better as
compared to the previous point. If the new point is not good as compared to the previous points then the simplex is
contracted along all dimensions toward the best point and steps down the valley. In the downhill simplex search
technique these operations are applied serially during each iteration till the method finds the optimal solution. In the
downhill simplex technique following operations are applied serially :
condition.
Arrange points with the rank and do the relabeling of N +1 points. Points are ranked according to following
F (Py, ;)>-->F(P,)>£P,)
Create initial P_ point with the help of reflection.
P. = P+ (P,— Pui)
Here P, is calculated by taking the centroid of the N best points in the vertices of the simplex.
Py, , is reptaced by P,, if following condition satisfied
Macht
Machine Learning (
ing (MU-S 5
em 6-Comp) Introduction to Optimization Techniques ...Page no (3-11)
P, = P,+ B* (P,-P.)
&y
Vo Py 41 is replaced by P,, if following condition is satisfied, else Py, , is replaced by P,
t Condition : f (P,) < f (P,),
On. . ‘
H 1
r |
= Create a new point P, with the help of contraction if following condition is satisfied.
seh Q.2 Describe Down Hill Simplex method. Why is it called Derivative Free method 2
(5 Marks)
tiske (Ans. : Refer sections 3.3 and 3.3.2)
2
, Starting from the point X, = (0, 0). (10 Marks)
Q.3 = Minimize f(X,, Xz) = 4X, —2 X_ + 2X, +2 X, Xo+ x,
: Refer section 3.2.2)
t bet Using steepest descent method (Perform only two iterations) (Ans.
techniques.
simple Q.4 Differentiate : Derivative Based and Derivative free optimization
(5 Marks)
ox x (Ans. : Refer sections 3.2 and 3.3)
on! = D> Dec. 2019
techniques. Explain Steepest Descent method for
Q.5 List some advantages of derivative-based optimization
on™ optimization. (Ans. : Refer sections 3.2 and 3.2.2) (10 Marks)
section 3.3.2) (5 Marks)
Q.6 Write short note on Down Hill Simplex Method. (Ans. : Refer
Q.3.2 For Gradient Decent (GD) and Stochastic Gradient
Multiple Choice Questions Decent (SGD) which of the following sentence is
correct?
Q. 3.1 What type of constraint in the feasible basic solution (a) You update a set of parameters in an iterative manner
must satisfy for simplex method ? to minimize the error function.
in your
(a) Non-Negativity (b) Negativity (b) You have to run through all the samples
Y Ans. : (a) a paramet er in each
(c) Basic (d) Common training set for a single update of
Explanation : Non negative points are uscd for simplex iteration for SGD.
method. (c) You either use the entire data or a subset of training
data to update a parameter in each iteration in GD.
i
Machine Learning (MU-Sem 6-Comp) Introduction to Optimization Tech :b
niques -..Page no
(d) You update a set of parameters in parallel manner to |
Q. 3.8 Convergence
3-12)
‘minimize the error function,
speed of Derivative Free Optimization t
~ Ans. : (a) method is i
Explanation : In SGD for cach iteration you choose the (a) Fast (b) Slow
batch which is generally contain the random sample of
(c) Very Fast ' () Medium ¥ Ans.:(y
data But in case of GD each iteration
contain the all of Explanation : more number Of iterations
the training observations. are required in
derivative free methods
since they work on random
Q.3.3. Which onc of the following is incorrect w.r.t. basis
Derivative
based optimization ? Q.3.9 Derivative based Optimization method uses
(a) Uses derivative information with objecti (a) Heuristic Search (b) Bes
ve function t First Search
(b) Slow convergence (c) Breadth First Search
(d) Depth First Search
(c) Follows mathematical methodology
: (a)
Y Ans.
(d) Fast Convergence Expianation =: — Heuristic
Y Ans. : () function (derivative |
Explanation : Derivative based optimization methods information) is used for optimizat
ion.
: converges very fast.
Q.3.10 This is being used for evaluation in De Tivative Based
Q.3.4 In Classical Newton’s method Optimization method
the descent direction is
determined by which method? (a) Only Objective Function
(a) First order derivative of the (b) Derivative Information
function with Objective Function
(b) Partial order derivative of (c) Only Derivative informat
the avail able objective ion i
function (d) Only Objective Information
(c) Gradient method Ans. :(b)
Explazation : Derivative of objective function is _
(d) Second order derivative calculated in this method,
of the available objective
.
function
~ Ans. : (d) Q.3.11 As the iterations are performe
Explanation : According d cost to update Parameters —
to working of the Newton in method of stcepest descent
method. method is
(a) increases * (b)
Q.3.5 decreases
In Derivativé free ‘Optimizatio (c) constant
i
n methods points are (d)- fluctuates
selected based on which criteria Y Ans. : (b)
Explanation : As number Of
(a) Minimum value iterations is progressed cost
(b)
Maximum value to update weight parameters ‘
decreases.
(c) Fitness Vaiue (d) Error value ;
¥ Ans,: (ce) Q. 3.12 Evolutionary conc
Explanation : Derivative ept is used in
free optimization methods
works on the concept of surv (a) Derivative free method
ival of the fittest. It denotes
how good the point is. (b) Derivative based meth
od
(c) Both the methods
’ When Gradient information
is used it i Ss called as which
type of optimization (d) None of the methods
¥ Ans. : (a)
(a) Derivative free optimiza Explanation : Derivative free metho
tion. ds like genetic
(b) Derivative based opti algorithm is based on evolutiona
mization ry concept.
(c) Constrained Optimiza
tion
. () Optimization
Q. 3.13. IF g(2) is the sigmoid function, then its derivative with
Y Ans. : (b) respect to z may be written in
Explanation term of g(z) as
in derivative based
derivative information optimization (@) g(z)(g(z) - 1) (b) g(z)(1 + g(2))
is used,
() B21: +82) 4) g(z\(1- (2)
’) '
Q.3.7 Which One is not a der
ivative Free optimizat Explanation : Formul laws
V Ans‘omo. (@id
(a) Genetic alg ion method? a to calculate derivativ
e of sigm Pie
orithm (b) Downhill simplex function.
(c) Simulated Ann method,
ealing
‘ (d) Gradient
Q.3.14 son ifif wewe 5° solve:
We can et multiple local optimum solutions
Descent method
Explanation ; Gra
4 linear regression problem by minimizing
dient based meth
Y Ans.: (d) the sum. s
based opti Mization od is a derivativ Squared errors using gradient descent.
e oy
method , (a) True (b) False
Explanation
v Anes
: We will not ‘ imum
get multiple optim
Solution,
: (Mi
Q. 3.15 The error function most suited for gradient descent using Explanation : When the function is at a maximum, its
logistic regression is second derivative has a negative value and mot a positive
(a) The entropy function (b) The squared error value.
Pad
minimized ?
Explanation : Mean squared error function is used. (a)0 (b) | (c)5 (d) 3 Y Ans. : (b)
4
Q. 3.16 Suppose your model is overfitting. Which of the Explanation : Probably the easiest way to solve the
following is NOT a valid way to try and reduce the problem is to recognize that the second derivative is
overfitting ? positive. So the furction is a minimum at x = 0, and the
correct answer is (b) This is also the absolute minimum
(a) Increase the amount of training data.
because the first derivative is zero only at a single point.
a
Newton’s method for optimization ? Q. 3.22 An initial estimate of an optimal solution is given to be
region
(a) The lower bound for the search used in conjunction with the steepest ascent method to
(b) Twice differentiable optimization
function determine the maximum of the function. Which of the
(c) The function to be optimized following statements is correct?
SHAH Venture
Tech-Neo Publications...A SACHIN
determine the gradient as Vf= Oi = 4] (d) decreases as you get closer to the minimum
Y Ans.
: (a), (0), (a
— Q.3.25 Determine
the determinant of hessian of the function
x- 2 — dy +6 at point (0. 0)? Explanation =: Gradient’ of a continuous and:
differentiable function is not a non-zero at a maximum,
@2(b 4 () O @W -S8 Ans. : (d)
Explanation : To determine the Hessian, the second Q. 3.33 Let us say that we have computed the gradient of our
partial derivatives are determined and evaluated as cost function and stored it in a vector g. What is the cost
follows ne
dt ae cr a. of one gradient descent update given the gradient?
2, ay =-4 dydx = 0. The resulting
(a) O(D) (b) O(N)
Hessian matrix and its determinant are
> (c) O(ND) (d) O(ND?) Y Ans,: (a)
H= [5 “J and determinant (H) =— 8. Explanation : According to working of gradient descent
method.
Q.3.26 In descent methods, the particular choice of search
direction does not matter so much. Q. 3.34 Which of the following statements is FALSE ?
(a) True (b) False. Y Ans,
: (b) (a) Multidimensional direct search methods are similar
Explanation : Selection of search direction makes an to one-dimensional direct search methods.
impact. (b) Enumerating all possible solutions in a search space
Q.3.27 and selecting the optimal solutions is an effective .
Im descent methods, the particular choice of line search
é
Toh, | Machine
Machine Learning
Learning | (MU-Sem 6-Comp) Introduction to Optimization Techniques ...Page no (3-15)
et hy Sac. (c) (@) Ih.the:
the optimization function is twice differentiable, () smooth (d) (a) and (b) Y Ans. : (c)
nM by multidimensional direct search methods cannot be Explanation : According to definition of cost function of
\ é used to find an optimal solution. optimization problem
\ (d) Multidimensional direct search methods are not . ;
) Pr a guaranteed to find the global optimum. Q.3.41 The feasible region for the inequality constraints with
Py ' Y Ans. :(©) respect to equality constants
| Explanation idj :
: Multidimensional direct search methods i eases
M2) tncredses () deer
qd ifr can be used with any function to find optimal (c) does notchange (d) none of the above
Ly solutions.
If the functions are twice differentiable v Ans.: (a)
there are more
computationally efficient techniques for optimization of Explanation : Standard procedure of constraints.
these functions. Q. 3.42 Which of the following is a objective function in
derivative based optimization for feed forward
/Q.3.36 If we want to find the value of the variable ‘x’ that
network ?.
tothem. maximizes a differentiable function f(x), then we
hy should: (a) Error function (b) Energy function
"hy Find (c) Cost function (d) Fitness function
of : (a) Find ‘x’
‘x from rom 4,
2 =0
=
Y Ans. : (a)
NON-2ery4
“ny a Explanation : Derivative of error function is used in
(b) Find ‘x’ from 4, = 0 and check that that second
feed forward network optimization.
puted trp derivative is positive
Q. 3.43 When the objective function is smooth and if we need
ede 4
(c) Find df = © and
et‘x’ from dx check that that second efficient local optimization then it is better to use
* BLVEn then, derivative is negative (a) Gradient based optimization method.
\(N) ' (d) Find ‘x’ from second derivative = 0 (b) Derivative based optimization method
v Ans. : (c)
(ND (ND) ! Explanation : oe (c) Both of above methods
Find ‘x’ from a. 0 and check that that
vorkinof gmt second derivative is negative.
(d) None of the above v Ans. : (a)
Explanation : Gradient based optimization method gives
\Q.3.37 Each optimization problem must have certain parameters better solution if objective function is smooth.
rents isFALE called
(b) dummy variables Q. 3.44 Gradient at a point is to the contour line
search elit (a) linear variables
none of the above going through that point.
search met (c) design variables (d)
Y Ans. : (c) (a) Parallel (b) Orthogonal
solutions
bles are requi red for every (c) None of above
al sotuticst! Explanation : Design varia (d) All above Y Ans.: (b)
vith ve Ww optimization problems. Explanation : According to property of gradient.
Q.3.38 A “s type” constraint expressed in the standard form is Q. 3.45 In Random search method, if f (x + b + dx) < f(x),
Set
' search # active at a design point if it has
semtiab (a) zero value (b) more than zero value (a) x=x+b+dx andb=0.2b+0.4 dx
(c) less than zero value (d) (a) and (c) a
(b) x=x+b-—dand
x b=b-0.4dx
v Ans. : (a)
oe
Explanation : Standard representation of constraint. (c) b=0.5b
(d) None of above Y Ans. : (a)
B: 3.39 Maximization of f(x) is equivalent to minimization of
if
Explanation : According to working of random search
yith wil (a) -f(x) (b+) 1/f() method.
iat? é (c) VIG) (d) noneoftheabove W Ans. : (a) Q. 3.46 Down hill simplex method uses which Telationship____
ent i Explanation : Standard procedure of constraints. (a) Input-output (b) Geometric
fe 3.40 ania problem oa, functions are (c) Actual output-desired output (d) None of above
: roblem 1s referred to as
‘aa (b) «n+l during each iteration. In each iteration four operation
‘ performed as reflection, expansion, one-dimensi
(c) n-1 | (d) n+2 v Ans.
: (b)
contraction, and multiple contractions.
Explanation : Simplex is a geometrical object that has
n + 1 vertices, here n represents the number of
independent variables:
Chapter...
Learning with Regression
and Trees
University Prescribed, Syllabus sissies
Learning “with Regression : Linear Regression, Logistic Regression. Learning with Trees : Decision Trees,
Constructing Decision Trees using Gini Index, Classification and Regression Trees (CART).
nensnannsouennennees®
vupesnnssccesaunnssaaaago
t
penenneanancessensaerenen
auneessncone® sunnnnennececeessnanmessneneet®
aapewmneeessssncennsaeas
— Let’s see simple regression first, in this X contains a single feature. In multiple regressions, X contains more than
feature. In simple regression training records are plotted as value ofX vs. value of Y. Next task is to find a functi
that if a random unknown X value is given we can predict Y. There are different types of functions that can be
linear regression, we assume that the function is linear as shown in Fig. 4.1.2(a).
yi tty tt |
LLETi PLL 4
(a) Linear Function (b) Non-Linear Function
—
Machine Lea rning (MU-Sem = 6-Comp) Learning with Regression and Trees ...Page no (4-3)
ae
n the¢ value sof mal ted value is the
The di ifference betwee of point and the predicted value is called as error of prediction. Predic
value of point on the line.
predicte yl
*s take an example, the are shown in Table 4.1.1. From
the
the errors of prediction (Y-Y’)
hescepredicted . values; (Y’) and
able
ne we
sy can
In say
sal that,
hat the
tl icted Y value as 3.2. The error of prediction is 0.8.
aa y tha, the second point has Y value as 4 and a pred
Table 4.1.1 : Regression data.
on is,
- The regression line equati
y’ =aX+b
20x15 _ 9 55
n DXY-DLXLY _ 4% 86=
120-400 4x
nD X7- (LX)
s
I x 20) = |
b = 4 (XY -ax LX)=4 (15 -0.55
mp)
Machine Learning (MU-Sem 6-Co
For X =4,
Y’ = (0.55) (4)+ 1 =3.2
Y’ = (0.55) (6)+1=43
phe sas
= (0.55) (8) +1=5.4
cea
"In Multiple linear regressions there are two or more
linear regression.
4
]
ing equation, .
The regression line is represented using the follow
s
Yo = Oyt O Xpt Oy, Xy + = +a,X,+¢€
In the above equation Y’ is the predicted value, X, X, _. X, are the predictors, ¢ is random error and ;
Op, O,, %, 0, are regression coefficients.
i i El
ip 4.2 _EXAMBLES OF LINEAR REGRESSION
in table below:
re expenditure of an organization (in thousand) for every month is shown
Example 4.2.1: The
X (Month) 1/2/;3]4)]5
Y (Expenditure) | 12 | 19 | 29 | 37 | 45
Sr.No. | X | Y | XY |X’
1 1/12) 12] 1
2 2] 19 | 38 | 4
3 3 | 29 | 87 | 9
4 4 | 37 | 148] 16
5 5 | 45 | 225 | 25
Total | 15 | 142 | 510 | 55
(a) The equation for the regression line is,
,
Y =aX+b
wu
|e
1 -l]-1] 1
ola]
2 2 2 4
3 3 | 2] 6
IS random : Total 4 3 Il | 14
n>,XY-DX
a = A
SO LY 3x1l-4x3 = 0.807
below: n> X?-(xy¥ 3x 14-16
"
of data
Fig, Ex. 4.2.2 : Scatter plot
Machine Learning (MU-Sem 6-Comp) Learning with Regression and Trees ...Page no (4.
seule ewesg MU - May 17, 10 Marks
course. Use the method a
Following table shows the midterm and final exam grades obtained for students in a database pet
of a student who received 86 in the mid term exam.
squares using regression to predict the final exam grade
72150 | 81 | 74] 94] 86 | 59 | 83 | 86 | 33 | 88 | 81
Midterm exam(X)
84 | 53 | 77 | 78 | 90 | 75 | 49 | 79 | 77 | 52 | 74 | 90
Final exam(Y)
vi Solution :
Sr.No.| X | Y | XY | X’
1 72 | 34 | 6048 | 5184
2 |'50 | 53 | 2650 | 2500
3 81 | 77 | 6237 | 6561
4 74 | 78 5772 5476
5 94 | 90 | 8460 | 8836
6 86 | 75 | 6450 | 7396
7 59 | 49 | 2891 | 3481
8 83 | 79 | 6557 | 6889 :
9 | 86 | 77 | 6622 | 7396
10 | 33 | 52 | 1716 | 1089
Il 88 | 74 | 6512 | 7744 * ;
12 | 81 | 90 | 7290 | 6561 |
Total | 887 | 878 | 67205 | 69113
The equation for the regression line is,
Y =aX+b
a _ BLXV-2KEY _ 65 :
n xX? (2X)
b = 4( Dy -ax Dx) =25.12
Y’ = 0.65X + 25.12
The final exam grade of a student who received 86 in the mid term exam,
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Vent
b = 113.33
Now the equation for the
line becomes,
Y’ = 394113,33
The sale of company for next
two years, X = 2019, 1=6
,
Y 39 x 6 + 113.33
’
Y 347.33
X= 2020,t=7
,
Y = 39x74 113.33
,
Y = 386.33
classified as eeeede
edhoie land anything below 0.5 isene
classified as 0. Fig. 4.3.1 : Logistic or Sigmoid Function
Pseudo code
— Ingradient descent method we move in the opposite direction of the gradient to find the minimum point on a function _
- The gradient operator will always point opposite to the direction of gradient increase.
b = b-axVf(b)
- _ This step is repeated until we reach stopping criterion.
— -Using above mentioned methods optimized b is calculated, net input is calculated and then the prediction is caleult
by giving the net input to the function,
1
Prediction = —X%
l+e
Class = 1 if Prediction>0.5
|
0 else L
_
0
vj Solution :
Module
Initially assume logistic regression coefficients by = b, = b, = 0
cal! For 1 row, x, = 14.5, x, = 12.5 andy = 1
Now we will calculate prediction for the first row,
Prediction = 1/(1 +e% C (by + by XX, + by X X,)))
. ;
Prediction = 0.5
calculate the new coefficient values using a simple update equation. Ideal values for alpha are from 0.1 to
Ni ow we willill calculate
0.3. Let’s take alpha as 0.3. For bg by default input is 1.
b.,, prediction
+ alpha x (y— prediction) x aa n x (1 - prediction) x input
Drew = ‘old
x 12.5 0.46
b 2new = 940.3x
— * (10.5) x 0.5 x (1 ~ 0.5)
‘oti
Now we will calculate prediction for the sec ‘ond row,
so a +b x x, + by X X2)))
75 X 4.5)))
Prediction = 1/(1 +e (- (0.0375 + 0.54735 x 8.5 + 0.468
Prediction = 0.99
icient values
Now we will calculate the new coeff
ction x (1 — prediction) x input
bo, new = Day + alpha x (y — prediction) x predi
— The tasks or the problems in which the records are represented by attribute-value pairs.
re’ attribute the value is f
Records are represented by a fixed set of attribute and their value Example : For ‘temperatu
‘hot’. :
then decision tree learning becomes
When there are small numbers of disjoint possible values for each attribute,
very simple.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Ventur
—_ Although some training records have unknown values, decision tree methods can be used.
—_—
2. Decision Tree Representation
— Decision tree is a classifier which is represented in the form of a tree structure where each node is either a leaf
veness dt
node or a decision node.
«ht ° Leaf node represents the value of the target or response attribute (class) of examples. re
spent ° Decision node represents some test to be carried out on a single attribute-value, with one branch and sub tree ull
! for each possible outcome of the test.
chsi — Decision tree generates regression or classification models in the form of a tree structure. Decision tree divides a
goa topmost decision node in a tree which represen the best predictor is called — —— trees can be used
. |
atl to represent categorical as well as numerica High
re set of records
© Root Node : It represents enti fi. m mn . ~
ag: in divided into two or
or dataset and thisis 1sis aga
more similar sets.
woe I Low, High
i © Splitting : Splitting procedure is used to divide Sma) Big
: din .
a node into two or more sub-nodes depending [Unacceptable] [Acceptable| [unacceptable] [Acceptable
on the criteria.
1. Gini Index
— Itis assumed that there exist several possible split values for each attribute.
+= Gini index method can be modified for categorical attributes.
‘Tf a data set T contains example from n classes, gini index, gini (T) is defined as,
After splitting T into two subsets T, and with sizes N, and N,, gini index of split data is,
N, i
gini (T3) w» (4.4.2)
BiNk.sprinn(T).== Vein Ty) ro 7
The attribute with smallest gini split (T) is selected to split the node.
Stop onstru
top tree constr ction carly do not divide
ucti ivi a node iiff thisthi would result in the goodne
ss measure falling below threshold,
Postpruning (prune after building tree)
o When we have selected all the attributes, but the records still do not belong to the same class (some are + and
some are —), then the node is converted into a leaf node and labelled with the most frequent class of the records in
the subset
o When there are no records in the subset, this is due to the non coverage of a specific attribute value for the record
in the parent set, for example if there was no record with income = 40 K. Then a leaf node is generated, and
;
labelled with the most frequent class of the record in the parent set.
— Decision tree is generated with each non-terminal node representing the selected attribute on which the
data was Split,
and terminal nodes representing the class label of the final subset of this branch.
Summary |
Entropy of each and every attribute is calculated using the data set
- Divide the set S into subsets using the attribute for which the resulting entropy (after splitting) is minimum (or,
equivalently, information gain is maximum) 4
Pseudocode
‘Add a new tree branch below Root, corresponding to the test A = v,.
Let Record (v) be the subset of records that have the value v, for A
pal Record (v) és empty .
fe Then below this new branch add a leaf node and label with most frequent target value in the records
Else below this new branch add the sub tree ID3 (Records (v,), Target_Altribute,
Mu-
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) + at
fal Tech-Neo Publications..A SACHIN SHAH Venture
Ve a
aching Leaming U-Sem 6-Comp) — —_ Learning with Regression and Trees ...Pa 4
aS7.6 _ EXAMPLE OF CLASSIFICATION TREE USING ID3 SSS
4.6.1: Suppose we want ID3 to evaluate car :
example —— is “Should we accept car?” which canbe
acer imaccepisbie OF nol. The target
Buying Price | Maintenance_ Price Lug_Boot | Safety | Evaluation?
High High Small High | Unacceptable
High High Small Low | Unacceptable
Medium High Small High | Acceptable
Low Medium Small High | Acceptable
Low Low Big High | Acceptable
Low Low Big Low | Unacceptable
Medium Low Big Low | Acceptable
High Medium Small High | Unacceptable
High Low Big High | Acceptable
Low Medium Big High | Acceptable
High Medium Big Low Acceptable
Medium Medium Small Low Acceptable
Medium High Big High | Acceptable
Low Medium Smail Low | Unacceptable
VJ Solution :
» Step1
Maintenance_Price | p, | n, | I (P,, n, )
High 0} 2 0
Medium | l
Low 1 | 0 0
E(Maintenance_ Price) = (2) 1 (0,2) + (2) Td1,l)+ (5) 1(1,0) =0.4
i iin a
d Step 3
Consider now only Maintenance_ Price
and Safety for Buying_Price = Medium
Buying_Price Maintenance_ Price | Lug_Boot Safety | Evaluation?
Medium High Small High | Acceptable
Medium Low Big Low | Acceptable
Medium Medium Small Low | Acceptable Module
Medium High Big High | Acceptable f 4)
[Acceptabie}
Small Big
» Step 4
=
Consider now only Maintenance_ Price and Safety for Buying_Price = Low
| Evaluation?
Buying_Price | Maintenance_ Price | Lug_Boot | Safety
Medium Small High Acceptable
Low
Low Big High Acceptable
Low
Low Big Low | Unacceptable
Li
Big High Acceptable
7 Medium
Medium Small Low | Unacceptable
=
P, = 3 and n,=2
icatiions..A SA CHIN SHAH Venture
Tech-Neo Publicat
(MU-New Syllabus w.ef academic year 18-19) (M6-14) (a e
Machine Learning (MU-Sem 6-Comp) Leaming with Regression and Trees ...Page no 4-18)
SS SS)
1.) = 1(3,2)=0.970
|. Compute entropy for Safety
Safety | pj | n | Tn)
High | 3 | 0 0
Low | 0 | 2 0
2
E (Safety) . (2)te, 0 +(F)10, 2)=0
Gain (S ,,,. Maintenance_ Price) =I (p, n) — E (Maintenance_ Price) = 0.970 - 0,951 = 0.019
Now we will check the value of ‘Evaluation?’ from the database, for all branches.
High Low
Medium
(MU-New Syllabus w.e. academic year 18-19) (M6-14) ta Tech-Neo Publications...A SACHIN SHAH Venture
ample 4.6.2: Suppose we want ID3 to decide whether the loan is to be sanctioned or
not. The target classification is
“Should we sanction loan?” which can be yes or
no.
Customer no | Spending_ Habit | Collateral
Income | Credit_Score | Sanction?
|
| High None Low Bad No
2 High None Medium Unknown No
3 Low None Medium Unknown No |
4 Low None Low Unknown No
5 Low None High Unknown Yes
6 Low Sufficient | High Unknown Yes
7 Low None Medium Bad No
8 Low Sufficient High Bad No
9 Low None High Good Yes
10 High Sufficient | High Good Yes
ul High None Low Good No
12 High None Medium Good No
Solution:
")” Class P: Sanction = “Yes”
Class N : Sanction = “No”
Total records = 12
No. of records with Yes = 4 and No=8
(we
mm 1(4,8) = - (5) tors(-3)-(45)-to22( 45) =0.922
a
Module
/ Step tl
© ten «10.02-(4)ne($m3)
)-(§07)
/ p,= Landn,=4
Spending_Habit | p, | ™ | 1 (p.m)
High t{ 4] o720
Low 3 | 4+] 0.985
E(A) = 2 BT ni)
v4 ai
tNew Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications_A SACHIN SHAH Venture
Collateral) P| He) |
None 7) Ont
Sufficient | 2] 1 | 0.018
TO
BoA) =X = 1 ned |
\
9 3 }
Income | py | 1 | EP)
Low QO] 3 0 ‘ i
Medium | 0 | 4 0)
High 4} 0.721
PrN,
E(A) >=!
ptn I(p, 0)
\ 4 5
_E (Income) . (3) xO+ (4) x0+ (3) x 0.721 = 0.3
I
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications..A ‘SACHIN SHAH vertu
& Step 3
Consider only Spending Habit. Collateral and Credit_Score Income = Medium
» Step4
and Credit_Score Income = High
Consider only Spending_Habit, Collateral
| Sanction?
| Collateral | Income | Credit_Score
Customer no | Spending_ Habit Yes
None High Unknown
5 Low
Unknown Yes
| Sufficient High
6 Low
Bad No
Sufficient High
8 Low
Good Yes
High
9 Yes
High Good
High Sufficient |
10
no
Learning with Regression and Trees ...Page
it | py | I (pj 0, )
Spending_Habi
Low 3/1] 0811
High 1 | 0 0
E(A) z= PF Lo.md
4 1) x 0 =_ 0.648
E(Spending_Habit) = (4) ce +(5
= 6.073
— E (Spending_Habil) = 0.721 - 0.648
Gain (Spigns Spending_Habit) = 1 (p, n)
2. Compute entropy for Collateral
Collateral | p; | ™% I(p,, n; )
None 0 0
Sufficient 1 0.918
2 3 _
E (Collateral) = (2)x0+($)xos18=05 5
E(A) on “Lp, m)
E=1
E (Credit_Score)
(5)<o*O+\+(s)x
(5) o+(3 5
5 )XO+| )x0=0
Low
Medium
| No | “No Credit_Score
2S Good
Medium
Example 4.6.3 : Suppose we want ID3 to decide whether the car will be stolen or not. The target classification is “car is
stolen?” which can be Yes or No.
Vv Solution :
Class P: Stolen = “Yes” Class N: Stolen = “No”
Total records = 10
No. of records with Yes = 5 and No=5
_ (sta) toe(sta)
I(p,n) = (<t5) log, (583)
16.5 =-()me(8)-(é)ee(t)=
* Step1
1. Compute entropy for Colour
Colour | p, | | 1 (Pr)
Red 3{2{ 097!
Yellow| 2] 3| 0971|
Venture _
ations-A SACHIN SHAH
/ (el Tech-Neo Public
) (M6-14)
(MU-New Syllabus w.e.f academic year 18-19
Is
E(A) = 5 no
ven “1 (p,m)
i=l
5 5 x 0.971 = 0.971
E (Colour) = (5) x 0.971 + (3)
Sports | 4 | 2 | 0.923
suV | 1/3] O81I
E(A),== z me
pin = 5 (p, n)
6 4 ,
+( 4) x08 | =0.878
E (Type) (5) x0.923
"
‘E(A) = ran
oom “(py n)
I
» Step2
branch.
As attribute Type at root, we have to decide on remaining tree attribute for Sports
Consider only Colour and Origin for Type = Sports
Car no | Colour | Type | Origin | Stolen?
1 Red | Sports | Domestic Yes
(M6-14)
(MU-New Syllabus w.e.f academic year 18-19)
bi ing (MU- C
chine Learning (MU-Sem 6-Comp) Learning with Regression and Trees ...Page no (4-25)
4 2
E(Colour) = (2) x 0.811 +(2) x 1=0.873
.
pen 1 (Pim)
E(A) = = Hei
p, +n,
a suv
Step 3
chosen, w e have
As attribute Type and Origin is already
attribute for SUV
to decide on only remaining Colour
branch.
from the
Now we will check the value of ‘Stolen?’
database, for all branches,
= Domestic, Stolen? Domestic
nt o For, Type = Sports and Origin
pee = Yes as well as No
most common class. In
So for this type of case we hav e to select the
Yes as well as No, s oO we can
this example there are 2 instances for
| select any one. Let’s we select No.
o For, Type = Sports and Origin = Imported, Stolen? = Yes
n? = No
o For, Type = SUV and Colour = Red, Stole
well as No
= Yellow, Stolen? = Yes as
For, Type = SUV and Colour
Domestic
Example 4.7.1: Create a decision tree using Gini Index to classify following dataset.
12 Medium Old No
MI Solution:
ture
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH ven
a
+=4 ginj (Medium)
Te
ee ee
Age Split gini(y
(Yong) +=> ini (Mediu
m) +> gin (Old)
el-@+@))} al-(G)-@))-
8l-(G-)]
eee
__- Split value of Income is smallest, so We will select Income
<TR
as root node.
nod
Income
——
- From the database we can see that,
o Own Car=No for Income = Low, so we can directly write down ‘No’ for Low branch.
— Since Income is taken as root node, now we have to decide on the Age attribute, so we will take Age as next node Module
below Medium branch. _
| Income
Medium
Medium
Example 4.7.2: Stock market involving only Discrete ranges has profit as categorical value {up, down}.Use Ginj inde,
method to draw classification tree. .
; mid No hardware up
: ‘ ; mid No software |. up
* i new No hardware | up
, aw new no software up
i j Solution :
bay — — Inthis example there are two classes down and up.
aN No. of records for down = 5
No. of records for up = 5
Total No. of records = 10
cwicn = [1-((3)«(3))05
- Now we will calculate Gini of the complete database as,
— Next we will calculate Split for all attributes, i.e. Age, Competition and Type.
Age
. 3 4
Split = To gini (old) +79o Bini (mid) * gini (new) :
- Machine Learnin
Competition
4[-(G)-QNeeh-
—<s
AN
“—~
+
Type
Split &
10 ini (software) +—=4 gini
(hardware)
vol'-(C5) +(+8
f)-()
2+3)
])
Split valuc of Age is sm allest,
so we will select Age 4s root
v
node.
From the database we can
see that, Profit = Down
for Age = Old, so we can directly write
node. down ‘Down’ for Old branch
| Type
2:-(()-@Q)C-(@O}
Split 2...
BIN
Machine Learning (MU-Sem 6-Com Learning with Regression and Trees . Pa @ no 4.39
[competition]
No Yes
Example 4.7.3 : Suppose we want Gini index to decide whether the car will be stolen or not.-The target classification is “cat
is stolen?” which can be Yes or No.
Carno | Cojour | Type Origin | Stolen?
1 Red Sports | Domestic Yes
2 Red | Sports | Domestic No
3 Red | Sports | Domestic Yes
4 Yellow | Sports | Domestic No
5 Yellow | Sports | Imported Yes
6 Yellow | SUV | Imported No
7 Yellow {| SUV | Imported Yes
8 Yellow | SUV | Domestic No
9 Red SUV_ | Imported No
10 Red | Sports | Imported Yes
vw Solution:
Split +».= 24
= eevol '-((3)3Y,(2Y
+(2)) J5[1-((B)+(3))
5 2V 73
Jeo
Type
Split = 6 oat 4 —
plit = 49 sini (Sports)
+ 75 gini (SUV)
- A (O+@)4h-(G-@)ee)e
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
ra
[Bl rech-Neo Publications._A SACHIN SHAH Ventu®..
2
origin
10( [ 1- a]5
(3) 5
3)’ )]
+ (3) 25 !-((3)(2
*vol 2 )) Jen Sports su
u
= no Origin | Stolen?
Domestic Yes
2 Domestic } No
3 Domestic Yes
4 Yellow Domestic No
5 Yellow | Sports Imported Yes
10 Red | Sports Imported Yes
Colour
. 4... 2..,
Split = 6 sini (Red) + 6 Bini (Yellow)
= éL-((24)+
-(G)
G@ +G))
))Jean
Origin> suv Module
- 4 4 4
Split = 6 gini (Domestic) + é gini (Imported)
4-4)
Split value of Origin is smallest, so we will select Origin as next node.
Domestic Imported
Machine je “earni
Lea Ing (MU-Sem
- 6-Comp)
-( Learning with Regression and Trees .,.Pagg no ay
Domestic
Eyecolor->
.
. 3 4
Split = 75 gini (Brown) +75
12 ini (Blue)
4
£-()-O)L@-@))eo
| { Married->
he,
AL @OV ALG)
S~~ s .
Split
7
= 7512 gini (male) +>2 gini (female)
7
= 7¥
= L(G) +3)f0\)\]. ) eS 5 [1-(()+(2)
oy sy
) Jeo
Hairlength->
: 8 4
‘Ie Split = Tp gini i (long) + ne gini (short)
Nez =)
3 + ) i-((3)
By +(4)fly = 0.458
hed Split value of Sex is smallest, so we will select Sex as root node. Module
r
—_—_—
a U4
a) female
Fe
ie
; From the database we can see that,
ee class = Football for Sex = male, so we can directly write down ‘Football’ for male branch.
& class = Netball for Sex = female, so we can directly write down ‘Netball’ for female branch.
Ny
fr male female
iaten’. Te
Machine Learning (MU-Sem 6-Comp) _ Learning with Regression and Trees ...Page no (4-34
3 eee
see ey MU - May 17, 10Marks
‘For a Sunburn dataset given below, construct a decision tree
Height Weight Location Class
Name Hair
Light No Yes
Swati Blonde Average
Mv Solution :
Hair, Height, Weight and Location.
We will calculate Split for all attributes, i.e.
Hair->
a Bit = ae 3 1,gini
-O)-G-O)
(Red)
gini (Blonde) + g gini (Brown) +
111-280 4
Hei ght->
2... 3...
-O)-alO-O)
2 sini (Average) + gin (Tall) +g gint (Short)
[OOH
Split =
8 3
nv
/
f
Weight->
af
2 3, + g ini (Heavy)
3.
* N
\ Split = § gini (Light) +3 sini (Average)
2
2 2
40-(@+@))
2
+3) J J-
-(Q)-@))
2
8b
2
5) +(5) JJ +ebt-Ua)
:
| 21
= 2[1-((2) +(3) Jel (4-2) :
‘ :
Location-> 3
5... 344
(Yes)
gsm (No) +g gin!
511-2) «4 -(@@)1-%
Split =
llest, so w
Split value of Hair is sma
is
(M6-14) Tech-Neo Publications...A SACHIN SHAH
w.e.f academic y ear 18-19)
(Mu-New Syllabus
Learning wit .
ey 7 BI ith Regression and Trees ...Page no (4-35
Now we will split the remaining attributes considering
& Blonde d ala,
Height->
split = ei ni (Average) + 7 gini (Tall) 424 Bini (Short)
LO -O)40-()
4 (9) *4aL!- ($) +(4))]4 +3 [1-((4) +(2)) Joa
Weight->
split = = sini (Light) + gini (Average)
Blonde/ Brown
Venture
fal Tech-Neo Publications...A SACHIN SHAH
7 wa. 7...
Creditscore-> Split = 74 sini (High) + 74 sini (Low) = 0.493
8...
Location-> —_Split = 74 Bini (bad) +e gini (good)
bad good
(MU-New syllabus we academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Ventut®
U-Sem 6-Comp)
machine Learning ; (M . : Learnini ge
no (4-37)
arhing with Regression and Trees ...Pa
vit wale of Creditecore is’sinallest, 80 We will ill sele
se l ecl Cred
red its
lit ‘core node c belc
belo Ww bad bran’ch a
i brane
bad good
Creditscora
Low
2
2
a
Income->
Split
t
= Zgini
6 ent (Low) + 6 sini (High) + Ggini (Medium) = 0,295
. . low, 2
Defaulting-> Split = G sini (High) =0
+ gini (Medium) + gini (Low)
Split value of Defaulting is smallest, so we will select Defaulting node below good branch
Since only one attribute is remaining, we can directly select Income below creditscore= High branch
For Location = bad and creditscore = High and Income = Low, Giveloan= No
Giveloan= Yes
For Location = bad and creditscore = High and Income = Medium,
High, Giveloan= Yes
For Location = bad and ereditscore = High and Income =
oan= No
For Location = bad and ereditscore = Low, Givel
High, Giveloan= No
| = 0.392 For Location = good and Defaulting =
Modulee
= Low, Giveloan= Yes
For Location = good and Defaulting |
14
ng = Medium, Giveloan= yes
For Location = good and Defaulti
)
Learning with =Ro rossion and Troos ...Page no 4-38
—!.
Machine Leaming (MU-Sem 6-Co
mp)
—
Example 1
—— we
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications... SACHIN SHAH Venture
sD Reduction
It is based on the decrease in SD after a dataset is Split on an attributute, Constructing a tree is all about finding attribute
SDR.
that returns highest
p Step1:
SD (Maintenance_Price?) = 9,32
» Step2:
The dataset is then split on the different attribute.SD for each branch is calculated. The resulting SD is subtracted from
SD before split.
Maintenance_ Price(SD)
Low 7.78
Buying_Price Medium 3.49
High 10.87
SD (Maintenance_Price, Buying_Price) = P (Low) SD (Low) + P (Medium) SD (Medium) + P (High) SD (High)
5 4 5
=14 * 7.78 +74 * 3.49 +14 * 10.87 = 7.66
Small 9.36
High 71.87
Low
10.59
Safety
Safety) = P (High) SD (High) + P (Low) SD (Low)
SD (Maintenance_ Price,
=x8 787 4—6 x 10.59 = 9.02
. 14 10.5
14
= 0.3
ntenance_ Price, Safety) = 9.32 — 9.02
SDR = SD (Maintenance_ Price) - SD (Mal
we select Buying_Price
SDR of Buying_Price is highest so
Buying_Price as our root node.
——
Lug_Boot
; : Big 12.49
SD(High, Lug_Boot) = P (Small) SD (Small) + P (Big) SD (Big)
ES
2 3
=5% 75+ 5% 12.49 = 10.49
SDR of Safety is highest so we select Safety as next node below High branch.
‘o For Buying_Price = High and Safety = High, we can Buying_Price
High ;
directly write down the answer. Low
Medium
o For Buying _Price = High and Safety = Low, we can
_ directly write down the answer. 46.3
— To write down the answer we take average of values of following records, High
(MU-New Syllabus w.e-f academic year 18-19) (M6-14) el Tech-Neo Publications...A SACHIN SHAH Venture
Machine Learning (MU-Sem 6-Comp) Learning with Regression and Trees ...Page no (4-41)
ing_Price = Hi = :D: 30) /2 = 26.5
Dre Uh, q o For Buying_Price = High and Safety = Low, Maintenance _Price = (23 +
Buying_Price
For Buying_Price = Medium, we can directly write
down the answer as 46.3. The answer is calculated by taking
the average of values of Maintenance_ Price for Medium
records (average of 46, 43, 52, and 44).
» Step4:
“a3
Low Small High 35
Example 2
Outlook | Temp | Humidity Windy | Hours Play?
Day |
1 | Rainy Hot High False | 25
2 | Rainy Hot High True 30
3 | Overcast | Hot High False | 46
4 | Sunny Mild | High False | 45
5 | Sunny Cool | Normal False | 52
Standard deviation
the data into subsets that contain instances
A decision tree is built up top down from root node and involved partitioning
with similar values. We use SD to calculate homogeneity of a numerical sample.
I, 2
SD Reduction
ute. Constructing a tree is all about finding attribute
It is based on the decrease in SD after a dataset is split on an attrib
an
» Step1:
ee Le
» Step2:
resulting SD is subtracted from
The dataset is then split on the different attribute.SD for each branch is calculated. The
SD before split.
Hours Play(SD)
Rainy 1.78
Overcast 3.49
Sunny 10.87
Outlook
+ P (Sunny) SD (Sunny)
SD (HoursPlay, Outlook) = P (Rainy) SD (Rainy) + P (Overcast) SD (Overcast)
4
=3 x 7.78 +74% 3.49 +a xX 10.87 = 7.66
=-——
4 4
=7Jq 14 x 10.51 +—14 x 8.95 +776 % 7.65 = 8.84
SDR=S
P (Hours Play) ~ SD(Hours Play, Temp) = 9.32 - 8.84 = 0.48
Hours Play (SD)
High
Humidity
on
8.3
SD (Hours
(H Play, a
Humidity)
=7
= p (High) SD (High) + P (Normal) SD (Normal)
a 7
=14 x 9.36 +55 8.=37
8.86
6
Sunny Cool | Normal | True | 23
{|__—_—
|Oe
Sunn
10 ||su y y Mild Normal | False __| 46
nn
y
YL, |}
| Mia_|
14 | Sunny .
Hi
High True | 30
Cool 14.5
High 75
i i | ae 3B
=5x75+5% age
12.49 = 10.49
False 3.09
SDR of windy is highest so we select windy as next node below sunny branch.
‘oO. For Outlook = sunny and Windy = false, Hours Play = (45 + 52 + 46) /3 = 47.7
o For Outlook = sunny and Windy = true, Hours Play = (23 + 30) /2 = 26.5
"> Step3:
‘ , F x .
* . .
.Now we will consider the records of ‘overcast’. For Overcast we will directly write down the answer.
us w.e.f academic year 18-19) (M6-14) . Tech-Neo Publications...A’ SACHIN SHAH Venture
Machine Learning (MU-Sern 6-Comp) Learning with Regression and Trees ...Page no (4-45)
Day | Outlook | ‘Temp Humidity | Windy | Hours Play?
3 | Overcast | Hot High False | 46
> Step 4:
Now we will consider the records of ‘Rainy’.
Day | Outlook | Temp | Humidity | Windy | Hours Play?
I Rainy Hot High False 25
2 Rainy Hot High True 30
8 Rainy Mild High False 35
9 Rainy Cool Normal False 38
I Rainy Mild Normal True 48
High 5
Humidity
Normal 5
SD (Sunny, Humidity) =
P (High) SD (High) + P (Normal) SD (Normal)
2
=2 x5+5% 5=5
rs Play, Humidity) = 7.78 - 5
= 2.28
SDR = SD (Rainy) - SD (Hou
o For Outlook = Rainy and Temp = Cool, we can directly write down the answer.
the answer.
© For Outlook = Rainy and Temp = Hot, we can directly write down
© For Outlook = Rainy and Temp = Mild, we can directly write down the answer.
o For Outlook = Rainy and Temp = Hot, Hours Play = (25 + 30) /2= 27.5
© For Outlook = Rainy and Temp = Mild, Hours Play = (35 + 48) /2 = 41.5
Mild
415
© May 2015
Q.1 Create a decision tree for the attribute “class” using the respective values. (Ans. : Refer Example 4.7.4) (12 Marks)
MU-New Syllabus w.ef academic year 18-19) (M6-14) [al rech-Neo Publications...A SACHIN SHAH Venture -
Module
}
= ‘
Q@8 Fora Sunbum deteset given below, construct @ decision tree For the following
determines which atiribute is root attribute and generate two
(Ans. : Refer Example 4.7.6)
data, Calculate Gini indexes and
level deep decision tree.
(10 Marks)
alll
Defaulting | Creditscore | Location Give Loan?
Sr.No. | Income
High High bad No
- 1 { Low
| Low High High good No
ee 2
High High High bad Yes
Be 5
Medium Medium High bad Yes
i ; |
Low bad No
Bs
e. =5 | Medium Low Low good Yes
Beka, §
: 6 | Medium Low good Yes
.”—d*~S , Low
‘sae
t r
ee
=
bad No
N
High
=
High
||
3}
4
8 _Low Medium
Low Low bad 9
bad No
Yes
Low
| og
10
fe
| Medium Medium good
tedium Low
| tT Low | ___ Medium __ High good Yes
Machine Learning (MU-Sem 6-Comp) (Learning with Regression and Trees ...Page no (4-48)
the steps.
Q.9 — , Explain how regression problem can be solved using St leepest descent method. Write down
Q.10 Given the following data for the sales of car of an automobilei company for sixi consecutive
i years. . Predi ct the sales for
“next two consecutive years. (Ans. : Refer Example 4.2.4) (10 Mar ks)
Years | 2013 | 2014 2015 | 2016 | 2017 | 201 8
Sales | 110 | 100 | 250 | 275 | 230 | 300
(b) Y = a + bX where a is the slope of the line and b is (c) Unsupervised learning ;
X-Intercept (d) Reinforcement Learning v Ans, : (a)
(c) Y =a + bX where a is the Y-Intercept and b is the Explanation : Best fitting line, is drawn from input and
‘Slope of the line output points which is used for prediction.
(d) Y¥ =a + bX where a is the slope of the line and b is Q.4.6 Choose the appropriate method used to find best fit the :
the Y-Intercept Ans. : (c) data in logistic regression.
Explanation : Definition of linear regression
(a) Jaccard Distance (b) Least Square ame
A correlation between age and percentage of getting
(c) Maximum likelihood (d) Pearson coefficient
infected with COVID-19 is — 1. What you can interpret ;
¥ Ans. : (0) FB
‘rom this _ . . ; Explanation : Maximum likelinood(probability) is used
(a) Age is a good predictor and is negatively correlated E
to decide the class in logistic regression. i
(b) Age is a good predictor and it is positively correlate
d .
(c) Age is not good precitor and it is negatively Q. 4.7 Choose the “propriate method used to find best fit the
conelated » data in linear regressi on. a
" @) Age is not good predictor and it is positively related (a) Least Square error (b) Maximum likelihood
Y Ans, : (a) (c) R-square (d) Pearson coefficient
Explanation : Age is a important factor
in COVID SO,
Wwe can say age is a good predictor and it is negatively aa
correlated. Explanation : Least square error is used to draw
regression line. ay
In linear regression, which plot should be used to Q.4.8 On given data logistic regression model is build. Tt has
Fepresent the relationship between dependent variable
and independent Variable? training accuracy as X and testing accuracy as Y.
features are added in this already build model how Ifitwill
new
Se ~@) Box plot (b) * Bar graph effects on accuracy. Choose appropriate one.
+, © Scatter 4) Plot Histogram — ¥ Ans. : (c) (a) Testing accuracy will decrease ;
—renation 3 Scatter plot is used to Plot predicted
Output vs j : (b) Training accuracy will decrease
: .
ha ae (c) Testing accuracy will increase Hf ne
How Many ‘coefficients do you need to estimate in a (d) Training accuracy witl increase or remain sam
Simple linear regression model.?
0 () s ©) 1 @ 2. Y Ans. : (d) Explanation : If we add more features to a train
‘then training accuracy will increase
or remain same
We academic year 18
-19) (M6-14)
Tech-Neo Publications...
A SACHIN SH/ Vent
m of Leaming wiWith
To find the minimum or the maximu Regression and Trees ...
Page no (4-49
to zero because
a
function, we
sett he gradient Q. 4.15 Find the Statement which
i is true in case of gini inde
(a) The value of the gradient at extrema of x
(a) The gingini jindex is pj
is always equal to zero the function attribute is biased towards multivalued
(b) Depends upon the type of problem (b) Gini lindex
i does not favours equal sized partitions
(c) The value of the gradient at extrema of th (c) siardidoas,
Gini inde X favour t €st when impurity
i one is there 4in both
e function
is always equal to maximum
‘)
(d) The value of the gradient at extrema of the functi (d)d When the number of classes is large Gini index
is a
Netion good choice.
is always equal to minimum YA Y Ans. wt: (a)(a
Ns. (a) Explanation : Property of gini index method.
Explanation : Property of gradient,
Q. 4.16 The Family of Decision Tree learning algorithm is
Q. 4.10 Which
a of the following iss a disad
Sadvantage of decisi
cision (a) Unsupervised learning model a
(b) Supervised learning model
(a) Factor analysis
(c) Stochastic leaming Model
(b) Decision trees are robust to outliers
(d) Reinforcement learning Model ¥ Ans. : (b)
(c) Decision trees are prone to overfit Explanation : In decision tree training data contains
Ne
(d) Decision trees are not prone to overfit input as well as output.
Y Ans. :(c) Q. 4.17 Temperature Prediction is
Explanation : If we continue to split all the branches (a) Classification problem (b) Regression Problem
then it will lead to overfitting. (c) Clustering problem (d) Astrological Problem
4)
Find the statement which is not true in case of gini index.
¥ Ans. : (b)
nd Q. 4.11
Explanation : Since we are predicting temperature
(a) Thegini index is biased towards multivalued attribute
(numerical value).
he
(b) Gini index does not favour equal sized partitions
(c) Gini index favour test when purity is there in both Q. 4.18 How many Coefficients are needed to estimate a simple
partitions. linear regression model with one independent variable ?
&) 2 @3 @d4 v Ans. : (b)
Gini index is not (@~l
(d) When the number of classes is large we
¥ Ans. : (b) Explanation : Along with one independent variable
)
a good choice.
size partitions. will require Y-Intercept and slope of the
line. Module
Explanation : Gini index favours equal price of house
Q. 4.19 You want to design a system to predict the
actual is nearby
0.4.12 Which one of these is not a
tree based learner? and after prediction if predicted and
ed for the loan
(b) D3 same then deal is registered and process
(a) CART deed is done. The
he
(d) Random forest approval. Once loan is approved sale
(c) Bayesian Classifier : (c) selling price of a house depends on
the following factors.
v Ans. of bedrooms, number
For eg. It depends on the number
on baye’s the year the house was
Explanation : Bayesian classifier is based of kitchen, number of bathrooms,
of the lot. Also the facilities
theorem. built and the square footage
‘a) Loan sanction depends
that are available near the house.
ent cas es is___— customer. Given these factors,
Ww Q. 4.13 Finding the number of Covid Pati on the credit history of of
of the house is an example
(a) Classification model (6) Astr
Regrolog on one
essiical Mode
predicting the selling price
task and also what
process you will follow to
(d)
as (c) Clustering Model Y Ans. : (b)
which
design? Classification, deal
aw . ients ( surveys Binary
ill of pats (a) Locality roval, sale deed
Explanation : Since we are finding number dit sco ring, loan app
registration, cre l classification, deal
numerical output). survey, Multileve
(b) Locality it scoring, loan approval, sale deed
registration, cred sion,, deal
rat, Logistic regression is __— survey,
!
Simple linear regres
dee d
(c) Locality Joan approval, sale
(a) Supervised Regression gi st ra ti on , credit scoring, deal
re regressio n,
Multiple linear sale deed
(b) Supervised Classification (d) Locality survey, t scoring, loan approval,
fe gistration, credi Y Ans. : (d)
(c) Unsupervised Learning
(d) Reinforcement learning
on iso AH Venture
Explanation : Logistic regressi pu bl ic ations..A SACHIN SH
output is 4! Te ch -N eo
deciding class. In this expected [e l
6-14)
~ (MU-Ne
New Syllabus w.e.f academic year 18-19) (M
|
Scanned with CamScanner
To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419
(d) A weak negative cormelation between pam t. (bh) Matiilabel ela tiation
¥v Ans. ttn) (8) Sdnypte Cia nila
Explanation : Correlation coctlicient: value of extetly GD Muatiple tiene nyeadtan ¥ Annet (l)
1.0 means there is a perfect’ positive relationship Baplanatton r Shree we ane peddle tion vee tine garlic Ht ba
between the two variables. Fora positive inerease tone a evample ot ramadan, Selly peter alegre ant
variable, there is also a positive increase in the second trvmaltdpte: Cetin so TE RHI ate Hiner recenein,
variable. A value of -E means there ix a perfect uepative
relationship between the Qwo variables, ‘This shows (hat Q. 4.24 A feat FE ean take certain vatner Ay Fh GD and TE
the variables move in opposite directions for a positive rypresents grade ot tents Crane cottyye, Wiloleth al tlie
increase in one variable, there is vdeerease in the second followliy statement by true ti fallow ligy eae?
variable. While the strength of the relationship: varies: in QW) Beate PE is ar exdaat nenidinal var titihe,
ytte
degree based ow the absolute value of the correlation (b) Poattne EE Idan exiiiple ab ontinal vielubles
coefficient. Ce) Hedoesnt belong tomy of the above eiitegorys
(Doth of these ¥ An 1(4)
Q. 4.21 The sales of a company (in thousands) for each month
are shown in table below : Eaphiuation : Ordlaat ville ane the varlibtes whilvlt
Find least square regression line, y = ax +b, Use Tine asa have some order In thede categorie, Hor example, peal
model to estimate the sales of company in the 6" month, A should be considered ay tiple goracte thin grade Hh,
X(month) | | | 2.{ 3 | 4] 5 Q, 4.25 Whiel of the fatlowliy bv tiie far a deetslon tee?
Y(sales) | 12} 19 | 29] 37] 45 Qt) Adeelston give tea exaiple oft Ulnvear alow un
(a) 84 (b) 60 (b) The catropy of anode qypleatly: decreas a We ay
(c) 74 (d) 80 ¥ Ans. (0) dlown a deedton tree,
Explanation : 84, from calculation of regression Ine (©) Hitropy ts rieastine of gndty,
Y=aX+b (Aw attribute with tower mutant information alyouild
¥ An th)
be preferred to other ati bates,
4 22 Suppose you want to develop multi classificntion
Exphimitlon Property ofa deelton tive,
problem and you are only allowed to use binary logistic
» classifiers to solve a multi-class classification problem,
Given a training set with 2 classes, this classifier_cun cust EIS
e
rs} Tech-Neo Publicatlons..A SACHIN SHAH Ventur
——S———————
Suppose you gol a situation where you find that linear
Q, 4.26 IG(S, Road Type) = 0, IG(S, Speed Limit) = |
egression model is underfitting the data in such situation
Since the decision tree is constructed on the feature
which of the following options woluld you consider?
—
answer.
(b) Will strat intoducing higher degree features
Q. 4.30 Consider a simple linear regression model with One
FT
eee Fe
induce more features in variable space or you can add by how much output variable (Y) will change?
aa
some polynomial degree variables to make the model (a) 1 unit (b) By slope
more complex to be able to fit the data better,
(c) By intercept (d) None v Ans. : (c)
Consider the dataset, S given below : Explanation : Equation for simple linear regression:
Q. 4.27
Y =a+bxX. Now if we increase the value
Elevation | Road Type | Speed Limit | Speed + 1) Le.
ofX by | then the value of Y would be a + b (x
steep Uneven Yes Slow value of Y will get incremented by b.
steep Smooth Yes Slow y
Q. 4.31 The following table shows the results of a recentl
correlat ion of the number of
flat Uneven No Fast conducted study on the
acute
hours spent driving with the risk of developing
steep Smooth No Fast fit line for this
backache. Find the equatio n of the best
Elevation, Road Type and speed Limit are the features data.
and Speed is the target label that we want to predict.
No of hrs(x) | Risk on a scale {y)
Find the entropy of the dataset, S as given above:
10 95
(a) 0.5 (b)O (c) 1 (d) 0.7 v Ans. : (c)
9 80
Explanation : For a dataset, S with C many classes, the
entropy of the set, S given by H(S) is defined as : 2 10
\ H(S) = X p, log p, where p, is the probability of an 15 50
the element of S belonging to a class. In this case,
10 45
wyatt P(slow) = 0.5; P(fast) = 0.5 and hence H(S) = | Module
ES 16 98 nde
Q. 4.28 Find the information Gain if the dataset is split at the
11 38
eA feature “Elevation” :
0.675 (d) 0.325 v Ans. : (d) 16 93
antl (a) 1 (b) O(c)
The feature, Elevation has 2 Choose which of the options is correct?
vias Explanation
a split on the featur e
values = {Steep, Flat]. For (a) y =3.39x + 11.62 (b) Y=4.69x + 12.58
a having feature
Elevation, one subtree would be of inputs (c) Y =4.59x + 12.58 (d) Y=3.59x + 10.58
g featur e, Flat. For the feature,
we Steep and the other havin Y Ans. : (c)
5 les, out of which P(slo w) = 2/3
Steep, there are 3 examp
ni py (Steep ) = 0.9 and Explanation : For each x calculate the value of Y using
and P(fast) = 1/3 Thus, Entro
the given equations. Then calculate error for each
Entropy(Flat) = 0.
equation. Equation with the lowest error is the desired
(1/4) *0)
Thus Information Gain = | — ((3/4) * 0.9 + answer.
= 1 — 0.675 = 0.325
Q. 4.32 Pruning is a technique that reduces the size of decision
Q. 4.29 Find the feature on which the parent node must be
trees by removing sections of the tree that provide little
chosen to split the dataset, S based on information gain power to classify instances. This is done in order to
avoid
(a) Speed Limit (b) Road Type
(a) overfitting (b) underfitting
(c) Elevation
Y Ans. : (a)
(c) Both (d) None v Ans. : (a)
we
Explanation : Using the Information Gain formula, Explanation : Pruning reduces the complexity of the
s.
can find the information gain for each of the feature final classifier, and hence improves predictive accuracy
The values should be IG(S, Elevation) = 0.325, by the reduction of overfitting.
Q. 4.35 Find the number of syntactically distinct hypotheses Q. 4.41 Regarding bias and variance,. which of the following
given in the table Question 33. statements are true?
Aint : Each attribute can have 2 more values : ? and 0 (a) Models which overfit are more likely to have high
bias
(a) 160 (b) 320
(c) 80: (d) 40 Y Ans. : (b) (b) Models which overfit are more likely to have low
bias
Explanation : For each attribute possible values as
(c) Models which overfit are more likely to have high
3*2*2*2 and each attribute can have 2 more values so,
variance
5*4*4*4 = 320
(d) Models which overfit are more likely to have low
Q. 4.36 We can get multiple local optimum solutions if we solve variance Y Ans.
: (b), (c)
"a linear regression problem by minimizing the sum of
Explanation : The bias of a classifier gets reduced when
squared errors using gradient descent.
the training set error lowers down to zero causing low
(a) True (b)- False Y Ans. : (b) bias, while due to overfitting che gap between the training
Explanation : we do not get multiple optimum solution. error and test, error becomes higher, causing high
_ Q. 4.37 When a decision tree is grown to full cept, it is more
variance.
likely tofit the noise in the data. . Q. 4.42 Consider a binary classification problem. SupposeI have
(a) True. (b). False. v Ans. : (a) trained a model on a linearly separable training set, and
. Explanation : As tree grows to full depth, possibility of now I get a new labeled data point which is correctly
Tedundant and noisy data gets increased. classified by the model, and far away from the decision
boundary. If I now add this new point to my earlier
_ As the number of training examples goes to infinity, training set and re-train, in which cases is the leat
your model trained on that data will have
decision boundary likely to change?
(a) Lower variance (b) Higher variance (a) When my model is a perceptron. :
(c) Same variance Y Ans. : (a) (b) When my model is'logistic regression. | if
. (MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A'SACHIN SHAH Venture
Sem 6-Comp)
ine Learning (MU- Learning with Re ression and Trees ...P
(c) When my model is an SVM, age no (4-53
(d) When my model is HMM. ExplPlanatio
ion n : After addiing a feature in
¥ Ans. : (b) feature spac
e,
wh
y ether that feature iiss i Important or
unimportant features
Explanation : If we add new point to already trained e R-squared always increase.
5.1.3
5.1.4
54
SE
Scanned with CamScanner
To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419
5.6
condition is a conjunction of attribute, Here Condition consists Of one or more attribute tests which are logically
ANDed-
y represents the class label
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
medium
Car | triggers rule
R1 unacceptable
Car 2 triggers rule
R2 > acceptable
Car 3 triggers both R3
and R4
Car 4 triggers none of
the rules
ran 5.1.3 .
Characteristics of Rule based Classifier
1, Mutually Exclusive Rules : Rule based Classifier comprises of mutually exclusive rules where the Tule
ate
in dependent of each other, Each and every record is covered by at most one rule.
Solution : Arrange rules in the
order
Arrangement of Rules in the Ord
er
Rules are assigned a priority and based on this they are arranged and ranks are associated. When a test record is Blvenas
a input to the classifier, a label of the class with highest priority triggered rule is assigned. If the test record does not trigger
any of the rules then a default class is assigned.
Rule-based ordering
In rule based ordering individual rules are ranked based on their quality.
— RI: (Buying-Price = high) (Maintenance_Price = high) * (Safety = low) > Car_evaluation = unacceptable
Class-based ordering
In class based ordering rules which belong to the same class are grouped together.
— R2: (Buying-Price = medium) * (Maintenance_Price = high) “ (Lug_Boot = big) ++ Car_evaluation = acceptable
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Ventue
(b) Grow a rule using the, learn one rule at a time method. Either we can use general to specifi
c or specific to general
strategy.
(c) Remove training records covered by the rule, otherwise the next rule will be exactly
same as that of the previous
tule.
(d) Repeat above two steps until stopping criteria is reached. Stopping criteria can be computation of Module
significant gain.
2. Indirect Method : From Decision Trees 5
In the Indirect method we will learn how to generate a rule-based
classifier by extracting IF-THEN rules from a
decision tree.
Points that we need to remember while extracting a rule from a decision tree :
© For each path from the root to the leaf node one rule is created.
© The leaf node represents the class prediction, forming the rule consequent.
MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
as Wk, which is
ry . .
3. ;
etwork can be properly adjusted. Baled to the hidden layer so that the weights connected in each layer of the
n
when Back propagation network is trained for the
correct classificati on for a
to the network in order to check training data, then a testing data is applied
if the unseen patterns are correctly
cl assified or not.
Vv Solution :
Feedforward stage
Y, { (0.6 *2+1*0)=0.76
Y, f(-1+0.6*14+2*0)=04
)*02)=
dO = (t-Oyf (0) =(0.9-0.2(1-0. 0.112
.2*
Now we will calculate the Error of the Hidden Layer as,
Updation of weights
First we will update the weights between the Hidden and the Output Layer as,
W,, = Wy, +1 *dO* Y,=-1 +0.3 *0.112 * 0.76 = - 0.975
Wi. = W,, + *dO* Y,=1+0.3 * 0.112 * 0.4 = 1.013
Bias is updated as
40.3 O=-
W, = Wy+n*d 1
* 0.112 =-0.996
Now we will update the weights between the Hidden and the Input Layer as,
V,, = Vy tn *dY,* Z, =2 40.3 * (- 0.02)* 0.6 = 2.0036
V.tn*dY, *Z,=1+0.3 * (-0.02)*0=1
"
x<
pias i updtated as
Vo. = Viotn *dY, =~
é>
1+0.3* 0.026 =
=-0.992
<7 BAYESIAN BELIEF NETWORK
yi ———
To understand the Bayesian belief network first we will reyj Ise the
. miiian « h Concepts of
the conditional probability is represen Bayes theorem.
ted in the followin
8 Way.
P(A, C)
P(C/A) =
P(A)
P(A,C)
P (A/C)
P(C)
The Bayes theorem is defined using the following
equation
P(C/A) =
r(S)P@
ciP(A)
- Now the question is, can we estimate P (C | A,, A2,...,A,) directly Module
ut
from data. The approach used is to compute the
posterior probability P(C | Aj, A3,...,A,) for all values of C using the Bayes theorem.
P(A,, Aj, ..., A, 1C) P(C)
P(CIA,,A;,...,A,) = P(A, Ayu A,)
- Now we have to select the value of C that maximizes P(C | A,, Ay,..., A,). It will be as good as selecting
the C that
gives a maximum value for P (A;, A,, sees A, 1 C) P(C). Next question is calculation of P (A,, Agyeses A, 1 C), that we
will see in the following sections.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) al Tech-Neo Publications...A SACHIN SHAH Venture
| Y = 100k N
2 M
| : - g5k | Yl
4 ,
;
5
: 7 EE
; a}
NY - fF2z0k |ON
85k Y
8 N S a
75k N
9 N MT
10 N s 90k_|___* __
Class: P(C) = N./N
and C,
(x-py
P (A/C) Le
6
u"
210°
= 2975
For (Salary, N): mean = 110 and variance
(20-110)
| — "ye 2075
P (Salary= 120/N) = Yon« 2075 e "5 = (),0072
walkie:
puoan= NIY) = I
= 27
= SIN)
p (status
= 17
= DIN)
p (Status
= 4/7
= MIN)
p (Status
= 27
= SIY)
p (Status
= 17
DIY)
p (Status =
0
MIY) =
p (Status =
For salary : If Savings = N : mean = 110, variance = 2975
p(x | Savings = N)
= P(Loan=NI Savings = N) x P (Status =
MI Savings = N) x P (Salary = 120K | Savings = N)
= 417x4/7 x 0.0072 = 0.0024
p(xi Savings = Y)
= P(Loan=N1 Savings = Y) x P (Status = M | Savings = Y) x P (Salary = 120K | Savings
= Y)
= 1x0x1.2x10° =0
since P (XN) P (N) > P (XIY) PCY)
Therefore P (NIX) > P (YIX) = > Savings = N for the given record.
prample 2 : Naive Bayes Classifier
x
Type-
YPe-> P(SUV/yes) = 3/5, P (SUV/no)
= 2/5
P(Sports/yes) = 2/5, P (Sports/no) = 3/5
1
Origin-> P (Domestic/yes) = 3/5, P (Domestic/no) = 2/5
P(Imported/yes) = 2/5, P (Imported/no) = 3/5
P (Red. Domestic, SUV/Y)*P(Y)
1/5*5/10
P (Red/yes)* P (Domestic/yes)* P (SUV/yes) *P(Y) = 3/5*2/5*
= 0.024
P (Red, Domestic, SUV/N)*P
(N)
= 0.072
UExample 5.3.1
KGB MU - Dec. 19, 10 Marks ifier toto find
=Cool, Wind= Strong use naive Bayes classifier
fi whether the
For a unknown tuple t =<Outll ook =Sunny , Temper ature
sa
class for PlayTennis is yes or no. The dataset is given below
Outlook [wind
Temperature | Play Tennis _
Sunny Hot Weak |_____ Ne
Hot Strong | No
Sunny
Hot Weak Yes
Overcast
Rain Mild Weak | Yes__
Rain Cool Weak | Yes__J
Cool Strong No
Rain
Cool Strong Yes
Overcast
Mild Weak No
Sunny
Cool Weak Yes
Sunny
Mild Weak Yes
Rain
Mild Strong Yes
Sunny
Mild Strong Yes
Overcast
Hot Weak Yes
Overcast
Mild Strong No
Rain
ow Solution :
= 0.0257
0.0158
i
-
AB wey ., twork is an uncor mmon gs Ort of
chart (c:
(called a coordinated diagram) together w i
r table of probabilities. Ss. The graph
Ta ith oe
7 :
comprises of arcs and nodes. The discrete or continuous va ni jables ar
rangement . are
causal relationshi
i ps
ee
ed by
the nodes. The between the variables are represented
cteprosen
by the arcs .
_
. y
F 0.4 | 0.6
Bus strike
T F
0.2 0.8
Module
5
uv Solution :
strike) * P (No bus strike)
e) * P (Bus strike) + P (Ram held up! No bus
P(Ram held up) = P (Ram held up! Bus strik
Sprinkler
False False
False | True
True False
True True
IV] Solution :
on using the given graph.
First we will write the joint probability distribution equati
R) P (S|R)P(R)
Joint probability distribution: P (G, S, R) = P (GIS,
er)
We need to find the reason of wet grass (Rain or Sprinkl
to rain,
First we will find the probability of wet grass due
P(G=T,R=T)
=T)
P(R=TIG P(G=T)
now &
ach probability term is to be written in th e form of io ows,
P(G=TIS=T.R-E NER J0int probability distribution as foll
gz T,S= T.R=T) = T :
7 =TIR=
pa MARKOV MODEL
* 5.4.1 Markov Models
tha he o.
arkov model is a discrete finite system 1 called as initial state.
has N distin ct states. The model ,starts at time t=~
that ilities that are
ystem moves from present state to next state with the transition probab
increas
step increa th syste ystem me
seses the
‘“ the time
As the time “te Suc
assigned fo present state, < uch type of system is known ass a discrete, ; or finiteite Markov model
Markov model.
:
-arete Markov Model every a, indica li
probabiility
Ty a, ) indicates the probab actstion to state j . from state i.. The a, are stored in
of transi
In Discrete
s are indicated by vector P-
A (a,} matrix. p, Is Hhe prod ibility to begin from a given state i. These start probabilitie
vy -
is th z rn ba ili
— [¢ atrix.
Attt | time the state of the system depends only on the state of the system at time t
s is “stationary”
ins : Pro bab ili tie s ind ependent oft when proces
Markov Cha Mod
= %41 | X= X) = Pi
So, for all t, P(X.)
ut wi
ility with which the system will be present in the next system witho
This can be inferred as Py, represents the probab
ent system is in state ¢.
depending on the value of 1, if the pres
has purc hased milk, then there is a 90% chance that his next purchase will also be milk.
Example 5.4.1: Suppose a person 80% chance that his next purchase will also be bread.
d then there is an
If the same person purchased brea
y purc hased milk, what is the probability that he will purchase bread two
Let's assume that a person currentl
hases from now?
purchases from now and three purc
0.1
— Cus 0.8
00m)
0.2
Vv Solution :
[ 0.9 0.1 |
A
0.2 0.8 0.17 |
0.90 O.1 09 O01 ]_ 0.83
0.66 ,
|= 0.34
A’ = 0.2 0.8 0.2 0.8 ,
T; ne ywalis 0.17
sat.
hase brea d two purc hasesacs from
purc
The probability of he will 0. ;
0.9 O.1 0.83 0.17 0 438 0.562
=
\' _ . .
Lo2 08 JL 0.34 0.66 A
“=
s of now js 0.219
The probability of he will purchase bread three purchase from
s the probability for the Observation
nny, what i
Example 5.4.2: Given that the weather on day
i
1 (t= 1) 8 : nny. The states a re represented as, Rainy. ,
, su
y, rainy, sunny, ee
O = sunny, sunny, sunny, rain
transition matrix Is given as A.
Cloudy: 2, Sunny: 3. The
04 03 03
M Solution:
) P (213) P (312)
p (313) P (313) p (113) P (II) P GIL
|
P(OlModel) = P( Dy De Ie By ba te ees
1 #0.4%03 0.1 *0.2= 1.536x 10°"
Ey hay yy Ay, YY YY ayyityy= 1.0 # 0.8 * 0.8 * 0.
UExample 5.4.3 TIS y’. Transition
states: ‘Rain’ and ‘Dr
DEA eee
Consider Markov chain model for ‘Rain’ and ‘Dry’ is shown in following fig ure Two
probabilities : P(‘Rain''Rain’) = 0.2, P(Dry''Rain’) = 0.65, P(‘Rain'I'Dry’) = 0- ; ‘Rain’, ‘Dry’}.
say P(‘Rain’) = 0.4, P(‘Dry) = 0.6.Caloulate a probability of a sequence of states {‘Dry’, "Ral n’,
0.2 Sain) a3
——— bp 33 0.7
)..
= &
Mi
P (O11 Model) P (Dry, Rain, Rain , Dry | Model) = P (Dry) P (Rain! Dry) P (Rain! Rain) P (Dry! Rain)
0.6 * 0.3 * 0.2 * 0.65 = 0.0234
1. Evaluation Problem : Given observation sequence O = O,, O,, ....., 0; and A how to compute P (O 1).
Then, P (OIA) by (H) b, (T) by (T) b;, (H) b, (H) by (H) by (T)
0.5 * 0.5 * 0.2 * 0.5 * 0.5 * 0.8 * 0.2
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
i=]
Induction :
O HTTHHHT
Similarly @ to &, are calculated, Based on this path Q is decided. Once we get Q we can easily calculate P (O | A).
Backward procedure
Y~
Module
Initialisation : B; (i) = |
N 5
Induction:
B, (i) = LX a,b,O.1 8410)
j=l
O = HTTHHHT
NII T ais 6)
B.(F) = |
SANA AERA
B,(B)= 1
B.,(F) = agg by (T) * 1 + apg bp (T) *1=0.9* 0.5 + 0.10.2 =0.47
* 05 +0.7* 0.2 = 0.29
B_,(B) = agrby (T) * 1 + agg by (T) * 1 = 0.3
te Tech-Neo Publications...A SACHIN SHAH Venture
(MU-New Syllabus w.ef academic year 18-19) (M6-14)
Sa ta i
Maximum Prob
AbITiLY is Of F, so F is
selected,
Suey
Bk ) = , by CH)
Ay B | (: F
+ ay)
y by (HD * BB)
= 0.9 * 0.5 * 0.47 + 0.1 * 0.8 * 0.29 = 0.2
34
fp »(F) =
Any by CH) * CF) + agg
by (H) * BB)
= 0.3" 0.5 * 0.47 +0.7 * 0.8 * 0,29 = 0.2329
Maximum Probability is
of F, so F is selected.
Similarly, § ito get Q we can easily calculate P (Q | h)
74re calculated, Based on this path Q is decided. Once we
D. .
er 7c OQ = Gq).
te
ecoding Problem ; Given observation sequence O and 4 how to choose state sequenc Q= GiGi.
Forward-Backward
algorithm
. a, (1) By (i
y,Q) = N ORD
x a (i) BW
Jel
y, (F) and y, (B) is calculated and maximum is selected.
Use EM algorithm
Compute probability of each state at each position using forward and backward probabilities
Compute probability of each pair of states at each pair of consecutive positions i and i + / using forward (i) and
backward (i + 1)
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
Linear Data
Non-linear Data x
Inseparable Data
Fig. 5.5.1; Diff
erent types of
data
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)4 [al rech-Neo Publications..A SACHIN SHAH Venture
FQ) = wy4b
In above equation, w represents the weight vector and b represents the bias.
By eralt .
scaling the values of w and b we can represent the optimal in mM, any ways. As a matter of “ONE non
hyperplane
“mong all the possible notations of the hyperplane the one selected is
Iwxebl = |
Here x represents the training records closest to the hyper eral the training5 records that are closest t, the
plane. In g¢ n
is called as the can
onical hyperplane.
hyperplane are called as support vectors. This notation
The distance between a point x and a hyperplane (w, b) is given by the result of geometry as follows,
Distance = Iwx+bl
Theil
WwW
Distance, = WYX+bI_
1 I
Ilwll Ilwll
Margin is twice the distance to neare Xp
st samples
9
M =
Ilwll
‘ Ls,
min L(w) = 5 Ilwi subject to y, (w'x +b) 2 I foralli.
w,b
Where y, represents the labels of training Fig. 5.5.3 : Solution to find maximum margin
This is a problem of Lagrangian optimization that can be solved using Lagrange multiplier to calculate weight vector
“w’ and the bias ‘b’ of the optimal hyperplane.
Let's assume that we have 2 classes of 2 dimensional data to separate. Let's also assume that each class consist of only
one point
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [a] Tech-Neo Publications_A SACHIN SHAH Venture
_constralints are,
ne &
c(web)
P = y lw
YIWK, +b] 1>q
ce, (web) = Tlwx, +h. 1>0
c,(w, b) = ~TIwxy+bl-139
M)C,(w, b)
= 2‘ wll" ~m, (w
. x, +b) - 1)~m
,,
2 ~ (Wx, +b)
~ 1)
=
2
Filwil —
™M (OWX) + by — 1) +m
, 2 ((wx 2 +b)+1
+1)
Ww ¢ solve for the gradient of Lagrangian
VL (w. b, m) Vi(w) —m
, Ver (w, b) +m
2°Ve3
C,(w, b =(
(W, by)
—|; w,
( 3 b, m) Ww-m + m,
MV 1X, x 7 0
5 (5.5.1)
ap L (webs m) = —m,+m,=0
;
(5.5.2)
= L(w,b,m) = (wx, +b)-1=09
dA,
(5.5.3)
17 (w,b,m) = (wx,+b)+1=0
(5.5.4)
Equating Equation (5.5.3) and (5.5.4), we get
(wx, +b)-1 = (wx,+b)+1
(wx,)-1 = (wx,)+1
(wx,;)—(WX,) = 2
w(xX,;-X,) = 2
W(x,-x,) = 2
—3w,-3w, = 2
w, = —(0.67+w,) (5.5.5)
MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
de
m, = m;
m, = m,=0.11
space
Fig. 5.5.4 : Mapping of Input space to Feature
ssed with
Particularly, the data is preproce
X — (x)
Fig. 5.5.5: i
. Mapping of one feature space to another feature space
:
transforms the data inj to an easily
Kernel functiojon understandable form. This is done via mapping input spacect to
e space. In support ve aching 3
of this 1s
calculated of two vectors and the result
cor machine inner Product is
ano Aon ber. Wh .
always 4 single number. When we replace this inner product by a kernel it is called as kernel trick.
here are different algorithms that use different kinds of kernel functions. Among the different types of kernel functions
such as nonlinear, a radial basis function (RBF), polynomial, and sigmoid, RBF is mostly used. The reason for this
is along the X-axis radial basis function gives the localized and finite response
The number of support vectors will be determined based on the different criteria’s such as what is the complexity of the
model, how much slack is allowed. One or more than one support vectors need to be defined for every complications in
the final model from the input space. Support vector machines output compromises of support vectors and alpha. This is
used to specify the effect of support vectors on the final decision.
_ qf we select the model with high complexity it will result in to over fitting. For better generalization if large margin is
selected then it may Iead to incorrect classification. And accuracy depends on the trade-off between these two selections
criteria. If we over fit the data then the range of support vectors may vary from very less to each single point. This
tradeoff is controlled through the selection of kernel and its parameters. Module
_ Insupport vector machine the data points are tested by taking the dot product of each support vector with the test point. 5
Hence the computational complexity increases, if we increase the number of support vectors. Classification of test
points will be faster if we have less number of support vectors.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
Gaussian kernel is
defined as,
K (x,y) =
ix-yir
vo{ S22")
<0
It is defined as,
K (x,,X,)
ay
= tanh (kx, * K+), for some (not every) k > 0 andc <0,
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [ial rech-Neo Publications..A SACHIN SHAH Venture
fine d as.
JI i, de
Jie (Gilx~y i)
K(x,¥) =
Ix-ypeorh
be used
In regression problems this kernel can
ed as,
Wis defin n
<F5 CLUSTERING
To solve the well known clustering problem K-means is used, which is one of the simplest unsupervised learning
algorithms. Given data set is classified assuming some prior number of clusters through a simple and easy procedure. In odule
k- means clustering for each Cluster one centroid is defined. Total there are k centroids. 5
of centroids. To get the
The centroids should be defined in a tricky way because result differs based on the location
much as possible. Next, each point from the
better results we need to place the centroids far away from each other as
for all the points.
given data set is stored in a group with closest centroid. This process is repeated
step new k centroids are calculated again from the
The first step is finished when all points are grouped. In the next
result of the earlier step.
process
done for the data points and closest new centroids. This
After finding these new k centroids, a new grouping is
is done iteratively.
Venture
[el Tech-Neo Publications..A SACHIN SHAH
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)
Je ¥ M94 her
oe Na ‘entre C
x) and the cluster ¢ r Wis
( int
j . Een ta ape’ na data
‘
P
“|
© — Use the calculated means to group the data points into clusters
o Forifrom 1 tok
o end_for ; No
Calculate/Initialise
end_until Centroid
- Iter
Iterate ata point
stable ( =(= no data
untilil stable 2
poi move group) Distance
. objectsto
centroids
1. Determine the centroid coordinate
Grouping
Ou based
A on
centroid minimum distance
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
; Given
{ 2, 4,The
To Learn 10, Art
12: 3.Of20, 30, ops
Cyber Len
Security & Ethical Hacking Contact Telegram - @crystal1419
5} Aemene rent
“om ’ oh
eo eon: JMET ig We 2
anol assign ineans: M, = 3, 1m, = 4
Re assign
oe (2, 3,4, 10, 11, 12}, k; = {20, 25, 30, 31) m, =7, m, = 26.5
Final clusters
Item 4 5 4
VI Solution :
Suppose we use item | and 2 as the first centroids, c, = (J, 1) andc, = (2, 1)
The distance of item 1 = (1, 1) toc, = (J, 1) and with c, = (2, 1) is calculated as,
D= (1-1) +(1-1) =0 |
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Venture
D= Va-2 40-9 =1 |
The distance of item 2= (2.1) toe, =(1, 1) and with c, = (2, 1) is calculated as. |
D= V2Q-N+0-1
=1
D= V2-27+0-1y
=0
The distance of item 3 1) is calculated as,
= (4,3) toc, =(1, 1) and with c, = (2
D= Va-1+G-1F =361
I
D = Va-27+@-1 =283
The distance of item 4 = (5,4) toc, = (1, 1) and with c, = (2. 1) is calculated as,
D= V6 +G-1=s5
Ul
D= V(5-2))4+G-1)'=4.24
Objects-centroids dis
tance
Object Clustering
G! piooe)
~ O11]
Iteration 1 : Determine centroids
C, = (24+4+45/3,14+3 44/3)
= (11/3, 8/3)
The distance of item | =(1, 1) toc, = (J, 1) and with c, = (11/3, 8/3) is calculated as,
D= VU- 1) +(1-1y =0
The distance of item 2 = (2, 1) toc, = (1, 1) and with c, = (11/3, 8/3) is calculated as,
D= V2-)'+0-1=1
D = V2- 113) + - 8/3) = 2.36
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
oniot atnefet d 22 ]
0011
rmine centroids
tionson +2! *Dete
Cc, (1 + 2/2, 1 + 1/2)
= (3/2, 1)
of item 1 = (1, 1) toc, = (3/2, 1) and with c, = (9/2, 7/2) is calculated as,
The distance
p= Vd-32+0-1=05
p= VU-92y7 +(1-72y =43
is calculated as,
The distance of item 2 = (2, 1) toc, = (3/2, 1) and with c, = (9/2, 7/2) Module
p= VO-32)+0-1)=05
D = V2-92y+0-72y =3.54
is calculated as,
The distance of item 3 = (4, 3) toc, = (3/2, 1) and with c, = (9/2, 7/2)
D (4-3/2) + G- 1) = 3.20
"
D= V6-32)+4-1) =461
MU-New Syllabus w.e.f academic year 18-19) (M6-14) [el Tech-Neo Publications_A SACHIN SHAH Venture
rib
Cc
'
—_—
2 be 0.5 3.20 4.61
4.3 3.54 0.71 0.71 m( 1) group2
Nhe
C,=\2°2
— ‘
Item 4 has minimum distance for group 2, so we cluster item 4 in‘ group 2.
Object Clustering
5 1100
G =
0011
Obj
*=G!, Objects ,
does not move from
clusters
group any more. So, the finalal clusters are a 5 follows:
G"=G_,
— — Item 1 and 2 are clustered in group |
1 2 10
2 2 5
3 8 4
4 5 8
5 7 5
6 6 4
7 1 2
8 4 9
M Solution :
Suppose we use data points 1, 4 and 7 as the first centroids, ¢, = (2, 10), c, = (5, 8) and c, = (1, 2)
The distance of data point | = (2, 10) toc, = (2, 10), ¢, = (5, 8) and with ¢, = (1, 2) is,
D= (2-1)
+ (10-2) = 8.06
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [al Tech-Neo Publications..A SACHIN SHAH Venture
a A aia
she distance of data point 1 = (7, 5) toc, = (2, 10), ¢, =(5, 8) and with c, = (1, 2) is ’
D = Y(7-2)'+(5-10)°=7.07
D= V(7-5)4+(5-8)
= 3.6]
qhe distance of data point I = (6, 4) to c, = (2, 10), c, = (5, 8) and with ¢, = (1, 2) is,
D= V6-2)+4-10¢=72)
D= V6-5)+@-8 =4.12
D = V6-1'+G-2)=539
The distance of data point 1 = (1, 2) toc, = (2, 10), c, = (5, 8) and with c,=(1,2)is,
D= Vd - 2) +210) =8.06
D= Vo -57 +8 =721
D= (1-1) +(2-2)y'=0
The distance of data point | = (4, 9) to c, = (2, 10), c, = (5, 8) and with c,=(1, 2)is,
Module
D= Vi4-27+0- 10)
= 2.24 5
D= Vi4-5) 419-8) = 1.4
D (4-1)
+ (9-2) =7.62
Objects-centroids distance
MU-New Syllabus w.ef academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Venture
omp
Machine Learning (MU-Sem 6-C
rix we can sec,
troid distance mal +4
From the above object cen
<A|
distance for grou
— Data point 1 has mi inimum of
group3, s¢ oe
m dist ance for
Data point2 has minimu
16
owe €
imu m d ista nce for gro up 2,8
— Data point 3 has min
soe for group
2
sO
wec
da? soint 39 grouP
4 has min imu m dist ance
— Data point clustet a
ta
16in gr0t p
for gro up2,+ so we
m dist ance
— Data point 5 has minimu Juster data poin qin
2, so we € us group
distance for group“»
- Data poin (6 has minimum ter data
point
for gr ou -
p 3,8 owe clus
ce group 2.©
- Data point 7 has minimu m distan ata poin ,sinn
; a: imu m
; ance for group2, $0 We © luster d
dist
— Data point 8 has min
Object Clustering
10000000
ge} ooriri1o!
901000010
Object Clustering
1000000 1
01000010
C, =)
(2 + 4/2, 10 + 9/2 (3, 9.5)
(6.5, 5.25)
C, = (8 +5474 6/4,4+48+5+4/4) =
C,
(2+ 1/2, 5 + 2/2) = (1.5, 35)
group |
1.12 2.35 7.43 2.5 6.02 6.26 7.76 1.12 7 ¢,=@, 9.5)
2_ | 654 4.51 1.95 3.13 0.56 1.35 6.38 7.68 c, = (6.5, 5.25)
group 2
652 1.58 6.52 5.7 5.7 452 1.58 6.04 J c,=(1.5,3.5) group3
uct”
G=]/00101 494
mine centroids
‘ 3: Deter
lion Cc, == (2 +5443, 10494
8/3)Y7 == (3,67
(35.67, 9 )
Ge (8+7 +4 673,445 44/3)
=(7 4.33)
C; = (2+ 12,5 + 2/2) = (1.5 3.5)
3
pata points=2 and 7 are clustered in group
mple 5.6-5 MU - May 15, 10 Marks
yexamp
K-means algorithm on given data for k = 3. Use c, (2), c2 (16) and C, (38) as initial cluster centres.
31, 12, 15, 16, 38, 35, 14, 21, 23, 25, 30
pata :2, 4: 6. 9.
o Solution -
¢, =2,C,= 16,c,= 38
The numbers which are close to mean are grouped into respective clusters.
ky =3)
f= (24.6. 1,30)
(12, 15, 16, 14, 21, 23, 25), ks= (335, Modul
for new cluster group.
Again calculate new mean 5
c, = 3.75, ¢) = 18,c,;=
;
32
New clusters
Final clusters
MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
Apply K-means algorithm on given data for k = 3. Use c, (2), ¢) (16) and ¢3 (38
14, 21, 23, 25, 30
Data : 2, 4, 6, 3,31, 12, 15, 16, 38, 35,
MW Solution :
¢,= 38
¢, = 2.¢)= 16,
The numbers which are close to mean are grouped into resp ective clusters.
ky = (2.4.6.3),
ky = (12, 15, 16, 14, 21, 23, 25), ka= (31, 35, 30)
Again calculate new mean for new cluster group.
New clusters
= 32
Ky = (24
3},,6,
ky = (12, 15, 16, 14, 21, 23, 25), ky= (31,3530) = 3.75, C) = 18,
, %
Clusters remains unchanged
Final clusters
K, = (2,4,6.3
k= (12,
),15, 16, 14, 21, 23, 25}, ks= (31, 35, 30]
t& 5.6.2 Hierarchical Clustering
1. Compute the proximity matrix (distance matrix) 2. Assume each data point as a cluster.
In Agglomerative hierarchical clustering proximity matrix is symmetric i.e., the number on lower half will be same as
the numbers on top half.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Venture
Xty
D,/04 | 053
Dz | 0.22 | 0.38
Dg | 0.35 | 0.32
Dg | 0.26 | 0.19
Ds | 0.08 | 0.41
Dg | 0.45 | 0.30
| solution -
MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
Distance matrix
D, 0
D, | 0.24] 0
D, | 0.22 | 0.15 | 9
D, | 0.37 | 0.20 | 0.15 | 9
0
D, | 0.34 | 0.14 | 0.28 | 0.29
0.25 | 0.11 | 0.22 0.39 | 0
D, | 0.23 |
Dd,2 D, Ds D, Ds
D, nis two in one cluster and recalculate distance
0.11 is smallest. D, and Dy have smallest distance. So, W e combine this
matrix.
. 0.23) = 0.22
Distance ((D,, D,), D,) = min (distance (D,, D,), distance (Dg, Dy) = ™n (0.22,
: ‘ 0.25) = 0.15
Distance ((D,, D,), D,) = min (distance (D,, D,), distance (D,, D,)) = mn (0.15,
: 22 = 0.15
Distance ((D,. D,), D,) = min (distance (D,, D,), distance (D,, D4) = ™) (0.15, 0.22)
‘ . 39) = 0.28
Distance ((D,. D,), D,) = min (distance (D,, D,), distance (D,, Dx) = mn (0.28, 0.39)
Similarly we will calculate all distances.
Distance matrix
D, 0 |
D, | 0.24] 0 |_|
(DD) [0.22] 015] 9 |_|
D, | 037/020} 015 | 0
D, [0.34] 014] 0.28 | 0.29] 0
D, D; (Dy.D,) Dy Ds
. : : and recalculate
‘ distance
0.14 is smallest. D, and D, have smallest distance, So, we combine this two in one cluster and
matrix.
(D,, D,))
Distance ((D,, D,), (D,, Ds)) = min (distance (D,, D,), distance (D,. D,), distance (D,, D,), distance
= min (0.15, 0.25, 0.28, 0.29) = 0.15
Distance matrix
D, 0
(D,,D,) | 0.24] 0
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [el Tech-Neo Publications...A SACHIN SHAH Venture
D>,
- gmallest. » We combi
ysis i Mbine thisj two in: one cluster and recalcu
alculate
0. ce matt’ .
(2 p
3 ie at rix
row as <ingle cluster remains (D;, Ds, Ds, Dy, D,, D,)
, e represent the final dendogram for single linkage as,
ext Ww
—_——_
Root. One node
0.11 is smallest. D, and D, have smallest distance. So, we combine this two in one cluster and recalculate distance
matrix.
Distance ((D3, Dg), D,) = max (distance (D5, D,), distance (D,, D,)) = max (0.22, 0.23) = 0.23 Module
MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
D, 0
(DD) | 0.34
(Dy Dg, Dy) | 0.37 0.39
D, (Dy Dy Dy) (Dy Dor Pa)
teri calculate distance
. these: two in on e cluster and re
combine
‘ and D, have smallest distance so, we
0.34 is smallest. (D,, D,)
matrix,
Distance matrix en
(D,, Ds, Dy) 0 1
(Dy, Dg, Dy) 0.39
(D,,Ds,D,) (Dy, Dy Da)
Now a single cluster remains (D,, D;, D,, Dy, D,, D,)
Next, we represent the final dendogram for complete linkage as,
D3 D6 D4 D2 ODS D1
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) fl Tech-Neo Publications...A SACHIN SHAH Venture
0. x
m init” 2 (Ds Dg)» D,)= 1/2 (dista nee (D,, .
Sotus. (DD
we will calculate all distances, # Y= 120.29 4 0.23) = 0.23
t
icet matrix
ya”
(D Dn't , D
iis gmallest. D, and Dy have smallest distance So, we comb a D Ds
‘om ae :
ine this two in one cluster and recalculate distance
~
™malls:ix
_ranice ma trix
pista
0.24 is smallest. (D,, D,) and D, have smallest distance. So, we combine this two in one cluster and recalculate distance
matrix.
Distance matrix
(D,, Ds, Dy) 0 0
(DyD,.D,)| 0.26 0 Module
(D;,Ds,D,) (D3, Dy, Dy) )
Now a single cluster remains (D,, D,, D,, D3, Dg, Dy)
Next, we represent the final dendogram for average linkage as,
D3 D6 D4 D2 D5 D1
Venture
Tech-Neo Publications...A SACHIN SHAH
ademic year 18-19) (M6-14)
Mi Solution :
P,
P,
Py 3| 0
P, 7|0
p,}9}8{5)4]9
P, Pp, Py Py Ps fsPy Ps .
juster and recalculate distance matrix,
clus
2 is smallest. P, and P, have smallest distance. So, we combine this two in one
Distance ((P,, P,), P,) = min (distance (P, P,), distance (P;, P3)) = min (6, 3)=3
Distance matrix
(P;. Py)
P,
P, 9 0
P, 8 4] 0
(P,,P;) Py Pa Ps
. . ar r and recalculate distance
So, we combine this two in one cluste
3 is smallest. (P,. P;) and P, have smallest distance.
matnx.
= 1 4 7
(P>, P,). distance (P,. P,)) = min (9, 7)
: ?
.
Distance ((P,, P;, P,), P,)) = min (distance (P,. P,), distance
Similarly, we will calculate all distances.
Distance matrix
(P,, P,P) 0
P, 7 0
P, 5 4] 0
(P,,P3P,) Py Ps
Distance matrix
(Py, Ps) 5 0
(P,, P;, Py) (P,, Ps)
pistance matrix
ee P, 6 0
_ oma
(Py, Ps) 10 7 0
(P;,Ps) Py (P,, Ps)
6 is smallest. (P,, P,) and P, have smallest distance. So, we combine
this two in one cluster and recalculate distance
matrix,
(MU-
New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
Distance matrix
(P,P, P,) 0
(P,, Ps) 10 0
(P,, P», Py) | (Par Ps)
Now s
a sing l €- Clu ster
] ust re Mai
a ns (P P Pa k Py, P, ,P,)
2
6 310
10 7| 0
9 5 4|0
P, P, Py Py Ps
2 is smallest. P, and P, have smallest distance. So, we
combine this two in one cluster and recalculate distan
ce matrix,
Distance ((P,, P,), P,) = 1/2 (distance (P,, P,), distan
ce (P,, P,)) = 1/2 (6, 3) = 4.5
Similarly, we will calculate all distan
ces.
Distance matrix
(P,P) | 0
P, 45 | 0
P, 95 | 7] 0
Ps 85 | 5} 4{ 0
(P,, P3) Py P, Ps
4 is smallest. P, and P, have smallest distance. So, we combine this two in one cluster and recalculate distance matrix,
Distance matrix
(P;, P3) 0
P, 45 |0
(PyPs)| 9 | 6] 0
(P,,P;) Py (Py, Ps)
io" (P,,P,) |8
|
(P,, P,, P;) (P,, P,)
P1 P2 P3 pg ps
6.9 MU- Ma 16, 10 Marks
(ET
le thm on
jo merative clustering algori given data and draw dendrogram. Show three clusters with its allocated points.
- : - —— :
singlee link method.
A| 9 [v2 | Vo |v VE I i |
Bl v2 | o |] Ww [3] lye
C | Vio | Ve o |W lysl 2
D | Viz | v5 0 2 3
E}] VB | 1 | v5 2 | 0 | Via
F | ¥20 | 8 2 3 113 | 0
a Solution :
Distance matrix
A 0
B | 1.414 0 Modu
C | 3.162 | 2.828] 0 : 5 2
D | 4.123 1 2.236 | 0 ZZ
E | 2.236 1 2.236 | 2 | 0
F | 4.472 | 4.242] 2 3 | 3.6 0
A B Cc D E F
1 is smallest. B, D and B, E have smallest distance. We can select anyone. So, we combine
B, D in one cluster and
recalculate distance matrix using single linkage.
MU-New Syllabus wef academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
Distance matrix -
a eg rermsmemernend
A 0 —_——
BD] 14s] 0 | oe
Cc | 3.162 | 2.236] 0
E [226] 1 | 2.236) 0
F | 4472] 3 2 | 3.6 :
A Bo cE F
.
re scalculate d distance matrix :
ate
i £s aU ; sJuster
one clus and
1 is smallest, B, D and E have smallest distance. So, we combine this (wo 1?
Distance matrix
B,D,E | 1.414 0
Cc 3.162 1
F | 4472] 3
Distance matrix
A 0
B,D,E,C | 1.414 0
F 4.472 2 0
A BD,E,C F
:
In the questions three clusters are asked with their allocated points. Three clusters are A, (B, D, E, C) and
ar
F,
UExample 5.6.10 [UEIEYEIARO ir
For the given set of points identify clusters using complete link and average link using Agglomerative clustering.
A |B
P, 1 1
P2 | 1.5 | 1.5
P3 5
P, 3
P, | 4
Ps, | 3 | 35
| Solution :
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [el Tech-Neo Publications...A SACHIN SHAH Venture
mallest
P4 and P6 have smallest dist P6
ance, We ¢; an
7 sate distance matrix using
15is § 7 ‘ select CCl ;anyo
ne, So, we
complete linkage, combine thi
¢ cule 5 in one cluster and
te matrix
Distance matrix
Pl} 0
P2 | 0.707 | 0 _4 =
P3 | 5.656 | 4.949 | 0
P4 | 3.605 | 2.915 | 2.236] 0
PS } 4.242 | 3.835] 1414] 1 0
Po | 5201] 25 | 1.802 | 05 | 1.118 | 0
Pl =p2)o ps PH PS PG
O.5 1.5 isis smallest. P4 and P6 have smallest distance. We can select anyone .So,
~. wewe combine
combine this in one clust er and
recalculate distance matrix using comp
lete linkage.
Distance matrix
Pl 0
P2 | 0.707 0
P3 | 5.656 | 4.949 0
P4.P6 | 4.403 | 2.707 | 2.019 0
PS | 4.242 | 3.535 | 1.414 | 1.059 | 0
Pl P2 P3 -P4,P6 PS
0.707 is smallest. P] and P2 have smallest distance. So, we combine this two in one cluster and recalculate distance
matrix.
Distance matrix
P1,P2
P3 | 5.302 0
Distance matrix
P1,P2 0
P3 5.302 0
P4,P5,P6 | 3.66 | 1.817 0
P1,P2 P3 P4,P5,P6
.
Next we will combine all clusters in a single cluster
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [al Tech-Neo Publications...A SACHIN SHAH Venture
fe ps nodels s, The method is executed in an iterative way.y- | In the p 8 Method. This method is based on the concept of
j zit” «are calculated that hasae the maxim
: um likelih
oe ood of ils attributTesence
es of miss;Sing data the probability
ility distrib
distribution
ef
8.
input 10 the EM algorithm as the dat
ie
ste?
‘i Maxim.‘aat
iz
ion step is executed, in which the param
eters tre,
Of the Probability distribution of each model ;
e are again
wiated: : lgorithm i
cal .
stopping criteria of the algorithm is, when the distribution Parameters conver ge or reach
. Convergence is guarantee . the maximum number of
ns-
d as the algori thm increases the likelihoo
d at each iteration until it reaches the
erat? local
asi gorithm is executed in the following Way :
: re EMjnitialisation-step
al : Model's , ,
parameters are assigned to random values,
EXP ectation-step : Assign points to the model that
fits each one best
M aximization-step : Update the parameters of the model using
the Points assi gned
in the earlier step
values converge
Iterarate until parameter
st
Consider a set of
arting parameters given a set of incomplete (observed) data. Assume observed data come from a
specific model.
Use these to “estimate” the missing data. Formu
late some parameters for that model. Use this
to guess the missing
value (E step).
0 Use “Complete” data to update parameters. From missing data and observed data find the m ost likely parameters
(M step).
oO
Repeat step 2 and 3 until convergence.
Now let’s understand EM algorithm with the help of example. In example 1 we will see what au se
of EM algorithm is
Jin the example 2 we will see how to use EM algorithm.
fl
ample 5.6.11 : Suppose Coin A and B is used for tossing. Each coin is tossed 10 times. Following table
observation sequence of getting H and T when coin A and B is used. s the show
What is the Probability of getting H if
coin A and B is used?
Module
Number of Toss wd 5 7
Coin Used for Toss | 1|2/3|4/5]617/8]91
]| 10
B H}|H}H]|H/|HI TI TI TIT] T
A H|H/]|H|H]|H|JH|HIHI TI T
A H}H|H|H]H]|H|H|H|H] T
B H}/H}|H|H/I T/T] T/T ITI] T
A H}|H/]H|H|H/|H|H] TIT] T
ee
I solution:
Let's first calculate number of H and T for each coin as follows mem
of get er 24
Probability is used, P,= a6 * 0.
robability of getting Head when coin A
Asi . (
Probability of getting Head when coin B is used, Py=SHI = 0,45
calculate the
not given then how we will
In this example if the coin state is hidden ie. whether coin A or B is used 1s
probability?
coin is tossed 10 times. Following table shows the
A and B is used for tossi ng. Each is not
Example 5.6.12 : Suppose Coin of getting H and T for each roun
d. But which coin is used for
whic h round
observation sequence
and B?
known. Then how to calculate the probability of getting H for coin A
Number ofToss
7 | 8/9 | 10
Round number | 1 | 2 | 3 | 4
0 HIH/H{H}H| T/T TIT] T
H|H|H/|H|H|H]|H]H |T | T
1
| H| T
2 H/H|H/|H/]H/H]|H]H
HIH/H/H/T/T{ TI TIT T
3
HI H|H/H|H/H/H/ TIT T
4
M Solution :
is used.
state is not known (Coin A or B) then EM algorithm
When only observation sequence is known, but the
Now Let's solve this example using EM algorithm.
Assume P, = 0.6 and P, = 0.5
B coin as,
Now we will calculate probability of using A and
A (P,)"(1- Pa)” "= (0.6)(1 = 0.6)" * = 0.00079626
il}
Ay = Xap=0-45
ALL
pound
und 2: 2: In round 2 there are 9 H, 1 T and total tosses are 10.
Now we will calculate probability of using A and B coin as,
By = 44p=0.20B a
Now we will calculate probability of getting H and T for Coin A and B for round 2 as,
A; = Ay * Number
of T=0.80 * 1 =0.8
By, = By * Number of H=0.2*9=1.8
Ay = By * Number of
T = 0.2 * | = 0.2
MU, ae
” New Syllabus w.e.f academic year 18-19) (M6-14) [al Tech-Neo Publications...A SACHIN SHAH Venture
By = 44B p= 0.65
; as,
Now we will calculate probability of getting H and T for Coin A and B for round 3
Round 4: In round 4 there are 7H, 3 T and total tosses are 10.
Ay =
A
AGB =O
B
By = 7, p= 035
Now we will calculate probability of getting H and T for Coin A and B for round 4 as,
Ay = Ay * Number of H = 0.65 * 7 = 4.55
A; = of T = 0.65 * 3 = 1.95
Ag * Number
By, = By * Number of H = 0.35 * 7 = 2.45
A, = By * Number of T= 0.35 * 3 = 1.05
Now we will calculate P, and P, by summarizing the results of round 0 to round 4,
Coin A Coin B
Ay | Ay | By | By
0 2.25 | 2.25] 2.75 | 2.75
2 7.2 | 08 18 | 0.2
3 1.4 2.1 26 | 39
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
Apillt
, hen Coin B is used, P,= =H
wy 0 r
yililY
getting Hw TH+Ep = 0.58
pab values of P, and P, second iteration j
ce new SaPplied, This
jap these f P, and P, and then that will be the fina)
ot" iq the values ° Probabilities ess and until there will
Hypothetical connection is present between the input layer and the hidden layer, whereas weighted connections are
p
resent
>
between hidden layer and output layer.
- Let the input vector is P € X' and prototype of the input vector
isCe KR" (1 Si<u) be the. The output of each RBF unit is as
MC ewe nene
follows:
MiNlew Sylabus wef academic year 18-19) (M6-14) Tech-Neo Publications... SACHIN SHAH Venture
Learning (hat it js
‘ancien 5 du e (0 the reason
: ear pais (uncle
. ; fang
~ocestan function : mostly used among all passible
| radial ba
~ Generally, the Gaussian is
(5
1.6.2)
factorizable. Hence
7/0) |
RP) = exp - CPG)
Where o, represents the width of the i" RBE unit.
y(P) = LY R(PYXwGid
=] ho cani ive output is W Gj i).
pll ficld to the i"
ath ngth of the | rece
is w(j, 0), R, = 1, and weight or stre
of j”
the outp ut
can say from
nthe following analysis.acteWe
where, Bias Pa
; i, :
] work _ complexitwe y is reduced by not taking bias in to consideratio
n | near
= Net char rized by a li
uts of radi al basi s func tion cl assifier are
outp
ie
SHAH Venture
18-19) (M6-14) aI Tech-Neo Publications...A SACHIN
(MU-New Syllabus w.e.f academic year
se gi = (GG) 'S
Tint
nput nattera
pattern |
(1,1)
(0,1)
(0,0)
(1,0)
Wa
‘New Syllabus w.e.f academic year 18-19) (M6-14) tal Tech-Neo Publications...A SACHIN SHAH Venture
aaa a 8,
G(ilx-11P) = exp(-lx-4
I")
Now we will construct a G matrix (X-b y’) and third column is for
nd column. exp (-
For the first column we use, exp (~ (X - t,)° ) and for the seco
bias (1)
For example for first row and first column X = (1, 1) andt, = (1, 1)
= exp(-((1-
1) +(1-1))) =exp (0) =!
For example for second row and first column X = (0, 1) and t, = (1,1)
= exp(-((1-0)'
+ (1 —0)°))
= 0.1353
For example for second row and second column X = (0, |) and t, = (0, 0)
4
Now we will find the inverse of this matrix by dividing the adjacency matrix by determinant
Determinant = 1.2889(1.2889 * 4 - 1.8709 * 1.8709) — 0.5412 (0.5412 * 4 - 1.8709 * 1.8709)
+ 1.8709(0.5412 * 1.8709 — 1.2889 * 1.8709) = 0.2392
Now to find the adjacency matrix we find the values row wise and we consider the alternate + and — sign for the above
matrix
For first row first column = + (1.2889 * 4 - 1.8709 * 1.8709) = 1.6553
For first row second column = — (0.5412 * 4— 1.8709 * 1.8709) = 1.3355
For first row third column = + (0.5412 * 1.8709 — 1.2889 * 1.8709) = — 1.3989
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [el Tech-Neo Publications...A SACHIN SHAH Venture
— 2.5018 AIL Y
W = - 2.5018
2.8404
+s n Me matrix is constructed as
yatio
yorerP®
gd = (Wy OD, ¥209, 30), Wy (X))
w(x) represents the hidden function corresponding to centers8.
wh ere YN
n weight matrix is calculated as
ie Wels
W = Oo. ! d
—7a14: Design a XOR problem with given data set as (1,1) , (0,1 , ; ,
pample 8 interpolation matrix. (0,1) , (0,0) , (1,0) and also find Weight vector and
4 Solution :
As we are designing the problem for XOR the input and desired responses are as follows:
Input pattern | Desired output
(1) 0
(0,1) 1 Module
(0.0) 0 5
(1,0) !
Aswe have to select all the centers t, = (1, 1). t = (0, 1), t, = (0,0) and t, = (1, 0)
The Interpolation matrix is given as
I 0.3678 0.1353 0.3678
0.3678 1 0.3678 0.1353
> =
0.1353 0.3678 1 0.3678
= 0.3678 0.1353 0.3678 1
'MN "Syllabus w.e.f academic year 18-19) (M6-14) SACHIN SHAH Venture
eI Tech-Neo Publications...A
i
Machine Learning
~ (9837
1.5182
Woe 6 =! _ 99337
1.5182 eer
d|ji7 | | v5 0 2 3
e| V5 | 5 2 o | ¥13
1|y20 | vie | 2 3 |v |
> May 2017
(10 Marks)
the margin ? (Ans. : Refer sections 5.5.1 and 5.5.2)
Q.8 Whats Support Vector Machine 9 How to compute
using Agglomerative clustering.
y clusters using complete link and average link
Q.9 — For the gi ven set of points identif
(10 Marks)
(Ans, : Refer Example 5.6. 10)
Venture
(M6-14) [al Tech-Neo Publications...A SACHIN SHAH
(MU-New Syllabus w.ef academic year 18-19)
°:
P, 15 | 1.5
3 5 5
Ps | 3 4
Ps} 4 | 4
Po | 3 | 35
> May 2019
: . IDain! ‘Dry’. Transition
a. 10 Consider Markov chain model for ‘Rain’ and ‘Dry’ is shown in following
. orgs te fliet
figure Two states: ‘Rain and ‘Dry bilities :
probabilities: P(‘Rain'l'Rain’) = 0.2, P(‘Dry'Rain’) = 0.65, P(‘Rain'I'Dry’) = 0.3, airiec.t iti abilities «
P(‘Dry'I'Dry’) = 0.7, Initial prob
say P(‘Rain’) = 0.4, P(‘Dry’) = 0.6.Calculate a probability of a sequence of states {‘Dry’, 'Rain’, ‘Rain’, ‘Dry’).. io)
(Ans. : Refer Example 5.4.3) rks
coma
g.12 Explain following terms w.r.t Bayes’ theorem with proper examples : (a) Independent probabilities (b) Dependent
Probabilities (c) Conditional Probability (d) Prior and Posterior probabilities Define Bays theorem based on these
Probabilities. (Ans. : Refer section 5.3.1)
(10 Marks)
Q.13 Draw and discuss the structure of Radial Basis Function Network. How RBFN can be used to solve non linearly
separable pattern ? (Ans. : Refer section 5.6.5)
(10 Marks)
Q.14 Illustrate Support Vector machine with neat labeled sketch and also show how to derive optimal hyper-Plane?
Q.15 Why is SVM more accurate than logistic regression ? (Ans. : Refer section 5.5)
(5 Marks)
Module
ae
Q.16 Explain Radial Basis Function with example. (Ans. : Refer section 5.6.5)
(5 Marks)
Q.17 Explain various basic evaluation measures of supervised learning Algorithm for Classification.
(Ans. : Refer sections 5.1, 5.2 and 5.3)
(10 Marks)
Q.18 Define Support Vector Machine. Explain how margin is computed and
optimal hyper-plane is decided ?
(Ans. : Refer sections 5.5.1 and 5.5.2)
(10 Marks)
Q.19 Write short note on : Hidden Markov Model. (Ans. : Refer section 5.4)
(5 Marks)
Q.20 Write short note on : EM algorithm. (Ans. : Refer section 5. 6.3)
(5 Marks)
. : . 9
Q.5.3° SVM can be used to solve problems. Explanation : Out of the options given. only K-Means
(a) Classification (by Regression clustering algorithm and EM clustering algorithm has the
(c) Clustering drawback of converging at local minima.
(d) Both Classification and Regression Y Ans.:(d) | Q.5.7 | Which of the following algorithm is most sensitive to
Explanation : With the help of SVM we can get outliers ?
categorical as well as numerical output. (a) K-means clustering algorithm
(b) K-medians clusteri Igori
Q.5.4 SVMisa learning algorithm. ering algorithm
(c) K-modes clustering algorithm
(a) Supervised (b) Unsupervised
(d) K-medoids clustering algorithm Y Ans. : (a)
(c) Semisupervised — (d) Reinforcement ¥ Ans, : (a)
Explanation : Expected output is known.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) al Tech-Neo Publications..A SACHIN SHAH Ventur
e
How does the state of the process is described in HMM ? ri thms whereas Ap
Q. 5.9 are classifi cation algo
(a) Literal (b) Single random variable Association algorithm.
(c) Single discrete random variable ective when
Q.5.15 SVMs are less eff
random variable v Ans. : (c)
(d) me ly seperable
(a) The data is linear
Explanation : States are represented in HMM model and ready to use
(b) The data is clean
using Single discrete random variable. OV erlapping points
sy and contains
(c) The data is noi Ans. : (¢)
false statement related to Support vector dat a is cl ea n
Select the The
Q. 5.10 (d) bution od data is
is noisy and distri
machine. Explanation : If data to draw proper
(a) SVM can be used as binary classifier not proper then w ¢
will not be able
(b) SVM can be used as Multi-class Classifier decision boundary.
ity from given data TP = 30,
(c) SVM can not perform non-linear classification Calculate the Sensitiv
and non-linear Q. 5.16 = 10
(d) SVM can be used for linear TN = 930, FP = 30, FN
classification Y Ans. : (c) (b) |
(a) 0.75 v Ans. : (a)
SVM can perform non-linear (d) 0.99
Explanation (c) 0.86
kernels .
classification using the non liner Explanation :
EP 3040 L=
Sensitooivity = [7p + FN)
is 0.75
Q.5.11 In Regression trec output attribute
(b) Discrete
(a) Categorical data TP = 30,
Y Ans. : (c) from given
(c) Numerical (d) Range Q. 5.17 Calculate the Accuracy
. = 10
.
tree at the leaf node we get TN = 930, FP = 30, FN
Explanation : In regression (b) |
(a) 0.96
the numerical outpul. v Ans, : (a)
(c) 0.86 (d) 0.99
valid iterative strategy for (TP + TN)
Q.5.12 Which of the following is/are . soe _ + P+
TN +F
cy =TTp+| TN
)
EP +FNFN)
clustering analysis? Explanation : Accura
treating missing values before
(a) Nearest Neighbor assignm
ent 960 Module
(b) Imputation with mean
= To00 = 96 5
tion Maximization be used to answer any
(c) Imputation with Expecta Q. 5.18 How the bayesian network can
algorithm query?
v Ans. : (c) Joint distribution
lier (b)
(d) Imputation with out (a) Full distribution
n with EM
we perform imputatio (c) Partial distribution (d)
Zero distribution
Explanation : When ore applying
ues can be handled bef Y Ans. : (b)
algorithm missing val
clustering. first we have to find
Explanation : In Bayesian network
nection between from this we can
on network the con joint probability distribution then
Q. 5.13 In Radial basis functi
is called as
input and hidden layers answer any query.
n
(a) Weighted connectio
(b) Intra connection
~Sem 6-Comp)
(c) Finds hyperplane without any criteria Q. 5.28 If you are using Multinomial mixture models with the
(d) Does not find hyperplane for classification eXpectation-maximization algorithm for clustering a set
¥ Ans. : (b) of data points into: two. clusters, which of the
assumptions are important ?
Explanation : If margin is maximum then we can gain (a) All the data points follow two Gaussian distribution
more confidence in our classification. (b) All the data points follow n Gaussian distribution
Q. 5.23 Probabilities in Bayes theorem that are changed with the (n> 2)
help of new available information are classified as (c) All) the data points follow two multinomial
distribution
(a) Independent probabilities (d) All the data points follow n multinomial distribution
(b) Posterior probabilities
(n> 2) ¥ Ans. : (c)
Explanation : For Multinomial mixture models data
(c) Interior probabilities
v Ans. : (b) points should follow two multinomial distributions.
(d) Dependent probabilities
(b) allow more than one input attribute in a single rule (a) Solving queries
(b) Increasing complexity
(c) require input attributes to take on numeric values.
(d) require each (c) Decreasing complexity
rule to have exactly one categorical
(d) Answering probabilistic query Module
output attribute. : (a)
Y Ans. Y Ans. : (d)
Explanation ; As per working of association rules like Explanation : Based on probability concept queries are 5
apriori algorithm. answered in baye's system.
Q. 5.33 Given desired class C and population P, lift is defined as Q. 5.38 What does the bayesian network provides?
(a) Complete description of the domain
(a) the probability of class C given population P divided (b) Partial description of the domain
by the probability of C given a sample taken from (c) Complete description of the problem
the population. (d) None of the mentioned Y Ans. : (a)
(b) the probability of population P given a sample taken Explanation : Bayesian belief network provides the
from P. complete scenario of a particular event.
Nr
very sensitive (0 not
Q.5.40 Three components of Bayes decision rule are class prior.
d 2 is False
likelihood and (a) | is True an
d 2 is Truc
(a) Evidence (b) Instance (b) | is False an
(c) Confidence (d) Salience Y Ans.
: (a) -) Both are Truc
(c) 0 Y Ans. :(c)
Explanation : Based on new evidences rules are (d) Both are False
Explanation : Both the options afe tric and ie sell
updated.
Q. 5.41 In Bays theorem. unconditional probability is called as
explanatory. |
the statement is True/False > k-NN
(a) Evidence (b) Likelihood Q.5.46 State whether
(c) Prior (d) Posterior Y Ans. : (a) algorithm does more computation on test time rather than
Explanation : Basic definition train time.
Q. 5.42 In Bays theorem, class conditional probability is called Y Ans. : (a)
(a) True (b) False
as
: The training phase of the algorithm
Explanation
(a) Evidence (b) Likelihood feature vectors and class
only of storing the
consists
(c) Prior (d) Posterior Y Ans. : (b) In the t esting phase, a test
labels of the training samples.
the label which are most
Explanation : Basic definition point is classified by assigning
samples nearest to that
frequent among the k training
Q. 5.43 One person is tossing a coin inside a closed room and he utation.
query point — hence higher comp
tells only the output (Head or Tail) to the person who is x and
Q. 5.47 Suppose. you have given the following data where
standing outside the room .The type of coin which he is ent
y are the 2 input variables and Class is the depend
using whether fair coin or biased coin is not known. variable
Suppose you want to find out the type of coin which class
Uly
0
(b) Discrete Markov Model
1
(c) Prediction Model I +
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) B) Tech-Neo Publications...A SACHIN SHAH
Venture
Whar
'S the joing Po
Ndivonay hability eleatyiby
classes are independent of each other
ove
PrObab UO 1 fortes cf
. ties?
») All the features of a class are independe Not each POPS DNS A
( NS '
other most prob PPPS AD,S wD
¢ mos able feature for
a class is the mo (ce) ND PD yep NY$a. 49; ‘ \
-) Th tant feature st 2 (SD YS 49
to be considered for WW)! PD) , PP ,6 {Dy
impo cla ssi fic ation PS DRS
1 the features of a class are
0 Dh dy
conditionally
(dé A
di ic pendent on each other. Y Ans, : (b) Explanation LF Y Ans. 4d)
‘on : Naive Bayes TMy the » figure,
Assumption js that dD w fea See that
ane 2 8TE5 NOL denende 1) and
pxplanatl of aclass are ind
ependent of eac
al] the
any
“Pendent on ANY
Variable as
they dow"
fealu! res *
h other which in
Homing direct
“ have
ase in real life. Bec 1s
from D i»
ed edg es S$) has an Weoming
ause of this assump i hence
tion, the
nol the ca § 1 depends on edye
Bayes Classis fier, edges from D,. S, has 3
ificr is called Naive
rooming
D, an dD.
class’ hence §
der the following datase Sh Wan incoming 2 depends on D, and
b,
le t. a, b, ¢ are the te ¢ dge from Dy
Const Sy depends on bd,
(5 49 and Vis
Kis the clasz1
s(1/,0): Alures Hence, dis the answer,
5 Q.5:5.51
What isi the Markoy bla
; ' nket of Vanable, S,
(a) D, ) D,
;
0 (c) D, and dD, () None ¥ Ans. stb
)
Explanation + Jp 4 Bay
esian Network, the Mark
blanket of Node, X is ov
the set consisting of N"s
parents,
1] = S children and
0 Parents of X's chi
ldren. hy the given
0] 0 diagram, Nanable, S, has
a Parent Dy and no chi
. Hence, the Correct answer ldren.
Classify the testst ins
instance given belo
tance given w into class 1/0 usi is (b),
be ng
a Nai ve Bayes Classifier. MS | Q.5.52 Suppose you are Usi. ng
‘ a Linear SVM classifie
Class classi r with 2
aj bic Ss fication Proble
blem. m. Con
Consid
sider
e the following dat
in which the points a
o;o};1]? circled red Fepresent
support vectors
0 (b) Will the decision bou
1 ndary change if any of the
(a Y Ans. : (b) Points are removed? red
Explanation:
P(K=Ila=0,b=0,c=1)=3
/6* 1/3*2/3*2/3 = 0.7407
344
P(K=0la=0,b=0,c=1)= 3/6* 1/3* 1/3*2/3 = 0.03703 © i
P(K=Ila=0, eO'c= =1la=0,.b=0,c=1)
b=0,c=1)> P(K=
50 A patient goes to a doctor with sy
0.5. The doctor sus mptoms S,. S and S,.
pects disease D, and D,
2p + @
and Syenies a -
J:
Bayesian network for the relation amon ;
g the disease and 1 Module
symptoms as the following : ©
D,
i i i i + 5
1 2 3 4 5
x
Fig. Q. 5.52
(a) Yes (b) No Y Ans,
: (a)
Explanation : These three examples are pos
8, 22 itioned such
S3 that removing any one of them int
roduces stack in the
ig. Q. 5.50 constraints. So the decision bou
Fig. Q.5 ndary would completely
change.
ransformed feature
poundary in the
(b) the cis
decision
linear sae 4 . :
Q, 5.53 Consider the data-points in the figure below, feature space is
Xo Hy
i dects«ion boundary in the original
(c) the
linear ve inl he original Y feature
space is
bou nda ry Ans. : (b), (d)
(d) the deci sion
non -linear As per working of RBF.
;
Explanation :
5 tatements is/are true about
ic h of th e following
Q. 5.57 Wh
?
kernel in SVM ional data to high
el fu nc ti on map low dimens
1, Kern
ace
dimensional sp
rity function
2. It’sa simila se but 2 is True
bu t2 is Fa lse (b) 1 is Fal
(a) 1 is True (d) Both are False
(c) Both are Truc v Ans.
: (c)
M
per working of SV
Fig. Q. 5.53 Explanation : As
?
Let us assume that the black-colored circles represent What is tr uc about
K-Mean Clustering
Q. 5.58 e to cluster center
ely s¢ nsitiv
positive class whereas the white colored 1, K-means is extrem
circles represent negative class. Which of the following initializations gence
lead to Poor conver
among H,, H, and H, is the 2. Bad initialization can
maximum-margin hyperplane? speed ring
Jead to bad overall cluste
(a) HH, = (b) Hy 3, Bad initialization can
(a) | and 3 (b) | and2
(c) H, = (d) None of the above. Y Ans. : (c) Y Ans. : (c)
(c) 2 and 3 (d) 1,2 and 3
Explanation : H, docs not separate the classes. en statements are true.
Explanatio nz All three of the giv er
H, does, but only with a small margin.
K-means_ is extremely sensitive lo cluster cent
a lization can lead to Poor
H, separates them with the maximal margin.
initialization. Also, bad initi
as bad overall clustering.
convergence speed as well
Q. 5.54 The soft’ margin SVM is more preferred than the a horizontal line on y
Q. 5.59 If in the following fi gure we draw
hard-margin svm when : clusters we will get?
axis for y = 2 how many number of
(a) The data is linearly separable
(b) The data is noisy and contains overlapping point
Y Ans. : (b) 2.54
Machine Leaming (M
U-Sem 6-Com
_Learing Clusterin .. Page
and Custering
Classification and
itn Classification ne (ESD5
want 1 Com)
Page no
Assume, you en Learning with
ing. Which
If two variables V, and V; are used for cluster
Q.5.60 i hi
clos - oO clu .
7 observations into 3 | @- 5-61
usters using K-\Mea Ns clustStet . of the following are true for
K means clustering with
iterationi the Clusters - C,.cer ae After first
esva
* Xx Cy has the following k=3?
lation of 1, the cluster
1c: Ht oe |. If V, and V, 2 has a corre
2 c. a 110.4 84.7.7)
ght line
mM centroids will be in a strai
cluster
3. {5.5 piac
€ e 2. If V, and V; has a correlation of 0, the
Wat Sati ).(9.9))
3
aight line
t centroids will be in str
nen on?
for second econd iterati Muster centroids if youy want to proceed
Choose the correct ans cr!
(a) C)544.0,:0
22.2
2- 2, ).¢ (7,(77)
7
bc: 12 (ay 1 Only
(2.2), Cc, > (0.0), Cy: (5.5)
= (by 2 Only
ccoc:116.6). C, 2 (4.4).0, 549.9)
(c) Both 1 and 2
(d) None of these : (a)
v Ans.
Vv! Ans. :s (a) (d) None of the above
Explanati ion : Finding
F V,
Explanation : If the correlation between the variables
centroid for data points in cluster
>
C,= (Stes
Findi
pes seey =i&.4}
and V,15 1. then all the data poin
“
ts will be in a
. .
stra
centroids will form
ight
a
Ing Centre . line. Hence. all the three cluster
nd for data points in cluster .
C= (0+ 4) (440) ) = (2,2) straight line as well.
7 a)
Q00
Chapter...
ensionality Reduction ©
2 University Prescribed Syllabus .
|
6.1.1 Dimension Reduction Techniques in ML..u........ccccsccsessscesssseesseescecssssesssesseessssescoceseneeseres
eimnertia
inecnesante ienini tt
CES
6.5 University Questions and AnSWeIS ...vssssuter
sssscrssreeee erseaee
Multiple Choice QuestiOns....sss
nsnenees anaceransosneess
Chapter EmdS.ressesssseennrsnseesssnsensseen
Y=
no (6-2)
Dimensionality Reduction ...Page
Machine Learning (MU-Sem 6-Comp) =)
Dimensionality Re
duction
--Page no (6-3)
performance
Classifier
o The classifier’s
performance Us
ually will degrad
e fi ora
larg
e number of features,
. ett tt Number of variables
oo.
jp
Fig. 6.1.3 : Classifier performance
@ and amount of data
6.1.1 Dimension Reduction Techniqu
es in ML
wm 6.1.1(A) Feature Selection
Xn X;
classifiers.
oe based“ *, a ne ho the lowest probability of_error of all such
“‘ tees
o Classifier 0!
so we need some heuristics.
~ — Itis not possible to go over all 2" possibilities,
i . ~ Forward search
a a © Start from empty set of features.
. Module
© Tryeach obremay™e aa for adding specific feature.
ry venes in validation error. 6)
Pop o Estimate classification/regression a
f ' © Select feature that gives maximum ” ‘eI Tech-Neo Publications..A SACHIN SHAH Venture
-19) (M6-14)
year 18
(MU-New Syllabus w.e.f academic
Xn = , Yn = Xn
component. You can sec that second principal component is | - “ee fof
aeAt ddPpp yr
symmetrical to first principal component. Fig. 6.2.1
y
wach’ NES pakicadcie A SCAT venture
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) 7 - ublications....
i
=. The he p principal
ine components are cones:
P NE SENSitive 16 the size Dimensionatity Reduction --Page no (6-5
institutionalize factors before pplying Pca “122 of estimation NOW {0 seatle this 4} issue we )
ili
event that interpretability
i PP ying PCA to your , informa ionScaleal colfee
of the Oulcomes is : Applyi We ought to dependably
Mal collection fosey
AS et
SS crj i
importance, In the
isn't the correct BY
Stem for your task,
11
1.6109
MW Solution:
First we will Sind the mean valucs
X
X,, = Ly tte eeaeeenesenees
” N = number of data
points = 10)
X, = 1.81
Y
Yn _ x N
Y, m = 1.91
yx Cyy
(X= X,)"
Cyx = LS = 0.6165
mation, {hy (X-X,) (¥-Y,) = 9.61544
Cyy = Cyx=“No
anal priest
(Y¥- Yq) 0.7165
|
Cy = y=
is (PCA) eas 0.61544)
= 10.61544 0.7165
ip #8
is used.
mat i Now to find the eigen values following equation
q sige ; IC-Al =0
0.0489V
0.6154 4 V,, + 0.7165 Vie
544Vi fal Tech-Neo Publications..A SACHIN SHAH Venture
W
r
w.e.f academic yea
(MU-New Syllabus
a4 lain Spee ee
neni renee
- To find the eigen vector we can take either Equation (1) or (2), for both the equations answer will be the same, let’s take
the first equation ‘
Now we have to find the principle component, it is equal to the eigen vector corresponding to maximum Eigen value, in
this 4, is maximum, hence principle component is Eigen vector Vy.
i 0.677
ore Principle component = | 9735
- Two beliefs :
'
1. The source signal
signalss are autonomous of one another.
§‘:
2. Each source signal values have non-Gaussian conv eyances,
t
- Three effects of blending source signals :
a
’
1. Autonomy : Considering to belicf 1, ICA comprises of the autonomous source signals; their signal blends are most Jl
va
certainly not, This is based on the fact that the ugnal blends share a sumilar source signals. t;
2. Ordinariness : “Considering the Central Lamit Theorem, the dispersion of an aggregate of {tee integulat factors
having limited Muctuation approaches towards a Gaussian circulanion”, Freely, an aggregate of two autonomous f
irregular factors as a rule has 4 dissemination nearer to Gaussian than any of the two unique variables. At this point y
The factual model x = As is known as independent component analysis. The ICA is a generative model, which implies
that it depicts how the observed information are created by a procedure of blending the part The independent
parts s;.
matrix is
components are inactive variables, implying that they can't be specifically observed. Additionally the mixing
thought to be obscure. All we observe is the arbitr. ‘ary vector x, and we should
est it. This
must bé doné'under as general suppositions as could Teasonably be expected. inal
e ae
The beginning stage for ICA is the plain straightforward Suspicion that the parts s, are factually aut ous. It will be
e factually autonom'
seen undemeath that we s hould likewise expect that the autonomo “Gaussian
appropriations. us component must have non
- Nonetheless, .
. in the essent . angled) pe ci
issue is extens|
ial} Made] a
we don't ty eCcent these
ivel
y disentang e die
spersion
square, yet k (on the of |
this presumpt F a chance that
ion Can Wa — theyv aree
be now nd nti ne Know-
n, the
Fig. its opposi the 8 thhsat the Ob ,
te, say W, an c
Mt tpoinl sc ur e bl en di
- d AC quire the t, Substeqeuent ng brid is
ICA is firm en to ¢
ly id entified With (just by : 5 = Wx
the lechni
implies here a unique sign que called
al, i.e, indepe ndent componblent
ind SOUrCE se
paration (B wo
- SS) or blind
“Blind” implie n Si mi lar t 6 the spea si Bnal partitio
n,
s that we know ker in a cocktail A “source”
pr ‘Actic Party issue,
suppositions on the source sign
source partition.
- Innumerous applic
ations, it would
: be more Scnsible to
mean incl udin: g a nois: e term in‘ the mode accept th al there is
some no ise in the
. . l. F or estimations, whic
the commotion fre St raightforwardness,
e model is suf Wwe ex cl ud e any noise ter hich would
ficiently tro ubleso ms, since the estimation
me in iself, and is
by all accounts ade
of
wT quate for some app
6.3.1 Preprocessing for ICA lications,
Preprocessing is except
ionally helpful before
actually applying an ICA
some preprocessing methods calculation on the informat
that lead into less com ion. Now we will sec
plex ICA estimation and a better model.
Centering
— The essential and vital Preprocessin
g method is to center x. Centering is
done by subtracting its mean vector m=
from x, so that x becomes a zero-mean E(x]
variable. This means s is zero-mean too.
This preprocessing is made exclusively to improve the ICA
calculations : It doesn't imply that the mean couldn't be
assessed.d. In the wake of assessing A is the mixing matrix that contains centered informat
ion, Calculations can be
finished b y including
i the mean ve ctor r of s aga
gaini b aack
C K to the ¢ fe focused
CUSCCO evinlt ations
NS OF S. s. A ! m= 2g)
sed ev of represents the 2 m [t mendiin vecto
v
ector
ofs ,> where m is the mean that was subtracted in the preprocessing.
Whitenin
. sedure
initially whitening the observe d variables is another important preprocessing procedure ini ICA. This
° This indicates that
f ICA, we need to change the observed vect indica “s
:
before actually do ing the calcu lation 0 , or x straightly with the desired goa
.
er white vector xX .. The parts of t he @ white vectorss are uncorrelated and their differences measure up to
2
that we get anoth iance matrix ofx meets the eet dose
identity matrix :
solidarity. As it were, the covar ia’
~-T _
E {x x } =I One prevalent technique of whitbis enin=e g uses : the cigen
igen- value decay of
we ‘ n
e is+ constant ly co nceivable. T
the orthogonal i C
m atrixi of eigenvectors 0 fE{ xx° ) and comer
~The whitening chang represents
{ xx} = EDE’, here =E diag(dj.. «dy
the covariance matrix E E[xx'] can be assessed uniformly from the available
D
to corner network of its igenvalues 15is D. te by
possib
n should now be
example x(1),.--- x(T). Wshhiteni
x =ED BT x peration as Do” -1= ding -W
wea “Wy
dd. tjHis
x
ent-wisc 0 = diag (d, .
asim ple compon Module
.
- ix D’'” D is calculated by
where
x
{zx
the matri
~ x} =! -
easy to check that now E Tech-Neo Publications..A SACHIN SHAH Venture
7
4
mic year 18-19) (M 6-14)
w.e-f acade
(MU-New Syllabus
\=
ion ...Page no (6-1 0)
Dimensionality Reduct (G10)
Machine Learning (MU-Sem 6-Comp) =
shrewd activity as _.
where the grid D — 1/2 D is registered by a straightforward segment
E{ x x T} =]
It is anything but difficult to w
1/2),..., [d_n] * (- 1/2)).
atch that now
D— 1/2 = diag ({d_1]
— Whitening transforms the mixing matrix into a new one, A. We have from (4) and (34):
x =ED'"E'As=As
- Brightening changes the blending lattice into another one, A. We have from (4) and (34) :
seen from
The utility of whitening resides in the fact that the new mixing matrix A is orthogonal. This can be
E {xx} T) =AE(ss}A'=AA =I
The utility of brightening lives in the way that the new blending lattice A is symmetrical.
— Whitening decreases the quantity of parameters to be evaluated. The new, orthogonal mixing matrix A needs to be
calculated instead of assessing the parameters which are the components of the original matrix A. An orthogonal
matrix consists of n(n — 1)/2degrees of opportunity. For example, in two measurements, orthogonal transformation is
consist of a portion of the quantity of
dictated by a solitary angle parameter. In bigger measurements, orthogonal matrix
parameters of a discretionary matrix. Consequently we can say that whitening handles half of the problems of ICA.
Because whitening is an extremely basic and standard methodology, it is significantly less difficult than any ICA
lines.
computations, it is a good thought to lessen the multifaceted nature of the issue along these
ie rT
tt
Fy ‘a
—_t
= L
— Jt will be very useful to lessen the measurement of the | |
|
information in the meantime as we do the whitening. In the
case of eigenvalues d; of E[ xx}, we dispose those that are
too little, as is regularly done in the factual method of
essential part examination. This has frequently the impact of
decreasing commotion. Besides, measurement decrease
anticipates overlearning, which can once in a while be seen in
ICA.
Fig. 6.3.1 : The joint distribution of the whitened mixtures
In whatever remains of this instructional exercise, we accept that the information has been preprocessed by centering
and whitening. For effortlessness of documentation, we signify the preprocessed information just by x, and the changed
mixing matrix by A, discarding the tildes,
In the first segments, we presented diverse proportions of nongaussianity, i.e. target capacities for ICA estimation. By
and by, one additionally needs a calculation for expanding the difference work. In this area, we present an extremely
productive strategy for augmentation. Here it is accepted that the information is preprocessed by centering and whitening 45
examined in the first area.
- Now, we will demonstrate the one-unit adaptation of Fast §CA. By a “unit” we allude to a computational unit,i n the
long run a fake neuron, with a weight vector w that the neuron can be updated by a learning method. The Fast ICA
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications. A SACHIN SHA Venture
MaA
ing = (MU-Sem
Machine Learning ( 6-Com
learnipeng n
p)
pra
inr
cie
plse findfsind a Dimens
bearing, 8
Le, ionality Re
gaussi anity is
j. a uniti Vector
w
ducti
estimated by su ch that the Projec
compelled to
the Bu ess of Acgentro ti on w"x ay
solidarity; for py J(w". ). Re ——
iani — 8;
brightened info vi
rmat on this is si .
onof
T
- The Fast ICA depends on a wy emph eq ua l toocob
tlliga
ging
e thede Standard tuation of w"x should
sh ouldtehere be
so
asis cons Whe of w to be unity,
.
likewise inferred as an 4pproximative Newton Pir e for fj din,
c 8 4a great
pre:
est of the non Baussianit
Yele. Indicate by y of w'y. It ten
8,(u) = tanh g the Subordinat ds to be
(a,u), e of he nonquadr
atic work G,
82(U) = uexp-u?/
)
wherel <a, <
is as per the following ; -~
| = 1. The ¢ fi fundamental type of .
the Fast ICA calculation
1. Select an initial
weight vector w
: 2.
3. Calculate, w*= {xg (w'x)} —=E
Calculate, w = w*/ {p’ (w"x)}w
Wri
4 If not Converged, go
back to Step 2,
ly) equival lL Tticat
(nearly) equivalent to |. timations of w point a similar way, i.e.
i their dot-product is
direction. This is again erges to a solitary Point,
in light of the fact that since w and - w characte
the independent compon rize a same
sign. Note likewise that it ents can be characterize
d just up to a multiplicative
is here expected that the informatio n is prewhitened,
The induction of Fast ICA
is as per the followin &. Firs
t note that the maxima of
are gotten at certain optima of the esti mate of the negentropy of
E { G(w"x )} . As indicate w'x
d by the Kuhn -Tucker conditions,
the requirement E ( (w'x)?} =IIWIP the optima of E (G(w'x )} unde
r
=1 are acquired at point
s where
E (xg(w"x)} -Bw =0
Let us try to solve this equation by Newto
n's method. Jacobian matrix JF(w) as JF(w)
= E{xx'g’ (w'x)) — Bl
1. To make easy the inversion of this matrix, we have to
approximate the first term. Since the d ata is spher
approximation seems to be T bag t ed, a reasonable
E {xx"g’ (w"x)} = B{xx"} B{ g’ (w"x)) =
,
(g (wTx)}
Thus, the Jacobian matrix becomes diagonal, and can easily be inverted. Thus
we obtain the following approximative
Newton iteration :
w* = w-[E ( xg(w'x
—B))]
wi/[E (2’(w'x)}
- 8]
> 6.4 SINGLE VALUE DECOMPOSITION
- i
In singul ar V: a ue dlecom positio! n method a trix 1: 1S decomposed
matrix rices:
into three other matrices:
A = USV"
Here, A represents 5
mxnm atrix.
trix. U represents m xn orthogonal matrix. S is an x n diagonal matrix and V is anx n
Orthogonal matrix.
ors as colu mns; ’ S is i
i a diag onal i
matrix which
i contains
i singular
i va lues; , and V has s
~ Mattia: UT pas the Jefe singular yee" lue decomposition original data present in a coordinate system is expanded.
right singular vectors as rows. I n singular valu
. ix is diagonal.
Heve thereovariance:matrix 1s .
ie eed to find the eigenvalues and eige nvectors of AA’ tT. and A’A.
T The Module
— To calculate singular value decomp ositi jon we ni
6)
T
U sii
consists of the e cig
cigenvectors of AA.
i
columns of V consists 0 f eigenve ctors of
A’. The columns of
no (619)
age n(n.
Dimensionality Reduction ...Page
Machine Learning 9 ( (MU-Sem m 6-Comp)
6- SS
~The Square roots of eigenvalues from AA‘ or ATA represents the singular values in S.
— The singular values are arranged in descending order and stored as the diagonal entries of the S matrix, The tana °
' ar
- values are always real numbers. If the matrix A is a real matrix, then U and V are also real.
Example 1 : Find SVD for A =[ ‘ ‘|
: Now we will calculate eigen vectors V, and V, using the method that we have seen in PCA
1°
1 j
V2 ~2
Mz] 4 V2 1
V2 N2
Next we will calculate AV, and AV,
.
AV, = [? ¥?) AV,
v i)
tc
ll
Ul = AY [2] U, = Av =[°|
1 = Av, = Lo 2 = TAV, =LI
SVD is written as,
LoL
10] [22 07} V2 V2
LL 0 Bl a 1
>
|
| i Vi
‘pp. 6.5-'-UNIVERSITY QUESTIONS AND ANSWERS
Qa.1 Explain in detail Principal Component Analysis for Dimension Reduction. (Ans. : Refer section 6.2) (10 Marks)
a.3 Use Principal Component analysis (PCA) to arrive at the transformed matrix for the given data.
As [ 5 7 osl (Ans. : Refer section 6.2) (10 Marks)
Multiple Choice Questions Q.6.5 In PCA principal component is a eigen vector for which
eigen value is_
Q. 6.1 Which of the following method is a dimensionality (a) maximum (b) minimum
reduction method?
(c) zero (d) one Y Ans. :(a)
(a) Principal Component Analysis
Explanation : Property of eigenvalue and eigenvector.
(b) Regression
(c) Classification Feature selection is a process of ___
(d) Clustering (a) selecting best k features from original d features
—
Y Ans. : (a)
Explanation : PCA is (b) extracting any k features from original d features
a DR technique whereas
regression, (c) selecting any k features from original d features
classification are supervised learning
examples and clustering is example of unsupervised (d) extracting best k features from original d features
learning. Y Ans. : (a)
Explanation : Out of all features best features are
Q. 6.2 Which of the following property is true for PCA
selected.
Algorithm ?
(a) Data used for PCA is having Less variance Q. 6.7 Out of following which method is used for dimension
teduction?
(b) Maximum number of principal components are
greater than number of features (a) Classification (b) Clustering
(c) All principal components are orthogonal to each (c) Regression
other (d) Independent component analysis(ICA)
(d) PCA is a Supervised learning method Ans. : (c) Y Ans. : (d)
Explanation : Property of PCA algorithm. Explanation ICA is a DR technique whereas
:
regression, classification are supervised learning
Q. 6.3 Single Value Decomposition(SVD) is which type of examples and clustering is example of unsupervised
method? learning.
(a) Regression (b) Classification
Q. 6.8 Dimensionality reduction algorithm ___
(c) Clustering Dimension Reduction ¥ Ans.
: (d)
(a) Reduce Time complexity
Explanation : SVD is used to decompose matrix in to its
(b) Increase Memory complexity
components.
(c) Increase Time Complexity
Q. 6.4 If eigenvalues are roughly equal then ___ (d) Increase overfitting problem ¥ Ans. : (a)
(a) PCA will perform outstandingly Explanation : Since data dimension is reduccd, time
(b) PCA will perform badly required to process the data will also be reduced.
(c) LDA will perform outstandingly Q. 6.9 Which following is an example of Supervised
v Ans.“ : (b)
alt
(d) LDA will perform badly dimensionality reduction algorithm ?
vectors are same in such
Explanation : When all eigen (a) Naive Bayes (b) SVM
case you won’t be able to select the principal compone nts Modul
are equal. (c) PCA (d) LDA Y Ans. : (d)
that case all principal compone nts
because in
mh
(a) 40 ter 30 k
c) 30 figure that af
can see in the .
Retn, a
na ti on = We e co ns ta nt
c la remains th
onen ts the curve
ie cipal comp
prin ...A SACHIN SHAH Venture
[el Tech-Neo Publications
a
. : -14
(M6-14)
w Syll abus wef aca demic year 18-19)
(MU-Ne
ee
2. &tis invariant to‘affine transforms. What are their coordinates in the 1-d subspace?
3. It can be used for lossy image compression. (a) f-2. (0), ¥2 (6) ¥2,0,V2 |
4. Itis not invariant to shadows. ) ¥2,0) J-2 @ ¥-2,(0),V-2
(a) land 2 (b) 2and3 \ sO Y Ans. s (a)
(c) 3 and4 (d) land4 Y Ans. : (c) Points after
Explanation : The coordinates of three
Explanation : Option c is correct projection should be
Q.6.44 Under which condition SVD and PCA produce the same 2 =xT, v=[-1,-H [22 2) T=V-2, Zy
projection result?
V2
(a) When data has zero median ‘ =xT,v=0,2,=xT;v=V 2
(b) When data has zero mean Q.6.47 For the projected data you just obtained projections
(c) Both are always same V-2, (0), fz. Now if we represent them in the
(d) None of these Y Ans. : (b) original 2-d space and consider them as the
Explanation : When the data has a zero mean vector, reconstruction of the original data points, what is the
otherwise you have to center the data first before taking reconstruction error ? (Context : 45-47)
SVD. (a) 0% (b) 10% (c) 30% (d) 40% ¥ Ans. : (a)
: inte i ion : The reconstruction error is 0, since all
@. 6.45 Ob, tder'S data points in the 2d space : © 11). (0) tice potas are perfectly located on the direction of the
ee first principal component. Or, you can actually calculate
the reconstruction : z, - v.
Fig. Q. 6.45
What will be the first principal component for this data?
+].
iCEY) > [eal3 |
(a)
(SAR
land2 (b) 3and4
[dal a Fig. Q. 6.48
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Néd Pubtiestieng neh
(0.5, 0.5, 0.5, 0.5) and (- 0.5, -0.5, 0.5, 0.5) (b) Perpendicular offset
Land 2 (b) { awd 3
(a) (c) Both
(c) 2and 4 (dl) 3 and 4 Y Ans. : (d) v Ans. : (b)
(d) None of these
two loading
Explanation : For the first two choices, the Explanation : We always consider residual as vertical
l.
veelors are not orthogona offsets. Perpendicular offset are useful in case of PCA.
Chapter Ends...
goog