Introduction-ML Merged
Introduction-ML Merged
Introduction-ML Merged
SEM-V
[email protected]
Syllabus Unit Description Duration
1 Introduction: What is Machine Learning. Supervised Learning. Unsupervised Learning 2
Total 30
Teaching and Evaluation Scheme
Program: B. Tech. CSDS Semester : II
Course/Module : Machine Learning Module Code:
Testing: Test the model using unseen test data to assess the model accuracy
CS583, BING
20
LIU, UIC
What do we mean by Learning?
• Given
• a data set D,
• a task T, and
• a performance measure M,
a computer system is said to learn from D to perform the task T if after learning the system’s
performance on T improves as measured by M.
• In other words, the learned model helps the system to perform T better as compared to no
learning.
An Example
• Data: Loan application data
• Task: Predict whether a loan should be approved or not.
• Performance measure: accuracy.
No learning: classify all future applications (test data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
• We can do better than 60% with learning.
Fundamental Assumption of Learning
Assumption: The distribution of training examples is identical to the distribution of test
examples (including future unseen examples).
• If the shape of the object is rounded and has a depression at the top, is red in color, then it
will be labeled as –Apple.
• If the shape of the object is a long curving cylinder having Green-Yellow color, then it will
be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the
basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time has to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name
as BANANA and put it in the Banana category. Thus the machine learns the things from
training data(basket containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” , “disease” or “no disease”.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
Steps
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides, then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
o Supervised learning allows collecting data and produces data output from previous
experiences.
o Helps to optimize performance criteria with the help of experience.
o Supervised machine learning helps to solve various types of real-world computation
problems.
o Supervised learning models are not suitable for handling complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
o Classifying big data can be challenging.
Unsupervised
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as
‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two parts. The first may
contain all pics having dogs in them and the second part may contain all pics having cats in
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
• Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Parameters Supervised machine learning Unsupervised machine learning
Computational
Complexity Simpler method Computationally complex
𝑛∑𝑥𝑦 − (∑𝑥)(∑𝑦)
𝑏(𝑠𝑙𝑜𝑝𝑒) =
𝑛∑𝑥 2 − (∑𝑥)2
Where,
x and y are two variables on the regression line.
b = Slope of the line.
a = y-intercept of the line.
x = Values of the first data set.
y = Values of the second data set.
Solved Examples
Question: Find linear regression equation for the following two sets of data:
x 2 4 6 8
y 3 7 5 10
Solution:
Construct the following table:
x y x2 xy
2 3 4 6
4 7 16 28
6 5 36 30
8 10 64 80
= 20 = 25 = 120 = 144
𝑛∑𝑥𝑦−(∑𝑥)(∑𝑦)
𝑏= 𝑛∑𝑥 2 −(∑𝑥)2
=
b = 0.95
∑𝑦∑𝑥 2 –∑𝑥∑𝑥𝑦
𝑎= 𝑛(∑𝑥 2 )–(∑𝑥)2
a = 1.5
Linear regression is given by:
y = a + bx
y = 1.5 + 0.95 x
Linear Regression
Problems with Solutions
Linear regression and modelling problems are presented along with their solutions at the bottom of the
page. Also a linear regression calculator and grapher may be used to check answers and create more
opportunities for practice.
Review
If the plot of n pairs of data (x , y) for an experiment appear to indicate a "linear relationship" between y
and x, then the method of least squares may be used to write a linear relationship between x and y.
The least squares regression line is the line that minimizes the sum of the squares (d1 + d2 + d3 + d4) of
the vertical deviation from each data point to the line (see figure below as an example of 4 points).
Figure 1. Linear regression where the sum of vertical distances d1 + d2 + d3 + d4 between observed and
predicted (line and its equation) values is minimized.
The least square regression line for the set of n data points is given by the equation of a line in slope
intercept form:
y=ax+b
• Problem 1
• Problem 2
a) Find the least square regression line for the following set of data
b) Plot the given points and the regression line in the same rectangular system of axes.
• Problem 3
The values of y and their corresponding values of y are shown in the table below
x 0 1 2 3 4
y 2 3 5 4 6
• Problem 4
The sales of a company (in million dollars) for each year are shown in the table below.
x y xy x2
-2 -1 2 4
1 1 1 1
3 2 6 9
Σx = 2 Σy = 2 Σxy = 9 Σx2 = 14
2.
We now use the above formula to calculate a and b as follows
a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (3*9 - 2*2) / (3*14 - 22) = 23/38
b) We now graph the regression line given by y = a x + b and the given points.
3.
x Y xy x2
-1 0 0 1
0 2 0 0
1 4 4 1
2 5 10 4
Σx = 2 Σy = 11 Σx y = 14 Σx2 = 6
b) We now graph the regression line given by y = ax + b and the given points.
5.
x Y xy x2
0 2 0 0
1 3 3 1
2 5 10 4
3 4 12 9
4 6 24 16
Σx = 10 Σy = 20 Σx y = 49 Σx2 = 30
We now calculate a and b using the least square regression formulas for a and b.
a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (5*49 - 10*20) / (5*30 - 102) = 0.9
b) Now that we have the least square regression line y = 0.9 x + 2.2, substitute x by 10 to find the
value of the corresponding y.
y = 0.9 * 10 + 2.2 = 11.2
7. a) We first change the variable x into t such that t = x - 2005 and therefore t represents the
number of years after 2005. Using t instead of x makes the numbers smaller and therefore
manageable. The table of values becomes.
y (sales) 12 19 29 37 45
We now use the table to calculate a and b included in the least regression line formula.
t Y ty t2
0 12 0 0
1 19 19 1
2 29 58 4
3 37 111 9
4 45 180 16
Σx = 10 Σy = 142 Σxy = 368 Σx2 = 30
We now calculate a and b using the least square regression formulas for a and b.
a = (nΣt y - ΣtΣy) / (nΣt2 - (Σt)2) = (5*368 - 10*142) / (5*30 - 102) = 8.4
b = (1/n)(Σy - a Σx) = (1/5)(142 - 8.4*10) = 11.6
Example 9.9
Calculate the regression coefficient and obtain the lines of regression for the following data
Solution:
Regression coefficient of X on Y
(i) Regression equation of X on Y
= 0.929X+7.284
Example 9.10
Calculate the two regression equations of X on Y and Y on X from the data given below, taking deviations
from a actual means of X and Y.
Solution:
= –0.25 (20)+44.25
= –5+44.25
= 39.25 (when the price is Rs. 20, the likely demand is 39.25)
Example 9.11
Obtain regression equation of Y on X and estimate Y when X=55 from the following
Solution:
(i) Regression coefficients of Y on X
(ii) Regression equation of Y on X
Y–51.57 = 0.942(X–48.29 )
Y = 0.942X–45.49+51.57=0.942 #–45.49+51.57
Y = 0.942X+6.08
Y= 0.942(55)+6.08=57.89
Example 9.12
Find the means of X and Y variables and the coefficient of correlation between them from the following
two regression equations:
2Y–X–50 = 0
3Y–2X–10 = 0.
Solution:
We are given
We get Y = 90
We get X = 130
2Y = X+50
Example 9.13
Find the means of X and Y variables and the coefficient of correlation between them from the following
two regression equations:
4X–5Y+33 = 0
20X–9Y–107 = 0
Solution:
We are given
We get Y = 17
But this is not possible because both the regression coefficient are greater than
So our above assumption is wrong. Therefore treating equation (1) has regression equation of Y on X and
equation (2) has regression equation of X on Y . So we get
Example 9.16
For 5 pairs of observations the following results are obtained ∑X=15, ∑Y=25, ∑X2 =55, ∑Y2 =135,
∑XY=83 Find the equation of the lines of regression and estimate the value of X on the first line
when Y=12 and value of Y on the second line if X=8.
Solution:
Y–5 = 0.8(X–3)
= 0.8X+2.6
=9