Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Assignment-based Subjective Questions/Answers

Q1. From your analysis of thecategorical variables fromthe dataset,what could youinfer about their effect on the
dependent variable?
A1. The categorical variables have very effect on the target variable.
The Equation of our best fitted line is:
cnt = 0.2464yr - 0.0820holiday - 0.2353windspeed + 0.0459summer + 0.0823winter + 0.1026Aug - 0.1653Dec -
0.2074Feb - 0.2759Jan - 0.1111Nov + 0.1212Sep - 0.3090 * Light Snow - 0.0922 Mist

Q2. Why is it important to use drop_first=True during dummy variable creation?

A2. 1.drop_first=True is important to use, as it helps in reducing the extra column created during dummy variable
creation. Hence it reduces the correlations created among dummy variables.

2.Let’s say we have 3 types of values in Categorical column and we want to create dummy variable for that column. If
one variable is not furnished and semi_furnished, then It is obvious unfurnished. So we do not need 3rd variable to
identify the unfurnished.

Hence if we have categorical variable with n-levels, then we need to use n-1 columns to represent the dummy variables.

Q5. Based on the final model, which are the top 3 features contributing significantly towards explaining the
demand of the shared bikes?

A5. The top 3 features contributing significantly towards explaining the demand of the shared bikes are :

1. Correlation
2. P-value
3. VIF

General Subjective Questions


Q1. Explain the linear regression algorithm in detail.
A1. Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task.
Regression models a target prediction value based on independent variables. It is mostly used for finding out the
relationship between variables and forecasting. Different regression models differ based on – the kind of relationship
between dependent and independent variables, they are considering and the number of independent variables being used.
Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x).
So, this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear
Regression. The equation is given by : Y = b1 + b2*X
Q2. Explain the Anscombe’s quartet in detail.
A2. Anscombe’s Quartet can be defined as a group of four data sets which are nearly identical in simple descriptive
statistics, but there are some peculiarities in the dataset that fools the regression model if built. They have very different
distributions and appear differently when plotted on scatter plots. The four data set plots which have nearly same
statistical observations, which provides same statistical information that involves variance, and mean of all x,y points in
all four datasets. This tells us about the importance of visualising the data before applying various algorithms out there to
build models out of them which suggests that the data features must be plotted in order to see the distribution of the
samples that can help you identify the various anomalies present in the data like outliers, diversity of the data, linear
separability of the data, etc. Also, the Linear Regression can be only be considered a fit for the  data with linear
relationships and is incapable of handling any other kind of datasets.
Q3. What is Pearson’s R?
A3. A Pearson's correlation is used when you want to find a linear relationship between two variables. It can be used in a
causal as well as a associativeresearch hypothesis but it can't be used with a attributive RH because it is univariate.
Pearson's correlation should be used only when there is a linear relationship between variables. It can be a positive or
negative relationship, as long as it is significant. Correlation is used for testing in Within Groups studies. A possible
research hypothesis for this statistical model would be that there is a positive linear relationship between variables.
Another possible research hypothesis would be that there is a negative linear relationship. If there is no linear relationship
between the variables, then we would retain the null hypothesis. Pearson's correlation should be used when there is a
significant effect. (p > .05) When there is a relationship between two variables. There can be a positive or negative
correlation. It cannot be used when we retain the null hypothesis because then there is no relationship. It can be used if the
null is rejected. A Pearson's correlation is used when two quantitative variables are being tested in the RH. This cannot
test attributive RH, but can associative and causal. The associative hypothesis can be tested when ever we want with a
correlation. The causal RH can only be used with a correlation when a well-ran true experiment is being ran.
Q4. What is scaling? Why is scaling performed? What is the difference between normalized scaling and
standardized scaling?
A4. Scaling is a step of data Pre-Processing which is applied to independent variables to normalize the data within a
particular range. It also helps in speeding up the calculations in an algorithm.

Most of the times, collected data set contains features highly varying in magnitudes, units and range. If scaling is not done
then algorithm only takes magnitude in account and not units hence incorrect modelling. To solve this issue, we have to
do scaling to bring all the variables to the same level of magnitude.It is important to note that scaling just affects the
coefficients and none of the other parameters like t-statistic, F-statistic, p-values, R-squared, etc

Normalization/Min-Max Scaling: It brings all of the data in the range of 0 and 1. sklearn.preprocessing.MinMaxScaler
helps to implement normalization in python.

MinMax Scaling : x = {x - min(x)} / { max(x) – min(x)}

Standardization Scaling: Standardization replaces the values by their Z scores. It brings all of the data into a standard
normal distribution which has mean (μ) zero and standard deviation one (σ).

Standardization : x = { x – mean(x)} / S.D(x)

Q.5. You might have observed that sometimes the value of VIF is infinite. Why does this happen?

A5. If there is perfect correlation, then VIF = infinity. This shows a perfect correlation between two independent variables.
In the case of perfect correlation, we get R2 =1, which lead to 1/(1-R2) infinity. To solve this problem we need to drop one
of the variables from the dataset which is causing this perfect multicollinearity. An infinite VIF value indicates that the
corresponding variable may be expressed exactly by a linear combination of other variables (which show an infinite VIF
as well).

Q.6. What is a Q-Q plot? Explain the use and importance of a Q-Q plot in linear regression.

A.6. Quantile-Quantile (Q-Q) plot, is a graphical tool to help us assess if a set of data plausibly came from some
theoretical distribution such as a Normal, exponential or Uniform distribution. Also, it helps to determine if two data sets
come from populations with a common distribution.

This helps in a scenario of linear regression when we have training and test data set received separately and then we can
confirm using Q-Q plot that both the data sets are from populations with same distributions.
Few advantages:

a) It can be used with sample sizes also

b) Many distributional aspects like shifts in location, shifts in scale, changes in symmetry, and the presence of outliers can
all be detected from this plot.

It is used to check following scenarios:

If two data sets —

i. come from populations with a common distribution

ii. have common location and scale

iii. have similar distributional shapes

iv. have similar tail behavior

You might also like