Problem 1: Linear Regression

Problem 1:
Linear Regression
You are a part of an investment firm and your work is to do research about these 759 firms. You are
provided with the dataset containing the sales and other attributes of these 759 firms. Predict the
sales of these firms on the bases of the details given in the dataset so as to help your company in
investing consciously. Also, provide them with 5 attributes that are most important.
The first step of our analysis, we have to import all the necessary libraries. After loading the libraries, we
have to load our data_set(Firm_level_data) for our analysis.
1. First, we have found what are all the entries in the data set by checking the top 5 rows.
From the above, we now got the idea of how the data in entered.
2. The next step is, we need to know the details of the columns along with how many entries and
the data type of all the variables.
From the above, we can infer that there are 10 columns with 759 entries, except sp500 ,all the
variables are int and float,where sp500 is an object.
3. Now, we need to know whether all the variables have any null values in the given data set.
From the above output, except “tobinq” all the variables doesn’t have null values. As, the
number of null values of “tobinq” is less we can modify those with the mean value. After this
process, we noticed that all the null values are modified.
4. Then, now we need to know whether any values are duplicated or not.
There is no duplication present in the dataset provided.
5. Now, we need to describe the data set.

6. Univariate analysis:
i) Sales:
There is no outlier present in “Sales”. The value ranges between 0 and 2000.
ii) Capital:
There is no outlier present in “Capital”. The value ranges between 0 and 1000.
iii) Patents:
From the above, it is understood that there is no outlier present and the values are ranging
from 0 to 12.
iv) Randd:
The “Randd” has no outliers present and the data ranges between 0 and 150.
v) Employment:
The variable “ employment” has no outliers present and the data ranges from 0 t 10.
vi) Tobinq:
There are many outliers present in the data, which needs to be taken care of. The value
ranges between 1 and 3.
vii) Value:
There is no outliers present in the dataset. The value ranges between 0 and 2000.
viii) Institutions:
There is no outliers present in the dataset. The value ranges from 20 to 60.
Checking for Correlation between the variables:
The sales and the capital is having more commonly related. So, in order to predict the sales , we can take
“Capital” for splitting the data.
Multivariate Analysis:
1.2 Impute null values if present? Do you think scaling is necessary in this case? (8
marks)
All the null values present in the data base has been imputed. Scaling in necessary to
convert the variables with different measurement into the same measurement.
Scaling is required in our data set also. We have treated the outliers present in the dataset
and then we did the StandardScaler normalizes.
Ans 1.3 We have encoded the data (having string values) for Model ling and also
done Data Split: Split the data into test and train (70:30).
We have to split the given dataset into training and testing by separating X and Y, X
train,X_test,Y_train,Y_test.
And then we fit the model.
The performance metrics are as follows:
R Square on training data is 83.15%

RMSE on training data is 6%
RMSE on testing data is 5.19%
1.4 Inference:
Before going for the new , we need to check on the capital invested is good which is
reflecting in the scatterplot.
The important variables are value,employment,sales and patents.
The very important attribute is Employment and Patents.

Problem 2: Logistic Regression and LDA
You are hired by Government to do analysis on car crashes. You are provided details
of car crashes, among which some people survived and some didn't . You have to help
the government in predicting whether a person will survive or not on the basis of the
information given in the data set so as to provide insights that will help government to
make stronger laws for car manufacturers to ensure safety meas ures. Also, find out
the important factors on the basis of which you made your predictions.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis. (8 marks)
We have to import all the necessary library files to process the data analysis.Need to check
the head entries.
Description:
Info:
From the above, we can infer that, there are totally 15 columns with 11217 entries. The first column is
unnamed . The datatypes are integer, float,object.
To check the null values in the dataset:
Except “injSeverity “, all the variables are not having any null values.
Multivariate Analysis:
The above shows the collinearity between the variables.
2.2 We have encoded the data (having string values) for Modelling.
Data Split: We have splitted the data into train and test (70:30).
By taking “Survived” as the target variable we have splitted the data into train and test.
Ans 2.3 The performance metrics of Logistics regression and Linear Discriminant
Analysis model is as given below:
We have splitted the data into training and testing.
From the above output,we infer that we have accuracy of 96% on testing dataset.
Based on the confusion matrix, the accuracy is 98%.
2.4 Insights:
 The accuracy of both training and the testing is more or less same as 98%.The
confusion matrix also shows the similarity.We can conclude that logistic method is
better to predict the analysis.

Problem 1: Linear Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Read this document in other languages

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Problem 1: Linear Regression

Uploaded by

Copyright:

Available Formats

Problem 1:

There is no duplication present in the dataset provided.

5. Now, we need to describe the data set.

And then we fit the model.

The performance metrics are as follows:

R Square on training data is 83.15%

The important variables are value,employment,sales and patents.

The very important attribute is Employment and Patents.

To check the null values in the dataset:

The above shows the collinearity between the variables.

We have splitted the data into training and testing.

Based on the confusion matrix, the accuracy is 98%.

You might also like