Professional Documents
Culture Documents
Problem 1: Linear Regression
Problem 1: Linear Regression
Linear Regression
You are a part of an investment firm and your work is to do research about these 759 firms. You are
provided with the dataset containing the sales and other attributes of these 759 firms. Predict the
sales of these firms on the bases of the details given in the dataset so as to help your company in
investing consciously. Also, provide them with 5 attributes that are most important.
The first step of our analysis, we have to import all the necessary libraries. After loading the libraries, we
have to load our data_set(Firm_level_data) for our analysis.
1. First, we have found what are all the entries in the data set by checking the top 5 rows.
From the above, we now got the idea of how the data in entered.
2. The next step is, we need to know the details of the columns along with how many entries and
the data type of all the variables.
From the above, we can infer that there are 10 columns with 759 entries, except sp500 ,all the
variables are int and float,where sp500 is an object.
3. Now, we need to know whether all the variables have any null values in the given data set.
From the above output, except “tobinq” all the variables doesn’t have null values. As, the
number of null values of “tobinq” is less we can modify those with the mean value. After this
process, we noticed that all the null values are modified.
4. Then, now we need to know whether any values are duplicated or not.
i) Sales:
There is no outlier present in “Sales”. The value ranges between 0 and 2000.
ii) Capital:
There is no outlier present in “Capital”. The value ranges between 0 and 1000.
iii) Patents:
From the above, it is understood that there is no outlier present and the values are ranging
from 0 to 12.
iv) Randd:
The “Randd” has no outliers present and the data ranges between 0 and 150.
v) Employment:
The variable “ employment” has no outliers present and the data ranges from 0 t 10.
vi) Tobinq:
There are many outliers present in the data, which needs to be taken care of. The value
ranges between 1 and 3.
vii) Value:
There is no outliers present in the dataset. The value ranges between 0 and 2000.
viii) Institutions:
There is no outliers present in the dataset. The value ranges from 20 to 60.
Checking for Correlation between the variables:
The sales and the capital is having more commonly related. So, in order to predict the sales , we can take
“Capital” for splitting the data.
Multivariate Analysis:
1.2 Impute null values if present? Do you think scaling is necessary in this case? (8
marks)
All the null values present in the data base has been imputed. Scaling in necessary to
convert the variables with different measurement into the same measurement.
Scaling is required in our data set also. We have treated the outliers present in the dataset
and then we did the StandardScaler normalizes.
Ans 1.3 We have encoded the data (having string values) for Model ling and also
done Data Split: Split the data into test and train (70:30).
We have to split the given dataset into training and testing by separating X and Y, X
train,X_test,Y_train,Y_test.
1.4 Inference:
Before going for the new , we need to check on the capital invested is good which is
reflecting in the scatterplot.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis. (8 marks)
We have to import all the necessary library files to process the data analysis.Need to check
the head entries.
Description:
Info:
From the above, we can infer that, there are totally 15 columns with 11217 entries. The first column is
unnamed . The datatypes are integer, float,object.
Except “injSeverity “, all the variables are not having any null values.
Multivariate Analysis:
2.2 We have encoded the data (having string values) for Modelling.
Data Split: We have splitted the data into train and test (70:30).
By taking “Survived” as the target variable we have splitted the data into train and test.
Ans 2.3 The performance metrics of Logistics regression and Linear Discriminant
Analysis model is as given below:
From the above output,we infer that we have accuracy of 96% on testing dataset.
2.4 Insights:
The accuracy of both training and the testing is more or less same as 98%.The
confusion matrix also shows the similarity.We can conclude that logistic method is
better to predict the analysis.