Professional Documents
Culture Documents
Anshul Dyundi Predictive Modelling Alternate Project July 2022
Anshul Dyundi Predictive Modelling Alternate Project July 2022
ALTERNATIVE PROJECT
PREDICTIVE MODELIING
Problem 1: Linear Regression
Problem Statement: You are a part of an investing firm and your work is to do
research about these 759 firms. You are provided with the dataset containing the sales
and other attributes of these 759 firms. Predict the sales of these firms on the bases of
the details given in the dataset so as to help your company in investing consciously.
Also, provide them with 5 attributes that are most important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.
(8 marks)
1.2 Impute null values if present? Do you think scaling is necessary in this case? (8
marks)
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data
into test and train (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE. (8 marks)
1.4 Inference: Based on these predictions, what are the busine ss insights and
recommendations? (6 marks)
-------------------------------------------------------------------------------------------------------
Data Dictionary for Firm_level_data:
1. sales: Sales (in millions of dollars).
2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P, is a stock market index
that measures the stock performance of 500 large companies listed on stock
exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a
physical asset's market value and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions.
Ans 1.1 We have read the data and have done exploratory data analysis. The brief
summary is as given below:
Univariate Analysis: It refer to the analysis of a single variable. The main purpose of
univariate analysis is to summarize and find patterns in the data. The key point is that there
is only one variable involved in the analysis.
Let us check the distribution of each variable of data
Bivariate Analyis : Through bivariate analysis we try to analyze two vari ables
simultaneously. As opposed to univariate analysis where we check the characteristics
of a single variable, in bivariate analysis we try to determine if there is any
relationship between two
variables.
There are essentially 3 major scenarios that we will come across when we perform
bivariate analysis
1. Both variables of interest are qualitative
2. One variable is qualitative and the other is quantitative
3. Both variables are quantitative
1.2 Impute null values if present? Do you think scaling is necessary in this case? (8
marks)
Ans 1.2 We have imputed the null values.
Scaling – Purpose : In this method, we convert variables with different scales of
measurements into a single scale.
Based on the given data set, as we have attributes that are not well defined meaning so
therefore we should scale our data in this case.
Accordingly we have scaled t he dataset after treating the outliers and converting the
categorical data into continuous in the dataset.
Standard Scaler normalizes the data using the formula (x-mean)/standard deviation
Ans 1.3 We have encoded the data (having string values) for Model ling and also
done Data Split: Split the data into test and train (70:30).
We have split the given data set into train and test data for model building we have
followed the steps as given below:
-3 : Model is introduced.
-6: Validation
We have applied Linear Regression on the given dataset and after application of it the
performance metrics is as given below:
Univariate analysis explores each variable in a data set, separately. It looks at the
range of values, as well as the central tendency of the values. It describes the pattern
of response to the variable. It describes each variable on its own. Descript ive statistics
describe and summarize data.
Univariate analysis refer to the analysis of a single variable. The main purpose of
univariate analysis is to summarize and find patterns in the data. The key point is that
there is only one variable involved in the analysis.
We have performed Univariate analysis as given below
Bivariate Analysis - Through bivariate analysis we try to analyze two variables
simultaneously. As opposed to univariate analysis where we check the characteristics of a single
variable, in bivariate analysis we try to determine if there is any
relationship between two variables.
There are essentially 3 major scenarios that we will come across when we perform
bivariate analysis
Further we have also done between numerical variable i.e. frontal and air bag
deployment.
Ans 2.2 We have encoded the data (having string values) for Modelling.
Data Split: We have splitted the data into train and test (70:30).
We have applied Logistic Regression and LDA (linear discriminant analysis) on the
given dataset taking survived as target variable.
Ans 2.3 The performance metrics of Logistics regression and Linear Discriminant
Analysis model is as given below: