Assighment Project 1

Prepared by: Anil Ulchala
Date: 10th July 2021

Table of Contents
1. Problem Statement: .............................................................................................................................. 5
1.1.1 Use methods of descriptive statistics to summarize data. .............................................................. 7
1.1.2 Which Region and which Channel spent the most? ........................................................................ 7
1.1.3 Which Region and which Channel spent the least? .................................................................. 8
1.2 There are 6 different varieties of items that are considered. Describe and comment/explain all the
varieties across Region and Channel? Provide a detailed justification for your answer. ......................... 8
1.3 On the basis of the descriptive measure of variability, which item shows the most inconsistent
behavior? Which items shows the least inconsistent behavior?............................................................ 11
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with the
help of detailed comments. .................................................................................................................... 11
1.5 On the basis of your analysis, what are your recommendations for the business? How can your
analysis help the business to solve its problem? Answer from the business perspective. .................... 12
2. Problem Statement: ............................................................................................................................ 12
2.1. For this data, construct the following contingency tables (Keep Gender as row variable) ................. 12
2.1.1. Gender and Major ......................................................................................................................... 12
2.1.2. Gender and Grad Intention ........................................................................................................... 13
2.1.3. Gender and Employment .............................................................................................................. 13
2.1.4. Gender and Computer .................................................................................................................. 13
2.2. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following question: ..................................................................................................................................... 13
2.2.1. What is the probability that a randomly selected CMSU student will be male? .......................... 13
2.2.2. What is the probability that a randomly selected CMSU student will be female?....................... 13
2.3.1. Find the conditional probability of different majors among the male students in CMSU. .......... 14
2.3.2 Find the conditional probability of different majors among the female students of CMSU. ........ 14
2.4. Assume that the sample is a representative of the population of CMSU. Based on the data, answer
the following question: ............................................................................................................................... 14
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate........... 14
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a laptop. 15
2.5.1. Find the probability that a randomly chosen student is a male or has full-time employment? .. 15
2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring
in international business or management. ............................................................................................. 15
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The
Undecided students are not considered now and the table is a 2x2 table. Do you think the graduate
intention and being female are independent events? ............................................................................... 15
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and
Text Messages............................................................................................................................................. 16
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3? .......... 16
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the
conditional probability that a randomly selected female earns 50 or more. ......................................... 16
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and
Text Messages. For each of them comment whether they follow a normal distribution. Write a note
summarizing your conclusions .................................................................................................................... 16
List of Tables
Table 1 - Average spends across Region and Channel ................................................................................ 10
Table 2 – Contingency Table: Gender Vs Major.......................................................................................... 12
Table 3 – Contingency Table: Gender Vs Grad Intention ........................................................................... 13
Table 4 – Contingency Table: Gender Vs Employment .............................................................................. 13
Table 5 – Contingency Table: Gender Vs Computer .................................................................................. 13
Table 6 – Contingency Table: Gender Vs Intent to Graduate (Dropped Undecided) ................................ 15
List of Figures
Figure 1 - Sample Data .................................................................................................................................. 5
Figure 2 - Exploratory Details ........................................................................................................................ 5
Figure 3 - Check for missing Values .............................................................................................................. 6
Figure 4 - Correlation Map for the Data Set ................................................................................................. 6
Figure 5 - Descriptive Information ................................................................................................................ 7
Figure 6 – Spends across the Region............................................................................................................. 7
Figure 7 – Spends across the Channel .......................................................................................................... 8
Figure 8 – Spends across the Region............................................................................................................. 9
Figure 9 – Plots against min and max spends based on the products .......................................................... 9
Figure 10 – Plots against IQR and CV across different products................................................................. 10
Figure 11 – Box Plot for various products ................................................................................................... 11
Figure 12 – Q-Q and hist plots for GPA and Salary ..................................................................................... 16
Figure 13 – Q-Q and hist plots for Spending and Text Messages ............................................................... 17
Figure 14 – Box and KDE plots for GPA and Salary ..................................................................................... 17
Figure 15 – Box and KDE plots for Spending and Text Messages ............................................................... 18
Introduction:
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Explore the dataset using central tendency and other parameters. This assignment should help the
different section of people in exploring the summary statistics, contingency tables, conditional
probabilities & hypothesis testing.
1. Problem Statement:
A wholesale distributor operating in different regions of Portugal has information on annual
spending of several items in their stores across different regions and channels. The data consists of
440 large retailers’ annual spending on 6 different varieties of products in 3 different regions
(Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail).
Data Description:
Region: In which location distributor is operating; Other, Lisbon, Oporto
Channel: Represents what are the channels used by distributor for his sales
Sample of the Dataset
Figure 1 - Sample Data
Exploratory Data Analysis
Figure 2 - Exploratory Details

There are total 440 rows and 10 columns, in which two columns are object type and remaining are
integer type.
Check for missing values in Data Set
Figure 3 - Check for missing Values
From the Figure 3 we can conclude that there are no missing values in the provided data set.
Correlation plot for Data Set
Figure 4 - Correlation Map for the Data Set
From the correlation plot, we can see that various attributes of the car are highly correlated to each
other. Correlation values near to 1 or -1 are highly positively correlated and highly negatively
correlated respectively. Correlation values near to 0 are not correlated to each other.
1.1.1 Use methods of descriptive statistics to summarize data.
Ans: Descriptive statistics help describe and understand the features of a specific data set by giving
short summaries about the sample and measures of the data. The most recognized types of descriptive
statistics are measures of centre: the mean, median, and mode, which are used at almost all levels of
math and statistics.
Figure 5 - Descriptive Information
From the descriptive statistics, we can see that there are 6 unique types of products are sold by the
whole seller. NaN shows that the values cannot be calculated for particular variables. Like , we can
calculate unique value for a numerical variable.
 Channel has two unique values, with "Hotel" as most frequent with 298 out of 440
transaction with 67.7 percentage of spending comes from "Hotel" channel.
 Retail has three unique values, with "Other" as most frequent with 316 out of 440
transactions with 71.8 percentage of spending comes from "Other" region.
1.1.2 Which Region and which Channel spent the most?

Ans:
Figure 6 – Spends across the Region

From Figure 6, we can infer the following:
Other is the region that spend more with amount of $ 10677599
Hotel is the channel that spend more with amount of $ 7999569
1.1.3 Which Region and which Channel spent the least?

Ans: From Figure 6, we can infer the following:
Oporto is the region that spend less with amount of $ 1555088
Retail is the channel that spend less with amount of $ 6619931
1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a
detailed justification for your answer.
Ans:
Figure 7 – Spends across the Channel

Figure 8 – Spends across the Region
Figure 9 – Plots against min and max spends based on the products
Figure 10 – Plots against IQR and CV across different products
Based on the Figure 7 thru Figure 99, we can infer that categories like Milk, Grocery & Detergents_Paper
have higher spend in the Retail channel versus Hotel, across all regions. On the other hand, Fresh
Fr and
Frozen have higher consumption in the Hotel channel versus Retail, across all regions.
Also, if we plot a box plot we can summarize that spend for Fresh and groceries is the maximum across
region and channel while for Delicatessen it is the least across region and channel.
Table 1 - Average spends across Region and Channel

In Channel
"Hotel" Average Highest Spending in Fresh items and Lowest Spending in Detergents Paper. In Channel
"Retail" Average Highest Spending in Grocery items and Lowest Spending in Frozen items. Refer Figure 7
In Region
All the three regions Lisbon, Oporto and Other has the average Highest Spending in Fresh and Lowest in
Delicatessen items.
1.3 On the basis of the descriptive measure of variability, which item shows
the most inconsistent behavior? Which items shows the least inconsistent
behavior?
Ans: Upon calculation of Coefficient of Variance (CV) and plotting the same for all the products, refer to
Figure 10. We can infer that product ‘Fresh’ has a least CV of 1.05 and product ‘Delicatessen’ has highest
CV of 1.85.
As per the stats, the one with least CV is the least inconsistent behavior and with highest CV is the most
inconsistent behavior. Therefore, among the data provided Fresh is the least inconsistent and
Delicatessen is the most inconsistent behavior.
1.4 Are there any outliers in the data? Back up your answer with a suitable
plot/technique with the help of detailed comments.
Ans: Yes, there are outliers. To determine the outliers across the products a box plot is drawn. Refer to
Figure 11.
We can conclude that all the product spends has outliers with Fresh has max and Delicatessen is least
Figure 11 – Box Plot for various products

1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective.
Ans: Based on various observations on the data provided following are some recommendations.
1. We can see highest inconsistency across some products. Therefore Whole Seller has to focus
more on those products to make them fall under less inconsistent zone.
2. Also, there are many outliers for all the products, this says that spends are not even and
need to identify a reason for outliers
3. When we checked the spends based on the region, Others is doing well compared to Lisbon
and Oporto. Whole Seller needs to find the grey areas that causing less spends in these
regions and try to increase market campaign or promotions to have increase the spends.
4. The spending of Hotel and Retail channel are different which should be more or less equal.
Whole Seller has to work out on the prices/discounting to increase the spending across the
Hotel.
5. Also, Whole Seller needs to focus on other products apart from Fresh and Grocery as mean
spending are comparatively more from other products.
2. Problem Statement:
The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about
the undergraduate students that attend CMSU. CMSU creates and distributes a survey of 14 questions
and receives responses from 62 undergraduates (stored in the Survey data set).
2.1. For this data, construct the following contingency tables (Keep
Gender as row variable)
2.1.1. Gender and Major
Ans:
Table 2 – Contingency Table: Gender Vs Major

2.1.2. Gender and Grad Intention
Ans:
Table 3 – Contingency Table: Gender Vs Grad Intention
2.1.3. Gender and Employment

Ans:
Table 4 – Contingency Table: Gender Vs Employment
2.1.4. Gender and Computer

Ans:
Table 5 – Contingency Table: Gender Vs Computer
2.2. Assume that the sample is representative of the population of CMSU.

Based on the data, answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be
male?
Ans: Probabiity that a randomly selected CMSU will be a male is (29/62) = 46.77%
2.2.2. What is the probability that a randomly selected CMSU student will be
female?
Ans: Probabiity that a randomly selected CMSU will be a female is (33/62) = 53.23%
2.3.1. Find the conditional probability of different majors among the male
students in CMSU.
Ans:
Probability of Accouting being a male student is (4/29) = 13.79%

Probability of CIS being a male student is (1/29) = 3.45%
Probability of Economics/Finance being a male student is (4/29) = 13.79%
Probability of International Business being a male student is (2/29) = 6.9%
Probability of Management being a male student is (6/29) = 20.69%
Probability of Other being a male student is (4/29) = 13.79%
Probability of Retailing/Marketing being a male student is (5/29) = 17.24%
Probability of Undecided being a male student is (3/29) = 10.34%
2.3.2 Find the conditional probability of different majors among the female
students of CMSU.
Ans:
Probability of Accouting being a female student is (3/33) = 9.09%

Probability of CIS being a female student is (3/33) = 9.09%
Probability of Economics/Finance being a female student is (7/33) = 21.21%
Probability of International Business being a female student is (4/33) = 12.12%
Probability of Management being a female student is (4/33) = 12.12%
Probability of Other being a female student is (3/33) = 9.09%
Probability of Retailing/Marketing being a female student is (9/33) = 27.27%
Probability of Undecided being a female student is (0/33) = 0
2.4. Assume that the sample is a representative of the population of

CMSU. Based on the data, answer the following question:
2.4.1. Find the probability That a randomly chosen student is a male and
intends to graduate.
Ans: Probability of a student being male and graduate is (17/28) = 60.71%
2.4.2 Find the probability that a randomly selected student is a female and
does NOT have a laptop.
Ans: Probability of a student being female and not having a laptop is ((2+2)/(5+2)) = 57.14%

2.5.1. Find the probability that a randomly chosen student is a male or has
full-time employment?
Ans: Probability that a randomly chosen student is a male or has a full-time employment is 51.61%
2.5.2. Find the conditional probability that given a female student is randomly
chosen, she is majoring in international business or management.
Ans: probability that given a female student is randomly chosen, she is majoring in international
business or management is 24.24%
2.6. Construct a contingency table of Gender and Intent to Graduate at 2

levels (Yes/No). The Undecided students are not considered now and the
table is a 2x2 table. Do you think the graduate intention and being
female are independent events?
Ans:
Table 6 – Contingency Table: Gender Vs Intent to Graduate (Dropped Undecided)
As we are that if two events are independent then it should satisfy the below rule:
P(A ∩ B) = P(A).P(B)
Probability of student being a Female is P(F) = 0.5
Probability of Student intent to Graduate is P(Y) = 0.7
Probability of a student being Female and Intent to graduate is P(F∩Y) = 0.275
Since, P(F∩Y) ≠ P(F).P(Y)
Therefore, graduate intention and being a female are not independent events.
2.7. Note that there are four numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages .
Answer the following questions based on the data
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA
is less than 3?
Ans: Probability
robability that his/her GPA is less than 3 is 37.0 %
2.7.2. Find the conditional probability that a randomly selected male earns 50
or more. Find the conditional probability that a randomly selected female
earns 50 or more.
2.8. Note that there are four numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages. For each of them
comment whether they follow a normal distribution. Write a note
summarizing your conclusions
Ans:
Figure 12 – Q-Q and hist plots for GPA and Salary

Figure 13 – Q-Q
Q and hist plots for Spending and Text Messages
Figure 14 – Box and KDE plots for GPA and Salary

Figure 15 – Box and KDE plots for Spending and Text Messages
From the Graphs it is clear more or less the continuous variables fall under normal distribution.
Empirical Relation for GPA:

1. mu + 1 sigma = 3.129 + 0.377 = 3.506 mu - 1 sigma = 3.129 - 0.377 = 2.752
2. Almost 70% of the data points lies in between this range
Empirical Relation for Salary:
1. mu + 1 sigma = 48.548 + 12.080 = 60.628 mu - 1 sigma = 3.129 - 0.377 = 36.468
2. Almost 83% of the data points lies in between this range
Empirical Relation for Spending:
1. mu + 1 sigma = 48.548 + 12.080 = 60.628 mu - 1 sigma = 3.129 - 0.377 = 36.468
2. Almost 83% of the data points lies in between this range and slightly left skewed
Empirical Relation for Text Messages:
1. mu + 1 sigma = 246.21 + 214.466 = 460.676 mu - 1 sigma = 246.21 - 214.466 = 31.744
2. Almost 82% of the data points lies in between this range aand
nd is left skewed

Assighment Project 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assighment Project 1

Uploaded by

Copyright:

Available Formats

Prepared by: Anil Ulchala

Date: 10th July 2021

Figure 1 - Sample Data

Exploratory Data Analysis

Figure 2 - Exploratory Details

Check for missing values in Data Set

Figure 3 - Check for missing Values

Correlation plot for Data Set

Figure 4 - Correlation Map for the Data Set

Figure 5 - Descriptive Information

1.1.2 Which Region and which Channel spent the most?

Figure 6 – Spends across the Region

1.1.3 Which Region and which Channel spent the least?

Figure 7 – Spends across the Channel

Table 1 - Average spends across Region and Channel

Figure 11 – Box Plot for various products

Table 2 – Contingency Table: Gender Vs Major

Table 3 – Contingency Table: Gender Vs Grad Intention

2.1.3. Gender and Employment

Table 4 – Contingency Table: Gender Vs Employment

2.1.4. Gender and Computer

Table 5 – Contingency Table: Gender Vs Computer

2.2. Assume that the sample is representative of the population of CMSU.

Probability of Accouting being a male student is (4/29) = 13.79%

Probability of Accouting being a female student is (3/33) = 9.09%

2.4. Assume that the sample is a representative of the population of

2.5. Assume that the sample is representative of the population of CMSU.

2.6. Construct a contingency table of Gender and Intent to Graduate at 2

Table 6 – Contingency Table: Gender Vs Intent to Graduate (Dropped Undecided)

Figure 12 – Q-Q and hist plots for GPA and Salary

Figure 14 – Box and KDE plots for GPA and Salary

Empirical Relation for GPA:

You might also like