SMDM Project
SMDM Project
12-14-2021
—
Statistical Methods for Decision
Making
—
Problem1 Statement:
A wholesale distributor operating in different regions of Portugal has information on annual spending of
several items in their stores across different regions and channels. The data consists of 440 large
retailers’ annual spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto,
Other) and across different sales channel (Hotel, Retail).
Basic EDA
The data has 440 instances with 9 attributes. 7 integer type and 2 object type (Strings in the column) as
evident from the below result.
Dataset has 9 variables Buyer/Spender', 'Channel', 'Region', 'Fresh', 'Milk', 'Grocery', 'Frozen',
'Detergents_Paper’ and 'Delicatessen'. Channel and Region both are categorical columns while all the
others are integer type.
The Following table is derived using Descriptive Statistics to summarize the data.
The below Bar plot represents the Total Spending of all the regions.
From the above plot it can be concluded that the region Other has the Highest spending and the region
Oporto has the lowest spending.
The above table represents the Channel wise distribution of Total Spending of all the food Items. Here it can
be seen that spending is spread across two Channels Hotel and Retail..
The below Bar plot represents the Total Spending of both the Channels.
1.2 There are 6 different varieties of items are considered. Do all varieties
show similar behavior across Region and Channel? Provide justification
for your answer.
It can be seen that in region Lisbon the product Detergent Paper has maximum Coefficient of
variable so it is highly inconsistent in Lisbon followed by Grocery. Where as in Oporto Frozen
products shows highest inconsistent behavior followed by Detergent Paper. On the Other hand in the
Region Other Delicatessen shows the highest inconsistency followed by Detergent Paper.
In the region Lisbon the product Delicatessen has the least Coefficient of Variable so it is the Most
consistent product in Lisbon where as in Oporto Fresh and Delicatessan are most consistent. On
the other hand in Other region only Fresh is most consistent.
It can be seen that in channel Hotel the product Delicatessen has maximum Coefficient of variable
so it is highly inconsistent in Hotel followed by Frozen. On the Other hand in the channel Retail
Detergent Paper shows the highest inconsistency followed by Milk.
In the channel Hotel the product Detergent Paper has the least Coefficient of Variable so it is the
Most consistent product in Hotel Channel where as in Retail Frozen products are most consistent.
On the basis of above analysis it can be concluded that considering all the 6 variety of
items ,all varieties do not show similar behavior across Region and Channel
1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behavior? Which items show the least inconsistent
behavior?
The above table represents the descriptive statistics of all the six Food Items Fresh,Milk,Grocery,
Frozen, Detergents_Paper and Delicatessen.
Here the consistency of any food item can be calculated using the Coefficient of Variance(CV). The higher
the coefficient of variation, the greater the level of inconsistency and vice versa.
where:
σ = standard deviation
μ = mean
To determine the presence of Outliers in the Data the best method is creating Box plot of all the variables as
shown below.
1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective.
On the basis of the analysis, it can be seen that the region Other and the channel Retail have Higher
spending than other Channel and Regions. Hence From the Business prospective if a new business is to
be opened it Should be opened in the Other region with Channel Retail as the Other region is
absorbing maximum amount of sells and this can boast up the Revenue compared to opening a new
business in Lisbon or Oporto and with the Channel Hotel.
Also the food item Delicatessen shows least inconsistent behavior across all regions and channels. So
Delicatessen is also recommended to be available at all times in all the Businesses .
Problem 2 :
The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the
undergraduate students that attend CMSU. CMSU creates and distributes a survey of 14 questions and receives
responses from 62 undergraduates .
The Data is stored in the Survey data set as follows :
2.1 For this data, construct the following contingency tables (Keep
Gender as row variable)
2.2.1. What is the probability that a randomly selected CMSU student will
be male?
Total No of Students = 62
Total No of Male = 29
Probability a randomly selected student will be male = Total No of Male / Total No of Male
Hence from the calculations done in Python we conclude that :
The probability that a randomly selected CMSU student will be male is 46.77 %
2.2.2. What is the probability that a randomly selected CMSU student will
be female?
Total No of Students = 62
Total No of Female = 33
Probability a randomly selected student will be male = Total No of Male / Total No of Female
The probability that a randomly selected CMSU student will be Female is 53.23 %
2.3.1. Find the conditional probability of different majors among the male
students in CMSU.
2.3.2 Find the conditional probability of different majors among the female
students of CMSU.
2.4.1. Find the probability That a randomly chosen student is a male and
intends to graduate.
The probability That a randomly chosen student is a male and intends to graduate is
27.42 %
2.4.2 Find the probability that a randomly selected student is a female and
does NOT have a laptop.
Contingency table For Gender and Computer :
Probability that a randomly selected student is a female and does NOT have a laptop
= Probability that a randomly chosen student is a Female * Probability of Female with No
Laptop
The probability that a randomly selected student is a female and does NOT have a
laptop is 6.45 %
2.5.1. Find the probability that a randomly chosen student is either a male
or has full-time employment?
Probability that a randomly chosen student is either a male or has full-time employment
= Probability of a Student being Male + Probability of a student having FullTime Employment
- Probability of a Male having FullTime Employment
The probability that a randomly chosen student is either a male or has a full-time employment
79.87 %
The conditional probability that given a female student is randomly chosen, she is majoring in
international business or management is 24.242 %
2X2 Contingency table of Gender and Intent to Graduate without considering the Undecided
students
Two events A and B can be proved to be Independent events when it satisfies the condition :
In this case if being female and graduate intention are independent can be proven by checking the
condition :
Where F = Female
Yes = Grad Intention being Yes
Hence, Graduate intention and being female are not independent events
2.7. Note that there are four numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages.
Since GPA is a continuous variable the Probability of a student whose GPA is less than 3 an be calculated by
using the Poisson Distribution.
If a student is chosen randomly, what is the probability that his/her GPA is less than 3
is 39.49%
2.6.2. Find the conditional probability that a randomly selected male earns
50 or more. Find the conditional probability that a randomly selected
female earns 50 or more.
The above distplot represents the salary of all the Male in the population.
As we can see it is normally distributed hence the conditional probability that a randomly selected male earns
50 or more can be calculated using the Normal distribution.
To calculate this, we will calculate the cumulative probability for less than 50 using Normal Distribution and
then will subtract from 1.
The Conditional probability that a randomly selected male earns 50 or more is 83.04 %
As we can see it is normally distributed hence the conditional probability that a randomly selected female earns
50 or more can be calculated using the Normal distribution.
To calculate this, we will calculate the cumulative probability for less than 50 using Normal Distribution and
then will subtract from 1.
2.8. Note that there are four numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages. For each of them comment
whether they follow a normal distribution. Write a note summarizing your
conclusions.
Problem 3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount
of moisture the shingles contain when they are packaged. Customers may feel that they have
purchased a product lacking in quality if they find moisture and wet shingles inside the packaging. In
some cases, excessive moisture can cause the granules attached to the shingles for texture and
colouring purposes to fall off the shingles resulting in appearance problems. To monitor the amount
of moisture present, the company conducts moisture tests. A shingle is weighed and then dried. The
shingle is then reweighed, and based on the amount of moisture taken out of the product, the pounds
of moisture per 100 square feet is calculated. The company would like to show that the mean moisture
content is less than 0.35 pound per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles
and 31 for B shingles.
3.1 Do you think there is evidence that means moisture contents in both
types of shingles are within the permissible limits? State your conclusions
clearly showing all steps.
For the A shingles, the null and alternative hypothesis to test whether the population mean moisture content is
less than 0.35 pound per 100 square feet is given:
The sample is not a large sample. So you use the t distribution and the tSTAT test statistic
Since we a testing for only sample A we use One sample T test. Also as python by default in
Python, ttest_1samp shows the result of 2-sided it is divided by 2 as our is a !_Sided test.
We have no evidence to reject the null hypothesis since p value > Level of significance
For the B shingles, the null and alternative hypothesis to test whether the population mean moisture content is
less than 0.35 pound per 100 square feet is given:
The sample is not a large sample. So you use the t distribution and the tSTAT test statistic
Since we a testing for only sample A we use One sample T test. . Also as python by default in
Python, ttest_1samp shows the result of 2-sided it is divided by 2 as our is a !_Sided test.
We have evidence to reject the null hypothesis since p value < Level of significance
3.2 Do you think that the population mean for shingles A and B are equal?
Form the hypothesis and conduct the test of the hypothesis. What
assumption do you need to check before the test for equality of means is
performed?
To perform a Test of equality of the population mean of the A shingles and B shingles, the null and alternative
hypothesis to test whether the population mean moisture content is equal is given:
We have two samples A and B and we do not know the population standard deviation.
The samples are not large sample. So you use the t distribution and the tSTAT test statistic
Since we a testing for equality between sample A and B we use two sample T test.
We do not have enough evidence to reject the null hypothesis in favour of alternative
hypothesis since p value > Level of significance
Therefore, It can be concluded that the population mean for shingles A and B are equal.
Thank You