Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

2021

SMDM PROJECT
DSBA

NATASHA CHAUHAN
9/10/2021
CONTENT

1. Wholesale Customer Data Analysis................................................................................................2


Problem 1.1..................................................................................................................................3
Problem 1.2..................................................................................................................................5
Problem 1.3..................................................................................................................................6
Problem 1.4..................................................................................................................................6
Problem 1.5..................................................................................................................................7

2 - Clear Mountain State University (CMSU) Survey..........................................................................8


Problem 2.1..................................................................................................................................8
Problem 2.2........................................................................................................ .........................9
Problem 2.3..................................................................................................................................9
Problem 2.4.................................................................................................................................10
Problem 2.5.................................................................................................................................11
Problem 2.6.................................................................................................................................12
Problem 2.7.................................................................................................................................12
Problem 2.8.................................................................................................................................14

3 - Hypothesis Testing for Quality of Shingles....................................................................................15


Problem 3.1.................................................................................................................................15
Problem 3.2.................................................................................................................................17

1
Problem 1: Wholesale Customers Analysis
Problem Statement:
A wholesale distributor operating in different regions of Portugal has information on annual
spending of several items in their stores across different regions and channels. The data
consists of 440 large retailers’ annual spending on 6 different varieties of products in 3
different regions (Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail).
Solution:
Importing all the libraries and data into the jupyter notebook

Basic EDA:
 There are total 440 entries and 9 coulmns, There are total of 9 variables:
Buyer/Spender, Channel, Region, Fresh, Milk, Grocery, Frozen, Detergents_Paper,
and Delicatessen. Data type are 2 Object/string and 7 integers which means there
are 2 categorical columns and 7 numerical.

 There are no null value in the data checked with isnull()

2
1.1 Use methods of descriptive statistics to summarize data. Which Region and
which Channel spent the most? Which Region and which Channel spent the
least?

Describe function is used to summarize data statistically:

Describe explains about all the statistical measures such as count, frequency,
mean, mode, median, range and skewness in the data known as Measures of
central tendency.
 There are 2 categorical (channel, region) variable and 7 numerical (buyer/sender,
fresh, milk, Grocery, Frozen, Detergents_Paper, and Delicatessen).
o Channel - 2 unique values. Hotel with highest frequency of 298/440 total count.
o Region - 3 unique values: Others has highest frequency of 316/440 total count.
o Fresh:
Total count- 440, where mean -12000, sd- 12647.33, 3 (mini)-112151 (maxi)
Quartile 1 (25%) - 3127.75, Quartile 2 (50%) - 8504.0 also known as median,
Quartile 3 (75%) - 16933.75
o Milk:
Total count- 440, where mean -5796.27, sd-7380.38, 55 (mini)-73498 (maxi)
Quartile 1 (25%) - 1533, Quartile 2 (50%) -3627 also known as median,
Quartile 3 (75%) - 7190.25.
o Grocery:
Total count- 440, where mean - 7951.28, sd- 9503.16, 3 (mini)- 92780.0 (maxi)
Quartile 1 (25%) - 2153.0, Quartile 2 (50%) - 4755.5also known as median,
Quartile 3 (75%) - 10655.75
o Frozen:
Total count- 440, where mean – 3071.93, sd- 4854.67, 25 (mini)- 60869.0
(maxi)
Quartile 1 (25%) - 742.25, Quartile 2 (50%) - 1526.0 also known as median,
Quartile 3 (75%) - 3554.25.
o Detergents_Paper:
Total count- 440, where mean – 2881.49, sd - 4767.85, 3 (mini)- 40827.0
(maxi)
Quartile 1 (25%) - 256.75, Quartile 2 (50%) - 816.5 also known as median,
Quartile 3 (75%) - 3922.0.
o Delicatessen:
Total count- 440, where mean – 1524.87, sd- 2820.10, 3 (mini)- 47943 (maxi)
Quartile 1 (25%) - 408.25, Quartile 2 (50%) - 965.5 also known as median,
Quartile 3 (75%) - 1820.25.

3
To find out the most spending within region and channel, need to create a new
column “spending’ which is total of all the 6 items/variables, shown below:

Method-1: Calculated using groupby,


To calculate the most spending Channel and Region, ‘groupby’ has been used. It
has helped in suming up all the values of the variable (region or channel in this
case)
From the python calculation, we conclude:
 “Others” is the region which has spent the most and ” Hotel” is the channel which
has spent the most.
 “Oporto” are the region which has spent the least and ”Retail” is the channel
which has spent the least.

Method-2: Calculating using crosstab,


 Highest spending Region is ‘Others’,’hotel’ being the heighest spending channel.
 Least spending Region is ‘Oporto’where ‘Retail’ is the least spending channel.

4
1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a
detailed justification for your answer.
Solution:
After plotting the graph using crosstab function across all the items, we can
conclude that all the items doesn’t behave similar to regions and channels.
Crosstab function has been used here for clear bifurcation between the Channels
in the specific region.

 It can be clearly seen that the all the items doesn’t behave in similar manner
when it comes to channel and region. As Fresh and Frozen has the highest
spending in hotel channel than channel across all region. Whereas Milk,
Grocery, Detergents_Paper has higher spending in Retail than hotel across all
region. Delicatessen also have a little more spending from retail than hotel.
 It is safe to say that Fresh and Frozen items are mostly consumed by Hotel
channel irrespective of the region they are distributing. On the other hand, items
like Milk, Grocery, Detergents_Paper are distributed among retailer
irrespective of region.
 But Delicatessen is the item which is almost distributed equally in Oporto
between both the channels. It is the variable which shows difference in spending
between the channels which is affected by region. There is a huge difference
between the different channels spending in region Lisbon but almost equal in
Oporto.
 From the observation we can say that 5 others variables distribution among
channel is not affected by the region. But distribution of Delicatessen is affected
slightly.

5
1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behaviour? Which items show the least inconsistent
behaviour?
The below table gives the summary of descriptive measure of all six items from
data:

Consistency of the items can be calculated using Coefficient of Variance (CV). The
coefficient of variation is a relatively simple and quick tool to compare different data
series. The higher the CV value reflects higher inconsistency.
Formula to derive Coefficient of variance (CV):
CV = μ/σ
Where:
σ = Standard deviation
μ = Mean
From the table last row of CV- it is evident that Coefficient of Variance is highest for
Item “Delicatessen” and lowest for Item “Fresh”. Hence it can be concluded that
the item that shows the most inconsistent behaviour is Delicatessen and the items
show the least inconsistent behaviour is Fresh.

1.4 Are there any outliers in the data? Back up your answer with a suitable
plot/technique with the help of detailed comments.

An outlier is a point that differs significantly from other observations. It is important


to find the outliers in data as it may distort the statistical analysis and violate the
assumptions. It increases the variability in data which results in decrease in
statistical power.
Easiest way to find out outliers are box plots due to whiskers, typically depicted by
quartiles and inter quartiles that helps in defining the upper limit and lower limit
beyond which any data lying will be considered as outliers
To find presence the outliers in the data, boxplot has been used for all the variables/
items as shown below, python calculation:

6
From the box plot we can easily conclude that all the variables has outliers in them.
Yes, Outliers are present in “Fresh, Milk, Grocery, Frozen, Detergents_Paper,
Delicatessen”.
1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective
On the basis of analysis done above, we can recommend following solutions to the
business:
o ‘Others’ is the region which are spending the most as compared to Lisbon and
Oporto and ‘Retail’ is the channel which is spending the most compared to
hotel in this region. If business needs to be extended it should be done in the
region ‘Other’ and channel should be Retail. They need to focus on the items
like ‘Milk, Grocery, Detergents_Paper, and Delicatessen’ rather on all of
them. As extending the business by keeping these things in mind can boast up
the sale and increase the revenue for the business rather than focusing on
Lisbon or Oporto.
o If the business is interested in growing the revenue from the ‘Hotel’ channel
then they should highly focus on the food products like “Fresh and Frozen”. It
is being observed that irrespective of the region, Items like Fresh and Frozen
are doing amazing good in Hotel channel.
o Food product like Fresh is having the highest spending in both the channels,
irrespective of region followed by Grocery and Milk. So, it is recommended
that these food products should be there at all the business and regions.
o Delicatessen is the product which has shown most inconsistency irrespective
of region and channel. So it is the product which can be sold at all the channels
and region. Hence, it should be made available at all times.

7
Problem 2 –
Problem Statement:
The Student News Service at Clear Mountain State University (CMSU) has decided to
gather data about the undergraduate students that attend CMSU. CMSU creates and
distributes a survey of 14 questions and receives responses from 62 undergraduates
(stored in the Survey data set).
Solution:
Once importing all the libraries into the jupyter notebook, upload the Survey file into it.
Completed the basic EDA:

2.1 For this data, construct the following contingency tables (Keep Gender as
row variable)
2.1.1 Gender and Major
Sol: Below output from Jupyter notebook

2.1.2 Gender and Grad Intention


Sol: Below output from Jupyter notebook

2.1.3 Gender and Employment


Sol: Below output from Jupyter notebook

2.1.4 Gender and Computer


Sol: Below output from Jupyter notebook

8
2.2 Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:

From the contingency tables


 Total number of students= 62
 Total number of Male= 29
 Total number of female= 33

2.2.1 What is the probability that a randomly selected CMSU student will be
male?
P (randomly selected stud will be male) = (Total no of male /Total no of
students)
By Calculation done in python we can conclude that:
 Probability that a randomly selected CMSU student will be male:
0.46774193548387094 which is 46.77%.
2.2.2 What is the probability that a randomly selected CMSU student will be
female?
P (randomly selected stud will be female) = (Total no of female /Total no of
students)
By Calculation done in python we can conclude that:
 Probability that a randomly selected CMSU student will be female:
0.532258064516129 which is 53.22%.

2.3 Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.3.1 Find the conditional probability of different majors among the male
students in CMSU.

Using contingency tables of Gender and Majors we got the total numbers of
males opting for different majors.
P (major | male) = P (major ∩ male)/ P (male)
Sol: From the calculation done in python we conclude that:
Probability of Accounting among male student is 13.79%
Probability of CIS among male student is 3.45%
Probability of Economics/Finance among male student is 13.79%
Probability of International Business among male student is 6.9%
Probability of Management among male student is 20.69%
Probability of Other among male student is 13.79%
Probability of Retailing/Marketing among male student is 17.24%
Probability of Undecided among male student is 10.34%.

9
2.3.2 Find the conditional probability of different majors among the female
students of CMSU.
Using contingency tables of Gender and Majors we got the total numbers of
females opting for different majors
P (major | female) = P (major ∩ female)/ P (Female)
Sol: From the calculation done in python we conclude that:
Probability of Accounting among female student is 9.09%
Probability of CIS among female student is 9.09%
Probability of Economics/Finance among female student is 21.21%
Probability of International Business among female student is 12.12%
Probability of Management among female student is 12.12%
Probability of Other among female student is 9.09%
Probability of Retailing/Marketing among female student is 27.27%
Probability of Undecided among female student is 0%

2.4 Assume that the sample is a representative of the population of CMSU. Based
on the data, answer the following question:
2.4.1 Find the probability that a randomly chosen student is a male and
intends to graduate.

Probability of randomly chosen male student = 29/62


Probability of male who intends to graduate= 17/29
P (randomly chosen student is a male and intends to graduate) = P
(randomly chosen male student) * P (male who intends to graduate)
Sol: From the calculation done in python we conclude that:
P (Intends to grad ∩ Male) = P (intends to grad| Male) x P (male) =
0.2742, the probability that a randomly chosen student is a male and
intends to graduate is 27.42%.
2.4.2 Find the probability that a randomly selected student is a female and
does NOT have a laptop.

Probability of randomly chosen female student = 33/62


Probability of female with no laptop = 1-(29/33) or 4/33
P (randomly selected student is a female and does NOT have a laptop) = P
(female with no laptop) * P (randomly chosen female student)
Sol: From the calculation done in python we conclude that:
P (doesn't have laptop ∩ Female) = P (doesn't have laptop | Female) x
P (Female) = 0.06452, the probability that a randomly selected student is a
female and does NOT have a laptop is 6.45%.

10
2.5 Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.5.1 Find the probability that a randomly chosen student is a male or has
full-time employment?

Probability of student having Full-time employment= 10/62


Probability of student being male= 29/62
Probability of male full-time student= 7/62
P (randomly chosen student is a male or has full-time employment) = P
(Student having Full-time employment) + P (Student being male-
Probability of male full-time student)
Sol: From the calculation done in python we conclude that:
P (Full-time employment or Male) = P (Full-time employment) + P
(male) - P (Full-time male employment| total student) = 0.516129. The
probability that a randomly chosen student is a male or has full-time
employment is 51.61%.
2.5.2 Find the conditional probability that given a female student is randomly
chosen, she is majoring in international business or management.

Probability of student being female= 33/62


Probability of female majoring in international business= 4/33
Probability of female majoring in management= 4/33
P (given a female student is randomly chosen, she is majoring in int.
business or management) = P (Female majoring in int. business) + P
(Female majoring in management)
Sol: From the calculation done in python we conclude that:
P (Int Business/Management and Female) = P (international
business/Female) + P (management/Female) = 0.242424, Probability of
given a female student is randomly chosen, she is majoring in international
business or management is 24.24%.

11
2.6 Construct a contingency table of Gender and Intent to Graduate at 2 levels
(Yes/No). The Undecided students are not considered now and the table is a
2x2 table. Do you think the graduate intention and being female are
independent events?

To be proven that both the events are independent, mentioned condition needs to
be fulfilled:
P (A∩B) = P (A) * P (B)
To check whether the graduate intention and being female are independent
events, we need to prove that:
P (Female ∩ Grad intention Yes) = P (Female) * P (Grad intention Yes)
P (Grad Intention Yes) = 28/40 = 0.7
P (Grad Intention Yes | female) = 11 / 20 = 0.55
Sol: From the calculation done in python we conclude that:
P (Female∩ Grad intention Yes) ≠ P (Female) * P (Grad intention Yes)
Hence, Graduate intention and being female are not independent events

2.7 Note that there are four numerical (continuous) variables in the data set, GPA,
Salary, Spending, and Text Messages. Answer the following questions based
on the data:
2.7.1 If a student is chosen randomly, what is the probability that his/her
GPA is less than 3?
As GPA is a continuous variable, it can be calculated using Poisson
distribution, calculated in Python notebook:
Method 1:
To calculate the probability that GPA is less than 3, we need to add the
probability of 0, 1, 2 using Poisson distribution.
Mean of GPA (m) = 3.13
P (GPA is less than 3) = P (GPA is 0) + P (GPA is 1) + P (GPA is 2)
Sol: From the calculation done in python we conclude that:
Stats.poisson.pmf(0,m)+stats.poisson.pmf(1,m)+stats.poisson.pmf(2,m)
The probability that GPA is less than 3 is 0.394703 or 39.47%.
Method 2:
Instead of adding the probability of 0, 1, 2 using cdf (Cumulative
Distribution Function) in Poisson distribution
stats.poisson.cdf (2,3.13)
The probability that GPA is less than 3 is 0.394703 or 39.47%.

12
2.7.2 Find the conditional probability that a randomly selected male earns 50
or more. Find the conditional probability that a randomly selected
female earns 50 or more.
Method 1: Using contingency table (calculation in python)

 Probability that a randomly selected male earns 50 or more


P (Male earning 50 or more)/ (Total Male) = 14/29
Output:
P (Male earns 50 or more) = P (Male earning 50 or more/Total Male) is
0.48275862 or 48.28%.
 Probability that a randomly selected female earns 50 or more
Probability of Female earning 50 or more/Total Female= 18/33
Output:
P (Female earns 50 or more) = P (Female earning 50 or more/Total
Female) is 0.5454545454 or 54.54%.
Method 2: Using crosstab and indexing
By using crosstab and index to normalize we get the probability of each
value of Salary and gender. Addition of all the probability which are 50 or
more will give the same result as above.

From the python calculation:


P (randomly selected male earns 50 or more) = Sum of all the
probability which are 50 or more which is 0.483 or 48.3%.
Sum (0.138 0.034 0.034 0.103 0.103 0.034 0.00 0.00 0.034)

P (randomly selected female earns 50 or more) = Sum of all the


probability which are 50 or more which is 0.5454 or 54.54%
Sum(0.1515 0 0 0.151 0.151 0 0.03 0.03 0.03)

2.8 Note that there are four numerical (continuous) variables in the data set, GPA
Salary, Spending, and Text Messages. For each of them comment whether
they follow a normal distribution. Write a note summarizing your conclusions.

13
By using distplot, we can understand by the plot whether the variables are
normally distributed or not, 4 numerical/continuous variables will be used are
GPA, Salary, Spending, and Text messages.

From the above Distplot for all the continuous variables- GPA, Salary, Spending
and Text Messages, we can see that:
 ‘GPA’ is almost Normally Distributed with a left skewness.
 ‘Salary’ is also Normally Distributed with slight right skewness.
 ‘Spending’ is not Normally distributed and highly Right Skewed
 ‘Text message’ is not Normally Distributed and highly Right Skewed.
Skewness of Variables are mentioned below, calculation in python:
GPA -0.314600
Salary 0.534701
Spending 1.585915
Text Messages 1.295808
 GPA has very less skewness, it is left skewed hence its negative.
 Salary also has very less skewness, it is right skewed hence its positive
 Spending is highly Right Skewed
 Text Message is highly Right Skewed.

14
Problem 3 –
Problem Statement:
An important quality characteristic used by the manufacturers of ABC asphalt shingles is
the amount of moisture the shingles contain when they are packaged. Customers may feel
that they have purchased a product lacking in quality if they find moisture and wet shingles
inside the packaging. In some cases, excessive moisture can cause the granules attached
to the shingles for texture and coloring purposes to fall off the shingles resulting in
appearance problems. To monitor the amount of moisture present, the company conducts
moisture tests. A shingle is weighed and then dried. The shingle is then reweighed, and
based on the amount of moisture taken out of the product, the pounds of moisture per 100
square feet are calculated. The company would like to show that the mean moisture
content is less than 0.35 pounds per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet)
for ‘A’ shingles and 31 for ‘B’ shingles.
Solution:
Once importing all the libraries into the jupyter notebook, upload the A & B shingles.csv
file into it. Completed the basic EDA:

3.1 Do you think there is evidence that means moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly
showing all steps.

Shingles A:
For Shingles A formulate the null and alternate hypothesis at per pound per 100gm
feet, which is:
H0: mean moisture content <= 0.35
HA: mean moisture content > 0.35
Level of Significance = 0.05

 Sample size is given for both the variables but population mean and standard
deviation is not known.
 Sample being in small size, will use a T test and t_statistics test. We are testing
only one sample so we will run a 1 sample T test due to unknown population
standard deviation.

15
From the python calculation:
t_stat, p_value = ttest_1samp(df.A, 0.35)
print('The T statistic is: {0}\n''The corresponding pvalue is :{1}'.format(t_stat,
p_value/2))
As Python by default tests for 2 sample test in ttest_1samp, shows the result of 2-
sided so it is divided by 2 as we are running a 1-sided test.
One sampled t-test, p-value= 0.07477
The T statistic is: -1.4735046253382782
The corresponding pvalue is: 0.07477633144907513
T test, p-value (0.075) > Level of significance (0.05)
We failed to reject the null hypothesis which means there is no enough evidence to
conclude that the mean moisture content for A Shingles is more than 0.35 pounds
per 100 sq feet.

Shingles B:
For Shingles B formulate the null and alternate hypothesis at per pound per 100gm
feet, which is:
H0: mean moisture content <= 0.35
HA: mean moisture content > 0.35
Level of Significance = 0.05
 Again sample size is given for both the variables but population mean and standard
deviation is not known.
 As sample being in small size, will use a T test and t_statistics test. We are testing
only one sample so we will run a 1 sample T test due to unknown population
standard deviation.
From the python calculation:
t_stat, p_value = ttest_1samp(df.B, 0.35, nan_policy= "omit")
print('The T statistic is: {0}\n''The corresponding pvalue is :{1}'.format(t_stat,
p_value/2))
As Python by default tests for 2 sample test in ttest_1samp, shows the result of 2-
sided so it is divided by 2 as we are running a 1-sided test.
One sampled t-test, p-value= 0.0020904774003191826
The T statistic is: -3.1003313069986995
The corresponding pvalue is: 0.0020904774003191826
T test, p-value (0.002) < Level of significance (0.05)
We have evidence to reject null hypothesis, we have evidence to conclude that the
mean moisture content of Shingles B is not less than or equal to 0.35 pounds per
100 sq feet.

16
3.2 Do you think that the population mean for shingles A and B are equal? Form
the hypothesis and conduct the test of the hypothesis. What assumption do
you need to check before the test for equality of means is performed?

In testing whether the mean for shingles A is same as the shingles B, we need to
formulate the hypothesis:
H0: mean moisture content of Shingles A = mean moisture content of
Shingles B
HA: mean moisture content of Shingles A ≠ mean moisture content of
Shingles B
Mathematically it can also be written as:
H0: μA - μB = 0 or μA = μB
HA: μA - μB ≠ 0 or μA ≠ μB
Level of Significance = 0.05
 We have 2 samples now shingles A and shingles B, and the population mean and
Standard deviation is still not known.
 The sample is not a large sample and both the variables are independent variables.
So we will use the t distribution and the tStat test statistic for two sample unpaired
test.
 We use the scipy.stats.ttest_ind to calculate the t-test for the means of TWO
INDEPENDENT samples. This function returns t statistic and two-tailed p value.
From the python calculation:
t_stat,p_value = ttest_ind(df['A'],df['B'],nan_policy='omit')
print('The T statistic is',t_stat)
print('The corresponding pvalue is',p_value)
As we know that Python by default tests for 2 sample test in ttest_1samp.
Two sampled t-test, p-value= 0.2017496571835306
The T statistic is 1.2896282719661123
The corresponding pvalue is 0.2017496571835306
T test, p-value (0.2017) > Level of significance (0.05)
We failed to reject the null hypothesis which means there is no enough evidence to
conclude that the mean moisture content for Shingles A is equal to the mean
moisture content for Shingles B.
Therefore, we can conclude that the population mean for Shingles A and B are
equal. Assumptions when running a two-sample t-test, the basic assumptions
are that the distributions of the two populations are normal, and that the variances
of the two distributions are the same. If those assumptions are not likely to be met,
another testing procedure could be use.

17
THANK YOU!!

18

You might also like