Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

1/30/2022 Business Report Project – SMDM

Sonali Pradhan
1 – Wholesale Customer Data Analysis...............................................2
1.1 Problem ..........................................................................9
1.2 Problem ..........................................................................14
1.3 Problem .........................................................................20
1.4 Problem ..........................................................................23
1.5 Problem ..........................................................................29

2 – Clear Mountain State University (CMSU) Survey........................30


2.1 Problem ….........................................................................32
2.2 Problem.............................................................................35
2.3 Problem.............................................................................36
2.4 Problem.............................................................................39
2.5 Problem.............................................................................41
2.6 Problem.............................................................................43
2.7 Problem.............................................................................44
2.8 Problem ............................................................................46

3 – Hypothesis Testing for Quality of Shingles…………….…….......50


3.1 Problem .............................................................................51
3.2 Problem .............................................................................54

1
Wholesale Customers Analysis

Summary–This business report provides detailed explanation of approach to each problem


given in the assignment and provides relative information with regards to solving the problem.

Wholesale Customers Analysis


Problem Statement:
A wholesale distributor operating in different regions of Portugal has information on
annual spending of several items in their stores across different regions and channels. The
data consists of 440 large retailers’ annual spending on 6 different varieties of products in 3
different regions (Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail). We
imported the ‘Wholesale Customer data’ dataset in python to analyse the spend under each
store items across regions and channel to find solutions to each problem.

The data is given in the file “Wholesale+Customers+Data” as shown below.

2
Basic EDA
• The data has 440 instances with 9 attribute.7 integer type and 2 object type(strings in
the column) as evident from the below result.

Importing the Libraries to start to EDA

• There are no null values in any column which is evident from the below result.

3
Dataset has 9 variables “Buyer/Spender”,” Channel”,” Region”,” Fresh”,
“Milk”,”Grocery”,”Frozen”,”Detergents_Paper”,”Delicatessen”.
Channel and Region both are categorial columns and all the others are integer type.
Dropping as no use column for our analysis 1 continuous types of feature (Buyer/Spender) will be
dropped as no use for our analysis.

Performing Regional count

Performing Channel count

4
Starting to explore the data with the Univariate analysis (each feature individually), before
carrying the Bivariate analysis and compare pairs of features to find correlation between
them.

5
Univariate

From the graphs on the distribution of product it seems that we have some outliers in the
data, further deep dive to identify the outlier.

6
Outliers are detected but not necessarily removed, it depends of the situation. Here I will
assume that the wholesale distributor provided us a dataset with correct data, so I will keep
them as is.

7
Bivariate

From the pair plot above, the correlation between the "detergents and paper products" and
the "grocery products" seems to be pretty strong, meaning that consumers would often
spend money on these two types of products. Let's look at the Pearson correlation coefficient
to confirm this:

8
1.1 Use methods of descriptive statistics to summarize data. Which Region
and which Channel spent the most ? Which Region and which Channel
spent the least ?

The following table is derived using Descriptive Statistics to summarize the data

9
Measure of Central Tendency - Mean, Median, mode Measure of Dispersion - Range, IQR,
Standard Deviation.
From the above two describe function, we can infer the following-

Channel has two unique values, with "Hotel" as most frequent with 298 out of 440 transactions.
i.e. 67.7 percentage of spending comes from "Hotel" channel.

Retail has three unique values, with "Other" as most frequent with 316 out of 440 transactions.
i.e.71.8 percentage of spending comes from "Other" region.

Fresh item (440 records),

has a mean of 12000.3, standard deviation of 12647.3, with min value of 3 and max value of
112151 .

The other aspect is Q1(25%) is 3127.75, Q3(75%) is 16933.8, with Q2(50%) 8504

range = max-min =112151-3=112,148 & IQR = Q3-Q1 = 16933.8-3127.75 = 13,806.05 (this


helpful in calculating the outlier(1.5 IQR Lower/Upper limit))

Milk item (440 records),

has a mean of 5796.27, standard deviation of 7380.38, with min value of 55 and max value of
73498.

The other aspect is Q1(25%) is 1533, Q3(75%) is 7190.25, with Q2(50%) 3627

range = max-min =73498-55=73443 & IQR = Q3-Q1 = 7190.25-1533 = 5657.25 (this helpful
in calculating the outlier(1.5 IQR Lower/Upper limit))

Grocery item (440 records),

has a mean of 7951.28, standard deviation of 9503.16, with min value of 3 and max value of
92780.

The other aspect is Q1(25%) is 2153, Q3(75%) is 10655.8, with Q2(50%) 4755.5

range = max-min =92780-3=92777 & IQR = Q3-Q1 = 10655.8-2153 = 8502.8 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Frozen (440 records),

10
has a mean of 3071.93, standard deviation of 4854.67, with min value of 25 and max value of
60869.

The other aspect is Q1(25%) is 742.25, Q3(75%) is 3554.25, with Q2(50%) 1526

range = max-min =60869-25=60844 & IQR = Q3-Q1 = 3554.25-742.25 = 2812 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Detergents_Paper (440 records),

has a mean of 2881.49, standard deviation of 4767.85, with min value of 3 and max value of
40827.

The other aspect is Q1(25%) is 256.75, Q3(75%) is 3922, with Q2(50%) 816.5

range = max-min =40827-3=40824 & IQR = Q3-Q1 = 3922-256.75 = 3665.25 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit))

Delicatessen (440 records),

has a mean of 1524.87, standard deviation of 2820.11, with min value of 3 and max value of
47943.

The other aspect is Q1(25%) is 408.25, Q3(75%) is 1820.25, with Q2(50%) 965.5

range = max-min =47943-3=47940 & IQR = Q3-Q1 = 1820.25-408.25 = 1412 (this helpful in
calculating the outlier(1.5 IQR Lower/Upper limit)).

11
12
Observations:
Highest spend in the Region is from "Others" and lowest spend in the region is
from "Oporto" Highest spend in the Channel is from "Hotel" and lowest spend in the
Channel is from "Retail". Hotel channel spend amount is 8070603$ with the highest
spend amount .
• Retail spend amount 6645917$ has least spend amount based on Channel. output
from Python Channel Hotel 8070603 Retail 6645917 Similarly we grouped totals by
region to get totals by region. 3 Other regions spend amount is 10741625$ with the
highest spend amount and Oporto region spend amount is 1569987 $ and has least
spend amount by Region.

13
1.2 There are 6 different varieties of items that are considered. Describe and
comment / explain all the varieties across Region and Channel? Provide
a detailed ? justification for your answer.

14
15
16
17
On the basis of above it can be concluded that considering all the 6 varieties of items, all
varieties don’t show similar behaviour across region and channel.

Upon plotting the coefficient of variables across all the regions it is evident that all the Food
Products doesn’t behave similarly across all the regions.

18
It can be seen that in region Lisbon the product Detergent Paper has maximum coefficient of
variable so it is highly inconsistent in Lisbon followed by Grocery. where as in Oporto Frozen
products shows highest inconsistent behaviour followed by Detergent Paper. On the Other
hand in the region Other Delicatessen shows highest inconsistent behaviour followed by
Detergent Paper.

In region Lisbon the product Delicatessen has the least coefficient of variable so it is the most
consistent product In Lisbon whereas Oporto Fresh and Delicatessen are most consistent. On
the other hand in Other region only Fresh is most consistent.

Upon plotting the coefficient of variables across the two channel it is evident that all the Food
Products doesn’t behave similarly across all the regions.

It can be seen that in Channel Hotel the product Delicatessen has maximum coefficient of
variable so it is highly inconsistent in Hotel followed by Frozen. On the Other hand in the
channel Retail Detergent Paper shows highest inconsistent followed by Milk.

In the channel Hotel the product Detergent Paper has the least coefficient of variable so it is
the most consistent product In Hotel Channel whereas in Retail Frozen are most consistent.

19
1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behaviour? Which items show the least inconsistent
behaviour?

Fresh item have highest Standard deviation So that is Inconsistent. Delicatessen item have
smallest Standard deviation, So that is consistent.

The above table represents the descriptive statistics for all six food Items
Fresh,Milk,Grocery,Frozen,Detergents_Paper and Delicatessen.

Here the consistency of any food items can be calculated using Coefficient of
Variance(CV).The higher the Coefficient of Variance, greater the level of inconsistency and
vice versa.

efficient of Variance(CV).=Standard deviation/mean

20
21
The above graphs represents the Coefficient of Variance of all the Food Items. From the plots
it is evident that Coefficient of Variance is highest for Item Delicatessen and lowest for Item
Fresh.

Hence it can be concluded that item that shows the most inconsistent behaviour is
Delicatessen and the items show the least inconsistent behaviour is Fresh.

22
1.4 Are there any outliers in the data?
To determine the presence of Outliers in the Data the best method is creating Box Plot of all the
variables as shown.

Yes there are outliers in all the items across the product range (Fresh, Milk, Grocery, Frozen,
Detergents_Paper & Delicatessen) Outliers are detected but not necessarily removed, it
depends of the situation. Here I will assume that the wholesale distributor provided us a
dataset with correct data, so I will keep them as is.

23
24
25
26
27
From the above it can be concluded that Yes,There are outliers in the data.

Outliers are presents in the variables Fresh, Milk, Grocery, Frozen, Dtergents_Paper and
Delicatessen.

28
1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective.

On the basis of the analysis the following recommendations can be made:

• On the basis of the analysis, it can be seen that the region Other and the channel
Retail have higher spending than the Channel and Regions.Hence from the Business
prospective if a new business is to setup it should be opened in the Other region with
Channel Retail as the Other reason absorbing maximum amount of sells and this can
be boast up the Revenue compared to opening new business in Lisbon or Oporto and
with the Channel Hotel.
• In all the region the Food Items Fresh has the highest spending followed by Grocery
and Milk .Hence these food products are strongly recommended to be available.
• Also the food Item Delicatessen shows least inconsistent behaviour across all the
regions and the channels.So Delicatessen is also recommended to be available al; the
times in the business.

29
Clear Mountain State University (CMSU) Survey

Problem 2 –

The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 undergraduates.

Summary–

This business report provides detailed explanation of approach to each problem given
in the assignment and provides relative information with regards to solving the problem.

CMSU Survey Data Analysis We imported the ‘CMSU Survey-1’ dataset in python to analyse
the data about the undergraduate students who attend CMSU. Below is the detailed approach
and answer.

30
Basic EDA

31
No missing values are found.

2.1Problem. For this data, construct the following contingency tables (Keep
Gender as row variable).

Problem 2.1.1 Gender and Major Solution:

Below is the output from Python,

32
Problem 2.1.2. Gender and Grad Intention

Below is the output from Python,

Problem 2.1.3. Gender and Employment


Below is the output from Python,

33
Problem 2.1.4. Gender and Computer
Below is the output from Python,

34
2.2. Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question: 2.2.1. What is the probability that a
randomly selected CMSU student will be male?

Solution: For this we need to find out total male students out of whole student from the given
data. After calculation we got the result that probability of 46.8% student will be male in CMSU
if randomly selected.

2.2.2. What is the probability that a randomly selected CMSU student will be
female?

Solution: For this we need to find out total female students out of whole student from the
given data. After calculation we got the result that probability of 53.2% student will be female
in CMSU if randomly selected.

35
2.3. Assume that the sample is representative of the population of CMSU.
Based on the data, answer the following question:
2.3.1. Find the conditional probability of different majors among the male
students in CMSU.

Solution:

Using contingency tables of Gender and Majors we got the total numbers of males and
number of males opting for different majors Below is the output from Python –
Probability of Males opting for Accounting. is 13.79%
Probability of Males opting for CIS. is 3.45%
Probability of Males opting for Economics/Finance. is 13.79%
Probability of Males opting for InternationalBusiness. is 6.90%
Probability of Males opting for Management. is 20.69%
Probability of Males opting for Other. is 13.79%
Probability of Males opting for Retailing/Marketing. is 17.24%
Probability of Males opting for Undecided. is 10.34% And from this output we can easily say
that most of the males students prefer Management as Majors and CIS is the least preferred
one

36
2.3.2 Find the conditional probability of different majors among the female
students in CMSU.

Solution: Using contingency tables of Gender and Majors we got the total numbers of
females and number of females opting for different majors Below is the output from Python –

Probability of Females opting for Accounting. is 9.09%

Probability of Females opting for CIS. is 9.09%

Probability of Females opting for Economics/Finance. is 21.21%

Probability of Females opting for InternationalBusiness. is 12.12%

Probability of Females opting for Management. is 12.12%

Probability of Females opting for Other. is 9.09%

Probability of Females opting for Retailing/Marketing. is 27.27%

Probability of Females opting for Undecided. is 0.00%

And from this output we can easily say that most of the females students prefer
Retailing/Marketing as Majors.

37
38
2.4Problem. Assume that the sample is a representative of the population of
CMSU. Based on the data, answer the following question:

2.4.1 Find the probability That a randomly chosen student is a male and
intends to graduate.

Solution: Using contingency tables of Gender and Grad Intention we got the total numbers
of males and number of males intends to be graduate And post calculation we find out that -
Probability of Males and intends to be Graduate. is 27.42%

2.4.2 Find the probability that a randomly selected student is a female and does
NOT have a laptop.
Solution: Using contingency tables of Gender and Computer we got the total numbers of
females and number of females does not have a laptop

And post calculation we find out that - Probability of randomly selected student is a Female
and does NOT have a laptop. is 6.451%

39
40
2.5 Problem. Assume that the sample is representative of the population of
CMSU. Based on the data, answer the following question:
2.5.1 Find the probability that a randomly chosen student is either a male or has
fulltime employment?
Solution: Using contingency tables of Gender and Employment we got the total numbers of
males and number of males who are full time employed.

And post calculation we find out that - Probability of randomly chosen student is either Male
or has full time employment. is 51.61 %

41
2.5.2. Find the conditional probability that given a female student is randomly
chosen, she is majoring in international business or management.

Solution:
Using contingency tables of Gender and Major we got the total numbers of females and
number of females majoring in international business or management. And post calculation
we find out that - Probability that given a female student is randomly chosen, she is majoring
in international business or management is 24.24%

42
2.6 Problem. Construct a contingency table of Gender and Intent to Graduate at
2 levels (Yes/No). The Undecided students are not considered now and the table
is a 2x2 table. Do you think the graduate intention and being female are
independent events?

Solution:

These probabilities are not equal. This suggests that the two events are not independent

43
2.7 Problem. Note that there are four numerical (continuous) variables in the
data set, GPA, Salary, Spending, and Text Messages.

Answer the following questions based on the data 2.7

2.7.1 If a student is chosen randomly, what is the probability that his/her GPA is
less than 3?

Solution: Using contingency tables of Gender and GPA we got the total numbers of students
and number of students GPA less than 3 And post calculation we find out that - Probability
that student is chosen randomly and that his/her GPA is less than 3 is 27.41%

2.7.2 Find the conditional probability that a randomly selected male earns 50 or
more. Find the conditional probability that a randomly selected female earns 50
or more.

Solution:
Using contingency tables of Gender and Salary we got the total numbers of Male and Female
and number of male and female earning 50 or more And post calculation we find out that -
Probability that randomly selected male earns 50 or more is 48.27% And Probability that
randomly selected female earns 50 or more is 54.54%

44
45
2.8 Problem Note that there are four numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages. For each of them comment
whether they follow a normal distribution. Write a note summarizing your
conclusions.

Solution: Used distplot to know the normal distribution of these four numerical (continuous)
variables in the data set – GPA, Salary, Spending and Text Message.

46
47
48
From the above histograms for the continuous variables GPA, Salary, Spending and Text
Messages following observation can be concluded:

• GPA is almost Normally Distributed with a slight skewness toward the left.
• Salary is also Normally distributed with a slight skewness toward the right.
• Spending is not Normally distributed and highly right skewed.
• Text message is not Normally distributed and highly right skewed.

The following table consists of the skewness value of the variables

49
Hypothesis Testing for Quality of Shingles

Problem 3 –
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that
they have purchased a product lacking in quality if they find moisture and wet shingles inside
the packaging. In some cases, excessive moisture can cause the granules attached to the
shingles for texture and coloring purposes to fall off the shingles resulting in appearance
problems. To monitor the amount of moisture present, the company conducts moisture tests.

A shingle is weighed and then dried. The shingle is then reweighed, and based on the amount
of moisture taken out of the product; the pounds of moisture per 100 square feet are
calculated. The company would like to show that the mean moisture content is less than 0.35
pound per 100 square feet. The file (A & B shingles.csv) includes 36 measurements (in pounds
per 100 square feet) for A shingles and 31 for B shingles.

Summary–This business report provides detailed explanation of approach to each problem


given in the assignment and provides relative information with regards to solving the problem.
3 – Asphalt Shingles Data Analysis We imported the ‘A & B shingles’ dataset in python to
analyze the data about the Asphalt Shingles. Below is the detailed approach and answer.

The file (A&B singles.csv) including 36 measurements(in pounds per 100 square feet) for A
shingles and 31 for B singles.

Basic EDA

50
3.1Problem Do you think there is evidence that mean moisture contents in
both types of shingles are within the permissible limits? State your
conclusions clearly showing all steps.

Solution:

H0:mean moisture content<=0.35

Ha:mean moisture content>0.35

Level of significance(α) = 0.05

We have a sample and don’t know the population standard deviation.

The sample is not a large sample. So we can use the t distribution and tSTAT test statistic.

Since we a testing for only sample A we use One Sample Ttest. Also as apython by default in
python,ttest_1samp shows the result of 2-sided it is divided by 2 as our is a 1 sided test.

51
Hence from the calculation conclusion:

Input - Python Jupyter t_statistic, p_value = ttest_1samp(df.A, 0.35) print('One sample t


test \nt statistic: {0} p value: {1} '.format(t_statistic, p_value/2))

Output from Python Jupyter One sample t test t statistic: -1.4735046253382782 p value:
0.07477633144907513 Since pvalue > 0.05, do not reject H0 . There is not enough evidence
to conclude that the mean moisture content for Sample A shingles is less than 0.35 pounds
per 100 square feet. p-value = 0.0748. If the population mean moisture content is in fact no
less than 0.35 pounds per 100 square feet, the probability of observing a sample of 36 shingles
that will result in a sample mean moisture content of 0.3167 pounds per 100 square feet or
less is .0748.

We have no evidence to reject the null hypothesis since p value>Level of significance

H0:mean moisture content<=0.35

Ha:mean moisture content>0.35

Level of significance(α) = 0.05

We have a sample and don’t know the population standard deviation.

The sample is not a large sample. So we can use the t distribution and tSTAT test statistic.

Since we a testing for only sample A we use One Sample Ttest. Also as apython by default in
python,ttest_1samp shows the result of 2-sided it is divided by 2 as our is a 1 sided test.

Hence from the calculation conclusion:

52
Input - Python Jupyter t_statistic, p_value = ttest_1samp(df.B, 0.35,nan_policy='omit' )
print('One sample t test \nt statistic: {0} p value: {1} '.format(t_statistic, p_value/2))

Output from Python Jupyter One sample t test t statistic: -3.1003313069986995 p value:
0.0020904774003191826 Since pvalue < 0.05, reject H0 . There is enough evidence to
conclude that the mean moisture content for Sample B shingles is not less than 0.35 pounds
per 100 square feet. p-value = 0.0021. If the population mean moisture content is in fact no
less than 0.35pounds per 100 square feet, the probability of observing a sample of 31 shingles
that will result in a sample mean moisture content of 0.2735 pounds per 100 square feet or
less is .0021.

We have evidence to reject the null hypothesis since p value>Level of significance

53
3.2 Problem Do you think that the population means for shingles A and B are
equal?

Form the hypothesis and conduct the test of the hypothesis. What assumption
do you need to check before the test for equality of means is performed?

Theoretical Assumptions for the Hypothesis Testing:

To perform a Test of equality of the population mean of the A singles and B singles, the null
and alternative hypothesis to test whether the population mean moisture content is equal
given:

H0: mean moisture content of A = mean moisture content of B

HA: mean moisture content of A != mean moisture content of B

Level of significance :0.05

We have a sample and don’t know the population standard deviation.

The sample is not a large sample. So we can use the t distribution and tSTAT test
statistic.

Since we a testing for equality between sample A and B we use two sample T Test.

54
Solution: H0 : μ(A)= μ(B) Ha : μ(A)!= μ(B) α = 0.05 Input - Python Jupyter
t_statistic,p_value=ttest_ind(df['A'],df['B'],equal_var=True ,nan_policy='omit')
print("t_statistic={} and pvalue={}".format(round(t_statistic,3),round(p_value,3)))

Output from Python Jupyter t_statistic=1.29 and pvalue=0.202 As the pvalue > α , do not
reject H0; and we can say that population mean for shingles A and B are equal Test
Assumptions When running a two-sample t-test, the basic assumptions are that the
distributions of the two populations are normal, and that the variances of the two distributions
are the same. If those assumptions are not likely to be met, another testing procedure could
be use.

55

You might also like