Professional Documents
Culture Documents
Assighment Project 1
Assighment Project 1
List of Figures
Figure 1 - Sample Data .................................................................................................................................. 5
Figure 2 - Exploratory Details ........................................................................................................................ 5
Figure 3 - Check for missing Values .............................................................................................................. 6
Figure 4 - Correlation Map for the Data Set ................................................................................................. 6
Figure 5 - Descriptive Information ................................................................................................................ 7
Figure 6 – Spends across the Region............................................................................................................. 7
Figure 7 – Spends across the Channel .......................................................................................................... 8
Figure 8 – Spends across the Region............................................................................................................. 9
Figure 9 – Plots against min and max spends based on the products .......................................................... 9
Figure 10 – Plots against IQR and CV across different products................................................................. 10
Figure 11 – Box Plot for various products ................................................................................................... 11
Figure 12 – Q-Q and hist plots for GPA and Salary ..................................................................................... 16
Figure 13 – Q-Q and hist plots for Spending and Text Messages ............................................................... 17
Figure 14 – Box and KDE plots for GPA and Salary ..................................................................................... 17
Figure 15 – Box and KDE plots for Spending and Text Messages ............................................................... 18
Introduction:
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Explore the dataset using central tendency and other parameters. This assignment should help the
different section of people in exploring the summary statistics, contingency tables, conditional
probabilities & hypothesis testing.
1. Problem Statement:
A wholesale distributor operating in different regions of Portugal has information on annual
spending of several items in their stores across different regions and channels. The data consists of
440 large retailers’ annual spending on 6 different varieties of products in 3 different regions
(Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail).
Data Description:
Region: In which location distributor is operating; Other, Lisbon, Oporto
Channel: Represents what are the channels used by distributor for his sales
Sample of the Dataset
From the Figure 3 we can conclude that there are no missing values in the provided data set.
From the correlation plot, we can see that various attributes of the car are highly correlated to each
other. Correlation values near to 1 or -1 are highly positively correlated and highly negatively
correlated respectively. Correlation values near to 0 are not correlated to each other.
1.1.1 Use methods of descriptive statistics to summarize data.
Ans: Descriptive statistics help describe and understand the features of a specific data set by giving
short summaries about the sample and measures of the data. The most recognized types of descriptive
statistics are measures of centre: the mean, median, and mode, which are used at almost all levels of
math and statistics.
From the descriptive statistics, we can see that there are 6 unique types of products are sold by the
whole seller. NaN shows that the values cannot be calculated for particular variables. Like , we can
calculate unique value for a numerical variable.
Channel has two unique values, with "Hotel" as most frequent with 298 out of 440
transaction with 67.7 percentage of spending comes from "Hotel" channel.
Retail has three unique values, with "Other" as most frequent with 316 out of 440
transactions with 71.8 percentage of spending comes from "Other" region.
1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a
detailed justification for your answer.
Ans:
Figure 9 – Plots against min and max spends based on the products
Figure 10 – Plots against IQR and CV across different products
Based on the Figure 7 thru Figure 99, we can infer that categories like Milk, Grocery & Detergents_Paper
have higher spend in the Retail channel versus Hotel, across all regions. On the other hand, Fresh
Fr and
Frozen have higher consumption in the Hotel channel versus Retail, across all regions.
Also, if we plot a box plot we can summarize that spend for Fresh and groceries is the maximum across
region and channel while for Delicatessen it is the least across region and channel.
"Hotel" Average Highest Spending in Fresh items and Lowest Spending in Detergents Paper. In Channel
"Retail" Average Highest Spending in Grocery items and Lowest Spending in Frozen items. Refer Figure 7
In Region
All the three regions Lisbon, Oporto and Other has the average Highest Spending in Fresh and Lowest in
Delicatessen items.
1.3 On the basis of the descriptive measure of variability, which item shows
the most inconsistent behavior? Which items shows the least inconsistent
behavior?
Ans: Upon calculation of Coefficient of Variance (CV) and plotting the same for all the products, refer to
Figure 10. We can infer that product ‘Fresh’ has a least CV of 1.05 and product ‘Delicatessen’ has highest
CV of 1.85.
As per the stats, the one with least CV is the least inconsistent behavior and with highest CV is the most
inconsistent behavior. Therefore, among the data provided Fresh is the least inconsistent and
Delicatessen is the most inconsistent behavior.
1.4 Are there any outliers in the data? Back up your answer with a suitable
plot/technique with the help of detailed comments.
Ans: Yes, there are outliers. To determine the outliers across the products a box plot is drawn. Refer to
Figure 11.
We can conclude that all the product spends has outliers with Fresh has max and Delicatessen is least
1. We can see highest inconsistency across some products. Therefore Whole Seller has to focus
more on those products to make them fall under less inconsistent zone.
2. Also, there are many outliers for all the products, this says that spends are not even and
need to identify a reason for outliers
3. When we checked the spends based on the region, Others is doing well compared to Lisbon
and Oporto. Whole Seller needs to find the grey areas that causing less spends in these
regions and try to increase market campaign or promotions to have increase the spends.
4. The spending of Hotel and Retail channel are different which should be more or less equal.
Whole Seller has to work out on the prices/discounting to increase the spending across the
Hotel.
5. Also, Whole Seller needs to focus on other products apart from Fresh and Grocery as mean
spending are comparatively more from other products.
2. Problem Statement:
The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about
the undergraduate students that attend CMSU. CMSU creates and distributes a survey of 14 questions
and receives responses from 62 undergraduates (stored in the Survey data set).
2.1. For this data, construct the following contingency tables (Keep
Gender as row variable)
2.1.1. Gender and Major
Ans:
2.2.1. What is the probability that a randomly selected CMSU student will be
male?
Ans: Probabiity that a randomly selected CMSU will be a male is (29/62) = 46.77%
2.2.2. What is the probability that a randomly selected CMSU student will be
female?
Ans: Probabiity that a randomly selected CMSU will be a female is (33/62) = 53.23%
2.3. Assume that the sample is representative of the population of CMSU.
Based on the data, answer the following question:
2.3.1. Find the conditional probability of different majors among the male
students in CMSU.
Ans:
2.3.2 Find the conditional probability of different majors among the female
students of CMSU.
Ans:
2.5.2. Find the conditional probability that given a female student is randomly
chosen, she is majoring in international business or management.
Ans: probability that given a female student is randomly chosen, she is majoring in international
business or management is 24.24%
As we are that if two events are independent then it should satisfy the below rule:
P(A ∩ B) = P(A).P(B)
Probability of student being a Female is P(F) = 0.5
Probability of Student intent to Graduate is P(Y) = 0.7
Probability of a student being Female and Intent to graduate is P(F∩Y) = 0.275
Since, P(F∩Y) ≠ P(F).P(Y)
Therefore, graduate intention and being a female are not independent events.
2.7. Note that there are four numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages .
Answer the following questions based on the data
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA
is less than 3?
Ans: Probability
robability that his/her GPA is less than 3 is 37.0 %
2.7.2. Find the conditional probability that a randomly selected male earns 50
or more. Find the conditional probability that a randomly selected female
earns 50 or more.
2.8. Note that there are four numerical (continuous) variables in the data
set, GPA, Salary, Spending, and Text Messages. For each of them
comment whether they follow a normal distribution. Write a note
summarizing your conclusions
Ans:
From the Graphs it is clear more or less the continuous variables fall under normal distribution.