Chap 2
Chap 2
Chap 2
2.1 Introduction
Most of the readers of this book have probably suffered through an elementary
course in Statistics. For many, the experience was painful, but everything, espe-
cially the formulas, was quickly forgotten. For others, only the formulas have been
forgotten, and years of therapy been unsuccessful in dealing with the rest. What-
ever the case for you, we will assume that very little, if any, of the material from
this course is at your fingertips. Instead, we will attempt to introduce some of the
concepts that we need in this chapter, and save others as we need them for later. For
those of you comfortable with the notions of statistical significance, p-values and
hypothesis tests, you may want to skip this chapter and go right to Chapter 3. For
those staying on, we first review the basics of displaying data before examining the
output of a statistical program on a specific data set. We then attempt to explain
each part of the output in detail, allowing the reader to skip the sections that he/she
feels comfortable with.
7
8
7.583 7.711 7.496 8.062 7.434 8.070 7.909 7.208 7.386 8.305
8.052 7.851 8.092 7.924 7.916 7.840 7.748 7.904 7.372 7.534
7.703 7.500 7.631 7.912 8.139 8.562 7.804 7.586 7.570 8.815
7.538 7.797 7.694 7.676 8.538 7.480 7.377 7.321 7.309 8.012
7.619 7.624 7.843 7.958 7.544 7.558 7.697 7.842 8.077 7.805
8.173 7.685 7.843 7.994 7.823 7.870 7.855 7.460 7.434 7.418
7.857 7.231 7.228 7.281 7.413 7.948 7.821 7.798 7.718 7.445
7.385 7.647 7.366 7.807 7.390 7.352 7.274 7.220 7.367 7.321
7.618 8.297 7.672 7.741 8.527 7.370 7.872 8.032 7.785 7.574
7.325 7.741 8.035 7.738 7.670 7.538 7.661 7.947 8.031 7.621
7.591 7.415 7.397 7.424 7.515 7.351 7.588 7.508 7.397 7.594
7.406 7.535 7.595 7.544 7.832 7.995 7.928 7.919 7.604 7.663
7.608 7.350 7.474 7.622 7.283 7.456 7.374 7.466 7.782 7.268
8.044 7.629 8.010 7.491 7.724 7.321 7.452 7.529 7.671 7.558
7.855 7.429 7.622 8.513 7.938 7.496 7.574 7.316 7.528 7.792
7.296 7.401 7.317 7.401 7.249 7.776 7.486 7.509 7.641 7.600
8.018 8.315 8.344 7.941 7.815 7.915 7.774 7.854 7.656 7.343
7.434 7.508 7.341 7.868 7.609 7.456 7.541 7.586 7.453 7.575
7.600 7.457 7.598 8.162 7.894 7.605 7.585 7.648 8.069 7.681
8.143 7.698 8.145 8.102 8.109 7.970 7.944 7.853 7.863 7.675
8.293 7.898 7.737 7.843 7.772 7.487 7.593 7.550 7.462 7.642
8.649 8.046 7.834 7.706 7.521 7.434 7.700 7.460 7.458 7.325
7.341 7.393 7.393 7.431 7.583 7.259 7.425 7.170 7.410 7.849
7.652 7.560 7.409 7.622 7.516 7.601
Table 2.1: Times Per Mile in Minutes for runs of about 4 miles. There are 236 days
represented.
10
30
Frequency
20
10
0
Figure 2.1: A histogram of run rates. The y-axis shows the frequency that a rate
from each bin on the x-axis occurred. For example there were 11 runs out of the
236 total, with a rate between 7.2 and 7.3 minutes per mile.
played by the data on a single variable. One way of displaying the distribution is to
consider a histogram of the data (figure 2.1). On the x-axis of the histogram, the
data are collected into bins. 2 In this case the bins have width 1/10 of a minute per
mile. The y-axis of the histogram shows the number of times that each bin occurs.
For example, there were 11 runs with rates between 7.2 and 7.3 minutes per mile.
From the histogram we see that during this period, I typically ran about 7.5 - 7.6
minutes/mile, although on any given day the rate could vary from around 7.0 to 9.0
minutes/mile. The lack of symmetry is quite apparent in a histogram. Does the
lack of symmetry for these data “make sense” to you? Why might there be some
extremely slow days, but not as many extremely fast days?
2
The choice of the number of bins is left up to the user – or the computer program. Usually it’s
taken to be between 5 and 20 unless the data set is very large. Often the end points of the bins are
chosen to be “pretty” that is, with numbers that can be expressed with few decimals.
Chapter 2 Things You Should Know 11
For data sets that aren’t too large, we can squeeze more out of the histogram by
adding a little additional information. Look at the stem and leaf plot of the same
data in figure 2.2. To understand this display, first turn it sideways and notice that
it resembles figure 2.1, but with numbers replacing the bars. 3 To understand these
added numbers, look at the second line of figure 2.2:
72 : 1233567788
The “stem”, 72, is followed by 10 different “leaves”. The stem and leaves together
represent the numbers 7.21, 7.22, 7.23, 7.23, 7.25, 7.26, 7.27, 7.27, 7.28 and 7.28.
How did I know that 72:1 meant 7.21 and not 72.1 or 721 or 0.721? Because of
the line at the top of the plot that says: “Decimal point is 1 place to the left of the
colon”.
The advantage of the stem and leaf plot over the histogram is that not only do
we now know, as before, that 10 runs were between 7.2 and 7.3, but we now know
what the values (to one more decimal place) actually are. The stem and leaf plot
takes a little getting used to, but is a very useful display of a batch of data.
Another graphical summary of these data is shown in figure 2.3. This display,
called a box and whisker plot, (or just a box plot) is based on a five number summary
of the data: the lowest value, the lower quartile (the point in the data for which
25% of the values lie below), the median (the point for which half of the data lie
below, half above), the upper quartile and the highest value. Notice that the “box”
includes the central 50% of the data, between the two quartiles, from about 7.5 to
7.8 minutes/mile. The relative closeness of the two quartiles and the extremes to
the median gives an idea of the symmetry (or lack of symmetry) of the batch. The
boxplot also identifies outliers by separating out any point lying more than 1.5 box
widths from either edge of the box 4 . So, the “whisker” of the plot goes out to the
largest (and smallest) values lying within 1.5 box widths from the box edges. Here,
3
Some computer programs display this plot sideways, that is, in the same direction as a his-
togram.
4
Some plots use another symbol for “far outliers” that are more than 3 box widths away.
12
71 : 7
72 : 1233567788
73 : 01222222244455577777889999
74 : 0000111112223333334556666666778999
75 : 000111122333344444566677778889999999
76 : 000000112222222334455566677778889
77 : 0000112244445778889
78 : 000011223344444555555667779
79 : 001112223444556799
80 : 11233345567789
81 : 0144467
82 : 9
83 : 0014
84 :
85 : 1346
Figure 2.2: A stem and leaf diagram of the run data. Each line represents a tenth
of a minute, with the leaves on the right representing the hundredths. The first line
represents 7.17.
Chapter 2 Things You Should Know 13
8.5
Minutes per Miles
8.0
7.5
Figure 2.3: A box plot of the run times. The black dot in the middle of the box
is the median, while the upper and lower edges of the box are the upper and lower
quartiles of the data. A whisker is drawn out to the largest and smallest values that
lie within 1.5 box heights from the edge of the box. All other points are indicated
as outliers.
14
8.5
Minutes per Miles
8.0
7.5
1 2 3 4 5 6 7 8 9 10 11 12
Month
Figure 2.4: A box plot of run times by month. Notice that both the changes in both
level and the variation of times by month are apparent.
the whisker goes out to about 8.3 and identifies a handful of values from 8.5 to near
9.0 as outliers. As we mentioned before, there are no low outliers.
contained in both the other two plots. This makes it easy to see changes in either
the level or the spread of the process. For more information on the use of the box
plot in a production setting see the article by Hoadley ([?]).
8.5
Minutes per Miles
8.0
7.5
30
20
Frequency
10
0
data: runtimes
sample estimates:
mean of x: 7.6778
median of x: 7.6215
standard deviation of x: 0.2931
Figure 2.5: Output from the run data. A slightly different representation of the box
plot is shown here where the median is indicated by the white bar in the middle of
the box
Chapter 2 Things You Should Know 17
where etc. represent the individual data values, and denotes the size of
the sample. Sometimes we use the summation shorthand 6 for the expression above:
Which is a better summary of the data set, the mean, or the median of a sample?
Or in other words, which is a better description of a “typical” observation? The
answer is that both are useful. If the data are symmetric, the two summaries will be
equal. Knowing that they are very different tells me that that the distribution is far
from symmetric. For example, if the average income on my block is $100,000, but
the median income is $40,000, this tells me something interesting about the income
distribution. It says, for one thing, that there is likely to be at least one outlier
(possibly several) who have quite large incomes.
For the run rates, the median and mean are close, indicating the data are not far
from symmetric. The fact that the mean is a little larger indicates that there is a
slight tendency for the data to be skewed to the right.
However, a more common measure of the spread is what is known as the stan-
dard deviation. For a histogram with a Normal distribution shape (the bell shaped
curve), this quantity has a precise meaning. Since, as we will see, the Normal plays
a very important role, we usually use the standard deviation to measure spread. You
may recall from an elementary Statistics class that the sample standard deviation is
found by computing:
(2.1)
We’ll talk more about this later – and in particular we’ll even talk about why the
mysterious in the denominator, but for now we’ll just leave it that is a
measure of how spread out the histogram is.
Why do we need the standard deviation? Because, not only does it measure how
spread out the distribution is, but it serves as a yardstick for comparing individuals.
This is especially true for data sets that roughly follow the Normal distribution.
The standard deviation gives us a way of saying how close observations are to each
other. Let’s use heights of students in my Stat 101 class as an example. Suppose
the mean height is inches, and the sd is inches. For any two individuals, we
can use the standard deviation to talk about how big a difference there is between
their measurements. We can think roughly of one standard deviation as being a
substantial difference, two standard deviations as being a very large difference and
three standard deviations as a huge difference. For the students, this makes sense:
four inches is indeed a substantial difference in height, eight inches is very large and
a foot is huge. With the average height at , someone six feet ( inches) tall is
one standard deviation taller than the average. A basketball player ( ) tall is
three standard deviations taller than the average. When we express the observations
in standard deviation units like this, we are using standard scores. Or we say that
we have standardized the observations. Technically, we standardize by subtracting
the average and then dividing by the standard deviation:
(2.2)
The new observations, the , are standardized observations whose values tell how
Chapter 2 Things You Should Know 19
far from the average the observation is in standard deviation units (positive values
are larger than the mean, negative values are smaller). The basketball player of
has a standardized score of:
lation will not have exactly the same average height as my sample. Why not? Put
very simply, because
OBSERVATIONS VARY!!
If not for this fact, I would be unemployed. Imagine what a wonderful(?) world
it would be if every height were the same, every measurement precise and exactly
the same. In this world, every experiment would consist of just one observation
which would exactly reproduce the value for the entire population. Think how easy
a marketing poll would be – just call up one person and ask which brand she prefers.
Her answer would then be THE answer! Well, fortunately, (at least for me), such
is not the case. Every sample is different. In a population whose average height is,
say, exactly 68.578 in., each sample will have its own average which will probably
be near 68.578 in., but not exactly (your mileage may vary!!). Just how near it
is to the population mean depends of many things, among which are how variable
heights are, who happens to be in my sample, and how many people comprise our
sample. Our goal in Statistics is to quantify how close the sample average might be
to the true population mean.
In order to distinguish the population and sample averages, we use two very
different symbols. We have already seen that the sample average is denoted by
. This number is a statistic, that is, a quantity based on the data that changes
from sample to sample. By contrast, the population mean is fixed characteristic
of the population, a constant (albeit unknown). These unknown features of the
population are called parameters. To emphasize the difference between statistics
and parameters, statisticians often use Greek letters for the parameters and Latin
letters for the statistics that estimate them. We denote the population mean by the
Greek letter for m (mean) which is . Similarly, we use the Greek letter s, to
denote the population standard deviation (the parameter) and to distinguish it from
the estimate (the statistic) that varies from sample to sample.
Chapter 2 Things You Should Know 21
(2.3)
But remember, what we really want is the average squared difference from the
true, population mean . But, we don’t know , so we have to estimate it – and we
use in its place. And this is the problem. For each sample, the observations are
closer to their own average than they are to . This makes each term a little
too small. Remember, we should be using , and this in turn makes the standard
deviation too small unless we compensate. Suppose we took a sample of people
from a population with mean IQ (say all adults in the U.S.), what would be?
Well, we don’t know. It will probably be near , but not exactly. Suppose the
average of the 50, is . The observations vary less around than they
do around , the mean of all adults. Why? Because the average of a sample
is exactly the number that the observations are closest to. This makes the sum in
equation 2.3 too small. It turns out that we should adjust the sum in equation 2.3
by the ratio . If we do this, this replaces the in equation 2.3 by , giving
the usual estimate of in equation 2.1. Unfortunately, the engineers who design
calculators apparently didn’t understand this and gave people the choice of using
or which has caused headaches for Statisticians ever since.
To add some more language that we’ll need later on, we say that the process
of estimating has cost us one degree of freedom. Consider the IQ’s from my
sample. Without looking at them, we have no way of knowing what specific values
they take. We describe this by saying that the data set has degrees of freedom.
But suppose I tell you that the average is . Then, you’d only have to know
22
of the IQ’s. Why? Because I could figure out last one, by knowing all have to
average to . So we say that there are only degrees of freedom left in the
residuals after I’ve taken out the average. What I’ve done is split the degrees of
freedom into for the average and for the residuals.
would the histogram of these averages have? It seems logical that it would have a
shape that depended on the original population histogram. But, an astounding fact,
known as the Central Limit Theorem tells us that this histogram of sample averages
will always tend to be Normally distributed. The larger the sample size, the more
Normal looking the histogram will be.
Well, that’s a nice result, but so what? What we’ll soon see is that this fact
enables us to quantify the uncertainty in our guess (the sample average) of the
population mean. That is, we’ll use it to construct the confidence interval for .
And we can do this in a way that doesn’t depend on the shape of the histogram
we started with. Since histograms of averages are approximately normal, we will
be able to use some facts known about the normal (see the subsection below) to
construct these confidence intervals.
0.4
1 sd
0.3
0.2
0.1
0.0
-4 -2 0 2 4
Standard Units
So if somebody tells you you’re , what they really mean is that you’re one in
a million. (Your next question should be – which side?).
20
15
10
5
0
60 65 70 75
heights
Notice how spread out around the data are. How likely would it be to find
one student over tall in class? Translating to standard deviation units, this is
SD’s greater than the average, something that happens for normal data about
of the time, or about times out of . Not something that happens everyday,
certainly, but not impossible. Now, what are the chances that the avearage of the
students in my class is over ? Why does this seem so much less likely to
happen? Because the standard error is much smaller than the standard deviation!
How likely would it be for a class of students to have an average of over
? Does this seems even less likely? The standard error is related to the standard
deviation in the following simple way:
where is the size of the sample. For my class of , the standard deviation (of
the individuals) is , so the SE (of the average) is or . So, in terms of
26
standard errors away. Wow! For a Normal distribution, the probability of an ob-
servation being more than 30 standard deviations away is incredibly small. There’s
probably a greater chance that all the molecules in the room you’re in will jump to
one side of the room, suffocating you. It’s just not going to happen by chance. At
least, not as long as the sample is random.
10000
8000
6000
4000
2000
0
0 2 4 6
Outcome of Die
0 1 2 3 4 5 6
Value of Roll
15000
10000
Frequency
5000
0
0 1 2 3 4 5 6
Value
Figure 2.10: Actual histogram of 100,000 rolls. Notice how much more like the
theoretical histogram this one appears, with 100,000 rolls instead of 30.
Chapter 2 Things You Should Know 29
rolls, I’m going to collect the average of the two dice for each roll. I’ve shown the
results of the first 15 rolls in table 2.2.
Toss # Die #1 Die #2 Average of
the Two Tosses
1 2 3 2.5
2 5 3 4
3 1 5 3
4 2 4 3
5 2 4 3
6 4 4 4
7 3 5 4
8 1 3 2
9 2 4 3
10 3 4 3.5
11 5 6 5.5
12 6 4 5
13 6 5 5.5
14 2 1 1.5
15 2 1 1.5
16 ... ... ...
Table 2.2: The first 15 tosses of two dice and their average
In the last column of table 2.2 I’ve got a collection of the 100,000 averages from
rolls of two dice. (So far, so good?) What does the histogram of this collection of
averages (the last column in the table above) look like? Will it be just as flat as
the histogram of the first 100,000 rolls? Let’s think a minute before looking at the
answer. In the original sample, every number has exactly the same probability of
being chosen – that’s what the uniform distribution means. So, I’m just as likely to
get a 6 as a 3. But is every average equally likely as well? How could we get an
average of 6 (that is, a total of 12?) We’d have to have 6 on both rolls! How likely
30
15000
10000
Frequency
5000
0
0 1 2 3 4 5 6
Value
Figure 2.11: Histogram of the averages from 100,000 rolls of two dice. This distri-
bution is known, not too surprisingly, as the triangular distribution. The point here
is that the distribution of the averages looks different from the original uniform
distribution.
is this (you needn’t be precise!)? On the other hand, how could we get an average
of 3.5 (total of 7)? We could get a 3 and a 4, or a 5 and a 2 or a 6 and a 1. There
are many more scenarios producing an average near 3.5 than an average near 6. For
my sample of 100,000 the histogram of the averages looks like the one shown in
figure 2.11.
Now, what happens if we take another 100,000 rolls and make averages of size
3? First, we are even more likely to be near 3.5. This makes sense if we consider
how hard it would be to get an average of 6 now. So, the standard deviation is
smaller. Also, the shape continues to change, as we can see in figure 2.12.
If we were to keep doing this, making averages of more and more rolls, the
two trends would continue. That is, we’re more and more likely to be close to 3.5,
(the standard deviation continues to shrink), and the shape continues to look less
Chapter 2 Things You Should Know 31
12000
10000
8000
Frequency
6000 4000
2000
0
0 1 2 3 4 5 6
Averages of Three Tosses
Figure 2.12: Histogram of the averages of 100,000 rolls of three dice. Notice that
the standard deviation is smaller than in figure 2.11, and how much less like the
uniform this histogram appears.
32
6000
Frequency
4000
2000
0
0 1 2 3 4 5 6
Averages of Ten Tosses
Figure 2.13: Histogram of the averages of 100,000 rolls of ten dice. What theoreti-
cal histogram does this look like?
and less like the original uniform distribution. Averages of size 10 are shown in
figure 2.13.
Can you see the pattern? For one thing, the histogram is getting narrower, be-
cause the average is more likely to be near . The standard error is getting smaller!
We knew that from the last section. What’s happening to the shape of the distribu-
tion? You guessed it – it’s the famous Normal distribution. Notice that once the
average is taken over enough observations, the original distribution (the uniform)
plays a very small role in what the histogram of the averages looked like. This same
phenomenon occurs no matter what shape the original histogram has.
Let’s review what happened to the location, the spread and shape of the averages
as the sample size got larger. First, each histogram stayed centered at the population
mean, which happened here to be . If it hadn’t, we would say that the average
is biased. (Unbiased means, that on the average, the sample statistic estimates the
parameter we want to estimate).
Chapter 2 Things You Should Know 33
What about the spread? Notice that as the sample size gets larger, we tend to be
closer to the mean of . This is because the standard error is equal to the standard
deviation over the square root of the sample size. As the sample size gets larger,
the averages get closer to the true population mean. What this says is that if you
have a larger sample size, on the average you will be closer to the true population
mean than if your sample size is smaller. (The Law of Large Numbers again). If
this weren’t true, polling companies would soon go out of business.
And the shape? We see that as the sample size gets larger, the histogram of the
averages gets closer and closer to having a Normal distribution. This is the Central
Limit Theorem 8 . We will put this to very practical use in the next section.
60
50
40
Frequency
30
20
10
0
0 5 10 15 20 25
Figure 2.14: The histogram of my sample of 200 college students’ number of hours
spent watching TV. Notice the asymmetry.
Chapter 2 Things You Should Know 35
600
500
400
Frequency
300
200
100
0
2 3 4 5 6
there’s a chance that that interval contains the true mean. This is called a
confidence interval for the population mean.
To understand the mechanism of the confidence interval, let’s turn the problem
around and suppose that we know the answer. That is, suppose we known that the
mean number of TV hours that all college students watch is really hours. While
we can’t see the distribution for all students, we can suspect that it will be skewed
to the right since there will likely be some students who will watch much more than
hours/week, but no one can watch less than .
Now, let’s imagine that we took different random samples of stu-
dents. What would the histogram of all these averages look like? Will it be as
asymmetric as the original distribution, or will the averaging process make it more
symmetric? Too see what it might look like, I’ve simulated such samples
and shown the histogram of the averages below in figure 2.15. Notice where our
average of hours lies. Not far from the true mean of hours. The Central
36
Limit Theorem says that the shape of this histogram will be approximately Normal,
even though the individuals observations are not. Notice that the averaging of
students has practically eliminated the asymmetry. This histogram, on the average,
will be centered at the population mean of hours (because the average is unbi-
ased). The standard deviation of this histogram, the standard error, depends on the
sample size: the larger the sample, the smaller the spread. Here, the SE is about
hours.
Now, on to the confidence interval. We know that the histogram of the averages
is approximately Normal (the Central Limit Theorem). We also know that for any
Normal (or Gaussian) random variable, approximately 95% of the observations fall
within 2 standard deviations of the mean. So, applying this to our case, about 95%
of all random samples of college students will have their average, , hours of TV
watching within two standard errors ( ) of the mean, , or from to hours.
(Check the histogram).
So, the average from a single random sample, , has about a 95% chance to be
within 2 SE’s of . Now, we use this fact to turn the problem around. Starting with
our sample average of , let’s put a SE window around this average,
Now, how often will this kind of interval contain the population mean? It fails only
for the of the samples whose averages fall farther than standard errors from .
In other words, it will work of the time. That’s why we can say that with
confidence, the population mean is within our confidence interval. Of course,
of the time we will be unlucky, and the population mean won’t be in our interval.
If this is too often for you to be comfortable with, you can always take a higher
probability confidence interval. Of course, it will be wider! And of course, we’ll
probably never know if we happen to be in the or in the unlucky of the
samples. But being human, we never think that the happen to us! But, trust me,
it will – of the time. To protect himself against being in the unlucky , an
engineer I knew at Hewlett-Packard decided to have a t-shirt made up for himself
which read, “Rare events don’t happen to me”. You can take whatever precautions
you like to try to ensure that your confidence interval contains the mean, but the
Chapter 2 Things You Should Know 37
bottom line is that a confidence interval will contain the population mean
of the time on the average, no more and no less.
If we just use as our estimate of , (and using the more precise instead of ),
the 95% confidence interval formula would be:
(2.4)
If you trust in computer programs, you won’t have to be bothered with remembering
equation 2.5.1, but realize simply that for samples sizes below , the confidence
interval will be slightly bigger than standard errors on each side of the average.
Above , it’s about . For a quick and dirty confidence interval, just use for the
multiple as long as the sample size is at least .
So, finally, let’s redo the confidence interval for our sample of TV watchers
with the adjustment. Since the average was hours and the sample standard
deviation ,a confidence interval for the true population mean of all college
students is:
A stem and leaf diagram of the data is shown in figure 2.16. The average is
mpg with standard deviation of mpg. Using the statistic adjustment,
a confidence interval for the “true” mileage of the car is
Table 2.3: Miles per gallon recorded for 100 fill ups from a 1989 Nissan Maxima.
Chapter 2 Things You Should Know 41
Figure 2.16: A stem and leaf diagram of the Nissan Maxima data.
42
1.5
1.0
0.5
0.0
22 23 24 25 26
Figure 2.17: A histogram of the sample averages of size assuming that the mean
is mpg. Notice where our average is!
deviation is . (We need to make some assumption about the sd). Then averages
of size from this population will be approximately normal with mean and
standard error . Let’s look at our average or How “far”
is this average from the hypothesized mean of ? Let’s first look at a picture.
Figure 2.17 shows the histogram of averages for a mean of mpg with a sample
size of . Where is our average on this picture? Does it look like a typical value?
Clearly not! In fact, you probably don’t have to calculate the p-value here. It passes
what I call the “ocular trauma test of statistical significance”9 . Just to make things
more precise, we’ll go ahead and calculate the p-value. (It’s often said that you use
9
Translation - It hits you right between the eyes!
Chapter 2 Things You Should Know 43
SE’s below . Now, of course, any sample average is theoretically possible with
a mean of and an SE of , but is not a very likely average to get. In
fact, the chance of getting an observation standard deviations smaller than
the mean is less than . (I looked it up). This is
small! We’re back to all the molecules in the room jumping to one corner again.
What do we make of this? There are only two possibilities. One, we just got
something that’s nearly impossible by chance. (We were really, really unlucky).
Or, the hypothesized value of is WRONG!. Not one to believe that I am cursed
with bad luck, I would view this as very strong evidence against the hypothesized
value of .
This illustrates the basic idea behind hypothesis testing. We postulate a value
for the mean. Let’s call it . This hypothesis is known as the null hypothesis. The
alternative to this is that it’s not . (And in that clever way of naming things that
Statisticians have, this is known as the alternative hypothesis).
Then we collect our evidence – the sample average. We then compute how
many standard errors away from our is and call this :
(2.5)
Why do we call it ? Because, if the null hypothesis is true, this quantity should fit
the distribution and be in the range of about to . (The exact number depends
on the degrees of freedom and the confidence level). But, just think of this value
as the number of standard errors away from the hypothesized mean. By using the
distribution as a reference, we can now compute how likely it is to get a value this
far away (or more) from the hypothesized value. This probability is called the p-
value. If the p-value is small, this means that the probability is small of getting that
far away from the hypothesized mean, and in this case, we view this as evidence
44
against the hypothesized value of . If the p-value is not particularly small, then
our average is not untypical given the hypothesized mean. It’s consistent with the
hypothesis and thus provides no evidence against it.
Let’s try another value. How about for the hypothesized mean value? If
, then our is only
standard errors lower than the mean. Values less than SE below the mean happen
all the time and so this is a perfectly reasonable average, given that the true mean
is . The p-value is this case is , which means that of the time,
just by chance, you’ll get a value this far below the mean. That’s certainly not
much evidence against the null hypothesis. The histogram in figure 2.18 shows
how close to the hypothesized value we are now.
So, we’ve seen that is an absurd value for the mean, and is reasonable.
As a last example, let’s try mpg as the mean. Now our average is:
1.5
1.0
0.5
0.0
Figure 2.18: A histogram of the sample averages of size assuming that the mean
is mpg. Notice how typical looking our average is now.
46
1.5
1.0
0.5
0.0
Figure 2.19: A histogram of the sample averages of size assuming that the mean
is mpg. How typical is our average now?
Chapter 2 Things You Should Know 47
against the null hypothesis. Values between and provide weak evidence
against the null hypothesis. Between and – reasonable evidence and
smaller than , strong evidence. Some disciplines sanctify these guidelines by
labeling them as in table 2.4. This tends to give these somewhat arbitrary values
far too much credibility and importance. It is important to realize that the p-value
should be viewed as one piece of information in the decision making process, and
not the final say.
Table 2.4: By relegating p-values to one, two or three stars, some disciplines have
created the “Michelin guide” to statistical significance.
P-values are often misunderstood and abused. Let’s review them and define
them a little more formally. An important concept in hypothesis testing is that the
only way to prove something is to assume the opposite and to have the data disprove
that. It’s similar to the role of the district attorney in a court case. In order to prove
guilt (our objective), we assume innocence and let the evidence (the data!) disprove
it. And, of course, as we know all too well these days, failing to establish guilt does
not prove innocence either. It merely fails to convict.
Thinking of ourselves as the district attourney, we put the hypothesis that we’d
like to reject as the null hypothesis. Many real world decisions are based on a
comparison with a current standard, or current accepted value. We hypothesize this
value in our null hypothesis, and see if the data enable us to reject it. In the case
of hypothesizing mpg as our null hypothesis, we are giving benefit of the doubt
to the manufacturer who has claimed that the car delivers mpg, and putting the
burden of proof on the data to reject this claim. In this case, we would write:
48
On the other hand, it is worth stating again that failing to reject the null hy-
pothesis does not prove it! Failing to reject the null hypotheses simply says that
the evidence (the data) are consistent with the null hypothesis. But, they may be
consistent with many other hypotheses and explanations as well. This is an often
overlooked or misunderstood point, but here’s a silly example to convince you that
failing to reject is not proof that it’s true. Suppose I want to prove De Veaux’s
Law which states: “The sun and the stars revolve around De Veaux”. Now, accord-
ing to De Veaux’s Law (as long as he stays on the planet), from anywhere on Earth,
the Sun will appear to rise in the East and set in the West. I want you to collect data
for the next two months and see if these data verify De Veaux’s Law. Two months
from now, you won’t be able to reject this hypothesis, because the data are consis-
tent with it. They “fail to reject the Null Hypothesis”. Of course, they don’t prove it
either. There are many other theories consistent with these data as well. That’s why
we are careful to choose the null hypothesis in the hope that the data will reject it –
otherwise we are left with inconclusive results.
Now, if the data do seem inconsistent with the null hypothesis, there is always
a question of the degree to which they are inconsistent. This is where the p-value
comes in. We always assume the null hypothesis is true to start. We next collect
data and from these data derive a statistic (like the sample average). The p-value
gives the probability of this statistic’s occurrence given the null hypothesis is true.
If this value is small enough, this is viewed as evidence against the null hypothesis.
The degree to which you need to be convinced is context dependent. If there is
no number small enough to crack your faith in the null hypothesis, then there is
no point in collecting the data!! On the other hand, there is NO specific p-value
that rejects the null hypothesis in all cases. (It certainly isn’t 0.05 as claimed by
table 2.4!!!). In the courts, the correct level is left deliberately vague: “beyond a
reasonable doubt”. In practice, for decision making, how small the p-value must be
depends on the beliefs of the decision maker(s). It should be part of the decision
Chapter 2 Things You Should Know 49
making process, but NOT the sole determinant. The confidence interval for the
quantity of interest should be examined, as well, with considerations of all the costs
of decisions involved.
The probability of getting a value larger or smaller than your statistic – that
is, farther away from the null hypothesis than your statistic
Which of these three numbers is the appropriate one depends on the context.
The output for our example with a hypothesized mean of mpg is shown in ta-
ble 2.5.
The “Test Statistic” tells us that our estimate, , is standard errors to
the left of the hypothesized mean of . The first p-value tells us that the probability
of being more than standard errors from the mean is . The last one tells
us that the probability of being more than standard errors to the left of the
mean is half of this, or . The middle value is just the probability of being
greater than standard errors to the left of the mean. It doesn’t mean much in
this case. In fact, the last two numbers will always add to , since together they
give the probability of being to the left and right of our value. Which is appropriate
is determined by the problem definition. Since in our case, we would take action
(take the car back, get it tuned, sue the manufacturer) only if we got significantly
50
Table 2.5: P-values from a test of the Hypothesis that the mean mileage is .
Notice that the p-value depends on whether the test is one or two sided and which
direction one looks.
less than , the appropriate p-value is the last one, which tells us that our car got
something that should occur about times out of .
significant, but still financially and scientifically insignificant. Let me repeat – just
because data show statistical significance does not mean that they also imply finan-
cial or scientific significance. If a new engine delivers “statistically significantly”
more mileage than the old, is it necessarily better? (What about cost? – how much
more mileage does it deliver?) If a new credit card promotion delivers “statistically
significantly” more customers is it better? How much does it cost? What type of
customers does it bring in, how much higher is the response rate? These questions
always need to be asked when talking about business or scientific decisions.
In fact, when talking about real decisions, it is often convenient and practical to
use the confidence interval in place of, or in addition to a hypothesis test. Let’s use
a credit card response rate as an example. Suppose the current product delivers 4%
response. In a controlled design (we’ll talk much more about how to do this in later
chapters), a new product delivered a statistically significantly higher response rate,
with a p-value of .00001. Even if this raw response rate is what I’m interested in, I
should be unwilling to make the business decision to adopt the new design without
more information. What I would like to see, is a confidence interval for the profit
or revenue generated by the new product. Suppose the confidence interval for the
response rate of new product is (4.2% – 5.1%). Then, statistically it is fairly clearly
better than the 4% delivered by the old product. But, I still need to put cost into
the equation. The new design may be more expensive and/or may attract customers
whose behaviors are more costly. What we need to do is to translate the response
rate confidence interval into a profit confidence interval. Once that is done, I am in a
much better position to make a business decision. I almost always prefer confidence
intervals to hypothesis tests for the base of a business or scientific decision process.
One more point about p-values. There is a common misconception about what
the probability means. Let’s look at a polling example. Suppose I randomly sample
adults in the U.S. Let’s assume that the there are males and females in
the population (the number of females is actually a little over , but let’s ignore
this). I’ll take a sample of adults. Suppose I get females in the sample. If I
test the hypothesis that the mean proportion is , I get the results in table 2.6.
We get a p-value (two sided) of . What does this mean, exactly? Does it
52
Table 2.6: Hypothesis test that the proportion of females in the population is .
Notice that the hypothesis is rejected on the basis of this sample.
mean that there’s a probability of only that the null hypothesis is true? NO!!
If you thought yes – you are not alone. Most people make this mistake. In fact, it’s
the other way around. What it means is that under the null hypothesis assumption
(of equal numbers of males and females), there’s a probability of only of
getting a sample as lopsided (one way or the other – that’s why it’s two sided) as
the one we got. The p-value says nothing directly about how true the null hypothesis
is. This is one of the most misunderstood concepts in applied statistics. The p-value
is a measure of how unlikely our sample statistic is, assuming the null hypothesis to
be true. It is then used as evidence for making decisions about whether we believe
the null hypothesis to be true. Generally, the more confidence one has in the null
hypothesis before the data are collected, the smaller the p-value will have for this
individual to be to convinced that the null hypothesis is false. That’s one reason
why there’s no universally accepted p-value for which all null hypotheses should be
rejected.
Many people get mistakenly stuck on the idea that the p-value gives the proba-
bility of the null hypothesis begin true, so I’ll give a very simple example of why
we can’t switch the probabilities around. Suppose I know that if I run across In-
terstate 95 blindfolded, my chances of getting killed are . Now, if you see my
obituary in the paper, what are the chances that I died by running across interstate
Chapter 2 Things You Should Know 53
95 blindfolded? Are they ? Of course not. Similarly, the p-value says nothing
about the probability that the null hypothesis is true. It’s the probability of getting
our sample given that the null hypothesis is true – the other way around.
There is one more concept in the context of hypothesis testing that we must review
before moving on to the comparison of two groups in Chapter 3. Let’s think about
testing whether our car gets at least mpg one more time. Unless I have perfect
information – in this case testing the car continually until its demise – there is
always the possibility that I will make an incorrect decision. There are two types
of errors, depending on whether or not the null hypothesis is actually true or not.
To make things concrete, let’s suppose that the null hypothesis is that the car gets
(at least) mpg. (The alternative hypothesis is that it does not). Now, suppose
the car really does get mpg, so, in fact, the null hypothesis is true, but I happen
to get a low sample average. Suppose it’s so low that I reject the null hypothesis.
If I use a cut off p-value of , how often will I make such a mistake? Well, the
probability of getting a statistic with a p-value of below is exactly so my
probability of such an error is exactly the p-value cut-off that I used. This kind of
mistake is seeing a false signal – or what we would call a false “positive” in the
context of drug or disease testing. That is, the null hypothesis was indeed true, but
I decided to reject it based on a false signal – because I saw something out of the
ordinary (low p-value). In a drug testing situation, the null hypothesis would be
that no drug is present. A false positive implies that we incorrectly rejected the null
hypothesis because, by chance, the data seemed inconsistent with it. Statisticians
have given this false signal type of error (rejecting the null hypothesis when true)
the uninformative name of Type I error. Let’s put both situations in a table (see
54
table 2.6.3:
Now, let’s reverse the situation. Suppose that the null hypothesis is wrong –
the car really does have mean mileage below mpg. But this time, suppose that
the average we get from our sample is not low enough to reject the null hypothesis.
Here we failed to see the signal associated with the car’s inadequate mileage – this is
a missed signal, or a false “negative”. Can you guess what type of error statistician’s
call this? If you said Type II you’re starting to think frighteningly like a statistician
.10 .
Later we will worry about the frequency of making both Type I and Type II
errors. The probabilities associated with each of these are called and (using
Greek letters just to obscure things as usual). Fixing the value of the p-value to use
as a threshold at some value results in a probability of a Type I error of precisely
. Calculating depends on how false the null hypothesis actually is. For example,
if my car in reality gets only mpg, the chance of missing the signal (getting
a sample mileage consistent with the null hypothesis of mpg) is pretty small.
There’s a much greater chance of missing the signal though if my car actually gets
mpg. The farther away from the null hypothesis we are, the lower will be.
Calculating it can be very tricky – but it turns out it is a very important quantity
for determining an appropriate sample size for an experiment. Fortunately, many
software packages now calculate it. We’ll see lots of examples later in the book.
By changing the cut-off p-value, you can make lower at the expense of raising
10
Somewhat, and only partially facetiously, we sometimes also define a Type III error as giving
the right answer to the wrong question
Chapter 2 Things You Should Know 55
, or vice versa. The most effective way of lowering both – reducing our risk
simultaneously of making either kind of mistake is to increase the sample size.