Chap 2

Chapter 2
What you should have learned in that

Stat course you took, but probably
didn’t
2.1 Introduction
Most of the readers of this book have probably suffered through an elementary
course in Statistics. For many, the experience was painful, but everything, espe-
cially the formulas, was quickly forgotten. For others, only the formulas have been
forgotten, and years of therapy been unsuccessful in dealing with the rest. What-
ever the case for you, we will assume that very little, if any, of the material from
this course is at your fingertips. Instead, we will attempt to introduce some of the
concepts that we need in this chapter, and save others as we need them for later. For
those of you comfortable with the notions of statistical significance, p-values and
hypothesis tests, you may want to skip this chapter and go right to Chapter 3. For
those staying on, we first review the basics of displaying data before examining the
output of a statistical program on a specific data set. We then attempt to explain
each part of the output in detail, allowing the reader to skip the sections that he/she
feels comfortable with.
7
8
2.2 Displaying Data

Exploring data is often, and probably should always be, the first step in analyzing
data. And the first step in exploring data can be summarized by what I modestly call
De Veaux’s first law, which can be succinctly abbreviated by the acronym DTDP:
Draw the D Picture! 1 . The great sage, Yogi Berra, once said, “You can see a lot,
just by looking”. This is certainly true with data displays.
Far too often, an investigator “throws the data into the computer” and then starts
trying whatever statistical program seems to fit, looking desparately for a “p-value”,
without spending time to look at or think about the data. We will talk about statis-
tical significance, p-values and all that later, but for now, we will just concentrate
on what we can see. A good display of the data may not arm the researcher with
p-values and significance levels, but it just may answer the question.
There are, of course, many ways to display data. Some are good, some bad,
some honest and some dishonest. Several wonderful books have been written on
this subject and we refer the interested reader to the texts of Tufte (1983 and 1989),
Cleveland (199x) and Wainer (1997). In this book we will just review some of the
basic graphical displays of data.
2.2.1 Displaying a batch

What is a batch of data? A batch is an informal term indicating a group of numbers.
Later we’ll talk about random samples, but for now, a batch is simply a collection
of data about which we want to say something. For example, consider the following
batch of data in table 2.1 on jogging rates for a middle aged lunchtime runner. These
are my average times per mile for runs of between 3 and 5 miles on 236 days in the
period 1988 to 1992.
Looking at a table of data like table 2.1 is a good way to get data vertigo. A
graphical summary is a much better way to start. We’ll begin by displaying the
distribution of the data. The distribution is simply the pattern of variability dis-
1
The DTDP law has been promoted by many people before me and in many versions including
DTDPATLAI!.(Draw the d picture and then look at it! – attributed to Lynn Hare)
Chapter 2 Things You Should Know 9
7.583 7.711 7.496 8.062 7.434 8.070 7.909 7.208 7.386 8.305
8.052 7.851 8.092 7.924 7.916 7.840 7.748 7.904 7.372 7.534
7.703 7.500 7.631 7.912 8.139 8.562 7.804 7.586 7.570 8.815
7.538 7.797 7.694 7.676 8.538 7.480 7.377 7.321 7.309 8.012
7.619 7.624 7.843 7.958 7.544 7.558 7.697 7.842 8.077 7.805
8.173 7.685 7.843 7.994 7.823 7.870 7.855 7.460 7.434 7.418
7.857 7.231 7.228 7.281 7.413 7.948 7.821 7.798 7.718 7.445
7.385 7.647 7.366 7.807 7.390 7.352 7.274 7.220 7.367 7.321
7.618 8.297 7.672 7.741 8.527 7.370 7.872 8.032 7.785 7.574
7.325 7.741 8.035 7.738 7.670 7.538 7.661 7.947 8.031 7.621
7.591 7.415 7.397 7.424 7.515 7.351 7.588 7.508 7.397 7.594
7.406 7.535 7.595 7.544 7.832 7.995 7.928 7.919 7.604 7.663
7.608 7.350 7.474 7.622 7.283 7.456 7.374 7.466 7.782 7.268
8.044 7.629 8.010 7.491 7.724 7.321 7.452 7.529 7.671 7.558
7.855 7.429 7.622 8.513 7.938 7.496 7.574 7.316 7.528 7.792
7.296 7.401 7.317 7.401 7.249 7.776 7.486 7.509 7.641 7.600
8.018 8.315 8.344 7.941 7.815 7.915 7.774 7.854 7.656 7.343
7.434 7.508 7.341 7.868 7.609 7.456 7.541 7.586 7.453 7.575
7.600 7.457 7.598 8.162 7.894 7.605 7.585 7.648 8.069 7.681
8.143 7.698 8.145 8.102 8.109 7.970 7.944 7.853 7.863 7.675
8.293 7.898 7.737 7.843 7.772 7.487 7.593 7.550 7.462 7.642
8.649 8.046 7.834 7.706 7.521 7.434 7.700 7.460 7.458 7.325
7.341 7.393 7.393 7.431 7.583 7.259 7.425 7.170 7.410 7.849
7.652 7.560 7.409 7.622 7.516 7.601
Table 2.1: Times Per Mile in Minutes for runs of about 4 miles. There are 236 days
represented.
10
30
Frequency
20
10
0
7.0 7.5 8.0 8.5 9.0

Minutes per mile
Figure 2.1: A histogram of run rates. The y-axis shows the frequency that a rate
from each bin on the x-axis occurred. For example there were 11 runs out of the
236 total, with a rate between 7.2 and 7.3 minutes per mile.
played by the data on a single variable. One way of displaying the distribution is to
consider a histogram of the data (figure 2.1). On the x-axis of the histogram, the
data are collected into bins. 2 In this case the bins have width 1/10 of a minute per
mile. The y-axis of the histogram shows the number of times that each bin occurs.
For example, there were 11 runs with rates between 7.2 and 7.3 minutes per mile.
From the histogram we see that during this period, I typically ran about 7.5 - 7.6
minutes/mile, although on any given day the rate could vary from around 7.0 to 9.0
minutes/mile. The lack of symmetry is quite apparent in a histogram. Does the
lack of symmetry for these data “make sense” to you? Why might there be some
extremely slow days, but not as many extremely fast days?
2
The choice of the number of bins is left up to the user – or the computer program. Usually it’s
taken to be between 5 and 20 unless the data set is very large. Often the end points of the bins are
chosen to be “pretty” that is, with numbers that can be expressed with few decimals.
For data sets that aren’t too large, we can squeeze more out of the histogram by
adding a little additional information. Look at the stem and leaf plot of the same
data in figure 2.2. To understand this display, first turn it sideways and notice that
it resembles figure 2.1, but with numbers replacing the bars. 3 To understand these
added numbers, look at the second line of figure 2.2:
72 : 1233567788
The “stem”, 72, is followed by 10 different “leaves”. The stem and leaves together
represent the numbers 7.21, 7.22, 7.23, 7.23, 7.25, 7.26, 7.27, 7.27, 7.28 and 7.28.
How did I know that 72:1 meant 7.21 and not 72.1 or 721 or 0.721? Because of
the line at the top of the plot that says: “Decimal point is 1 place to the left of the
colon”.
The advantage of the stem and leaf plot over the histogram is that not only do
we now know, as before, that 10 runs were between 7.2 and 7.3, but we now know
what the values (to one more decimal place) actually are. The stem and leaf plot
takes a little getting used to, but is a very useful display of a batch of data.
Another graphical summary of these data is shown in figure 2.3. This display,
called a box and whisker plot, (or just a box plot) is based on a five number summary
of the data: the lowest value, the lower quartile (the point in the data for which
25% of the values lie below), the median (the point for which half of the data lie
below, half above), the upper quartile and the highest value. Notice that the “box”
includes the central 50% of the data, between the two quartiles, from about 7.5 to
7.8 minutes/mile. The relative closeness of the two quartiles and the extremes to
the median gives an idea of the symmetry (or lack of symmetry) of the batch. The
boxplot also identifies outliers by separating out any point lying more than 1.5 box
widths from either edge of the box 4 . So, the “whisker” of the plot goes out to the
largest (and smallest) values lying within 1.5 box widths from the box edges. Here,
3
Some computer programs display this plot sideways, that is, in the same direction as a his-
togram.
4
Some plots use another symbol for “far outliers” that are more than 3 box widths away.
12
Decimal point is 1 place to the left of the colon
71 : 7
72 : 1233567788
73 : 01222222244455577777889999
74 : 0000111112223333334556666666778999
75 : 000111122333344444566677778889999999
76 : 000000112222222334455566677778889
77 : 0000112244445778889
78 : 000011223344444555555667779
79 : 001112223444556799
80 : 11233345567789
81 : 0144467
82 : 9
83 : 0014
84 :
85 : 1346
High: 8.649 8.815
Figure 2.2: A stem and leaf diagram of the run data. Each line represents a tenth
of a minute, with the leaves on the right representing the hundredths. The first line
represents 7.17.
8.5
Minutes per Miles
8.0
7.5
Figure 2.3: A box plot of the run times. The black dot in the middle of the box
is the median, while the upper and lower edges of the box are the upper and lower
quartiles of the data. A whisker is drawn out to the largest and smallest values that
lie within 1.5 box heights from the edge of the box. All other points are indicated
as outliers.
14
8.5
Minutes per Miles
8.0
7.5
1 2 3 4 5 6 7 8 9 10 11 12
Month
Figure 2.4: A box plot of run times by month. Notice that both the changes in both
level and the variation of times by month are apparent.
the whisker goes out to about 8.3 and identifies a handful of values from 8.5 to near
9.0 as outliers. As we mentioned before, there are no low outliers.
2.2.2 Displaying more than one batch

The boxplot really starts to shine when we look at more than one batch of numbers
at the same time. For example, let’s look at the run data again, this time split by the
month the run took place (figure 2.4). While an overall pattern may not be obvious,
we do see some month to month differences. Why might January (month 1) be a
slow month, and July (month 7) be a fast one?
Those of you familiar with quality control charts know that to monitor a pro-
duction process one traditionally charts successive averages of samples taken from
the process on one graph and the range of the samples on another. This gives the
classic “X-bar” and “R” charts of quality control. As an alternative, by plotting
the whole batch against time in a series of box plots, you combine the information
contained in both the other two plots. This makes it easy to see changes in either
the level or the spread of the process. For more information on the use of the box
plot in a production setting see the article by Hoadley ([?]).
2.3 Numerically Summarizing Data

So far, we have not asked any specific questions of the data. Rather, we have been
content to plot the data and look at interesting features. But such a process usu-
ally generates questions, even some that might not have occurred to the investigator
before looking at the data. While graphics can answer questions about the shape,
location, and other interesting features of the data set, we often want to make infer-
ences as well, generalizing from the data set at hand to a larger group or population.
Making such generalizations is often the central goal of a data analysis.
To get specific, let’s go back to my runtimes in table 2.1.
We’ll quickly review the basics of statistical inference by looking at the output
from an analysis of these data by a typical Statistics program, and then spend the
rest of the chapter exploring each of these topics in more detail. The reader should
feel at liberty to pick and choose among these subjects according to interest.
An output of the analysis of the run data is shown in figure 2.5 Graphics first.
From the two plots we see that the distribution of heights is fairly symmetric, with
a central value of about 7.7 minutes/mile. We also see that the range of the data is
from 7.17 to 8.82 minutes/mile. There are several outliers, as evidenced by the box
plot.
2.3.1 Location – the mean and the median

Now, let’s summarize the data set numerically. What is a typical run rate? Two an-
swers are: the sample mean is 7.68, while the median is 7.62. We have already been
introduced to the median via the boxplot. While the median is the “middle value”
of the data, the value that splits the area of the histogram in half, the sample mean
(or average), is the balancing point of the data histogram, obtained by averaging all
16
8.5
Minutes per Miles
8.0
7.5
30
20
Frequency
10
0
7.0 7.5 8.0 8.5 9.0
Minutes per mile
data: runtimes
sample estimates:
mean of x: 7.6778
median of x: 7.6215
standard deviation of x: 0.2931
Figure 2.5: Output from the run data. A slightly different representation of the box
plot is shown here where the median is indicated by the white bar in the middle of
the box
the data values. Symbolically, we write the average 5 as follows :
where etc. represent the individual data values, and denotes the size of
the sample. Sometimes we use the summation shorthand 6 for the expression above:
Which is a better summary of the data set, the mean, or the median of a sample?
Or in other words, which is a better description of a “typical” observation? The
answer is that both are useful. If the data are symmetric, the two summaries will be
equal. Knowing that they are very different tells me that that the distribution is far
from symmetric. For example, if the average income on my block is $100,000, but
the median income is $40,000, this tells me something interesting about the income
distribution. It says, for one thing, that there is likely to be at least one outlier
(possibly several) who have quite large incomes.
For the run rates, the median and mean are close, indicating the data are not far
from symmetric. The fact that the mean is a little larger indicates that there is a
slight tendency for the data to be skewed to the right.
2.3.2 Spread – the standard deviation

Usually, after the location of the dataset, the second most important characteristic is
its spread. We use the spread to tell how far, on the average, an observation is likely
to be from the center. A simple way of measuring the spread is the difference be-
tween the upper and lower quartiles – the box width. This is called the interquartile
range (IQR).
5
we use the terms mean and average interchangeable, although when speaking of a sample, we
will usually prefer to use the term average
6
Some people say that statisticians tend to like to use Greek symbols whenever there is danger
of comprehension, that it is an effective way to maintain job security. But actually, it’s just to save
space. We’ll try to keep the summations and subscripts to a minimum in this book, but sometimes
they will prove to be much better than the alternative.
18
However, a more common measure of the spread is what is known as the stan-
dard deviation. For a histogram with a Normal distribution shape (the bell shaped
curve), this quantity has a precise meaning. Since, as we will see, the Normal plays
a very important role, we usually use the standard deviation to measure spread. You
may recall from an elementary Statistics class that the sample standard deviation is
found by computing:
(2.1)
We’ll talk more about this later – and in particular we’ll even talk about why the
mysterious in the denominator, but for now we’ll just leave it that is a
measure of how spread out the histogram is.
Why do we need the standard deviation? Because, not only does it measure how
spread out the distribution is, but it serves as a yardstick for comparing individuals.
This is especially true for data sets that roughly follow the Normal distribution.
The standard deviation gives us a way of saying how close observations are to each
other. Let’s use heights of students in my Stat 101 class as an example. Suppose
the mean height is inches, and the sd is inches. For any two individuals, we
can use the standard deviation to talk about how big a difference there is between
their measurements. We can think roughly of one standard deviation as being a
substantial difference, two standard deviations as being a very large difference and
three standard deviations as a huge difference. For the students, this makes sense:
four inches is indeed a substantial difference in height, eight inches is very large and
a foot is huge. With the average height at , someone six feet ( inches) tall is
one standard deviation taller than the average. A basketball player ( ) tall is
three standard deviations taller than the average. When we express the observations
in standard deviation units like this, we are using standard scores. Or we say that
we have standardized the observations. Technically, we standardize by subtracting
the average and then dividing by the standard deviation:
(2.2)
The new observations, the , are standardized observations whose values tell how
far from the average the observation is in standard deviation units (positive values
are larger than the mean, negative values are smaller). The basketball player of
has a standardized score of:
which tells us that he is standard deviations higher than the average.
2.4 Samples and sampling

One of the most important concepts in Statistics is the idea of sampling and infer-
ence. To make things concrete, consider my Stat101 class again. The average of
these students is . What inference can I make from this? Is it true that the
average height of ALL college sophomores is ? Is the average height of all
adults ? Is the average height of all college sophomores near ? Which
statement (if any) are you comfortable with?
In making the leap from sample to population, we first have to decide what
group or population our sample represents. In fact, it is better to turn this question
around. Given a population about which we want to say something, we have to de-
cide how to pick a representative sample. If we are interested in all college students,
is my sample consisting of my Stat 101 class representative? Does my sample give
more preference to students of a certain height? It would if my institution preferred
students of a certain height, or if taking statistics was somehow related to height,
which would certainly be the case if more men took Statistics than women. Making
the sample random, that is, giving every member of the population an equal chance
of being picked, makes the sample representative. Think of other ways to find a
sample of 100 college students: picking your friends, the students in the first row,
or the last row, etc. Can you think of ways these might be biased? (The general
problem of sampling is fascinating, but lies outside the scope of this book. For a
nice introduction see [?] or [?].)
Once we have been successful in obtaining a random sample, we are ready to
make the jump in inference from sample to population. Clearly, the entire popu-
20
lation will not have exactly the same average height as my sample. Why not? Put
very simply, because
OBSERVATIONS VARY!!
If not for this fact, I would be unemployed. Imagine what a wonderful(?) world
it would be if every height were the same, every measurement precise and exactly
the same. In this world, every experiment would consist of just one observation
which would exactly reproduce the value for the entire population. Think how easy
a marketing poll would be – just call up one person and ask which brand she prefers.
Her answer would then be THE answer! Well, fortunately, (at least for me), such
is not the case. Every sample is different. In a population whose average height is,
say, exactly 68.578 in., each sample will have its own average which will probably
be near 68.578 in., but not exactly (your mileage may vary!!). Just how near it
is to the population mean depends of many things, among which are how variable
heights are, who happens to be in my sample, and how many people comprise our
sample. Our goal in Statistics is to quantify how close the sample average might be
to the true population mean.
In order to distinguish the population and sample averages, we use two very
different symbols. We have already seen that the sample average is denoted by
. This number is a statistic, that is, a quantity based on the data that changes
from sample to sample. By contrast, the population mean is fixed characteristic
of the population, a constant (albeit unknown). These unknown features of the
population are called parameters. To emphasize the difference between statistics
and parameters, statisticians often use Greek letters for the parameters and Latin
letters for the statistics that estimate them. We denote the population mean by the
Greek letter for m (mean) which is . Similarly, we use the Greek letter s, to
denote the population standard deviation (the parameter) and to distinguish it from
the estimate (the statistic) that varies from sample to sample.
2.4.1 Why , and what’s a degree of freedom?

Armed with this, for those who care, we can now talk about that damned
in the denominator. (For those who don’t, just skip this subsection). Why ?
The simple answer for why we divide by instead of in the sample standard
deviation is that this makes the estimate equal to on the average, which makes
unbiased. But why? It might seemm natural to take the square root of the
average of the squared differences from :
(2.3)
But remember, what we really want is the average squared difference from the
true, population mean . But, we don’t know , so we have to estimate it – and we
use in its place. And this is the problem. For each sample, the observations are
closer to their own average than they are to . This makes each term a little
too small. Remember, we should be using , and this in turn makes the standard
deviation too small unless we compensate. Suppose we took a sample of people
from a population with mean IQ (say all adults in the U.S.), what would be?
Well, we don’t know. It will probably be near , but not exactly. Suppose the
average of the 50, is . The observations vary less around than they
do around , the mean of all adults. Why? Because the average of a sample
is exactly the number that the observations are closest to. This makes the sum in
equation 2.3 too small. It turns out that we should adjust the sum in equation 2.3
by the ratio . If we do this, this replaces the in equation 2.3 by , giving
the usual estimate of in equation 2.1. Unfortunately, the engineers who design
calculators apparently didn’t understand this and gave people the choice of using
or which has caused headaches for Statisticians ever since.
To add some more language that we’ll need later on, we say that the process
of estimating has cost us one degree of freedom. Consider the IQ’s from my
sample. Without looking at them, we have no way of knowing what specific values
they take. We describe this by saying that the data set has degrees of freedom.
But suppose I tell you that the average is . Then, you’d only have to know
22
of the IQ’s. Why? Because I could figure out last one, by knowing all have to
average to . So we say that there are only degrees of freedom left in the
residuals after I’ve taken out the average. What I’ve done is split the degrees of
freedom into for the average and for the residuals.
2.4.2 Sampling distribution

Our goal is to find the population mean, but typically it is took expensive, or im-
possible to look at the whole population. Even the U.S. census is only a sample
– and not even a random one at that! But for now, let’s just imagine that we do
have the data on some characteristic for everyone in some population. What shape
does this histogram have? Well, that depends on the data collected, of course. A
histogram can have any shape. Many are symmetric, but certainly not all. Many
have one mode (or bump in the histogram) while some are multimodal. Some
have a scattering of values lying far away from the rest. Some follow the famous
Normal, or Gaussian curve. But, the distribution can be just about any shape imag-
inable. Michael Johnson won both the and meter gold medals in Atlanta
in 1996. He won the 200 by running it in seconds which bettered the previ-
ous world record by seconds. Let’s suppose we had the times of all the
meter sprints that Michael ran in 1996. Would you expect the histogram to be bell-
shaped? Would you expect it even to be symmetric? Would there be as many values
to the left as to the right of the mean? What value would you find at the far left of
the histogram?
Each sample’s histogram reproduces the population histogram more or less
faithfully. As the number of observations in the sample gets larger and larger, it
reproduces it more and more closely. By the same token, the average, of the sam-
ple will be closer, on the average, to the population mean, , the larger the sample
size is. (This fact is called the Law of Large Numbers in Theoretical Statistics).
Now we’d never do the following, but suppose we took lots of different random
samples of the same size from the same population and plotted all of their averages
in a histogram. To be specific, let’s imagine I took lots of samples of size
of college students and made a histogram of their average heights. What shape
would the histogram of these averages have? It seems logical that it would have a
shape that depended on the original population histogram. But, an astounding fact,
known as the Central Limit Theorem tells us that this histogram of sample averages
will always tend to be Normally distributed. The larger the sample size, the more
Normal looking the histogram will be.
Well, that’s a nice result, but so what? What we’ll soon see is that this fact
enables us to quantify the uncertainty in our guess (the sample average) of the
population mean. That is, we’ll use it to construct the confidence interval for .
And we can do this in a way that doesn’t depend on the shape of the histogram
we started with. Since histograms of averages are approximately normal, we will
be able to use some facts known about the normal (see the subsection below) to
construct these confidence intervals.
2.4.3 Some Facts about the Normal Distribution

Since the normal distribution will play such an important role in making confidence
intervals, we want to review some important facts about it.
In the idealized Normal histogram shown in figure 2.6, notice that we have cho-
sen the center ( ) to be 0 and the standard deviation, SD, to be 1. This is the
reference, or standard, Normal. You can think of the observations here as having
already been standardized as described in equation 2.2. Notice that for the Normal,
the standard deviation, , is the distance from the center ( ) to the place on the
curve where the curve changes from being concave to convex (the point of inflec-
tion). The point of this is that for the Normal, it is easy to see the meaning of the
standard deviation. By the way, this is the only histogram shape for which this is
true. In addition, we can make some statements about where observations from this
histogram lie:
approximately 68% of all the observations lie within 1 SD of the mean

approximately 95% of all the observations lie within 2 SD’s of the mean7
approximately 99.7% of all the observations lie within 3 SD’s of the mean
approximately 1 out of a million of all the observations lie outside 5 SD’s of the mean
24
0.4
1 sd
0.3
0.2
0.1
0.0
-4 -2 0 2 4
Standard Units
Figure 2.6: The idealized normal histogram
So if somebody tells you you’re , what they really mean is that you’re one in
a million. (Your next question should be – which side?).
2.4.4 The Standard Error

Before we show why the shape of the histogram of averages tends to be Normal,
we first introduce an extremely important quantity known as the standard error.
The standard error is just the standard deviation of the average, but it’s such an
important quantity that it gets its own name. Although I don’t particular like the
term, or find it very informative, we’re stuck with it. Now, the first question might
be: how can an average have a standard deviation? Isn’t there just one average?
Well, yes, but we’re thinking here of how much an average will vary from sample
to sample. Remember the histogram of averages from many different samples that
we constructed? The standard deviation of that histogram is the standard error.
Let’s go back to the samples of students. Suppose again that students are
tall with an SD of . A typical class’ histogram might look like figure 2.7.
20
15
10
5
0
60 65 70 75
heights
Figure 2.7: Heights of 100 college students in Stat 101
Notice how spread out around the data are. How likely would it be to find
one student over tall in class? Translating to standard deviation units, this is
SD’s greater than the average, something that happens for normal data about
of the time, or about times out of . Not something that happens everyday,
certainly, but not impossible. Now, what are the chances that the avearage of the
students in my class is over ? Why does this seem so much less likely to
happen? Because the standard error is much smaller than the standard deviation!
How likely would it be for a class of students to have an average of over
? Does this seems even less likely? The standard error is related to the standard
deviation in the following simple way:
where is the size of the sample. For my class of , the standard deviation (of
the individuals) is , so the SE (of the average) is or . So, in terms of
26
standard errors is:
standard errors away. Wow! For a Normal distribution, the probability of an ob-
servation being more than 30 standard deviations away is incredibly small. There’s
probably a greater chance that all the molecules in the room you’re in will jump to
one side of the room, suffocating you. It’s just not going to happen by chance. At
least, not as long as the sample is random.
2.4.5 The Central Limit Theorem

The Central Limit Theorem says that histograms of sample averages will tend to be
roughly normally distributed as the sample size gets reasonbably large. And, this
is true no matter what the shape of the histogram of the original data themselves.
If this seems obvious or even believable to you, skip to the next section. For those
who stay, to convince you that the original histogram doesn’t influence the shape
for the histogram of the averages, I’m going to perform a dice rolling example.
For a fair die, every whole number from 1 to 6 has an equal probability (1 out of
6) of coming up. If I roll this die over and over, and collect my data, the histogram
for the first 60,000 rolls should look like the histogram in figure 2.8, with every
number coming up equally often – at least in theory. This histogram shape is called
the uniform distribution.
Now, as Yogi Berra says, “In theory, there isn’t much difference between theory
and practice. But in practice, there is.” What this means to us is that if I roll the
die only 30 times, I don’t get perfect uniform frequencies. Instead, I might get the
histogram shown in figure 2.9.
Since observations vary, I get only a (not very good in this case) approximation
to the theoretical histogram. But suppose I roll it 100,000 times instead of 30. I
won’t exactly replicate the (uniform) distribution above, but since my sample is
big, I should be closer. See figure 2.10.
Now let’s do something a bit different. This time, I’m going to roll another die
100,000 times, but instead of plotting a histogram of the values of these 100,000
10000
8000
6000
4000
2000
0
0 2 4 6
Outcome of Die
Figure 2.8: Theoretical histogram of 60,000 rolls

8
6
Frequency
42
0
0 1 2 3 4 5 6
Value of Roll
Figure 2.9: Actual histogram of 30 rolls

28
15000
10000
Frequency
5000
0
0 1 2 3 4 5 6
Value
Figure 2.10: Actual histogram of 100,000 rolls. Notice how much more like the
theoretical histogram this one appears, with 100,000 rolls instead of 30.
rolls, I’m going to collect the average of the two dice for each roll. I’ve shown the
results of the first 15 rolls in table 2.2.
Toss # Die #1 Die #2 Average of
the Two Tosses
1 2 3 2.5
2 5 3 4
3 1 5 3
4 2 4 3
5 2 4 3
6 4 4 4
7 3 5 4
8 1 3 2
9 2 4 3
10 3 4 3.5
11 5 6 5.5
12 6 4 5
13 6 5 5.5
14 2 1 1.5
15 2 1 1.5
16 ... ... ...
Table 2.2: The first 15 tosses of two dice and their average
In the last column of table 2.2 I’ve got a collection of the 100,000 averages from
rolls of two dice. (So far, so good?) What does the histogram of this collection of
averages (the last column in the table above) look like? Will it be just as flat as
the histogram of the first 100,000 rolls? Let’s think a minute before looking at the
answer. In the original sample, every number has exactly the same probability of
being chosen – that’s what the uniform distribution means. So, I’m just as likely to
get a 6 as a 3. But is every average equally likely as well? How could we get an
average of 6 (that is, a total of 12?) We’d have to have 6 on both rolls! How likely
30
15000
10000
Frequency
5000
0
0 1 2 3 4 5 6
Value
Figure 2.11: Histogram of the averages from 100,000 rolls of two dice. This distri-
bution is known, not too surprisingly, as the triangular distribution. The point here
is that the distribution of the averages looks different from the original uniform
distribution.
is this (you needn’t be precise!)? On the other hand, how could we get an average
of 3.5 (total of 7)? We could get a 3 and a 4, or a 5 and a 2 or a 6 and a 1. There
are many more scenarios producing an average near 3.5 than an average near 6. For
my sample of 100,000 the histogram of the averages looks like the one shown in
figure 2.11.
Now, what happens if we take another 100,000 rolls and make averages of size
3? First, we are even more likely to be near 3.5. This makes sense if we consider
how hard it would be to get an average of 6 now. So, the standard deviation is
smaller. Also, the shape continues to change, as we can see in figure 2.12.
If we were to keep doing this, making averages of more and more rolls, the
two trends would continue. That is, we’re more and more likely to be close to 3.5,
(the standard deviation continues to shrink), and the shape continues to look less
12000
10000
8000
Frequency
6000 4000
2000
0
0 1 2 3 4 5 6
Averages of Three Tosses
Figure 2.12: Histogram of the averages of 100,000 rolls of three dice. Notice that
the standard deviation is smaller than in figure 2.11, and how much less like the
uniform this histogram appears.
32
6000
Frequency
4000
2000
0
0 1 2 3 4 5 6
Averages of Ten Tosses
Figure 2.13: Histogram of the averages of 100,000 rolls of ten dice. What theoreti-
cal histogram does this look like?
and less like the original uniform distribution. Averages of size 10 are shown in
figure 2.13.
Can you see the pattern? For one thing, the histogram is getting narrower, be-
cause the average is more likely to be near . The standard error is getting smaller!
We knew that from the last section. What’s happening to the shape of the distribu-
tion? You guessed it – it’s the famous Normal distribution. Notice that once the
average is taken over enough observations, the original distribution (the uniform)
plays a very small role in what the histogram of the averages looked like. This same
phenomenon occurs no matter what shape the original histogram has.
Let’s review what happened to the location, the spread and shape of the averages
as the sample size got larger. First, each histogram stayed centered at the population
mean, which happened here to be . If it hadn’t, we would say that the average
is biased. (Unbiased means, that on the average, the sample statistic estimates the
parameter we want to estimate).
What about the spread? Notice that as the sample size gets larger, we tend to be
closer to the mean of . This is because the standard error is equal to the standard
deviation over the square root of the sample size. As the sample size gets larger,
the averages get closer to the true population mean. What this says is that if you
have a larger sample size, on the average you will be closer to the true population
mean than if your sample size is smaller. (The Law of Large Numbers again). If
this weren’t true, polling companies would soon go out of business.
And the shape? We see that as the sample size gets larger, the histogram of the
averages gets closer and closer to having a Normal distribution. This is the Central
Limit Theorem 8 . We will put this to very practical use in the next section.
2.5 Confidence Intervals

Let’s suppose I want to find out how much TV the average college student watches
per week. To try to answer the question, I took a random sample of students,
and found out that they spend an average of hours a week watching TV, with
a standard deviation of hours (see figure 2.14. What can I say about the TV
watching habits of the whole population of college students from this?
We’ve taken a sample, but what we’d like to do is to make an inference, that
is, some sort of statement about how much TV, on the average, everyone in our
population watches per week. Our best guess of this number is hours, but we’d
like to say more. We’d like to be able to say how precise this estimate is. We’d
like to be able to say something like: “We believe that the mean number of hours
of TV watching/week among college students is in the range to hours”.
We might also quantify the degree of belief by saying something like: “We believe
with 95% confidence that the mean number of hours of TV watching/week among
college students is in the range to hours”. By this we mean that we think
8
A bit more formally, the consequences of the Central Limit Theorem can be stated as follows:
Let be a random sample of size n from a population with mean (average) and standard
deviation . Then as n gets large, the distribution of the average of the sample will be approxi-
mately Normal with mean and standard deviation .
34
60
50
40
Frequency
30
20
10
0
0 5 10 15 20 25
Hours of TV watched per week
Figure 2.14: The histogram of my sample of 200 college students’ number of hours
spent watching TV. Notice the asymmetry.
600
500
400
Frequency
300
200
100
0
2 3 4 5 6
Number of hours watched per week
Figure 2.15: A histogram of the averages from different random samples of

size . What happened to the asymmetry?
there’s a chance that that interval contains the true mean. This is called a
confidence interval for the population mean.
To understand the mechanism of the confidence interval, let’s turn the problem
around and suppose that we know the answer. That is, suppose we known that the
mean number of TV hours that all college students watch is really hours. While
we can’t see the distribution for all students, we can suspect that it will be skewed
to the right since there will likely be some students who will watch much more than
hours/week, but no one can watch less than .
Now, let’s imagine that we took different random samples of stu-
dents. What would the histogram of all these averages look like? Will it be as
asymmetric as the original distribution, or will the averaging process make it more
symmetric? Too see what it might look like, I’ve simulated such samples
and shown the histogram of the averages below in figure 2.15. Notice where our
average of hours lies. Not far from the true mean of hours. The Central
36
Limit Theorem says that the shape of this histogram will be approximately Normal,
even though the individuals observations are not. Notice that the averaging of
students has practically eliminated the asymmetry. This histogram, on the average,
will be centered at the population mean of hours (because the average is unbi-
ased). The standard deviation of this histogram, the standard error, depends on the
sample size: the larger the sample, the smaller the spread. Here, the SE is about
hours.
Now, on to the confidence interval. We know that the histogram of the averages
is approximately Normal (the Central Limit Theorem). We also know that for any
Normal (or Gaussian) random variable, approximately 95% of the observations fall
within 2 standard deviations of the mean. So, applying this to our case, about 95%
of all random samples of college students will have their average, , hours of TV
watching within two standard errors ( ) of the mean, , or from to hours.
(Check the histogram).
So, the average from a single random sample, , has about a 95% chance to be
within 2 SE’s of . Now, we use this fact to turn the problem around. Starting with
our sample average of , let’s put a SE window around this average,
Now, how often will this kind of interval contain the population mean? It fails only
for the of the samples whose averages fall farther than standard errors from .
In other words, it will work of the time. That’s why we can say that with
confidence, the population mean is within our confidence interval. Of course,
of the time we will be unlucky, and the population mean won’t be in our interval.
If this is too often for you to be comfortable with, you can always take a higher
probability confidence interval. Of course, it will be wider! And of course, we’ll
probably never know if we happen to be in the or in the unlucky of the
samples. But being human, we never think that the happen to us! But, trust me,
it will – of the time. To protect himself against being in the unlucky , an
engineer I knew at Hewlett-Packard decided to have a t-shirt made up for himself
which read, “Rare events don’t happen to me”. You can take whatever precautions
you like to try to ensure that your confidence interval contains the mean, but the
bottom line is that a confidence interval will contain the population mean
of the time on the average, no more and no less.
2.5.1 Estimating the standard error

The last wrinkle in finding the confidence interval involves the problem of estimat-
ing . The confidence interval for the mean involves , through the standard error:
If we just use as our estimate of , (and using the more precise instead of ),
the 95% confidence interval formula would be:
(2.4)
Well, there’s a problem. Equation 2.4 is incorrect. Using instead of has

added some extra uncertainty into the equation, reducing the confidence level. To
correct it, we need to adjust the in the equation upwards. This isn’t a huge deal,
and statistical software will take care of this for you, but it’s an historical sticky
wicket in the teaching of Statistics. To get an idea of the size of the adjustment,
here are two signposts. For a sample of size , the number actually should be
instead of at confidence, which is the approximate number I’ve used up to
now anyway. But for a small sample of size , where there’s a lot more uncertainty
in estimating , the changes to , making the confidence interval more than
wider.
This adjustment was first discovered by a quality control engineer named Gos-
sett, who worked for the Guiness brewing company in Dublin in the early 1900’s.
Due to the nature of his work, testing the quality of Stout, he needed very small
sample sizes. But using these small samples to estimate , he noticed, by the fact
that he was rejecting too many good batches of Guinness Stout, that the confidence
interval in equation 2.4 was too narrow. Not only did he realize that the intervals
were too narrow, but he went on to figure out exactly what the adjustment should be
(no small feat – especially at the end of one of his days!). It turns out that instead of
38
basing the number of standard errors ( at ) on the Normal distribution, we

should replace the Normal by a family of distributions that get wider as the sam-
ple size gets smaller. Gossett was not allowed to publish this result because of a
company policy forbidding employees to publish. To get around this, he used the
pseudonym Student. The family of distributions that he discovered to replace the
Normal distribution has been known as Student’s t-distribution since. Computer
programs automatically calculate this and use it when reporting confidence inter-
vals. The family of distributions is referenced by its degrees of freedom which is
(the same used to estimate !).
I hope you’ll never need the formula below, but it’s here just in case. We’ll state
it in generality to relieve us from having to always make confidence intervals as
well. If we let denote the probability that the confidence interval does not contain
the mean, (so that 95% confidence means ), we write the confidence interval
as follows:
If you trust in computer programs, you won’t have to be bothered with remembering
equation 2.5.1, but realize simply that for samples sizes below , the confidence
interval will be slightly bigger than standard errors on each side of the average.
Above , it’s about . For a quick and dirty confidence interval, just use for the
multiple as long as the sample size is at least .
So, finally, let’s redo the confidence interval for our sample of TV watchers
with the adjustment. Since the average was hours and the sample standard
deviation ,a confidence interval for the true population mean of all college
students is:
which, as we can see, is not terrible different from before.

2.6 Hypothesis Tests

Rather than give a confidence interval for the mean, we could ask a specific question
about the population mean. For example, is the mean at least , or is the mean ?
To such a question, we might demand a yes or no answer, although such a black or
white answer might be dangerous. Often we’ll want to go back to the confidence
interval as well, to see how close to the cliff we are. As an example, suppose we
are concerned about the fuel efficiency of the car we recently bought. The first
measurements of the miles per gallon of a 1989 Nissan Maxima are shown in
table 2.3. For simplicity, we’ll assume that the same number of miles were driven
before each fill up, although this is only approximately true.
A stem and leaf diagram of the data is shown in figure 2.16. The average is
mpg with standard deviation of mpg. Using the statistic adjustment,
a confidence interval for the “true” mileage of the car is
or Now, suppose the manufacturer claimed that the mileage of

the car in “normal” driving conditions is at least mpg. What do you think of this
claim? One way to answer the question is to look at the confidence interval and
see that is simply an unreasonable number for the mean, given the confidence
interval of .
The hypothesis test approach turns this question around. Rather than look at
the confidence interval and judge whether a hypothesized value for the mean seems
reasonable, it assumes the hypothesized value for the mean and then computes how
likely it is to get the data we got. More specifically, we calculate the probability of
getting as far or farther from the hypothesized mean as our data. If this probability
is very small (how small is up to you, the judge), then this is viewed as evidence
against the hypothesis.
To see how this works, let’s give the benefit of the doubt to the manufacturer
and hypothesize that the true mean is mpg. Let’s also assume that the standard
40
21.964 23.694 18.824 20.851 26.370

22.810 25.785 24.353 23.385 23.381
28.175 21.232 25.603 21.064 22.067
23.867 24.226 19.512 22.147 19.948
22.650 21.197 20.748 20.351 26.540
25.385 21.613 21.400 22.035 21.810
23.806 23.678 26.462 21.514 19.700
18.362 25.403 23.158 22.091 23.768
20.201 21.111 20.573 27.434 20.256
22.687 25.206 21.169 22.747 24.000
19.375 26.113 20.625 21.511 21.441
21.980 21.667 20.260 19.125 23.652
18.693 18.374 24.885 21.405 21.806
25.111 20.903 19.343 23.235 25.088
20.900 28.037 19.789 22.576 21.374
26.894 20.139 21.399 16.767 14.684
21.739 23.803 20.563 25.188 25.794
24.122 23.356 24.632 23.407 22.958
20.815 20.762 21.194 21.667 23.979
24.514 24.500 22.344 22.847 19.551
Table 2.3: Miles per gallon recorded for 100 fill ups from a 1989 Nissan Maxima.
N = 100 Median = 22.05084

Quartiles = 20.83294, 23.98929
Decimal point is at the colon

14 : 7
15 :
16 : 8
17 :
18 : 4478
19 : 13456789
20 : 12334666788999
21 : 1122224444455677788
22 : 0001113677788
23 : 02244447778889
24 : 001245569
25 : 112244688
26 : 14559
27 : 4
28 : 02
Figure 2.16: A stem and leaf diagram of the Nissan Maxima data.
42
1.5
1.0
0.5
0.0
22 23 24 25 26
Miles per Gallon
Figure 2.17: A histogram of the sample averages of size assuming that the mean
is mpg. Notice where our average is!
deviation is . (We need to make some assumption about the sd). Then averages
of size from this population will be approximately normal with mean and
standard error . Let’s look at our average or How “far”
is this average from the hypothesized mean of ? Let’s first look at a picture.
Figure 2.17 shows the histogram of averages for a mean of mpg with a sample
size of . Where is our average on this picture? Does it look like a typical value?
Clearly not! In fact, you probably don’t have to calculate the p-value here. It passes
what I call the “ocular trauma test of statistical significance”9 . Just to make things
more precise, we’ll go ahead and calculate the p-value. (It’s often said that you use
9
Translation - It hits you right between the eyes!
pictures to convince your friends and p-values to convince your enemies).

Since a standard error is , and , this means is
over standard errors smaller than the mean! More exactly, it’s
SE’s below . Now, of course, any sample average is theoretically possible with
a mean of and an SE of , but is not a very likely average to get. In
fact, the chance of getting an observation standard deviations smaller than
the mean is less than . (I looked it up). This is
small! We’re back to all the molecules in the room jumping to one corner again.
What do we make of this? There are only two possibilities. One, we just got
something that’s nearly impossible by chance. (We were really, really unlucky).
Or, the hypothesized value of is WRONG!. Not one to believe that I am cursed
with bad luck, I would view this as very strong evidence against the hypothesized
value of .
This illustrates the basic idea behind hypothesis testing. We postulate a value
for the mean. Let’s call it . This hypothesis is known as the null hypothesis. The
alternative to this is that it’s not . (And in that clever way of naming things that
Statisticians have, this is known as the alternative hypothesis).
Then we collect our evidence – the sample average. We then compute how
many standard errors away from our is and call this :
(2.5)
Why do we call it ? Because, if the null hypothesis is true, this quantity should fit
the distribution and be in the range of about to . (The exact number depends
on the degrees of freedom and the confidence level). But, just think of this value
as the number of standard errors away from the hypothesized mean. By using the
distribution as a reference, we can now compute how likely it is to get a value this
far away (or more) from the hypothesized value. This probability is called the p-
value. If the p-value is small, this means that the probability is small of getting that
far away from the hypothesized mean, and in this case, we view this as evidence
44
against the hypothesized value of . If the p-value is not particularly small, then
our average is not untypical given the hypothesized mean. It’s consistent with the
hypothesis and thus provides no evidence against it.
Let’s try another value. How about for the hypothesized mean value? If
, then our is only
standard errors lower than the mean. Values less than SE below the mean happen
all the time and so this is a perfectly reasonable average, given that the true mean
is . The p-value is this case is , which means that of the time,
just by chance, you’ll get a value this far below the mean. That’s certainly not
much evidence against the null hypothesis. The histogram in figure 2.18 shows
how close to the hypothesized value we are now.
So, we’ve seen that is an absurd value for the mean, and is reasonable.
As a last example, let’s try mpg as the mean. Now our average is:
standard errors too low. The probability of this happening by chance is .

Hmmm, this isn’t quite as clear. See figure 2.19.
If the probability of our data occuring given some hypothesized value for the
mean is , it’s pretty clear that hypothesized value is ridiculous. By contrast, if
the mean were , data like ours could occur with a probability of . This is
clearly a very reasonable occurrence, and the p-value of provides NO ,or at
least very little, evidence against the null hypothesis. But what about our p-value
of . It says that given a mean of , averages as far away or farther than
ours are somewhat rare, but not impossible. YOU have to decide what a reasonable
amount of evidence is in your problem. The real business or scientific decision will
involve costs of the action and costs of making a mistake not just the p-value. Many
people make the mistake of just looking at the p-value to make the decision. Even
worse, often people look for hard and fast rules about how big a p-value should be.
Traditionally, p-values of over are treated as providing NO reliable evidence
1.5
1.0
0.5
0.0
21.5 22.0 22.5 23.0 23.5
Miles per gallon
is mpg. Notice how typical looking our average is now.
46
1.5
1.0
0.5
0.0
22.0 22.5 23.0 23.5 24.0
Miles per gallon
is mpg. How typical is our average now?
against the null hypothesis. Values between and provide weak evidence
against the null hypothesis. Between and – reasonable evidence and
smaller than , strong evidence. Some disciplines sanctify these guidelines by
labeling them as in table 2.4. This tends to give these somewhat arbitrary values
far too much credibility and importance. It is important to realize that the p-value
should be viewed as one piece of information in the decision making process, and
not the final say.
p-value Degree of evidence against the Null Hypothesis

None
* Weak Evidence
** Evidence
*** Strong Evidence
Table 2.4: By relegating p-values to one, two or three stars, some disciplines have
created the “Michelin guide” to statistical significance.
P-values are often misunderstood and abused. Let’s review them and define
them a little more formally. An important concept in hypothesis testing is that the
only way to prove something is to assume the opposite and to have the data disprove
that. It’s similar to the role of the district attorney in a court case. In order to prove
guilt (our objective), we assume innocence and let the evidence (the data!) disprove
it. And, of course, as we know all too well these days, failing to establish guilt does
not prove innocence either. It merely fails to convict.
Thinking of ourselves as the district attourney, we put the hypothesis that we’d
like to reject as the null hypothesis. Many real world decisions are based on a
comparison with a current standard, or current accepted value. We hypothesize this
value in our null hypothesis, and see if the data enable us to reject it. In the case
of hypothesizing mpg as our null hypothesis, we are giving benefit of the doubt
to the manufacturer who has claimed that the car delivers mpg, and putting the
burden of proof on the data to reject this claim. In this case, we would write:
48
Conclusion: Reject with a p-value of .

Here the conclusion was clear. An average of was so inconsistent with
that it was soundly rejected.
On the other hand, it is worth stating again that failing to reject the null hy-
pothesis does not prove it! Failing to reject the null hypotheses simply says that
the evidence (the data) are consistent with the null hypothesis. But, they may be
consistent with many other hypotheses and explanations as well. This is an often
overlooked or misunderstood point, but here’s a silly example to convince you that
failing to reject is not proof that it’s true. Suppose I want to prove De Veaux’s
Law which states: “The sun and the stars revolve around De Veaux”. Now, accord-
ing to De Veaux’s Law (as long as he stays on the planet), from anywhere on Earth,
the Sun will appear to rise in the East and set in the West. I want you to collect data
for the next two months and see if these data verify De Veaux’s Law. Two months
from now, you won’t be able to reject this hypothesis, because the data are consis-
tent with it. They “fail to reject the Null Hypothesis”. Of course, they don’t prove it
either. There are many other theories consistent with these data as well. That’s why
we are careful to choose the null hypothesis in the hope that the data will reject it –
otherwise we are left with inconclusive results.
Now, if the data do seem inconsistent with the null hypothesis, there is always
a question of the degree to which they are inconsistent. This is where the p-value
comes in. We always assume the null hypothesis is true to start. We next collect
data and from these data derive a statistic (like the sample average). The p-value
gives the probability of this statistic’s occurrence given the null hypothesis is true.
If this value is small enough, this is viewed as evidence against the null hypothesis.
The degree to which you need to be convinced is context dependent. If there is
no number small enough to crack your faith in the null hypothesis, then there is
no point in collecting the data!! On the other hand, there is NO specific p-value
that rejects the null hypothesis in all cases. (It certainly isn’t 0.05 as claimed by
table 2.4!!!). In the courts, the correct level is left deliberately vague: “beyond a
reasonable doubt”. In practice, for decision making, how small the p-value must be
depends on the beliefs of the decision maker(s). It should be part of the decision
making process, but NOT the sole determinant. The confidence interval for the
quantity of interest should be examined, as well, with considerations of all the costs
of decisions involved.
2.6.1 One sided versus two sided p-values

In spite of the fact that people place so much (probably way too much) value in the
actual size of the p-value, determining it precisely can sometimes be important. In
many contexts, like our car example, we are interested only in values on one side of
the null hypothesis. We would be unlikely to complain to the dealer if the mileage
were too high! So, we would probably compute the p-value based only on being
too far to the left of the mean (too small), not being too far on both sides. For this
reason, software packages typically report three possible p-values:
The probability of getting a value larger than your statistic
The probability of getting a value smaller than your statistic
The probability of getting a value larger or smaller than your statistic – that
is, farther away from the null hypothesis than your statistic
Which of these three numbers is the appropriate one depends on the context.
The output for our example with a hypothesized mean of mpg is shown in ta-
ble 2.5.
The “Test Statistic” tells us that our estimate, , is standard errors to
the left of the hypothesized mean of . The first p-value tells us that the probability
of being more than standard errors from the mean is . The last one tells
us that the probability of being more than standard errors to the left of the
mean is half of this, or . The middle value is just the probability of being
greater than standard errors to the left of the mean. It doesn’t mean much in
this case. In fact, the last two numbers will always add to , since together they
give the probability of being to the left and right of our value. Which is appropriate
is determined by the problem definition. Since in our case, we would take action
(take the car back, get it tuned, sue the manufacturer) only if we got significantly
50
Test: Mean = Value

Hypothesized Value: 23
Actual Estimate 22.412
t Test
Test Statistic -2.402849
Prob 0.0181309
Prob 0.9909346
Prob 0.0090654
Table 2.5: P-values from a test of the Hypothesis that the mean mileage is .
Notice that the p-value depends on whether the test is one or two sided and which
direction one looks.
less than , the appropriate p-value is the last one, which tells us that our car got
something that should occur about times out of .
2.6.2 P-values and real decisions

The p-value we just got, is a “two star” p-value, less than but more than
. What should we do? That answer should depend on the cost of the action –
NOT on the p-value!!
This is an important and misunderstood point. It is crucial to realize that statis-
tical significance just tells how many standard errors away from the hypothesized
mean our data happen to be. It doesn’t say anything directly about a business or
financial decision. I’ve seen “statistical significance” trumpeted at meetings and
used as a weapon against anyone doubting the wisdom presenter too many times.
Suppose a new drug is found to be “statistically significantly” more effective than
the currently produced drug. Does this mean we should replace the old drug with
the new? What are the other issues? Among others are cost of the new drug, its side
effects and the actual size of the increase in efficacy. Being 10 standard errors better
doesn’t always translate into a huge real increase. If the sample size is very large,
the average may be statistically far from the hypothesize mean, statistically very
significant, but still financially and scientifically insignificant. Let me repeat – just
because data show statistical significance does not mean that they also imply finan-
cial or scientific significance. If a new engine delivers “statistically significantly”
more mileage than the old, is it necessarily better? (What about cost? – how much
more mileage does it deliver?) If a new credit card promotion delivers “statistically
significantly” more customers is it better? How much does it cost? What type of
customers does it bring in, how much higher is the response rate? These questions
always need to be asked when talking about business or scientific decisions.
In fact, when talking about real decisions, it is often convenient and practical to
use the confidence interval in place of, or in addition to a hypothesis test. Let’s use
a credit card response rate as an example. Suppose the current product delivers 4%
response. In a controlled design (we’ll talk much more about how to do this in later
chapters), a new product delivered a statistically significantly higher response rate,
with a p-value of .00001. Even if this raw response rate is what I’m interested in, I
should be unwilling to make the business decision to adopt the new design without
more information. What I would like to see, is a confidence interval for the profit
or revenue generated by the new product. Suppose the confidence interval for the
response rate of new product is (4.2% – 5.1%). Then, statistically it is fairly clearly
better than the 4% delivered by the old product. But, I still need to put cost into
the equation. The new design may be more expensive and/or may attract customers
whose behaviors are more costly. What we need to do is to translate the response
rate confidence interval into a profit confidence interval. Once that is done, I am in a
much better position to make a business decision. I almost always prefer confidence
intervals to hypothesis tests for the base of a business or scientific decision process.
One more point about p-values. There is a common misconception about what
the probability means. Let’s look at a polling example. Suppose I randomly sample
adults in the U.S. Let’s assume that the there are males and females in
the population (the number of females is actually a little over , but let’s ignore
this). I’ll take a sample of adults. Suppose I get females in the sample. If I
test the hypothesis that the mean proportion is , I get the results in table 2.6.
We get a p-value (two sided) of . What does this mean, exactly? Does it
52
Test mean = value

Hypothesized value
Actual estimate
t-test
Test Statistic
Prob
Prob
Prob
Table 2.6: Hypothesis test that the proportion of females in the population is .
Notice that the hypothesis is rejected on the basis of this sample.
mean that there’s a probability of only that the null hypothesis is true? NO!!
If you thought yes – you are not alone. Most people make this mistake. In fact, it’s
the other way around. What it means is that under the null hypothesis assumption
(of equal numbers of males and females), there’s a probability of only of
getting a sample as lopsided (one way or the other – that’s why it’s two sided) as
the one we got. The p-value says nothing directly about how true the null hypothesis
is. This is one of the most misunderstood concepts in applied statistics. The p-value
is a measure of how unlikely our sample statistic is, assuming the null hypothesis to
be true. It is then used as evidence for making decisions about whether we believe
the null hypothesis to be true. Generally, the more confidence one has in the null
hypothesis before the data are collected, the smaller the p-value will have for this
individual to be to convinced that the null hypothesis is false. That’s one reason
why there’s no universally accepted p-value for which all null hypotheses should be
rejected.
Many people get mistakenly stuck on the idea that the p-value gives the proba-
bility of the null hypothesis begin true, so I’ll give a very simple example of why
we can’t switch the probabilities around. Suppose I know that if I run across In-
terstate 95 blindfolded, my chances of getting killed are . Now, if you see my
obituary in the paper, what are the chances that I died by running across interstate
95 blindfolded? Are they ? Of course not. Similarly, the p-value says nothing
about the probability that the null hypothesis is true. It’s the probability of getting
our sample given that the null hypothesis is true – the other way around.
2.6.3 Two types of error
There is one more concept in the context of hypothesis testing that we must review
before moving on to the comparison of two groups in Chapter 3. Let’s think about
testing whether our car gets at least mpg one more time. Unless I have perfect
information – in this case testing the car continually until its demise – there is
always the possibility that I will make an incorrect decision. There are two types
of errors, depending on whether or not the null hypothesis is actually true or not.
To make things concrete, let’s suppose that the null hypothesis is that the car gets
(at least) mpg. (The alternative hypothesis is that it does not). Now, suppose
the car really does get mpg, so, in fact, the null hypothesis is true, but I happen
to get a low sample average. Suppose it’s so low that I reject the null hypothesis.
If I use a cut off p-value of , how often will I make such a mistake? Well, the
probability of getting a statistic with a p-value of below is exactly so my
probability of such an error is exactly the p-value cut-off that I used. This kind of
mistake is seeing a false signal – or what we would call a false “positive” in the
context of drug or disease testing. That is, the null hypothesis was indeed true, but
I decided to reject it based on a false signal – because I saw something out of the
ordinary (low p-value). In a drug testing situation, the null hypothesis would be
that no drug is present. A false positive implies that we incorrectly rejected the null
hypothesis because, by chance, the data seemed inconsistent with it. Statisticians
have given this false signal type of error (rejecting the null hypothesis when true)
the uninformative name of Type I error. Let’s put both situations in a table (see
54
table 2.6.3:
True State of the World

True False
Decide True Type II Error – False Negative
Decide False Type I Error – False Positive
Now, let’s reverse the situation. Suppose that the null hypothesis is wrong –
the car really does have mean mileage below mpg. But this time, suppose that
the average we get from our sample is not low enough to reject the null hypothesis.
Here we failed to see the signal associated with the car’s inadequate mileage – this is
a missed signal, or a false “negative”. Can you guess what type of error statistician’s
call this? If you said Type II you’re starting to think frighteningly like a statistician
.10 .
Later we will worry about the frequency of making both Type I and Type II
errors. The probabilities associated with each of these are called and (using
Greek letters just to obscure things as usual). Fixing the value of the p-value to use
as a threshold at some value results in a probability of a Type I error of precisely
. Calculating depends on how false the null hypothesis actually is. For example,
if my car in reality gets only mpg, the chance of missing the signal (getting
a sample mileage consistent with the null hypothesis of mpg) is pretty small.
There’s a much greater chance of missing the signal though if my car actually gets
mpg. The farther away from the null hypothesis we are, the lower will be.
Calculating it can be very tricky – but it turns out it is a very important quantity
for determining an appropriate sample size for an experiment. Fortunately, many
software packages now calculate it. We’ll see lots of examples later in the book.
By changing the cut-off p-value, you can make lower at the expense of raising
10
Somewhat, and only partially facetiously, we sometimes also define a Type III error as giving
the right answer to the wrong question
, or vice versa. The most effective way of lowering both – reducing our risk
simultaneously of making either kind of mistake is to increase the sample size.

Chap 2

Uploaded by

Copyright:

Available Formats

Chap 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap 2

Uploaded by

Copyright:

Available Formats

Chapter 2

What you should have learned in that

2.2 Displaying Data

2.2.1 Displaying a batch

7.0 7.5 8.0 8.5 9.0

Decimal point is 1 place to the left of the colon

High: 8.649 8.815

2.2.2 Displaying more than one batch

2.3 Numerically Summarizing Data

2.3.1 Location – the mean and the median

7.0 7.5 8.0 8.5 9.0

Minutes per mile

the data values. Symbolically, we write the average 5 as follows :

2.3.2 Spread – the standard deviation

which tells us that he is standard deviations higher than the average.

2.4 Samples and sampling

2.4.1 Why , and what’s a degree of freedom?

2.4.2 Sampling distribution

2.4.3 Some Facts about the Normal Distribution

approximately 68% of all the observations lie within 1 SD of the mean

Figure 2.6: The idealized normal histogram

2.4.4 The Standard Error

Figure 2.7: Heights of 100 college students in Stat 101

standard errors is:

2.4.5 The Central Limit Theorem

Figure 2.8: Theoretical histogram of 60,000 rolls

Figure 2.9: Actual histogram of 30 rolls

2.5 Confidence Intervals

Hours of TV watched per week

Number of hours watched per week

Figure 2.15: A histogram of the averages from different random samples of

2.5.1 Estimating the standard error

Well, there’s a problem. Equation 2.4 is incorrect. Using instead of has

basing the number of standard errors ( at ) on the Normal distribution, we

which, as we can see, is not terrible different from before.

2.6 Hypothesis Tests

or Now, suppose the manufacturer claimed that the mileage of

21.964 23.694 18.824 20.851 26.370

N = 100 Median = 22.05084

Decimal point is at the colon

Miles per Gallon

pictures to convince your friends and p-values to convince your enemies).

standard errors too low. The probability of this happening by chance is .

21.5 22.0 22.5 23.0 23.5

Miles per gallon

22.0 22.5 23.0 23.5 24.0

Miles per gallon

p-value Degree of evidence against the Null Hypothesis

Conclusion: Reject with a p-value of .

2.6.1 One sided versus two sided p-values

The probability of getting a value larger than your statistic

The probability of getting a value smaller than your statistic

Test: Mean = Value

2.6.2 P-values and real decisions

Test mean = value

2.6.3 Two types of error

True State of the World

You might also like