Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

LANGARA

COLLEGE

Statistics 4800

Supplementary Notes

September 2014
1.1

STEPS IN GETTING A GOOD SAMPLE


Sampling Methods and Bias

Note: This section reorganizes and expands on the material presented in the text.
Exercises for further study are provided at the end of this section.

Basic Idea in Sampling: Many people think they know how to get a "good" sample from
some larger population, but there are many ways to go wrong. As well, there is often more
than one correct method from which to choose. Here are some steps to follow in order to
get your research off to a good start.

1. The first step in solving any statistical problem is to state the problem clearly, and to
identify the population involved. That is, your research objectives must be fully
specified.

E.g. "I'd like to know the average height of students." This is not a clear question, since
we cannot determine the population it refers to. (Which population of students is
relevant here?).

2. List the variables that relate to the study's objectives. Note that, in many cases, these
variables will be translated into questions.

E.g. If the study's objective is to find the average height of all college students in the
Lower Mainland, then the key variable is clearly 'Height'. The corresponding question
to ask would be "What is your height in centimetres?”

3. When it is impossible or undesirable to do a census of the relevant population, a


sample from this population must suffice.

The sampling frame is a list of elements of the relevant population that serves as the
practical source of the sample. Ideally, the sampling frame and the population should
be identical, but when this is not possible or practical, the frame should at least be
representative of the population.

E.g. "What is the average yearly income of all adults (18 or over) living in Vancouver?"

One possible sampling frame might be the telephone directory, but how does it differ
from the population of interest here? Would a list of all eligible voters be better?
1.2

4. In choosing an appropriate sampling procedure, the two basic sources of error in


sampling must be considered:
1) Sampling Error: This error occurs naturally in all samples and results from sampling
variability. Since a sample is only part of the population, different samples will
produce different results, in general; thus, very few samples are exactly like the
population, and some error is introduced. (Note that we expect this error to
decrease as the size of the sample increases.)

2) Non-Sampling Error: Also called systematic bias, this type of error results from one
or more of the following mistakes:

a) Choosing a poor sampling frame; this can lead to a one-sided type of error
called selection bias; this bias is the result of under-representing, or even
excluding, an important part of the target population in the sample.

E.g. The population of interest is all Langara students this term. A poor sampling
frame would be all the students present in the cafeteria next Wednesday
between 11:30 am and 1:30 pm. Why?

b) Allowing inaccurate responses to contaminate the data; this is called response


bias, and can result from interviewer errors, or the use of sensitive questions,
vague questions, or leading questions; you should also remember that the
behaviour of sampled elements can change just because they are under
observation.

E.g. "Wouldn't you agree that it's about time the federal government got rid of
the G.S.T.?"

c) Not getting an answer from some of the elements originally chosen to be in


your sample; this often leads to non-response bias, since the data you did
manage to collect may not be representative of the original population you set
out to study.

E.g. A questionnaire was mailed to all the households in a certain


neighbourhood to determine the level of support for the idea of having a
neighbourhood pub in the area. Only 20% of the households responded, and
most were opposed to the idea.

d) Substituting conveniently available elements for those that are not so easy to
contact; this is called "haphazard substitution", and is NOT the answer to dealing
with non-response problems. (What is the answer?)
E.g. In the above example, using personal interviews to increase the response
rate may do little to reduce bias in the sample. Can you suggest why?
1.3

5. There are two basic ways to collect sample data:

1) NONRANDOM SAMPLING: the researcher chooses items that, in his or her opinion
(or judgment) are representative of the population; or the researcher chooses items
which happen to conveniently be available. Note that this is a very subjective
process, and thus cannot be evaluated in an objective, scientific manner.

2) RANDOM SAMPLING: elements are selected without regard to their true nature,
using an objective, probability-based technique (as described below). When a
sample is selected at random, the sampling error can be analyzed objectively (as
we will see in a later chapter); for this reason, statistical inferences are based on
random samples.

6. Some Common Sample Designs


1) Simple Random Sampling: gives every element in the sample the same chance of
being chosen. To ensure that a sample is truly random, a table of random
numbers or a computerized random number generator should be used.

E.g. Suppose you want to draw a simple random sample (“SRS”) from among the
approximately 6,000 students on the enrolment list of Langara in a fall term. If the
sample size is 60, you can use StatGraphics Plus to produce a list of 60 pseudo
random numbers between 1 and 6000.

Click the SGP Spreadsheet icon (located at the bottom of the Main Menu screen).
Then click the heading of the column which will contain the data, making sure that
the entire column is highlighted. Click Edit from the Main menu, then click Modify
Column. You will see the Generate Data window, containing a box of Operators.
Scroll through this box to the RINTEGER(?,?,?) operator, and double click on it to
move it to the Expression box. Edit the expression box contents, replacing those
question marks, to obtain RINTEGER(60, 1, 6000). Click OK. You will see a list of
60 numbers between 1 and 6000 in the highlighted column.

2) Systematic Sampling: If a "1 in K" systematic sample is required, a random number


between 1 and K is chosen, the element with that number is picked from the
sampling frame, then every Kth element after that.

E.g. Suppose that 20,000 accounts are filed in an order that is completely unrelated
to their accuracy, and suppose that an auditor wants to sample 5% of this
population, or 1000 accounts. Now, 5% = 5/100 = 1/20, so this is a 1 in 20 sample
(i.e., K = 20 here). Picking a random number between 1 and 20, we get, say, 4. So
the account in the 4th position in the drawer is the first element in the sample; the
next will be at position 24, then position 44, and so on until all 1000 elements have
been chosen.
1.4

Advantages: The population size need not be known exactly, and the items do not
have to be numbered. So this system is usually faster and easier than taking a
SRS. It is also the only sort of system that can be used on "volatile" populations,
like people leaving a movie theatre.

Disadvantages: A bias may result when systematic sampling is used on a


population that has some sort of cycle or repetitiveness to it. For example, the daily
receipts of a grocery store may show a seven-day cycle, and a systematic sample
could over-sample a particular spot in the cycle. (What type of bias would this
create?)

NOTE: If the population is in random order relative to the variable of interest, then a
systematic sample is equivalent to a SRS. In this course, we will go on to analyze
data that are from simple random samples, or the equivalent. Data from the next
two designs is more complicated to analyze and is an advanced topic.

3) Stratified Sampling: the population is divided into two or more sub-populations


called strata, and then a sample is selected from each stratum. This method is
better than a SRS if the study's key variable shows less variability within each
stratum than it does within the whole population; in such cases, the stratified
sample not only provides a better estimate of the parameter of interest, it also
provides an estimate for each stratum separately.

E.g. If you want to estimate the average weight of all adult males in Vancouver, the
strata could be based on age. Why? Because men of a similar age should be more
likely to have similar weights than would men of quite different ages, since weight
tends to increase with age (up to a point).

E.g. Suppose you wish to estimate the average monthly expenditure on gasoline
(for private use) for all families living in Greater Vancouver. If you can identify
practical, convenient, and more homogeneous strata, then you should use stratified
sampling. Any ideas?

Notes:
a) If possible, choose strata so that there is similarity of objects within each
stratum. But you will often see this method used on large, complex populations
for which SRS or systematic methods are impractical.
b) If the size of each stratum is known, sampling can be done proportionately. For
example, if 18% of the population is in stratum # 3, then 18% of the sample
could come from this stratum.
c) Often, a simple random or systematic sample is selected from each stratum, as
described above. (See diagram on the page after)
1.5

4) Cluster Sampling When the elements of the population are found in naturally
occurring groups or clusters, it is often convenient to sample all the elements within
a cluster; thus, a sample of such clusters is selected, and all the elements of each
such cluster get into the sample. Cluster sampling works best when the individual
clusters are heterogeneous -- i.e., like the population.

E.g. Langara students form clusters called classes, and this provides a convenient
way to survey a large number of students. The LARS guide would serve as the
sampling frame, since it lists all the classes being offered. A random or systematic
sample of these classes would be selected, and a census done of each class.

Note: Although cluster sampling is a time-saver, it should be clear to you that it is


less random than the other methods we have seen, since the members of one
cluster may be quite similar to each other. For this reason, it is not recommended
for use by students. Stick to the basic methods in this course.

7. Once you have collected your data, you can start to summarize and describe it, then go
on to interpret your results. Don't forget that the whole idea of all of this is to answer
your original problem!
1.6

Diagrams Contrasting Stratified and Cluster Sampling

Stratified Random Sample: A simple random sample is selected (proportionately) from each of the
9 strata below. In other words, fewer individuals are selected from a smaller stratum whereas
more individuals are selected from a larger stratum.

┌───────────┬─────────────────┬───────────────────┐
│ . . │. . . . │ . . │
│ . .. │ . . │ . . . │
│ . . │ .. . . │ . . . │
│ . . │ . . . │ . . │
├───────────┼─────────────────┼───────────────────┤
│ . . │ . . . . │ . . │
│ .. . │ . . . │ . . .. │
│ . . │ . . . . │ . . . │
│ ... . │ . │ . . │
├───────────┼─────────────────┼───────────────────┤
│ . . . │ . . .│ . . . │
│ . . . │ . . . .│ . .. . │
│ . . │ . . │ . . . . │
│ . .. │ . . . . │ . . │
│ . . │ . .. │ .. . . .. │
└───────────┴─────────────────┴───────────────────┘

Cluster Random Sample: A simple random sample of 7 clusters (shaded regions) is selected from
a population of elements grouped into 70 clusters. Note that the clusters are similar but may not
contain the same number of elements. Also a census is taken in each of the seven selected
clusters.

▪▪▪ ▪
▪▪▪
▪▪▪ ▪▪ ▪▪ ▪▪▪
▪▪ ▪▪
▪▪▪ ▪▪ ▪▪▪ ▪▪
▪▪ ▪▪

▪▪▪ ▪▪ ▪▪▪
▪▪▪ ▪▪▪
▪▪
1.7

EXERCISES

1. A small undergraduate college has 1,000 students, evenly distributed among the four classes:
freshman, sophomore, junior and senior. A study is being designed to estimate the percentage
of students in the college who own a cellular phone. It is decided to interview a sample of 100
students. One student proposes the following method for selecting the sample: Write the name
of each student on a ticket, put the 1,000 tickets in a box, stir them up so they are thoroughly
mixed, and then take 100 tickets at random from the box. Another student objects on the
grounds that this sample is unlikely to have exactly 25 students in each of the four classes. So,
a second design is proposed: Hire a team of four interviewers, one from each class, and ask
each interviewer to select 25 representative students from their class.

(a) Which method of sampling is preferred in this situation? Why?

(b) Can you suggest a better method of sampling in order to ensure that exactly 25 students
from each class are contained in the sample?

2. A forester wanted to survey spruce trees on one face of a mountain in order to determine the
percentage of spruce trees that were infested or were attacked by a very destructive insect
known as the tomentosus. The forester obtained a map of the mountain and then subdivided
the area of interest into 400 square grids with each square covering an area of 100 square
metres. A simple random sample of 50 squares was selected, and for each selected square,
the survey team determined the total number of spruce trees and the number of such trees that
were infested or had been attacked by the tomentosus.

(a) Identify the parameter that the forester was interested in.

(b) What sampling design was used for this survey?

(c) If the tomentosus is believed to be evenly distributed throughout the face of this mountain,
would you recommend this sampling design? Why or why not?

(d) If the tomentosus prefers to reside in regions of the mountain face with higher temperatures
and hence lower elevations, would you recommend using a different sampling design?
Explain your answer briefly.
1.8

3. For each of the following surveys, state which sampling frame you would recommend. Justify
your choice briefly. Be realistic.

(a) The Board of Directors at Shady Hollows Golf Club plans to survey its membership with
respect to opinion on smoking regulations in their new clubhouse. The membership is
made up of several different categories of men and women: Regular, Intermediate, Senior
and Social.

(b) A North Vancouver travel agency wants to determine its most popular winter vacation
destinations during the past three years.

4. For each of the surveys mentioned in question 3, state which method of sampling would be
most appropriate for the sampling frame you recommended. Justify your choice.

5. For each of the studies described below, (i) Identify the source(s) of bias; and (ii) Explain how
the identified bias may affect the results of the study.

(a) To obtain consumer reaction to a new beer, a marketing firm contacts beer drinkers
referred to them from people who had previously participated in their marketing surveys.
Each person received $25 for participating in the study.

(b) After a television debate between federal leadership candidates, viewers are encouraged to
phone a television station to indicate the candidate of their preference.

(c) Cards are left on tables in a restaurant to solicit suggestions and comments about the
quality of the meals and service.

(d) In order to gauge public opinion on a proposed bill on abortion, a Member of Parliament
uses the results of a survey by the editor of a newsletter of a women's organization.
2.1

EXPLORING ASSOCIATION BETWEEN TWO QUANTITATIVE VARIABLES

An Inferential Approach

Are two variables correlated with each other? Is the linear correlation strong? In
particular, does the sample data allow us to conclude that the population has a linear
correlation? Often, if you show a scatter diagram to two different people, one person will
believe that there is a reasonably strong correlation, and the other person will believe
that the correlation is not very strong. Who is right? In order to help decide about scatter
diagrams in which the correlation is not clearly strong or weak, we use a form of
Statistical Inference.

Statistical Inference
In statistical inference, we want to make a decision: either the population correlation is
zero (there is no linear correlation), or it is not zero (there is some linear correlation).
These two possible conclusions are written down as statements called hypotheses (the
singular is hypothesis). The statement of "no linear correlation in the population" is
called the null hypothesis (denoted by H0), while the statement "there is a linear
correlation in the population" is called the alternative hypothesis (denoted by Ha). Our
aim is to decide whether our sample correlation coefficient (denoted “r”) is strong
enough to support the claim that there is a linear correlation in the population.

The starting point in this type of inference (also called "hypothesis testing") is to assume
that the null hypothesis is true. We then determine whether it is likely to get a sample
correlation coefficient as big as the one that we did get in our sample. For example,
suppose we have a sample of size n=10, and we calculate r=0.93. We need to know if
0.93 is likely, or if it is unlikely, when a sample of size 10 is taken from a population
which has correlation coefficient equal to zero.

Now you may be asking yourself: "If the population correlation coefficient is zero, then
shouldn't the sample correlation coefficient also have a value of zero?" Well, we do
expect the sample to be similar to the population, but we also know that, just by bad
luck (also known as …….. sampling error!), we might get an unrepresentative sample.
2. 2

Decision Points
In order to help in our decision making, a set of decision points has been developed.
These decision points are designed to let us know which values of r are likely to occur
when sampling from a population with correlation coefficient equal to zero. (Be careful
when reading this: distinguish the sample correlation coefficient from the population
correlation.) The decision points for various sample sizes are given below.

DECISION POINTS FOR DETERMINING THE SIGNIFICANCE OF LINEAR


CORRELATION COEFFICIENT FROM A RANDOM SAMPLE OF SIZE n

n Dec. Pt. n Dec. Pt. n Dec. Pt.


5 0.878 17 0.482 29 0.367
6 0.811 18 0.468 30 0.361
7 0.765 19 0.456 35 0.334
8 0.707 20 0.444 40 0.312
9 0.666 21 0.433 45 0.294
10 0.632 22 0.423 50 0.279
11 0.602 23 0.413 60 0.254
12 0.576 24 0.404 70 0.235
13 0.553 25 0.396 80 0.220
14 0.532 26 0.388 90 0.207
15 0.514 27 0.381 100 0.197
16 0.497 28 0.374 200 0.139

To use these decision points (D.P.), refer to the following diagram:

conclude cannot conclude conclude


linear correlation linear correlation linear correlation
|--------------------|-----------------------------------|-----------------------------------|--------------------|
-1 -DP +DP +1

Any value of “r” near -1 or +1 will make us think that the population also has a nearly
perfect correlation. Any value of “r” which is fairly close to zero will make us think that
the population may not have a linear correlation. For in-between values of r, the above
diagram shows us what conclusion to make.
2.3

In the preceding diagram, "conclude linear correlation" means that we conclude the
alternative hypothesis is (more or less) true; in statistical language we "reject the null
hypothesis". In other words, it is highly unlikely that the population has zero correlation.

In the preceding diagram, "cannot conclude linear correlation" means that we are
"unable, or fail to, reject the null hypothesis". In other words, it is possible that the
population correlation coefficient is zero; however at the same time, it is also possible
that the population correlation coefficient is not exactly equal to zero. Specifically, we
cannot conclude (or we do not have enough evidence to conclude) that there is a linear
correlation in the population. Make this conclusion whenever r is between “-D.P.” and
“D.P.”.

Question:
Is “unable to reject null hypothesis” the same as “accept null hypothesis”? In other
words, is “we do not have enough evidence to conclude that there is a linear correlation
in the population” the same as “there is no linear correlation in the population”? Think
about this question; it will be discussed in class.

As we shall see later in this course, decision points are designed so that there is only a
5% chance of incorrectly rejecting the null hypothesis. The following example illustrates
this important statistics concept.

EXAMPLE
The book value per share (X-variable, in dollar) and the annual dividend (Y-variable, in
dollar) for 15 randomly selected utility stocks are given in the table below.
X = Book Y = Annual X = Book Y = Annual
Company Company
Value ($) Dividend ($) Value ($) Dividend ($)
A 22.44 2.40 I 12.14 0.80
B 20.89 2.98 J 23.31 1.94
C 22.09 2.06 K 16.23 3.00
D 14.48 1.09 L 0.56 0.28
E 20.73 1.96 M 0.84 0.84
F 19.25 1.55 N 18.05 1.80
G 20.37 2.16 O 12.45 1.21
H 26.43 1.60
2. 4

a) Draw a scatter diagram using StatGraphics.


Before doing anything, we should first look at the scatter diagram. (What is the
purpose of this?)

Plot of Dividend vs Book Value


3
Dividend

0
0 5 10 15 20 25 30
Book Value

b) Compute r using StatGraphics. [Answer: r=+0.6970]


Using your scatter plot and convince yourself that r=+0.697 is a reasonable value for
the sample correlation. Also check to see if there are problems with outliers or with
curvature.

c) State the hypotheses.


(All you need to do here is to write out the two statements; choosing between these
statements will occur later in the process (see part d).)

The null hypothesis and the alternative hypothesis are abbreviated as Ho and Ha
respectively.
H0: There is no linear correlation between the book value per share and the annual
dividend, in the population of all utility stocks
Ha: There is a linear correlation between the book value per share and the annual
dividend, in the population of all utility stocks
2.5

d) Using an appropriate decision point, make a conclusion about the correlation


between the book value and the annual dividend in the population of all utility stocks.

From the decision point table, the decision point is 0.514 when the sample size is 15
(n=15). The decision point diagram for this problem is then

|-------------------------|---------------------------|---------------------------|---------*---------------|
-1 -0.514 0 +0.514 +1

Since r (see the *, which represents r =0.697 in the above diagram) is not between
the two decision points, we conclude Ha, namely that the population correlation
coefficient is significantly different from zero. This means that we conclude there is
a significant linear correlation between the two variables.
Therefore the conclusion is: we have enough evidence to conclude that utility stocks
with different book values tend to have a proportional different annual dividend.
(Note: we are not saying there is a cause-and-effect relationship between the two
variables – we cannot conclude that changing a stock’s book value will result in a
proportional change in that stock’s annual dividend. We can only conclude that
annual dividend is proportionally associated with book value.)

Note: Check the scatter plot again now. What happens to your conclusion? Because
there is curvature affecting the results of this correlation calculation, the inference made
above should be regarded with caution: the relation between the two variables appears
in fact to be curved rather than linear.

Where did the Decision Points come from?


To get some idea of what the decision points represent, imagine that there is a
population in which two variables have a zero correlation with each other. Now imagine
taking a simple random sample of 15 individuals from that population and, for that
sample, computing r. We would expect that the sample will produce a value of r which is
close to zero, but not exactly zero. Now imagine repeating this sampling process: take
another SRS of size n=15 and compute the correlation. And repeat it thousands of
times.
The decision points are developed so that 95% of these thousands of samples will have
r somewhere between -0.514 and +0.514. The other 5% of those thousands of samples
will have r outside of the interval between -0.514 and 0.514. We interpret this to mean
that when a population has zero correlation, there is a 5% chance that the sample we
will take will have correlation coefficient outside of the interval formed by the decision
points, but there is a 95% chance that the sample we will take will have a correlation
coefficient within the decision point interval.
How do we use this information? When we take our sample and calculate r, if it is in the
decision point interval (i.e. near zero) then we conclude that the population correlation is
not strong.
3.1

EXPLORING ASSOCIATION WITH QUALITATIVE VARIABLES

Chi-Square Analysis of Contingency Tables

We want to know whether there are differences among certain subgroups in our
population. For example, do men and women vote similarly; do commuters from Burnaby,
Richmond and Surrey have similar opinions about a proposed tax on highways; do
employees in three types of jobs differ in marital status; does whether or not a person has
received a speeding ticket depend on what age group they belong to?

We will see how to analyze these and other questions. Once the data has been collected,
summarized and graphed, we will then calculate a statistic called Chi-square statistic (“chi”
rhymes with "sky"). The chi-square statistic (denoted as  2 ) can be used, where
appropriate, as a measure of the independence between two qualitative variables (or
categorical variables) that have been summarized in a contingency table.

To see how the chi-square method works, consider a marketing survey of beer drinkers'
gender (Male, Female) and the type of beer they prefer (Light, Regular, Dark). The two
qualitative variables are “gender” and “preferred beer type”. The result from a SRS of 150
beer drinkers in City A are summarized in the following contingency table (also known as a
cross-tabulation or a two-way table).

Preferred Beer Type


Light Regular Dark
GENDER

Male 20 40 20 80
Female 30 30 10 70
50 70 30 150

Graphical Summary of Two Qualitative Variables


A block diagram (or side-by-side bar graph or stacked bar chart) can be drawn to show the
relationship between two qualitative variables. It consists of a series of bar charts, placed
side-by-side, which enables us to compare groups of a qualitative variable. In our
example, we might be interested in comparing Males' and Females' preferred beer type.
In a block diagram, each mini bar chart shows the frequencies for one row (or one column)
of the contingency table.
3.2

In Figure 1, we compare the shapes of the


Barchart
two “mini bar charts”, each of which has
40
Light
three bars.
30 Regular If the shapes differ a lot (or do not look
frequency

Dark
alike), then we will believe that Males and
20
Females have different preferences.
10 On the other hand, if the shapes do not
differ by much (or do look alike), then we
0
Male Female will believe that Males' and Females'
preferences are not very different.

Figure 1

Another graph that you might want to use is a stacked bar chart. In this graph, collapse
each bar chart into a single bar by stacking the pieces on top of each other. See Figures 2
and 3. Notice the choice of comparing Males to Females (Figure 2) or comparing the three
types of beer (Figure 3).

Block Diagram Block Diagram


80 80
Light Male
60 Regular 60 Female
frequency
frequency

Dark

40 40

20 20

0 0
Male Female Light Regular Dark

Figure 2 Figure 3

Preferably these graphs would be drawn using frequencies within each subgroup.

Now that you have had a chance to look at the graphs, does it appear to you that there is
any difference between the male group and the female group, in terms of their preferred
beer type? Do you think that there would be a difference if we looked at the population of
all male and female beer drinkers in City A? Below, we will see how to decide whether the
differences in the sample are large enough for us to believe that there would be
differences in the population.
3.3

THE CHI-SQUARE METHOD


Using our example, we wish to make an inference about the population, is there any
difference between male and female drinkers in terms of preferred beer type? We can
state our two possible conclusions about the population using two statements called the
hypotheses.
Null hypothesis (or H0): For all the beer drinkers in City A, there is no
difference between males and females, in terms of
their preferred beer type.
or
The variable “gender” and the variable “preferred
beer type” are independent to each other, in the
population.

Alternative hypothesis (or Ha): For all the beer drinkers in City A, there is a
difference between males and females in terms of
their preferred beer type.
or
The variable “gender” and the variable “preferred
beer type” are not independent to each other, in the
population.

Notice that if H0 is true, then we would expect a sample of females to have roughly the
same distribution of preferred beer type as a sample of males would have. In other words,
in the sample, the percentage of females preferring, say, Regular should be "close to" the
percentage of males who prefer Regular. What we need is a technique for deciding what
"close to" means. That technique is the chi-square analysis.

In a chi-square analysis, we will compare two sets of numbers.


1. The observed frequency (or abbreviated as f ): it is the number of responses in each
cell of the contingency table.

2. The expected frequency (or abbreviated as e ): it is the value that we would expect to
get if H0 were true. Notice that expected frequencies reflect that knowing a person’s
gender gives no additional information about one’s preferred beer type.
[H0 is true if, in the population, the percentage of male drinkers who prefer Light equals
the percentage of female drinkers who prefer Light beer. Similar argument can be
applied to the Regular and Dark beer.]
3.4

Expected frequencies are calculated as follows: column totals are used to estimate the
percentage of individuals in each category.

Preferred Beer Type


Light Regular Dark
GENDER

Male 20 40 20 80
Female 30 30 10 70
50 70 30 150

For example, using the above contingency table:


Fraction of individuals who prefer Light beer is 50/150*100% = 33.33%
Fraction of individuals who prefer Regular beer is 70/150*100% = 46.67%
Fraction of individuals who prefer Dark beer is 30/150*100% = 20%

If H0 is true, the percentage of Males preferring Light beer should equal the percentage of
Females preferring Light beer and they should both be equal to 33.33%. Now, 33.33% of
80 male drinkers is 26.67. In other words, we would expect to get about 26.67 male
drinkers preferring Light beer.
By the same token, 46.67% of 80 is 37.33. We would expect about 38 male drinkers
preferring Regular beer. Similarly, 20% of 80 or 16 male drinkers would be expected to
prefer Dark beer.
Similar calculation can be performed among female drinkers and all the expected
frequencies are tabulated in the following two-way table.

Preferred Beer Type


Light Regular Dark
GENDER

Male 20 (26.67) 40 (37.33) 20 (16) 80


Female 30 (23.33) 30 (32.67) 10 (14) 70
50 70 30 150
Note: Expected frequencies are presented inside parentheses, next to their corresponding
observed frequencies.
3.5

The following graph shows the side-by-side bar graph of the expected frequencies. Since
the null hypothesis is assumed to be true when calculating expected frequencies, we
expect the shapes look exactly alike.

Barchart
40
Light
30 Regular
frequency

Dark

20

10

0
Male Female

Figure 4

We now need to compare the observed frequencies ( f ) to the expected frequencies ( e ).


The comparison will be on a cell-by-cell basis. As we usually do when making
comparisons, we start by subtracting: compare 20 to 26.67, 40 to 37.33 etc. to calculate
f  e . As has happened in other places in this course, these differences will sum to zero
(you can check this). So we will square the differences in each cell, i.e. calculate ( f  e) 2 .

There is, however, one more part to this calculation. Suppose that f  10 and e  15 . The
difference is -5. If f  100 and e  105 , the difference is also -5. But the 5-unit difference
seems less important in the second case (100 vs. 105) than in the first case (10 vs. 15).
To incorporate this idea (that 5 units is a lot sometimes and not a lot other times), the chi-
square calculation looks at the square of differences relative to the e for each cell. Hence
the formula for the  2 statistic is:
( f  e) 2
  2

In our example,
(20  26.67) 2 (40  37.33) 2 (20  16) 2 (30  23.33) 2 (30  32.67)2 (10  14)2
2      
26.67 37.33 16 23.33 32.67 14
 1.6681  0.1910  1.0000  1.9069  0.2182  1.1429
 6.1271
3.6

Notice that cells in which the observed frequency is close to the expected frequency have
( f  e) 2
contributed a small value of . Cells in which f and e are very different have
e
( f  e) 2
contributed a larger value of . Thus if our calculated value of  2 is not close to
e
( f  e) 2
zero, this indicates that some of the cells must have had not close to zero.
e
Hence, some of the observed and expected frequencies were quite different. Our
conclusion would be that H0 is unreasonable - the observed frequencies are not close to
the expected frequencies, where the expected frequencies reflect that there is no
difference between the males’ and females’ preferences.
On the other hand, a relatively small calculated value of  2 indicates that all of the
observed frequencies are close to the expected frequencies, so we may not be able to
conclude H0 is wrong.
Remember from Section 2 of this supplement that “we cannot conclude H0 is wrong” is not
the same as “we conclude that H0 is correct”.

Decision Rule:
If  2 is near zero, conclude “we fail to reject H0”.
If  2 is much larger than zero, conclude Ha (i.e. we reject H0)

In our example, when we compare the observed frequencies to the frequencies and learn
that the difference between them is  2 = 6.1271. Does this mean that the observed
frequencies are close to the expected frequencies? Is 6.1271 near zero or is it much
larger than zero?
We answer this by referring to chi-square decision point table (next page). To select the
appropriate Decision Point, we first determine the "degrees of freedom” (abbreviated as
df) for your contingency table. The df are determined by the size of the contingency table,
and are defined as
df = (# of rows -1) x (# of columns -1).
In our example, df = (2-1)x(3-1) = 1x2 = 2. Now, with df = 2, the decision point is 5.991.

Our decision rule is:


If the calculated  2 statistic exceeds the decision point, then we can conclude reject H0
and claim that Ha is true. In other words, the groups being compared are very similar.
If the calculated  2 statistic is no bigger than the decision point, then we do not have
enough evidence to reject H0 and claim H0 is not wrong.

fail to reject H0 reject H0


|-----------------------------------|--------------------------------------------->
0 D.P.
3.7

CHI-SQUARE DECISION POINTS FOR VARIOUS DEGREES OF FREEDOM (D.F.)

D.F. Dec. Pt. D.F. Dec. Pt. D.F. Dec. Pt.


1 3.841 11 19.675 21 32.671
2 5.991 12 21.026 22 33.924
3 7.815 13 22.362 23 35.172
4 9.488 14 23.685 24 36.415
5 11.071 15 24.996 25 37.652
6 12.592 16 26.296 26 38.885
7 14.067 17 27.587 27 40.113
8 15.507 18 28.869 28 41.337
9 16.919 19 30.144 29 42.557
10 18.307 20 31.410 30 43.773

Conclusion:
In this example, we have calculated  2 =6.1271 which is greater than our decision point of
5.991. Therefore, we can reject H0 and conclude Ha is true. In other words, we can make
an inference by saying that there are strong differences between male and female in terms
of their preferred beer type, among all the beer drinkers in City A.

NOTE: If in our example the calculated  2 statistic was less than the decision point, then
we would conclude that, among all drinkers in City A, Gender and Beer Type are
independent. In other words, male and female beer drinkers do not differ significantly in
their preferred beer type.
3.8

EXAMPLE
Does the type of automobile repair differ for various makes of car? A random sample of
new cars was selected and the results are tabulated below.

Type of Repair
Electrical Fuel Supply Other
A 17 19 24
Make of
B 9 21 9
Car
C 33 44 19

Notice that the above contingency table contains only the observed frequencies. So, we
will have to first compute the row totals, column totals and the grand total. As a result, the
expected frequency of each cell can be calculated. The complete contingency table is
shown below.

Type of Repair
Electrical Fuel Supply Other
A 17 (18.15) 19 (25.85) 24 (16.00) 60
Make of
B 9 (11.80) 21 (16.80) 9 (10.40) 39
Car
C 33 (29.05) 44 (41.35) 19 (25.60) 96
59 84 52 195

( f  e) 2
Next, the contribution of the  2 statistic from each cell, or , can be calculated as
e
they are shown below.

Type of Repair
Electrical Fuel Supply Other
A 0.07 1.81 4.00
Make of
B 0.66 1.05 0.19
Car
C 0.54 0.17 1.70

The  2 statistic (or the sum of the above 9 cells) is 10.20. With the degrees of freedom
calculated to be (3-1)(3-1) = 4, the corresponding decision point is then 9.488.

fail to reject H0 reject H0


|-----------------------------------|--------------------------------------------->
0 9.488

Conclusion:
Because our  2 statistic is greater than the decision point, we can conclude that the three
makes of cars do not all have similar percentages of electrical, supply, other repairs.
3.9

Notes:
1. A general formula for calculating the expected value of a cell is
e = (row total) * (column total) / (grand total)

2. For theoretical reasons, a chi-square analysis should be interpreted with caution if


some of the expected frequencies are smaller than 5. If any expected frequencies < 5,
you should try to combine rows or columns to obtain a smaller contingency table with
all expected frequencies at least 5. For example,

3 7 8 -- 10 8 -- -- --
if 9 21 24 use -- 30 24 or 12 28 32
12 28 32 -- 40 32 12 28 32

Note that the degrees of freedom will be 2 instead of 4. Take special note of the fact
that it is only the expected values that should not be less than 5. There is no problem if
there are some small observed values. Remember that with small sample size or the
contingency tables has too many cells, it often happens that many of the expected
frequencies will be less than 5.

3. The value of the  2 statistic is always greater than or equal to zero. If an expected
frequency is the same as the corresponding observed frequency, then the chi-square
component will be zero. If all the rows of a contingency table have exactly the same
distribution, then  2 = 0.
For example, the following observed values yield  2 = 0 (check that the rows are
identically distributed by finding the relative frequencies). You should also draw the
block diagram using these relative frequencies, to see that the side-by-side bar charts
are identical.
7 63
23 207
A contingency table in which the row distributions are similar will have  2 near zero.

4. If you have a quantitative variable, you can make it qualitative by forming categories.
For example, instead of the actual income, the following groups can be formed:
(1) under $20,000, (2) between $20,000 and $40,000, and (3) over $40,000.

Therefore, the  2 analysis can also be used if you want to investigate the relationship
between a qualitative variable and a quantitative variable.
Remember that you can also subjectively compare a qualitative variable to a
quantitative one by visually comparing the dot plot, histogram or boxplot for each
category of the qualitative variable.
3.10

5. A table having 2 rows and 3 columns could also be thought of as having 3 rows and 2
columns -- just flip it over. This flipping may make interpretations more natural. In the
example we looked at, we compared male drinkers to female drinkers in terms of beer
preference. Equivalently, we could have compared the three types of beer and see if
the Male/Female ratios are the same for all three beer types.

6. In general, concluding “fail to reject the null hypothesis” means that any observed
difference just reflects chance variation.
Thus, in our example, concluding fail to reject H0 would mean that while the Male and
Female preferences are different in the sample, they are not different enough to
suggest any difference in the population.

Concluding the alternative hypothesis means that the observed differences are large
enough to suggest a real difference in the population. The  2 statistic is used to help
us decide if the sample differences are large enough to conclude Ha.

You might also like