Download as pdf or txt
Download as pdf or txt
You are on page 1of 174

FACULTY OF COMMERCE AND LAW

MASTER OF BUSINESS ADMINISTRATION

STATISTICS FOR MANAGERS

MBAZ504

Author: Amos Tendai Munzara


Master of Business Administration (ZOU)
BSc Mathematics and Statistics (ZOU)
Diploma in Education (Gweru Teachers’ College)


 
Content Reviewer: Kudzanayi Ruvharo
MSc in Statistics (UZ)
BSc Special Honours in Statistics (UZ)
BSc Mathematics and Statistics (ZOU)
Diploma in Education (Gweru Teachers’ College)

Editor: Cuthbert Muza


Master of Commerce in Accounting (MSU)
Bachelor of Business Administration in Accounting (Solusi
University)
Certified Public Accountant (CPA(Z)) (ICPAZ)
Certified Professional Forensic Accountant (CPFAcct) (ICFA-Canada)
Public Accountants and Auditors Board (PAAB) Registration
Certificate
Intermediate Certificate (ICSAZ)


 
Table of Contents
Module Overview ................................................................................................................... 10
UNIT 1 DATA COLLECTION TECHNIQUES ................................................................ 11
1.0 Introduction .................................................................................................................... 11
1.1 Learning Objectives ....................................................................................................... 11
1.2 Sampling......................................................................................................................... 11
1.3 Random Sampling Methods ........................................................................................... 12
1.3.1 Simple random sampling ......................................................................................... 12
Activity 1.1 ....................................................................................................................... 12
1.3.2 Stratified random sampling ..................................................................................... 12
Activity 1.2 ....................................................................................................................... 13
1.3.3 Systematic random sampling ................................................................................... 13
Activity 1.3 ....................................................................................................................... 14
1.3.4 Cluster sampling ...................................................................................................... 14
1.4 Non- Random Sampling Methods .................................................................................. 14
1.4.1 Judgemental sampling ............................................................................................. 14
1.4.2 Convenience sampling ............................................................................................. 15
1.4.3 Qouta sampling ........................................................................................................ 15
1.5 Data Collection Methods ................................................................................................ 15
1.5.1 Direct observation.................................................................................................... 15
1.5.2 Personal interview ................................................................................................... 16
1.5.3 Telephone interview ................................................................................................ 16
1.5.4 Postal questionnaire ................................................................................................. 17
Activity 1.4 ....................................................................................................................... 17
1.6 Questionnaire Design ..................................................................................................... 17
Activity 1.5 ....................................................................................................................... 18
1.7 Summary ........................................................................................................................ 18
Further Reading .................................................................................................................... 19
UNIT 2 DATA REPRESENTATION .................................................................................. 20
2.0 Introduction .................................................................................................................... 20
2.1 Learning Objectives ....................................................................................................... 20
2.2 Tabulation....................................................................................................................... 20
Activity 2.1 ....................................................................................................................... 21
2.3 Stem and Leaf Display ................................................................................................... 22
Activity 2.2 ....................................................................................................................... 22
2.4 Bar Charts....................................................................................................................... 23
2.4.1 Simple bar chart ....................................................................................................... 23


 
2.4.2 Multiple bar chart .................................................................................................... 23
2.4.3 Component bar chart ............................................................................................... 24
Activity 2.3 ....................................................................................................................... 25
2.5 Pie Charts ....................................................................................................................... 25
2.6 Histogram ....................................................................................................................... 26
2.6.1 Histogram with unequal class width ........................................................................ 27
2.7 Frequency Polygon......................................................................................................... 28
Activity 2.4 ....................................................................................................................... 29
2.8 Cumulative Frequency Curves ....................................................................................... 29
2.8.1 Less than ogive ........................................................................................................ 29
2.8.2 ‘More than’ ogive .................................................................................................... 30
Activity 2.5 ....................................................................................................................... 31
2.9 Shape of the Distribution................................................................................................ 31
2.9.1 Skewness……………………………..………………………………………..……32
2.9.2 Kurtosis .................................................................................................................... 32
Activity 2.6 ....................................................................................................................... 33
2.10 Summary ...................................................................................................................... 33
Further Reading .................................................................................................................... 33
UNIT 3 MEASURES FOR DESCRIBING DATA ............................................................. 34
3.0 Introduction .................................................................................................................... 34
3.1 Learning Objectives ....................................................................................................... 34
3.2 Measures of Central Tendency ....................................................................................... 34
3.2.1 The arithmetic mean ................................................................................................ 34
Activity 3.1 ....................................................................................................................... 36
Activity 3.2 ....................................................................................................................... 38
3.2.2 The median .............................................................................................................. 38
Activity 3.3 ....................................................................................................................... 39
Activity 3.4 ....................................................................................................................... 39
Activity 3.5 ....................................................................................................................... 41
3.2.3 The mode ................................................................................................................. 41
Activity 3.6 ....................................................................................................................... 42
Activity 3.7 ....................................................................................................................... 43
3.2.4 Estimating the mode using a histogram................................................................... 43
3.2.5 Choosing the appropriate average ........................................................................... 44
Activity 3.8 ....................................................................................................................... 45
3.3 Measures of Position ...................................................................................................... 45
3.3.1 Quartiles for discrete data ........................................................................................ 46


 
Activity 3.9 ....................................................................................................................... 47
3.3.2 Quartiles for continuous data ................................................................................... 47
Activity 3.10 ..................................................................................................................... 48
3.4 Measures of Dispersion .................................................................................................. 48
3.4.1 Range ....................................................................................................................... 48
Activity 3.11 ..................................................................................................................... 48
3.4.2 Variance and standard deviation of ungrouped data ............................................... 49
Activity 3.12 ..................................................................................................................... 49
3.4.3 Variance and standard deviation of grouped data ................................................... 50
Activity 3.13 ..................................................................................................................... 51
3.4.4 Coefficient of variation ............................................................................................ 52
Activity 3.14 ..................................................................................................................... 52
3.5 Coefficient of Skewness ............................................................................................. 53
3.6 The Box-and-Whisker Plot............................................................................................. 53
3.7 Summary ........................................................................................................................ 54
Further Reading .................................................................................................................... 55
UNIT 4 PROBABILITY DISTRIBUTIONS....................................................................... 56
4.0 Introduction ........................................................................................................................ 56
4.1 Learning Objectives ....................................................................................................... 56
4.2 Discrete Random Variables and their Distributions ....................................................... 56
Activity 4.1 ....................................................................................................................... 57
4.2.1 Expectation of a discrete random variable .............................................................. 57
4.2.2 Mean of a function of a random variable ................................................................ 58
Activity 4.2 ....................................................................................................................... 59
4.2.3 Variance and standard deviation of a discrete random variable .............................. 59
Activity 4.3 ....................................................................................................................... 60
4.3 Binomial Probability Distribution .................................................................................. 60
Activity 4.4 ....................................................................................................................... 63
4.4 Poisson Probability Distribution .................................................................................... 63
Activity 4.5 ....................................................................................................................... 65
4.5 The Normal Probability Distribution ............................................................................. 65
4.6.1 Standard normal curve ............................................................................................. 65
Activity 4.6 ....................................................................................................................... 67
4.6.2 Evaluating probabilities using the standard normal tables ...................................... 67
Activity 4.7 ....................................................................................................................... 69
4.6.3 Practical problems ................................................................................................... 69
4.7 Summary ........................................................................................................................ 71


 
Further Reading ................................................................................................................. 72
UNIT 5 STATISTICAL ESTIMATION.............................................................................. 73
5.0 Introduction .................................................................................................................... 73
5.1 Learning Objectives ....................................................................................................... 73
5.2 What is Statistical Estimation? ....................................................................................... 73
5.3 Point Estimation ............................................................................................................. 74
5.3.1 Point estimator of the population mean ................................................................... 74
5.3.2 Point estimator of the population variance .............................................................. 74
Activity 5.1 ....................................................................................................................... 75
5.3.3 Point estimator of the population proportion ........................................................... 75
Activity 5.2 ....................................................................................................................... 76
5.3.4 Point Estimate of the difference of two population means ...................................... 76
Activity 5.3 ....................................................................................................................... 77
5.3.5 Point estimate of the difference of two population proportions .............................. 77
5.4 Confidence Interval Estimation ...................................................................................... 78
5.4.1 Interval estimate of the population mean ................................................................ 78
Activity 5.6 ....................................................................................................................... 79
Activity 5.7 ....................................................................................................................... 80
Activity 5.8 ....................................................................................................................... 81
5.4.2 Estimation of the population proportion .................................................................. 81
Activity 5.9 ....................................................................................................................... 82
5.4.3 Confidence interval for the difference between two population means (Independent
samples) ............................................................................................................................ 82
Activity 5.10 ..................................................................................................................... 84
Activity 5.11 ..................................................................................................................... 85
5.4.4 Confidence interval for difference of two populations means (Paired samples) ..... 85
Activity 5.12 ..................................................................................................................... 87
5.4.5 Confidence interval for the difference between two population proportions .......... 87
Activity 5.13 ..................................................................................................................... 89
5.5 Determining Sample Size in Estimation ........................................................................ 89
5.5.1 Sample size for estimating population mean ........................................................... 90
Activity 5.14 ..................................................................................................................... 90
5.5.2 Sample size for estimating a population proportion ................................................ 90
Activity 5.15 ..................................................................................................................... 92
5.6 Summary ........................................................................................................................ 92
Further Reading .................................................................................................................... 93
 


 
UNIT 6 HYPOTHESIS TESTING ....................................................................................... 94
6.0 Introduction .................................................................................................................... 94
6.1 Learning Objectives ....................................................................................................... 94
6.2 Statistical Hypotheses .................................................................................................... 94
6.2.1 Types of hypotheses ................................................................................................ 94
6.2.2 Deciding on the null hypothesis .............................................................................. 95
Activity 6.1 ....................................................................................................................... 96
6.3 Type I and Type II Errors ............................................................................................... 96
6.4 Steps Followed in Hypothesis Testing ........................................................................... 96
6.5 Tests Concerning the Population Mean of a Single Population ..................................... 98
Activity 6.2 ..................................................................................................................... 101
6.6 Test Concerning a Population Proportion for a Single Population .............................. 101
Activity 6.3 ..................................................................................................................... 103
6.7 Confidence Interval Approach to Hypothesis Testing ................................................. 103
Activity 6.4 ..................................................................................................................... 104
6.8 Testing for Differences of two Population Means ....................................................... 104
6.8.1Tests for independent samples................................................................................ 104
Activity 6.5 ..................................................................................................................... 106
Activity 6.6 ..................................................................................................................... 108
6.8.2 Tests for paired samples ........................................................................................ 108
Activity 6.7 ..................................................................................................................... 110
6.9 Testing for Difference of two Population Proportions ................................................. 110
Activity 6.8 ..................................................................................................................... 112
6.10 Summary .................................................................................................................... 112
Further Reading .................................................................................................................. 113
UNIT 7 CHI-SQUARE TESTS .......................................................................................... 114
7.0 Introduction .................................................................................................................. 114
7.1 Learning Objectives ..................................................................................................... 114
7.2 Conducting a Chi-square Test of Association .............................................................. 114
7.2.1 Hypothesis ............................................................................................................. 114
7.2.2 Test statistic ........................................................................................................... 115
7.2.3 Critical value.......................................................................................................... 116
Activity 7.1 ..................................................................................................................... 118
7.3 Goodness-of-fit Test ..................................................................................................... 119
7.3.1 Hypothesis ............................................................................................................. 119
7.3.2 Test statistic ........................................................................................................... 119
7.3.3 Critical value.......................................................................................................... 119


 
7.3.4 Decision Criteria .................................................................................................... 120
Activity 7.2 ..................................................................................................................... 122
Activity 7.3 ..................................................................................................................... 123
Activity 7.4 ..................................................................................................................... 125
7.4 Summary ...................................................................................................................... 125
Further Reading .................................................................................................................. 126
UNIT 8 REGRESSION AND CORRELATION ANALYSIS ......................................... 127
8.0 Introduction .................................................................................................................. 127
8.1 Learning Objectives ..................................................................................................... 127
8.2 The Linear Regression Model ...................................................................................... 127
Activity 8.1 ..................................................................................................................... 128
8.2.2 Assumptions of the linear regression model.......................................................... 128
8.3 Scatter Plot ................................................................................................................... 129
Activity 8.2 ..................................................................................................................... 130
8.4 Estimating the Simple Regression Model .................................................................... 131
Activity 8.3 ..................................................................................................................... 133
8.4.1 Prediction using the regression model ................................................................... 133
8.5 Estimating the Multiple Regression Model.................................................................. 133
Activity 8.4 ..................................................................................................................... 134
8.5 Testing the Significance of the Coefficients ................................................................ 135
8.5.1 Testing the overall significance of the model ........................................................ 136
8.7 Correlation Analysis..................................................................................................... 137
8.7.1 Pearson’s product moment correlation coefficient ................................................ 137
8.7.2 Coefficient of determination .................................................................................. 138
8.8 Summary ...................................................................................................................... 140
Further Reading .................................................................................................................. 141
UNIT 9 INTRODUCTION TO TIME SERIES ANALYSIS........................................... 142
9.0 Introduction .................................................................................................................. 142
9.1 Learning Objectives ..................................................................................................... 142
9.2 Components of a Time Series ...................................................................................... 142
9.2.1 Trend component ................................................................................................... 142
9.2.2 Seasonal component .............................................................................................. 143
9.2.3 Cyclical component ............................................................................................... 143
9.2.4 Irregular component .............................................................................................. 144
9.3 Time Series Models ...................................................................................................... 144
9.3.1 Additive model ...................................................................................................... 144
9.3.2 Multiplicative model.............................................................................................. 144


 
9.4 Isolating the Trend Component .................................................................................... 145
9.4.1 Least squares method............................................................................................. 145
Activity 9.1 ..................................................................................................................... 146
9.4.2 Moving average method ........................................................................................ 146
Activity 9.2 ..................................................................................................................... 148
9.5 Isolating the Seasonal Component ............................................................................... 149
Activity 9.3 ..................................................................................................................... 150
9.5.1 Deseasonalising of data ......................................................................................... 150
Activity 9.4 ..................................................................................................................... 151
9.5.2 Predicted values of the series................................................................................. 151
Activity 9.5 ..................................................................................................................... 151
9.6 Summary ...................................................................................................................... 152
Further Reading .................................................................................................................. 153
UNIT 10 INDEX NUMBERS.............................................................................................. 154
10.0 Introduction ................................................................................................................ 154
10.1 Learning Objectives ................................................................................................... 154
10.2 Types of Index Numbers ............................................................................................ 154
10.2.1 Price indices ......................................................................................................... 154
10.2.2 Quantity indices ................................................................................................... 155
10.2.3 Value indices ....................................................................................................... 155
10.3 Simple Index Numbers ............................................................................................... 155
10.3.1 Simple price index ............................................................................................... 155
Activity 10.1 ................................................................................................................... 155
10.3.2 Simple quantity index .......................................................................................... 156
Activity 10.2 ................................................................................................................... 156
10.3.3 Index number series trends .................................................................................. 156
Activity 10.3 ................................................................................................................... 157
10.3.4 Changing the base period .................................................................................... 158
Activity 10.4 ................................................................................................................... 159
10.4 Weighted Aggregate Indices ...................................................................................... 159
Activity 10.5 ................................................................................................................... 161
10.5 Use of Index Numbers as Deflators ........................................................................... 161
Activity 10.6 ................................................................................................................... 162
10.6 Challenges in Constructing Index Numbers ........................................................... 163
10.7 Summary ................................................................................................................... 163
Further Reading ............................................................................................................... 164
APPENDICES ....................................................................................................................... 165


 
Module Overview
The module is meant to give the student a comprehensive understanding of Statistics for
managerial purposes. It emphasises the application of Statistical concepts to business
situations.

The module has got ten units. In Unit 1 we look at sampling procedures, methods of data
collection and the design of good data collection instruments. The different ways of
presenting qualitative and quantitative data are shown in Unit 2. In Unit 3 we look at various
measures of describing data. These different measures will be classified into three broad
categories namely measures of central tendency, measures of location and measures of
dispersion. In Unit 4 we focus on some special probability distributions which are the
Poisson, binomial and normal probability distributions. The two related concepts of statistical
inference which are estimation and hypothesis testing are tackled in Unit 5 and Unit 6
respectively. Unit 7 is about Chi-Square tests while Unit 8 is about regression and correlation
analysis. In Unit 9 you are introduced to time series analysis. The last unit is on index
numbers.

As students, you are advised to go through the units in the order they are arranged. Each unit
has some activities which are meant to give you enough practice in working out solutions to
problems. You are encouraged to study the worked examples before you attempt the activity
questions. We have also included a list of reference textbooks at the end of each unit for
further reading.

10 
 
Unit 1

Data Collection Techniques

1.0 Introduction

Statistics is a scientific discipline which involves applying a set of rules and procedures to
reduce large masses of data to manageable proportions thereby making it possible to draw
reasonable conclusions from those data. In this unit, we will look at different techniques
which are related to data collection. These techniques include the different sampling methods,
the design of data collection instruments such as questionnaires and the various methods of
collecting data.

1.1 Objectives
By the end of the unit, you should be able to:
• describe the different sampling methods
• describe methods of collecting data
• design good instruments of data collection
• conduct a survey for a particular enquiry

1.2 Sampling

Sampling refers to the process of selecting a smaller set of study units (sample) from the
entire collection of units of interest (population). This is usually necessitated by limited
financial and time resources to carry out a population study. A census, which is the term used
for a study of the whole population, is usually costly and requires more time and labour
compared to a sample survey.

The data that is collected from a sample is meant to provide an insight into the population
from which it was drawn. It is, therefore, necessary that the sample ought to be representative
of the underlying population. A representative sample is one which reflects the characteristics
of the underlying population as closely as possible. It has to be as large as resources would
permit because the larger the sample the more accurate the results would be. To ensure the
use of a representative sample in a survey, the sample elements are selected randomly using a
suitable random sampling method.

The methods of selecting samples are classified into two categories namely random sampling
and non- random sampling. In random sampling, also known as probability sampling, every
element of the population has equal chance of being included or excluded from the selected
sample, that is, selection of study units is free from personal bias. This is not the case in non-
random sampling where the researcher exercises human judgement in selecting study units
resulting is samples which are biased.

11 
 
1.3 Random Sampling Methods

In this section we are going to consider the following random sampling methods: Simple
random sampling, Stratified random sampling, Systematic random sampling, and Cluster
sampling.

1.3.1 Simple random sampling


To select a sample using simple random sampling one has to come up with a numbered list of
all the elements in the population of interest. This list is called a sampling frame. The
sampling frame makes it possible for us to draw elements from the population by randomly
generating the numbers of the elements to be included in the sample using a scientific
calculator, a computer or random number tables. Random digit tables are provided in the
appendices.

In some studies, obtaining a frame of all the units in the population is impossible. In these
studies, a random sample can still be obtained by randomising some aspect of the study such
as the location and the time of collecting the observations.

Example 1.1
Using random number tables, obtain a simple random sample of 30 accounts from a
population of 500 bank accounts.

Solution 1.1
You start by numbering the accounts from 001 to 500. In the table of random digits, you
randomly choose a row or column to start from. Since 500 is a three digit number, you draw
random numbers with three digits, ignoring any numbers greater than 500. You should also
ignore any number already obtained.

Suppose you select row six as the starting point. The first account to be included in the
sample is the account with identification number 428. The next three digit numbers 664 and
627 are ignored since they are greater than 500. Continuing systematically in the same row
you get the following random numbers: 343, 621 (ignore), 936 (ignore), 362, 358, 259, 351,
298, 285, 300, 606 (ignore), 004. You will continue this process until you have the desired
sample of 30 accounts.

Activity 1.1
Use random digits tables to select a random sample of 50 elements from a population of 500
elements.

1.3.2 Stratified random sampling


In stratified random sampling, the population is divided into two or more subpopulations
which are called strata, and then simple random samples are taken in each stratum and then
combined to give the desired sample.

This method is suitable for selecting a sample from a population which is made up of distinct
subpopulations. Each subpopulation consists of study units with similar characteristics and
has to be adequately represented in the final sample. In order to ensure that this is the case,

12 
 
the sample sizes in each stratum are made to be proportional to their frequency in the
population.

Example 1.2
An interdenominational church pastor would like to hear the opinion of congregants
regarding a proposed fasting programme. On a particular Sunday service the congregation
comprised of 250 members from the AFM Church, 200 members from ZAOGA Church and
50 members from the Anglican Church. How would the pastor select a random sample of 40
congregants for use in the study using stratified random sampling?

Solution 1.2
The total number of congregants is 500. The congregation has to be partitioned into three
subgroups according to the church one belongs to. The sample allocation of each subgroup is
determined as follows:
250
AFM: × 40 = 20 members
500
200
ZAOGA: × 40 =16 members
500
50
ANGLICAN: × 40 = 4 members
500

A sampling frame of each subpopulation is prepared, and then simple random sampling is
used to fill its allocation.
Activity 1.2
A popular government primary school has 1 000 students of whom 600 are in boarding and
the rest are day scholars. An opinion poll is to be conducted concerning the time to begin
lessons during winter. How would you select a random sample of 200 students to take part in
the poll using stratified random sampling?

1.3.3 Systematic random sampling


Systematic random sampling involves selecting study units at periodic intervals, for example,
selecting every 10th item from a production run for inspection. The first study unit is the only
one selected at random while subsequent units are selected following a specific order. For
this reason, systematic random sampling is viewed as a quasi- random sampling method.

Suppose a population has N elements and a random sample of size n is to be selected using
systematic random sampling. The steps to be followed are:
• number the elements in the population from 1 to N,
• calculate the sampling ratio( k ) by dividing the population size N by the desired
sample size n,
• randomly select a number between 1 and k inclusive to get the starting point, and
• select additional elements at evenly spaced intervals( every k th element thereafter)
until the desired sample is obtained.

13 
 
Example 1.3
A bank has 2 000 account holders. A random sample of 40 accounts is to be selected using
systematic random sampling. List the accounts that would make up the sample if the first
account randomly selected is 12.

Solution 1.3
You begin by numbering the accounts from 1 to 2 000. Then you calculate the sampling ratio
k as follows:
N 2000
k= = = 50 .
n 40

Now, starting from 12, every 50th account thereafter is included.


The first account is 12, the second account is 12 + 50 = 62 , the third account is 62 + 50 = 112
and you continue adding 50 like this until a sample of 40 bank accounts is obtained.
Activity 1.3
Refer to Example 1.3. Obtain a random sample of 20 accounts using systematic random
sampling if the first account selected randomly is 15.

1.3.4 Cluster sampling


Where the population consists of homogeneous subpopulations the population can be divided
into these subpopulations called clusters. Unlike in stratified random sampling where there is
homogeneity in terms of the variable of interest, elements within clusters exhibit different
characteristics. The sampling frame is a complete list of all the clusters that make up the
population. A random sample of clusters is selected and all the study units in the sampled
clusters are selected for the survey.

The advantages of cluster sampling are that the method is cost less costly and less time
consuming. The amount of travelling by interviewers to reach the respondents is reduced and
as a result time wastages are minimal. The disadvantages are that the selected sample may not
reflect the diversity of the population elements because some sections of the population may
be under-represented or over-represented.

1.4 Non- Random Sampling Methods


Non- random sampling methods are non-probabilistic in nature. In this section, we will
consider the following non-random sampling methods: Judgemental sampling, Convenience
sampling, and Quota sampling.

1.4.1 Judgemental sampling


Judgemental sampling is used where special skills are the criteria used to form a sample.
Usually the sample is very small. For example, only renowned economists may be asked to
comment on the economic status of a country. These economists are not randomly selected
but deliberately selected due to their expertise. The method is subjective in that it is purely
based on the researcher’s personal discretion to choose the sample.

14 
 
1.4.2 Convenience sampling
The sample is drawn to suit the convenience of the researcher. The researcher selects those
elements which are easily accessible to him, making the sampling process easy and cheaper.
For example, suppose a researcher would like to find out the level of customer service
satisfaction of shoppers in a supermarket. The researcher will simply interview shoppers
coming out of the supermarket rather that following up on customers who once bought goods
from the supermarket. However, the sample chosen is usually not representative of the
population.

1.4.3 Quota sampling


In quota sampling, the sample is drawn on the basis of more specific guidelines about which
study units should be drawn. To start with, you decide on the proportion of the population to
be included in the sample. Then you identify distinct groups within the population. You go on
to determine the quota of respondents needed from each group. You then fill each quota by
finding enough elements from each group of the population.

Quota sampling is almost similar to stratified random sampling in that you start by
identifying strata in the population and dividing the population accordingly. It differs from
stratified random sampling in that it does not require a sampling frame. It is therefore much
quicker, cheaper and easier compared to stratified random sampling. Its major drawback is
that it is susceptible to bias.
 

1.5 Data Collection Methods

Data collection can be defined as the process of counting or enumerating or measuring


together with the recording of results. One important factor that is considered when selecting
a suitable method of data collection is the source of the data. Sources of data can be classified
as primary source and secondary source giving rise to primary data and secondary data
respectively.

Primary data are those data which are originally collected by a researcher from the field for
the first time and for a specific purpose. The data are first hand information and hence are up-
to-date, more reliable and accurate. However, primary data is usually expensive to collect and
requires more time and labour to gather.

Secondary data are those data which are already collected, processed and used by someone
else for their own purpose which may not necessarily be the same as for the current
researcher. Secondary data is obtained from journals, newspapers, company newsletters and
from publications made by government agencies. Although it is much faster and cheaper to
obtain secondary data, the data might be outdated and therefore, not suitable for the current
study.

In this section, we will look at four principal ways of collecting primary data.

1.5.1 Direct observation


The researcher records some observations on the study units. If the study units are people,
they should not realise that they are being observed. In some cases, the researcher becomes a
member of the group being studied.

15 
 
Advantages of direct observation
• The method is easy to implement.
• It is faster and cheaper.

Disadvantages of direct observation


• The researcher would not be able to give reasons for certain observed behaviour by
merely observing without interviewing study units.
• If study units come to realise that they are being observed they may change their usual
behaviour.

1.5.2 Personal interview


Large scale surveys are conducted using this method. The method involves face-to-face
interviews between field investigators and the subjects of the study. The interviewer will ask
questions from a prepared interview schedule and will record the responses of the interviewee
on spaces provided on the schedule. Interviewers must undergo training so that there is some
degree of uniformity in the manner in which questions are asked and the mode of filling the
schedules.

Advantages of personal interviews


• A high response rate.
• There is room for probing and clarification of misunderstood questions.
• Non-verbal responses can be noted.
• Questionnaires are completed in a standard way by the interviewer.
• The presence of interviewer discourages collective responses.

Disadvantages of personal interviews


• The exercise is costly due to training expenses and the amount of travelling done by
enumerators.
• It is time consuming.
• Respondent anonymity is lost.
• Interviewer bias may arise if the interviewer suggests answers to the respondent
through use of suggestive gestures or asking leading questions.
• The interviewer may also encounter hostile respondents who refuse to be interviewed.

1.5.3 Telephone interview


Information about the respondent is collected through a telephone conversation. The method
requires an efficient telephone system.

Advantages of telephone interviews


• It is less expensive compared to face-to-face personal interviews.
• The interviewer can clarify any unclear questions.
• The interviewer can call back if the respondent is not initially available.
• A large sample of respondents can be interviewed within a short period of time.

Disadvantages of telephone interviews


• Non-response bias, the method has a bias against those that do not own telephones.

16 
 
• Interviewer bias may arise if the interviewer suggests answers to the respondent by
asking leading questions.
• The interview may be terminated prematurely due to technical faults.
• Respondent anonymity is lost.
• Non-verbal responses cannot be observed.

1.5.4 Postal questionnaire


A specially designed questionnaire is distributed to respondents by post, e-mail and
sometimes by hand where it is convenient to do so. Usually the questionnaire contains close-
ended questions which elicit answers of the type yes or no or multiple choice questions. The
method involves the respondent completing a questionnaire in the absence of the interviewer.
After completing the questionnaires respondents will then post them back to the interviewer.

Advantages of postal questionnaires


• Respondent anonymity is guaranteed.
• Respondents have more time with the questionnaire and can give well-considered
answers.
• Expenses are reduced because there is no need to train and deploy interviewers.
• There is no interviewer bias.

Disadvantages of postal questionnaires


• Low response rate if respondents feel bothered to take part in the study.
• There is no room to probe or clarify misunderstood questions.
• Respondent may be tempted to seek assistance from a third party resulting in
collective responses being given.
• Non-response bias due to the fact that those who cannot read or write and those
simply not interested in the topic are less likely to respond.

Activity 1.4
1. (a) Distinguish between primary data and secondary data.
(b) What are the advantages of primary data over secondary data?

2. Explain the advantages and the disadvantages of administering a survey questionnaire on-
line.
3. Suggest ways of improving the response rate in postal questionnaire interviews.

1.6 Questionnaire Design


The questionnaire is the major instrument used to collect primary data (from human beings).
It is important to ensure that the questionnaire is well structured in order to solicit for the
required information.

Well designed questionnaires should:


• have clear instructions to guide the respondent when completing it,
• not be unnecessarily long,
• have clear and unambiguous questions,
• have questions arranged in logical sequence,
17 
 
• have short questions rather than long questions,
• avoid questions which require calculations, and
• not contain double-barrelled questions.

The questionnaire should be tested in a pilot survey before the actual survey. This is
necessary for the following reasons:
• to help fine-tune the data organisation procedures
• to test the questionnaire for clarity of instructions and questions
• to test the data analysis tools beforehand
• to determine the appropriate sample size for the actual study
• to train enumerators in questioning skills

Activity 1.5
Design a questionnaire that you would use to collect data concerning working conditions
from employees in your organisation.

1.7 Summary
In this unit we looked at the various methods of selecting random samples from underlying
populations. Random sampling methods are those in which population elements have the
same probability of becoming part of the sample. We then looked at the three principal ways
of obtaining primary data which are observation, personal interview and postal
questionnaires. The advantages and disadvantages of these methods were discussed. We
ended the unit by describing the attributes of well designed questionnaires.

18 
 
Further Reading

Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

19 
 
Unit 2

Data Representation

2.0 Introduction

In its original form collected data is usually a complex mass of data which is not suitable for
analysis. It is easier and much faster to make sense out of the collected data when the data is
presented in diagrammatic or graphical form. Tables or graphs allows us to summarise large
masses of data into a simple and more comprehensible form which makes comparisons,
analysis and interpretation of data possible. In this unit, you will learn the various ways of
displaying data.

2.1 Objectives
By the end of the unit, you should be able to:
• display data using tables, pie charts, bar charts, frequency distribution tables,
histograms, frequency polygons and ogives
• construct a stem and leaf plot for a set of data
• describe the shape of data using graphical displays

2.2 Tabulation

Tabulation refers to the process of summarising collected data in the form of a table. The
following are the attributes of a good table:
• It should be numbered (at the top) for easy identification and reference
• It should have a suitable title which is placed against the table number
• It should have suitable captions and stubs with units if any
• It should have footnotes where necessary
• It should have a source note (if secondary data is used) at the bottom of the table

Example 2.1
The ages of a random sample of MBA students are as follows:
34 28 46 37 33 24 29 45 37 34 32 25 50 54 32 36 38 41 38 44 28 43 40 49
30 46 27 34 61 33
Draw a table to show the distribution of ages.

Solution 2.1
A frequency distribution table can be used to show the distribution of ages as shown in Table
2.1 below.

20 
 
Table 2.1 Age Distribution of MBA Students
Age (years) Number of Students
Below 30 6
30 – 40* 13
40 – 50 8
50 & above 3
Total =30

*30 – 40 means 30 years to less than 40 years

Example 2.2
A company has 250 employees of which 120 are female. Of the 105 married employees, 50
are male. Show this information on a table.

Solution 2.2
This information can be shown on a contingency table as shown in Table 2.2 below.

Table 2.2: Classification of Employees of a Company


Sex/Marital Status Married Not married Total
Female 55 65 120
Male 50 80 130
Total 105 145 250

Activity 2.1
Table 2.3 below gives a summary of revenue collections by ZIMRA for the third quarter of
2013.

Table 2.3: 2013 Third Quarter Revenue Collections


Tax Head Actual Target Difference % Difference
US$ US$ US$
Individual Tax 211 330 253.63 171 250 000.00 40 080,253.63 23%
Company Tax 102 358 448.49 105 202 000.00 -2 843 551.51 -3%
VAT(Imports) 129 318 123.40 122 600 000.00 6 718 123.40 5%
VAT(Local Sales) 154 769 430.48 168 525 000.00 -13 755 569.52 -8%
Customs Duty 91 760 416.72 94 080 000.00 -2 319 583.28 -2%
Excise Duty 129 904 846.54 125 450 000.00 4 454 846.54 4%
Carbon Tax 9 645 156.12 9 100 000.00 545 156.12 6%
Mining Royalties 39 004 999.94 63 700 000.00 -24 695 000.06 -39%
Other Taxes 29 253 632.15 45 018 000.00 -15 764 367.85 -35%
Total 897 345 307.46 904 925 000.00 -7 579 692.54 -1%

(a) How much money was contributed by Company tax?


(b) Which tax head contributed the largest portion of the revenue?
(c) Which tax head had the greatest variance from target?
(d) What proportion of the collected revenue came from mining royalties?
(e) Did ZIMRA manage to meet its revenue collection target for the period under review?

21 
 
2.3 Stem and Leaf Display

A stem and leaf display can be viewed as a histogram which instead of rectangular bars uses
actual data values as building blocks. This is an advantage in that one will not lose sight of
the original data values. The purposes of a stem and leaf display are varied and include the
following:
• to show the shape of the distribution,
• to sort the observations that makes up the distribution in order of size,
• to make it easy to detect unusual observations and observations which appear more
frequent than the others, and
• to provide a basis for judging the suitability of different types of averages.

In order to display a set of data on a stem and leaf display, you have to decide on the
appropriate stem and leaf to use. For two digit numbers, the tens are used as stem digits and
the units as the leaf digits. For example, the two digit numbers 42, 46 and 49 may be recorded
on the same stem line as 4| 2 6 9. For a set of three digit numbers, either a combination of
tens and units may be used as the leaf digits while the hundreds are used as the stem digits or
you use two digit stems and one digit leaves. To produce a stem and leaf display for a set of
data you should follow these steps:
• List the set of stem digits that appear in the data
• Record each observation by putting its leaf digit alongside its stem digit
• Arrange the leaf digits sharing the same stem digit in order of magnitude

Example 2.3
Display the data of Example 2.1 on a stem and leaf display.

Solution 2.3
Rough Display Final Display
Stem Leaf Stem Leaf
2 849587 2 457889
3 4737422688043 3 0223344467788
4 65143096 4 01345669
5 04 5 04
6 1 6 1
KEY: 2| 4 = 24
Figure 2.1 Stem and Leaf Display Showing Ages of MBA students
Activity 2.2
The amounts of cash spent by a random sample of 30 customers in a supermarket were as
follows:
125 36 512 304 301 357 74 50 122 98 130 100 208 295 250 211 436 450 159
101 106 451 23 65 194 500 25 100 102 110
(a) Represent the data on a stem and leaf display.
(b) Hence, construct a frequency distribution table of the data.

22 
 
2.4 Bar Charts
Bar charts present you with more flexible tools of displaying qualitative data. Bar charts are
fairly easy to interpret and can be produced directly from data using widely available
statistical packages. There are three basic types of bar charts namely
• simple bar chart
• component bar chart
• cluster bar chart
2.4.1 Simple bar chart
A simple bar chart is used to represent only one attribute. Each category is represented by a
bar. The bars must have similar width and uniform space should be left between bars. The
height of each bar is directly proportional to the frequency of the category.

Example 2.4
The marital status of employees at a local NGO is as follows:
Marital Status Number of employees
Single 25
Married 10
Divorced 30
Draw a simple bar chart of the data.

Solution 2.4
35
30
25
Number of Students

20
15
10
5
0
single married divorced
Marital Status

Figure 2.2: Bar Chart Showing Marital Status of Employees

The bar chart shows that the majority of employees are divorced while very few are married.
2.4.2 Multiple bar chart
A multiple bar chart is useful when comparisons of data are required. For example, you may
want to compare production of a crop over two farming seasons.

Example 2.5
An A2 farmer’s production figures for the first two years of farming were as follows:

23 
 
Crop Year
2000 2001
Maize(tons) 40 45
Sugar beans(tons) 25 20
Soya beans(tons) 15 35

Construct a multiple bar chart to represent the data.

Solution 2.5

50
45
40
Production(tonnes)

35
30
25
20 Year 2000
15 Year2001
10
5
0
maize sugar beans soya beans
Crop

Figure 2.3: Multiple Bar Chart Showing Crop Production

Maize and soya beans production was high in 2001 compared to the year 2000. In both years
the production of maize was higher than that of sugar beans and soya beans.

2.4.3 Component bar chart


A component bar chart is particularly useful where you want to emphasise the relative
proportions of each category. Each bar is subdivided according to the components consisting
in it. Different shades of colour may be used to distinguish one component from another.

Example 2.6
Represent the data of Example 2.5 on a component bar chart.

24 
 
Solution 2.6
90
80
70

Production(tonnes)
60
50
40 Year2001
30
Year 2000
20
10
0
maize sugar beans soya beans
Crop

Figure 2.4: Component Bar Chart Showing Crop Production

Activity 2.3
1. According to the Confederation of Zimbabwe Industries (CZI) the average capacity
utilisation in the manufacturing sector dropped to 39% in 2013, from 44.2% in 2012 and
57% in 2011. Display this information on a simple bar chart.
2. The table shows the number of tobacco farmers in three districts of Mashonaland East
Province for the years 2012 and 2013.

Year Marondera Wedza Goromonzi


2012 40 12 26
2013 48 10 32

Draw a multiple bar chart to compare the number of tobacco farmers in 2012 and 2013.

2.5 Pie Charts


A pie chart is used to show the contribution of each category of an attribute. Each category is
represented by a sector of a circle. The size of the sector is proportional to the contribution of
each category to the whole. A pie chart is easily understood and can be produced directly
from data using widely available computer software. However, it is not suitable when there
are too many categories. A pie chart with too many segments can be quite confusing.

Example 2.7
Use the data presented in Table 2.3 to produce a pie chart to show the contribution of each
revenue head to total revenue. Which revenue heads contributed the most to total revenue?

25 
 
Solution 2.7

Mining  Other Taxes
Royalties Carbon Tax 3%
4% 1%
Individu
ual Tax
Excise Du
uty
24%
15%
Cusstoms Duty Companyy Tax
10% 11%
%
VAT(Local 
V
Sales) mports)
VAT(Im
17% 15
5%

Figurre 2.5 Pie Ch


hart of Reveenue Headss’ Contributtion

The pie chart


c shows that VAT contributed
c a combined 32% to totall revenue. Inndividual Taax
constituteed 24% of to
otal revenue while Excisse Duty was third with a contribution
n of 15%.

2.6 Histtogram
A histogrram is used to represennt a frequenccy distributioon. It consissts of a set of
o bars whosse
areas reppresent the frequencies
f o the variouus classes. Unlike
of U a barr chart, theree are no gapps
between the bars.

Examplee 2.8
The monnthly salariess earned by a sample of 20 salesperssons employed in the mootor insurancce
industry are:
Salary//($) Number of E Employees
2 to less th
200 han 300 2
3 to less th
300 han 400 4
4 to less th
400 han 500 8
5 to less th
500 han 600 5
6 to less th
600 han 700 1

Construcct a histogram
m of the dataa.

26
Solution 2.8

Number of employees

8
6
4
2
0
2 3 4 5 6 7 Salary ($00)
Figure 2.6: Histogram Showing Monthly Salaries
2.6.1 Histogram with unequal class width
When the classes of a grouped frequency distribution have unequal class widths, we use
frequency density on the vertical axis. The frequency density is found by dividing the class
frequency by the class width. In some cases the first and/or last class is open-ended. In that
case, the usual class width is assumed to be the width of those classes. Where there is no
usual class width, the open-ended classes are given the same width as the adjoining classes.

Example 2.9
The distribution of ages of customers visiting a barber shop on a Saturday is shown in the
table below:
Age Range(years) Number of Customers
11 – 20 2
21 – 40 5
41 – 45 7
45 – 55 4
56 – 65 2
Produce a histogram to portray the distribution.

Solution 2.9
You begin by changing the apparent class limits into real class boundaries and then calculate
the frequency density for each class.

Age Range(years) Class Width Frequency Frequency Density


10.5 - 20.5 10 2 0.2
20.5 – 40.5 20 5 0.25
40.5 – 45.5 5 7 1.4
45.5 – 55.5 10 4 0.4
55.5 – 65.5 10 2 0.2

The histogram is shown in Figure 2.7

27 
 
Frequency density
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70
Age (years)
Figure 2.7 Histogram Showing Distribution of Ages

2.7 Frequency Polygon

A frequency distribution can also be represented by a frequency polygon. A frequency


polygon is constructed by plotting class midpoints against frequency. Where a histogram has
been drawn, a frequency polygon (for the same data) can be superimposed on it by simply
marking the midpoints of the bars at the top and joining them by straight lines. This way, it
can be noted that the area covered by the histogram is equal to the area under the
corresponding frequency polygon.

Example 2.10
Construct a frequency polygon using the data of Example 2.8.

Solution 2.10
Number of employees
8
6
4
2
0
0 1 2 3 4 5 6 7 8 Salary ($00)
Figure 2.8 Frequency Polygon

28 
 
Activity 2.4
1. The table below shows the amount of money collected from motorists as spot fines at
a police roadblock over 30 randomly selected days:

Revenue ($000) Number of Days


Below 5 3
5 to less than 10 9
10 to less than 15 11
15 to less than 20 5
20 or more 2

(a) Draw a histogram of the data.


(b) Superimpose a frequency polygon on the histogram drawn in part (a).

2.8 Cumulative Frequency Curves


These curves which are also known as ogive curves take the form of an elongated S. They are
based on cumulative frequencies. Their main purpose is to estimate quartiles and percentiles.
There are two forms of ogive curves which are the ‘less than’ ogive and the ‘more than’
ogive.

2.8.1 Less than ogive


It is obtained by plotting the ‘less than’ cumulative frequencies against the upper limit of the
class boundaries and then joining the plotted points by a smooth curve. The cumulative
frequencies are usually expressed as percentages to facilitate easy estimation of partitioned
values.

Example 2.11
Construct a ‘less than’ ogive using the frequency distribution of Example 2.8. Use the ogive
curve to estimate the three quartiles and the 80th percentile.

Solution 2.11

Salary/($) Number of Cumulative Number % Cumulative Number


Employees of Employees of Employees
200 to less than 300 2 0+2=2 10
300 to less than 400 4 2+4=6 30
400 to less than 500 8 6 + 8 = 14 70
500 to less than 600 5 14 + 5 = 19 95
600 to less than 700 1 19 + 1 = 20 100

To produce a ‘less than’ ogive we now plot the following points:


Salary less than 200 300 400 500 600 700
% cumulative no. of employees 0 10 30 70 95 100

29 
 
120

% cumulative no. of employees
100

80

60

40

20

0
0 100 200 300 400 500 600 700 800
Salary ($)

Figure 2.9 ‘Less Than’ Ogive

To estimate the middle quartile (median) a straight horizontal line is produced to touch the
curve from the vertical axis at 50%, then from the point of contact a straight vertical line is
produced to touch the salary axis. The median is read off the salary axis at the point of
contact. Similar lines can be produced at 25%, 75% and 80% to touch the curve and down to
the salary scale to estimate the lower quartile, upper quartile and 80th percentile respectively.
The procedure is demonstrated in Figure 2.10 below.

120
% cumulative no. of employees

100

80

60

40

20

0
0 200 400 600 800
Salary ($)

Figure 2.10 Using the Ogive Curve to Estimate Quartiles and Percentiles

The lower quartile is estimated to be $380, the median to be $460, the upper quartile to $520
and the 80th percentile to be $540.
2.8.2 ‘More than’ ogive
The ‘more than’ ogive is obtained by plotting the ‘more than’ cumulative frequencies against
the lower limit of class boundaries and then joining the points by a smooth curve.

30 
 
Example 2.12
Draw a ‘more than’ ogive for the data in Example 2.8.

Solution 2.12
Salary ( ≥ ) 200 300 400 500 600 700
Cumulative no. of employees 20 18 14 6 1 0
% Cumulative no. of employees 100 90 70 30 5 0

120
% Cumulative no. of employees

100

80

60

40

20

0
0 100 200 300 400 500 600 700 800
Salary ($)

Figure 2.11 ‘More Than’ Ogive

Suppose a ‘more than’ ogive is superimposed on a ‘less than’ ogive, the two curves intersect
and the x- coordinate of the point of intersection is an estimate of the median.
Activity 2.5
The amounts of cash demanded back by a random sample of 40 customers paying for
groceries using their credit cards in a supermarket were as follows:

Amount ($) Number of customers


Less than 100 12
100 to less than 200 8
200 to less than 300 7
300 to less than 400 7
400 to less than 500 5
500 or more 1

Draw a ‘less than’ ogive and a ‘more than’ ogive on the same axes and hence estimate the
average amount of cash demanded back by customers.

2.9 Shape of the Distribution

The shape of data can be quite evident from a graph like a histogram or a stem and leaf
display. Some distributions are symmetrical while others are skewed. Some are more peaked
while others are flatter.

31 
 
2.9.1 Skewness
Skewness is a measure of the degree of asymmetry of a frequency distribution. The
distribution is right-skewed or positively skewed when it stretches to the right than it does to
the left as shown in Figure 2.12 (a). Most of the data is concentrated to the left of the
histogram. The relative position of the mean is such that the mean is higher than the median
which is higher than the mode.

A left-skewed or negatively skewed distribution stretches asymmetrically to the left as shown


in Figure 2.12 (b). In this case most of the data is clustered to the right. The relative positions
of the measures of central tendency are reversed, that is, mean < median < mode.

(a) Right-skewed Distribution (b) Left-skewed Distribution

Figure 2.12: Skewness of Distributions

2.9.2 Kurtosis
Kurtosis is a measure of the peakedness of a distribution relative to the normal distribution.
The larger the kurtosis, the more peaked will be the distribution. We will describe two types
of kurtosis which are:
• leptokurtic
• platykurtic

A leptokurtic distribution is a more peaked distribution than the normal distribution. A flatter
distribution compared to the normal distribution is described as a platykurtic distribution.
Figure 2.13 shows the positions of these distributions relative to the normal distribution.

(a) Leptokurtic distribution (b) Platykurtic distribution

Figure 2.13 Kurtosis of Distribution

32 
 
Activity 2.6
Describe the shape of data represented on a stem and leaf display in Activity 2.2 hence
comment on the expenditure pattern of the customers.

2.10 Summary
Data is made more manageable and easier to interpret when it is presented in tabular or
graphical form. In this unit we described the various ways of displaying both qualitative and
quantitative data. The suitability of each method will largely depend on the nature of the data
to be displayed. We also discussed the advantages and disadvantages of each method of data
presentation.

Further Reading
Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

33 
 
Unit 3

Measures for Describing Data

3.0 Introduction

Statistics is an area of Science concerned with the extraction of information from numerical
data. Individual values, when taken together in their entirety, form a distribution or a
population. Summary statistics are ways of characterising that distribution: saying whether
the values are very similar; whether there are some exceptionally large or small values; what
a typical value is like, and so on. In this unit, we are going to discuss various statistics that are
used to describe the distributions from which data are obtained.

3.1 Objectives
By the end of this unit, you should be able to:
• calculate the mean, median and mode for discrete and continuous data
• discuss the advantages and disadvantages of each of the measures of central tendency
• estimate quartiles and percentiles for given data sets
• use a box plot to summarise a given data set
• calculate the range, inter-quartile range, variance and standard deviation for given
samples of discrete and continuous data
• calculate coefficient of variation for given data sets

3.2 Measures of Central Tendency


Measures of central tendency are also called measures of central location, averages or
averages of the first order. In this module, we shall use the terms measure of central tendency
or average interchangeably to mean the representative value around which all the values of
the variable cluster or concentrate. The three averages are: the arithmetic mean - commonly
known as the mean, the median and the mode.

3.2.1 The arithmetic mean


The layman calls it "the average" as if it were the only average. He calls it so, probably, since
it is the most commonly used average.

The mean of a small set of discrete data


The mean of a set of n measurements x1 , x2 , x3, ..., xn is equal to the sum of the measurements
divided by n. Denoting the mean by x we have:

∑x i
x= i =1
[3.1]
n
where the ∑ is the summation sign.

34 
 
Example 3.1
The following data set is a record of the amount of money (in $) spent by a sample of 8
customers on groceries in a shop on a particular Saturday. The figures were rounded-off to
whole numbers.
13, 8, 21, 4, 23, 16, 11, 15
Calculate the mean amount spent by customers on groceries.

Solution 3.1
n

∑x i
x= i =1
n
111
=
8
= 13.875
Based on the sample, a customer spent on average about $14 on groceries in the shop on that
particular Saturday.

The mean of discrete frequency data


When a data set is large with some observations appearing several times each, the mean is
found by multiplying each observed value by the corresponding frequency, adding up and
then dividing the sum by the number of observations.

The mean for discrete frequency data is obtained using formula 3.2.
k

∑fx i i
x= i =1
[3.2]
n
k
where n = ∑ f i . Note that n is the total frequency, k is the number of categories of
i =1

observations and f i is the frequency of category i .

Example 3.2
In Example 3.1, suppose there are: 5 customers who spent $13 each; 3 customers who spent
$8 each; 1 customer who spent $21; 6 customers who spent $4 each; 2 customers who spent
$23 each; 3 customers who spent $16 each; 4 customers who spent $11 each ;and 7 customers
who spent $15 each. Find the mean amount spent in the shop.

Solution 3.2
This is an example of frequency discrete data. We usually record such data in form of a
frequency table as shown below.

x 13 8 21 4 23 16 11 15
f 5 3 1 6 2 3 4 7

There are a total of 31 individual observations. We multiplying each observed value by its
corresponding frequency, add up and then divide by 31.

35 
 
k

∑fx i i
x= i =1

n
377
=
31
= 12.1613
Thus on average, the customers spent $12.16 each.
Activity 3.1
1. A sample of ten university students had the following weekly expenditure, in dollars.
23, 15, 18, 35, 24, 45, 35, 28, 40, 32
Calculate the mean weekly expenditure for a student.
2. There were 5 categories of cash prices in a road-show promotion of a product. The
following frequency distribution table shows the number of people who won the various
categories of prices.
Prize ($) 25 40 60 75 120
No. of winners 20 12 5 3 1

Calculate the mean cash price won at the road-show.

The mean of grouped continuous data


Continuous variables such as mass, height and distance take values which are not clear-cut.
Large volumes of data of such measures are usually presented in form of grouped frequency
tables. Once data is presented in this form, some information is lost as it is no longer possible
to retrieve the original raw data. The mean is estimated on the basis of this limitation.

Suppose that data is grouped into k classes/categories with frequencies f1 , f 2 ,..., f k . Let the
classes have the midpoints x1 , x2 ,...xk respectively. Then the mean is estimated by:
k

∑fx i i
x= i =1
k
[3.3]
∑f
i =1
i

where x i is the class midpoint or class mark.


The class midpoint is a representative mark of all the marks falling in the particular class. It is
obtained by adding the lower and upper class boundaries and dividing the result by 2.

Example 3.3
The following data show monthly salaries (in dollars) of 50 employees of a non-
governmental organisation.

Salary($00) No. of employees


0 to less than 10 10
10 to less than 20 23
20 to less than 30 12
30 to less than 40 3
40 to less than 50 2
Calculate the mean monthly salary of the employees.

36 
 
Solution 3.3
Salary($00) Number of employees, f i Midpoint, xi f i xi
0 - 10 10 5 50
10 - 20 23 15 345
20 - 30 12 25 300
30 – 40 3 35 105
40 – 50 2 45 90
∑ f i =50 ∑ f i xi = 890

Mean, x =
∑fxi i

∑f i

890
=
50
= 17.8
The mean monthly salary is $1 780.00.

Example 3.4
An organisation recorded monthly medical expenses incurred by families of 30 randomly
selected employees.
Amount($00) Number of employees
1 – 10 3
11 – 20 7
21 – 30 11
31 – 40 5
41 – 50 4
Calculate the mean monthly expenditure per family.

Solution 3.4
Note that there are some gaps in between the classes. Amounts spent may assume values
between say 10 and 11, so there is need to do some continuity correction to the class
boundaries to obtain the real limits. The gaps are one unit each, so we obtain the limits by
adding 0.5 to upper class boundaries, and by subtracting 0.5 from the lower class boundaries.

Amount($00) Frequency, f i Midpoint, x i f i xi


0.5 – 10.5 3 5.5 16.5
10.5 – 20.5 7 15.5 108.5
20.5 – 30.5 11 25.5 280.5
30.5 – 40.5 5 35.5 177.5
40.5 – 50.5 4 45.5 182
∑ f i =30 ∑ f i xi = 765

Mean, x =
∑fx
i i

∑f i

765
=
30
= 25.5
The mean monthly medical expense per family was $2 550.00
37 
 
Activity 3.2
1. The data shows mass, in kg, a sample of people who had applied to train as horse-riders.
Mass (kg) 15 - 20 21- 25 26 - 30 31 - 35 36 - 40
Number of applicants 9 5 3 4 2
Calculate the sample mean.
2. The heights of the applicants were recorded as shown in the table below.
Height (cm) 130 - 135 136- 140 141 - 145 146 - 160
Number of applicants 7 9 4 1
Calculate the mean height of the applicants.

3.2.2 The median


The median is a positional average. It is the value such that half the observations in the data
set are larger than it and half are smaller than it. The median is the central value after the
observations are ranked according to size.

Median for small set of discrete data


Let the ordered values of a data set be y1 , y 2 , y3 …, y n where n is the number of observations.
The median is given by
y n +1 if n is odd [3.4]
2

1
( y n + y n + 2 ) if n is even. [3.5]
2 2 2

Example 3.5
The data set shows scores for university students who wrote a management course.
36, 67, 41, 52, 73, 61, 58, 76, 33, 48, 68
Find the median score.

Solution 3.5
Rearranging in order of size
y: 33 36 41 48 52 58 61 67 68 73 76
Since n = 11 is odd, the median corresponds to
y n +1 = y11+1 = y12 = y 6 = 58
2 2 2

Example 3.6
Suppose in Example 3.5, the student who obtained a score of 68 was disqualified for some
reason. Find the median of the remaining scores.

Solution 3.6
Since n = 10 is even, the median is given by
1
( y n + y n+ 2 )
2 2 2

1
= ( y5 + y 6 )
2
1
= (52 + 58)
2
= 55

38 
 
Activity 3.3
1. The PXP bank issued loans to farmers toward the rain season. The loans, in thousand
dollars, are listed below.
7.3, 5.4, 14.2, 9.1, 15.0, 8.6, 24.5, 3.7, 6.3, 16.4, 12.5, 18.2
Find the median of the data set.
2. An Omnibus operator expects the bus crew to cash in $100 dollars every day. Due to
various reasons, the crew may sometimes fail to meet the target. On 11 randomly selected
days, the following amounts were cashed in:
84, 93, 75, 88, 55, 69, 96, 100, 74, 80, 58
Find the median for the cash remittances.

Median for discrete frequency data


We shall proceed by giving an example to demonstrate how to locate the median of discrete
frequency data. This approach is used in order to avoid a cumbersome task of having to list a
large number of values in the data set.

Example 3.7
In a survey to assess attitude to a new product, a random sample of 41 potential customers
was obtained. A 60-point rating scale was used to measure the potential to become loyal to
the product. The frequency table shows the distribution of scores that were obtained in the
survey.

x 18 24 27 35 42 53
Frequency, f 4 7 5 12 7 6

Find the median score.

Solution 3.7
n + 1 41 + 1
Rank for median = = = 21
2 2
The median has a rank of 21 i.e. the 21st in the set of ascending values.
We then construct a cumulative frequency table to help us identify the median value.

x ≤ 18 ≤ 24 ≤ 27 ≤ 35 ≤ 42 ≤ 53
Cumulative 4 11 16 28 35 42
frequency

We note from the table that there are 16 values that are below or equal to 27, that is, the 16th
value is 27. Similarly, the 28th value is 35. A list of values from the 16th to the 28th will
include the 21st. The list will be a 17 followed by a chain of 35s up to the 28th value. Thus the
21st value is 35, the median.
Activity 3.4
During the first quarter of the year EST Department stores conducted a promotions for
various products were prices worth various amounts of dollars were won. The frequency table
shows the distribution of prizes that were won.
Prize worth 20 35 60 75 120
No. of prizes 83 26 48 52 15
Find the median value of the prices.

39 
 
Median for grouped continuous data
To find the median you start by identifying the median class. The median class is the class
that contains the nth 2 observation, where n is the total frequency. The class containing the
nth 2 observation can easily be identified using the less than cumulative frequencies of the
data.
The median is given by
Cm (n 2 − Fm −1 )
Median = Lm +
fm [3.6]
where - Lm = lower class boundary of median class, f m = frequency of the median class, Fm −1
= cumulative frequency up to(but excluding) the median class, C m = width of the median
class, n = total frequency and m = subscript used to denote median class.

Example 3.8
Calculate the median of data in Example 3.4.

Solution 3.8
Class interval Class boundaries Frequency, f i Cumulative frequency, F
1 – 10 0.5 - 10.5 3 3
11 – 20 10.5 – 20.5 7 10
21 – 30 20.5 – 30.5 11 21
31 – 40 30.5 – 40.5 5 26
41 - 50 40.5 – 50.5 4 30

The median class contains the 30th 2 observation, that is, the 15th observation. This is
contained in the class 20.5 – 30.5. Therefore, Lm = 20.5 , Cm = 10 , f m = 11and Fm−1 = 10 .
You now substitute these values into the formula.
C (n 2 − Fm −1 )
Median = Lm + m
fm
10(15 − 10)
= 20.5 +
11
50
= 20.5 +
11
= 25.04545455
≈ $2504.54

Example 3.9
The table below shows the distribution of the sales ($) made by vendors at Mbare Musika one
particular morning.
Sales, x 40 - 60 61 - 80 81 - 100 101 - 150 151 - 200 201- 250
Frequency, f 8 5 15 9 13 6

Estimate the median of the sales.

40 
 
Solution 3.9
Note that there are some gaps in between the classes. For instance, the lowest class ends at 60
and the next class starts from 61. Sales may assume values between 60 and 61, so there is
need to do some continuity correction to the class boundaries to obtain the real limits.

Sales ($) Frequency, f Cumulative frequency, F


39.5 – 60.5 8 8
60.5 – 80.5 5 13
80.5 – 100.5 15 28
100.5 – 150.5 9 37
150.5 – 200.5 13 50
200.5 – 250.5 6 56

The median has a rank of 28, that is, it is the 28th value of the ordered data set. This is in the
class 80.5 – 100.5, which is therefore the median class. The median is then interpolated using
the formula:
C (n 2 − Fm −1 )
Median = Lm + m
fm
20( 28 − 13)
= 80.5 +
15
300
= 80.5 +
15
= 100.5
The median is $100.50

Activity 3.5
1. The frequency table shows the distribution of bank deposits, in thousand dollars, made by
companies over the month of February.
Deposits 60 - 80 80 - 100 100 - 200 200 - 250 250 - 400
No. of banks 5 12 5 15 8
Estimate the median for the company deposits.
2. The distribution of investors by value of shares (thousand dollars) in Earthly limited
company is shown in the frequency table.
Value 2.5 - 4.5 5.0 - 7.5 8.0 - 12.5 13.0 - 18.5 19.0 - 25.5
Investors 12 9 24 8 5
Estimate the median share value.

3.2.3 The mode


The mode of a data set is the observation that appears most. The mode represents fashion and,
often, it is used in business.

Example 3.10
A cross boarder trader is deciding to order shoes for resale. She will be guided by shoe sales
recorded by a colleague in the same business in order to determine what proportion to order
of each size. The sales (shoe sizes) recorded by the colleague on her last visit were:
4, 7, 8, 8, 9, 3, 8, 8, 7, 9, 5, 6, 7, 5, 8
Determine the modal shoe size.

41 
 
Solution 3.10
The shoe size which appears most is 8. This is the shoe size with highest demand, and the
cross boarder trader should order more size 8 shoes.

Mode for discrete continuous data

Example 3.11
Suppose in Example 3.10 the sales for the last 30 visits were as follows:
Size 3 4 5 6 7 8 9 10
Frequency 11 7 16 10 23 47 7 1

What was the modal shoe size?

Solution 3.11
Size 8 has the highest frequency of 47, hence it is the mode. This shoe size must constitute
the biggest proportion of the new order.
Activity 3.6
Mukwe Lodge recorded the following number of bookings per week for accommodation in
the first quarter of the year.
15, 23, 9, 15, 25, 17, 18, 12, 15, 26, 13, 21
Find the modal number of bookings.

The mode for grouped continuous data


The mode for discrete data could be found easily by inspection. However, when raw data is
put into classes, it is difficult to tell exactly how many times each value occurs, but you can
tell the number of times each class occurs. The class that occurs the greatest number of times
than any other class is the modal class. The actual mode lies in the modal class and can be
estimated by calculation or graphically using a histogram.

The mode is calculated using the formula:


C m ( f m − f m −1 )
Mode = Lm +
2 f m − f m−1 − f m +1 [3.7]
where Lm - lower class boundary of the modal class, C m -class width of modal class, f m -
frequency of the modal class, f m −1 - frequency of the class one step below the modal class,
f m +1 - frequency of the class one step above the modal class and m = subscript used to denote
modal class.

Example 3.12
Calculate the mode of the data in Example 3.4.

Solution 3.12
The modal class is 20.5 – 30.5
C m ( f m − f m −1 )
Mode = Lm +
2 f m − f m−1 − f m +1
10(11 − 7)
= 20.5 +
2(11) − 7 − 5

42 
 
40
= 20.5 +
10
= 24.5
The modal expense was $2 450.00

Example 3.13
The frequency table shows the distribution of loans (in thousand dollars) that were issued to
small businesses by PXP bank.
Amount 2-8 9 - 15 16 - 20 21 - 35 36 - 40
Frequency 15 8 23 14 5

Find the mode.

Solution 3.13
We need to present the table using real limits as shown below. It is important to note that
successive classes have small gaps of 1 unit between them. To close these gaps we subtract
half the distance (0.5) from lower limits, and add the same to the upper class limits.

Amount 1.5 - 8.5 8.5 - 15.5 15.5 - 20.5 20.5 - 35.5 35.5 - 40.5
Frequency 15 8 23 14 5

The modal class is 15.5 - 20.5 since it has the highest frequency.
C m ( f m − f m −1 )
Mode = Lm +
2 f m − f m−1 − f m +1
5(23 − 8)
= 15.5 +
2(23) − 8 − 14
75
= 15.5 +
24
= 18.625
The mode is $18 625.00
Activity 3.7
The table shows monthly tobacco sales, in tons, made over the last 32 months at BW Tobacco
Auction Floors.

Sales in tons 5.5 - 9.5 10 - 15.5 16 - 20.5 21 - 24.5 25 - 28.5


No. of months 6 3 12 8 3

Estimate the mode for the monthly sales of tobacco by the company.

3.2.4 Estimating the mode using a histogram


The use of a histogram to estimate the mode requires that the bars be of uniform width. The
method is illustrated in Figure 3.1 using the data of Example 3.14.

Example 3.14
The monthly salaries earned by a sample of 20 salespersons employed in the motor insurance
industry are:

43 
 
Salary/($) Number of employees
200 to less than 300 2
300 to less than400 4
400 to less than 500 8
500 to less than 600 5
600 to less than 700 1

Estimate the mode using a histogram.

Solution 3.14
You start by identifying the modal class. The modal class is the one with the tallest bar. You
then estimate the position of the mode within the modal class by drawing diagonals as shown
in Figure 3.1. Where the diagonals intersect you now draw a straight vertical line downwards
to meet the horizontal axis.

Number of employees

                          8 

                         6 

                       4 

                  2 

                           0 

                               200     300     400       500         600     700     Salary ($) 

         Figure 3.1 Estimation of Mode 


The arrow indicates the position of the mode which can be read off from the horizontal scale.
3.2.5 Choosing the appropriate average
The suitability of the mode, median or mean as an average for a given situation largely
depends on the advantages and disadvantages of the particular measure.

Advantages of the median


Using the median in describing a distribution has advantages in that it:
• is ease to calculate;
• eliminates the effect of extreme values;
• is capable of further algebraic use in analysing other measures;
• can be estimated graphically using an ogive.

Disadvantages of the median


Using the median has disadvantages in that it:
• may not be representative of all the items as it ignores the extreme values;
• cannot be determined precisely when it falls between two middle values;
• has no use when items are weighted according to size;
• requires ranking of items which may be involving.

44 
 
Advantages of the arithmetic mean
The mean has the following advantages:
• it is ease to calculate;
• it is based on all the observations;
• it has further algebraic use in calculating other measures;
• it is easily understood.

Disadvantages of the arithmetic mean


The mean has the following disadvantages:
• it is affected by extreme values (outliers), if any, in a data set;
• it does not give information on composition of the data;
• it does not depict the entire picture of the data;
• it does not always represent the characteristics of individual items;
• it is usually not one of the observed values.

Advantages of the mode


The advantages of using the mode are that:
• it is easy to find;
• it is easy to understand;
• it is usually one of the observed values of the data set.

Disadvantages of the mode


The disadvantages of mode are that:
• the mode may not exist;
• it may not be unique;
• its use in further statistical analysis is limited;
• it does not take into account all other values except the most frequent.
Activity 3.8
The planning department of a Building Society would like to estimate the average household
size of workers at a particular company for which they are to develop a housing project. The
Society gathered the following data pertaining to household sizes from a random sample of
20 workers at the company.
2, 1, 3, 6, 2, 5, 3, 5, 1, 7, 1, 2, 5, 3, 3, 4, 5, 5, 7, 15
(a) Find the mean, median and mode of the data.
(b) Which average is most suitable to estimate household size? Justify your answer by
saying why the other two are not suitable.

3.3 Measures of Position


These measures provide the position of a value in an ordered set of data. The median, for
instance, is a measure of position which divides the distribution into halves. Other commonly
used measures of position are the lower quartile (Q1), the upper quartile (Q3), and the
percentiles. The quartiles (Q1, Q2 and Q3) are positions which divide the entire distribution
into four portions of equal frequency. The lower quartile (Q1) is the value below which lies
25% of the distribution. The median (Q2) has 50% of the distribution lying below it while the
upper quartile (Q3) is the value below which lies 75% of the distribution.

45 
 
Rank for the quartiles
In finding or estimating the quartiles, the data is first arranged in ascending order. We then
need to know the rank for the quartiles, since these will be used in making estimation. Table
3.1 summarises the rank for the quartiles for discrete and continuous data in which the
number of observations is n .

Table 3.1 Rank for the Quartiles


Quartile Rank in discrete data Rank in continuous data
Q1 n +1 n
4 4
Q2 n +1 n
2 2
Q3 3( n + 1) 3n
4 4

3.3.1 Quartiles for discrete data


We demonstrate how the quartiles are estimated for given sets of discrete data, using the
ranks.

Example 3.15
A phone- shop operator recorded the daily revenue she received, in dollars, over 14 days as
shown below.
12, 18, 23, 27, 14, 17, 25, 43, 16, 37, 22, 28, 10, 36
(a) Estimate Q1 and Q3 from the data.
(b) Based on the calculated values, what is the probability that on a given day her revenue
exceed Q3?

Solution 3.15
(a) There are 14 observations, hence n = 14 . We first arrange these values in ascending
order to obtain the ordered data set below.
10, 12, 14, 16, 17, 18, 22, 23, 25, 27, 28, 36, 37, 43
14 + 1
The rank for Q1 is = 3.75
4
Thus the rank is 3.75, that is, 3 + 0.75. We therefore consider Q1 to be the third value plus
0.75 of the distance between this and the fourth value. Put mathematically, this is
Q1 = 14 + 0.75(16 − 14)
= 15.5
3(14 + 1)
The rank for Q3 is = 11.25
4
The upper quartile is, therefore, the 11th value plus 0.25 of the difference between this and the
12th value.
Q3 = 28 + 0.25(36 − 28)
= 30

The upper quartile, Q3, has 0.25 of the distribution lying above it. The probability that her
revenue exceeds $30 is 0.25.

46 
 
Activity 3.9
Find the lower and upper quartiles for the following data sets
(a) 15, 23, 9, 15, 25, 17, 18, 12, 15, 26, 13, 21
(b) 7.3, 5.4, 14.2, 9.1, 15.0, 8.6, 24.5, 3.7, 6.3, 16.4, 12.5, 18.2

Percentiles are found in a very similar way to quartiles. The 25th percentile and the 75th
percentile are in fact Q1 and Q3 respectively.

3.3.2 Quartiles for continuous data


Quartiles can be obtained from grouped data in a similar way as was used for the median.
You begin by identifying the appropriate quartile class. The lower quartile class is the class
that contains the nth 4 observation while the upper quartile class contains the 3nth 4
observation. The following computational formulae are then made use of to estimate the
quartiles:
C ( n 4 − Fq −1 )
Lower quartile, Q1 = Lq + q
fq
[3.8]
Cq (3n 4 − Fq −1 )
Upper quartile, Q3 = Lq +
fq
[3.9]
where Lq = lower limit of the quartile class, C q = class width of the quartile class, f q =
frequency of the quartile class and Fq −1 = cumulative frequency of the class one step below
the quartile class.

Example 3.16
Calculate the lower quartile, Q1 and upper quartile, Q3 for the data of Example 3.4.

Solution 3.16
The lower quartile class contains the 30th 4 observation, that is, the 7.5th observation. The
lower quartile class is therefore 10.5 - 20.5.
C ( n 4 − Fq −1 )
Lower quartile, Q1 = Lq + q
fq
10(7.5 − 3)
= 10.5 +
7
= 10.5 + 6.428571429
= 16.9285

The upper quartile class contains the 3nth 4 observation, that is, the 22.5th observation. This
class is 30.5 - 40.5.
C (3n 4 − Fq −1 )
Upper quartile, Q3 = Lq + q
fq
10(22.5 − 21)
= 30.5 +
5
= 30.5 + 3
= 33.5

47 
 
Activity 3.10
Using the data of Example 3.13, calculate the lower and upper quartiles of monthly
salaries of the employees.

3.4 Measures of Dispersion


Measures of dispersion give an indication of how widely scattered the observations are
around their mean. When values in a sample or population are close to the mean, they exhibit
less dispersion. The measures of dispersion we are going to look at are the range, inter-
quartile range, semi inter-quartile range, variance and standard deviation.
3.4.1 Range
The range gives a simple indicator of the variability of a set of observations. The range of a
set of observations is the difference between the largest observation and the smallest
observation.
Range = highest observed value – lowest observed value [3.10]

Example 3.17
Find the range of the following data
13 12 16 19 26 20 14 21 15 18 22 36

Solution 3.17
Range = highest observed value – lowest observed value
Range = 36 - 12 = 24

The range for grouped data is found by subtracting the real lower limit of the lowest class
interval from the real upper limit of the highest class interval. Although it is very easy to use
and understand, the range is not a reliable way of measuring the spread of data because it is
only based on only two observations which are the highest and lowest values. If one of these
two values is an outlier, then the spread of data is rather exaggerated. Moreover, it is not
applicable where class intervals are open-ended.

Inter-quartile range
The inter-quartile range (IQR) is the range between quartiles. More specifically, it is the
difference between the upper quartile and the lower quartile, that is:
IQR = Q3 - Q1 [3.11]

In turn, the semi inter-quartile range (SIQR) is half the inter-quartile range and is obtained
from the formula:
SIQR = (Q3 − Q1 ) [3.12]
2
The SIQR is limited in that, just like the range, it is based on selected observations in a
distribution so it cannot always detect dispersion in data. However, it is more resistant to
extreme observations compared to the range.
Activity 3.11
Find the range, inter-quartile range and the semi inter-quartile range for the following data
12 19 19 26 20 14 21 15 17 22 36 12 18 33 15 21 18 19 11

48 
 
3.4.2 Variance and standard deviation of ungrouped data
The variance and standard deviation allow us to avoid the shortcomings of the range and
inter-quartile range as measures of dispersion because they take into account all the
observations in the data set as opposed to just selecting a few.

The variance of a set of data is the average squared deviation of the data points from their
mean. Computationally, the variance of a sample of n observations x1 , x2 ,..., xn is obtained by
the formula:
1 ⎛⎜ n 2 1 ⎛ n ⎞ ⎞⎟
2
[3.13]
s =
2
∑ xi − n ⎜⎝ ∑
n − 1 ⎜⎝ i =1 i =1
xi ⎟
⎠ ⎠ ⎟

The formula for population variance σ is given by:


2

1 ⎛⎜ N 2 1 ⎛ N ⎞ ⎞⎟
2

σ =
2
∑ xi − N ⎜⎝ ∑
N ⎜⎝ i =1 i =1
xi ⎟
⎠ ⎟⎠
[3.14]
The standard deviation of a set of observations is the positive square root of the variance of
the set. The variance is a squared quantity and its units which are (units)2 often have no
practical meaning. For example, the variance of sales data in dollars is (dollars)2 which is
practically meaningless. By taking the square root of the variance, we ‘unsquare’ the units
and get the standard deviation which has the same units as those of the quantity being
measured and thus easier to interpret compared to the variance. When calculating variance or
standard deviation, you should verify whether the data relate to a population or a sample.

Example 3.18
The numbers of vehicles stopping to refuel at a service station on 20 randomly selected days
are:
32 37 29 40 35 26 45 37 34 29 30 34 56 74 40 48 45 43 32 35
Find the variance and standard deviation of the data.

Solution 3.18
1 ⎛⎜ n 2 1 ⎛ n ⎞ ⎞⎟
2

s2 = ∑ i n ⎝∑
n − 1 ⎜⎝ i =1
x − ⎜
i =1
x i ⎟
⎠ ⎟⎠
1⎛ (781) 2 ⎞
= ⎜⎜ 32821 − ⎟
19 ⎝ 20 ⎟⎠

= 122.2605263
The standard deviation is then obtained by finding the positive square root of the variance.
s = 122.2605263
= 11.0571
The variance is 122.2605 and the standard deviation is 11.0571.
Activity 3.12
The commissions (in dollars) earned by a sample of 15 ice cream vendors in one month were:
78 50 65 79 97 80 102 45 54 75 98 86 92 69 72 75 80
Find the variance and standard deviation of the data.

49 
 
3.4.3 Variance and standard deviation of grouped data
Suppose that data were put into k classes. Let x1 , x2 ,..., xk be the midpoints of the class
intervals and f1 , f 2 ,..., f k be the respective class frequencies, then the population variance is
given by:
1 (∑ f i x i ) 2
σ 2 = (∑ f i xi2 − )
N N [3.15]
k
where N = ∑ f i is the population size.
i =1
The sample variance is given by:
1 (∑ f i xi ) 2
s2 = (∑ f i xi2 − )
n −1 n [3.16]
k
where n = ∑ f i is the sample size.
i =1
The standard deviation is found by taking the square root of the variance.

Example 3.19
Calculate the variance and standard deviation of the following data.
Class interval Frequency
2 - 9 2
10 - 17 6
18 - 25 12
26 - 33 5
34 - 41 3
42 - 49 2

Solution 3.19

Class boundaries Frequency, f i Class midpoint, xi f i xi f i xi2


1.5 - 9.5 2 5.5 11 60.5
9.5 - 17.5 6 13.5 81 1 093.5
17.5 - 25.5 12 21.5 258 5 547
25.5 – 33.5 5 29.5 147.5 4 351.25
33.5 – 41.5 3 37.5 112.5 4 218.75
41.5 – 49.5 2 45.5 91 4 140.5
∑ f i = 30 ∑ fi xi = 701 ∑ f i xi2 = 19411.5

1 (∑ f i xi ) 2
Variance, s 2 = (∑ f i xi2 − )
n −1 n
1 (701) 2
= (19411.5 − )
29 30
1
= (19411.5 − 16380.03333)
29
= 104.5333334
≈ 104.5333

50 
 
Standard deviation, s = 104.5333334
= 10.22415441
≈ 10.2242

Example 3.20
Calculate the variance and standard deviation of the data in Example 3.4.

Solution 3.20
Amount($00) Frequency, f i Midpoint, x i f i xi f i xi2
0.5 – 10.5 3 5.5 16.5 90.75
10.5 – 20.5 7 15.5 108.5 1681.75
20.5 – 30.5 11 25.5 280.5 7152.75
30.5 – 40.5 5 35.5 177.5 6301.25
40.5 – 50.5 4 45.5 182 8281
∑ f i =30 ∑ f i xi = 765 ∑ f i xi = 23507.5
2

1 (∑ f i xi ) 2
Variance, s =
2
(∑ f i xi2 − )
n −1 n
1 (765) 2
= (23507 .5 − )
29 30
1
= ( 23507 .5 − 19507 .5)
29
1
= (4000)
29

= 137.9310345

Standard deviation = 137.9310345

= 11.74440439

The standard deviation was $1174.44


 

Activity 3.13
The annual profits made by a random sample of 40 companies in the textiles industry are
shown in the table below.

Profit ($00) Number of companies


10 but less than 20 3
20 but less than 30 7
30 but less than 40 12
40 but less than 50 10
50 but less than 60 5
60 but less than 100 3
Calculate the:

51 
 
i. Mean
ii. Median
iii. Mode
iv. Semi inter-quartile range
v. Variance
vi. Standard deviation

3.4.4 Coefficient of variation


The coefficient of variation is the standard deviation given as a percentage of the mean. It is
calculated using the following formula:
Coefficient of variation (CV) = s × 100 [3.17]
μ
The coefficient of variation is a relative measure and it is used to compare variability of two
or more distributions especially where the units of measurement differ.
 
Example 3.21
A German based firm would like to purchase stock in one of two companies (A and B) listed
on the Zimbabwe Stock Exchange. The firm considered the monthly returns of the two
companies over the last 10 months.
A: 34 42 36 38 45 40 32 34 39 41
B: 21 24 32 64 50 35 28 30 42 55
Compare the variability in returns between the two companies. In which company should the
firm invest?
 
Solution 3.21
A: mean = 38.1 standard deviation = 16.7667
CV= 16.7667
× 100
38.1

= 44.01 %
B: mean = 38.1 standard deviation = 14.2162
CV = 14.2162
× 100
38.1
= 37.31 %
The returns of company A are more variable and therefore, risky compared to company B.
The German based firm should invest in company B.

Activity 3.14
Sekai and Sam stay in the same suburb and are employed by the same company in town.
Sekai travels to work by bus and Sam cycles. The times (in minutes) taken by each to get to
work on a sample of 10 days were:
Sekai: 35 26 41 38 36 48 37 30 35 24
Sam: 24 28 24 21 27 26 24 28 22 23
Calculate the coefficient of variation for each set of times. Whose travel time is more
consistent? Justify your answer.

52 
 
3.5 Coefficient of Skewness

Pearson’s coefficient of skewness, denoted Skp, is a measure of the degree of departure from
symmetry which is based on the difference between the mean and the median. It is calculated
using the formula
3( mean − median )
Skp = [3.18]
s tan dard deviation
A symmetrical distribution has a coefficient of skewness which is equal to zero. A coefficient
of skewness which is close to zero indicates moderate skewness. A positive coefficient of
skewness shows that data are positively skewed whilst a negative coefficient means data are
negatively skewed.

Example 3.22
Calculate the coefficient of skewness for the data in Example 3.18

Solution 3.22
You are now capable of finding the mean, median and standard deviation of ungrouped data.
Show that mean = 39.05, median = 36 and standard deviation = 11.0571
3(39.05 − 36)
Coefficient of skewness =
11.0571
= 0.8275
Since the coefficient of skewness is positive, the data is positively skewed.

3.6 The Box-and-Whisker Plot


A box- and- whisker plot is useful in comparing distributions. It highlights five summary
measures of a distribution which are: the median, lower quartile, upper quartile, the smallest
observation and the largest observation.

The middle half of the values in a distribution is represented by a box which has the lower
quartile at one end and the upper quartile at the other. The median is shown by a line inside
the box. Observations in the top and bottom quarters are represented by straight lines called
whiskers which extend from each end of the box, one from the lower quartile to the smallest
observation and the other from the upper quartile to the largest observation. Because of these
features, a box plot makes it easier to determine skewness, spread, central tendency and
possible outliers of a distribution.

Example 3.23
Draw a box-and-whisker plot of the following sales data
10 5 14 11 16 24 21 12 16 20 22 15 24 18 10 14 19 8 12 20

Solution 3.23
The smallest observation is 5 while the largest observation is 24. By now you should be able
to show that the lower quartile is 11.25; the median is 15.5 and the upper quartile is 20.

53 
 
20

Sales
15

10

Figure 3.1 Box- and- Whisker Plot of Sales Data

Note: The length of the box (showing inter-quartile range) and that of the whiskers (showing
the range) give an indication of the spread of the data.

3.7 Summary

We looked at three broad categories of measures of describing data namely measures of


central tendency, measures of location and measures of dispersion. The measures of central
tendency locate the centre of data; these are the mean, mode and median. Quartiles and
percentiles which are classified as measures of position provide the position of a value in an
ordered set of data. Measures of dispersion give an indication of how widely scattered the
observations are around their mean. When values in a sample or population are close to the
mean, they exhibit less dispersion. The measures of dispersion we looked at are the range,
inter-quartile range, semi inter-quartile range, variance and standard deviation. We also
considered the advantages and disadvantages of these measures of describing data.

54 
 
Further Reading

Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

55 
 
Unit 4

Probability Distributions

4.0 Introduction
A probability distribution shows all possible outcomes of an experiment and how probable
each type of outcome is. It is defined as a rule that assigns probabilities to the different values
of a random variable. In this unit we will look at discrete random variables and their
probability distribution. We will also look at some special probability distributions namely:
the Binomial probability distribution, the Poisson probability distribution and the normal
probability distribution.

4.1 Objectives
By the end of the unit, you should be able to:
• define a probability distribution and the standard normal distribution
• find the probability distribution for a random variable
• calculate the mean, variance and standard deviation of a random variable
• find the mean of a function of a random variable
• solve problems involving the Binomial probability distribution and the Poisson
probability distribution
• state the properties of the normal distribution
• transform a normal distribution into a standard normal distribution
• compute probabilities using standard normal distribution tables

4.2 Discrete Random Variables and their Distributions


Discrete random variables are those that can assume a countable number of values. A
probability distribution for a discrete random variable can be construed as a table listing the
possible values of the random variable (X) alongside their associated probabilities. The
probabilities of all possible values of a discrete random variable X add up to 1.

Example 4.1
The numbers of cell phone handsets sold per day by a phone dealer over the past 30 days are
as follows:

No. of phones per day, x 0 1 2 3


No. of days 12 6 9 3

Find the probability distribution of the number of phones sold per day.

Solution 4.1
The probabilities of the different values of X are calculated as follows:
12
P ( X = 0) = = 0.4
30

56 
 
6
P ( X = 1) = = 0 .2
30
9
P ( X = 2) = = 0 .3
30
3
P ( X = 3) = = 0 .1
30
The probability distribution of X is tabled below.

No. of phones sold per day, x 0 1 2 3


P (X = x) 0.4 0.2 0.3 0.1

Example 4.2
Given the following probability distribution for a random variable X, find the value of h.

x 0 1 2 3 4
P (X = x) 0.05 0.35 0.5 h 0.02

Solution 4.2
For a probability distribution, ∑ P( X = x) = 1
all x

0.05 + 0.35 + 0.5 + 0.02 + h = 1


0.92 + h = 1
h = 0.08
Activity 4.1
The probability distribution of a random variable is tabled below:

x 0 1 2 3 4
P (X = x) 0.05 a 2a 0.25 0.1

Find the value of a.

4.2.1 Expectation of a discrete random variable


Let X be a discrete random variable. The expectation of X, denoted by E(X), is found by
multiplying each value of X by its probability and then adding up the products, that is,
μ = E ( X ) = ∑all x xP ( X = x )
[4.1]
The mean is in fact a weighted average of the possible values of X – the weights being the
probabilities.

Example 4.3
Let X be a discrete random variable with the following probability distribution

x 0 1 2
P (X = x) 0.25 0.5 0.25

Find the expectation of X.

57 
 
Solution 4.3
E ( X ) = ∑all x xP ( X = x )
= 0(0.25) + 1(0.5) + 2(0.25)
= 0 + 0.5 + 0.5
=1

Example 4.4
By investing in a particular stock for one year, an investor hopes to make a profit of $100,
$400, $500, or $1 000 with probabilities 0.4, 0.3, 0.18, or 0.12 respectively. What is his
expected profit?

Solution 4.4
The probability distribution is summarised in the following table:

X 100 400 500 1 000


P (X = x) 0.4 0.3 0.18 0.12

E ( X ) = 100(0.4) + 400(0.3) + 500(0.18) + 1000(0.12)


= 40 + 120 + 90 + 120
= 370
His expected profit is $370
4.2.2 Mean of a function of a random variable
Let h(X) be a function of the discrete random variable X. The expected value of h(X) is given
by:
E[h( X )] = ∑ h( x) P( X = x)
all x
[4.2]

Example 4.5
Let X be a discrete random variable with the following probability distribution:

x 0 1 2 3
P (X = x) 0.25 0.4 0.1 0.25

Find
a) E (X)
b) E (2X)
c) E (2X – 1)

Solution 4.5
x 0 1 2 3
2x 0 2 4 6
2x – 1 -1 1 3 5
P( X= x) 0.25 0.4 0.1 0.25

58 
 
a) E ( X ) = ∑ xP( X = x)
all x

= 0(0.25) + 1(0.4) + 2(0.1) + 3(0.25)


= 0 + 0.4 + 0.2 + 0.75
= 1.35

b) E (2 X ) = ∑ 2 xP( X = x)

= 0(0.25) + 2(0.4) + 4(0.1) + 6(0.25)


= 0 + 0.8 + 0.4 + 1.5
= 2.7

Remark 4.1 Take note that E ( 2 X ) = 2 E ( X ) . In general, E (aX ) = aE( X ) .

c) E (2 X − 1) = ∑ (2 x − 1) P( X = x)

= −1(0.25) + 1(0.4) + 3(0.1) + 5(0.25)


= 1.7
Remark 4.2 You should realise that E ( 2 X − 1) = 2 E ( X ) − 1 . In general, the expected value
of a linear function of a random variable is given by:
E ( aX + b) = aE ( X ) + b [4.3]
 

Activity 4.2
The number of wooden chairs made per month by a backyard carpentry shop is a random
variable with the following probability distribution.

X 19 20 21 22
P (X = x) 0.4 0.25 0.2 0.15
a) What is the most probable number of chairs produced per month?
b) Find the probability that the number of chairs that will be made next month is at least
20.
c) Find the probability that the number of chairs produced per month is at most 21.
d) Find the expected number of chairs produced per month.
e) Suppose that the carpentry shop incurs fixed monthly costs of $100 and an additional
construction cost of $5 per chair. Find the expected monthly cost of the operation.

4.2.3 Variance and standard deviation of a discrete random variable


The variance of a discrete random variable is the expected squared deviation of the random
variable from its mean. It is given by the following formula:
σ 2 = ∑ ( x − μ ) 2 P( X = x)
all x
[4.4]

However, the following formula is user friendly and therefore preferred in computations.
σ 2 = Var( X ) = E( X 2 ) − [ E( X )]2 [4.5]

The standard deviation, σ , is then found by taking the positive square root of the variance.

59 
 
Example 4.6
Find the variance and standard deviation of a discrete random variable X associated with the
following distribution.

Solution 4.6
x 0 1 2
P (X = x) 0.25 0.5 0.25

E ( X ) = ∑ all x xP ( X = x )
= 0(0.25) + 1(0.5) + 2(0.25)
= 0 + 0.5 + 0.5
=1
E ( X 2 ) = ∑ x 2 P( X = x)
all x

= 0 2 (0.25) + 12 (0.5) + 2 2 (0.25)


= 0 + 0.5 + 1
= 1.5

σ 2 = Var( X ) = E( X 2 ) − [ E( X )]2
= 1 .5 − 12
= 0.5
Standard deviation = 0.5
= 0.707106781
≈ 0.7071

Activity 4.3
The profits to be realised from a certain business venture, to the nearest $1 000, are
believed to follow the probability distribution shown below.

x -2 000 -1 000 0 1 000 2 000 3 000


P(X = x) 0.1 0.1 0.2 p 0.3 0.1

a) Determine the value of p.


b) Find the probability that the business venture
i. makes a loss
ii. realises profit of at least $2 000
c) Find the expected earnings and standard deviation of profits for the business
venture.
d) Is the venture likely to be successful? Explain.

4.3 Binomial Probability Distribution

The distribution is used to model experiments which consist of a series of finite trials that
take place repeatedly, each with only two possible outcomes. By convention, one of these
outcomes is referred to as ‘success’; the other as ‘failure’.
The properties of a binomial experiment are:
• the experiment consists of n repeated trials;

60 
 
• each trial results in two possible outcomes which are mutually exclusive;
• the probability of success(p) remains constant from trial to trial;
• the repeated trials are independent;
• the variable of interest is the number (X) of successes observed during the n trials.

Let X be number of successes in n-trials of a binomial experiment each with probability p.


Then X is a binomial random variable with probability distribution given by:
⎛n⎞
P ( X = x) = ⎜⎜ ⎟⎟ p x (1 − p ) n − x for x = 0, 1, 2,… n. [4.6]
⎝ x⎠
A binomial random variable is completely described by specifying the two parameters, n and
p. We write X~ B (n, p) to show that X is binomially distributed with n number of trials and
probability of success p in each trial.

The mean and variance of a binomial random variable depends on the two parameters n and
p.
Let X~ B (n, p), then
E(X) = n p [4.7]
Var(X) = n p (1 – p) [4.8]

Example 4.7
A new drug for a rare disease is known to be effective in 70% of the cases treated. Six
patients suffering from the disease are to be treated.
a) Find the probability that
i. 4 patients will be successfully treated.
ii. at least 2 patients will be successfully treated.
b) Find the mean and variance of the number of patients successfully treated.

Solution 4.7
Let X be the number of successfully treated patients, then X ~ B (6, 0.70).
⎛n⎞
P ( X = x) = ⎜⎜ ⎟⎟ p x (1 − p ) n − x
⎝ x⎠
⎛6⎞
a) (i) P ( X = 4) = ⎜⎜ ⎟⎟ × 0.70 4 × 0.30 6− 4
⎝ 4⎠
⎛6⎞
= ⎜⎜ ⎟⎟ × 0.70 4 × 0.30 2
⎝ 4⎠
= 0.324135
≈ 0.3241

(ii) P ( x ≥ 2) = 1 − P ( X < 2)
= 1 − [ P ( X = 0) + P ( X = 1)]
⎛6⎞ ⎛6⎞
= 1 − [⎜⎜ ⎟⎟ × 0.70 0 × 0.30 6 + ⎜⎜ ⎟⎟ × 0.701 × 0.30 5 ]
⎝0⎠ ⎝1 ⎠
= 1 − [0.000729 + 0.010206]
= 1− 0.010935
= 0.989065
≈ 0.9891

61 
 
b) E ( X ) = n p = 6 × 0.70 = 4.2
Var ( X ) = n p (1 − p )
= 6 × 0.70 × 0.30
= 1.26
∴ standard deviation = 1.26
= 1.122497216
≈ 1.1225

Example 4.8
A recent college graduate is applying for 5 jobs and believes that he has a constant and
independent 0.6 probability of getting an offer in each of the application. What is the
probability that he will have:

a) no offers at all?
b) at least 4 offers?
c) fewer than 4 offers?

Solution 4.8
Let X be number of successful offers. X ~ B (5, 0.6)
⎛5⎞
a) P ( X = 0) = ⎜⎜ ⎟⎟ × 0.6 0 × 0.4 5
⎝0⎠
= 0.01024
≈ 0.0102

b) P ( X ≥ 4) = P ( X = 4) + P ( X = 5)
⎛5⎞ ⎛ 5⎞
= ⎜⎜ ⎟⎟ × 0.6 4 × 0.41 + ⎜⎜ ⎟⎟ × 0.6 5 × 0.4 0
⎝ 4⎠ ⎝ 5⎠
= 0.2592 + 0.07776
= 0.33696

c) The use of the binomial formula here involves finding the probabilities that X = 0, X =
1, X = 2 and X = 3 and then adding up the probabilities to find P (X < 4). This process
may take quite some time and it is recommended that you make use of binomial tables
which are attached in the appendices. The tables give cumulative binomial
probabilities for selected n and p values.

Here n = 5, p = 0.6 and k = 3


∴ P ( X < 4) = P ( X ≤ 3) = 0.663

62 
 
Activity 4.4
1. In a world cup final soccer match, the teams have remained deadlocked after extra
time. To decide the winner, a team is to take 5 penalty kicks each. If 60% of all
penalties taken in previous world cup matches are known to be scored, what is the
probability that a team will:
a) score all 5 penalties?
b) score not less than 4 penalties?
c) fail to score all penalties?
2. Two in 10 oranges harvested from an orchard are known to be infected.
a) Calculate the probability that in a basket of 6 oranges taken from the orchard:
i. none are infected.
ii. more than half are infected.
b) Find the:
i. expected number of infected oranges, and
ii. standard deviation of infected oranges from the orchard.

4.4 Poisson Probability Distribution


The Poisson distribution is used to model events that can happen over a period of time or
space. It allows us to find the probability that a specific number of incidents happen over a
particular period. The following are examples of events that can be modelled by the Poisson
probability distribution:
• The number of phone calls received per hour by an operator.
• The number of patients admitted at a hospital per day.
• The number of deaths occurring per day due to a cholera outbreak.
• The number of typing errors per page of a manuscript.

The variable of interest is the discrete number of occurrences per unit time/space.
The conditions for a Poisson distribution are as follows:
• there is no fixed number of trials
• the mean number of events λ is known

The Poisson distribution is specified by one parameter, that is, the mean λ .We write X ~ P0 ( λ
) to show that X follows the Poisson probability distribution with mean λ . Let X be a Poisson
distributed random variable with mean λ , then the probability that the variable X takes a
particular value x is given by:
e −λ λx
P ( X = x) = for x = 0, 1, 2, … [4.9]
x!

The mean and variance of X are both equal to λ .

Example 4.9
The average number of vehicle accidents per day at a busy road intersection is 2. Find the
probability that:
a) no accident occurs in a given day.
b) at least 2 accidents occur in a given day.
c) exactly 3 accidents occur in a period of two days.

63 
 
Solution 4.9
Let X be the number of accidents, then X ~ P0 (2).
e −λ λx
P ( X = x) =
x!
a) no accident implies x takes the value 0
e −2 2 0
P( X = 0) =
0!
= 0.135335283
≈ 0.1353

b) At least 2 accidents means 2 or 3 or 4 or … up to infinity, since there is no fixed


number of trials. The probability is evaluated using the complement rule, that is:
P ( X ≥ 2) = 1 − P ( X < 2)
= 1 − [ P ( X = 0) + P ( X = 1)]
e −2 2 0 e −2 21
= 1−[ + ]
0! 1!
= 1 − [0.135335283 + 0.270670566 ]
= 1− 0.406005849
= 0.59399415
≈ 0.5940

c) The reference period is now 2 days, therefore the mean becomes 2 x 2 = 4 so that X ~
P0 (4).
e −4 4 3
P( X = 3) =
3!
= 0.195366814
≈ 0.1954

Example 4.10
Martha makes on average 0.5 typing errors on each page. Suppose she is tasked to type a
report with 8 pages, what is the probability that she makes
a) no mistakes at all?
b) no more than 4 mistakes?

Solution 4.10
Let X be the number of typing errors. If the average is 0.5 per page, then in 8 pages the
average would be 8 x 0.5 = 4.

Now X ~ P0( 4)
e −4 40
a) P( X = 0) =
0!
= 0.018315638
≈ 0.0183

b) P ( X ≤ 4) = P ( X = 0) + P ( X = 1) + P ( X = 2) + P ( X = 3) + P ( X = 4)

64 
 
Since the calculations here are long and laborious, it is advisable to use Poisson tables found
in the appendices. The tables give cumulative Poisson probabilities.

The expected value λ = m = 4 and x = c= 4 giving P ( X ≤ 4) = 0.629 .


Activity 4.5
1. The mean number of car thefts in Harare is 3 per month. What is the probability that a
month passes when there are
a) no car thefts?
b) more than 2 thefts?
c) 4 or fewer car thefts?
2. A rural district hospital admits on average 2 patients per week for malaria treatment. Find
the probability that
a) 3 patients are admitted per week for malaria treatment.
b) at most 5 patients are admitted per week for malaria treatment.
c) at least 10 patients are admitted per month for malaria treatment.

4.5 The Normal Probability Distribution


The normal curve is bell-shaped as shown in Figure 4.1. The curve is completely specified by
the mean μ and variance σ 2 of the distribution under investigation.

Figure 4.1: The Normal Curve

The properties of the normal curve/distribution are:


• The curve is symmetric about the mean
• It is unimodal – has a single peak
• At the line of symmetry, the mean, median and mode coincide, that is, mean = median
= mode
• The curve approaches the horizontal axis asymptotically as we proceed in either
direction away from the centre. This means that the curve will not come into contact
with the horizontal axis at both ends but extends to infinity.
• The total area under the curve and above the horizontal axis is equal to1

4.5.1 Standard normal curve


We have an infinite number of normal curves because different values of mean μ and
variance σ 2 will give rise to different normal curves with varying centres and peakedness.
However, one is selected as our standard. A mean of zero and variance of one will give rise to
the standard normal curve.

65 
 
Let X be a random variable that is normally distributed with mean μ and variance σ 2 . We
write X ~ N( μ , σ 2 ) . For example, if the mean of X is 10 and variance is 25, we write X ~
N(10, 52) where 5 is the standard deviation. A random variable with mean zero and variance
one is called a standard normal variable and is denoted by Z, that is Z ~ N(0,1). The
distribution of Z is called the standard normal distribution. An arbitrary normally distributed
variable X is transformed to the standard normal distribution by the transformation

X −u
Z= [4.10]
σ

The area under a normal curve between any two specified points gives the probability that the
random variable assumes values between the two points. In Figure 4.2, the shaded area give
the probability that a random variable X assumes values between x = x1 and x = x2, that is
P(x1< X < x2).

x1 μ x2

Figure 4.2 Area under the Curve between x1 and x2

To find P(x1 < X < x2), we must find standard values corresponding to x1 and x2 by the
x −μ x −μ
transformation z1 = 1 and z 2 = 2 .
σ σ
It now follows that P(x1 < X < x2) = P(z1< Z < z 2) and Figure 4.2 is transformed to look like
Figure 4.3 below.

z1 0 z2

Figure 4.3 Area under the Standard Normal Curve between z1 and z2

Table 1 in the appendices gives values for the area under the standard normal curve lying to
the left of any specified z value for values of z from -3.4 to 3.4. The area corresponds to the
probability that a given value is less than or equal to z, that is, P(Z ≤ z ) .

Example 4.11
A random variable X is normally distributed with a mean of 10 and variance 25. Find
standard values (z-values) corresponding to:
a) x = 12
b) x = 8

66 
 
Solution 4.11
X ~ N (10, 52)
X −u
We make use of the transformation Z = with μ = 10 and σ = 5 .
σ
12 − 10
a) x = 12: z =
5
= 0.4

8 − 10
b) x = 8: z =
5
= -0.4

Activity 4.6
A random variable X is normally distributed with a mean of 15 and variance 36. Find
standard values corresponding to:
a) x = 16
b) x = 13

4.5.2 Evaluating probabilities using the standard normal tables


Table 1 in the appendices is used to find the probability that Z is less than or equal to a
certain value, z or greater or equal to z. We will use a few examples to illustrate how this is
done.

Example 4.12
Let Z ~ N (0, 1). Find
a) P (Z ≤ 1.34)
b) P (Z ≤ −2.75)
c) P (Z ≥ 1.62)
d) P (0.47 ≤ Z ≤ 1.86)

Solution 4.12
a) The probability P (Z ≤ 1.34) is given by the area shown in Figure 4.4

0 1.34

Figure 4.4 The Probability P (Z ≤ 1.34)

To find P (Z ≤ 1.34) , we locate a value of z equal to 1.3 in the left column of Table 1.
We then move across the row to the column under 0.04 where we read 0.9099.
Therefore, P (Z ≤ 1.34) = 0.9099.

67 
 
b) The area required to find P (Z ≤ −2.75) is shown in Figure 4.5.

-2.75 0
Figure 4.5 The Area Required to Find P (Z ≤ −2.75)

We locate a value of z = -2.7 under the left column. We then move across the row to
the column under 0.05, giving P (Z ≤ −2.75) = 0.0030.

c) P (Z ≥ 1.62)
The area required is the area under the standard normal curve to the right of z = 1.62 as
shown in Figure 4.6

0 1.62
Figure 4.6 The Probability P (Z ≥ 1.62)

In the left column of Table 1, go to a value of z equal to 1.6, then move across that row
to the column under 0.02 where you read 0.9474. This is the area to the left of 1.62, but
we want the area to the right of 1.62 as shown in Figure 4.6. You should remember
that the total area under the curve is equal to 1. Therefore, if we subtract the area to the
left of z =1.62 from 1, the remaining area to the right of 1.62 gives us P (Z ≥ 1.62).

P (Z ≥ 1.62) = 1 – P( Z < 1.62)


= 1 – 0.9474
= 0.0526

d) Figure 4.7 shows the area between z = 0.47 and z = 1.86

0 0.47 1.86

Figure 4.7 The Area between z = 0.47 and z = 1.86

The shaded area is obtained by subtracting the area to the left of z =0.47 from the area
to the left of z = 1.86, that is,

P( 0.47 ≤ Z ≤ 1.86) = P ( Z ≤ 1.86) − P ( Z ≤ 0.47 )


= 0.9686 – 0.6808
= 0.2878

68 
 
Remark 4.3
The probability that a continuous variable takes a precise value is zero. This implies that the
probability of, say, z is less or equal to 1.25 is just the same as that of z is less than1.25. In
general P(Z ≤ z) = P(Z < z).

Activity 4.7
Let Z ~ N (0, 1). Find
a) P (Z ≤ 3.10)
b) P (Z ≥ −0.27 )
c) P (-1.45 ≤ Z ≤ 2.63)

Example 4.13
Given a random variable X which is normally distributed with mean 15 and variance 100,
find:
a) P(X < 20)
b) P(X > 12)
c) P( 12 < X < 20)

Solution 4.13
X ~ N (15, 102)
We begin by finding z-values corresponding to the x-values given using the transformation
given by equation 4.10.
X − μ 20 − 15
a) P(X < 20) = P ( < )
σ 10
= P (Z < 0.5)
= 0.6915
0 0.5
X − μ 12 − 15
b) P(X > 12) = P( > )
σ 10
= P (Z > -0.3)
= 1 - P (Z < -0.3)
= 1 – 0.3821 -0.3 0
= 0.6179

12 − 15 X − μ 20 − 15
c) P (12 < X < 20) = P ( < < )
10 σ 10
= P (-0.3 < Z < 0.5)
= P (Z < 0.5) – P (Z < -0.3)
= 0.6915 – 0.3821)
= 0.3094 -0.3 0 0.5

4.5.3 Practical problems


In this section we are going to solve practical problems using the normal probability
distribution.

69 
 
Example 4.14
The delays that are experienced at a border post by truck drivers to clear their cargo were
found to be normally distributed with mean 48 hours and a standard deviation of 6 hours.
Find the probability that a driver has to wait for:
a) at least 36 hours to clear his cargo.
b) between 40 hours and 50 hours to clear his cargo.

Solution 4.14
Let X be the total waiting time to get clearance. Then X ~ N (48. 62).
X − μ 36 − 48
a) P (X ≥ 36) = P ( ≥ )
σ 6
= P ( Z ≥ −2)
= 1 – P(Z < - 2)
= 1 – 0.0228
= 0.9772 -2 0

40 − 48 X − μ 50 − 48
b) P( 40 < X < 50) = P ( < < )
6 σ 6
= P(-1.33 < Z < 0.33)
= P(Z < 0.33) – P(Z < -1.33)
= 0.6293 – 0.0918
= 0.5375 -1.33 0 0.33

Example 4.15
The demand for second hand Japanese cars in Zimbabwe is normally distributed with a mean
of 1 600 cars sold per month and standard deviation of 50 cars. What is the probability that:
a) at most 1 500 cars will be sold in one month?
b) between 1 500 and 1 600 cars will be sold in one month?

Solution 4.15
Let X be number of cars sold per month, then X ~ N (1 600, 502).

X −μ 1500 − 1600
a) P(X ≤ 1500) = P ( ≤ )
σ 50
= P (Z ≤ −2)
= 0.0228

-2 0

1500 − 1600 1650 − 1600


b) P (1500 < X < 1650) = P ( <Z< )
50 50
= P (-2 < Z < 1)
= P (Z < 1) – P (Z < -2)
= 0.8413 – 0.0228 -2 0 1
= 0.8413

70 
 
Activity 4.8

1. The times that cars took to refuel at a busy service station are normally distributed
with mean 3 minutes and a standard deviation of 0.2 minutes. What is the probability
that a car will take
a) more than 4 minutes to refuel?
b) not more than 2 minutes to refuel?
c) between 2 and 4 minutes to refuel?
2. A fast food restaurant finds that the number of meals it serves in a week is normally
distributed with a mean of 4 000 and a standard deviation of 200. What is the
probability that in a given week the number of meals served will be:
a) at most 4 500?
b) between 4 000 and 4 500?
3. On average a tuck-shop sells 300 loaves of bread per day with a standard deviation of
50 loaves. Find the probability that the tuck-shop will sell at least 400 loaves per day.

4.6 Summary

In this unit, we calculated the mean and variance of a discrete random variable when given
the probability distribution. We defined a probability distribution as a rule that assigns
probabilities to the different values of a random variable.

We also looked at the binomial and Poisson probability distributions. The binomial
distribution is used to model experiments which consist of a series of finite trials that occur
repeatedly, each with only two possible outcomes while the Poisson distribution is used to
model events that can happen over a period of time or space.

We ended by looking at the normal probability distribution which is used to model


continuous random variables. The distribution is completely specified by two parameters
which are the mean and variance of the distribution. The standard normal distribution has a
mean of 0 and a variance of 1. The area under the standard normal curve gives probabilities.
An arbitrary normal distribution is transformed to the standard normal distribution to
facilitate the evaluation of probabilities using prepared tables.

71 
 
Further Reading

Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

72 
 
Unit 5

Statistical Estimation
5.0 Introduction
Statistical investigations are usually carried out on samples drawn randomly from populations
of interest. As a result, statistical analysis will be based on sample data rather than population
data. The major reasons for this are that it is usually too expensive and time consuming to
collect population data. Sometimes it is also impossible to obtain population data. The results
of the sample study are then used to estimate results for the population thereby allowing
important decisions about the population to be made.

In this unit we introduce you to an important branch of statistical inference called statistical
estimation. You will learn about point estimation and confidence interval estimation for the
mean and proportion of a single population.

5.1 Objectives

By the end of the unit, you should be able to:


• provide justification for estimating population parameters
• distinguish between point estimation and confidence interval estimation
• find point estimates of population parameters
• construct confidence interval estimates of population parameters
• state advantages of interval estimation over point estimation
• find confidence intervals for the difference in two population means and in two
population proportions
• use a confidence interval for hypothesis testing

5.2 What is Statistical Estimation?

Statistical estimation involves the use of sample statistics to predict the corresponding
population parameters. In this process, a sample measure is used as an estimate of the
corresponding population measure. A sample proportion p̂ is used as an estimator of the
population proportion p . For example, an aspiring Member of Parliament (MP) may want to
estimate the true proportion of voters that favour him in a constituency. The MP would obtain
the opinions of a random sample of eligible voters in the constituency. The fraction of voters
in the sample who favour the MP could be used as an estimate of the true proportion of voters
who are likely to vote for the MP.

Similarly, the sample mean x is taken as an estimator of the population mean μ while the
sample variance s 2 is taken as an estimator of the population variance σ 2 . It is important, to
ensure that samples used in statistical analysis are representative of their parent populations.
The only way to ensure that this is the case is to select random samples using probability
based sampling methods.

73 
 
The estimation is in two forms namely point estimation and confidence interval estimation.
  

5.3 Point Estimation

In point estimation, a single value of a statistic is used as an estimate of the population


parameter. The disadvantages of a point estimate are that:
• It is not exactly equal to the population mean μ most of the time. The actual estimate
may not or may be close to it.
• It is uncertain whether it will be a good estimate and we have no idea of the
probability that it will be a good estimate. A point estimate does not reveal any
information about the accuracy of the estimation procedure.

5.3.1 Point estimator of the population mean


The point estimator of the population mean is the sample mean x given by:

1
x=
n
∑ xi [5.1]

where x1 , x2 , ..., xn are n randomly selected sample values drawn from the population.

Example 5.1
The daily sales ($) of a vegetable vendor over 30 randomly selected days are:
14 21 28 17 15 34 10 18 25 30 21 15 11 28 17 20 20 29 31 24 11 19 26
34 10 16 25 30 22 17

Find the point estimate of the population mean.

Solution 5.1
n = 30, ∑ x = 638
x=
∑x
n
638
=
30
= 21.26666667
≈ $21.27

5.3.2 Point estimator of the population variance


The best estimator of the population variance is the sample variance which is given by:

(∑ xi2 − ∑ i )
1 ( x )2
s2 = [5.2]
n −1 n

Example 5.2
Using the data of Example 5.1, find the point estimate for the variance.

74 
 
Solution 5.2
n = 30, ∑ x = 638 , ∑ x = 15046
2

(∑ x − ∑
2
1 ( x)
s =
2 2
)i

n −1
i
n
1 (638) 2
= (15046 − )
30 − 1 30
1
= (15046 − 13568.13333)
29
1
= (1477.866667 )
29
= 50.96091954
≈ 50.9609
Activity 5.1
1. The weights (kg) of 20 bags of potatoes randomly selected from a truckload of
250 bags of potatoes are:
10.2 10.1 9.8 9.9 10.0 9.8 10.3 10.1 10.4 9.7 8.9 9.0 10.6 10.9 11.0
11.3 10.2 12.0 9.8 10.7
Find point estimates of the
a) population mean weight, and
b) population variance of the weight of all potatoes in the truck.
2. The bank balances of 30 randomly selected savings accounts are:
200 128 132 400 380 24 267 306 86 94 125 106 249 364 59
34 126 184 230 342 311 265 46 38 89 122 241 237 98 106
Find point estimates for the
a) population mean balance, and
b) population variance of balances of all savings accounts.

5.3.3 Point estimator of the population proportion


An estimator of the population proportion p is given by:
k
pˆ = [5.3]
n
where k in the number of elements with the desired characteristics and n is the sample size.

Example 5.3
Refer to the data of Example 5.1. Find the point estimate of the population proportion of daily
sales which are above $20.

Solution 5.3
x = 13 , n = 30
k
pˆ =
n
13
=
20
= 0.65

75 
 
Example 5.4
In a study to determine the proportion of teachers in Zimbabwe who are degree holders, 800
teachers out of a random sample of 2 000 teachers said they have degree qualifications.
Find a point estimate of the proportion of all Zimbabwean teachers who have degrees.
If there are 15 000 teachers altogether in the country, how many have degrees?

Solution 5.4
n = 2000 k = 800
k
a) pˆ =
n
800
=
2000
= 0.4
Thus 40% of all the teachers in Zimbabwe have degrees.

b) Number of teachers with degrees = 40% of 15 000


= 6 000.

Activity 5.2
1. Refer to Activity 5.1, suppose the standard weight of a bag of potatoes is 10 kg. Find
an estimate of the proportion of all potato bags that are under weight.
2. A church organisation has a total membership of 600. A survey conducted at the
church showed that 80 church members out of a random sample of 200 members had
bibles. Find a point estimate of the proportion of church members who do not have
bibles.

We will now consider point estimates of the difference of two population means and
proportions.
5.3.4 Point Estimate of the difference of two population means
The difference between two sample means, x1 − x2 is an unbiased estimator of the difference
between two population means, μ1 − μ 2 where x1 is the mean of the 1st sample and x2 is the
mean of the second sample with μ1 and μ 2 being the respective population means.

Remark 5.1
A statistic is an unbiased estimator of a population parameter if the mean of values of the
statistic for all samples is equal to the parameter.

Example 5.5
The following are tuition fees in dollars at 8 randomly selected private schools and 10
randomly selected public schools:
Private schools: 500 550 495 450 510 650 450 600
Public schools: 350 390 300 350 375 400 250 380 300 350
Find a point estimate of the difference in the mean of fees charged by the two kinds of
school.

Solution 5.5
Let the sample mean for private schools be x1 and that for public schools be x2 .

76 
 
Average fees at private schools, x1 = 530.63
Average fees at public schools, x2 = 344.50
∴ point estimate of the difference in means = 530.63 − 344.50
= $186.13
This means that private schools charge on average $186.13 more than public schools.
Activity 5.3
An airtime vendor at a Roadport bus terminus recorded over one week the daily sales of two
types of juice card that he sells. The results are summarised below:
Buddie: 40 49 32 50 38 45 47
Easycall: 32 27 29 22 34 39 22
Find a point estimate of the difference in mean sales of the two types of juice card.

5.3.5 Point estimate of the difference of two population proportions


A population proportion is the number of elements of a population with certain desired
characteristics. The sample proportion, p̂ is used to estimate the population proportion, p
and is given by pˆ = k n where k is part of the sample with desired characteristics and n is the
sample size.

The difference between two sample proportions p̂1 - p̂2 is an unbiased estimator of the
difference between two population proportions, p1 - p2 .

Example 5.6
In a study at an institution of higher learning, it was discovered that 600 out of 1 000 male
students drink alcohol while 200 out of 500 female students drink alcohol. Find a point
estimate of the difference of actual proportions of male and female students who drink
alcohol at this institution.

Solution 5.6
600
Proportion of male drinkers, pˆ 1 = = 0 .6
1000
200
Proportion of female drinkers, pˆ 2 = = 0 .4
500
∴ a point estimate of the difference in the two proportions is
pˆ 1 − pˆ 2 = 0.6 − 0.4 = 0.2
This means that at the institution there are 20% more male drinkers compared to female
drinkers.

Activity 5.4
During the last Heroes Holiday, 6 out 10 accidents that happened along Harare-Mutare road
involved commuter omnibuses while 2 out of 10 involved conventional buses. Find a point
estimate of the difference of actual proportions of accidents due to omnibuses and
conventional buses in the country recorded over the Heroes Holiday. Comment on the
distribution of the accidents.

77 
 
5.4 Confidence Interval Estimation
A confidence interval is a range of numbers believed to include an unknown population
parameter. Attached to the interval is a measure of our confidence that the interval indeed
contain the population measure.

The general formula for a confidence interval is given by:


estimate ± (table) (standard error of estimate)

Confidence interval estimation is preferred to point estimation because the probability that
the interval includes the population measure is known. This is an advantage of interval
estimation over point estimation in that the probability is a measure of our confidence in the
estimated result.

Let us suppose that a 95% confidence interval for the population mean is, say, (10, 13), then
the probability that the mean is included in the interval (10, 13) is 0.95. Hence, we are 95%
confident that the mean lies in the range (10, 13). The probability that the population mean is
not contained in the interval (10, 13) is now 5%. The 5% is the level of error associated with
our confidence interval estimate; it is called the level of significance and it is denoted by α .

Activity 5.5
1. Given that a 99% confidence interval for a population mean is (8, 10), state
i. the lower and upper confidence limits;
ii. the probability of the mean lying in the interval (8, 10); and
iii. the probability of the mean lying outside the interval (8, 10).
2. What are the advantages of interval estimation over point estimation?

5.4.1 Interval estimate of the population mean


The formulae that we use to find confidence interval estimates for the population mean μ
depends on whether the population variance is known or not known and also on whether the
sample size is large or small. A sample size of 30 or more is considered a large sample
otherwise it is a small sample. There are three cases to consider:

Case I
If the population standard deviation σ is known, a 100 (1 − α ) % confidence interval for μ
is given by:
σ
x ± Zα 2 × [5.4]
n
where x is the mean of a sample of size n from a population with variance σ 2 , Zα 2 is the
value of the standard normal distribution such that the area under the curve to the right of it is
α σ
and is the standard error of the mean.
2 n

Example 5.7
An electrical firm supplies light bulbs that have a length of life that is approximately
normally distributed with a standard deviation of 20 hours. If a random sample of 40 bulbs
has an average life of 800 hours, find

78 
 
a) a 95% confidence interval for the population mean life of all bulbs supplied by this
firm
b) a 99% confidence interval for the population mean life of all bulbs supplied by this
firm

Solution 5.7
σ = 20 n = 40 x = 800 α = 0.05 ⇒ Z 0.05 2 = Z 0.025 = 1.96
a) A 95% confidence interval for μ is
σ
= x ± Zα 2 ×
n
20
= 800 ± 1.96 ×
40
= 800 ± 6.198064214
= (793.8019, 806.1981)

Thus we are 95% confident that the mean life of all bulbs is between 793.8019 hours
and 806.1981 hours.
b) A 99% confidence interval for μ is
σ
= x ± Zα 2 ×
n
20
= 800 ± 2.5758 ×
40
= 800 ± 8.145394797
= (791.8546, 808.1454)
Thus we are 95% confident that the mean life of all bulbs is between 791.8546 hours
and 808.1454 hours.

If we compare the two intervals, you will see that the one based on a higher confidence level
of 99% is wider and conveys less information about the possible value of μ than does the one
based on 95% which is narrower. In general, we say that when sampling is from the same
population, using a fixed sample size, the higher the confidence level, the wider the interval.
Activity 5.6
The burning times of a particular brand of candles imported from Mozambique are
known to be normally distributed with a standard deviation of 5 minutes. The mean
burning times of a random sample of 20 candles was 3 hours. Find a 90% confidence
interval for the mean burning time of all such candles.

Case II
When the population standard deviation σ is unknown and the sample size is large, n ≥ 30 ,
then a 100(1 − α ) % confidence interval for population mean μ is given by
s
x ± Zα 2 × [5.5]
n
where s is the sample standard deviation.

79 
 
Example 5.8
The Head of a rural primary school is worried by the big number of students who arrive late
for school. In order to be able to adjust the school starting time, he sought to find the average
distance walked by the students to school from home. The mean and standard deviation of the
distances travelled by a random sample of 60 students were 6km and 800m respectively.
Construct a 90% confidence interval for the mean distance travelled by all the students to
school.

Solution 5.8
n = 60 x=6 s = 800 m = 0.8km Z 0.10 2 = Z 0.05 = 1.6449
A 90% confidence interval for μ is
s
x ± Zα 2 ×
n
0.8
= 6 ± 1.6449 ×
60
= 6 ± 0.169884541
= (5.8301, 6.1699)

We are 90% confident that the mean distance travelled by the students to school is between
5.8301km and 6.1699km.
Activity 5.7
A survey of 400 company executives revealed that the average annual earnings of a
CEO is $200 000 with a standard deviation of $600. Find a 99% confidence interval
for the true average annual earnings for all company executives.

Case III
This is the case where the population standard deviation σ is unknown and the sample size is
small, ( n ≤ 30 ). A 100(1 - α ) % confidence interval for μ is given by
s
x ± tα 2 (n − 1) × [5.6]
n
where n − 1 is the number of degrees of freedom.

Remark 5.2
Since n < 30 , the sample standard deviation of a small sample is not a reliable enough
estimate of the population standard deviation to enable the use of the z-distribution, as a
result we use the t-distribution.

Example 5.9
Refer to Activity 5.1 Question 1, find a 95% confidence interval for the mean weight of all
bags of potatoes in the truck.

80 
 
Solution 5.9
n = 20 x = 10.235 s = 0.7264 α = 0.05 ⇒ tα 2 (n − 1) = t 0.025 (19) = 2.09
A 95 % confidence interval for μ is
s
x ± tα 2 (n − 1) ×
n
0.7264
= 10.235 ± 2.09 ×
20
= 10.235 ± 0.339474473
= (9.8955, 10.5745)
We are 95% confident that the true mean weight of all bags of potatoes in the truck is
between 9.8955kg and 10.5745 kg.

Example 5.10
A stock market analyst wanted to estimate the average return on a certain stock. A random
sample of 20 days yielded an average return of 12% and a standard deviation of 4%.
Construct a 95% confidence interval estimate for the average return on this stock?

Solution 5.10
σ is unknown and n = 20 is small, therefore we use the t-distribution. A 95% confidence
interval for μ is
s
x ± tα 2 (n − 1) ×
n
4
= 12 ± t0.025 (19) ×
20
= 12 ± 2.09 × 0.894427191
= 12 ± 1.869352829
= (10.1306, 13.8694)

We are 95% confident that the average return on this stock is between 10.13% and 13.87%.
Activity 5.8
A random sample of 10 cigarettes of a certain type has an average nicotine content of 15
milligrams and a standard deviation of 2.5 milligrams. Construct a 99% confidence interval
for the true average nicotine content of all the cigarettes.

5.4.2 Estimation of the population proportion


A population proportion shows the percentage of a population that possesses the
characteristic of interest. A 100(1 − α ) % confidence interval for the population proportion p
is given by:
pˆ (1 − pˆ )
pˆ ± Zα 2 × [5.7]
n

Example 5.11
In a survey of 300 company executives carried out by the Zimbabwe Congress of Trade
Unions (ZCTU), 81 executives said they are willing to publicly disclose their annual salaries.

81 
 
Find a 99% confidence interval for the proportion of all executives who are willing to
disclose their annual salaries.

Solution 5.11
n = 300 k = 81 α = 0.01 ⇒ Z α 2 = Z 0.005 = 2.5758
k 81
pˆ = = = 0.27
n 300
A 99% confidence interval for p is
pˆ (1 − pˆ )
pˆ ± Zα 2 ×
n
0.27 × 0.73
= 0.27 ± 2.5758 ×
300
= 0.27 ± 2.5758 × 0.025632011
= 0.27 ± 0.066022934
= (0.2040, 0.3360)

Between 20.4% and 33.6% of all company executives are willing to publicly disclose their
annual salaries.
Activity 5.9
A random sample of 400 customers who visited a retail shop was interviewed and 280
were found to have a preference for a certain brand of toothpaste. Find a 90%
confidence interval for the proportion of the population of customers who prefer the
particular brand of toothpaste.

5.4.3 Confidence interval for the difference between two population means
(Independent samples)
Independent samples are those obtained from two populations such that the selection of one
sample from a population will not affect the selection of the other sample from a different
population.

To construct a confidence interval for the difference between two population means, the
following measures are required:
• The sample means of the two independent samples, x1 and x2
• The sample sizes of each sample, n1 and n2
• A specified level of confidence, α
• The standard error of the difference between two sample means

To find the confidence interval estimate for μ1 − μ 2 , there are three cases to consider:
Case 1: When population variances are known, a 100(1 − α )% confidence interval for
μ1 − μ2 is given by:
σ 12 σ 12
( x1 − x2 ) ± z α +
n1 n2
2 [5.8]

82 
 
Case II: When population variances are unknown and samples are large, a 100(1 − α )%
confidence interval for μ1 − μ2 is given by:

s12 s 22
( x1 − x 2 ) ± z α +
n1 n2
2 [5.9]
This formula is applicable when variances of the two populations are not the same.

Case III: When population variances are unknown and samples are small, that is n1 , n2 < 30
, a 100(1 − α )% confidence interval for μ1 − μ2 is given by

s12 s 22
( x1 − x 2 ) − t α ( n1 + n2 − 2) +
n1 n2
2 [5.10]

Example 5.12
The Zimbabwe Teachers’ Association (ZIMTA) conducted a study to determine the average
salary of teachers serving in private and public schools. A random sample of 60 teachers in
private schools averaged $640 per month with a standard deviation of $50, while a random
sample of 100 teachers in public schools averaged $350 per month with a standard deviation
of $20. Construct a 90% confidence interval estimate for the difference in mean monthly
salaries between teachers in private and public schools.

Solution 5.12
Sample 1: Salaries in private schools Sample 2: Salaries in public schools
x1 = 640 x2 = 350
n1 = 60 n2 = 100
s1 = 50 s2 = 20
Since population variances are unknown, but samples are large, a 90% confidence interval
(C.I) for μ1 − μ 2 is given by:
s12 s 22
( x1 − x2 ) ± z α +
2
n1 n2
502 202
= (640 − 350) ± z0.05 +
60 100
= 290 ± 1.6445(6.7577)
= 290 ± 11.11
= (278.89,301.11)

There is a 90% probability that the true mean difference in salaries between teachers in
private and public schools lies between $278.89 and $301.11 in favour of teachers in private
schools.

83 
 
Example 5.13
In Example 5.12, suppose only a random sample of 10 teachers in private schools and 10
teachers in public schools were interviewed and the following results were found:
Sample 1: Salaries in private schools Sample 2: Salaries in public schools
x1 = 640 x2 = 350
n1 = 10 n2 = 10
s1 = 50 s2 = 20
Construct a 90% confidence interval estimate for the difference in mean monthly salaries
between teachers in private and public schools.

Solution 5.13
Since population variances are unknown, but samples are small, a 90% confidence interval
for μ1 − μ 2 is given by:
s12 s22
( x1 − x2 ) ± t α (n1 + n2 − 2) +
2
n1 n2
502 202
= (640 − 350) ± t0.05 (18) +
10 10
= 290 ± 1.73(17.0294)
= 290 ± 29.46
= (260.54,319.46)

There is a 90% chance that the true mean difference in salaries lies between 260.54 dollars
and 319.46 dollars. The confidence interval does not span zero, this implies that there is a
0.90 probability that there is significant difference in salaries between teachers in public and
in private schools.
Activity 5.10
1. Random samples of 15 male students and 15 female students at a college were taken and
their weight measured. The males averaged 56kg with a standard deviation of 4kg while
the females averaged 64kg with a standard deviation of 7kg. Construct a 99% confidence
interval for the difference in mean weight of male and female students. Comment on the
result obtained.
2. Random samples of 60 male students and 54 female students at a college were taken in a
study to investigate ability to solve mathematics problems. In a mathematics test marked
out of 100, the males averaged 62% with a standard deviation of 2% while the females
averaged 54% with a standard deviation of 7%. Construct a 99% confidence interval for
the difference in mean score of male and female students. Comment on the result obtained.

When we assume that the two populations are normally distributed with equal variance
(homogeneous variance assumption), then the point estimate of variance is taken to be the
pooled variance S p2 which is given by:
(n1 − 1) S12 + (n2 − 1) S 22
S =
2

n1 + n2 − 2
p
[5.11]

so that a 100(1 − α )% pair of confidence limits for μ1 − μ 2 is given by:

84 
 
1 1
( x1 − x2 ) ± t α ( n1 + n2 − 2) S p +
n1 n2
2 [5.12]

This is particularly useful when sample sizes are not equal, that is, n1 ≠ n2 but when sample
sizes are the same, even if the pooled variance is used the interval obtained remains largely
the same.

Example 5.14
In Example 5.12, suppose only a random sample of 10 teachers in private schools and 15
teachers in public schools were interviewed and the following results were found:
Sample 1: Salaries in private schools Sample 2: Salaries in public schools
x1 = 640 x2 = 350
n1 = 10 n2 = 15
s1 = 50 s2 = 20
Construct a 90% confidence interval estimate for the difference in mean monthly salaries
between teachers in private and public schools.

Solution 5.14
(n1 − 1) S12 + (n2 − 1) S 22
We start by finding the pooled variance: S p2 =
n1 + n2 − 2
(10 − 1)502 + (15 − 1)202
=
10 + 15 − 2
= 1221.74
Now a 90% confidence interval is found by:
1 1
( x1 − x2 ) ± t α ( n1 + n2 − 2) S p +
2
n1 n2
1 1
= (640 − 350) ± t 0.05 (23) × 34.9534 +
10 15
= 290 ± 1.71×14.2697
= 290 ± 15.9797
= (274.02,305.98)
There is a 90% chance that the true mean difference in salaries lies between 274.02 dollars
and 305.98 dollars.
Activity 5.11
An airtime vendor recorded the daily sales over one week of two types of juice cards that he
sells. The results are summarised below:
Buddie: 40 49 32 50 38 45 47 44
Easycall: 32 27 29 22 34 39 22
Find a 95% confidence interval of the difference in mean sales of the two types of juice cards.

5.4.4 Confidence interval for difference of two populations means (Paired samples)
In Section 5.4.3, we looked at the case of independent samples. In this section, we will
consider the case of paired or matched samples. Here observations are made on the same
subject before and after an intervention. For example, the weights of a group of people can be
measured before and after the group has undergone a weight reducing exercise.

85 
 
To find a confidence interval for the mean difference of the paired observations, the
following measures will be required:
• the paired differences, di
• the mean of paired differences, d
• the standard deviation of paired differences, sd

The mean of paired differences is found by summing the paired differences and then divide
by their count, that is:

d =
∑ di
n [5.13]
A 100(1 − α )% confidence interval for the mean difference of the paired observations will be
given by:
S
d ± t α (n − 1) d
2 n [5.14]
where d is the mean of paired differences for the sample and
S d the standard deviation of the paired differences for the sample.

The standard deviation is given by


1 (∑ d i ) 2
sd = (∑ d i2 − )
n −1 n [5.15]

Example 5.15
The share prices of 7 randomly selected stocks on a stock exchange was noted before and
after the president of a certain country was admitted ill in hospital.

Price before 14 18 21 15 17 19 23
Price after 13 15 20 15 16 16 21

Find a 95% confidence interval of the mean decline in share prices.

Solution 5.15
Price before Price after Differences, di d i2
14 13 1 1
18 15 3 9
21 20 1 1
15 15 0 0
17 16 1 1
19 16 3 9
23 21 2 4
∑ d i2 = 25

d =
∑d i

86 
 
11
=
7
= 1.5714
The standard deviation is
1 (∑ d i ) 2
sd = (∑ d i2 − )
n −1 n
1 112
= (25 − )
6 7
= 1.1339

Therefore, a 95% confidence interval for the mean decline in share prices is given by:
S
d ± t α (n − 1) d
2 n
1.1339
= 1.5714 ± t 0.025 (6)
7
= 1.5714 ± 2.45(0.4286)
= 1.5714 ± 1.05
= (0.52,2.62)

The mean decline in share price is between $0.52 and $2.62. The confidence interval does not
span zero which means that the price after is significantly different from the price before
which shows that the President’s ill health had a strong negative effect on share prices.
Activity 5.12
Five people were put on a special exercise for 2 weeks to lose weight and the results were
summarised as shown in the table.

Before 90 97.5 88.5 110.5 99.5


After 91.5 93.5 80.5 102 98.5

Calculate a 95% confidence interval of the mean loss weight following the programme.

5.4.5 Confidence interval for the difference between two population proportions
Estimate of the difference between two population proportions ( p1 − p2 ) can be found by
constructing a confidence interval around the difference between two sample proportions.

The following measures are required to construct a confidence interval for the difference
between two population proportions:
• The sample proportions of the two independent samples p̂1 and p̂2
• The sample size of each sample n1 and n2
• A specified level of confidence
• The standard error of the difference between two sample proportions

We will consider two cases, one where samples are large (n1 + n2 ≥ 30) and the other where
samples are small (n1 + n2 < 30)
For large samples, the confidence interval estimate for ( p1 − p2 ) is given by:

87 
 
pˆ 1qˆ1 pˆ 2 qˆ 2
( pˆ 1 − pˆ 2 ) ± z α +
n1 n2
2 [5.16]
where qˆ1 = 1 − pˆ 1 and qˆ 2 = 1 − pˆ 2 .

When samples are small (n1 + n2 < 30) we use the t-distribution. A confidence interval for
p1 − p2 is given by:
pˆ 1qˆ1 pˆ 2 qˆ 2
( pˆ 1 − pˆ 2 ) ± t α (n1 + n2 − 2) +
n1 n2
2 [5.17]

Example 5.16
A mobile phone dealer conducted a study to ascertain market acceptability of a new cellular
phone they wished to market. The study was conducted in the two major cities Bulawayo and
Harare. In Harare 300 shoppers were interviewed and 210 were impressed by the new phone
while in Bulawayo of 250 shoppers who were interviewed, 80 were impressed by the new
phone. Construct a 95% confidence interval of the difference between the actual proportion
of Harare and Bulawayo shoppers who were impressed by the new cellular phone.

Solution 5.16
Sample 1 (Harare) Sample 2 (Bulawayo)
n1 = 300 n2 = 250
210 80
pˆ 1 = = 0 .7 pˆ 2 = = 0.32
300 250
qˆ1 = 0.3 qˆ 2 = 0.68

Since the combined sample size is large, a 95% confidence interval for ( p1 − p2 )
is given by:
pˆ 1qˆ1 pˆ 2 qˆ 2
( pˆ 1 − pˆ 2 ) ± z α +
2
n1 n2
0.7(0.3) 0.32(0.68)
= (0.7 − 0.32) ± z 0.05 +
300 250
= 0.38 ± 1.96(0.0396)
= 0.38 ± 0.0777)
= [0.3023,0.4577]
≈ [30%,46%]

The 95% confidence interval ranges from 30% to 46%. The confidence interval does not
contain zero implying that there is a 95% chance that the difference on the true proportion of
shoppers in Harare and Bulawayo is quite significant.

Example 5.17
Refer to Example 5.16, suppose 15 Harare shoppers and 10 Bulawayo shoppers were
interviewed and that 9 Harare shoppers liked the cellular phone compared to 4 Bulawayo
shoppers. Construct a 95% confidence interval of the difference between the actual
proportion of Harare and Bulawayo shoppers who were impressed by the new phone.

88 
 
Solution 5.17
Sample 1 (Harare) Sample 2 (Bulawayo)
n1 = 15 n2 = 10
9 4
pˆ 1 = = 0 .6 pˆ 2 = = 0 .4
15 10
qˆ1 = 0.4 qˆ 2 = 0.6

Since samples are small (n1 + n2 < 30) we use the t-distribution. A 95 % confidence interval
for ( p1 − p2 ) is given by:
pˆ qˆ pˆ qˆ
( pˆ 1 − pˆ 2 ) ± t α (n1 + n2 − 2) 1 1 + 2 2
2
n1 n2
0.6(0.4) 0.4(0.6)
= (0.6 − 0.4) ± t 0.025 (24) +
15 10
= 0.2 ± 2.06(0.2)
= 0.2 ± 0.412
= [− 0.212,0.612]

There is a 95% probability that the true difference in the proportion of Harare and Bulawayo
shoppers who liked the cellular phone lies between -0.212 and 0.612. The confidence interval
covers zero, therefore there is 95% chance that there is no difference in the true proportion of
shoppers in Harare and Bulawayo who liked the new phone.
Activity 5.13
In a random sample of 100 prospective ZOU students, 60 indicated that they would take up a
master degree in Development Economics if offered while 72 of another random sample of
120 prospective students indicated that they would take up a degree in Auditing if offered by
the University. Using a 90% confidence interval, estimate the difference in proportion of
actual students who would take up Development Economics instead of Auditing.
 

5.5 Determining Sample Size in Estimation


To decide on the sample size appropriate for a survey study you have to make a compromise
between two factors which are:
• The resources available in terms of time and cost of the study. A huge sample is
costly to study and the study requires more time.
• The degree of accuracy required. The larger the sample that is used, the narrower the
interval. A narrower interval is associated with less uncertainty and more accurate
estimation results.

In order to determine the sample size for your study, you need to specify the precision of your
estimate and the level of confidence desired. The precision is given by the error that you are
prepared to tolerate in your estimated results. You also need an estimate of the population
standard deviation. This can be obtained from a pilot survey carried out before the actual
study.

89 
 
5.5.1 Sample size for estimating population mean
A confidence interval for population mean μ provides an estimate of the accuracy of our
point estimate x . If μ is actually the centre value of the interval, then x estimates μ without
error. However, x will not be exactly equal to μ most of the time, and the point estimate is
usually in error.

We may wish to determine how large a sample is necessary to ensure that the error in
estimating μ will not exceed e - the ‘bound on the error’. In the confidence interval
σ σ
x ± Zα 2 × , the ‘bound on the error, is e = Zα 2 × . Now, making n the subject of the
n n
formula, the sample size necessary so that the error will not exceed e can be shown to be:
2
⎡ Zα 2 × σ ⎤
n=⎢ ⎥ [5.18]
⎣ e ⎦

Example 5.18
In Example 5.7, how large a sample is required if we wish to be 95% confident that our
sample mean will be within 10 hours of the true mean?

Solution 5.18
α = 0.05 ⇒ Z α 2 = 1.96 e = 10 σ = 20
2
⎡ Zα 2 × σ ⎤
n=⎢ ⎥
⎣ e ⎦
2
⎡1.96 × 20 ⎤
=⎢
⎣ 10 ⎥⎦
= [3.92]2
= 15.3664
≈ 16
Activity 5.14
1. Find the minimum sample size required for estimating the average return on money
market investments to within 0.5% per year with 99% confidence. The standard
deviation of returns is believed to be 2% per year.
2. A market researcher would like to estimate the average amount spent on airtime per
month by each female student at a college. The researcher would like to be able to
determine the average amount spent by all female students at the college to be
within $1 with 95% confidence. From past studies, the population standard
deviation is known to be $2. What is the minimum required sample size?
 

5.5.2 Sample size for estimating a population proportion


The minimum sample size required to estimate the population proportion to be within a
specified amount e with 100 (1 − α )% confidence is given by:
pˆ (1 − pˆ ) Z α2 2
n= [5.19]
e2

90 
 
Example 5.19
In Example 5.11, how large a sample is needed if we wish to be 99% confident that our
sample proportion will be within 0.02 of the true proportion of all the CEOs who are willing
to disclose their annual salaries?

Solution 5.19
We know that pˆ = 0.27 and Z α 2 = Z 0.005 = 2.5758 .
The minimum sample size required is
pˆ (1 − pˆ ) Z α2 2
n=
e2
0.27 × 0.73 × 2.57582
=
0.022
= 3269.2709
≈ 3270

In practice, you cannot collect sample data before deciding on the sample size to use.
Therefore, we require a way of estimating the appropriate sample size for a study which does
not dependent on the sample proportion, p̂.

The largest value that pˆ (1 − pˆ ) can have is 0.5. You can show this by working out the value
of pˆ (1 − pˆ ) using increasing values of p starting with p = 0.1. If we assume the largest
value of pˆ (1 − pˆ ) , then formula [5.19] is reduced to
2
⎡Z ⎤
n=⎢ α 2⎥ [5.20]
⎣ 2e ⎦

Example 5.20
A researcher intends to conduct a study to estimate the proportion of supermarkets that offer
trolleys suitable for customers with difficulty in walking. Determine the sample size needed if
the researcher wishes to be 95 % confident that the estimated proportion is within 8% of the
true proportion.

Solution 5.20
α = 0.05 ⇒ Zα 2 = 1.96 e = 0.08
2
⎡Z ⎤
n=⎢ α 2⎥
⎣ 2e ⎦
2
⎡ 1.96 ⎤
=⎢
⎣ 2 × 0.08 ⎥⎦
= 150.0625

This has to be rounded up in order to meet the confidence requirement. Thus a sample size of
151 supermarkets should be used.

91 
 
Activity 5.15
1. In a survey of a random sample of 300 shoppers, 180 said they would prefer to make
payments using debit cards. How large a sample is needed if we are to be 95%
confident that the estimate is within 5% of the actual proportion of shoppers who
prefer to transact using debit cards.

2. A DStv research team intends to install monitoring devices in a random sample of


households in order to produce 99% interval estimates of the proportion of
households watching specific programmes. In how many households will the team
have to install the devices if they want to estimate to within 1% of the true
proportion.

5.6 Summary

In this unit we introduced you to an important branch of statistical inference called


estimation. Estimation is about the use of sample measurements to predict population values.
The estimation is done in two ways namely point estimation and confidence interval
estimation. The major drawback of a point estimate is that we have no idea of the probability
that it is a good estimate. This makes interval estimates preferable because they are
associated with a known level of confidence which is a measure of how confident we are that
the interval does include within it the population parameter.

You learnt about the construction of confidence intervals for the difference between two
population means and proportions as well as confidence intervals for the mean difference of
paired observations. Samples are paired or matched when observations are made on the same
object before and after an intervention. We looked at how interval estimates for a population
mean and population proportion are constructed. We also looked at how to determine the
appropriate sample size for estimation surveys.

92 
 
Further Reading
Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Arora, P.N. and Malhan, P.K. (2010). Biostatistics. Mumbai: Global Media.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Govindarajulu, Z. (2007). Non-parametric Inference. River Edge, NJ: World Scientific.
Hutcheson, G.D. and Moutinho, L. (2008). Statistical Modelling for Management, London:
SAGE Publications Ltd.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

93 
 
Unit 6

Hypothesis Testing

6.0 Introduction

Hypothesis testing is that branch of statistical inference that is used to verify claims made
concerning population parameters. In this unit, you will be introduced to the terminology
used in hypothesis testing. You will learn how to conduct hypotheses tests concerning the
mean and proportion of a single population. You will also learn about significance tests for
the difference between two population means for both independent and correlated (paired)
samples and for difference of two population proportions.

6.1 Objectives

By the end of the unit, you should be able to:


• define a statistical hypothesis
• formulate the null hypothesis and the alternative hypothesis for a given situation
• distinguish between Type I and Type II errors
• distinguish between one-tailed and two-tailed tests
• outline the steps followed in the procedure of hypothesis testing
• conduct hypothesis tests concerning the population mean and the population
proportion
• distinguish between matched and independent samples
• conduct hypothesis tests for difference of two population means and for difference of
two population proportions
• carryout a test for the mean difference of paired samples

6.2 Statistical Hypotheses

A statistical hypothesis is a claim or a guess or an assumption or a statement, which may or


may not be true, made concerning a population parameter.

Hypothesis testing involves gathering evidence from a random sample drawn from the
population of interest in order to decide whether the null hypothesis is likely to be true or
false. The hypothesis is rejected if evidence from the sample is not consistent with the stated
hypothesis, otherwise it is accepted. However, the acceptance of the stated hypothesis does
not necessarily imply that it is true, rather it is a result of insufficient evidence to reject it.

6.2.1 Types of hypotheses


There are two types of hypotheses which are called the:
• null hypothesis, and
• alternative hypothesis

94 
 
A null hypothesis is an assertion about the value of a population parameter. It is a formal
statement of the claim being made concerning a population parameter. The null hypothesis is
denoted by H0.

The alternative hypothesis, denoted by H1, is the negation of the null hypothesis. For example,
a null hypothesis might assert that the population mean is equal to a specified value μ0 . We
write this as H 0 : μ = μ0 . The alternative hypothesis oppose this assertion and it is written as
H1 : μ ≠ μ0 . In this case, the alternative hypothesis suggests that the mean takes values that
are either below μ0 or above it. Therefore, to investigate H0 we conduct a non-directional test
which is known as a two-tailed test.

A null hypothesis might assert that the population mean is at least equal to a certain specified
value μ0 . We write H 0 : μ ≥ μ0 . In this case, the alternative hypothesis would consist of
values below μ0 , that is, H1 : μ < μ0 . Similarly, if a null hypothesis assert that the population
mean is less than or equal to a specified value μ0 , that is, H 0 : μ ≤ μ0 , the alternative will be
H1 : μ > μ0 . In both these cases, since the alternative hypotheses consist of values either
below or above the specified value μ0 , we conduct a one-sided test or a one-tailed test.

6.2.2 Deciding on the null hypothesis


Determining what the null hypothesis should be in a given situation may prove to be difficult.
However if the null hypothesis is wrongly formulated, then the test will be pretty
meaningless. The following notes will be handy in deciding what the null hypothesis should
be:
• The null hypothesis always has an element of equality; either an equal to (=) sign or a
greater or equal to ( ≥ ) sign or a less than or equal to ( ≤ ) sign is used in the
expression of H0.
• The null hypothesis is usually an expression of a claim made by someone. However,
if the claim does not include an equal to (=) sign it becomes the alternative
hypothesis.
• The null hypothesis is the hypothesis that we formulate with the hope of rejecting.
• If the null hypothesis is true, then no corrective action would be necessary, whereas if
the alternative hypothesis is true, some corrective action would be necessary.
Example 6.1
A ZOU Regional Director claims that the average age of a ZOU student is 21. A new
Programme Coordinator at the region doubts this claim. Set up the null and alternative
hypothesis if the Coordinator wishes to show that it is not 21.

Solution 6.1
H 0 : μ = 21
H1 : μ ≠ 21

Example 6.2
A leading bakery claims that the average cost of producing a standard loaf of bread is 80
cents. If you suspect that the claim exaggerates the cost, how would you set up the null and
alternative hypothesis?

95 
 
Solution 6.2
H 0 : μ ≥ 80c
H1 : μ < 80c

Activity 6.1
1. A ZIMRA official at a busy border post claims that it takes, on average, at most 2
days for a truck driver to clear his consignment. You suspect that the average is
greater than 2 days and you want to test the claim. State the null and alternative
hypothesis for this test.
2. An ice cream vending machine is set to dispense 100 grams per cup. You suspect
that the machine is under-filling the cups. Set up the null and alternative hypothesis
to investigate this case.

6.3 Type I and Type II Errors


In deciding to reject or accept a null hypothesis, there will be chances for erroneously
rejecting or accepting it. Such errors may be due to faulty sampling procedures.

A type I error is committed when a true null hypothesis is rejected. The probability of
committing a type I error is called the level of significance and it is denoted by α . It is
common to use 1%, 5% and 10% level of significance in calculations. If α = 0.05 then there
are only 5 in 100 chances of committing this error.

A type II error is committed if we accept the null hypothesis when it is false. The probability
that the test will be able to detect a false null hypothesis is called the power of a test. In other
words, the power of a test is the probability of rejecting H0 when indeed H0 is false.

6.4 Steps Followed in Hypothesis Testing


The following steps should be followed when conducting a hypothesis test:

Step 1: State the null and alternative hypothesis


The null and alternative hypotheses are specified at this initial stage before gathering any
evidence. It would be unethical and rather manipulative to formulate the H0 and H1 at one’s
convenience after gathering evidence; a practise that we refer to as data snooping.

Step 2: Identify the distribution


For problems in this unit, you have to choose between two distributions namely the z-
distribution and the t-distribution.

When testing for the population mean μ we use:


a) The z-distribution when the:
• population standard deviation σ is known
• population standard deviation is unknown and n is large ( n ≥ 30)
b) The t-distribution when the population standard deviation σ is unknown and the
sample size n is small ( n < 30) .

96 
 
When testing for a population proportion and the sample size is large we use the z-
distribution.

Step 3: Determine the rejection and acceptance region


Depending on the distribution identified and the level of significance desired, you find a
value from statistical tables which we call a critical value. The critical value separates the
acceptance region from the rejection region. The rejection region is made up of a range of
values such that if a test statistic calculated from sample data falls in it the null hypothesis
would be rejected. The rejection region also depends on the nature of the alternative
hypothesis as shown in the following figures.

area of rejection ( α 2 ) area of rejection ( α 2 )

critical value 0 critical value


Figure 6.1 Rejection Region for H1 : μ ≠ μ0

Area of rejection

0 critical value
Figure 6.2 Rejection Region for H1 : μ > μ0

area of rejection (α )

0
Critical value
Figure 6.3 Rejection Region for H1 : μ < μ0

97 
 
Step 4: Calculate the test statistic
A test statistic is a value calculated from sample data that is used to decide whether or not to
reject H0. Once the test statistic falls within the rejection region, H0 is rejected. The
calculation of the test statistic depends on whether the population standard deviation σ is
known or unknown and also on the sample size as summarised in Table 6.1 below.

Table 6.1 Test Statistic for Testing μ


When σ is known When σ is unknown
Case I: n is large or small Case II: n is large
x − μ0 x − μ0
Z cal = ~ N(0,1) [6.1] Z cal = ~N(0,1) [6.2]
σ n s n

Case III: n is small


x − μ0
Tcal = ~t(n-1) [6.3]
s n

When testing for a single population proportion p , we use the z-distribution and the test
statistic is given by:

pˆ − p0
Z cal = [6-4]
p0 q0
n

Step 5: Decide whether or not to reject H0


The decision is made on the basis of a comparison between the value of the test statistic and
the critical value. If the test statistic is greater than the critical value in absolute terms, it will
fall in the rejection region thus leading to the rejection of H0.

Step 6: Make a conclusion


If H0 is rejected, we conclude that H1 is probably true. If we fail to reject H0, we conclude
that the evidence gathered is insufficient to warrant the rejection of H0.

6.5 Tests Concerning the Population Mean of a Single Population

Example 6.3
A stock market analyst claims that the average annual return on stocks in the construction
industry is 12%. You want to test whether this claim is true. You collect a random sample of
36 stocks in the construction industry and find that the average annual return is 10% with a
standard deviation of 3%. Use a 5% level of significance to test the analyst’s claim.

Solution 6.3
1. H 0 : μ = 12%
H1 : μ ≠ 12%

2. The population standard deviation σ is unknown, but the sample size n =36 is large, so
we use the z-distribution.

98 
 
3. The nature of the alternative hypothesis suggests we need to carry out a two-tailed test.
Using α =0.05, the critical value is ± Zα 2 = ± Z 0.025 = ±1.96

-1.96 0 1.96
The rejection criteria is therefore to reject H0 if Z cal > 1.96

x − μ0
4. Z cal =
s n
10 − 12
=
3 36
=-4

5. Since Z cal = 4 > 1.96 , we reject H0

6. We conclude that the average annual return is not 12% and therefore the analyst’s
claim is false.

Example 6.4
The average weekly earnings of all bus rank marshals is reported to be $180. You believe it is
too low. You collect a random sample of 100 rank marshals and find that the weekly average
is $250 with a standard deviation of $20. Conduct the test at 10% level of significance.

Solution 6.4
1. H 0 : μ ≤ 180
H1 : μ > 180

2. The population standard deviation σ is unknown, but the sample size n =100 is large,
so we use the z-distribution.

3. α = 0.05 and it is a one-tailed test. The critical value is Zα = Z0.10 = 1.2816

0 1.2816

We would reject H0 if Zcal > 1.2826

99 
 
x − μ0
4. Z cal =
s n
250 − 180
=
20 100
= 35.

5. Since Zcal = 35 is greater than the critical value = 1.2816, we reject H0.

6. We conclude that the average weekly earnings of all rank marshals is greater than
$180.

Example 6.5
In an advertisement it is claimed that a certain brand of air freshener will last, on average, at
least 40 days. A random sample of 12 households took the following number of days to use
up the air freshener:
28 41 36 50 17 39 21 64 26 30 42 12

Test the claim made for the product using a 5% level of significance.

Solution 6.5
1. H 0 : μ ≥ 40
H1 : μ < 40

2. The population standard deviation σ is unknown, but the sample size n =12 is small,
so we use the t-distribution.

3. α = 0.05 and it is a one-tailed test. The critical value is − tα (n − 1) = −t0.05 (11) = −1.80

-1.80 0
We would reject H0 if Tcal < −180 .

4. You should verify that x = 33.8333 and s =14.6339


x − μ0
Tcal =
s n
38.8333 − 40
=
14.6339 12
= -0.2762

5. Since - 0.2762 > -1.80, we fail to reject H0.


6. We conclude that the data does not provide sufficient evidence at 5% level of
significance to reject H0.

100 
 
Activity 6.2
1. Average total daily sales of a fruits vendor are known to be at most $26. The
vendor recently changed his site of operation and moved to a new site at a busy
street corner. He now wants to know whether his daily sales have improved since
then. A random sample of 16 trading days gave an average of $30 with a standard
deviation of $5. Does the data provide evidence that the vendor’s average total
daily sales have improved? Use α = 0.05 .

2. A graduate student comes out of college with an average fees debt of $1 500. A
sample of 200 graduates showed that the average debt was $900 with a standard
deviation of $120. Carry out the test at the 5% level of significance.

3. The average time that children who reside in the same neighboured spent to travel
to school is claimed to be 35 minutes. A random sample of 10 children taken from
the neighbourhood had their travel times recorded as follows:

37 38 40 35 36 35 39 37 40 42

Test the claim using a 1% level of significance.

6.6 Test Concerning a Population Proportion for a Single Population

Example 6.6
A long distance bus conductor claims that at least 25% of passengers buy some bananas to eat
before reaching their destinations. If 18 out of a random sample of 60 passengers bought
bananas during their trip with the bus, is the conductor right? Use 5% level of significance.

Solution 6.6
1. H 0 : p ≥ 0.25
H1 : p < 0.25

2. n = 60 is a large sample; we use the z-distribution

3. It is a one-tailed test. The critical value is - Z0.05 = −1.6449

-1.6449 0

We would reject H0 if Zcal < -1.6449

101 
 
pˆ − p0
4. Z cal =
p0 q0
n
0.3 − 0.25
=
0.25 × 0.75
60
0.05
=
0.05590
= 0.8945

5. We fail to reject H0

6. We conclude that the data does not provide sufficient evidence to reject H0

Example 6.7
Last year, 70% of total student applications received by the Zimbabwe Open University were
from female applicants. Out of a random sample of 150 applications received this year, 90
were from females. Test the hypothesis that the proportion of applications from females has
not changed using a 10% level of significance.

Solution 6.7
1. H o : p = 0.70
H o : p ≠ 0.70

2. n = 150 is a large sample, so we use the z-distribution

3. Using α = 0.10 , the critical value is ± Z0.05 = ±1.6449

-1.6449 0 1.6449

We would reject H0 if Z cal > 1.6449


90 pˆ − p0
4. pˆ = = 0.6 Z cal =
150 p0 q0
n
0.60 − 0.70
=
0.7 × 0.30
150
− 0 .1
=
0.037416573
= -2.6726

102 
 
5. Since Z cal = 2.6726 > 1.6449 , we reject H0

6. We conclude that the proportion of female applicants has changed.

Activity 6.3
1. The Traffic Safety Council of Zimbabwe claims that at least 65% of all road accidents are
due to human error. In a random sample of 500 road accidents, it was found that 342
accidents were due to human error. Use 5% level of significance to test the claim.

2. A credit controller of a clothing retail chain estimates that 20% of their customers default
on their monthly bill payment. A random sample of 400 accounts indicated that 130
accounts were at least one month in arrears. Does the data provide evidence to support the
credit controller’s claim? Use α = 0.10

6.7 Confidence Interval Approach to Hypothesis Testing

Suppose (a, b) is a 100 (1 − α )% confidence interval for μ , then the confidence interval (a, b)
provides plausible values of μ under the null hypothesis H 0 . Assuming H 0 , if μ falls
within (a, b) ,we do not reject H 0 but if it lies outside the interval (a, b) we reject H 0 .Thus
the lower and upper confidence limits form a pair of critical values beyond which H 0 will be
rejected as shown in Figure 6.4.

Rejection Acceptance Rejection


region region region

a b

Figure 6.4 Confidence Interval as Critical Values

Example 6.8
An electrical firm supplies light bulbs that have a length of life that is approximately
normally distributed with a standard deviation of 20 hours. If a random sample of 40 bulbs
has an average life of 800 hours,
a) Find a 99% confidence interval for the population mean life of all bulbs supplied by
this firm.
b) Hence test at 1% level of significance the claim that the population mean life of all
bulbs supplied by this firm is 800 hours.

Solution 6.8
a) From Solution 5.7, the 99% confidence interval for mean life of bulbs was found to be
(791.8546; 808.1454).

103 
 
b) The hypotheses tested are: H 0 : μ = 800
H1 : μ ≠ 800
Assuming H 0 , the claim is probably true because the confidence interval (791.8546;
808.1454) includes 800.
Activity 6.4
Last year, 70% of total student applications received by the Zimbabwe Open
University were from female applicants. Out of a random sample of 150 applications
received this year, 90 were from females. Use the confidence interval approach to test
the hypothesis that the proportion of applications from females has not changed using
a 10% level of significance.

6.8 Testing for Differences of two Population Means


Where we have two random samples, we may want to know if there is significant difference
between their two means. The two samples may be independent or matched. We shall start by
looking at the case of independent samples.
6.8.1Tests for independent samples
Independent samples are those obtained from two populations such that the selection of one
sample from a population will not affect the selection of the other sample from a different
population. The distribution of sample mean differences is assumed to be normally
distributed.

Let μ1 be the population mean of the first population from which a sample is drawn and μ 2 be
the mean of the second population. The hypotheses to be tested are:
H 0 : μ1 = μ 2
Set A:
H1 : μ1 ≠ μ 2
H 0 : μ1 = μ 2 H 0 : μ1 ≤ μ 2
Set B: or
H1 : μ1 > μ 2 H 1 : μ1 > μ 2
H 0 : μ1 = μ 2 H 0 : μ1 ≥ μ 2
Set C or
H1 : μ1 < μ 2 H 1 : μ1 < μ 2

The null hypothesis (H0) essentially says there is no difference in the means. Thus the
hypotheses can be expressed as differences, for example set A can be written as
H 0 : μ1 − μ2 = 0
H1 : μ1 − μ 2 ≠ 0

The test statistic depends on whether population variances are known or not known and also
on the sample sizes. We have three cases to consider:

Case I: When the variances ( σ 12 and σ 22 ) are known, irrespective of sample sizes, we use the
standard normal distribution (z-distribution ) and the test statistic is given by

104 
 
Z cal =
(x1 − x2 ) − (μ1 − μ 2 )
σ 12 σ 22
( + )
n1 n2
Assuming H0, μ1 = μ 2 the test statistic reduces to
x1 − x2
Z cal =
⎛ σ 12 σ 22 ⎞
⎜⎜ + ⎟
⎝ n1 n2 ⎟⎠

Case II: Where variances are unknown but samples are large ( both n1 and n2 are greater than
30), sample variances s12 and s22 are used as estimates of the population variances σ 12 and σ 22
respectively. The test statistic therefore becomes
x1 − x 2
Z cal = [6.5]
⎛ s12 s 22 ⎞
⎜⎜ + ⎟⎟
⎝ n1 n2 ⎠

The next thing is to find the critical values upon which the criteria to accept or reject H0 will
be based. For Case I and Case II, the critical values are obtained from z- tables. The critical
values and the subsequent rejection criteria are summarised in Table 6.2 below.

Table 6.2: Summary of Decision Criteria for Case I and Case II


Hypothesis Tested Critical value Decision criteria: Reject H0 if:
Set A ± Zα 2 Z cal > Zα 2
Set B Zα Z cal > Zα
Set C − Zα Z cal < −Zα

Example 6.9
A study was made to compare salaries of teachers in the private and public schools. 40
randomly selected teachers in private schools averaged $540 per month with a standard
deviation of $60, while 50 randomly selected teachers in public schools averaged $360 with a
standard deviation of $15. Test at 5% level of significance whether the difference between
these two samples mean is significant.

Solution 6.9
Sample 1:private schools Sample 2:public schools
n1 = 40 n2 = 50
x1 = 540 x2 = 360
S1 = 60 S2 = 15

H 0 : μ1 = μ 2
The hypothesis is stated as:
H1 : μ1 ≠ μ 2
This is a two-tailed test. The variances are unknown but samples are large, so we use the z-
distribution. The critical values are:
± Zα 2 = ± Z 0, 025 = ±1.96

105 
 
-1.96 0 1.96

We would reject H0 if Z cal > 1.96 otherwise we accept H0.


x1 − x2
We can now calculate the test statistic: Z cal =
s12 s 22
+
n1 n2
540 − 360
=
60 2 15 2
+
40 50
180
=
90 + 4.5
= 18,5
Since Z cal = 18,5 > 1.96 , we reject H0 and conclude that the salaries are different.

Activity 6.5
A non-governmental organisation is carrying out a study to compare the standard of
living in Zambia and in Zimbabwe. A random sample of 1 500 Zambian families gave a
mean family income of $8 050 with a standard deviation of $210 while a random sample
of 1 800 Zimbabwean families gave a mean family income of $11 500 with a standard
deviation of $1 000. Test at 5 % whether the standard of living in Zimbabwe is
significantly different from that in Zambia.

Case III: When variances are unknown and sample sizes are small, instead of the z-
distribution we now use the t-distribution. When using the t-distribution, we assume that:
• the two populations are normally distributed,
• the two populations have equal variances (homogeneous variance assumption).

In this case the best estimate of the population variance in each population is the pooled
sample variance given by
(n − 1) s12 + (n2 − 1) s22
s 2p = 1
n1 + n2 − 2 [6.6]
Thus the test statistic to be used will be
x1 − x2
Tcal =
⎛ s 2p s 2p ⎞
⎜ + ⎟
⎜n n ⎟
⎝ 1 2 ⎠
[6.7]
This test statistic is particularly useful when sample sizes are not equal i.e n1 ≠ n2 but when
sample sizes are the same, even if the pooled variance is used the value of Tcal obtained will
not be different from the one obtained using
106 
 
x1 − x2
Tcal =
⎛ s12 s22 ⎞
⎜⎜ + ⎟⎟
⎝ n1 n2 ⎠

For Case III, the critical values are obtained from t-tables. The critical values and rejection
criteria are as stated in Table 6.3.

Table 6.3: Decision Criteria for Case III


Hypothesis tested Critical value Decision criteria: Reject H0 if:
Set A ± tα 2 (n1 + n2 − 2) Tcal > tα 2 ( n1 + n2 − 2)
Set B tα (n1 + n2 − 2) Tcal > tα (n1 + n2 − 2)
Set C - tα (n1 + n2 − 2) Tcal < −tα (n1 + n2 − 2)

Example 6.10
A price monitoring agency noted the price of a commodity in 8 different branches of a retail
chain (A) and in 7 branches of another retail chain (B). The results were as follows:
Retailer A: 21 19 20 21 25 23 19 22
Retailer B: 16 20 15 18 22 20 18

Establish at the 1% level of significance whether the average price of the commodity at
retailer B is significantly smaller than the price of the same commodity at retailer A.

Solution 6.10
H 0 : μ1 = μ 2
H1 : μ1 > μ 2
It is a one-tailed test, variances are not known and samples are small, therefore we use the t-
distribution.

The critical value is given by tα (n1 + n2 − 2) = t0.01 (8 + 7 − 2) = t0.01 (13) = 2.65

0 2.65
We would therefore reject H 0 if Tcal > 2.65 .

Before we calculate the test statistic, we need to find the pooled sample variance.
Consider retailer A as the first sample, so that
n1 = 8 , x1 = 21.25 , s12 = 4,2143
and retailer B as the second sample to give
n2 = 7 x2 = 18,4286 s22 = 5.9524
(n1 − 1) s12 + (n2 − 1) s22
Now the pooled variance is given by s 2p =
n1 + n2 − 2

107 
 
7(4.2143) + 6(5.9524)
=
13
= 5.0165
x1 − x2
Tcal =
⎛ s 2p s 2p ⎞
⎜ + ⎟
⎜n n ⎟
⎝ 1 2 ⎠

21.25 − 18.4286
=
5.0165 5.0165
+
8 7
2.8214
=
12.6436
= 0.7935
Since Tcal = 0.7935 < 2.65, we do not reject H0 and conclude that there is insufficient
evidence that the average price of the commodity is smaller at retailer B.
Activity 6.6
1. State the assumptions underlying the use of t-tests for independent samples.
2. An airtime vendor recorded the daily sales of two types of juice cards that he sells on seven
randomly selected days. The results are summarised below:
Buddie: 40 49 32 50 38 45 47
Easycall: 32 27 29 22 34 39 22
Test at 5 % level of significance whether the mean daily sales of the two types of juice
cards are significantly different.
3. A random sample of 12 students doing Physics has a mean score of 40% with standard
deviation of 5%, while for students doing Mathematics, a random sample of 10 students
has a mean of 49% with a standard deviation of 10%. Assume the scores are normally
distributed, test at 5% whether there is a significant difference in the mean scores of the
two categories of students.

6.8.2 Tests for paired samples


When two data values are measured from the same source before and after an intervention,
the two sets of data values are dependent and are called paired or matched samples. For
example, the demand of certain identified products can be noted before and after a salary
review. The results obtained can be used to test whether the salary review had an effect on the
demand of the products.
The hypotheses to be tested are:
H 0 : ud = 0
A:
H1 : u d ≠ 0
H 0 : ud = 0
B:
H1 : u d > 0
H 0 : ud = 0
C:
H1 : u d < 0
where u d is the population mean of paired differences.

108 
 
For the case when n < 30, the test statistic is given by:
d
Tcal = [6.8]
sd
n
where d is the sample mean and sd sample standard deviation of paired differences, di .
For hypothesis A, H0 is rejected if Tcal > tα 2 ( n − 1) .
For hypothesis B, H0 is rejected if Tcal > tα (n − 1) .
For hypothesis C, H0 is rejected if Tcal < −tα (n − 1) .

Remark 6.1
For large samples (n ≥ 30), the t is replaced by a z.

Example 6.11
The share prices of 7 randomly selected stocks on a stock exchange was noted before and
after the president of a certain country was admitted ill in hospital.
Price before 14 18 21 15 17 19 23
Price after 13 15 20 15 16 16 21

Test at 5% significance level if the President’s ill health reduced the stock prices.

Solution 6.11
H 0 : μd = 0
H1 : μd > 0

Price before 14 18 21 15 17 19 23
Price after 13 15 20 15 16 16 21
di 1 3 1 0 1 3 2

d = 1.5714 and S d = 1.1339

The test statistic is given by:


d 1.5714
Tcal = = = 3.6666
sd 1.1339
n 7

The critical value is given by:


t0.05 (6) = 1.94

0 1.94
We would reject H0 if the test statistic is greater than 1.94.

109 
 
Decision:
Now, since the test statistic is greater than the critical value, we reject H0 and conclude that
the ill health probably reduced stock prices.
Activity 6.7
The productivity level (number of units) of 5 randomly selected workers was measured
before and after the workers underwent a training programme. The results were as follows:
No. of units produced
before 9 11 8 13 15
after 10 13 10 12 17
Test at 5% whether the training programme increased the productivity level of the workers.

6.9 Testing for Difference of two Population Proportions


Here we test the hypothesis that the proportions in two populations are not different. The
notation to be used is:
Sample 1 Sample 2
Sample size n1 n2
Sample proportion p̂1 p̂2
Population proportion p1 p2

The hypotheses to be tested are:


H 0 : p1 = p2
A:
H1 : p1 ≠ p2
H 0 : p1 = p2 H 0 : p1 ≤ p 2
B: or
H1 : p1 > p2 H 1 : p1 > p 2
H 0 : p1 = p2 H 0 : p1 ≥ p 2
C: or
H1 : p1 < p2 H 1 : p1 < p 2

We shall assume large samples, that is, both n1 and n2 are greater than 30. The test statistic is
given by:
( pˆ − pˆ 2 ) − ( p1 − p2 )
Z cal = 1
pq pq
+
n1 n2
[6.9]
which simplifies to
pˆ 1 − pˆ 2
Z cal = [6.10]
pq pq
+
n1 n2
if we assume H0.
pˆ 1n1 + pˆ 2 n2
p is the pooled sample proportion given by p =
n1 + n2 [6.11]
and q =1 − p .

110 
 
The critical values for a given α are obtained from z-tables (see appendices). Table 6.4
shows the critical values and subsequent rejection criteria for the different hypotheses.

Table 6.4 Summary of Decision Criteria for Proportions


Hypothesis Critical value Reject H0 if:
A ± Zα 2 Z cal > Zα 2
B Zα Z cal > Zα
C − Zα Z cal > −Zα

Example 6.12
A campaign manager of an aspiring Member of Parliament (MP) took a random sample of
2 000 voters in a certain constituency and found that 500 knew about the MP. After a
vigorous campaign exercise, another sample of 1 500 voters showed that 700 knew the MP.
Test at 5% level of significance if the campaign exercise increased the number of voters who
know the MP.

Solution 6.12
H 0 : p1 = p2
H1 : p1 < p2
Sample 1: Sample 2:
n1 =2000 n2 = 1500
500 700
pˆ 1 = = 0.25 pˆ 2 = = 0.4667
2000 1500
The pooled sample proportion
pˆ n + pˆ 2 n2
p= 1 1
n1 + n2
0.25(2000 ) + 0.4667 (1500)
p=
2000 + 1500
= 0.3429
q = 1 − p = 1 − 0.3429 = 0.6571
The test statistic, Z = pˆ 1 − pˆ 2
cal
pq pq
+
n1 n2
pˆ 1 − pˆ 2
Z cal =
1 1
pq( + )
n1 n2
0.25 − 0.4667
=
1 1
0.3429(0,6571)( + )
2000 1500
− 0.2167
=
0.000262872
= -13.3655

111 
 
Critical value:
− Zα = −1.6449

-1.6449 0
We would reject H0 if
Zcal < −1.6449
Now, since
Zcal = −13.3655 < −1.64490
We reject H0 and conclude that the number of voters who knew the MP had probably
increased.
Activity 6.8
An Election Support Network Organisation conducted a survey in Harare and Bulawayo to
ascertain voter’s attitude towards devolution of power. In Harare 1 000 people were
interviewed and 720 said they do not support devolution. In Bulawayo, 800 people were
interviewed and 680 said they support devolution. Is there a significant difference between
the views of Harare and Bulawayo voters regarding devolution? Test at 1% significance
level.

6.10 Summary

In this unit, you learnt about how to conduct hypotheses tests concerning the mean and
proportion of a single population. We defined a statistical hypothesis as an assumption or a
statement which may or may not be true, made concerning a population parameter.
Hypothesis testing therefore is about verifying whether the claim is true or false. We saw that
there are two types of hypotheses namely the null and alternative hypothesis. The null
hypothesis is a statement of the assertion made concerning a population parameter.

The decision to reject or accept H0 is based on evidence gathered from a random sample
drawn from the population of interest. A wrong decision may be arrived at due to sampling
errors. A type I error is committed when H0 is rejected when in actual fact it is true. If H0 is
accepted when in fact it is false, the error committed is called a type II error. The hypothesis
is rejected if evidence from the sample is not consistent with the stated hypothesis, otherwise
it is accepted. However, the acceptance of the stated hypothesis does not necessarily imply
that it is true, rather it is a result of insufficient evidence to reject it.

You were also taught how to conduct hypothesis test for difference of two population means;
hypothesis test for difference of two population proportions and hypothesis test for the mean
difference of paired samples. Through the various examples and activities provided in the
unit you will surely appreciate the usefulness of hypothesis testing in business and economic
applications and in everyday life.

112 
 
Further Reading

Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Arora, P.N. and Malhan, P.K. (2010). Biostatistics. Mumbai: Global Media.

Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth


Heinemann.
Govindarajulu, Z. (2007). Non-parametric Inference. River Edge, NJ: World Scientific.
Hutcheson, G.D. and Moutinho, L. (2008). Statistical Modelling for Management, London:
SAGE Publications Ltd.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

113 
 
Unit 7

Chi-Square Tests

7.0 Introduction

The Chi-square tests were developed by Karl Pearson in 1900. They are classified under non-
parametric statistical tests. In this unit, you will learn about two such tests which are the Chi-
Square test of association and the Chi-Square Goodness-of-fit test. The Chi-Square test of
association tests whether or not any two categorical variables are related to each other based
on sample observations. If the variables are related, we say that they are associated and
therefore not independent. On the other hand, the Chi-square Goodness-of- fit test is used to
test if a sample of data came from a population with a specified distribution.

Since hypothesis testing is involved, you should follow the basic steps of hypothesis testing
outlined in Unit 6.

7.1 Objectives
By the end of the unit, you should be able to:
• define a contingency table
• formulate hypotheses for the Chi-Square tests
• determine critical values using the Chi-Square distribution
• calculate the test statistic
• conduct a Chi-Square test of association for two categorical variables
• conduct a Chi-Square goodness-of-fit test
• distinguish between a Chi-Square test of association and a Chi-Square goodness-of-fit
test

7.2 Conducting a Chi-Square Test of Association

We may wish to test whether students’ choice of a degree programme has anything to do with
gender. The two variables are ‘gender’ and ‘choice of degree programme’. So we ask
ourselves, “Are there differences in choice of degree programme between male and female
students?” The test will be able to tell you if choice of a degree programme and gender are
correlated but will not tell you directly what programmes are most favoured by men or by
women for that matter. If there are no differences, then the variables are independent
otherwise they are said to be associated.

In carrying out the test, the general procedure of hypothesis testing should be observed. The
general outline of the test is as follows:
7.2.1 Hypothesis
The hypothesis is stated as:
H0: variables are not associated (independent)
H1: variables are associated (not independent)

114 
 
The data are presented in the form of a table called a contingency table. A contingency table
gives the frequency for two or more variables simultaneously. Table 7.1 has two rows and
two columns of observations and it is called a 2 × 2 contingency table. The two variables are
denoted by X and Y.

Table 7.1: A Typical Contingency Table

Y1 Y2 Total
X1 a(a1) b(b1) a +b
X2 c(c1) d(d1) c +d
Total a + c b +d N

In the table, a, b, c and d are observed frequencies for different combinations of X and Y
while a1, b1, c1 and d1 are the respective expected frequencies and N is the Grand Total which
is given by N =a + b + c + d.
The expected frequency for each observation is found by using the formula:
row total × column total
Grand total [7.1]
For example, in Table 7.1 the expected frequency for b is given by:
(a + b)(b + d )
b1 =
N

7.2.2 Test statistic


n
(Oi − Ei ) 2
The test statistic is given by χ 2
cal = ∑
i =1 Ei [7.2]
where E is the expected frequency for each observed frequency, O.

The steps to be followed in calculating the test statistic are:


Step 1: Calculate the expected frequencies for each observed frequency.    
Step 2: Subtract each expected frequency from the corresponding observed frequency to
obtain (Oi - Ei ).
Step 3: Square the differences for each i to get (Oi - Ei )2.
Step 4: Divide each square difference by the corresponding expected frequency and sum the
results for all i.
The work is best set out as shown in Table 7.2

Table 7.2: Calculation of χ 2 Test Statistic

Category i Oi Ei Ei − Oi (Oi − Ei ) 2
Ei
1
2
3
Etc
Total n n 0 χ 2 cal

115 
 
7.2.3 Critical value
To find the critical value, we first determine the degrees of freedom. In general, given a
contingency table, the degrees of freedom are equal to number of rows less one multiplied by
number of columns less one. If a contingency table has r-rows and c-columns, the degrees of
freedom would be (r – 1)(c – 1). For a contingency table with r-rows and c-columns and
known level of significance, the critical value is given by:
χ α2 ( r − 1)(c − 1)

χ α2 ( r − 1)(c − 1)
The rejection criteria would be to reject H0 if:
χ cal
2
> χ α2 (r − 1)(c − 1)

Example 7.1
A survey of first year university students sought to establish any association between choice
of degree programme and sex. Assuming only two degree programmes were on offer, the
following results were obtained:

Degree programme
Sex Mathematics Marketing
male 117 63
female 24 56

Use a 5% level of significance to test whether there is an association between sex and the
choice of degree programme.

Solution 7.1
H0: There is no association between sex and degree choice
H1: There is association between sex and degree choice

Mathematics Marketing Total


male 117(98) 63(82) 180
female 24(43) 56(37) 80
Total 141 119 260

The expected frequency for 117, for example, is obtained by


180 × 141
= 97.6154 .
260

The critical value is given by:

116 
 
χ α2 ( r − 1)(c − 1) = χ 02.05 ( 2 − 1)(2 − 1) = χ 02.05 (1) = 3.84

0 3.84
The rejection criteria will be to reject H0 if χ cal > 3.84
2

The test statistic is calculated as shown in Table 7.3 below.

Table 7.3 Calculation of Test Statistic for Example 7.1

O E O-E (O-E)2 (O − E ) 2
E
117 98 19 361 3.6837
63 82 -19 361 4.4024
24 43 -19 361 8.3953
56 37 19 361 9.7568
26.2382

Now, since χ cal


2
= 26.2382 > 3.84 we reject H0 and conclude that sex and choice of degree
programme are associated.

Example 7.2
In order to find out if males preferred a certain political party compared to women, the
following data were obtained from a sample of 1 718 eligible voters.

Party A Party B Party C


Male 313 124 391
Female 344 158 388

Test at 5% significance level whether there are differences in the way that males and females
vote.

Solution 7.2
H0: There is no association between sex and political party affiliation.
H1: Sex and political party affiliation are associated

Party A Party B Party C Total


male 313(316.64) 124(135.91) 391(375.44) 828
female 344(340.36) 158(146.09) 388(403.56) 890
Total 657 282 779 1 718

The critical value is χ 02.05 (2 − 1)(3 − 1) = χ 02.05 (2) = 6

117 
 
0 6

Therefore we would reject H0 if χ cal


2
>6

The test statistic is calculated as shown in Table 7.4 below.

Table 7.4 Calculation of Test Statistic for Example 7.2


O E O-E (O-E)2 (O − E ) 2
E
313 316.64 -3.34 11.16 0.0352
344 340.36 3.64 13.25 0.0389
124 135.91 -11.91 141.85 1.0437
158 146.09 11.91 141.85 0.9710
391 375.44 15.56 242.11 0.6449
388 403.56 -15,56 242.11 0.6240
3.3577

Since χ cal
2
= 3.3577 is not greater than 6, we do not reject H0 and conclude that the voting
patterns of males and females do not differ significantly, that is, there is no association
between voting patterns and sex.
Activity 7.1
1. In a survey of drug abuse at the workplace, workers at a gold mine answered to the
questions: (1) ‘Do you smoke dagga?’ (2) ‘Do you drink alcohol?’ as shown in the table
below:
Question 1
Question2 Yes No
Yes 56 30
No 18 6
Can you infer whether or not drinking alcohol is associated with smoking dagga?

2. In order to determine whether boys or girls got into trouble more often at school, the
following data in percentages was gathered.
Got in trouble No trouble
girls 25 56
boys 86 19
Is there a link between the sex of a student and getting into trouble at school? Test at 5%.

3. Out of a sample of 120 persons in a village, 76 were administered a new drug for
preventing influenza, and out of them 24 persons were attacked by influenza. Out of those
who were not administered the new drug, 12 persons were not affected by influenza.

118 
 
Prepare a 2 x 2 contingency table of the observations.
Use Chi-square test for finding out whether the new drug is effective or not.

4. A tobacco company claims that there is no relationship between smoking and lung
ailments. To investigate the claims, a random sample of 300 males in the age group of 40
to 50 is given a medical test. The observed sample results are tabulated below:
Lung ailment No lung ailment
Smokers 75 105
Non- 25 95
smokers
On the basis of this information, can it be concluded that smoking and lung ailments are
independent?

5. In a survey of 200 boys, of which 75 were intelligent, 40 had skilled fathers, while 85 of
the unintelligent boys had unskilled fathers. Does this support the hypothesis that skilled
fathers have intelligent boys?

7.3 Goodness-of-fit Test


Goodness-of-fit test is used to test if a sample of data came from a population with a specific
distribution. It tests the appropriateness of a fixed distribution based on sample data. For
example, we may want to test whether data are from a Poisson probability distribution or
from a Binomial probability distribution.

The test is applied to data that has been put into classes or categories. One drawback is that,
the test is sensitive to choice of categories in that the value of the test statistic depends on
how data has been categorised. Another disadvantage is that the test is not valid for small
samples. It requires a sufficient sample size (n > 30) in order for the χ 2 approximation to be
valid.
7.3.1 Hypothesis
In general the hypothesis to be tested is:
H0: The observations follow the postulated distribution
H1: The observations follow some other distribution.
7.3.2 Test statistic
k
(Oi − Ei ) 2
The test statistic is χ 2
cal = ∑
i =1 Ei [7.3]
where Oi is the observed frequency for category i , Ei is the expected frequency for category i
and k is number of categories.

7.3.3 Critical value


Assuming H0, χ 2 is approximately an observation from a Chi-square distribution with k − 1
degrees of freedom, where k is the number of categories. However, the number of degrees of
freedom depends on whether or not the parameters of the postulated distribution have been
specified. If m of the parameters have not been specified and therefore need to be estimated,
then the degrees of freedom would be given by k − m − 1

119 
 
For a specified level of significance-α, the critical value is given by χ 2k-m-1(α). This value is
obtained from statistical tables- table 3 of ZOU statistical tables. The critical value separates
the rejection region (the shaded region) from the acceptance region as shown in Figure 7.1
7.3.4 Decision Criteria

Acceptance region

Rejection region

χ k2− m−1 (α )
Figure 7.1: Critical Region for the Chi-Square Distribution

The decision criteria is to reject H0 if χ 2cal > χ 2k-m-1(α).


Now let us see how the expected frequencies are obtained and also how the test statistic is be
calculated.

The expected frequency for category i ( Ei ) is the product of sample size n and the
probability of category i , pi , that is
k
Ei = npi where n = ∑O
i =1
i

Remark 7.2 Expected frequencies should be numbers and not percentages or ratios.

Remark 7.3 If expected frequencies are not large enough, that is, not greater or equal to 5,
categories will be pooled.

The probability of each category i is calculated using the distribution specified by H0. For
easy of computation, the workings for calculating the test statistic is set out as shown in a
Table 7.6

Table 7.6: Calculation of χ 2 Test Statistic


Category i Oi Pi Ei Ei − Oi (Oi − Ei ) 2
 
Ei
 
1
  2
3
  .
  .
  K
Total n 1 n 0 χ 2 cal

120 
 
Example 7.3
The number of AIDS deaths recorded per day in government hospitals over a period of 365
days are:

Number of deaths 0 1 2 ≥3
Observed frequency 9 80 120 156

Test at 5% significance level, the hypothesis that the number of deaths per day follows a
Poisson distribution with mean 2.

Remark 7.4 Before attempting questions in this section you may need to revise the Poisson
probability distribution, the binomial probability distribution and the normal probability
distribution covered in Unit 4.

Solution 7.3
The hypothesis to be tested is:
H0: The number of deaths per day follows the Poisson distribution with mean 2
H1: The number of deaths follows some other distribution
k
(O − Ei ) 2
The test statistic is χ 2cal = ∑ i
i =1 Ei
The critical value is
2 2
χ k-m-1(α)= χ 4- 0- 1(0,05) =7.32

0 7.32
2
The decision criteria is to reject H0 if χ cal > 7.32.

Table 7.7: Calculation of Test Statistic for Example 7.3


Category i Oi Pi Ei Oi − Ei (Oi − Ei ) 2
Ei
0 9 0.1353 0.1353 x 365=49 -40 32.65
1 80 0.2707 0.2707 x 365 =99 -19 3.65
2 120 0.2707 0.2707 x 365 =99 +21 4.45
3 or more 156 0.3233 0.3233 x 365 =118 +38 12.24
Total 365 1.0000 1.0000 x 365= 365 0 χ 2 cal = 52.99

The probabilities in the table were calculated as follows:


Let X be number of deaths, so that X ~ P0 (2)

121 
 
e − λ λx
The probability distribution of X is given by P( X = x) = for x = 0,1,2...
x!
where λ is the mean.
e −2 2 0
P( X = 0) = = 0.1353
0!
e −2 21
P( X = 1) = = 0.2707
1!
e −2 .2 2
P( X = 2) = = 0.2707
2!
P( X ≥ 3) = 1 − P( X ≤ 2)
= 1 − (0.1353 + 0.2707 + 0.2707)
= 0.3233
Decision: Since χ cal2
= 52.99 > 7.32 we reject H0 and conclude that the number of deaths
follows some other distribution. Note that the number of deaths may follow the Poisson
distribution with a mean λ which is not necessarily 2.

Activity 7.2
1. The accounting department of a bank randomly selected 100 accounts and examined them
for errors. The following results were obtained:
Number of errors 0 1 2 3
Number of accounts 36 40 19 5
Test at 5% whether the distribution of errors is the Poisson distribution.

2. The ZRP Harare traffic section is worried about the increasing number of accidents at a
road junction. The number of accidents per week at the junction over a period of 120
weeks was recorded as follows:
Number of accidents Observed frequency
0 15
1 26
2 29
3 or more 50
Test the hypothesis that the number of accidents per week is a Poisson distributed random
variable with mean λ = 3 . Use α = 5% .

Example 7.4
A random sample of 100 employees from a large company were asked to indicate the number
of days they were absent from duty in a particular month.

No. of days absent 0 1 2


No. of employees 62 26 12

Let X be a random variable representing the number of days an employee is absent from duty
in a month. Test at 1% significance level, the hypothesis that X follows a binomial
probability distribution with parameters n = 2 and p = 0.45, that is, X ~ B(2,0.45) .

122 
 
Solution 7.4
The hypothesis to be tested is:
H 0 : X ~ B(2,0.45)
H 1 : X follows some other distribution.
(Oi − Ei ) 2
k
The test statistic is χ cal = ∑
2

i =1 Ei
The critical value is given by χ k2− m −1 (α ) = χ 22 (0.01) = 9.21 so we would reject H0 if χ cal
2
> 9.21
We now calculate the test statistic. We start by calculating the probabilities for the different
categories as follows:
X ~ B(2,0.45)
⎛ n⎞
P( X = x) = ⎜⎜ ⎟⎟ p x (1 − p) n− x for x = 0,1,2
⎝ x⎠
⎛ 2⎞
P( X = 0) = ⎜⎜ ⎟⎟0.450 0.552 = 0.3025
⎝0⎠
⎛ 2⎞
P( X = 1) = ⎜⎜ ⎟⎟0.4510.551 = 0.495
⎝1 ⎠
⎛ 2⎞
P( X = 2) = ⎜⎜ ⎟⎟0.452 0.550 = 0.2025
⎝ 2⎠
To find the expected frequency for a particular category, we multiply the probability of the
category by n = 100. The information is summarised in Table 7.8 below.

Table 7.8 Calculation of Test Statistic for Example 7.4


Category i Oi Pi Ei Oi − Ei (Oi − Ei ) 2
Ei
0 62 0.3025 30 32 34.13
1 26 0.495 50 -24 11.52
2 12 0.2025 20 -8 3.2
total 100 1 100 0 48.85

The decision is to reject H0 since χ cal


2
= 48.85 > 9.2 . Thus, there is no sufficient evidence to
conclude that the number of days an employee is absent from duty follows the binomial
distribution with the specified parameters.
Activity 7.3
A typist believes that the chance of her making a typing error on a page is 4 in 10. Her
supervisor proof read a 40 page manuscript that she had typed and recorded the following
results:
No. of errors per page 0 1 2 3 4
No. of pages 4 8 6 12 10

Does the data suggest the number of typing errors per page is binomially distributed? Use
α = 10% .

123 
 
Example 7.5
The Standards Association of Zimbabwe (SAZ) wanted to ascertain whether the life-time of a
certain brand of light bulbs conforms to the standard life-time of mean 18 months with a
standard deviation of 3 months. The SAZ made the following observations:

Range of life-time(months) No. of light bulbs


Less than 15 months 30
15-20 months 80
More than 20 months 20

Does the data provide enough evidence that the life-time of the light bulbs conforms to the
standard. Use α = 0.01.

Solution 7.5
Let X be the life time of the bulbs, then
H 0 : X ~ N (18;3 2 )
H1 : X ~ some other distribution

Test statistic: χ cal


2
=∑
(Oi − Ei )2
Ei
Critical value: χ 2
0.05 ( 2) = 6

0 6

Rejection criteria: Reject H0 if χ cal


2
>6
⎛ x − μ 15 − 18 ⎞
P( X < 15) = P⎜ < ⎟
⎝ σ 3 ⎠
= P(Z < −1)
= 0.1587

⎛ 15 − μ X − μ 20 − μ ⎞
P (15 < X < 20 ) = P⎜ < < ⎟
⎝ σ σ σ ⎠
⎛ 15 − 18 20 − 18 ⎞
= P⎜ <Z < ⎟
⎝ 3 3 ⎠
= P(−1 < Z < 0.667)
= 0.7486 − 0.1587
= 0.5899

⎛ x − μ 20 − 18 ⎞
P ( X > 20) = P⎜ > ⎟
⎝ σ 3 ⎠

124 
 
= P(Z > 0.667 )
= 1− 0.7486
= 0.2514
The probability of each category is multiplied by the sample size to obtain the expected
frequency of each category. We can now calculate the test statistic as shown in Table 7.9
below:

Table 7.9 Calculation of Test Statistic for Example 7.5


Category,i Oi Ei Oi-Ei (Oi − Ei )2
Ei
Less than 15 months 30 21 9 3.857
15-20 months 80 77 3 0.1169
More than 20 months 20 33 -13 5.1212
Total 9.095

Decision: Sinceχ cal = 9.095 > 6


2
we reject H0 and conclude that the lifetime of a light bulb
follows some other distribution.
Activity 7.4
A vegetable vendor recorded her daily sales over a period of 60 days as follows:
Daily sales($) No. of days
Less than 15 16
15-20 32
More than 20 12

Test at 5 % whether the daily sales are normally distributed with a standard deviation
of $2.

7.4 Summary
In this unit we started by looking at the Chi-Square test for association or independence. The
test is meant to ascertain whether two categorical variables are related or not. It compares
experimentally obtained results with those expected theoretically and based on the
hypothesis.

We also considered the Chi-square Goodness-of- fit test. The test is used to test if a sample of
data came from a population with a specified distribution. The test statistic that is used is the
χ 2 value. The test is applied to data that has been put into classes or categories. One
drawback is that, the test is sensitive to choice of categories in that the value of the test
statistic depends on how data has been categorised. Also the test is not valid for small
samples. It requires a sufficient sample size (n > 30) in order for the χ 2 approximation to be
valid.

In both cases, the Chi-Square value is computed on the basis of frequencies in a sample and
thus the value is a statistic and not a parameter. For this reason, the tests are classified under
non-parametric statistical tests.

125 
 
Further Reading
 

Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Arora, P.N. and Malhan, P.K. (2010). Biostatistics. Mumbai: Global Media.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Govindarajulu, Z. (2007). Non-parametric Inference. River Edge, NJ: World Scientific.
Hutcheson, G.D and Moutinho, L. (2008). Statistical Modelling for Management, London:
SAGE Publications Ltd.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

126 
 
Unit 8

Regression and Correlation Analysis

8.0 Introduction

Regression analysis seeks to establish how variables are related by giving a functional form
of the relationship. A typical regression model is made up of a variable to be explained which
is called the dependent variable and some explanatory variables called the independent
variables. We use the term simple regression analysis to refer to an analysis of the
relationship between a dependent variable and a single independent variable. Where more
than one independent variable is involved the analysis is called multiple regression analysis.

Closely related to regression analysis is correlation analysis which seeks to establish the
extent to which variables are related. In this unit, we will demonstrate how to fit a regression
model to a set of data and then use the model for prediction. We will also demonstrate how to
conduct some inference on the regression models.

8.1 Objectives

By the end of the unit, you should be able to:


• distinguish between regression analysis and correlation analysis
• use a scatter plot to infer the kind of relationship that exists between variables
• describe a typical regression model
• explain the significance of the error term
• state the assumptions of the regression model
• fit a linear regression model to sample data
• use the regression model for prediction
• calculate the correlation coefficient
• determine how well the model fit the data

8.2 The Linear Regression Model


The general form of a regression model is given by:
Yi = β0 + β1 X1i + β 2 X 2i + β3 X 3i + ... + β m X mi + ei i = 1, 2, 3,..., n
Where:
Yi - is the ith observation on the dependent variable Y
X mi - is the ith observation on the independent variable X m
m is the number of independent variables. If m = 1 , the model is a simple linear
regression model.
n is the number of observations
β1 , β 2 up to β m are the regression coefficients. These are unknown and have to be
estimated from observed data.

127 
 
ei - is the error term or stochastic disturbance term which measures the deviation of
each observed Yi value from the true regression line.

Often times we are guided by theory in deciding which independent variables to include in a
model. Where the theory is not very clear, we risk omitting some important variables from
our model. The error term is used as a proxy for all the omitted or neglected variables that
may affect the dependent variable but are not included in the regression model.

In some cases even if the theory is clear, we may be forced to omit some variables because
data on the excluded variables may not be available or we cannot find suitable proxies for
them. In addition, we can also adjudge that the combined effect of these variables is so
negligible such that their inclusion in the model is needlessly expensive and unwarranted.
Activity 8.1
1. An A2 commercial farmer intends to investigate the factors that influence maize yield. He
suspects that some influential factors are soil fertility, amount of rainfall, variety of maize
seed, cultivation methods and amount of fertiliser used. Postulate a model that the farmer
can use to analyse the relationship between these variables.

2. Formulate a regression model which can be used to investigate the relationship between
salary income, working experience and level of education. Suggest other variables that
may have an influence on salary income other than working experience and level of
education.

8.2.1 Assumptions of the linear regression model


The assumptions are necessary for the valid interpretation of the regression estimates, that is,
so that estimates of the regression coefficients are efficient and the tests of hypotheses about
them are not biased.

The following are the assumptions of the linear regression model:


Assumption 1: The regression model is linear in the parameters. The highest power of the
parameters is 1.
Assumption 2: The regressor X is assumed to be non-stochastic, that is, the values of X are
considered fixed in repeated samples.
Assumption 3: The mean value of the error term μi is zero, E ( μi | X i ) = 0 .
Assumption 4: Homoscedasticity – the error terms are assumed to have equal variance,
Var ( μi | X i ) = σ 2 .
Assumption 5: No autocorrelation between the error terms. Given any two x- values, X i and
X j ( i ≠ j ), the correlation between the error terms, μi and μ j ( i ≠ j ) is zero.
Cov ( μ i , μ j | X i , X j ) = 0
Assumption 6: Zero covariance between μi and X i . The disturbance terms and the
explanatory variables are uncorrelated.
Assumption 7: The number of observations ( n ) must be greater than the number of
parameters to be estimated ( k ), that is, n > k .
Assumption 8: Variability in X- values. The X- values in a given sample must not all be the
same. It will not be possible to estimate the parameters if all the X – values are the same.

128 
 
Assumption 9: The regression model is correctly specified. There is no specification bias or
error in the model used in empirical analysis. The assumption serves as a reminder that our
regression analysis and the results from that analysis are conditional upon the chosen model.
Assumption 10: No perfect multicollinearity. There are no perfect linear relationships among
the explanatory variables.

8.3 Scatter Plot

A scatter plot is the first step in analysing the relationship between two variables. It is simply
a plot of one variable against another using observed data. On the scatter plot the independent
variable is usually scaled along the horizontal axis.

Example 8.1
An A2 commercial farmer intends to investigate the relationship between maize yield and
amount of fertiliser used. He subdivided his field into six equal plots. He used the same
variety of maize seed in all the plots and only varied the number of bags of fertiliser applied
to each plot. Suppose the following data was obtained.

Amount of fertiliser (bags) 2 3 5 7 8 12


Yield (tonnes) 3 5 9 11 10 15

Draw a scatter plot of the data and comment on the relationship between yield and amount of
fertiliser.

Solution 8.1
The scatter plot is shown in Figure 8.1. The scatter plot shows a strong positive linear
relationship between yield and amount of fertiliser applied. We say that the relationship is
‘positive linear’ because the points seem to follow a line with positive gradient. The strength
of the linear relationship is inferred from the dispersion of the points from the line. If the
points display little dispersion from the line then the relationship is strong otherwise it is
weak.

16
14
12
Yield (tonnes)

10
8
6
4
2
0
0 2 4 6 8 10 12 14
Amount of fertiliser (bags)

Figure 8.1 A Scatter Plot of Yield versus Amount of Fertiliser

129 
 
Remark 8.1 We have used a few observations to simplify our analysis, but in practice
decisions should be based on many observations.

Example 8.2
The prices ($000) and ages (in years) of ten imported used cars of a specific model are as
follows:

Age(years) 6 9 7 6 8 10 9 11 5 7
Price($000) 12 7 9 10 9 4 6 3 15 10

Draw a scatter plot of the data and use it to infer the kind of relationship that exists between
price of car and age of car.

Solution 8.2
16
14
12
Price($000)

10
8
6
4
2
0
0 2 4 6 8 10 12
Age (years)

Figure 8.2 A Scatter Plot of Price of Car versus Age of Car

The scatter plot shows a strong negative linear relationship between price of car and age of
car.
Activity 8.2
The data shown below relate the average monthly earnings of a random sample of factory
workers with their level of education (measured by years of schooling)

Level of education, X (years) 7 9 11 13 14 15 16 17 18


Monthly earnings, Y ($00) 1.0 1.5 2.4 2.0 3.0 3.2 4.0 5.6 8.0

Draw a scatter plot of the data and use it to comment on the relationship between the two
variables.

130 
 
8.4 Estimating the Simple Regression Model

The simple linear regression model Yi = β0 + β1 X i + ei is estimated by Yˆi = βˆ0 + βˆ1 X i where
βˆ and βˆ are obtained from the following computational formulae:
0 1

n∑ xi yi − ∑ xi ∑ yi
βˆ1 = [8.1]
n∑ xi2 − (∑ xi ) 2

βˆ0 = ∑ ∑
y − βˆ x
1
i i
[8.2]
n

Example 8.3
Use the data of Example 8.1 to estimate the regression equation of the data.

Solution 8.3
n=6 ∑ x = 37 ∑ x 2
= 295 ∑ xy = 403 ∑ y = 53 ∑ y 2
= 561

βˆ1 = ∑ 2 ∑ ∑2
n xy − x y
n∑ x − (∑ x )
6(403) − 37(53)
=
6(295) − (37) 2
457
=
401
= 1.139650873

βˆ0 = ∑
y − βˆ1 ∑ x
n
53 − 1.139650873(37)
=
6
= 1.805486284

The estimated regression equation is given by


Yield = 1.805486284 + 1.139650873 Amount of fertiliser
The value of βˆ indicates that for every additional bag of fertiliser applied the yield of maize
1
increases by 1.139650873 tonnes.

Example 8.4
A company would like to estimate the relationship between its monthly sales and the amount
that the company spends on advertisement per month. A random sample of monthly
observations made over the past year is:

Monthly Expenditure ($00) Monthly Sales ($00)


7 18
9 35
5 12
15 50

131 
 
12 36
6 24
Draw a scatter plot to represent the data. Comment on the kind of relationship between
monthly expenditure on advertisement and monthly sales. Estimate the equation connecting
monthly expenditure and samples.

Solution 8.4
60

50
Monthly Sales ($00)

40

30

20

10

0
0 2 4 6 8 10 12 14 16
Monthly Expenditure ($00)

The points seem to be following a line with positive gradient. If you insert a line of best fit
through the points, you will see that the points do not deviate much from the line. We can
therefore conclude that there is a strong positive linear relationship between monthly
expenditure on advertisement and monthly sales.

Using the two variable statistical mode on your calculator, you will obtain the following
results:
n = 6 ∑ x = 54 ∑ x 2 = 560 ∑ y = 175 ∑ xy = 1827
n∑ xy − ∑ x ∑ y
b=
n∑ x 2 − (∑ x )
2

6(1827) − 54(175)
=
6(560) − (54) 2
1512
=
444
= 3.405405405

a=∑
y − b∑ x
n
175 − 3.405405405 (54)
=
6
= -1.481981978
The regression equation is Yˆ = −1.481981978 + 3.405405405X

132 
 
Activity 8.3

1. The data relate to consumption expenditure ($) and disposable income ($) for a random
sample of 10 households in Harare.
Disposable Income X 30 45 60 100 150 200 240 250 300 320
Consumption Y 25 20 40 65 120 190 200 200 260 220

a) Obtain the estimated regression line of the data.


b) Interpret the values of the regression coefficients obtained.

2. The data in the table below relate a manufacturer’s market share (%) with product quality
measured on a scale 0 to 100.

Product quality 27 39 73 66 33 43 47 55 60 68 70 75 82
Market share (%) 2 3 10 9 4 6 5 8 7 9 10 13 12

a) State the independent and dependent variable.


b) Draw a scatter plot of the data. Is it reasonable to fit a linear regression model? Explain.
c) Estimate the simple linear regression equation between market share and product
quality.

8.4.1 Prediction using the regression model


The regression model can be used to predict values of the dependent variable for given values
of the explanatory variable. The prediction should be restricted to the range of X- values used
in the construction of the model because outside this range it is not certain which x-values
does the variable assume and therefore the use of the model for prediction is unreliable and
risky.

Example 8.5
Use the model obtained in Example 8.4 to estimate the monthly sales if the monthly
expenditure on advertisement is $1 000.

Solution 8.5
X = 10 ⇒ Y = −1.481981978 + 3.405405405 (10)
= 32.57207207
≈ $3257.21

The monthly sales are estimated to be $3 257.21 if $1 000 is spent on advertisement per
month.

8.5 Estimating the Multiple Regression Model

The multiple regression model consists of more than one explanatory variable. The
computational formulae become more complicated as the number of explanatory variables
increases. As a result the computations are best handled by use of computer statistical
packages such as SPSS, Minitab, E-views, STATA, SAS and so on. You should be able to

133 
 
interpret the computer output and use it to make some predictions as well as the necessary
inference.

Example 8.6
The monthly salary ($00) that an employee is entitled to is thought to be dependent upon the
worker’s years of schooling and the years of work experience. The following data was
obtained from a random sample of 10 employees.

Salary($00), Y Years of schooling, X2 Work experience(years), X3


8 13 6
7 11 5
12 16 15
10 15 9
6 11 3
9 13 6
12 18 14
13 18 16
10 16 10
7 11 4

Fit a multiple linear regression model to the data.

Solution 8.6
The estimated regression coefficients are shown in Table 8.1 which is a computer output that
was obtained using the SPSS statistical package.

Table 8.1 Coefficients(a)


Mode Unstandardized Standardized
l Coefficients Coefficients t Sig.
B Std. Error Beta B Std. Error
1
(Constant) 2.324 1.603 1.450 .190
years of schooling .291 .168 .335 1.728 .128
years of working
.335 .098 .663 3.418 .011
experience
a Dependent Variable: salary of an employee

From Table 8.1, we take the unstandardised coefficients, βˆ1 = 2.324, βˆ2 = 0.291 and
βˆ = 0.335
3

The multiple linear regression model is stated as:


Salary = 2.324 + 0.291 Years of schooling + 0.335 Years of working exp erience

Activity 8.4
Use the model developed in Example 8.6 to estimate the salary for an employee with 14 years
of schooling and 8 years working experience.

134 
 
8.6 Testing the Significance of the Coefficients

The significance of the regression coefficients can be tested using the t-test. In this procedure,
sample results are used to verify the truth or falsity of a null hypothesis. The decision to
accept/reject H 0 is made on the basis of the value of a test statistic obtained from sample
observations.

H 0 : βi = 0
Hypothesis:
H1 : β i ≠ 0

Critical value: The critical value is obtained from t-tables and corresponds to the value
tα 2 ( n − k ) where k is the number of explanatory variables in the model and n is the number
of observations.

Test statistic: The general test statistic is given by


βˆ − βi*
Tcal = i [8.3]
se( βˆi )
In the particular case where we are testing H 0 : β i = 0 v H1 : β i ≠ 0 , if we assume H 0 ,
β i* = 0 so that the test statistic [8.3] will be reduced to
βˆi
Tcal = [8.4]
se( βˆ )
i

Decision criteria: We would reject H 0 if | Tcal |> tα 2 ( n − k ) .

Example 8.7
For the data of Example 8.6, test the significance of the coefficients using the t-test at 5%
level of significance.

Solution 8.7
The hypothesis that we are testing is
H 0 : β 2 = 0 v H1 : β 2 ≠ 0
This is a two-tailed test, the critical values are ± t8 (0.025) = ±2.31
βˆ2
The test statistic is Tcal =
se( βˆ2 )
0.291
=
0.168
= 1.7321
Since Tcal = 1.7321 < 2.31 , we do not reject H 0 and conclude that the coefficient β 2 is not
significantly different from zero at 5%. The implication is that years of schooling has no
influence on salary. A safe conclusion would be that the data does not provide sufficient

135 
 
evidence against H 0 at the 5% level of significance (probably because of the smaller sample
size used).

The other hypothesis that we are testing is


H 0 : β3 = 0 v H1 : β3 ≠ 0
Again this is a two-tailed test, the critical values are ± t8 (0.025) = ±2.31
βˆ2
The test statistic is Tcal =
se( βˆ2 )
0.335
=
0.098
= 3.4184

Now, since Tcal = 3.4184 > 2.31 , we reject H 0 and conclude that the coefficient β3 is
significantly different from zero implying that years of working experience has a significant
impact on salary.
8.6.1 Testing the overall significance of the model
The F-test, which employees the analysis of variance (ANOVA) approach, is meant to test
the overall significance of an observed multiple regression, that is, to test the joint hypothesis
that the true partial slope coefficients are zero simultaneously. The procedure of the test is
outlined below:

Hypothesis: H 0 : β2 = β3 = ...βk = 0
H1 : Not all slope coefficients are simultaneously zero.

Critical value: The critical value is obtained from statistical tables. For a given level of
significance α , numerator degrees of freedom k − 1 and denominator degrees of freedom
n − k , the critical value is given by Fα (k − 1, n − k ) .

Test statistic: Fcal obtained from the ANOVA table.

Decision criteria: Reject H 0 if Fcal > Fα (k − 1, n − k )

Example 8.8
For the data in Example 8.6, test the joint hypothesis that the partial regression coefficients
equal zero simultaneously.

Solution 8.8
Hypothesis: H 0 : β2 = β3 = 0
H1 : Not all slope coefficients are simultaneously zero.
Critical value: Fα (k − 1, n − k ) = F0.05 (2,7) = 4.74

Test statistic: The SPSS statistical package gave the following ANOVA table.

ANOVA(b)

136 
 
Mode
l Sum of Squares df Mean Square F Sig.
1 Regression 51.002 2 25.501 127.647 .000(a)
Residual 1.398 7 .200
Total 52.400 9
a Predictors: (Constant), years of working experience, years of schooling
b Dependent variable: salary of an employee
The value of Fcal from the ANOVA table is 127.647.
Decision: Since Fcal = 127.647 > Fα (k − 1, n − k ) = 4.74 , we reject H 0 and conclude that not
all slope coefficient equal to zero.

8.7 Correlation Analysis


Correlation analysis is concerned with establishing the strength of the linear relation between
variables. The level of correlation can be measured using the correlation coefficient. The
simple correlation coefficient (r) is a numeric measure of the correlation between a dependant
variable and a single explanatory variable. When we have k explanatory variables, the
multiple correlation coefficient R measures the strength of the linear relationship between a
dependant variable and the k independent variables. Its value ranges from -1 to +1.
Symbolically, we write − 1 ≤ r ≤ +1 or − 1 ≤ R ≤ +1 .

Possible values of r are interpreted as follows:


r is equal to zero, indicates there is no linear correlation between the variables
r = +1 indicates a perfect positive correlation between the variables
r = − 1 indicates a perfect negative correlation between the variables
r close to +1 indicates a strong positive correlation
r close to -1 indicates a strong negative correlation
r close to zero implies a weak correlation between the variables

8.7.1 Pearson’s product moment correlation coefficient


Pearson’s product moment correlation coefficient between two variables X and Y is
calculated using the following computational formula:
n∑ xy − ∑ x ∑ y
r= [8.5]
n∑ x 2 − (∑ x ) 2 . n ∑ y 2 − (∑ y ) 2

Example 8.9
Use the data of Example 8.1 to calculate the Pearson’s correlation coefficient. Interpret the
value of r obtained.

Solution 8.9
n=6 ∑ x = 37 ∑ x 2
= 295 ∑ xy = 403 ∑ y = 53 ∑ y = 561 2

n∑ xy − ∑ x ∑ y
r=
n∑ x − (∑ x ) . n∑ y − (∑ y )
2 2 2 2

6(403) − 37 (53)
=
6( 295) − (37 ) 2 .6(561) − (53) 2

137 
 
457
=
401× 557
457
=
472.6066017
= 0.966977605
The value of r indicates a strong positive correlation between maize yield and amount of
fertilizer applied.

8.7.2 Coefficient of determination


The simple coefficient of determination r 2 measures the proportion or percentage of total
variation in the dependent variable that is explained by variation in the independent variable.
Similarly, the multiple coefficient of determination R 2 measures the proportion of the
variation in the dependent variable that is explained by the combination of the independent
variables in the regression model. The value of r 2 is always positive and ranges from zero to
one, that is, 0 ≤ r 2 ≤ 1 . A value of r 2 = 0 means that there is no relationship between the
variables. In the context of multiple linear regression, the value of R 2 is also reflective of the
goodness of fit of the fitted regression model. The larger the value of R 2 the better is the fit.

Example 8.10
Find the coefficient of determination for the data of Example 8.1 and interpret the result.

Solution 8.10
r 2 = (0.966977605)2
= 0.935045688

The interpretation is that about 93.50% of the variation in maize yield is attributable to
variation in amount of fertilizer applied.

Using SPSS the output is shown in Table 8.4 below.

Table 8.4 Coefficient of Determination


Model Summary
Mode Adjusted Std. Error of the
l R R Square R Square Estimate
1 .967(a) .935 .919 1.22780
a Predictors: (Constant), Fertiliser(bags)

Adjusted R 2
R 2 is a non-decreasing function of the number of explanatory variables present in the model,
that is, an additional explanatory variable when added to the model will not decrease R 2 even
when it is spuriously correlated to the response variable. The larger the number of
2
explanatory variables is in a model, the higher R the will be. Thus we need a measure of
goodness of fit that is adjusted for the number of explanatory variables in the model, hence
we have to compute the adjusted R 2which is denoted by R 2as follows:
n −1 [8.6]
R 2 = 1 − (1 − R 2 )
n−k

138 
 
where n is the number of observations and k is the number of parameters estimated.

You will realise that when unimportant variables are added to the model, the value of R 2
decreases even if that of R 2 increases.

Example 8.11
The table below shows monthly data on a country’s exports (in millions of US dollars) and
four other economic variables: money supply (M1) in millions of US dollars, minimum bank
lending rate (Interest, %), an index of local prices (CPI), and the exchange rate (Exchange) of
the country’s currency per US dollar.

Exports M1 Interest CPI Exchange


2.6 3.1 7.5 112 2.15
2.6 4.9 7.8 115 2.16
2.7 5.1 8 124 2.16
3 5.1 8.1 127 2.14
2.9 5.1 8.2 129 2.11
3.1 5.2 8.2 130 2.13
3.2 5.1 8.2 136 2.13
3.7 5.2 8.6 139 2.11
3.6 5.3 8.7 141 2.16
3.4 5.8 9 154 2.17
3.7 5.7 9 155 2.13
4 5.7 9.2 157 2.09
4.1 6 9.2 158 2.04
4.9 6.2 9.8 158 2.12
5 6.7 10 162 2.15
5.4 7.1 10 163 2.14
4.2 7.3 10.2 165 2.08
4.1 7.2 10.7 168 2.11
4.6 7.5 10.8 169 2.11
4.4 7.9 11 156 2.13
4.3 7.8 10.2 158 2.13
4.1 8.2 9.8 160 2.11
4 8 9.3 156 2.16
3.7 8.6 9 157 2.17
3.5 8.6 9 154 2.13
3 8.5 8.9 159 2.09
3 8.5 8.2 158 2.04
3.2 8.7 8 158 2.12
3 8.5 7.7 153 2.1
2.9 8.4 7.5 152 2.11

Perform a multiple regression analysis with exports as the dependent variable and the other
four economic variables as explanatory variables.

Solution 8.11
The computer output below shows the regression results obtained using E-views.

139 
 
Dependent Variable: EXPORTS
Method: Least Squares
Date: 08/22/14 Time: 11:36
Sample: 2009M01 2011M06
Included observations: 30

Variable Coefficient Std. Error t-Statistic Prob.

C -9.927463 5.071488 -1.957505 0.0615


M1 -0.170930 0.071989 -2.374381 0.0256
INTEREST 0.402707 0.105967 3.800321 0.0008
EXCHANGE 3.117108 2.252334 1.383946 0.1786
CPI 0.030099 0.010193 2.952848 0.0068

R-squared 0.809696 Mean dependent var 3.663333


Adjusted R-squared 0.779247 S.D. dependent var 0.747632
S.E. of regression 0.351270 Akaike info criterion 0.896491
Sum squared resid 3.084772 Schwarz criterion 1.130024
Log likelihood -8.447361 Hannan-Quinn criter. 0.971200
F-statistic 26.59211 Durbin-Watson stat 1.208151
Prob(F-statistic) 0.000000

The multiple linear regression model is stated as:


Exports = -9.927463 +0.030099 CPI +3.117108 Exchange + 0.402707 Interest -0.170930 M1

The high R2 value of 0.809696 indicates that there is a strong linear relationship between the
variables. The combination of the four economic variables explains about 77.92% of the
variation in exports. The variables CPI, Interest, and M1 are important predictors of exports
since the corresponding coefficients are statistically significant at 1%, 1% and 5%
respectively. The results also show that the exchange rate is not an important predictor of
exports ( p − value = 0.1786).

Remark 8.2 In practice the series (except interest rates) are expressed in natural logs so that
the interpretation is made in terms of elasticities. The series are also subjected to some tests
before running the regressions.

8.8 Summary
In this unit we looked at two related concepts of regression analysis and correlation analysis.
While regression analysis is about establishing the functional form of the linear relationship
that exists between variables, correlation analysis is about measuring the strength of the
relationship between the variables. You learned how to estimate the linear regression models
and how to use them for prediction. We showed you how to conduct statistical inference
concerning the parameters of the model. You also learned how to calculate the correlation
coefficient and the coefficient of determination. The coefficient of determination gives an
idea of the goodness of fit of the fitted model.

140 
 
 

Further Reading
Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Arora, P.N. and Malhan, P.K. (2010). Biostatistics. Mumbai: Global Media.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Govindarajulu, Z. (2007). Non-parametric Inference. River Edge, NJ: World Scientific.
Gujarati, D.N. (2004). Basic Econometrics (4th edition). New Delhi: McGraw-Hill.
Hutcheson, G.D. and Moutinho, L. (2008). Statistical Modelling for Management, London:
SAGE Publications Ltd.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

141 
 
Unit 9

Introduction to Time Series Analysis


 

9.0 Introduction
A time series is a sequence of measurements or observations made on a variable at regular
time intervals over a relatively long period of time. Many business variables have
observations made on them at regular time intervals. Examples of these variables are daily
sales, monthly payroll, annual exports, annual profits and so on. For instance, a country’s
annual exports recorded over 30 years constitute time series data. Times series data are
important in that they help business managers to review past performance and they provide a
basis for predicting future values of the time series.

In this unit we make use of time series charts to describe the different components of a time
series. We use the method of least squares to fit a trend line to time series data. We
demonstrate how to use the moving average method to smoothen a time series.

9.1 Objectives
By the end of this unit, you should be able to:
• define a time series
• describe the components of a time series
• draw time series charts
• carryout a trend analysis in time series data using the least squares method and the
moving- average method
• deseasonalise data using the Ratio-to-Moving Average method
• predict future values of a time series

9.2 Components of a Time Series

A time series is best analysed by decomposing it into different components. The components
of a time series are:
• Trend component (T)
• Seasonal component (S)
• Cyclical component (C)
• Irregular Component (I)
9.2.1 Trend component
The trend component is an underlying longer-term movement in the series showing a steady
tendency of increase or decrease through time as illustrated in Figure 9.1.

142 
 
Yt

Trend

Time in years
Figure 9.1 Long Term Trend

The trend shows the overall movement in the series.


9.2.2 Seasonal component
The seasonal component is a short-term recurrent component, which may be daily, weekly,
monthly, or quarterly. The type of ‘seasonal’ component thus depends on how regularly the
data are collected. However, seasonal variation is usually a feature of data collected quarterly.
The variation follows a complete cycle throughout a whole year, with the same general
pattern repeating itself year after year as illustrated by Figure 9.2.

Yt

Year 1 Year 2 Year 3


Figure 9.2 Seasonal Component

Examples of time series variables (Yt) that display seasonal variation are:
• Sales of seasonal items such as blankets/jerseys, school uniforms, umbrellas, fruits
• Credit card spending which is generally high towards and during the festive season
• Electricity consumption which varies depending on time of the day
9.2.3 Cyclical component
The cyclical component is a long-term recurrent component that repeats over several years.
Cyclical movements differ in intensity and also vary in lengths usually lasting from 2 to 10
years. In business, cyclical behaviour is often referred to as the business cycle characterised
by troughs and peaks of business activity. Figure 9.3 is an illustration of the cyclical
component.

143 
 
0 5 10 15

Figure 9.3 Cyclical Variation

9.2.4 Irregular component


The irregular component accounts for variation which is of a random nature and is not part of
either the trend or the recurrent components. It does not contain any obviously predictable
pattern. The variation is due to sporadic forces such as natural disasters (floods, drought,
cyclones) or man-made disasters such as civil wars, strikes, boycotts. Figure 9.4 is an
illustration of the irregular component.

Yt

Year 1 Year 2 Year 3


Figure 9.4 Irregular Trend

9.3 Time Series Models

The relationship between the components of a time series can be described by two models of
a time series which are the Additive Model and the Multiplicative Model.
9.3.1 Additive Model
The model assumes that the components are added together with each observation Yt being
the sum of a set of components:

Yt = Tt + St + Ct + It [9.1]

The model is appropriate for series that have regular and constant fluctuations around a trend.
To decompose an additive time series you have to subtract the components from each other.
9.3.2 Multiplicative Model
The model assumes that the observed time series values are a product of the four components,
when all exist. The model is given by:

144 
 
Yt = Tt × St × Ct × It [9.2]

This model is more commonly used than the additive model because it is found to describe
more appropriately time series in a wide range of applications. It is more appropriate for
series that have regular but not constant fluctuations around a trend. To decompose a time
series which is assumed to be multiplicative, we divide the components.

9.4 Isolating the Trend Component

It is important to isolate the trend so that we have an idea of the general direction taken by the
series. In this section, we will discuss two methods of trend analysis which are the least
squares method, and the moving average method.

9.4.1 Least squares method


A times series chart of the observations against time may show that a straight line best
describes the increase or decrease in the series. In such cases you will use the least squares
technique borrowed from simple linear regression to estimate the trend equation.

The trend equation is estimated by

Yˆt = a + bX t [9.3]

where:

Ŷt is the estimated value of the dependent variable


X t is the independent variable which is time numbered sequentially from 1
a is the intercept, and
b is the slope of the trend line

The computational formulae for a and b are given by

b = ∑ t t2 ∑ t ∑2 t
n XY − X Y
n∑ X t − (∑ X t )
[9.4]

a= ∑ Y − b∑ X
t t

n [9.5]

Example 9.1
The annual maize production (in metric tonnes) at Bere farm for the past ten years is

Year 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Production 74 85 87 92 110 115 130 136 142 150
(metric tonnes)

a) Fit a trend line to the data.


b) Forecast production for the year 2012.

145 
 
Solution 9.1
a) The years are coded sequentially by assigning 2002 = 1, 2003 = 2, 2004 = 3 and so on. The
following results are obtained:
∑ X t = 55 ∑ X t = 385 ∑ X tYt = 6889 ∑ Yt = 1121 ∑ Yt = 132119
2 2

n∑ X tYt − ∑ X t ∑Yt
b=
n∑ X t2 − (∑ X t ) 2
10(6889) − 55(1121)
=
10(385) − (55)2
7235
=
825
= 8.76969697
1121 − 8.76969697 (55)
a=
10
= 63.86666667

The trend line is given by


Yˆt = 63.86666667 + 8.76969697 X t

b) For 2012, X t = 11 so that the maize production in 2012 is forecast to be


Yt = 63.86666667 + 8.76969697(11)
= 160.3333333
≈ 160 metric tonnes

Activity 9.1
The following data give the quarterly sales figures for a retail outlet for the period
2002 to 2004.

Year Quarter 1 Quarter 2 Quarter 3 Quarter 4


2002 10 12 13 11
2003 12 15 16 13
2004 14 16

a) Estimate the trend line using the Least Squares method


b) Forecast the sales for the last quarter of year 2004.

9.4.2 Moving average method


A moving average (MA) of a time series is an average of a fixed number of observations that
moves as we progress down the series. The moving averages smoothes out peaks and valleys
in the original series to leave out a relatively smooth trend. The moving averages are
therefore estimates of the trend at different stages of the series.

The moving averages are centred at the middle of the observations from which it has been
calculated. The term of the moving average series is meant to coincide with the periodicity of
the original series. For example, a four-point moving average will be appropriate for
quarterly data.

146 
 
Example 9.2
The daily sales of an airtime vendor over 12 days are recorded below:
37 24 62 80 77 95 94 133 148 155 128 161
Calculate a 3- point moving average of the sales.

Solution 9.2
A 3-point moving average requires you to find the average of three sets of observations at a
time.
The first MA = (37 + 24 +62)/3 = 41.00
The second MA = (24 + 62 + 80)/3 = 55.33
The third MA = (62 + 80 + 77)/3 = 73.00 and so on.

Sales 37 24 62 80 77 95 94 133 148 155 128 161


3-point MA 41 55.33 73 84 88.67 107.33 125 145.33 143.67 148

Note that the moving averages are centred at the middle of the data used to calculate it so that
we lose two observations one at the start and the other at the end. Centering is problematic
with even terms because, for an even term, the moving averages are ‘out of phase’ with the
time series observations. To centre the MA series, a further 2-point MA is found by
averaging every consecutive pair as illustrated in Example 9.3.

Example 9.3
The data below shows the sales ($000s) of a seasonal good at a retail outlet over three years.

Year Q1 Q2 Q3 Q4
1 14 32 33 6
2 16 35 36 7
3 15 38 41 8

a) Calculate four-point moving averages for the series


b) Plot the four-point MA series on the same graph as the original series.

Solution 9.3
a) Table 9.1 A 4-Point Centred MA of Sales
Year Quarter Sales (Yt) Uncentred 4-point MA Centred 4-point MA(T)
1 1 14

2 32
21.25
3 33 21.500
21.75
4 6 22.125
22.50
2 1 16 22.875
23.25
2 35 23.375
23.50
3 36 23.375
23.25

147 
 
4 7 23.625
24.00
3 1 15 24.625
25.25
2 38 25.375
25.50
3 41

4 8

b) Sales
40

original series
30

20
moving average series

10

0
1 2 3 4 1 2 3 4 1 2 3 4
Year 1 Year 2 Year 3
Figure 9.5 Original Series and Moving Average Series Showing Trends in Sales

The moving averages remove the fluctuations in the time series and make the curve smooth
as shown in Figure 9.5. The smoothed curve shows a moderate, upward trend in sales during
the three year period.
Activity 9.2
A supplier of school stationary recorded its quarterly sales figures ($00s) for the years
2009 to 2012. The data is shown in the table below.
Year Q1 Q2 Q3 Q4
2009 48 52 16 35
2010 50 46 22 40
2011 68 34 26 35
2012 73 56 16 45

a) Draw a time series chart of the data and comment on the trend and seasonal
components
b) Calculate centred 4- point moving averages for the data.
c) Plot the four-point MA series on the same graph as the original series.

148 
 
9.5 Isolating the Seasonal Component

If we have a seasonal time series, finding a moving average series for it will have the effect
of smoothing out the seasonality. Assuming a multiplicative model, we divide each
observation by the corresponding value of the moving-average series to isolate the seasonal
and irregular components. This procedure is known as the ratio-to-moving average method.

The stages that are followed in the ratio-to-moving average procedure for quarterly data are:
1. Calculate a centred 4-point moving average series

2. Find seasonal ratios by dividing each actual time series observation, Yt by its
corresponding moving average value

Yt Tt × Ct × St × I t
Seasonal ratio = MA = Tt × Ct
= St × I t
[9.6]

3. Find the average seasonal ratio for each quarter. The average could be the mean or
median but in most cases the median is used since it is not affected by outliers.

4. Add up the average seasonal ratios. They should add up to 4. If they do not add up to 4
adjust each average by adding to it one-fourth of the difference between their sum and
4. The results are adjusted seasonal ratios/indexes.

Example 9.4
Calculate adjusted seasonal indexes for the data of Example 9.3.

Solution 9.4
The necessary calculations are presented in the form of a table as illustrated in Table 9.2.

Table 9.2 Calculation of Seasonal Indexes


Year Quarter Sales (Yt) Uncentred Centred Seasonal Ratio
4-point MA 4-point MA(T) Yt /T
1 1 14

2 32
21.25
3 33 21.500 1.535
21.75
4 6 22.125 0.271
22.50
2 1 16 22.875 0.699
23.25
2 35 23.375 1.497
23.50
3 36 23.375 1.540
23.25
4 7 23.625 0.296
24.00
3 1 15 24.625 0.609

149 
 
25.25
2 38 25.375 1.498
25.50
3 41

4 8

After obtaining the seasonal ratios, you then find the mean seasonal index for each quarter.

Year Q1 Q2 Q3 Q4
1 1.535 0.271
2 0.699 1.497 1.540 0.296
3 0.609 1.498
Mean 0.6540 1.4975 1.5375 0.2835

Sum of means = 0.6540 + 1.4975 + 1.5375 + 0.2835 = 3.9725


To ensure that the sum of means is 4, we add to each mean one-fourth of the difference
between 3.9725 and 4, that is, 0.0275 ÷ 4 = 0.006875:

Quarter Adjustment Adjusted Seasonal Index


1 0.6540 + 0.006875 0.660875
2 1.4975 + 0.006875 1.504375
3 1.5375 + 0.006875 1.544375
4 0.2835 + 0.006875 0.290375
Total 4.000000

Activity 9.3
Calculate adjusted seasonal indexes for the data of Activity 9.2.

9.5.1 Deseasonalising of data


Deseasonalising refers to removing the effects of seasonal influence on the data. This is
achieved by dividing the actual Y values for each period by its corresponding adjusted
seasonal index.
Actual Y
Deseasonalised Y = Adjusted Seasonal index S
[9.7]
Example 9.5
Using the data of Example 9.3, obtain the deseasonalised series.

Solution 9.5
Table 9.3 Calculation of Deseasonalised Sales Values
Year Quarter Sales(Yt) Adjusted Seasonal Deseasonalised
Index (S) Sales (Yt/S)
1 1 14
2 32
3 33 1.544375 33.204
4 6 0.290375 6.425
2 1 16 0.660875 15.118
2 35 1.504375 35.165
3 36 1.544375 36.100

150 
 
4 7 0.290375 6.860
3 1 15 0.660875 16.274
2 38 1.504375 38.174
3 41
4 8
 

Activity 9.4
Using the data of Activity 9.2, obtain the deseasonalised series.

9.5.2 Predicted values of the series


Once we have the adjusted seasonal indexes, we multiply them by the trend estimates to get
the predicted series values.

For the data in Example 9.3, the predicted sales are found by multiplying the trend estimate
by the corresponding adjusted seasonal index as shown in Table 9.4.

Table 9.4 Calculation of Predicted Sales


Year Quarter Sales(Yt) Trend Estimate(T) Adjusted Seasonal Predicted
Index (S) Sales (TxS)
1 1 14
2 32
3 33 21.500 1.544375 33.204
4 6 22.125 0.290375 6.425
2 1 16 22.875 0.660875 15.118
2 35 23.375 1.504375 35.165
3 36 23.375 1.544375 36.100
4 7 23.625 0.290375 6.860
3 1 15 24.625 0.660875 16.274
2 38 25.375 1.504375 38.174
3 41
4 8

Activity 9.5
1. Find the predicted sales for the data of Activity 9.2.
2. A local church organisation recorded the following quarterly amounts (in 000s) of
tithes paid by its members for the period 2010 to 2012.

Year Quarter 1 Quarter 2 Quarter 3 Quarter 4


2010 100 120 132 110
2011 125 156 168 130
2012 141 164 180 200
a) Draw a times series plot of the data and comment on the trend shown.
b) Obtain a centred 4-point MA of the series and use it to calculate adjusted seasonal
indexes for the data.
c) Find the deseasonalised series of the data.
d) Forecast the quarterly amounts of tithes for the year 2013.
151 
 
9.6 Summary

In this unit we discussed four components of a time series namely the trend, seasonal,
cyclical and irregular component. An important business of time series analysis is to
decompose a time series into these components either using an additive model or a
multiplicative model. We looked at two methods of isolating the trend which are the Least
Squares Method and the Moving Averages method. We saw how a moving average smoothes
data to reveal trends in the data. We also looked at how to isolate the seasonal component
using the Ratio to Moving Average method. Finally you learnt about how to obtain predicted
series values using seasonal indices.

152 
 
Further Reading
Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.
Arora, P.N. and Malhan, P.K. (2010). Biostatistics. Mumbai: Global Media.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Govindarajulu, Z. (2007). Non-parametric Inference. River Edge, NJ: World Scientific.
Hutcheson, G.D. and Moutinho, L. (2008). Statistical Modelling for Management, London:
SAGE Publications Ltd.
Kazmier, L.J. (2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

153 
 
Unit 10

Index Numbers

10.0 Introduction

An index number is a number that measures the relative change in a set of measurements over
time. Index numbers show changes over time by expressing the new value, Vn as a
percentage of some existing value, V0 called the base value.
V
Index number = n × 100
Vo [10.1]

The base value is the value of the variable at some reference point in the past called the base
period. By convention, the index number of the base period is always assumed to be 100.

In this unit, you will be taught how to construct price and quantity indices for both weighted
and unweighted indices. We will also look at the problems that are encountered in the
construction of index numbers.

10.1 Learning Objectives

By the end of the unit, you should be able to:


• define index numbers
• calculate simple index numbers and weighted index numbers
• change the base from one period to another
• compare base weighting with current weighting
• state the purpose of index numbers
• use the CPI to adjust for inflation
• describe the challenges that are encountered in the construction of index numbers

10.2 Types of Index Numbers

There are three major categories of index numbers which are:


1. Price indices,
2. Quantity indices, and
3. Value indices.
10.2.1 Price indices
Price indices measure changes in price over time. Some examples of price indices are:
• Consumer Price Index (CPI) which measures the overall price change, from month to
month, of a representative selection of goods and services that are relevant to a typical
household. The CPI is used to calculate the rate of inflation.
• Producer Price Index (PPI) which measures the average change over time in selling
prices received by domestic producers for their output.

154 
 
10.2.2 Quantity indices
Quantity indices measure how much of a commodity is produced or consumed over time.
Some examples of quantity indices are:
• Industrial index which gives a measure of change in industrial output now compared
to a past reference point
• Mining index which gives a measure of change in minerals production now compared
to a specified base period
10.2.3 Value indices
Value indices measure changes in total monetary worth of say exports (export index) or
imports (import index) of an economy between two time periods.

10.3 Simple Index Numbers


The word ‘simple’ implies the measurements are for a single variable. A simple index
number is the ratio of two values of a variable, expressed as a percentage. The most
commonly referred to simple indices are: the Simple Price Index and the Simple Quantity
Index.
10.3.1 Simple price index
The Simple Price Index (SPI), sometimes known as a price relative, measures changes in the
price of a commodity. It shows the effect of a price change on a single product. The current
price is expressed as a percentage of the price at base period, that is:
P
SPI = n × 100
P0 [10.2]

Example 10.1
During a back-to-school promotion sale, a pair of school shoes that sold for $22 before the
sale was now selling for $18. Calculate the Simple Price Index.

Solution 10.1
Pn
SPI = × 100
P0
18
= × 100
22
= 81.82%
The price of the shoes went down by 18.18%.

Activity 10.1
1. Suppose the price of a 750ml bottle of cane spirit was $4.00 in 2010 and in 2012
the price increased to $4.50. Calculate the simple price index using 2010 as the
base year.
2. The average retail prices of a 2l bottle of cooking oil for the years 2010 to 2012 are
as follows:
Year Price ($)
2010 3.00
2011 3.50
2012 3.10
Determine a Simple Price Index for 2011 and 2012 using 2010 as the base.

155 
 
10.3.2 Simple quantity index
The Simple Quantity Index (SQI), sometimes called the quantity relative, is used to show
changes in the quantity sold or produced for a single product. The index is calculated as
follows:
Q
SQI = n × 100
Qo [10.3]

Example 10.2
The annual production of maize in Zimbabwe for the years 1997 and 2000 was 5 000 metric
tonnes and 400 metric tonnes respectively. Using 1997 as the base year, determine the change
in maize production.

Solution 10.2
Qn
SQI = × 100
Qo
400
= × 100
5000
=8%

The annual production of maize fell by 92% between 1997 and the year 2000.

Activity 10.2
The following data give the prices and quantities of two commodities sold in 2011
and 2012:

Product 2011 2012


Price Quantity Price Quantity
X 30 3 000 50 1 500
Y 15 400 20 250

Using 2011 as the base period, calculate the


a) Simple Price Index for commodity X.
b) Simple Quantity Index for Commodity Y.

10.3.3 Index number series trends


A time series of index numbers for a given period gives a reflection of trends in the output or
price of commodities.

Consider the index of maize production for the period 1995 to 2006 below:

Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Index 84 96 100 104 112 86 82 79 65 36 38 30

The index numbers show a steady increase in maize production from 1995 to 1999 followed
by a gradual decline from the year 2000 to 2006. The year 1997 is the base year because it
has index number 100. Production for the other years is then compared in percentage terms
with the production obtaining in 1997. For example, compared to the maize production in
1997, the production in 1999 was 12% higher and the production had declined by 70% by
2006.

156 
 
Example 10.3
The following figures represent the average annual cost (in dollars per m2) of low density
residential stands in Norton for the years 2004 to 2012.

Year 2004 2005 2006 2007 2008 2009 2010 2011 2012
Price 16 20 24 24 29 30 35 36 40

Construct simple index numbers for the prices using 2005 as the base period (2005 = 100).
Comment on the trend shown.

Solution 10.3
The year 2005 is the reference point, and the index number for 2005 is taken to be 100.

P2004
The index for 2004 = × 100
P2005
16
= × 100
20
= 80 %

P2008
The index for 2008 = × 100
P2005
29
= × 100
20
= 145%

The index numbers of the remaining years are calculated in similar fashion. The results are
summarised in Table 10.1

Table 10.1 Price Index for Residential Stands


Year 2004 2005 2006 2007 2008 2009 2010 2011 2012
Price 16 20 24 24 29 30 35 36 40
Index 80 100 120 120 145 150 175 180 200

There was a steady increase in the price of residential stands from 2004 to 2012 with the
price in 2012 being double what it was in 2005.

Activity 10.3
The average quarterly sales of a retail chain over three years are shown below:

Year 2009 2010 2011


Quarter 1 2 3 4 1 2 3 4 1 2 3 4
Sales ($000s) 98 102 80 120 75 70 64 86 43 54 50 84

Using the third quarter of 2009 as the base period, express the average sales as index
numbers. Comment on the trend in sales over the three year period.

157 
 
10.3.4 Changing the base period
With the passage of time, the relevance of any reference point in the past decreases in terms
of comparison with values in the present. Therefore, it may be necessary to change the base
period by moving it closer to the present. Moreover, comparison between any two index
number series is only possible if the two series have the same base period, therefore if the
bases are not the same it is necessary to rebase one of them.

To change the base period of an index series, change the index number of the new base
period to 100, then divide all numbers in the series by the index value of the proposed new
base period and then multiply by 100.
Old index value
New index value = × 100
Index value of new base [10.4]

Example 10.4
Consider the index series of maize production for the period 1995 to 2006 referred to earlier
on.

Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Index 84 96 100 104 112 86 82 79 65 36 38 30

Change the base period of the maize production index series from 1997 to 2003.

Solution 10.4
The year 2003 is now assigned an index number of 100. The old index numbers of the
remaining years are each divided by 65 and the result multiplied by 100 to obtain their
respective new index numbers. For example, the new index number for the year 1995 and
2006 are calculated as follows:
84
New index number for 1995 = × 100
65
= 129.23%

30
New index number for 2006 = × 100
65
= 46.15%

Year Old Index New Index


1995 84 129.23
1996 96 147.69
1997 100 153.85
1998 104 160.00
1999 112 172.31
2000 86 132.31
2001 82 126.15
2002 79 121.54
2003 65 100.00
2004 36 55.38
2005 38 58.46
2006 30 46.15

158 
 
Activity 10.4
1. In Example 10.4, change the base period of the maize production index from 1997
to 2000.

2. The following data are July 2009 to July 2010 commodity price index for a group
of consumer goods:

140 138 124 98 100 152 148 143 150 146 155 158 162
a) What is the base period used here?
b) Describe the trend in the price of the commodities over this period.
c) Change the base period to March 2010.

10.4 Weighted Aggregate Indices


Weighted aggregate index numbers take into account the differences in relative influence
exerted by different products in a composite index. We have to decide how much weight to
attach to each of the products. Generally, price indices are constructed by weighting the
prices of items by the corresponding quantities bought, sold, produced or consumed in the
base year or current year. Similarly, quantity indices are constructed by weighting the
quantities of the items by the corresponding prices in the base period or the current period.

In base-period weighting, when comparing prices it is assumed that quantities are held
constant at base period levels whilst when comparing quantities, it is assumed that prices are
held constant at the base period level. Base weighting is less expensive and less time
consuming because there is no continuous calculation of weights. However, the relevance of
the weights may diminish with the passage of time so that rebasing may be necessary.

In current-period weighting, when comparing prices it is assumed that quantities are held
constant at current period levels, whilst when comparing quantities, it is assumed that prices
are held constant at current period level. Current weighting involves continuous calculation
of weights which is expensive and time consuming. This also makes valid comparisons
difficult or impossible due to continuously changing weights. Despite these drawbacks,
current weighting is preferred because it ensures that an item is rated in accordance with its
current importance, so that there is no risk of producing a grossly misleading index through
the use of outdated weights.

The base-period weighted indices are called Laspeyers indices whilst the current-period
weighted indices are the Paasche indices. The computational formulae are presented below:

Laspeyre Price Index, LPI =


∑ PnQ0 × 100 [10.7]
∑ P0Q0

Laspeyre Quantity Index, LQI =


∑ Q P × 100
n 0
[10.8]
∑Q P 0 0

Paasche Price Index, PPI =


∑PQ n n
× 100 [10.9]
∑PQ 0 n

159 
 
Paasche Quantity Index, PQI =
∑Q P n n
× 100 [10.10]
∑Q P 0 n

A related index number is the Fisher’s index which is the geometric mean of the Laspeyre
and Paasche index numbers.

Fisher Price Index, FPI = LPI × PPI [10.11]

Fisher Quantity Index, FQI = LQI × PQI [10.12]

Example 10.5
The following data give the prices and quantities of three types of food stuff bought by a
private boarding school in 2011 and 2012

Food Type 2011 2012


Unit Price Quantity Unit Price Quantity
A 35 500 40 750
B 37 310 45 600
C 40 250 50 290

Calculate:
• Laspeyre and Paasche Price Indices for 2012, with 2011 as the base year and interpret
your results.
• Fisher Price Index and interpret the result.

Solution 10.5
Food Type PnQ0 P0Q0 PnQn P0Qn
A 20 000 17 500 30 000 26 250
B 13 950 11 470 27 000 22 200
C 12 500 10 000 14 500 11 600
Sum 46 450 38 970 71 500 60 050

a) LPI =
∑PQ
n 0
× 100
∑PQ
0 0

46450
= × 100
38970
= 119.19%

Using old quantities, prices have increased by 19.19%

PPI =
∑PQ n n
× 100
∑PQ 0 n

71500
= × 100
60050
= 119.07 %

160 
 
Using current quantities, prices have increased by 19.07 % from 2011 to 2012.

b) Fisher Price Index, FPI = LPI × PPI


= 119.19 × 119.07
= 119.13 %

Using both old and current quantities, prices have increased by 19.13 % from 2011 to
2012.

Activity 10.5
Using the data provided in Example 10.5, calculate:
a) Laspeyre and Paasche Quantity Indices for 2012, with 2011 as the base year and
interpret your results.
b) Fisher Quantity Index and interpret the result.

10.5 Use of Index Numbers as Deflators

The value of money changes as time goes on due to inflation. A dollar today is not worth the
same as a dollar 10 or so years ago. The use of index numbers as deflators allows us to
compare amounts of money across time periods.

The Consumer Price Index is an overall measure of relative changes in prices of many goods
and thus reflects changes in the value of money. The CPI is used as a deflator in converting
nominal amounts of money to what are called real amounts of money. Real amounts of
money are amounts that are comparable through time without due regard to changes in the
value of money due to inflation.

The converting procedure involves indentifying a constant point in time – the base period. By
simply dividing Y dollars in year i by the CPI value for year i and multiplying by 100, we
convert our X nominal (year i ) dollars to constant (base year) dollars.

The all items CPI for the years 2008 to 2011 as provided by ZIMSTAT are shown in Table
10.2.
Table 10.2 Consumer Price Index (Dec 2008 = 100)
Year CPI
2008 100
2009 92.1
2010 94.9
2011 98.2
Source: www.zimstat.co.zw/index (08-01-2013)

We will now look at an example to illustrate the use of the CPI as a deflator.

161 
 
Example 10.6
Suppose that during the years 2009 to 2011, the entry salary of a trained teacher was as
follows:
Year Salary ($)
2009 150
2010 250
2011 300

Use the CPI figures provided in Table 10.2 to transform the salaries to 2008 dollars.

Solution 10.6
Year Salary ($) CPI
2009 150 92.1
2010 250 94.9
2011 275 98.2

If we divide the 2009 salary of $150 by the CPI of that year and multiply the result by 100,
we get the equivalent salary in 2008 dollars, that is, the salary in real terms.

150
× 100 = 162.87
92.1

In real terms, the entry salary for a trained teacher in 2009 was $162.87.

We repeat the same procedure for 2010 and 2011.

250
For 2010: × 100 = 263.44
94.9

275
For 2011: × 100 = 280.04
98.2

Thus, the entry salary for 2010 and 2011 was $263.44 and $280.04 respectively in constant
2008 dollars. The salary has increased by $117.17 from 2009 to 2011. This shows that the
salary was able to keep up with inflation.

Activity 10.6

The data that follows shows the average price of a 2 litre bottle of cooking oil over
the past three years.
Year Price ($)
2009 2.75
2010 3.10
2011 3.50

Use the CPI figures in Table 10.2 to adjust the price to constant 2008 dollars.

162 
 
10.6 Challenges in Constructing Index Numbers

The following are some of the problems associated with the construction of index numbers:
1. Unavailability of data – data is expensive to collect and it is not always practicable to
determine the quantities involved (sold).
2. Choice of base year – the base year has to be a reasonably normal year characterised
by stability in business activity and such years are difficult to come by.
3. Selection of items – there may be disagreements on the items to include in the
consumer basket. The selection should be such that movements in prices of those
items chosen will be representative of the movements of prices of all items considered
relevant.
4. Choice of weights – it is difficult to select typical quantities and prices which measure
relative importance in the construction of composite indices. The weights may
become outdated with time giving rise to misleading indices.
5. Comparability of index series – comparison is only possible if two index series have
the same base period.

10.7 Summary
In this unit, we looked at the construction of simple index numbers and weighted index
numbers. Index numbers are used to measure the relative change in a set of measurements
over time. A base period is chosen to serve as a reference point. The base period is given
index number 100.

Simple indices show changes pertaining to a single item while aggregate indices are for a
group of items. Because items do not contribute the same to the envisaged change, the items
are given weights to reflect their relative importance. The weights may be current weights or
base weights. Whilst base weights are less expensive to use, they may be outdated thereby
giving rise to misleading indices. As time goes on, it may be necessary to change the base
period to keep up with current trends. We saw how the base can be changed from one period
to another.

We looked at the construction of the CPI and how it is used to adjust for inflation. Finally, we
discussed the problems that are associated with the construction of index numbers. These
include the unavailability of data, the choice of base year, choice of weights and selection of
items to make up the consumer basket.

163 
 
Further Reading

Aczel, A.D. and Sounderpandian, J. (2005). Complete Business Statistics. India: Tata
McGraw-Hill.

Arora, P.N. and Malhan, P.K. (2010). Biostatistics. Mumbai: Global Media.
Buglear, J. (2005). Quantitative Methods for Business. London: Elsevier Butterworth
Heinemann.
Govindarajulu, Z. (2007). Non-parametric Inference. River Edge, NJ: World Scientific.
Hutcheson, G.D. and Moutinho, L. (2008). Statistical Modelling for Management, London:
SAGE Publications Ltd.
Kazmier, L.J.(2003). Schaum’s Easy Outline: Business Statistics. Blacklick: McGraw-Hill
Trade.
Kemp, S.M. and Kemp, S. (2004). Business Statistics Demystified. Blacklick: McGraw-Hill
Proffessional Publishing.

164 
 
Appendices

Statistical Tables

165 
 
166 
 
167 
 
168 
 
169 
 
170 
 
171 
 
172 
 
173 
 
174 
 

You might also like