Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Chapter one

1.Introduction

1.1Definition and Classification of Statistics

What is statistics?

Statistics is the science of conducting studies to collect, organize, summarize, analyze, and draw
conclusions from data. Or statistics is the methodology for collecting, analyzing, interpreting and drawing
conclusions from information.

Putting it in other words, statistics is the methodology which scientists and mathematicians have
developed for interpreting and drawing conclusions from collected data.

Terminology
 Population - the entire group of objects that we are interested in.
 Sample - the subset that we actually measure.
 Characteristic - the particular measurement we are interested in.
 Parameter – a value which summarizes the characteristic of the population.
 Statistic – the estimate of the parameter that we make from the sample.

 A variable is a characteristic or attribute that can assume different values. Variables


whose values are determined by chance are called random variables.
 Data are the values (measurements or observations) that the variables can assume.
Variables can be classified as qualitative or quantitative.
 Qualitative variables are variables that can be placed into distinct categories, according
to some characteristic or attribute. For example, if subjects are classified according to
gender (male or female), then the variable gender is qualitative. Other examples of
qualitative variables are religious preference and geographic locations.
 Quantitative variables are numerical and can be ordered or ranked. For example, the
variable age is numerical, and people can be ranked in order according to the value of
their ages. Other examples of quantitative variables are heights, weights, and body
temperatures. Quantitative variables can be further classified into two groups: discrete
and continuous.
 Discrete variables assume values that can be counted. Discrete variables can be
assigned values such as 0, 1, 2, 3 and are said to be countable. Examples of discrete
variables are the number of children in a family, the number of students in a classroom,
and the number of calls received by a switchboard operator each day for a month.

Page 1 of 22
 Continuous variables can assume an infinite number of values between any two specific
values. They are obtained by measuring. They often include fractions and decimals.
Continuous variables, by comparison, can assume an infinite number of values in an
interval between any two specific values.
Temperature, for example, is a continuous variable, since the variable can assume an
infinite number of values between any two given
temperatures.
There are two major types of statistics depending on how data are used.

1. Descriptive statistics
2. Inferential statistics

The branch of statistics devoted to the summarization and description of data is called descriptive
statistics and the branch of statistics concerned with using sample data to make an inference
about a population of data is called inferential statistics.
Descriptive statistics consists of the collection, organization, summarization, and presentation of
data.

Descriptive statistics includes the construction of graphs, charts, and tables, and the calculation
of various descriptive measures such as averages, measures of variation, and percentiles.
Descriptive statistics involves methods of organizing, picturing and summarizing information
from data.

1. Descriptive Statistics
Descriptive statistics is the type of statistics that probably springs to most people‟s minds when
they hear the word “statistics.” Here the goal is to describe.
Numerical measures are used to tell about features of a set of data. There are a number of items
that belong in this portion of statistics, such as:
 The average, or measure of center, consisting of the mean, median, mode or midrange.
 The spread of a data set, which can be measured with the range or standard deviation.
 Overall descriptions of data such as the five number summary.
 Other measurements such as skewness and kurtosis.
 The exploration of relationships and correlation between paired data.
 The presentation of statistical results in graphical form.

2. Inferential Statistics

For the area of inferential statistics we begin by differentiating between two groups. The
population is the entire collection of individuals that we are interested in studying. It is typically
impossible or infeasible to examine each member of the population individually. So we choose a
representative subset of the population, called a sample.

Inferential statistics studies a statistical sample, and from this analysis is able to say something
about the population from which the sample came.

Page 2 of 22
Inferential statistics involves methods of using information from a sample to draw conclusions
about the population. Inferential statistics includes methods like point estimation, interval
estimation and hypothesis testing which are all based on probability theory.
Descriptive and inferential statistics are interrelated. It is almost always necessary to use methods
of descriptive statistics to organize and summarize the information obtained from a sample
before methods of inferential statistics can be used to make more thorough analysis of the subject
under investigation. Furthermore, the preliminary descriptive analysis of a sample often reveals
features that lead to the choice of the appropriate inferential method to be later used.

1.2 Steps of statistical investigation


The main steps utilized in a statistical investigation include four components:

1. Clarifying the problem and formulating questions or hypotheses that can be answered with the
data,
2. Designing or creating an appropriate experiment that can collect the data required,
3. Finding and using the appropriate techniques needed to accurately analyze the collected data,
and
4. Interpreting the collected data and results so as to answer the questions and hypotheses that
were proposed in the first place.
The main steps in a statistical investigation are thought to be a cylindrical process that helps
ensure that all procedures are completed logically. The aim of opening a statistical investigation
is to answer the many questions that are present in the world, and is a technique that has been
commonly applied by statisticians.

1.3. Importance and Limitation of Statistics


1.3.1 Importance of Statistics
There are many functions of statistics. Let us consider the following five important functions.

a). Condensation:
Generally speaking by the word „ to condense‟ , we mean to reduce or to lessen. Condensation is
mainly applied at embracing the understanding of a huge mass of data by providing only few
observations. Thus, Statistical measures help to reduce the complexity of the data and
consequently to understand any huge mass of data.

b). Comparison:
Classification and tabulation are the two methods that are used to condense the data. They help
us to compare data collected from different sources. Grand totals, measures of central tendency
measures of dispersion, graphs and diagrams, coefficient of correlation etc provide ample scope
for comparison.
As statistics is an aggregate of facts and figures, comparison is always possible and in fact
comparison helps us to understand the data in a better way.

Page 3 of 22
c). Forecasting:
By the word forecasting, we mean to predict or to estimate beforehand. In business also
forecasting plays a dominant role in connection with production, sales, profits etc. The analysis
of time series and regression analysis plays an important role in forecasting.

d).Estimation:
One of the main objectives of statistics is drawn inference about a population from the analysis
for the sample drawn from that population.
In estimation theory, we estimate the unknown value of the population parameter based on the
sample observations.
Suppose we are given a sample of heights of hundred students in a school, based upon the
heights of these 100 students, it is possible to estimate the average height of all students in that
school.

e). Tests of Hypothesis:


A statistical hypothesis is some statement about the probability distribution, characterising a
population on the basis of the information available from the sample observations. In the
formulation and testing of hypothesis, statistical methods are extremely useful.

1.3.2 Limitation of Statistics


Statistics with all its wide application in every sphere of human activity has its own limitations.
Some of them are given below.

1. Statistics is not suitable to the study of qualitative phenomenon:


Since statistics is basically a science and deals with a set of numerical data, it is applicable to the
study of only these subjects of enquiry, which can be expressed in terms of quantitative
measurements. As a matter of fact, qualitative phenomenon like honesty, poverty, beauty,
intelligence etc, cannot be expressed numerically and any statistical analysis cannot be directly
applied on these qualitative phenomenons.
Nevertheless, statistical techniques may be applied indirectly by first reducing the qualitative
expressions to accurate quantitative terms. For example, the intelligence of a group of students
can be studied on the basis of their marks in a particular examination.

2.Statistics does not study individuals:


Statistics does not give any specific importance to the individual items, in fact it deals with an
aggregate of objects. Individual items, when they are taken individually do not constitute any
statistical data and do not serve any purpose for any statistical enquiry.

3. Statistical laws are not exact:

Page 4 of 22
It is well known that mathematical and physical sciences are exact. But statistical laws are not
exact and statistical laws are only approximations. Statistical conclusions are not universally
true. They are true only on an average.
4. Statistics table may be misused:
Statistics must be used only by experts; otherwise, statistical methods are the most dangerous
tools on the hands of the inexpert. The use of statistical tools by the inexperienced and untraced
persons might lead to wrong conclusions. Statistics can be easily misused by quoting wrong
figures of data.
5. Statistics is only, one of the methods of studying a problem:
Statistical method do not provide complete solution of the problems because problems are to be
studied taking the background of the countries culture, philosophy or religion into consideration.
Thus the statistical study should be supplemented by other evidences.

Chapter Two
Page 5 of 22
2. Collection of data

2.1. Types of Statistical Data

Qualitative sounds like „quality‟. Words are used to describe quality (Example 1 and 2 of
data). Hence, qualitative data come from variables that use words to describe categories.
Quantitative sounds like „quantity‟. Numbers are used to describe quantity (Example 3 and 4
of data). Quantitative data can be described in terms of „how many...‟ and „how much...‟
where the response is some quantity. „How many...‟ indicates that counting is taking place.
The data produced is described as quantitative discrete data, visually represented by dots.

Page 6 of 22
2.2. Sources of Data

There are two source of data primary source and secondary source.
A primary source is a document created at the time of your research subject, about your
research subject. These documents are directly connected with the events or people being
researched.
A secondary source is a document created at a later time than the event being researched, by
someone who did not experience the said event. These documents have no direct connection
with the events or people being researched.

Below is a chart listing examples of primary and secondary sources.

2.3. Population and Sample

Page 7 of 22
Population and sample are two basic concepts of statistics. Population can be characterized as the
set of individual persons or objects in which an investigator is primarily interested during his or
her research problem. Sometimes wanted measurements for all individuals in the population are
obtained, but often only a set of individuals of that population are observed; such a set of
individuals constitutes a sample. This gives us the following definitions of population and
sample.

Population is the collection of all individuals or items under consideration in a statistical study.

For example:

 The students of the University of Hawassa


 The books in a library.

Sample is that part of the population from which information is collected.

A parameter is an unknown numerical summary of the population. A statistic is a known


numerical summary of the sample which can be used to make inference about parameters.

Individuals are the people or objects included in the study. A variable is the characteristic of the
individual to be measured or observe

2.4. The Need for Sampling

Page 8 of 22
What is the purpose of sampling?

To draw conclusions about populations from samples, we must use inferential statistics which
enables us to determine a population`s characteristics by directly observing only a portion (or
sample) of the population. We obtain a sample rather than a complete enumeration (a census ) of
the population for many reasons. Obviously, it is cheaper to observe a part rather than the whole,
but we should prepare ourselves to cope with the dangers of using samples.

There would be no need for statistical theory if a census rather than a sample was always used to
obtain information about populations. But a census may not be practical and is almost never
economical.

There are six main reasons for sampling instead of doing a census. These are;
-Economy
-Timeliness
-The large size of many populations
-Inaccessibility of some of the population
-Destructiveness of the observation
- accuracy.

The economic advantage of using a sample in research Obviously, taking a sample requires
fewer resources than a census.
A sample may provide you with needed information quickly.
Many populations about which inferences must be made are quite large. In such a case, selecting
a representative sample may be the only way to get the information required from high school
seniors.
There are some populations that are so difficult to get access to that only a sample can be used.
Like people in prison, like crashed aeroplanes in the deep seas, presidents e.t.c. The
inaccessibility may be economic or time related. Like a particular study population may be so
costly to reach like the population of planets that only a sample can be used. In other cases, a
population of some events may be taking too long to occur that only sample information can be
relied on.
The destructive nature of the observation Sometimes the very act of observing the desired
charecteristic of a unit of the population destroys it for the intended use. Good examples of this
occur in quality control. For example to test the quality of a fuse, to determine whether it is
defective, it must be destroyed. To obtain a census of the quality of a lorry load of fuses, you
have to destroy all of them. This is contrary to the purpose served by quality-control testing. In
this case, only a sample should be used to assess the quality of the fuses.
Accuracy and sampling a sample may be more accurate than a census. A sloppily conducted
census can provide less reliable information than a carefully obtained sample.

Chapter Three

Page 9 of 22
3. Sampling Techniques (3 hours)

3.1. Probability Sampling


Researchers use two major sampling techniques: probability sampling and non probability
sampling. With probability sampling, a researcher can specify the probability of an element‟s
(participant‟s) being included in the sample. With non probability sampling, there is no way of
estimating the probability of an element‟s being included in a sample. If the researcher‟s interest
is in generalizing the findings derived from the sample to the general population, then probability
sampling is far more useful and precise. Unfortunately, it is also much more difficult and
expensive than non probability sampling.
Probability sampling is also referred to as random sampling or representative sampling. The
word random describes the procedure used to select elements (participants, cars, test items) from
a population. When random sampling is used, each element in the population has an equal
chance of being selected (simple random sampling) or a known probability of being selected
(stratified random sampling). The sample is referred to as representative because the
characteristics of a properly drawn sample represent the parent population in all ways.

It uses a random process to guarantee that each unit of the population has a specified chance of
selection. There are four major types of probability methods these are simple random,
systematic, cluster and stratified sampling.

1.Simple Random sampling:

 every subject has an equal probability of being selected for the study.
 recommended way is to use a table of random numbers or a computer generated list of
random numbers
 process of enumerating every unit of the accessible population, and then selecting the
sample at random what are needed:
a. accurate listing of the population
b. mechanism to find and enroll those who are chosen

2. Systematic Sampling
 involves selecting by a periodic process; starting point is chosen at random

Example: Take 200 samples from a population of 3400.


Procedure:
 Number all units 1 to 3400; divide population with the number to be sampled (3400/200
= 17).
 Select any number between 1 to 17 to be the k.
 Then select every 17th subject thereafter.

NOTE: should not be used when a cyclic repetition is inherent in the sampling frame e.g. not
appropriate for selecting months of the year in a study of the frequency of different types

Page 10 of 22
of accidents, because some accidents occur most often at certain times of the year

3. Stratified Random sampling

 involves dividing the population into subgroups according to characteristics and taking a
random sample from each of these “strata”
 characteristics used to stratify should be related to the measurement of interest
 There should be homogeneity within the strata‟s and heterogeneity between strata‟s

4. Cluster Sampling

 process of taking a random sample of natural groupings of individuals in the population;


very useful when the population is widely dispersed and it is impractical or costly to list
and sample from all of its elements 27

 clusters are commonly based on geographic areas or districts.

 There should be homogeneity between clusters and heterogeneity within the clusters.

3.2. Non Probability Sampling


A sampling method in which the probability that a subject is selected is unknown and there are
different types of non probability sampling methods.

1.Consecutive Sampling
It involves taking every patient who meets the selection criteria over a specified time interval or
number of patients; it amounts to taking the complete accessible population over the duration of
the study.

2. Convenience Sampling
-It is a process of taking those members of the accessible population who are easily available at
the time of study

3. Judgemental Sampling
-It involves handpicking from the accessible population those individuals judged most
appropriate for the study

Chapter Four

Page 11 of 22
Classification and Presentation of Data ( 5 hours)

4.1. Types of Classification


There are four types of classification, viz.,
(i) qualitative;
(ii) quantitative;
(iii) temporal and
(iv) spatial.

(i) Qualitative classification: It is done according to attributes or non-measurable


characteristics; like social status, sex, nationality, occupation, etc. For example, the population of
the whole country can be classified into four categories as married, unmarried, widowed and
divorced. When only one attribute, e.g., sex, is used for classification, it is called simple
classification. When more than one attributes, e.g., deafness, sex and religion, are used for
classification, it is called manifold classification.

(ii) Quantitative classification: It is done according to numerical size like weights in kg or


heights in cm. Here we classify the data by assigning arbitrary limits known as class-limits. The
quantitative phenomenon under study is called a variable. For example, the population of the
whole country may be classified according to different variables like age, income, wage, price,
etc. Hence this classification is often called „classification by variables‟.

(a) Variable: A variable in statistics means any measurable characteristic or quantity which can
assume a range of numerical values within certain limits, e.g., income, height, age, weight, wage,
price, etc. A variable can be classified as either discrete or continuous.
(1) Discrete variable: A variable which can take up only exact values and not any fractional
values, is called a „discrete‟ variable. Number of workmen in a factory, members of a family,
students in a class, number of births in a certain year, number of telephone calls in a month, etc.,
are examples of discrete-variable.
(2) Continuous variable: A variable which can take up any numerical value (integral/fractional)
within a certain range is called a „continuous‟ variable. Height, weight, rainfall, time,
temperature, etc., are examples of continuous variables. Age of students in a school is a
continuous variable as it can be measured to the nearest fraction of time, i.e., years, months,
days, etc.

(iii) Temporal classification: It is done according to time, e.g., index numbers arranged over a
period of time, population of a country for several decades, exports and imports of India for
different five year plans, etc.

(iv) Spatial classification: It is done with respect to space or places, e.g., production of cereals
in quintals in various states, population of a country according to states, etc.

4.2. Tabular method of data presentation

Page 12 of 22
4.2.1 Frequency distribution
A frequency distribution is the organization of raw data in table form, using classes and
frequencies.
Two types of frequency distributions that are most often used are the categorical frequency
distribution and the grouped frequency distribution. The procedures for constructing these
distributions are shown now.

A. Categorical Frequency Distributions


The categorical frequency distribution is used for data that can be placed in specific categories. For
example, data such as political affiliation, religious affiliation, or major field of study would use
categorical frequency distributions.
Example: Twenty-five army inductees were given a blood test to determine their blood type. The
data set is

Construct a frequency distribution for the data.


Solution:
Since the data are categorical, discrete classes can be used. There are four blood types: A, B, O, and AB.
These types will be used as the classes for the distribution. The procedure for constructing a frequency
distribution for categorical data is given next.
Step1: Make a table as shown.

Step 2: Tally the data and place the results in column B


Step 3: Count the tallies and place the results in column C.
Step 4: Find the percentage of values in each class by using the formula

Page 13 of 22
Percentages are not normally part of a frequency distribution, but they can be added since
they are used in certain types of graphs such as pie graphs. Also, the decimal equivalent of a
percent is called a relative frequency.

Step 5: Find the totals for columns C (frequency) and D (percent). The completed
table is shown

B. Grouped. Frequency Distributions

When the range of the data is large, the data must be grouped into classes that are more than
one unit in width, in what is called a grouped frequency distribution. For example, a
distribution of the number of hours that boat batteries lasted is the following.

Page 14 of 22
The procedure for constructing the preceding frequency distribution is given in the next
example however, several things should be noted. In this distribution, the values 24 and 30 of
the first class are called class limits. The lower class limit is 24; it represents the smallest data
value that can be included in the class. The upper class limit is 30; it represents the largest
data value that can be included in the class. The numbers in the second column are called
class boundaries. These numbers are used to separate the classes so that there are no gaps in
the frequency distribution. The gaps are due to the limits; for example, there is a gap between
30 and 31.

Finally, the class width for a class in a frequency distribution is found by subtracting the lower
(or upper) class limit of one class from the lower (or upper) class limit of the next class. For
example, the class width in the preceding distribution on the duration of boat batteries is 7, found
from 31 -24 =7.
The class width can also be found by subtracting the lower boundary from the upper boundary
for any given class. In this case, 30.5- 23.5 =7.
The researcher must decide how many classes to use and the width of each class. To construct a
frequency distribution, follow these rules:

1. There should be between 5 and 20 classes. Although there is no hard-and-fast rule for the
number of classes contained in a frequency distribution, it is of the utmost importance to have
enough classes to present a clear description of the collected data.
2. It is preferable but not absolutely necessary that the class width be an odd number. This
ensures that the midpoint of each class has the same place value as the data.The class midpoint
Xmis obtained by adding the lower and upper boundaries anddividing by 2, or adding the lower
and upper limits and dividing by 2:

Page 15 of 22
For example, the midpoint of the first class in the example with boat batteries is

The midpoint is the numeric location of the center of the class. Midpoints are necessary for
graphing. If the class width is an even number, the midpoint is in tenths. For example, if the class
width is 6 and the boundaries are 5.5 and 11.5, the midpoint is

3. The classes must be mutually exclusive. Mutually exclusive classes have non overlapping
class limits so that data cannot be placed into two classes

4. The classes must be continuous.Even if there are no values in a class, the class must be
included in the frequency distribution. There should be no gaps in a frequency distribution. The
only exception occurs when the class with a zero frequency is the first or last class. A class
with a zero frequency at either end can be omitted without affecting the distribution.

5. The classes must be exhaustive.There should be enough classes to accommodate all the data.

6. The classes must be equal in width. This avoids a distorted view of the data. One exception
occurs when a distribution has a class that is open-ended. That is, the class has no specific
beginning value or no specific ending value.Afrequency distribution with an open-ended class is
called an open-ended distribution. Here are exampleof distributions with open-ended classes.

Page 16 of 22
Example:These data represent the record high temperatures in degrees Fahrenheit for each of the
50 states. Construct a grouped frequency distribution for the data using 7 classes.

Solution:
The procedure for constructing a grouped frequency distribution for numerical data follows.

Step 1:Determine the classes.

Find the highest value and lowest value: H = 134 and L = 100.
Find the range: R = highest value - lowest value
R = 134 -100 =34

Select the number of classes desired (usually between 5 and 20). In this case, 7 is arbitrarily
chosen.
Find the class width by dividing the range by the number of classes.

Round the answer up to the nearest whole number if there is a remainder: 4.9 ≈ 5. (Rounding up
is different from rounding off. Anumber is rounded up if there is any decimal remainder when
dividing. For example, 85/6 = 14.167 and is rounded up to 15)

Select a starting point for the lowest class limit. This can be the smallest data value or any
convenient number less than the smallest data value. In this case, 100 is used. Add the width to
the lowest score taken as the starting point to get the lower limit of the next class. Keep adding
until there are 7 classes, as shown, 100, 105, 110, etc.
Subtract one unit from the lower limit of the second class to get the upper limit of the first class.
Then add the width to each upper limit to get all the upper limits. 105 - 1 = 104
The first class is 100–104, the second class is 105–109, etc.
Find the class boundaries by subtracting 0.5 from each lower class limit and adding 0.5 to each
upper class limit: 99.5–104.5, 104.5–109.5, etc.

Page 17 of 22
Step 2 Tally the data.

Step 3 Find the numerical frequencies from the tallies. The completed frequency distribution is

Sometimes it is necessary to use a cumulative frequency distribution. A cumulative frequency


distribution is a distribution that shows the number of data values less thanor equal to a specific
value (usually an upper boundary).

Exercise: The data shown here represent the number of miles per gallon (mpg) that 30 selected
four-wheel-drive sports utility vehicles obtained in city driving. Construct a frequency
distribution, and analyze the distribution.

Page 18 of 22
4.3 Diagrammatic and graphical methods of data presentation
4.3.1 Pie chart
Pie graphs are used extensively in statistics. The purpose of the pie graph is to show the
relationship of the parts to the whole by visually comparing the sizes of the sections. Percentages
or proportions can be used.
A pie graph is a circle that is divided into sections or wedges according to the percentage of
frequencies in each category of the distribution.
Example: This frequency distribution shows the number of pounds of each snack food eaten
during the Super Bowl. Construct a pie graph for the data.

Solution: Step 1: Since there are 360o in a circle, the frequency for each class must be converted into
a proportional part of the circle. This conversion is done by using the formula
Degrees = (f/n)*360o
Where f = frequency for each class and n = sum of the frequencies. Hence, the following
conversions are obtained. The degrees should sum to 360o .

Page 19 of 22
Step 2: Each frequency must also be converted to a percentage using the formula

Step 3: Next, using a protractor and a compass, draw the graph using the appropriate degree
measures found in step 1, and label each section with the name and percentages, as shown below

4.3.2 Histogram
The histogram is a graph that displays the data by using continuous vertical bars (unless the
frequency of a class is 0) of various heights to represent the frequencies of the classes.
Example: Construct a histogram to represent the data shown for the record high temperatures for
each of the 50 states from above example.

Page 20 of 22
Solution:

Step1: Draw and label the x and y axes. The x axis is always the horizontal axis, and the y axis is
always the vertical axis.
Step 2: Represent the frequency on the y axis and the class boundaries on the x axis.
Step 3:Using the frequencies as the heights, draw vertical bars for each class.

As the histogram shows, the class with the greatest number of data values (18) is 109.5–114.5,
followed by 13 for 114.5–119.5. The graph also has one peak with the data clustering around it.

4.3.4 Bar chart

When the data are qualitative or categorical, bar graphs can be used to represent the data. A bar
graph can be drawn using either horizontal or vertical bars.
A bar graph represents the data by using vertical or horizontal bars whose heights or lengths
represent the frequencies of the data.
Example: The table shows the average money spent by first-year college students. Draw a bar
graph for the data.

Page 21 of 22
Solution:
1. Draw and label the x and y axes. For the horizontal bar graph place the frequency scale on the
x axis, and for the vertical bar graph place the frequency scale on the y axis.
2. Draw the bars corresponding to the frequencies.

Page 22 of 22

You might also like