Summary sheet: Collecting and interpreting data

K1 Understand and use the terms ‘population’ and ‘sample’

Use samples to make informal inferences about the population
Understand and use sampling techniques, including simple random sampling and
opportunity sampling
Select or critique sampling techniques in the context of solving a statistical problem,
including understanding that different samples can lead to different conclusions about the
L1 Interpret diagrams for single-variable data, including understanding that area in a histogram
represents frequency
Connect to probability distributions
L2 Interpret scatter diagrams and regression lines for bivariate data, including recognition of
scatter diagrams which include distinct sections of the population (calculations involving
regression lines are excluded)
Understand informal interpretation of correlation
Understand that correlation does not imply causation
L3 Interpret measures of central tendency and variation, extending to standard deviation
Be able to calculate standard deviation, including from summary statistics
L4 Recognise and interpret possible outliers in data sets and statistical diagrams
Select or critique data presentation techniques in the context of a statistical problem
Be able to clean data, including dealing with missing data, errors and outliers

Populations and samples

A population is a complete collection of people or items.
It is often not possible to gather data about every individual in a population, so a sample is used to
gather information which is then used to draw conclusions about the population.

Some common sampling techniques are:

 Simple random sampling is a sampling method in which the items in the sample are chosen by a
random process such as drawing from a box. Every member of the population has an equal
chance of being selected.
 Opportunity sampling involves choosing individuals for a sample as opportunity arises, such as
interviewing passers-by.
 Systematic sampling involves selecting individuals from a population by a systematic method,
such as selecting every 10th individual on a list of the population.
 Stratified sampling is used when the population can be divided into subgroups (strata) using
criteria such as age or gender, and ensures that all strata are represented in the sample.
Sometimes there is a requirement that the numbers sampled from each stratum is proportional
to the sizes of the strata (this is called proportional stratified sampling). Otherwise, weighting is
 Quota sampling can also be used when the population can be divided into strata. A certain
number of items from each stratum are required.

It is important to remember that different samples may give different results!

Some sampling techniques, such as opportunity sampling, are prone to bias, since the sample is unlikely
to be representative of the population.

Types of data
 Categorical data are not numerical in value (e.g. colours of cars)
 Discrete data are numerical data that can take only specific values, such as shoe sizes or number
of pets.
 Continuous data are numerical data that can take any real values in a range, such as weights or

Statistical diagrams
There are a wide variety of statistical diagrams that can be used to illustrate data. Some diagrams are
only appropriate for certain types of data.
 A histogram is used to illustrate grouped data. The vertical axis gives frequency density (found by
dividing the frequency by the class width). The frequency for each group is proportional to the
area of the bars, and there are no gaps between the bars.
 A frequency polygon is a plot of frequency against data values. The points are joined with
straight lines.
 A box-and-whisker diagram, or boxplot, summarises numerical data by showing the lowest
value, the lowest quartile, the median, the upper quartile and the highest value.
 A cumulative frequency curve is a graph illustrating numerical data. For each value, x, the total
frequency of data less than or equal to x is plotted against x. A cumulative frequency curve is
useful for estimating the values of the median, quartiles or other percentiles.

Measures of central tendency

 The mean is found by adding up the data items and dividing by the number of data items
 The median is the midpoint of the data when they are placed in numerical order
 The mode is the most frequently occurring data value

Measures of variation
 The range is the difference between the highest and lowest values from the data.
 The interquartile range is the difference between the upper quartile (the value corresponding to
¾ of the data when ranked numerically) and the lower quartile (the value corresponding to ¼ of
the data when ranked numerically).
 Other interpercentile ranges can also be found, such as the difference between the 10 th
percentile and the 90th percentile.
 The variance is a measure of the spread of a sample of data.

Variance 
 xi2  nx 2
where x is the mean of the data.
 The standard deviation is the square root of the variance.

Bivariate data
Data which involves two variables, e.g. height and weight, are called bivariate data.
Bivariate data can be illustrated on a scatter diagram in which the axes represent the two variables and
each data item is plotted using coordinates.

If it is believed that one variable is dependent on the other, the independent variable (sometimes called
the explanatory variable) should be plotted on the x-axis, and the dependent variable (sometimes
called the response variable) should be on the y-axis.

If a set of bivariate data plotted on a scatter diagram fall close to a straight line, there is linear
correlation. The closer the data lie to the line, the stronger the correlation. If the line has positive
gradient, there is positive correlation, and if it has negative gradient, there is negative correlation If all
the data lies on the line, there is perfect linear correlation.

It is important to remember that correlation does not imply causation; if there is correlation or
association, there may be other factors involved.

Sometimes a scatter diagram shows data falling in two or more groups. These may represent different
sections of the population (e.g. adults and children) and it may not be appropriate to treat the data as a
single set.

Outliers and cleaning data

An outlier is an unusually high or low value in a set of data.
Two commonly used definitions of outliers are:
 Any data value which is more than two standard deviations away from the mean
 Any data value which is more than 1.5 times the interquartile range above the upper quartile or
below the lower quartile.

Outliers may be legitimate data items, or they may be errors. Even if they are legitimate, it may
sometimes be appropriate to exclude them from the data.

Cleaning data involves dealing with missing data, errors and outliers. How you deal with these issues
depends on the situation and what you are using the data for.

