Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 3

This is the fourth in a series of articles in this journal on the use of statistics

in medicine. In the previous issue, we described how to choose an appropriate


statistical test. In this article, we consider this further and discuss how to
interpret the results.

More on choosing an appropriate statistical test


Deciding which statistical test to use to analyse a set of data depends on the type
of data (interval or categorical, paired vs unpaired) being analysed and whether or
not the data are normally distributed. Interpretation of the results of statistical
analysis relies on an appreciation and consideration of the null hypothesis, P-
values, the concept of statistical vs clinical significance, study power, types I
and II statistical errors, the pitfalls of multiple comparisons, and one vs two-
tailed tests before conducting the study.

Assessing whether a data set follows a normal distribution


It may be apparent from constructing a histogram or frequency curve that the data
follow a normal distribution. However, with small sample sizes (n < 20), it may not
be obvious from the graph that the data are drawn from a normally distributed
population. The data may be subjected to formal statistical analysis for evidence
of normality using one or more specific tests usually included in computer software
packages, such as the Shapiro�Wilkes test. Such tests are fairly robust with larger
sample sizes (n > 100). However, the choice between parametric and non-parametric
statistical analysis is less important with samples of this size as both analyses
are almost equally powerful and give similar results. With smaller sample sizes (n
< 20), tests of normality may be misleading. Unfortunately, non-parametric analysis
of small samples lacks statistical power and it may be almost impossible to
generate a P-value of <0.05, whatever the differences between the groups of sample
data.

When in doubt as to the type of distribution that the sample data follow,
particularly when the sample size is small, non-parametric analysis should be
undertaken, accepting that the analysis may lack power. The best solution to
avoiding mistakes in choosing the appropriate statistical test for analysis of data
is to design a study with sufficiently large numbers of subjects in each group.

Unpaired vs paired data


When comparing the effects of an intervention on sample groups in a clinical study,
it is essential that the groups are as similar as possible, differing only in
respect of the intervention of interest. One common method of achieving this is to
recruit subjects into study groups by random allocation. All subjects recruited
should have an equal chance of being allocated into any of the study groups.
Provided the sample sizes are large enough, the randomization process should ensure
that group differences in variables that may influence outcome of the intervention
of interest (e.g. weight, age, sex ratio, and smoking habit) cancel each other out.
These variables may themselves be subjected to statistical analysis and the null
hypothesis that there is no difference between the study groups tested. Such a
study contains independent groups and unpaired statistical tests are appropriate.
An example would be a comparison of the efficacy of two different drugs for the
treatment of hypertension.

Another method of conducting this type of investigation is the crossover study


design in which all subjects recruited receive either treatment A or treatment B
(the order decided by random allocation for each patient), followed by the other
treatment after a suitable �washout� period during which the effects of the first
treatment are allowed to wear off. The data obtained in this study would be paired
and subject to paired statistical analysis. The effectiveness of the pairing may be
determined by calculating the correlation coefficient and the corresponding P-value
of the relationship between data pairs.
A third method involves defining all those characteristics that the researcher
believes may influence the effect of the intervention of interest and matching the
subjects recruited for those characteristics. This method is potentially
unreliable, depending as it does on ensuring that key characteristics are not
inadvertently overlooked and therefore not controlled.

The main advantage of the paired over the unpaired study design is that paired
statistical tests are more powerful and fewer subjects need to be recruited in
order to prove a given difference between the study groups. Against this are
pragmatic difficulties and additional time needed for crossover studies, and the
danger that, despite a washout period, there may still be an influence of the first
treatment on the second. The pitfalls of matching patients for all important
characteristics also have to be considered.

The null hypothesis and P-values


Before undertaking statistical analysis of data, a null hypothesis is proposed,
that is, there is no difference between the study groups with respect to the
variable(s) of interest (i.e. the sample means or medians are the same). Once the
null hypothesis has been defined, statistical methods are used to calculate the
probability of observing the data obtained (or data more extreme from the
prediction of the null hypothesis) if the null hypothesis is true.

For example, we may obtain two sample data sets which appear to be from different
populations when we examine the data. Let us consider that the appropriate
statistical test is applied and the P-value obtained is 0.02. Conventionally, the
P-value for statistical significance is defined as P < 0.05. In the above example,
the threshold is breached and the null hypothesis is rejected. What exactly does a
P-value of 0.02 mean? Let us imagine that the study is repeated numerous times. If
the null hypothesis is true and the sample means are not different, a difference
between the sample means at least as large as that observed in the first study
would be observed only 2% of the time.

Many published statistical analyses quote P-values as =0.05 (not significant),


<0.05 (significant), <0.01 (highly significant) etc. However, this practice
resulted from an era before the widespread availability of computers for
statistical analysis when P-values had to be looked up in reference tables. This
approach is no longer satisfactory and precise P-values obtained should always be
quoted. The importance of this approach is illustrated by the following example. In
a study comparing two hypotensive agents, drug A is found to be more effective than
drug B and P < 0.05 is quoted. We are convinced and immediately switch all our
hypertensive patients to drug A. Another group of investigators conduct a similar
study and find no significant difference between the two drugs (P = 0.05). We
immediately switch all our hypertensive patients back onto drug B as it is less
expensive and seems to be equally effective. We may also be somewhat confused by
the apparently contradictory conclusions of the two studies.

In fact, if the actual P-value of the first study was 0.048 and that of the second
study was 0.052, the two studies are entirely consistent with each other. The
conventional value for statistical significance (P < 0.05) should always be viewed
in context and a P-value close to this arbitrary cut-off point should perhaps lead
to the conclusion that further work may be necessary before accepting or rejecting
the null hypothesis.

Another example of the arbitrary nature of the conventional threshold for


statistical significance may be considered. Suppose a new anti-cancer drug has been
developed and a clinical study is undertaken to assess its efficacy compared with
standard treatment. It is observed that mortality after treatment with the new drug
tends to be lower but the reduction is not statistically significant (P = 0.06). As
the new drug is more expensive and appears to be no more effective than standard
treatment, should it be rejected? If the null hypothesis is true (both drugs
equally effective) and we were to repeat the study numerous times, we would obtain
the difference observed (or something greater) between the two study groups only 6%
of the time. At the very least, a further larger study needs to be undertaken
before concluding with confidence that the new drug is not more effective�as we
shall see later, the original study may well have been under-powered.

Statistical vs clinical significance


Statistical significance should not be confused with clinical significance. Suppose
two hypotensive agents are compared and the mean arterial blood pressure after
treatment with drug A is 2 mm Hg lower than after treatment with drug B. If the
study sample sizes are large enough, even such a small difference between the two
groups may be statistically significant with a P-value of <0.05. However, the
clinical advantage of an additional 2 mm Hg reduction in mean arterial blood
pressure is small and not clinically significant.

Confidence intervals
A confidence interval is a range of sample data which includes an unknown
population parameter, for example, mean. The most commonly reported is the 95%
confidence interval (CI 95%), although any other confidence interval may be
calculated. If an investigation is repeated numerous times, the CI 95% generated
will contain the population mean 95% of the time.

Confidence intervals are important when analysing the results of statistical


analysis and help to interpret the P-value obtained. They should always be quoted
with the P-value. Consider an investigation comparing the efficacy of a new
hypotensive agent with standard treatment. The investigator considers that the
minimum clinically significant difference in mean arterial blood pressure after
treatment with the two drugs is 10 mm Hg. If P < 0.05, three possible ranges for CI
95% may be considered (Fig. 1). If P = 0.05, four possible ranges for CI 95% may be
considered (Fig. 2). These ranges for the CI 95% are summarized in Table 1.

You might also like