Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 11

The Kolmogorov-Smirnov Test

Vasileios Hatzivassiloglou
University of Texas at Dallas
Kolmogorov-Smirnov test
• A fully non-parametric test for comparing
two distributions
• Does not depend on approximations for
the distribution

2
Empirical distribution function
• For a random variable X and a sample {x1,
x2, ..., xn} the empirical distribution function
of X is defined as
1 n
FX ( x )   I ( xi  x )
n i 1
• where I(condition) is the indicator function,
i.e., 1 if the condition is true and 0
otherwise
3
Example data
• FX is an estimate of
– the cumulative probability function of X
• Consider the following example data:
• {1.26, 0.34, 0.70, 1.75, 50.57, 1.55, 0.08,
0.42, 0.50, 3.20, 0.15, 0.49, 0.95, 0.24,
1.37, 0.17, 6.98, 0.10, 0.94, 0.38}
• n = 20
• Is this data normal?
4
Examining the data
• Sorted data:
• {0.08, 0.10, 0.15, 0.17, 0.24, 0.34, 0.38,
0.42, 0.49, 0.50, 0.70, 0.94, 0.95, 1.26,
1.37, 1.55, 1.75, 3.20, 6.98, 50.57}
• Mean: 3.61, Standard deviation: 11.2

5
Examining the data for normality
• For normal data,
– 15% should be below one s.d. from the mean
(3.61-11.2 = -7.59)
– none of the samples are even negative
– about 2% should be above two standard
deviations from the mean
(3.61+2×11.2=26.01)
– here we have one in 20 samples way beyond
that value (50.57)
6
Example empirical distribution

7
Log transformation

8
The Kolmogorov-Smirnov test
• Given two cumulative probability functions
FX and FY, the test statistics are
D  max ( FX ( x )  FY ( x ))
x

D  max ( FY ( x )  FX ( x ))
x

• Usually the value D=max{D+, D-} is used


(although its distribution is harder to study
than either D+ or D-)
9
Comparing distributions

10
Advantages of Kolmogorov-
Smirnov
• It is non-parametric and hence robust
• It does not rely on the mean’s location only (like
the t-test)
• It works for non-normal data (the t-test can fail if
the data is too far from normal)
• It is not sensitive to scaling
• It is more powerful than χ2
• However, it is less sensitive than t if the data is
indeed normal
11

You might also like