Techniques of Observational

Basic Statistics
We use statistics to analyze a set of observations in order to evaluate just what we can
conclude from those data.
Imagine we have a collection of data: 4.45, 4.50, 4.50, 4.55, 4.55, 4.55, 4.60, 4.65, 4.65

Important characteristics:
• Mean : the “average” value:

Median: The individual value from the collection such that ½ the observations are
less and ½ are greater:
Note that the median must be extracted from the dataset, not simply calculated.

• Mode: The most frequently occurring value (seldom of interest):


Why is the median sometimes useful?

Imagine a different data set: 4.45, 4.50, 4.50, 4.55, 4.55, 4.55, 4.60, 4.65 , 8.7

• Mean : 5.006
• Median: 4.55
so the median is unaffected by a single very low or very high outlier (i.e. a point
that is way out of the main group of points).
How do we get an estimate of how good are data are?
• (Simple) Deviation

• Mean Deviation

the N-1 reflects the loss of one degree of freedom (if we had a single data point
mean deviation would be meaningless).
• Variance

• Standard Deviation ("root mean square deviation" or rms or sigma)

which is a better estimator for random errors than mean deviation.

How repeatable/reliable are our values?

The standard deviation tells us something about the expected value of a single
If the data are normally distributed

• 68% of the points will lie within ±1 sigma

• 95% of the points will lie within ±2 sigma
• 99.7% of the points will lie within ±3 sigma

Usually we accept a variation as statistically significant only if it is more than 3 sigma

from the mean.

Standard deviation of the mean:

How reliable is our estimate of the mean?

The “standard deviation in the mean” is given by

or . This is an estimator of the quality of

the mean value and it reflects the improvement gained by averaging several data points.
Note that to improve the quality of the data by a factor of ten would require one hundred
samplings of the data.
The “standard deviation in the mean” (sometimes called “standard error”) is the
appropriate value to use to draw “error bars” on a plot of mean values.
1 2 3 4
102.7051 96.99768 106.1652 106.7639 106

93.74577 87.22317 84.87374 92.7521 104

92.91529 102.4426 107.6497 98.48607
102.2656 112.7647 108.2898 93.23228 102

110.5028 111.9835 111.2649 110.5721 100

92.93493 117.3313 117.9858 100.0187 98

104.8264 78.16412 88.74805 121.3331
102.9943 97.65819 102.8124 96.12005 96

93.52754 110.9502 107.0592 86.2086 94

108.0685 89.13299 98.85192 84.94068 92

mean 100.4486 100.4649 103.3701 99.04276
std dev 6.660202 12.92402 10.09143 11.23548 90

std err 2.106141 4.086935 3.191189 3.552972 88

1 2 3 4

Kinds of data collections

“Normal” or “Gaussian” distributions: Most experimental results should follow this

“Poisson” or “counting rate” distributions

• Data collections where a value can never be less than 0 may have a Poisson
• An example is “photon” statistics: the number of photons that can arrive at a
given moment can be 0, 1, 2, … etc. but never less than 0

The “counts” accumulated in a CCD pixel will have a Poisson distribution.

The “standard deviation” of a Poisson distribution is given by so if you

have 10,000 counts in a pixel, the error will be ±100.
Signal-to-Noise ratio
• Have we taken enough data?
• How much longer should we observe?

For Poisson statistics (c = total received counts)

Linear Least Squares or Regression analysis

Assume a straight line fit to some data

Assume a straight line fit to some data. Let y = focus value, x = temperature
The errors can be estimated from:

Interpretation: At 0 F the focus will be (about) 31000. The change in focus with
temperature is about –40 counts per degree.
Correlation Coefficient
It is useful to look at the correlation coefficient, rho, between x and y. A correlation
coefficient of 0 means that x and y are not correlated, a value of +/- 1 means the
quantities are positively/negatively correlated.

