Basic Biostatistics
Basic Biostatistics
Introduction to Biostatistics
E-mail: [email protected]
Mob.: 0914-728565
Presentation
Interpretation
Publication 6
variable:
It is a characteristic that takes on different
values in different persons, places, or things.
For example:
- heart rate,
- the heights of adult males,
- the weights of preschool children,
- the ages of patients seen in a dental clinic.
Quantitative Qualitative
Interval Nominal
Ordinal
Ratio
• Target population:
– A collection of items that have something in common
for which we wish to draw conclusions at a particular
time.
• E.g., All hospitals in Ethiopia
– The whole group of interest
Information
Sample
23
Characteristics of a good MCT
MCT is good or satisfactory if it possesses the following
characteristics.
1. It should be based on all the observations
25
1. Arithmetic Mean
A. Ungrouped Data
26
Cont.
27
Cont.
The heart rates for n=10 patients were as follows (beats
per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of these
patients?
m f
i=1
i i
x= k
f i=1
i
where,
k = the number of class intervals
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
29
Cont.
Class interval Mid-point (mi) Frequency (fi) mifi
10-19 14.5 4 58.0
20-29 24.5 66 1617.0
30-39 34.5 47 1621.5
40-49 44.5 36 1602.0
50-59 54.5 12 654.0
60-69 64.5 4 258.0
Example. Compute the mean age of 169 subjects from the grouped data.
30
Cont.
When the data are skewed, the mean is “dragged” in the direction of the skewness
• It is possible in extreme cases for all but one of the sample points to be on
one side of the arithmetic mean & in this case, the mean is a poor measure of
central location or does not reflect the center of the sample.
• For a given set of data there is one and only one arithmetic mean
(uniqueness)
34
Cont.
Total 169
40
Cont.
41
Properties of the median
• There is only one median for a given set of data (uniqueness)
20
18
16
14
12
N 10
8
6
4
2
0
44
T. Ancelle, D. Coulombie
a) Ungrouped data
• It is a value which occurs most frequently in
a set of values.
48
Properties of mode
· It is not affected by extreme values
• The mode can be used for all types of data, but may
be especially useful for nominal and ordinal
measurements.
Mode
Median
Mean
Median
Mean
Two or more sets may have the same mean and/or median but they may be
quite different.
59
Measures of Dispersion
• Measures that quantify the variation or dispersion of a set of
data from its central location
61
Range (R)
• The difference between the largest and smallest
observations in a sample.
• Range = Maximum value – Minimum value
• Example –
– Data values: 5, 9, 12, 16, 23, 34, 37, 42
– Range = 42-5 = 37
IQR = Q3 - Q1
i.e., 50% of the infant girls weigh between 8.8 and 10.2 Kg.
65
Variance (2, s2)
• The variance is the average of the squares of the deviations
taken from the mean.
67
Cont.
a) Ungrouped data
i
(X ) 2
2 i 1
where
N
N
X i
= i =1
is the population mean.
N
68
Cont.
A sample variance is calculated for a sample of individual values
(X1, X2, … Xn) and uses the sample mean
𝑿 (e.g:- ) rather than the
population mean µ.
i
(m x) 2
fi
S2 i=1
k
f
i=1
i -1
where
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval
x = the sample mean
k = the number of class intervals
School of Public Health 71
Properties of Variance:
· The main disadvantage of variance is that its unit is
the square of the unit of the original measurement
values.
72
Standard deviation (, s)
74
School of Public Health 75
Example:
Example. Compute the variance and SD of the age of 169 subjects from the grouped
data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = √S2 = √120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
76
Properties of SD
• The SD has the advantage of being expressed in the same
units of measurement as the mean
SD
CV 100
x
SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200mg/dl 20.0
79
NOTE:
• The range often appears with the median as a numerical
summary measure
• Likelihood of an event.
• Objective probability
1) Classical probability and
2) Relative frequency probability.
Example:
Of 158 people who attended a dinner party, 99 were ill.
P (Illness) = 99/158 = 0.63 = 63%.
89
Subjective Probability
• Personalistic (represents one’s degree of belief in the
occurrence of an event).
• Example:
– A coin toss cannot produce heads and tails
simultaneously.
– Weight of an individual can’t be classified simultaneously
as “underweight”, “normal”, “overweight”
Example:
– The outcomes on the first and second coin tosses
are independent
• Let A represent the event that a randomly selected newborn is LBW, and B the event that
he or she is from a multiple birth
• The intersection of A and B is the event that the infant is both LBW and from a multiple
birth
• The union of A and B, A U B, is the event that either A happens or B happens or they both
happen simultaneously
P ( A or B ) = P ( A U B )
• In the example above, the union of A and B is the event that the newborn is either LBW
or from a multiple birth, or both
96
Basic Probability Rules
1. Addition rule
More generally:
P(A or B) = P(A) + P(B) - P(A and B)
P(event A or event B occurs or they both occur)
100
2. Multiplication rule
– More generally,
P(A ∩ B) = P(A) P(B|A) = P(B) P(A|B)
P(A and B) denotes the probability that A and B
both occur at the same time.
101
Conditional Probability
103
Cont.
• The probability of developing retinopathy is:
104
Cont.
105
Cont.
= 18/21 = 0.86
106
Cont.
= 21/39 = 0.54
108
Exercise:
Culture and Gonodectin (GD) test results for 240 Urethral Discharge
Specimens
Culture Result
GD Test yes No Total
Result Gonorrhea Gonorrhea
Negative 8 48 56
110
Cont.
6. What is the probability that a man does not
have gonorrhea has a negative GD test?
7. What is the probability that a man does not
have gonorrhea has a positive GD test?
8. What is the probability that a man with positive
GD test has gonorrhea?
111
Probability Distributions
• It is the way data are distributed, in order to draw
conclusions about a set of data
0 ≤ P(X = x) ≤ 1
∑ P(X = x) = 1
114
The following data shows the number of diagnostic
services a patient receives
117
Probability distributions can also
be displayed using a graph
0.8
0.7
0.6
0.5
Probability, X=x
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5
No. of diagnostic services, x
119
Binomial Distribution
• Consider dichotomous (binary) random variable
123
Cont.
• n denotes the number of fixed trials
• x denotes the number of successes in
the n trials
• p denotes the probability of success
• q denotes the probability of failure (1- p)
P(X=4) =10C4(0.4)4(1-0.4)10-4
= 10C4(0.4)4(0.6)6 = 210(.0256)(.04666)
= 0.25
126
Cont.
128
Cont.
The probability in the above table can be converted
into the following graph
0.3
0.25
Probability
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10
No. of Smokers
a.)
5 2 5-2
P(x 2) = (0.25) (0.75)
2
0.2637
x e
P(x) =
x!
133
Cont.
• where x = 0, 1, 2, . . .∞
• x is a potential outcome of X
• The constant λ (lambda) represents the rate at
which the event occurs, or the expected number
of events per unit time
• e = 2.71828
b) P(X=1) = 0.244
c) P(X=2) = 0.268
d) P(X=3) = 0.197
e) P(X=4) = 0.108
0.2
Probability
0.1
0.0
0 1 2 3 4 5 6 7
Poisson distribution with mean 2.2
137
Example:
• In a given geographical area, cases of tetanus are
reported at a rate of λ = 4.5/month
• What is the probability that 0 cases of tetanus will
be reported in a given month?
138
Cont.
• What is the probability that 1 case of tetanus
will be reported?
139
Continuous Probability Distributions
• A continuous random variable X can take on any value in a
specified interval or range
• The area under the curve between any two points x1 and
x2 is the probability that X takes a value between x1 and x2
• Therefore, P(X=x) = 0
• We calculate:
Pr [ a < X < b], the probability of an
interval of values of X.
141
The Normal distribution
143
Cont.
• A random variable X is said to follow ND, if and
only if, its probability density function is:
2
1 x-
1
2
f(x) = e , - < x < .
2
144
Cont.
π (pi) = 3.14159
e = 2.71828, x = Value of X
Range of possible values of X: -∞ to +∞
µ = Expected value of X (“the long run
average”)
σ2 = Variance of X.
µ and σ are the parameters of the normal
distribution — they completely define its
shape
145
Cont.
1. The mean µ tells you about location -
– Increase µ - Location shifts right
– Decrease µ – Location shifts left
– Shape is unchanged
146
147
Properties of the Normal Distribution
1. It is symmetrical about its mean, .
2. The mean, the median and mode are almost equal. It is unimodal.
3. The total area under the curve about the x-axis is 1 square unit.
5. As the value of increases, the curve becomes more and more flat
and vice versa.
150
Standard Normal Distribution
· It is a normal distribution that has a mean equal to
0 and a SD equal to 1, and is denoted by N(0, 1).
Z= x-
• Z represents the Z-score for a given x value
154
Cont.
155
Some Useful Tips
157
b) What is the probability that -1.96 < z < 1.96?
158
c) What is the probability that z > 1.96?
159
160
Exercise
Example:
• The diastolic blood pressures of males 35–44 years
of age are normally distributed with µ = 80 mm Hg
and σ2 = 144 mm Hg2
[σ = 12 mm Hg].
• Let individuals with BP above 95 mm Hg are
considered to be hypertensive
163
Cont.
b. What is the probability that a randomly
selected male has a DBP above 110 mm Hg?
Z = 110 – 80 = 2.50
12
164
Cont.
1. Student t-distribution
2. F- Distribution
3. 2 –Distribution
167
Cont.
• Researchers often use sample survey methodology to
obtain information about a larger population by
selecting and measuring a sample from that population.
168
Cont.
169
Cont.
Sample Information
Population
170
Steps needed to select a sample and ensure that
this sample will fulfill its goals.
171
Cont.
2. Define the target population
– To ensure that the requirements are operationally sound, the necessary data
terms and definitions also need to be determined.
173
Cont.
5. Decide on the methods on measurement
6. Preparing Frame
– List of all members of the population
– The elements must not overlap
174
Sampling
2) Non-sampling error:
- Observational error
- Respondent error
- Lack of preciseness of definition
- Errors in editing and tabulation of data
183
Cont.
184
Example
• Suppose your school has 500 students and
you need to conduct a short survey on the
quality of the food served in the cafeteria.
186
Cont.
• Ignore all random numbers after 500 because they do
not correspond to any of the students in the school.
187
Cont.
188
2. Systematic random sampling
• Sometimes called interval sampling,
systematic sampling means that there is a gap,
or interval, between each selected unit in the
sample
3. Select a number between one and K at random. This number is called the
random start and would be the first number included in your sample.
• Therefore, K = 4.
• You will need to select one unit out of every four units to
end up with a total of 100 units in your sample.
193
Cont.
• Using the above example, you can see that
with a systematic sample approach there are
only four possible samples that can be
selected, corresponding to the four possible
random starts:
A. 1, 5, 9, 13...393, 397
B. 2, 6, 10, 14...394, 398
C. 3, 7, 11, 15...395, 399
D. 4, 8, 12, 16...396, 400
194
3. Stratified random sampling
• Another drawback to cluster sampling is that you do not have total control
over the final sample size.
201
5. Multi-stage sampling
• Similar to the cluster sampling.
• In the first stage, large groups or clusters are identified and selected.
• In the second stage, population units are picked from within the
selected clusters (using any of the possible probability sampling
methods) for a final sample.
• Also, you do not need to have a list of all of the units in the
population. All you need is a list of clusters and list of the units
in the selected clusters.
203
B. Non-probability sampling
• The difference between probability and non-probability
sampling has to do with a basic assumption about the
nature of the population under study.
207
The most common types of non-
probability sampling
210
Cont.
211
2. Volunteer sampling
• As the term implies, this type of sampling occurs
when people volunteer to be involved in the study.
213
Cont.
• Sampling voluntary participants as opposed to
the general population may introduce strong
biases.
216
Cont.
218
4. Quota sampling
220
Cont.
221
Cont.
222
Cont.
223
Cont.
224
5. Snowball sampling
• A technique for selecting a research sample
where existing study subjects recruit future
subjects from among their acquaintances.
226
Estimation
Thus,
– A point estimate is of the form: [ Value ],
– Whereas, an interval estimate is of the form:
[ lower limit, upper limit ]
Sample mean, µ
Sample variance, S2 2
Sample proportion, P or π
Sample Odds Ratio,
OR
OŔ
RR
Sample Relative Risk, RŔ
ρ
Sample correlation coefficient, r
CIs can also answer the question of whether or not an association exists
• Confidence Level
– Confidence in which the interval will contain the
unknown population parameter
• P (L, U) = (1 - α)
2.25
a. 1.52 1.96 1.52 1.96(.33)
20
1.52 .65 (0.87, 2.17)
242
Cont.
b.
2.25
1.52 1.96 1.52 1.96(.27)
32
1.52 .53 (.99, 2.05)
c. The larger the sample size makes the CI
narrower (more precision).
243
Cont.
B. Unknown variance (small sample size, n ≤ 30)
• What if the for the underlying population is
unknown and the sample size is small?
• Standard error =
• t-value at 90% CL at 19 df =1.729
OR
1. One population
• Indication of equality (either =, ≤ or ≥) must appear
in Ho.
Ho: μ = μo, HA: μ ≠ μo
Ho: P = Po, HA: P ≠ Po
• Can we conclude that a certain population mean is
– not 50?
Ho: μ = 50 and HA: μ ≠ 50
– greater than 50?
Ho: μ ≤ 50 HA: μ > 50
Action Reality
(Conclusion)
Ho True Ho False
271
Type I & II Error Relationship
G. Statistical decision
We reject the Ho because Z = -2.12 is in the rejection region. The
value is significant at 5% α.
H. Conclusion
We conclude that µ is not 30. P-value = 0.0340
A Z value of -2.12 corresponds to an area of 0.0170. Since there are two
parts to the rejection region in a two tail test, the P-value is twice this which
is .0340.
• Rejection Region
• With α = 0.05 and the inequality, we have the entire rejection region at the
left. The critical value will be Z = -1.645. Reject Ho if Z < -1.645.
• Statistical decision
– We reject the Ho because -2.12 < -1.645.
• Conclusion
– We conclude that µ < 30.
– p = .0170 this time because it is only a one tail test and not a two tail test.
• Test statistic
• If the assumptions are correct and Ho is true, the test statistic follows
Student's t distribution with 13 degrees of freedom.
H0 : p = 0.014
HA: p ≠ 0.014
• P-value = 0.2548
302
Cont.
• Suppose d=1
• Then the sample size increases
303
Cont.
307
Sample Size: Two Samples
– for most statistics the skewness assumption is more important that the
kurtosis assumption