Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

IMG STUDY GROUP

Epidemiology & Biostatistics


Dr. Shakeel Ahmed

Epidemiology
Epidemiology is the study of the distribution and determinants of health-related states and events in
specific population and the application of this study to the control of health problems.
• By Distribution we mean Time – Place – Person
• By Determinants we mean causes and factors that influence the risk of disease.

Epidemiological Studies
A) Descriptive Studies
1 - Case report and case series:
Results are not representative of the population. First step in a field or community investigation
2 - Studies with secondary data:
We use data that already exist to investigate and answer questions for which the original study was not
designed. The original study could be cross-sectional, case-control, cohort, or clinical trial.
These data could be individual data or grouped data.
Individual data: individual data for every member in a previous study, clinical history, death certificate,
Grouped data (Ecological study): These are data for groups, rather than individuals. Ecological studies
compare the frequency of events in different groups. Caution is required in drawing conclusions and
identifying associations.

B) Observational Studies
Analytic studies, etiologic studies, are performed to test specific hypothesis about a specific health
problem. In general, associations observed in descriptive studies are often the basis for gathering more
specific data and testing hypothesis in additional studies.

a) Cohort Study (Prospective)


Starts with a group of people (a cohort) all considered to be free of a given disease. Information is
obtained to determine persons having a particular characteristic (certain exposure) that is suspected of
being related to the development of disease being investigated. These individuals are then followed for
a period of time to observe who develops/dies from disease. Finally, Incidence or death rates for the
disease are then calculated

b) Cohort Study (Retrospective)


The period of observation starts from some date in the past. They usually involve specially exposed
groups or industrial populations. Done by using company records of past & present employees:
Information: - date of employment. Date of departure - duration, degree of exposure - status:
living/dead.
 Incidence Rate is number of New Cases per total exposed or non-exposed population
 Relative Risk is a Ratio between incidence rate with risk and incidence rate without risk
 Attributable Risk is a Difference between incidence rate with risk and incidence rate without risk

1
Advantages Disadvantages
 Correct classification of exposure before disease  Not suitable for rare diseases
develops.  Time consuming (follow up)
 Permits calculation of incidence rates.  Losing people in follow up (Attrition)
 A direct measure of relative risk, and  Expensive
attributable risk.  Status of subjects may be changed
 Many possible outcomes to the same exposure  Leading to error in classification of exposure
can be studied. e.g. Change in habit, occupation.
 Suitable to study effect of rare exposures  Administrative problems: loss of staff,
 Accurate funding, high costs of study.

Case Control Study


 Odds Ratio OR = AD / BC
A measure of strength of association between the Risk Factor and the Disease, measured as odds of
exposure among the cases against / odds of exposure among the controls
 Kappa Coefficient Observed Agreement- Chance Agreement / 1 - Chance Agreement

Cross-Sectional Studies / Prevalence Survey


One of the observational studies based on a single examination of a section of population at one point
of time, also known as a Prevalence Study. Useful for the chronic conditions.
It measures Exposure and Outcome at the same time.
Advantages Simple, Short time, Prevalence rate can be measured.
Disadvantages Not appropriate to study rare diseases or events of short duration. Can’t measure the
temporal relationship between exposure and health outcome

Risk Reduction (Case Control Study)


 Control Event Rate = CE/ Total
 Experimental Event Rate = EE/ Total
 Attributable Risk Reduction = CER- EER
 Relative Risk Reduction = (CER- EER)/CER
 Number Need to be Treated = 1/ARR
 Control Event Rate = CE/ Total
Disease Prevention
Primary prevention: Prevention of a disease before it has been able to occur e.g., HPV vaccination
Secondary prevention: Early detection of disease by screening while it is still curable. e.g. Pap smear.
Tertiary prevention: Reduce disease progression and its complications / disability e.g., chemotherapy

Screening and Diagnostic Tests


Screening tests can often also be used as diagnostic tests. Diagnosis involves confirmation of the
presence or absence of disease in someone suspected or at risk for having a disease. Tests can be in the
form of: Questions, Examinations, Laboratory tests, X-Rays
Criteria for a population-based Screening include; Knowledge of the disease, Feasibility of screening
procedures, Diagnosis and treatment, Cost considerations
Characteristics of a Screening test are Validity, Reliability, and Accuracy
Validity is the trueness of test measurements. It detects how accurate a test measure the desired value.
Systematic error-reduces accuracy in a test.

2
Reliability is the consistency and reproducibility of a test and absence of random variation. It shows how
precise is the test. Random error-reduces precision in a test. The ability of a test or combination of tests
to give consistent results in repeated applications, whether correct or incorrect.
Example; a nurse making repeat blood pressure measurements on an individual; or of the person
performing the test, ten different nurses measuring the blood pressure of the same individual.

Accuracy; It is the proportion of true test results (TP+TN) among all test results = (a+d) / (a+b+c+d)

The Gold Standard


"The Gold Standard" refers to a disease diagnosing criteria by which scientific evidence is evaluated. An
ideal "Gold Standard" test has a 100% Sensitivity and Specificity. In real practice there are sometimes no
true "gold standard" tests. Example; Serum Ferritin is a very sensitive test to screen Iron Deficiency
Anemia , however, The Gold Standard is Bone Marrow Biopsy.
Sensitivity = TP / (TP + FN)
Likelihood of having positive test results in diseased people. FN rate determines the test's sensitivity
Test with 100% sensitivity have no False Negatives. Normal test result excludes disease (must be a TN)
Positive test result includes all people with disease. Positive test result does not confirm disease
Test result could be a TP or FP. Test is used most often as a screening test
Specificity = TN / (TN + FP)
Likelihood of having negative test results in healthy people. FP rate determines the test's Specificity.
Tests with 100% specificity have no False Positives. Positive test result confirms disease (must be a TP).
Negative test result does not exclude disease. Test result could be a TN or FN.
 False Positive Rate = FP / (FP+TN). False Positive Result in Healthy = FP/ All Healthy
 False Negative Rate = FN / (FN+TP) False Negative Result in Diseased = FN/ All Diseased
 Prevalence = (TP + FN) / (TP + FN + TN + FP) Total number of people with disease in the population

Negative Predictive Value = TN / (TN + FN) Predictive value of a negative test result (NPV)
Likelihood that negative tests result will exclude the disease. NPV best reflects the FN rate of a test.
Tests with 100% sensitivity (No False Negatives) always have a NPV of 100%.
Positive Predictive Value = TP/ (TP + FP) Predictive value of a positive test result (PPV), Likelihood that a
positive test result will confirm the disease. PPV best reflect the FP rate of a test. Tests with 100%
specificity (No False Positives) always have a PPV of 100%.

Likelihood Ratios = sensitivity / 1- specificity


The Likelihood Ratio (LR) is the likelihood that a given test result would be expected in a patient with the
target disorder compared to the likelihood that that same result would be expected in a patient without
the target disorder. The LR is used to assess how good a diagnostic test is.

Advantages over Sensitivity and Specificity


Less likely to change with the prevalence, Can be calculated for several levels of the symptom/sign or
test, Can be used to combine the results of multiple diagnostic test ,Can be used to calculate post-test
probability for a target disorder. Strength of the Test by Likelihood Ratio Qualitative Strength LR+, LR -
Excellent 10, Good 5, Useless 1

Pretest and Post-test Probability


After the serum ferritin test is done and your patient is found to have a result of 60 mmol/l, the post-
test probability of your patient having iron deficiency anaemia is therefore increased to 86 per cent, and
this suggests that the serum ferritin is a worthwhile diagnostic test

3
Bayes' Theorem: Post-test Odds = Pre-test Odds x Likelihood Ratio

BIOSTATISTICS
1. Descriptive statistics:
Provide accurate description, presentation and organizations of data, examples are;
• Numbers/calculations used to summarize/describe set of data
• allow for quick analysis of your data by tabulating & graphing
• various indices (mean, variance)
• basis for inferential statistics
2. Inferential Statistics:
Generalizing from the group you have sampled to the population from which the sample was drawn.
We want to infer something about the population from the data you have collected and analyzed
 Poll to look at political party affiliation
 Treatment effect
 Educational intervention
 relationship between exercise and health
Population:
All possible observations/measurements of interest, may be hypothetical (and infinite)
Example: all 21 yr old males, all 21 yr old Canadian males
Population Parameter
 numeric measures of the population
 mean, standard deviation, proportion
 usually not known because all observations cannot be made
Sample:
A portion of the population that is observed or measured
• subset of population
• representative sample allows for generalization
• size of sample is important, but not as important as representativeness
• size of a sample can never compensate for a biased sample
• biased = non-representative (selection bias)
Sample statistic
Any numeric measure computed from a subset (sample) of the population is called Sample Statistic
In research, we use the sample statistic to estimate the population parameter.

VARIABLE
 Inferential & descriptive statistics concerned with variables
 Anything that can vary (organism, environment, experimental treatment/situation)
 Anything that is being observed, measured, categorized or manipulated
Examples: Gender, GPA, blood pressure, “satisfaction of life, number of cigarettes smoked/wk,
experimental treatment

TYPES OF VARIABLES
Independent Variable vs. Dependent Variable
Discrete Variable vs. Continuous Variable
Qualitative Variable vs. Quantitative Variable
The type of variable determines what kind of statistical procedure may be performed

4
Independent Variable
• the intervention
• what is being manipulated by the experimenter
• the variable the experimenter changes
• the experimenter’s variable of interest
• the “cause” or “what is responsible” for the observed effect
• must sometimes rely on existing variation (gender, lung cancer)
Example: treatment (drug, placebo) educational intervention (PBL, lecture)
Dependent Variable
• that which depends on the independent variable
• that which the independent variable influences
• outcome of interest
• changes in response to the independent variable

Discrete vs. Continuous


Discrete variables
• Categorical data ; gender, treatment, nationality, marital status, blood type
• data that can only have whole numbers; hospitalizations, children, traffic violations

Continuous Variable
Data may take any value within a defined range; Blood pressure, height, weight, lung capacity

QUALITATIVE DATA (Nominal vs. Ordinal)


Nominal: (Categorical Data)
Include named categories with no particular order “existential” variable- property exists or it doesn’t
Examples: blood type (A, B, O, AB), gender (male, female)* university (McMaster, UofT, Western,
Windsor) marital status (single, common law, married, divorced, widowed
Ordinal: Interval vs. Ratio
Include ordered categories, difference from one level to the next meaningful but differences between
categories not considered equal
Example: stages of cancer (I, II, III, IV), Likert-type rating scale* (1, 2, 3, 4, 5, 6, 7) where numbers are
accompanied by a verbal anchor

QUANTITATIVE DATA (Continuous vs. Discrete)


Continuous (Numeric Data) e.g. Weight and Height, Age, BP
T –Test & ANOVA are used to compare the continuous data
Discrete e.g. No. of persons, Pulse, Cancer Pt.
Qualitative Variables;
NOMINAL (Non Numeric Data) e.g. Sex, Blood Group, Days, Countries
ORDINAL (For Ranking) e.g. Social classes, Staging of Breast Ca., Grades ( A, B,C etc.)

Quantitative data;
Interval Ratio
• ordered numbers along a continuum  ordered along continuum
• difference between the numbers are • difference between numbers meaningful
meaningful • equal distance between each value
• equal distance between each value • meaningful zero point
• no meaningful zero point, It is arbitrary • can convert to ratio or percent
• cannot turn into ratio or percent Example: weight (lbs, Kg) Height ( m, ft.)

5
Example: IQ, temperature (C, F) A weight of 200lbs is twice as heavy as 100lbs
Person with IQ of 100 not twice as smart as A woman of 6 ft height is 1.5 times taller than a
person with IQ of 50. Temperature of 20o C is not woman with 4 ft.
twice as hot as 10o C

Measures of Central Tendency


The central tendency measurement indicates middle location of data set or where most of the data fall.

Mean
Mean is used with interval & ratio data, sometimes with ordinal e.g. rating/Likert scales, most familiar,
widely used and reflects all the data
Summation of values divided by total number of values;
Example : ( n=11 ) 8 3 6 4 11 2 9 4 10 11 4 = 8+3+6+4…….+11+4
11
Sample mean = Formula 6.55
Population mean= Formula
Lease-square criterion: Sum of deviations around mean is zero (Standard Normal)
∑(X −mean)=0 8-6.55 =1-5 + ….. 4-6.5 = -2.5

Median
The Middle point that divides the data into 2 equal sets after arrangement in order. Median is used for
ordinal, interval and ratio data. It reflects 50th percentile. It is not influenced by extreme values
Example (odd): N=11 8 3 6 4 11 2 9 4 10 11 4
Step 1: place in order: 2 3 4 4 4 6 8 9 10 11 11
Step 2: (N+1)/2= 6th position; Step 3: 6th position value = 6
Note: Median would remain unchanged if the lowest value was changed to 0 or the highest to 22!
Example 2 (even): N=12 8 3 6 4 11 2 9 4 10 11 4 9
Step 1: 2 3 4 4 4 6 8 9 9 10 11 11
Step 2: (N+1)/2 = 6.5 position, but no 6.5th position so between 6th and 7th
Step 3: median is arithmetic mean of 6th & 7th value 6+8/2= 7

Mode
The most frequent occurring value, used primarily for nominal (and ordinal data) could also be used for
continuous data, for continuous data, usually group data to calculate modal group. There may be more
than one mode or none
Examples: modal world nationality = Chinese; modal gender = female
Example : 8 3 6 4 11 2 9 4 10 11 4 Mode = 4
Note: if the highest or lowest value was changed drastically, it would have no effect on the mode

Measures of Dispersion; Range, Interquartile Range, Variance, Standard Deviation


Dispersion indicates how tightly/loosely data clusters around central tendency and the variability.

Range
It is used with ordinal, interval & ratio and calculated as difference between largest & smallest value
It is entirely dependent on the most extreme scores. Outliers/extreme scores have large effect on range.
Example: 8 3 6 4 11 2 9 4 10 11 4 ; Range: 11-2 = 9
Example: 8 3 6 4 11 2 9 4 10 11 4 Range: 11-2 = 9
Example: 8 3 6 4 11 2 9 4 10 19 4 Range: 19-2 =17
Data are identical except for one point

6
Interquartile Range (mid-spread)
It is used for ordinal, interval & ratio data and comprises the middle 50% of the data. It is a difference
between the 75th and 25th percentile. The IQR is not influenced by extreme scores as it disregards half of
the data (lower Q and upper Q)
Example: (N=10) 42 43 45 47 48 49 51 53 53 54
step 1: calculate Q1 (median of lower half); step 2: calculate Q3 (median of upper half) step 3: Q3-Q1
Q1= 45 (42 43 45 47 48) Q3= 53 (49 51 53 53 54) IQR= 8 (53-45)
Example: 42 43 45 47 48 49 51 53 53 54 IQR: 45-53 = 8 Range: 54-42 = 12
Example: 42 43 45 47 48 49 51 53 53 64 IQR: 45-53 = 8 Range: 42-64 = 22

Variance:
Variance is an extent to which observations vary around the mean. Therefore, larger the deviation,
greater will be variance. Sum of deviations around mean is always equal to zero. The positive differences
cancel out negative differences
• cannot use sum of mean differences as an index of variability
• could use the absolute value, but absolute values cannot be manipulated algebraically
• overcome by squaring the mean deviations
• the unit of measurement is not the same

Variance (population):
Variance (sample): As N increases, sample variance approximates population variance
Example: N = 11 sum (X-mean)2/n = s2
NB: stats pkg or spreadsheet, default is a sample variance.

Analysis of Variance (ANOVA)


Compares mean values from three or more groups simultaneously considering one or more factors
 One-way ANOVA compares 2 or more groups considering one factor
 Two-way ANOVA compares 2 or more groups considering two or more factors

Standard Deviation
• take the square root of the variance
• returns value to the original unit of measurement
• Easier to interpret
• Average deviation from the mean
• How much scores vary on average
• Generalizing from the group you have sampled to the population from which the sample was drawn

HYPOTHESIS
Null Hypothesis (Ho)
Hypothesis of no difference, there is no association between the Disease & Risk factor in the population
Alternate Hypothesis (H1)
Hypothesis that there is some difference, there is some association between the Disease & the Risk
factor in the population.

Type I Error (α) “False-Positive Error"


Stating that there is an effect or difference when none exists (to mistakenly accept the experimental
hypothesis and reject the null hypothesis). p = probability of making a type I error. p is judged against a
preset level of significance (usually < .05) If p < .05, then there is less than a 5 % chance that the data will
show something that is not really there.

7
Type II Error (β) “False-Negative Error"
Stating that there is not an effect or difference when one exists (to fail to reject the null hypothesis
when it is in fact false).β is the probability of making a Type II Error.

The Power
The Power measure the probability of rejecting the Null Hypothesis when it is in fact False or the
likelihood of finding a difference if one in fact exists. The Power depends upon:
l. Total number of end points experienced by population
2. Difference in compliance between treatments groups (The mean values between groups)
3. Size of expected effect
Power ∞ Sample Size and Errors ∞ 1/ Sample Size

Confidence Interval (CI) = range from [mean - Z (SEM)] to [mean + Z ( SEM)]


Range of values in which a specified probability of the means of repeated samples would be expected to
fall. If the 95% CI for a mean difference between 2 variables includes 0, then there is no significant
difference and Null Hypothesis is not rejected. If the 95% CI for Odds Ratio or Relative Risk includes l,
Null Hypothesis is not rejected. If the CI between 2 groups overlaps, then these groups are not
significantly different. The 95% CI (corresponding to p = .05) is often used. For the 95% CI, Z = 1 .96

Lead Time Bias; Over Estimation of survival time due to early detection, not by improved Treatment
Chi-squared Test
A statistical test commonly used to compare observed data with data we would expect to obtain
according to a specific hypothesis. It requires that you use numerical values, not percentages or ratios.
Designed to test the correspondence between a theoretical frequency distribution & an observed
frequency distribution of categorical data if one sample of 20 patients is 30% hypertensive and another
comparison group of 25 patients is 60% hypertensive, a chi-squared test can be used to determine if this
variation is different than might be expected due to chance alone

Prevalence
Prevalence is the total number of cases in a population over a defined period of time.
Point prevalence: attempts to measure the frequency of all disease at one specific point in time,
therefore knowledge of the time of onset of disease is not required
Period prevalence: measure constructed from prevalence at a point in time, plus new cases and
recurrences over a defined period of time

Potential Yean of Life Lost (PYLL)


Calculated for a population using the difference between the actual age of death and a standard age of
death increased emphasis is therefore given to mortality at a younger age. Males are more likely to die
at younger ages due to unintentional injuries; this causes PYLL to be higher in males

Top 5 Potential Yean of Life Lost (PYLL) Causes in Canada


l. Neoplasm
2. Circulatory diseases
3. Unintentional injuries
4. Suicide
5. Respiratory diseases

Prepared by; Dr. Shakeel Ahmed https://www.facebook.com/imgstudygroup


[email protected]

You might also like