Biostatistics and Exercise
Biostatistics and Exercise
BIOSTATISTICS
SESSIONS 1 , LEARNING NOTES
Public health
Medicine
Ecological and environmental
Examples:
The observed proportion of the sample that responds to treatment;
The observed association between a risk factor and a disease in this
sample.
Population
A group of individuals
that we would like to
know something about
Sample
A subset of a population
(hopefully
representative)
RANDOM VS NON RANDOM SAMPLING
Random samples
Nominal
Ordinal
Interval
Ratio
NOMINAL DATA/VARIABLE
Nominal= categorical
Interval
Nominal Mode
Median
Ordinal
Symmetrical – Mean
Interval
Skewed – Median
continuous Discrete
Definition: A set of data is said to be Definition: A set of data is said to be
continuous if the values belonging to the discrete if the values belonging to the
set can take on ANY value within a finite set are distinct and separate
or infinite interval. (unconnected values
If there are very few discrete values, then discrete data is often
treated as ordinal.
TYPE OF VARIABLE NOT KIND
2type
Variables can be classified as
independent or
dependent
DV vs IV: plasma concentration and time: Let’s take example of a patient who
has taken a drug in the morning. The plasma concentration of this drug is a DV
since it changes over time during the day after drug intake.
TYPE OF VARIABLE
An intervening variable
is the variable that links the independent and dependent variable
Numerical presentation
Graphical presentation
Mathematical presentation
NUMERICAL PRESENTATION
Like frequency presentation in the table and other
GRAPHICAL PRESENTATION
1.Graphs drawn using Cartesian coordinates
Bar Graphs when presenting Nominal data (No order to horizontal axis)
Histograms when presenting Continuous or ordinal data (these should be on
horizontal axis)
Box Plots when presenting Continuous data
2.pie chart
3.statistical maps
FREQUENCY POLYGON
WHY IS IT ALWAYS BETTER OF SUMMARIZING UR DATA
It is ALWAYS a good idea to summarize your data (at least for important
variables)
You become familiar with the data and the characteristics of the sample
that you are studying
You can also identify problems with data collection or errors in the data
(data management issues)
Median : divide the score into 2 halves , care about odd and even number
mean is the sum of all the scores divided by the total number of scores =average
distribution of the data is normal, the mean =in middle distribution of the score =median
mean is a good measure of central tendency
It is preferred whenever possible and is the only measure of central tendency that is
used in advanced statistical calculations:
o More reliable and accurate
o Better suited to arithmetic calculations
C.T
mean can be misleading because it can be greatly influenced by extreme
scores called the out layer
For example, the average length of stay at a hospital could be greatly
influenced by one patient that stays for 5 years
17-46 C.T
A probability distribution
is a device for indicating the values that a random variable may have.
There are two categories of random variables:
c. discrete random variables, and
d. continuous random variables.
Frequency distribution
Probability distribution (relative frequency distribution)
Cumulative frequency
PROBABILITY OF DISTRIBUTION
± 1s contain about
68%;
±2 s contain about
95%;
±3 s contain about
99.7% of the area
under the curve
WHY WIDE SPREAD IS NOT IMPORTANT
Spread is important
when comparing 2 or
more group means.
For instance, it is
more difficult to see
a clear distinction
between groups in
the upper example
because the spread is
wider, even though
the means are the
same.
STANDARD NORMAL DISTRIBUTION
A normal distribution is determined by
. This creates a family of distributions
depending on whatever the values of
m and s are.
- The standard normal distribution has
mean=0 and standard dev =1.
Standard Z-Scores The standard z
score is obtained by creating a
variable z whose value is:
STANDARD NORMAL DISTRIBUTION
Given the values of m and s we can convert a value of x to a value of z.
A Z-score
is the number of standard deviations above or below the mean.
Graphical methods mainly include Histogram, Box-Whisker plot, Dot plot, the
normality plots (=Q-Q and P-P plots), etc… Normality plots are much used.
Notice
Transformation should be justified: it is recommended when including a non-
normally distributed variable in the analysis will reduce the effectiveness at
identifying statistical relationships, i.e. when this leads to losing power, due
to lack of normal distribution of the variable to be analyzed.
TYPE OF THE STATISTICS
There are two types of statistics:
Descriptive Statistics
Inferential Statistics
1.Descriptive statistics
used to summarize, organize, and make sense of a set of data (scores or observations).
are typically presented graphically, in tabular form (in tables), or as summary statistics (single
values) (descriptive statistics).
-e.g. : Mean, median, mode, frequencies, range, variance, standard deviation, quartiles, standard error of
the mean
also helps when it comes to describe the relationship between variables.
NB: descriptive statistics has been largely discussed in the previous paragraphs.
INFERENTIAL STATISTICS
Significance level is the value that is pre-determined used to reject or retain the hypothesis.
value of 0.05 is used called “p-value” common
Statistically significant findings mean that the probability of obtaining such findings by chance only is less than 5%
(i.e findings would occur no more than 5 out of 100 times by chance alone).
E.g. :
The 20 year risk of lung cancer for smokers is 15%
The 20 year risk of lung cancer among non-smokers is 1%
MEASURE OF ASSOCIATION
Odds Ratio (OR) is a way of comparing whether the probability of a certain event
is the same for two groups. Compare event in two grp
Used for cross-sectional studies, case control trials, and retrospective trials is
study done referring to the past event.
o In case control studies you can't estimate the rate of disease among study
subjects because subjects selected according to disease/no disease. So, you can't
take the rate of disease in both populations (in order to calculate RR).
o OR is the comparison between the odds of exposure among cases to the odds of
exposure among controls.
o Odds are same as betting odds. Example: if you have a 1 in 3 chance of winning a
draw, your odds are 1:2.
o To calculate OR, take the odds of exposure (cases)/odds of exposure (controls).
E.g. Smokers are 2.3 times more likely to develop lung cancer than non-smokers.
CONFIDENCE INTERVALS
When we measure the size of the effect we use confidence intervals
(CI). A CI is the range* in risk we would expect to see in the population.
CI provide an expected upper and lower limit (=range*) for a statistics
at a specified probability level (usually 95%, and sometime 99%)
The odds ratio we found from our sample (E.g. Smokers are 2.3 times
more likely to develop cancer than non-smokers) is only true for the
sample we are using.
This exact number is only true for the sample we have examined; it
might be slightly different if we used another sample.
Calculating a CI:
For example, a sample mean is an estimate of the population mean.
A CI provides a band within which the population mean is likely to fall:
CI = mean ± (Sm × confidence level) , Sm is standard error dev
CI
One of the best ways to increase the power of your study is to increase
your sample size.
STATISTICAL ANALYSES
Statistical analyses are either
parametric and
non-parametric.
Therefore, statistical analyses are performed using
parametric tests =variable in question is from a normal
distribution:
non-parametric tests =do not require any assumption of normal
distribution, are not sensitive
SPSS (Statistical Package for the Social Sciences) was designed to offer a more
user-friendly data analysis presentation than other statistical software.
It has got different versions over the past years (SPSS, IBM-SPSS, PASW -
Predictive Analytics Software
TYPE OF THE DATA
Types of data (reminder):
Nominal , Ranked , Scales (measures :Interval Ratio) , Mixed types
Text answers (open ended questions)
Nominal (categorical)
− Order is arbitrary when entering data in SPSS
− e.g. Gender, country of birth, personality type, yes or no.
− Use numeric in SPSS and give value labels.
Measures, scales
− Interval - equal units
− Ratio - equal units, zero on scale
• e.g. Family size, Salary
• Makes sense to say one value is twice another
− Use numeric (or comma, dot or scientific) in SPSS
• NB: numeric if you can manage to use numbers
• E.g. Family size, 1, 2, 3, 4 etc.
• E.g. Salary per year, 25000, 14500, 18650 etc.
Mixed type
− Categorised data
− Actually ranked, but used to identify categories or groups
e.g. age groups
= ratio data put into groups
− Use numeric in SPSS and use value labels.
E.g. Age group, 1=Under 15
2=15-34
3=35-54
4=55 or greater
Text answers
− E.g. answers to open-ended questions
− Either enter text as given (Use String in SPSS) or
− Code or classify answers into one of a small number types (Use numeric/nominal in
SPSS)
COMPUTING DESCRIPTIVE STATISTICS
Steps for statistical data analysis
Statistical data analysis is conducted in two steps:
1st step = Descriptive Statistics (to describe the sample) including Testing for
NORMALITY ASSUMPTIONS
2nd step = making inference (Inferential Statistics) (making inferences about the
population using what is observed in the sample).
Association statistics
Comparative statistics
Notice: As an introduction to SPSS for data analysis, we will focus on the first step (Descriptive
statistics); the second step is better covered after or combined with “Research Methodology”
courses/lectures
SPSS
Graphical methods mainly include Histogram, Box-Whisker plot, Dot plot, the
normality plots (=Q-Q and P-P plots), etc… Normality plots are much used (Q-Q
plot is more common).
Graphical interpretation
allowing good judgment to assess normality in situations when statistical tests
might be over or under sensitive
graphical methods do lack objectivity.
Conclusions :In some cases, both methods complement each other (sometimes
you need to rely on statistical methods when graphical methods do not help you
to decide whether your data is normally distributed or not)
ASSESSING NORMALITY GRAPHICALLY
Q-Q plot and P-P plot are called probability plots.
Probability plot helps to compare two data sets in terms of distribution;
one data set being from the data to be analyzed (data you collected yourself) and
another one from reference normally distributed data (usually shown as a straight
solid line) (theoretical normally distributed data).
If the data is normally distributed, the result would be a straight line with positive
slope like in the figure on right below indicating a good match for both data
distributions.
WHY DO WE EVEN NEED Q-Q PLOT OR P-P PLOT?
If we consider plotting non-cumulative distribution of two data sets against each other
then it is called Q-Q plot.
If we consider plotting cumulative distribution of two sets against each other then it is
called P-P plot. Q-Q plot is more common
Difficult to interpret histogram that’s why Q-Q or P-P plots is better
BOX-WHISKER PLOT
Usually used as measure of Variability (Dispersion). Box-Whisker
plot shows four equal parts along with three quartiles:
• Second Quartile (Q2) = median.
• Lower quartile (Q3) = median of lower half of the data
• Upper quartile (Q1) = median of upper half of the data
• Need to order the individuals first (from 1 to “N” individuals)
• One quarter of the individuals are in each inter-quartile range
ASSESSING NORMALITY STATISTICALLY
distribution is normal if its skewness and kurtosis have values between –1.0 and
+1.0”.
A perfectly normal distribution will have a skewness statistic of zero.
ASSESSING NORMALITY STATISTICALLY
A. Nominal
B. Ordinal A. Nominal
C. Interval B. Ordinal
D. Ratio C. Interval
D. Fratio
II.Within 3 standard deviations, the mean
picks up how much of the scores?
IV. Has categorical variables and bars
A. 68 are separate, but equal distances apart:
B. 78
C. 99 A. Bar Graph
D. 99.7 B. Histogram
E. 99.9 C. Frequency Polygon
V. Has continuous variables, bars touch and you VIII. The students t-test measures what:
can always find a third value:
A. Test the difference between 2 means
A. Bar Graph B. Test for differences between 3 or more means
B. Histogram
C. Frequency Polygon C. Differences between two frequency distributions
D. Whether two distributions are independent or
VI. Within 1 standard deviation, the mean picks dependent
up over how many of the values?
A. 60 IX. The Scientific Method is:
B. 62
C. 65 A. Qualitatitive Research
D. 66
E. 68 B. Quantitative Research
VII. The degree to which the independent X. As income level declines, tooth decay increases.
variable alone brings about the change in the This is an example of what:
dependent variable is what:
A. Internal Validity A. Positive correlation
B. External Validity B. Negative correlation
C. Internal Validity
D. External Validity
XIV. Randomly selecting a proportionate amount
XI. Randomly selecting a proportionate from subgroups is an example of what:
amount from subgroups is an example of A. Random Sampling
what: B. Stratified Sampling
A. Random Sampling C. Systematic Sampling
B. Stratified Sampling D. Convenience Sampling
C. Systematic Sampling
D. Convenience Sampling XV. In systematic sampling, every person has an
equal or random chance of being selected.
XII. Retrospective and Prospective are
what types of Epidemiological Studies? A. True
B. False
A. Analytical
B. Descriptive XVI. A zero correlation coefficient shows:
XIII. Descriptive statistics make no attempt A. A strong relationship
to generalize the research findings B. No relationship
beyond the immediate sample.
What is thw difference between positive
A. True correlation and negative correclation
B. False