1.StudentT TestContingencytables2020 Solution Laboratory Class
1.StudentT TestContingencytables2020 Solution Laboratory Class
LABORATORY 1: 11-02-2021
Contents
Student t-test 1
Exercise 1: Different genome sizes between crustaceans . . . . . . . . . . . . . . . . . . . . . . . . . 2
Exercise 2: Independent and Paired t-test - (OPTIONALLY!) Submit before 18/02/2021 . . 9
Student t-test
The two-sample t-test is used to directly compare the mean of two groups (X and Y). It is required that
measurements in the two groups are statistically independent. The null hypothesis states that the means of
two groups are equal, or equivalently, the difference of the means is zero:
1
A large t-score indicates that the groups are different. A small t-score indicates that the groups are similar.
#A t-test is most commonly applied when the test statistic would follow a normal
#distribution if the value of a scaling term in the test statistic were known.
#When the #scaling term is unknown and is replaced by an estimate based on the
#data, the test #statistics (under certain conditions) follow a Student’s
#t distribution. The t-test can #be used, for example, to determine if
#two sets of data are significantly different from #each other.
#(see wikipedia https://en.wikipedia.org/wiki/Student%27s_t-test)
#[Wikipedia] (https://en.wikipedia.org/wiki/Student%27s_t-test)
Here we have data on the genome size (measured in picograms of DNA per haploid cell) in two large groups
of crustaceans (Decapods and Isopods). The cause of variation in genome size has been a puzzle for a long
time; we’ll use these data to answer the biological question of whether some groups of crustaceans have
different genome sizes than others.
1. First we should observe the data, load the file into R and graphically explore the dispersion and
normality of the whole dataset. Looking at the histograms, do you think the data is normal?
2
10
8
6
4
2
0
3
Boxplot with jitter to observe dispersion
9
DNA concentration
Decapods Isopods
populations
Another[DNA] solution is represent the empirical distribution of variable [DNA] using a histogram with a
normal curve over:
#SOLUTION:
g = genome_size[,2]
m<-mean(g)
std<-sqrt(var(g))
hist(g, density=20, breaks=10, prob=T,
xlab="x-variable: DNA concentration", ylim=c(0, 0.5),
main="normal curve over histogram")
curve(dnorm(x, mean=m, sd=std),
col="darkblue", lwd=2, add=TRUE, yaxt="n")
4
normal curve over histogram
0.5
0.4
0.3
Density
0.2
0.1
0.0
0 2 4 6 8 10
2. If the variances are not similar between the groups and the variable is not normally distributed, we
cannot use the Student t test directly. To do it we try to transform the data to fit it into a normal
distribution. We will apply the log10 transformation to our data. Use the log10() function. Calculate
the mean and the variance of the newly transformed data. Are the variances more similar now?
#SOLUTION
genome_size$log10 <- log10(genome_size[,2])
#mean for each group
mean(genome_size[genome_size[,1]=="Decapods",3])
## [1] 0.4020461
mean(genome_size[genome_size[,1]=="Isopods",3])
## [1] -0.1066846
## [1] 0.3807424
var(genome_size[genome_size[,1]=="Isopods",3])
## [1] 0.2129233
5
# the means are different but the variances are more similar
#a double boxplot
library(tidyverse)
library(hrbrthemes)
genome_size %>%
ggplot( aes(x=V1, y=log10, fill=V1)) +
geom_boxplot() +
geom_jitter(color="black", size=0.4, alpha=0.9) +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("Boxplot with jitter to observe dispersion") +
xlab("populations") + ylab("log10 DNA concentration")
0
log10 DNA concentration
−1
−2
Decapods Isopods
populations
3. Now plot the histogram of the transformed data. Do they look nearly normal?
#SOLUTION
hist(genome_size[,3], density=10, breaks=10, prob=T,
xlab="log10(x-variable: DNA concentration)", ylim=c(0, 0.7), col="green")
6
Histogram of genome_size[, 3]
0.7
0.6
0.5
0.4
Density
0.3
0.2
0.1
0.0
4. Check if the transformed data follows a normal distribution using a graphical method.
#SOLUTION
g = genome_size[,3]
m<-mean(g)
std<-sqrt(var(g))
hist(g, density=10, breaks=10, prob=T,
xlab="log10(x-variable: DNA concentration)", ylim=c(0, 0.7), col="green")
curve(dnorm(x, mean=m, sd=std),
col="darkblue", lwd=2, add=TRUE, yaxt="n") #overlap the normal curve
7
Histogram of g
0.7
0.6
0.5
0.4
Density
0.3
0.2
0.1
0.0
5. After transforming the data (log10()), now we can apply the Student t test to answer the question: Do
both groups have the same mean genome size? What is the value of the t statistic? And the p-value?
#SOLUTION
result_StudentT <- t.test(genome_size[genome_size[,1]=="Decapods",3],
genome_size[genome_size[,1]=="Isopods",3],var.equal = TRUE)
result_StudentT
##
## Two Sample t-test
##
## data: genome_size[genome_size[, 1] == "Decapods", 3] and genome_size[genome_size[, 1] == "Isopods",
## t = 3.4308, df = 52, p-value = 0.001187
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2111807 0.8062808
## sample estimates:
## mean of x mean of y
## 0.4020461 -0.1066846
8
Exercise 2: Independent and Paired t-test - (OPTIONALLY!) Submit before
18/02/2021
Previously an example of independent t-test has been developed, but there are other forms to use t-test. The
PAIRED t-test is performed when the samples typically consist of matched pairs of similar units, or when
there are cases of repeated measures. For example, there may be instances of the same patients being tested
repeatedly—before and after receiving a particular treatment. In such cases, each patient is being used as a
control sample against themselves.
The expression of oncogen SKI-like in B-cells has been measured in patients with leukemia that are in
different stages of evolution of the disease. The data is in the B_cell.csv file. We want to check if the
evolution stage is a significant factor in the expression levels of the gene. Accepting that the evolution stage
vary considerably, they employed a t-test to compare the expression levels of the gene.
You can find the data in the B_cell.txt file. Use B_cell as the data set’ s name.
1)Perform two separate t-tests to test the following null hypotheses:
9
binom = rbinom(1000,10,0.6)
hist(binom, col="red", breaks=25)
Histogram of binom
250
200
150
Frequency
100
50
0
2 4 6 8 10
binom
In order to test categorical variables there are many possibilities, but one of the most common is use
hyphotesis test based on chi square statistic.
10
Chi-square goodness of fit test
The Chi Square goodness of fit test is a very simple and versatile test that quantitatively determines whether
a random variable really should be modeled with a particular probability distribution.
It literally tests the “goodness” of the density fit. The way the test works is it partitions the observed data
into bins and calculates the frequencies in each bin, similar to a histogram construction. It then compares
the observed frequencies with the expected frequencies that would result from a perfect fit to the proposed
distribution. It then calculates a test statistic that follows a chi square distribution, where oi are the observed
frequencies and ei are the expected frequencies values for n bins:
X (Oi − Ei )2
χ2 = , df = n − 1
i
Ei
We think that that the proportion of nucleotides A, T, C, G should be equally present in a given DNA
sequence, with proportion 0.25 for each. This is modeled by a multinomial distribution . Let’s see for a
particular gene how good a fit this really is. That is our null hypothesis is that the data fit a multinomial
distribution with equal probability for each nucleotide. Please test this situation using the following example
of gene:
using a test. . .
chisq.test(obs,p=rep(1/4,4))
##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 12.792, df = 3, p-value = 0.005109
#Based on result of a test statistic 12.79 (more extreme on the Chi-Square curve
#with 3 degrees of freedom than 7.81) we fail to accept our null hypothesis at an
#alpha=0.05 significance level and conclude that the fit differs significantly from
#a multinomial with equal probabilities for each nucleotide.
11
#more information at:
#An example of Categorical variables is the use of contingency tables (Two-Way Tables). #Please read the
#two-way tables:
#<http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm>
Suppose we are doing a genetic study related with the effect of two alleles (in a gene) with the presence of a
disease. Previosly, we performed a genetic test to determine which allele the test subjects have and a disease
test to determine whether the person has a disease.
The data for a 2 x 2 contingency analysis should be entered in the format below, which works for both tests.
contingencyTestData<-
matrix(c(45,67,122,38), nr=2, dimnames=list("Gene"=c("Allele 1","Allele 2"),
"Disease"=c("Yes","No")))
The tests we want to perform with this contingency table are whether or not the two factors, disease and
gene allele, are independent or whether there is a significant relationship between the factors. The null
hypothesis is that there is no relation and the factors are independent.
The Chi-Square test for independence is similar to the goodness of fit test, but uses for the expected values
the values calculated from the marginal probabilities of the data rather than expected values based on a
distribution.
The Chi-Square test statistic is therefore the sum over all cells the squared observed minus expected value
divided by the observed value. The degrees of freedom for this test are the number of rows minus 1 times
the number of columns minus 1, which is 1 in the case of a 2 x 2 table
X (Oij − Eij )2
χ2 = , df = (r − 1)(c − 1)
i,j
Eij
Where O are the observed frequencies fij , and E are the expected frequencies for each i rows and j columns.
Based on this test, are independent the factors allele and disease?
chisq.test(contingencyTestData)
##
## Pearson’s Chi-squared test with Yates’ continuity correction
##
## data: contingencyTestData
## X-squared = 34.662, df = 1, p-value = 3.921e-09
#Based on this result (p-value of 3.921e-9) we would strongly reject the null
#hypothesis that the factors gene allele and disease are independent and conclude
#that there is a significant relation between the disease and which allele of the
#gene a person has.
Also is possible to ask about: 1) Calculate the adding row and column sums
12
# Adding row and column sums
addmargins(contingencyTestData)
## Disease
## Gene Yes No Sum
## Allele 1 45 122 167
## Allele 2 67 38 105
## Sum 112 160 272
## Disease
## Gene Yes No
## Allele 1 0.1654412 0.4485294
## Allele 2 0.2463235 0.1397059
# Row percentages
prop.table(contingencyTestData, 1)
## Disease
## Gene Yes No
## Allele 1 0.2694611 0.7305389
## Allele 2 0.6380952 0.3619048
# Column percentages
prop.table(contingencyTestData, 2)
## Disease
## Gene Yes No
## Allele 1 0.4017857 0.7625
## Allele 2 0.5982143 0.2375
13
As an aid to the decision we can represent the contingency table by means of a graph like the following, but
is there an association between smoking and the level of exercise?
Propose a plot or a graphical representation in orden to observe it.
Problem Test the hypothesis whether the students smoking habit is independent of their exercise level at .05
significance level.
Solution We apply the chisq.test() function to the contingency table tbl, and found the p-value.
As the p-value 0.4828 is greater than the .05 significance level, we do not reject the null hypothesis that the
smoking habit is independent of the exercise level of the students.
The warning message found in the solution above is due to the small cell values in the contingency table.
To avoid such warning, we combine the second and third columns of tbl, and save it in a new table named
ctbl. Then we apply the chisq.test function against ctbl instead.
Now the p-value 0.3571 is greater than the .05 significance level, we do not reject the null hypothesis that
the smoking habit is independent of the exercise level of the students, without the warning missage.
Also is possible to ask about: 1) Calculate the adding row and column sums
14