Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

1.Student t-test / chi.

square-test (2021) (SOLUTION)

B.Dobon, H.Laayouni, S.Walsh, T.Monleon-Getino

LABORATORY 1: 11-02-2021

Contents
Student t-test 1
Exercise 1: Different genome sizes between crustaceans . . . . . . . . . . . . . . . . . . . . . . . . . 2
Exercise 2: Independent and Paired t-test - (OPTIONALLY!) Submit before 18/02/2021 . . 9

Introduction to Categorical probability distributions 9

Chi-square goodness of fit test 11


Exercise 3: Proportion of nucleotides in a gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Contingency tables. Chi-Square Test (for Independence) 11


Exercise 4: genetic study about alleles and disease . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Exercise 5: Students smoking habit - (OPTIONALLY!) Submit before 18/02/2021 . . . . . 13
Exercise 6 - (OPTIONALLY!) Submit before 18/02/2021 . . . . . . . . . . . . . . . . . . . . 14
Exercise 7 - (OPTIONALLY!) Submit before 18/02/2021 . . . . . . . . . . . . . . . . . . . . 14
Based partially in Statistics Using R with Biological Examples. Kim Seefeld, MS, M.Ed. Ernst Linder,
Ph.D. University of New Hampshire, Durham, NH

Student t-test
The two-sample t-test is used to directly compare the mean of two groups (X and Y). It is required that
measurements in the two groups are statistically independent. The null hypothesis states that the means of
two groups are equal, or equivalently, the difference of the means is zero:

Ho : µ(X) = µ(Y ), or µ(X) − µ(Y ) = 0

H1 : µ(X) 6= µ(Y ), or µ(X) − µ(Y ) 6= 0

When the stardart desviation is uknown we can use:


X−Y
t= q
n1 +
1 1
Sx2 n2

1
A large t-score indicates that the groups are different. A small t-score indicates that the groups are similar.

#A t-test is most commonly applied when the test statistic would follow a normal
#distribution if the value of a scaling term in the test statistic were known.
#When the #scaling term is unknown and is replaced by an estimate based on the
#data, the test #statistics (under certain conditions) follow a Student’s
#t distribution. The t-test can #be used, for example, to determine if
#two sets of data are significantly different from #each other.
#(see wikipedia https://en.wikipedia.org/wiki/Student%27s_t-test)
#[Wikipedia] (https://en.wikipedia.org/wiki/Student%27s_t-test)

#Please read the introduction from:


#[Introduction to t-test] #<http://blog.minitab.com/blog/adventures-in-statistics-2/
#understanding-t-tests-t-values-a#nd-t-distributions>

#see also at https://www.investopedia.com/terms/t/t-test.asp

Exercise 1: Different genome sizes between crustaceans

Here we have data on the genome size (measured in picograms of DNA per haploid cell) in two large groups
of crustaceans (Decapods and Isopods). The cause of variation in genome size has been a puzzle for a long
time; we’ll use these data to answer the biological question of whether some groups of crustaceans have
different genome sizes than others.

1. First we should observe the data, load the file into R and graphically explore the dispersion and
normality of the whole dataset. Looking at the histograms, do you think the data is normal?

genome_size <- read.table("genome_size_long_format.txt")

#BOX PLOT IS A GOOD SOLUTION:


boxplot(genome_size[,2], col="blue")

2
10
8
6
4
2
0

#A good interpretation of a box-plot can be seen in the link #<https://www.r-graph-gallery.com/boxplot.h

#an alternative to observe differences in the variance between groups


#we can use a double box-plot, also, to compare dispersion
# BOXPlot
library(tidyverse)
library(hrbrthemes)
genome_size %>%
ggplot( aes(x=V1, y=V2, fill=V1)) +
geom_boxplot() +
geom_jitter(color="black", size=0.4, alpha=0.9) +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("Boxplot with jitter to observe dispersion") +
xlab("populations") + ylab("DNA concentration")

3
Boxplot with jitter to observe dispersion

9
DNA concentration

Decapods Isopods
populations

Another[DNA] solution is represent the empirical distribution of variable [DNA] using a histogram with a
normal curve over:

#SOLUTION:
g = genome_size[,2]
m<-mean(g)
std<-sqrt(var(g))
hist(g, density=20, breaks=10, prob=T,
xlab="x-variable: DNA concentration", ylim=c(0, 0.5),
main="normal curve over histogram")
curve(dnorm(x, mean=m, sd=std),
col="darkblue", lwd=2, add=TRUE, yaxt="n")

4
normal curve over histogram
0.5
0.4
0.3
Density

0.2
0.1
0.0

0 2 4 6 8 10

x−variable: DNA concentration

2. If the variances are not similar between the groups and the variable is not normally distributed, we
cannot use the Student t test directly. To do it we try to transform the data to fit it into a normal
distribution. We will apply the log10 transformation to our data. Use the log10() function. Calculate
the mean and the variance of the newly transformed data. Are the variances more similar now?

#SOLUTION
genome_size$log10 <- log10(genome_size[,2])
#mean for each group
mean(genome_size[genome_size[,1]=="Decapods",3])

## [1] 0.4020461

mean(genome_size[genome_size[,1]=="Isopods",3])

## [1] -0.1066846

#variance for each group


var(genome_size[genome_size[,1]=="Decapods",3])

## [1] 0.3807424

var(genome_size[genome_size[,1]=="Isopods",3])

## [1] 0.2129233

5
# the means are different but the variances are more similar

#a double boxplot
library(tidyverse)
library(hrbrthemes)
genome_size %>%
ggplot( aes(x=V1, y=log10, fill=V1)) +
geom_boxplot() +
geom_jitter(color="black", size=0.4, alpha=0.9) +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("Boxplot with jitter to observe dispersion") +
xlab("populations") + ylab("log10 DNA concentration")

Boxplot with jitter to observe dispersion

0
log10 DNA concentration

−1

−2

Decapods Isopods
populations

3. Now plot the histogram of the transformed data. Do they look nearly normal?

#SOLUTION
hist(genome_size[,3], density=10, breaks=10, prob=T,
xlab="log10(x-variable: DNA concentration)", ylim=c(0, 0.7), col="green")

6
Histogram of genome_size[, 3]
0.7
0.6
0.5
0.4
Density

0.3
0.2
0.1
0.0

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

log10(x−variable: DNA concentration)

4. Check if the transformed data follows a normal distribution using a graphical method.

#SOLUTION
g = genome_size[,3]
m<-mean(g)
std<-sqrt(var(g))
hist(g, density=10, breaks=10, prob=T,
xlab="log10(x-variable: DNA concentration)", ylim=c(0, 0.7), col="green")
curve(dnorm(x, mean=m, sd=std),
col="darkblue", lwd=2, add=TRUE, yaxt="n") #overlap the normal curve

7
Histogram of g
0.7
0.6
0.5
0.4
Density

0.3
0.2
0.1
0.0

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

log10(x−variable: DNA concentration)

# now the variable it seems to be normal

5. After transforming the data (log10()), now we can apply the Student t test to answer the question: Do
both groups have the same mean genome size? What is the value of the t statistic? And the p-value?

#SOLUTION
result_StudentT <- t.test(genome_size[genome_size[,1]=="Decapods",3],
genome_size[genome_size[,1]=="Isopods",3],var.equal = TRUE)
result_StudentT

##
## Two Sample t-test
##
## data: genome_size[genome_size[, 1] == "Decapods", 3] and genome_size[genome_size[, 1] == "Isopods",
## t = 3.4308, df = 52, p-value = 0.001187
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2111807 0.8062808
## sample estimates:
## mean of x mean of y
## 0.4020461 -0.1066846

tvalue <- result_StudentT$statistic


pvalue <- result_StudentT$p.value

8
Exercise 2: Independent and Paired t-test - (OPTIONALLY!) Submit before
18/02/2021

Previously an example of independent t-test has been developed, but there are other forms to use t-test. The
PAIRED t-test is performed when the samples typically consist of matched pairs of similar units, or when
there are cases of repeated measures. For example, there may be instances of the same patients being tested
repeatedly—before and after receiving a particular treatment. In such cases, each patient is being used as a
control sample against themselves.
The expression of oncogen SKI-like in B-cells has been measured in patients with leukemia that are in
different stages of evolution of the disease. The data is in the B_cell.csv file. We want to check if the
evolution stage is a significant factor in the expression levels of the gene. Accepting that the evolution stage
vary considerably, they employed a t-test to compare the expression levels of the gene.
You can find the data in the B_cell.txt file. Use B_cell as the data set’ s name.
1)Perform two separate t-tests to test the following null hypotheses:

• No effect of evolution stage groups 1 versus 2 on expression levels of the gene


• No effect of evolution stage groups 1 versus 3 on expression levels of the gene

2)Now, reorganize the data using the following code:


B_cell.time <- data.frame(expres.basal=B_cell[B_cell[,2]==1,1], expres.6month=B_cell[B_cell[,2]==2,1])
Note the format of the dataset B_cell.time. Rather than organizing the data into the usual long format in
which variables are represented in columns and rows represent individual replicates, these data have been
organized in wide format. Wide format is often used for data containing repeated measures from individual
or other sampling units in time (in this example: basal and 6 months). Even though this is not necessary
(as paired t-tests can be performed on long format data), traditionally it did allow more compact data
management as well as making it easier to calculate the differences between repeated measurements on each
individual.
Perform a paired t-test to test the following null hypotheses:

• No effect of gene expression on time

Introduction to Categorical probability distributions


Categorical variables have finite sets of discrete values. Examples include sex (male/female), genotype, suffer
a disease, taxon, etc. Contrast this with continuous variables, which can take an infinite number of different
values. Examples include weight, longitude, distance, etc.
One of the best known categorical discrete distributions is the binomial distribution. A Bernoulli random
variable has two possible outcomes: 0 or 1. A binomial distribution is the sum of independent and identically
distributed Bernoulli random variables.
Equivalently, it is the special case of the multinomial distribution where the number of “choices” n is fixed
at one (Ex: colors, taxons, health-disease).
The binomial distribution arises whenever we perform a series of independent trials each of which can result
in either success or failure, and the probability of success is the same for each trial. Survey questions that offer
just two alternatives, male or female, smoker or nonsmoker, like or dislike, give rise to binomial distributions.
One can use the R function rbinom() to generate the results of binomial sampling.
Example, For example, binom = rbinom(1000,10,0.6) will store the results of 1000 binomial samples of 10
trials, each with probability of 0.6 of success in the vector binom. EX:

9
binom = rbinom(1000,10,0.6)
hist(binom, col="red", breaks=25)

Histogram of binom
250
200
150
Frequency

100
50
0

2 4 6 8 10

binom

#remember the expectation and variance for a binomal distribution


#binom distribution
paste("theoretical expectation= ",10*0.6,sep="") #theoretical expectation

## [1] "theoretical expectation= 6"

paste("observed mean/sample mean= ",mean(binom),sep="") #observed mean

## [1] "observed mean/sample mean= 6.06"

paste("theoretical variance= ",10*0.6*(1-0.6),sep="") #theoretical variance

## [1] "theoretical variance= 2.4"

paste("observed variance/sample variance= ",var(binom),sep="") # #observed variance

## [1] "observed variance/sample variance= 2.21061061061061"

In order to test categorical variables there are many possibilities, but one of the most common is use
hyphotesis test based on chi square statistic.

10
Chi-square goodness of fit test
The Chi Square goodness of fit test is a very simple and versatile test that quantitatively determines whether
a random variable really should be modeled with a particular probability distribution.
It literally tests the “goodness” of the density fit. The way the test works is it partitions the observed data
into bins and calculates the frequencies in each bin, similar to a histogram construction. It then compares
the observed frequencies with the expected frequencies that would result from a perfect fit to the proposed
distribution. It then calculates a test statistic that follows a chi square distribution, where oi are the observed
frequencies and ei are the expected frequencies values for n bins:

X (Oi − Ei )2
χ2 = , df = n − 1
i
Ei

Exercise 3: Proportion of nucleotides in a gene

We think that that the proportion of nucleotides A, T, C, G should be equally present in a given DNA
sequence, with proportion 0.25 for each. This is modeled by a multinomial distribution . Let’s see for a
particular gene how good a fit this really is. That is our null hypothesis is that the data fit a multinomial
distribution with equal probability for each nucleotide. Please test this situation using the following example
of gene:

#Observed nucleotides for gene of myoglobin


#myoglobin
#a c g t
#237 278 309 242
obs <- c(237, 278, 309, 242)

using a test. . .

chisq.test(obs,p=rep(1/4,4))

##
## Chi-squared test for given probabilities
##
## data: obs
## X-squared = 12.792, df = 3, p-value = 0.005109

#Based on result of a test statistic 12.79 (more extreme on the Chi-Square curve
#with 3 degrees of freedom than 7.81) we fail to accept our null hypothesis at an
#alpha=0.05 significance level and conclude that the fit differs significantly from
#a multinomial with equal probabilities for each nucleotide.

Contingency tables. Chi-Square Test (for Independence)


Contingency tables are a simple, yet powerful, method of analyzing count data that fits into categories. When
analysis of categorical data is concerned with more than one variable contingency tables (or two-way tables)
are employed. These tables provide a foundation for statistical inference, where statistical tests question
the relationship between the variables on the basis of the data observed. This is also called categorical data
analysis. We organize the data for analysis in a table format. The most typical case of a contingency table
is a 2 by 2 table (2 rows, 2 columns), although the table size can be extended to r (rows) by c (columns).

11
#more information at:

#An example of Categorical variables is the use of contingency tables (Two-Way Tables). #Please read the
#two-way tables:
#<http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm>

Exercise 4: genetic study about alleles and disease

Suppose we are doing a genetic study related with the effect of two alleles (in a gene) with the presence of a
disease. Previosly, we performed a genetic test to determine which allele the test subjects have and a disease
test to determine whether the person has a disease.
The data for a 2 x 2 contingency analysis should be entered in the format below, which works for both tests.

contingencyTestData<-
matrix(c(45,67,122,38), nr=2, dimnames=list("Gene"=c("Allele 1","Allele 2"),
"Disease"=c("Yes","No")))

The tests we want to perform with this contingency table are whether or not the two factors, disease and
gene allele, are independent or whether there is a significant relationship between the factors. The null
hypothesis is that there is no relation and the factors are independent.
The Chi-Square test for independence is similar to the goodness of fit test, but uses for the expected values
the values calculated from the marginal probabilities of the data rather than expected values based on a
distribution.
The Chi-Square test statistic is therefore the sum over all cells the squared observed minus expected value
divided by the observed value. The degrees of freedom for this test are the number of rows minus 1 times
the number of columns minus 1, which is 1 in the case of a 2 x 2 table

X (Oij − Eij )2
χ2 = , df = (r − 1)(c − 1)
i,j
Eij

Where O are the observed frequencies fij , and E are the expected frequencies for each i rows and j columns.
Based on this test, are independent the factors allele and disease?

chisq.test(contingencyTestData)

##
## Pearson’s Chi-squared test with Yates’ continuity correction
##
## data: contingencyTestData
## X-squared = 34.662, df = 1, p-value = 3.921e-09

#Based on this result (p-value of 3.921e-9) we would strongly reject the null
#hypothesis that the factors gene allele and disease are independent and conclude
#that there is a significant relation between the disease and which allele of the
#gene a person has.

Also is possible to ask about: 1) Calculate the adding row and column sums

12
# Adding row and column sums
addmargins(contingencyTestData)

## Disease
## Gene Yes No Sum
## Allele 1 45 122 167
## Allele 2 67 38 105
## Sum 112 160 272

2) Calculate the cell percentages of total

# Cell percentages of total


prop.table(contingencyTestData)

## Disease
## Gene Yes No
## Allele 1 0.1654412 0.4485294
## Allele 2 0.2463235 0.1397059

3) Calculate the row percentages

# Row percentages
prop.table(contingencyTestData, 1)

## Disease
## Gene Yes No
## Allele 1 0.2694611 0.7305389
## Allele 2 0.6380952 0.3619048

4) Calculate the column percentages

# Column percentages
prop.table(contingencyTestData, 2)

## Disease
## Gene Yes No
## Allele 1 0.4017857 0.7625
## Allele 2 0.5982143 0.2375

Exercise 5: Students smoking habit - (OPTIONALLY!) Submit before


18/02/2021

Based partially on http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence


In the built-in data set survey, the Smoke column records the students smoking habit, while the Exer
column records their exercise level. The allowed values in Smoke are “Heavy”, “Regul” (regularly), “Occas”
(occasionally) and “Never”. As for Exer, they are “Freq” (frequently), “Some” and “None”.
We can tally the students smoking habit against the exercise level with the table function in R. The result
is called the contingency table of the two variables.

13
As an aid to the decision we can represent the contingency table by means of a graph like the following, but
is there an association between smoking and the level of exercise?
Propose a plot or a graphical representation in orden to observe it.
Problem Test the hypothesis whether the students smoking habit is independent of their exercise level at .05
significance level.
Solution We apply the chisq.test() function to the contingency table tbl, and found the p-value.
As the p-value 0.4828 is greater than the .05 significance level, we do not reject the null hypothesis that the
smoking habit is independent of the exercise level of the students.
The warning message found in the solution above is due to the small cell values in the contingency table.
To avoid such warning, we combine the second and third columns of tbl, and save it in a new table named
ctbl. Then we apply the chisq.test function against ctbl instead.
Now the p-value 0.3571 is greater than the .05 significance level, we do not reject the null hypothesis that
the smoking habit is independent of the exercise level of the students, without the warning missage.
Also is possible to ask about: 1) Calculate the adding row and column sums

2) Calculate the cell percentages of total


3) Calculate the row percentages
4) Calculate the column percentages

Exercise 6 - (OPTIONALLY!) Submit before 18/02/2021


Suppose that there are two variables: the first is the gender (male or female) and the second collection of
causes related to accidents at work (burned) within a laboratory. This pair of variables has been observed
in a random sample of 250 accidents of these types in the last 10 years. Next we classify the individuals
according to their answers and we obtain a table called contingency table and that expresses the relation
between these two variables:
So, we can observe the barplot:

1) Calculate the adding row and column sums


2) Calculate the cell percentages of total
3) Calculate the row percentages
4) Calculate the column percentages
5) Test the hypothesis whether the number of accidents are distributed equally between males and females
(Testing for Equality of Proportions Between 2 Samples based on ji-square test) at .05 significance level.

Exercise 7 - (OPTIONALLY!) Submit before 18/02/2021


Using the previous presented sample (1000 binomial samples, each with probability of p=0.6 of success in the
vector binom), use R to generate such a binomial vector of sample results for following situations: sample
size equal to 1, 2, 4, 10, 20 and 60
Produce graphs in each case and describe your results and try to explain it. For each situation calculate the
mean and the standard deviation and compare your results to expected one.
When “n” is large, the Binomial distribution is laborious and complicated, so the mathematician Abraham de
Moivre (1667-1754) proved that when certain conditions are metpa Binomial distribution is can approximate
a Normal distribution of µ = n ∗ p and deviation Typical: σ = (n ∗ p ∗ (1 − p))

14

You might also like