Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

BMC Bioinformatics BioMed Central

Methodology article Open Access


Statistical analysis of real-time PCR data
Joshua S Yuan1,2, Ann Reed3, Feng Chen1 and C Neal Stewart Jr*1

Address: 1Department of Plant Sciences, University of Tennessee, Knoxville, TN 37996, USA, 2University of Tennessee Institute of Agriculture
Genomics Hub, University of Tennessee, Knoxville, TN 37996, USA and 3Statistical Consulting Center, University of Tennessee, Knoxville, TN
37996, USA
Email: Joshua S Yuan - [email protected]; Ann Reed - [email protected]; Feng Chen - [email protected]; C Neal Stewart* - [email protected]
* Corresponding author

Published: 22 February 2006 Received: 04 October 2005


Accepted: 22 February 2006
BMC Bioinformatics 2006, 7:85 doi:10.1186/1471-2105-7-85
This article is available from: http://www.biomedcentral.com/1471-2105/7/85
© 2006 Yuan et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract
Background: Even though real-time PCR has been broadly applied in biomedical sciences, data
processing procedures for the analysis of quantitative real-time PCR are still lacking; specifically in
the realm of appropriate statistical treatment. Confidence interval and statistical significance
considerations are not explicit in many of the current data analysis approaches. Based on the
standard curve method and other useful data analysis methods, we present and compare four
statistical approaches and models for the analysis of real-time PCR data.
Results: In the first approach, a multiple regression analysis model was developed to derive ∆∆Ct
from estimation of interaction of gene and treatment effects. In the second approach, an ANCOVA
(analysis of covariance) model was proposed, and the ∆∆Ct can be derived from analysis of effects
of variables. The other two models involve calculation ∆Ct followed by a two group t-test and non-
parametric analogous Wilcoxon test. SAS programs were developed for all four models and data
output for analysis of a sample set are presented. In addition, a data quality control model was
developed and implemented using SAS.
Conclusion: Practical statistical solutions with SAS programs were developed for real-time PCR
data and a sample dataset was analyzed with the SAS programs. The analysis using the various
models and programs yielded similar results. Data quality control and analysis procedures
presented here provide statistical elements for the estimation of the relative expression of genes
using real-time PCR.

Background increases exponentially since the reagents are not limited.


Real-time PCR is one of the most sensitive and reliably The linear phase is characterized by a linear increase in
quantitative methods for gene expression analysis. It has product as PCR reagents become limited. The PCR will
been broadly applied to microarray verification, pathogen eventually reach the plateau phase during later cycles and
quantification, cancer quantification, transgenic copy the amount of product will not change because some rea-
number determination and drug therapy studies [1-4]. A gents become depleted. Real-time PCR exploits the fact
PCR has three phases, exponential phase, linear phase and that the quantity of PCR products in exponential phase is
plateau phase as shown in Figure 1. The exponential in proportion to the quantity of initial template under
phase is the earliest segment in the PCR, in which product ideal conditions [5,6]. During the exponential phase PCR

Page 1 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

product will ideally double during each cycle if efficiency


is perfect, i.e. 100%. It is possible to make the PCR ampli-
Amount of PCR Product

fication efficiency close to 100% in the exponential


Plateau Phase phases if the PCR conditions, primer characteristics, tem-
plate purity, and amplicon lengths are optimal.

Both genomic DNA and reverse transcribed cDNA can be


Linear Phase
used as templates for real-time PCR. The dynamics of PCR
are typically observed through DNA binding dyes like
SYBR green or DNA hybridization probes such as molecu-
Exponential Phase lar beacons (Strategene) or Taqman probes (Applied Bio-
systems) [2]. The basis of real-time PCR is a direct positive
PCR Cycle Number
association between a dye with the number of amplicons.
As shown in Figure (1B and 1C), the plot of logarithm 2-
based transformed fluorescence signal versus cycle
number will yield a linear range at which logarithm of flu-
orescence signal correlates with the original template
Log(PCR Product Amount)

amount. A baseline and a threshold can then be set for fur-


ther analysis. The cycle number at the threshold level of
Plateau Phase log-based fluorescence is defined as Ct number, which is
the observed value in most real-time PCR experiments,
Linear Phase and therefore the primary statistical metric of interest.

Real-time PCR data are quantified absolutely and rela-


Exponential Phase tively. Absolute quantification employs an internal or
external calibration curve to derive the input template
copy number. Absolute quantification is important in
case that the exact transcript copy number needs to be
PCR Cycle Number determined, however, relative quantification is sufficient
for most physiological and pathological studies. Relative
quantification relies on the comparison between expres-
sion of a target gene versus a reference gene and the
expression of same gene in target sample versus reference
samples [7].

Since relative quantification is the goal for most for real-


Threshold time PCR experiments, several data analysis procedures
have been developed. Two mathematical models are very
widely applied: the efficiency calibrated model [7,8] and
Linear Phase the ∆∆Ct model [9]. The experimental systems for both
models are similar. The experiment will involve a control
Ct sample and a treatment sample. For each sample, a target
gene and a reference gene for internal control are included
for PCR amplification from serially diluted aliquots. Typ-
ically several replicates are used for each diluted concen-
tration to derive amplification efficiency. PCR
Figure 1 PCR
Real-time amplification efficiency can be either defined as percent-
Real-time PCR. (A) Theoretical plot of PCR cycle number age (from 0 to 1) or as time of PCR product increase per
against PCR product amount is depicted. Three phases can cycle (from 1 to 2). Unless specified as percentage ampli-
be observed for PCRs: exponential phase, linear phase and fication efficiency (PE), we refer the amplification effi-
plateau phase. (B) shows a theoretical plot of PCR cycle ciency (E) to PCR product increase (1 to 2) in this article.
number against logarithm PCR product amount. Panel (C) is The efficiency-calibrated model is a more generalized
the output of a serial dilution experiment from an ABI 7000 ∆∆Ct model. Ct number is first plotted against cDNA
real-time PCR instrument. input (or logarithm cDNA input), and the slope of the
plot is calculated to determine the amplification efficiency

Page 2 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

(E). ∆Ct for each gene (target or reference) is then calcu- influence the interpretation of ratio. Without a proper sta-
lated by subtracting the Ct number of target sample from tistical modeling and analysis, the interpretation of real-
that of control sample. As shown in Equation 1, the ratio time PCR data may lead the researcher to false positive
of target gene expression in treatment versus control can conclusions, which is especially potentially troublesome
be derived from the ratio between target gene efficiency in clinical applications. We hereby developed four statis-
(Etarget) to the power of target ∆Ct (∆Cttarget) and reference tical methodologies for processing real-time PCR data
gene efficiency (Ereference) to the power of reference ∆Ct using a modified ∆∆Ct method. The statistical methodol-
(∆Ctreference). The ∆∆Ct model can be derived from the effi- ogies can be adapted to other mathematical models with
ciency-calibrated model, if both target and reference genes modifications. SAS programs implementing the method-
reach their highest PCR amplification efficiency. In this ologies and data control are presented with real-time PCR
circumstance, both target efficiency (Etarget) and control practitioners in mind for turnkey data analysis. Standard
efficiency (Econtrol) equals 2, indicating amplicon dou- deviations, confidence levels and P values are presented
bling during each cycle, then there would be the same directly from the SAS output. We also included analysis of
expression ratio derived from 2-∆∆Ct [7,9]. the sample data set and SAS programs for the analysis in
the online supplementary materials.
∆Ctt arg et
(Et arg et ) Results and discussion
Ratio = Equation 1
∆Ct reference Data quality control
(Ereference )
From the two mathematical models for relative quantifi-
Whereas ∆Cttarget = Ctcontrol - Cttreatment and ∆Ctreference = Ctcon- cation of real-time PCR data, we observe disparities
trol - Cttreatment
between data quality standards. For efficiency-calibrated
method, the author who described this procedure [7]
Ratio = 2-∆∆Ct Equation 2 assumed that the amplification efficiency for each gene
(target and reference) is the same among different experi-
Whereas ∆∆Ct = ∆Ctreference - ∆Cttarget mental samples (treatment and control). In contrast,
whereas an amplification efficiency of 2 is not required,
Even though both the efficiency-calibrated and ∆∆Ct the ∆∆Ct method is more stringent by assuming that all
models are widely applied in gene expression studies, not reactions should reach an amplification efficiency of 2. In
many papers have thorough discussions of the statistical other words, the amount of product should double during
considerations in the analysis of the effect of each experi- each cycle [9]. Moreover, the ∆∆Ct method assumes that
mental factor as well as significance testing. One of the the PCR amplification efficiency for each sample will be 2,
few studies that employed substantial statistical analysis if PCRs for one set of the samples reaches full amplifica-
used the REST® program [8]. The software presented in this tion efficiency. However, this assumption neglects the
article is based on the efficiency-calibrated model and effect of different cDNA samples.
employed randomization tests to obtain the significance
level. However, the article did not provide a detailed Data quality could be examined through a correlation
model for the effects of different experimental factors model. Even though examining the correlation between
involved. Another statistical study of real-time PCR data Ct number and concentration can provide an effective
used a simple linear regression model to estimate the ratio quality control, a better approach might be to examine the
through Ct calculation [10]. However, the logarithm- correlation between Ct and the logarithm (base 2) trans-
based fluorescence was used as the dependent variable in formed concentration of template, which should yield a
the model, which we believe does not adequately reflect significant simple linear relationship for each gene and
the nature of real-time PCR data. It follows that Ct should sample combination. For example, for a target gene in the
be the dependent variable for statistical analysis, because control sample, the Ct number should correlate with the
it is the outcome value directly influenced by treatment, logarithm transformed concentration following the sim-
concentration and sample effects. Both studies used the ple linear regression model in equation 3. In the equation,
efficiency-calibrated models. Despite the publication of Xlcon represents the logarithm transformed concentration,
these two methods, many research articles published with β0 represents the intercept of the regression line, and βcon
real-time PCR data actually do not present P values and represents the slope of the regression line [14]. The accept-
confidence intervals [11-13]. We believe that these statis- able real-time PCR data should have two features from the
tics are desirable to facilitate robust interpretation of the regression analysis. First, the slope should not be signifi-
data. cantly different from -1. Second, the slopes for all four
combinations of genes and samples as shown in Table 1
A priori, we consider the confidence interval and P value of should not be significantly different from one another. A
∆∆Ct data to be very important because these directly

Page 3 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

Table 1: The sample real-time PCR data for analysis. In this data set, there two types of samples (treatment and control); two genes
(reference and target); and four concentrations of each combination of gene and sample. For data quality control and ANCOVA
analysis, the real-time PCR sample data set can be grouped in four groups according to the combination of sample and gene. The
Control-Target combination effect was named group 1, Treatment-Target group 2, Control-Reference group 3 and Treatment-
Reference group 4.

Replicate Sample Gene Concentration Ct Group (Class)

1 Control Target 10 23.1102 1


2 Control Target 10 22.9003 1
3 Control Target 10 22.8972 1
1 Control Target 2 26.5801 1
2 Control Target 2 26.2139 1
3 Control Target 2 26.0606 1
1 Control Target 0.4 28.1125 1
2 Control Target 0.4 28.1899 1
3 Control Target 0.4 27.5949 1
1 Control Target 0.08 30.2772 1
2 Control Target 0.08 30.4667 1
3 Control Target 0.08 30.7571 1
1 Treatment Target 10 21.7813 2
2 Treatment Target 10 21.7564 2
3 Treatment Target 10 21.641 2
1 Treatment Target 2 23.7965 2
2 Treatment Target 2 23.7571 2
3 Treatment Target 2 23.724 2
1 Treatment Target 0.4 26.3794 2
2 Treatment Target 0.4 26.2542 2
3 Treatment Target 0.4 25.9621 2
1 Treatment Target 0.08 28.5479 2
2 Treatment Target 0.08 28.3894 2
3 Treatment Target 0.08 28.3416 2
1 Control Reference 10 19.7415 3
2 Control Reference 10 19.494 3
3 Control Reference 10 19.3906 3
1 Control Reference 2 21.9838 3
2 Control Reference 2 22.4435 3
3 Control Reference 2 22.57 3
1 Control Reference 0.4 24.8109 3
2 Control Reference 0.4 24.4327 3
3 Control Reference 0.4 24.2342 3
1 Control Reference 0.08 26.7319 3
2 Control Reference 0.08 26.8206 3
3 Control Reference 0.08 26.822 3
1 Treatment Reference 10 18.4468 4
2 Treatment Reference 10 18.8227 4
3 Treatment Reference 10 18.3061 4
1 Treatment Reference 2 21.2568 4
2 Treatment Reference 2 21.0956 4
3 Treatment Reference 2 20.8473 4
1 Treatment Reference 0.4 23.2322 4
2 Treatment Reference 0.4 22.9577 4
3 Treatment Reference 0.4 23.2415 4
1 Treatment Reference 0.08 25.4817 4
2 Treatment Reference 0.08 25.608 4
3 Treatment Reference 0.08 25.5675 4

SAS program was developed to perform the data quality The input data is grouped as shown in Table 1 and addi-
control in Program1_QC.sas (additional file 1). tional file 2. Each combination of gene and sample was
classified in one group named from 1 to 4. The SAS proce-
Ct = β0 + βconXlcon + ε Equation 3 dure Proc Mixed was used for performing simple linear

Page 4 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

Figure
Data quality
2 control
Data quality control. The four classes represent four different combinations of sample and gene, which are reference gene
in control sample, target gene in control sample, reference gene in treatment sample, and target gene in treatment sample.
Each class should derive a linear correlation between Ct and logarithm transformed concentration pf PCR product with a slope
of -1.

regression for each group based on the model described Multiple regression model
above. The 95% confidence levels for slopes were esti- Several effects need to be taken in to consideration in the
mated, which are expected not be significantly different ∆∆Ct method, namely, the effect of treatment, gene, con-
from -1. The abbreviated SAS output for the analysis of a centration, and replicates. If we consider these effects as
sample data set is presented in SASOutput.doc (additional quantitative variables and have the Ct number relating to
file 3). Slopes for Ct and logarithm transformed concen- these multiple effects and their interactions, we can
trations for all four groups were not significantly different develop a multiple regression model as follows in Equa-
from -1 based on 95% confidence level. In addition to the tion 4.
numeric output, the program also provides a visualization
of data quality as shown in Figure 2, where the Ct number Ct = β0 + βconXicon + βtreatXitreat + βgeneXigene + βcontreatXiconXitreat
is plotted against logarithm transformed template con- + βcongeneXiconXigene + βgenetreatXigeneXitreat + βcongenetreatXiconX-
centration. A simple linear relationship should be itreatXigene + ε Equation 4
observed between the Ct number and logarithm trans-
formed concentration. In this model, Ct is the true dependent, the β0 is the inter-
cept, βxs are the regression coefficients for the correspond-
ing X (independent) terms, and ε is the error term [14].

Page 5 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

The model considers the effect of concentration, treat- Analysis of covariance and SAS code
ment, gene and their interactions. We are principally Another way to approach the real-time PCR data analysis
interested in the interaction between gene and treatment, is by using an analysis of covariance (ANCOVA). A simpli-
which addresses the degree of the Ct differences between fied model can be derived from transforming the data into
target gene and reference gene in treated vs. control sam- a grouped data as shown in Table 1 and additional file 2
ples: i.e., ∆∆Ct. ∆∆Ct can therefore be estimated from the resulting in Equation 5.
different combinations values of βgenetreat. The four groups
in Table 1 also represent the options of combinational Ct = β0 + βconXicon + βgroupXigroup + βgroupconXigroupXicon + ε.
effects of treatment and gene. The goal is to statistically Equation 5
test for differences between target and reference genes in
treatment vs. control samples. Therefore, the null hypoth- We are interested in two questions here. First, are the cov-
esis is the Ct differences between target and reference ariance adjusted averages among the four groups equal?
genes will be the same in treatment vs control samples, Second, what is the Ct difference of target gene value
which can be represented by combinational effect (CE) as: between treatment and control sample after corrected by
CE1-CE3 = CE2-CE4. An alternative formula will be: CE1- reference gene? In this case, the null hypothesis will be
CE2-CE3+CE4 = 0, which will yield an estimation of (µ2-µ1)-(µ4-µ3) = 0, and the test will yield a parameter
∆∆Ct. If the null hypothesis is not rejected, then the ∆∆Ct estimation of ∆∆Ct as shown in the
would not be significantly different from 0, otherwise, the Program3_ANCOVA.sas (additional file 6).
∆∆Ct can be derived from the estimation of the test. In this
way, we can perform a test of different combinational The SAS code implementing the ANCOVA model is simi-
effects of βgenetreat and estimate the ∆∆Ct from it. As shown lar to that of multiple regression model. Either SAS proce-
in the ∆∆Ct formula in Equation 2, if a ∆∆Ct is equal to 0, dures PROC GLM or PROC MIXED can be employed to
the ratio will be 1, which indicates no change in gene implement the ANCOVA model; and we used PROC
expression between control and treatment. MIXED here. The class statement defines which variables
will be grouped for significance testing. In this case, the
A SAS program for multiple regression model variables are concentration and group, and ANCOVA
SAS procedure PROC GLM was used for ∆∆Ct estimation assumes that these are co-varying in nature. The contrast
in Program2_MR.sas in additional file 4. The multiple and estimate statements were used to contrast the group
regression model is stated in a model statement. The com- effect, which will yield ∆∆Ct (-0.6848), as well as its stand-
binational effect of gene and treatment are evaluated in ard error (0.1185) and 95% confidence interval (-0.9262,
the estimate and contrast statement. The null hypothesis -0.4435). The SAS output with both confidence level and
of CE1-CE2-CE3+CE4 = 0 is tested in the contrast state- P value is presented in SASOutputs.doc (additional file 3).
ment and the parameter estimation yield the ∆∆Ct value.
The SAS input file is available in additional file 5 and the Simplified alternatives – T-test and wilcoxon two group
SAS output for the multiple regression is in SASOut- test
put.doc (additional file 3). More simplified alternatives can be used to analyze real-
time data with biological replicates for each experiment.
The SAS output gives a very comprehensive analysis of the The primary assumption with this approach is that the
data. We are interested in two aspects of the analysis. First, additive effect of concentration, gene, and replicate can be
we want to test whether the ∆∆Ct value is significantly dif- adjusted by subtracting Ct number of target gene from
ferent from 0 at P = 0.05. If the ∆∆Ct is not significantly that of reference gene, which will provide ∆Ct as shown in
different from 0, then we conclude the treatment does not Table 2. The ∆Ct for treatment and control can therefore
have a significant effect on target gene expression; other- be subject to simple t-test, which will yield the estimation
wise, the inverse is concluded. If the effect is significant, of ∆∆Ct.
we are interested in the standard deviation of ∆∆Ct value,
from which we can derive the ratio of gene expression as As a non-parametric alternative to the t-test, a Wilcoxon
discussed later. The SAS output provides the point estima- two group test can also be used to analyze the two pools
tion (-0.6848) and standard error (-0.1185) for the ∆∆Ct. of ∆Ct values. Two of the assumptions for t-test are that
PROC GLM or PROC MIXED are interchangeable in this the both groups of ∆Ct will have Gaussian distributions
application. If the experiments involve multiple biologi- and they will have equal variances. However, these
cal replicates, replicate effect can also be considered assumptions are not valid in many real-time PCR experi-
through modifying the SAS program. Then the estimation ments using realistically small sample sizes. Therefore a
will be the combined effect of gene, treatment and repli- distribution-free Wilcoxon test will be a more robust and
cate. appropriate alternative in this case [15].

Page 6 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

Table 2: ∆Ct calculation. The table presents the calculation of ∆Ct, which is derived from subtracting Ct number of reference gene
from that of the target gene. Con stands for concentration.

Sample Gene Con Ct Sample Gene Con Ct ∆Ct

Control Target 10 23.1102 Control Reference 10 19.7415 3.3687


Control Target 10 22.9003 Control Reference 10 19.494 3.4063
Control Target 10 22.8972 Control Reference 10 19.3906 3.5066
Control Target 2 26.5801 Control Reference 2 21.9838 4.5963
Control Target 2 26.2139 Control Reference 2 22.4435 3.7704
Control Target 2 26.0606 Control Reference 2 22.57 3.4906
Control Target 0.4 28.1125 Control Reference 0.4 24.8109 3.3016
Control Target 0.4 28.1899 Control Reference 0.4 24.4327 3.7572
Control Target 0.4 27.5949 Control Reference 0.4 24.2342 3.3607
Control Target 0.08 30.2772 Control Reference 0.08 26.7319 3.5453
Control Target 0.08 30.4667 Control Reference 0.08 26.8206 3.6461
Control Target 0.08 30.7571 Control Reference 0.08 26.822 3.9351
Treatment Target 10 21.7813 Treatment Reference 10 18.4468 3.3345
Treatment Target 10 21.7564 Treatment Reference 10 18.8227 2.9337
Treatment Target 10 21.641 Treatment Reference 10 18.3061 3.3349
Treatment Target 2 23.7965 Treatment Reference 2 21.2568 2.5397
Treatment Target 2 23.7571 Treatment Reference 2 21.0956 2.6615
Treatment Target 2 23.724 Treatment Reference 2 20.8473 2.8767
Treatment Target 0.4 26.3794 Treatment Reference 0.4 23.2322 3.1472
Treatment Target 0.4 26.2542 Treatment Reference 0.4 22.9577 3.2965
Treatment Target 0.4 25.9621 Treatment Reference 0.4 23.2415 2.7206
Treatment Target 0.08 28.5479 Treatment Reference 0.08 25.4817 3.0662
Treatment Target 0.08 28.3894 Treatment Reference 0.08 25.608 2.7814
Treatment Target 0.08 28.3416 Treatment Reference 0.08 25.5675 2.7741

A SAS program has been developed for both t-test and nonparametric nature, is the Wilcoxon two group test,
Wilcoxon two group test as shown in the attached pro- which is distribution-independent.
gram Program4_TW.sas (additional file 7). The SAS proce-
dures TTEST and UNIVARIATE were used to analyze the Data quality control
data. The SAS Macro 'moses.sas' [15] in additional file 8 Many of the current real-time PCR experiments do not
has been employed to derive the confidence levels. The include a standard curve design, nor do they use a method
SAS input file is in additional file 9 and the SAS output for to estimate the amplification efficiency. We argue here
sample data analysis is available in SASOutput.doc (addi- that real-time PCR data without proper quality controls
tional file 3). Since the estimate of difference derives from are not reliable, since the efficiency of real-time PCR could
subtracting treatment from control sample, the actual have significant impact on the ratio estimation and
∆∆Ct should be the inverse of the output estimate. dynamic range. For example, if a PCR has a percentage
amplification efficiency (PE) of 0.8 (i.e. PCR product will
Comparison of four approaches and data presentation increase 20.8 times instead of two times per cycle), a ∆Ct
A comparison of the four approaches is presented in Table value of 3 can only be transformed into 5.27 times differ-
3. Multiple regression and ANCOVA yield exactly the ences in ratio instead of 8 times. This problem gets ampli-
same result for ∆∆Ct estimation, because both methods fied when the ∆∆Ct or ∆Ct values are larger and the
employ the same mathematical approach for parameter amplification efficiency is lower, which could lead to
estimation. The t-test provides the same point estimation severely skewed interpretations.
of ∆∆Ct, however, the standard error is slightly greater,
which leads to a larger confidence interval. Wilcoxon two We therefore propose two standards for real-time PCR
group test provides a slightly smaller estimation of ∆∆Ct. data quality control according to the model using the SAS
The highly similar results from the four approaches vali- programs presented in this paper. First, experiments with
dated the models and SAS programs presented. The choice a serial dilution of template need to be included in order
of the models and programs will depend on the experi- to estimate the amplification efficiency of each gene with
mental design and the stringency and quality of the exper- each sample. Some researchers assume that the amplifica-
iment. However, the most conservative test, owing to its tion efficiency for each gene is the same in different sam-
ples because the same primer pair and amplification

Page 7 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

Table 3: The comparison of four approaches. The table listed ∆∆Ct, standard error, P-value and confidence interval derived from the
four methods presented in the article. Neither SAS package nor the macro used provides the standard error for Wilcoxon two group
test. We consider confidence interval to be sufficient for further data transformation.

Model ∆∆Ct Standard Error P-Value Confidence Interval

Multiple Regression -0.6848 0.1185 < 0.0001 (-0.4435, -0.9262)


ANCOVA -0.6848 0.1185 < 0.0001 (-0.4435, -0.9262)
t-test -0.6848 0.1303 < 0.0001 (-0.4147, -0.955)
Wilcoxon Test -0.6354 < 0.0001 (-0.4227, -0.8805)

conditions are used. However, we found that sample standard deviation of ∆∆Ct; and the confidence interval of
effect does have an impact on the amplification efficiency. the ratio should be derived from the confidence interval
In other words, the amplification efficiency could be dif- of ∆∆Ct. In other words, the point estimation of ratio
ferent for the same gene when amplified from different should be 2-∆∆Ct and the confidence interval for ratio
cDNA template samples. We therefore consider the exper- should be (2-∆∆CtHCL, 2-∆∆CtLCL). Since Ct is the observed
imental design with standard curve for each gene and value from experimental procedures, it should be the sub-
sample combination as the optimal. Second, under opti- ject of statistical analysis. The practice of performing sta-
mal conditions, if a plot of the Ct number against the log- tistical analysis at ratio directly is not appropriate. The
arithm (2-based) template amount should yield a slope presentation of data needs to refer to the ∆∆Ct and subse-
not significantly different from -1, which indicates a quently the ratio and confidence intervals derived from 2-
nearly 2 amplification efficiency. Even though both effi- ∆∆Ct.

ciency-calibrated model and modified ∆∆Ct model toler-


ates the amplification efficiency lower than 2, it is most Statistical analysis for real-time PCR data with
reliable to have all the reaction with amplification effi- amplification efficiency less than 2
ciency approximating 2 through optimizing primer As stated before, the PCR amplification efficiency can be
choices, amplicon lengths and experimental conditions. optimized to be approximately 2 with proper amplifica-
From our experience, maintaining all the amplification tion primers, RNA quality, and cDNA synthesis protocol.
efficiency near 2 is the best way to reach equal amplifica- Recent advancements in real-time PCR primer design have
tion efficiency among the samples and thus to ensure high allowed easier experimental optimization [16,17]. How-
quality data. It is also observed that a near 2 amplification ever, less than ideal real-time PCR data can occur regard-
efficiency can help to expand the dynamic range of ratio less the stringent control of experimental conditions.
estimation. There are three scenarios for suboptimal real-time PCR
data. In the first scenario, all of the PCR reactions have the
P-value, confidence intervals and data presentation same amplification efficiency, yet the efficiency differs
The P-value is an important parameter for significance from 2. In the second scenario, the PCR amplification effi-
level, and confidence intervals help to establish the relia- ciency differs by gene only. In other words, the amplifica-
ble range for ∆∆Ct estimation. Most of current real-time tion efficiency is the same for the same gene in all the
PCR publications do not present P-values and confidence biological samples; however, the amplification efficiency
intervals [11-13]. We believe disclosing P-values is impor- varies among the different genes. In the third scenario, the
tant when the researchers claim differential expression PCR amplification efficiency differs both by gene and by
between the samples or treatments exists. In the program sample. We considered the data in the third scenario as
we present, all the P-values are derived from testing the unacceptable as many others have reported [10,18]. In
null hypothesis that ∆∆Ct are equal to 0. Therefore, a any of these scenarios, the adjusted ∆∆Ct can be derived
small P-value indicates that the ∆∆Ct is significantly dif- from the ANCOVA model by including the PE in the 'esti-
ferent from 0, which demonstrates a significant effect. The mate' and 'contrast' statement of the SAS program.
interpretation of a P-value will depend on the experimen-
tal objectives. For example, at P = 0.05 in a treatment ver- Several approaches have been developed to calculate the
sus control experiment, we can claim that the treatment amplification efficiency in the low quality data. One of
has a significant effect; and in a tissue comparison experi-
such approach is so called 'dynamic data analysis', in
ment, we can claim that the gene expression is signifi-
which the fluorescence history of a PCR reaction is
cantly different among the tissues.
employed to calculate the amplification efficiency
Some publications present a standard deviation of the [19,20]. The advantage of the approach lies in the capacity
ratio as a meaningful metric. However, we argue here that to analyze low quality data and the economy in cost by
the standard deviation of ratio should be derived from the avoiding the standard curve. However, due to the mathe-

Page 8 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

matical complexity and the reliability controversy, this The Program5_LowQuatilityData.sas in additional file 10
method is not as widely applied as the traditional stand- provides the solution to derive the adjusted ∆∆Ct in the
first and second scenarios. A data set with amplification
ard curve method [10,16,18]. In our method, a standard
efficiency different by gene is provided in LowQuality-
curve already exists and can be used to derive amplifica-
Data.txt in additional file 11 to illustrate the use of the SAS
tion efficiency (E). Considering the simple linear regres- program. The data set is of lower quality mainly because
sion model in Equation 3, if Xlcon represents 10 based of the limited number of replicates involved in the exper-
logarithm transformed concentration, the amplification iment. Four steps are involved in calculating the ∆∆Ctadjust.
The first step is to perform the data quality control test as
efficiency (E) is 10-(1/slope) or 10−(1 / βcon ) according to
shown in Methods. From the SAS output, we can conclude
Ramussen 2001 and Pfaffl 2001 [7,21]. In our model, Xlcon that the LowQualityData dataset does not meet the
represents the 2 based logarithm transformed concentra- requirements for 2-∆∆Ct method, since one group of PCR
tion, the amplification efficiency (E) therefore is 2-(1/slope) has amplification efficiency significantly different from 1
as shown in the data quality control for LowQualityData
or 2−(1 / βcon ) , where the PE can be represented as -(1/
dataset part of SASOutput.doc (additional file 3).
βcon).
The second step is to test the equal PCR efficiency (or
In the first scenario discussed above, all PCR amplifica- slope) by observing the Type III sums of squares for lcon
tion have the same efficiency, but the efficiency is not and class interaction. A low p value will indicate the inter-
equal to 1. Then the ratio of gene expression can be repre- action of different groups of PCR (class) with logarithm
sented in the following equation. transformed concentration, which in turn indicates the
unequal slope among different groups of PCR. If all PCR
Ratio = E− ∆∆Ct = [2−(1 / β con ) ]− ∆∆Ct = 2− ∆∆Ct*PE Equation 6 amplification efficiency are equal, then the pooled ampli-
fication efficiency can be calculated and integrate into the
whereas PE = -(1/βcon), and ∆∆Ctadjust = PE*∆∆Ct SAS program for ∆∆Ctadjust calculation. In this set of data,
the Type III sums of squares has a p value smaller than
In the Equation 6, βcon is the pooled slope of the plot with 0.05, and the amplification efficiency are not equal for all
Ct against logarithm 2 based concentration. The βcon can PCRs. Tests of equal slopes are then performed for each
be calculated with a correlation function in SAS as shown gene to decide whether PCR amplification efficiency is the
in Program5_LowQualityData.sas in Additional file 10. In same for each gene. For either gene, the amplification effi-
the second scenario, the amplification efficiency differs by ciency is not significantly different with an α of 0.05. All
gene only. According to Equation 1, we have the following of the Type III sums of squares outputs can be found in
equation, in which the β0 is the pooled slope of the plot of SASOutput.doc (additional file 3).
Ct against log2 (concentration) for each gene.
The next step is to calculate the pooled slope (βcon) for
∆Ctt arg et
(Et arg et ) [2
−(1 / β conT arg et ) ∆Ctt arg et
] (PEt arg et ∗∆Ctt arg et − PEcontrol ∗∆Ct control ) each gene to derive the percentage amplification efficiency
Ratio = = =2 Equation 7
(Econtrol )∆Ctcontrol [2−(1 / β conControl ) ]∆Ctcontrol (PE = -(1/βcon)) for each gene. The pooled slopes are
derived based on the correlation between Ct and loga-
whereas PEtarget = -(1/βconTarget), PEcontrol = -(1/βconControl), rithm 2 based concentrations. The βcons for the two genes
and ∆∆Ctadjust = PEtarget*∆Cttarget-PEcontrol*∆Ctcontrol are -1.0813 and -1.0137 respectively as shown in SASOut-
put.doc (additional file 3) for the amplification efficiency
In the Equation 7, βconTarget and βconControl are the pooled calculation of LowQualityData dataset. With the βcon, -(1/
slope for the plot of Ct against logarithm 2 based concen- βcon) or PE can be calculated for each gene as 0.925 and
tration for target gene and reference gene respectively. The 0.987 respectively. The ∆∆Ctadjust can then be computed
slopes can be calculated by the with PEs substituting the 1 for each gene in the 'estimate'
Program5_LowQualityData.sas (additional file 10). The and 'contrast' statement. The SAS program is as follows in
∆∆Ctadjust can be calculated with the same program. Theo- additional file 10.
retically, an equation can also be derived for the third sce-
nario when PCR amplification efficiency differs both by Title 2 'Calculate the deltadeltaCt with Adjusted effi-
gene and by sample. However, in actual application, we ciency';
don't consider the data in the third scenario as acceptable
due to the significant variation of the amplification effi- PROC MIXED data=TR2 Order=Data;
ciency [10,18].
CLASS Class Con;

Page 9 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

MODEL Ct = Con Class Con*Class/SOLUTION NOINT; GGTCTTGCG (F) and TGGTCTTTCCGGTGAGAGTCT-


TCA (R). The primers for target gene (MT_7) were
Contrast 'Intercepts' Class 0.925 -0.925 -0.987 0.987; designed by Primer Express software (Applied Biosys-
tems) and the sequences were CCGCGGTACAAACCT-
Estimate 'Intercepts' Class 0.925 -0.925 -0.987 0.987/cl; TAATT (F) and TGGAACTCGATTCCCTCAAT (R). MT-7
gene is the Arabidopsis thaliana gene At3g44860 encoding
Run; a protein with high catalytic specificity for farnesoic acid
[22]. Primer titration and dissociation experiments were
The SAS output for the analysis is in SASOutput.doc performed so that no primer dimmers or false amplicons
(additional file 3). The ∆∆Ctadjust is therefore -1.0901 and will interfere with the result. After the real-time PCR
the change is significant since p value is very small. The experiment, Ct number was extracted for both reference
ratio can be represented as discussed in the standard ∆∆Ct gene and target gene with auto baseline and manual
method. The point estimation of the ratio in this example threshold.
is 2.129, and the 95% confidence interval is (1.926,
2.353). Real-time PCR experimental design, data output,
transformation, and programming
Overall, in the less optimized PCR reactions, statistical A main limitation of efficiency calibrated method and
analysis is not only complicated but also compromised ∆∆Ct method is that only one set of cDNA samples are
for precision and efficiency. Therefore caution should be employed to determine the amplification efficiency. It
exercised when performing statistical analysis with the was assumed that the same amplification efficiency could
low quality real-time PCR data, which may easily intro- be applied to other cDNA samples as long as the primers
duce error due to the efficiency adjustment [10,18]. and amplification conditions are the same. However,
amplification efficiency not only depends on the primer
Conclusion characteristics, but also varies among different cDNA sam-
In this report, we presented four models of statistical anal- ples. Using a standard curve for only one set of tested sam-
ysis of real-time PCR data and one procedure for data ples to derive the amplification efficiency might overlook
quality control. SAS programs were developed for all the the error introduced by sample differences. In our experi-
applications and a sample set of data was analyzed. The mental design, we have performed standard curve experi-
analyses with different models and programs yielded the ments with four concentrations of three replicates for all
same estimation of ∆∆Ct and similar confidence intervals. samples and genes involved. The ∆∆Ct will derive from
The data quality control and analysis procedures will help the standard curves only, and the data quality is examined
to establish robust systems to study the relative gene for each gene and sample combination. The analysis of
expression with real-time PCR. two samples is presented in the paper as an example. A
minimal of PCRs of two replicates in three concentrations
Methods will be required for each sample. Even though more effort
Plant material, RNA extraction, real-time PCR and sample is required, the data is more reliable out of stringent data
data set quality control and data analysis based on statistical mod-
The sample data set (Table 1) used for the analysis came els.
from the experiment described below. Arabidopsis thaliana
(Col1) plants were grown in the growth chamber at 23°C The output dataset included Ct number, gene name, sam-
with 14 hours of light for four weeks. Total RNA was iso- ple name, concentration and replicate. We used Micro-
lated with RNeasy Plant Mini Kit (Qiagen, Inc.) from soft® Excel to open the exported Ct file from an ABI 7000
methyl-jasmonate treated Arabidopsis, alamethecin treated sequence analysis system and then to transform data into
Arabidopsis and control plants, and DNA contamination a tab delimited text file for SAS processing. The sample
was removed with an on-column DNase (Qiagen, Inc.) data set is shown in Table 1.
treatment. One microgram of total RNA was synthesized
into first strand cDNA in a 20 µL reaction using iScript All programs were developed with SAS 9.1 (SAS Institute).
cDNA synthesis kit (BioRad Laboratories). cDNA was
then diluted into 10 ng/µL, 2 ng/µL, 0.4 ng/µL and 0.08 Authors' contributions
ng/µL concentration series. Three replicates of real-time JSY carried out the real-time PCR experiments, developed
PCR experiments were performed for each concentration the statistical model and SAS programs for analysis, and
using an ABI 7000 Sequence Detection System from drafted the article. AR provided assistance in SAS pro-
Applied Biosystems (Applied Biosystems). Ubiquitin was gramming and data modeling. FC provided assistance in
used as the reference gene, and the primer sequences for real-time PCR experiments. CNS provided oversight of the
Arabidopsis ubiquitin gene were CACACTCCACTT-

Page 10 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

work, conceptualized non-parametric elements, and final-


ized the draft. Additional file 9
The input data for Program4_TW.sas. The data contains only sample
name and ∆Ct.
Additional material
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
2105-7-85-S9.txt]
Additional file 1
The SAS program implements the data quality control processes.
Click here for file Additional file 10
[http://www.biomedcentral.com/content/supplementary/1471- The SAS program performs test for equal slope, grouped slope and adjusted
2105-7-85-S1.sas] ∆∆Ct.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471-
Additional file 2
2105-7-85-S10.sas]
The data provides the input data for Program1_QC.sas and
Program3_ANCOVA.sas. The data has grouped the Ct values according
to the different combination of sample and gene. Additional file 11
Click here for file The input data for Program5_LowQualityData.sas.
[http://www.biomedcentral.com/content/supplementary/1471- Click here for file
2105-7-85-S2.txt] [http://www.biomedcentral.com/content/supplementary/1471-
2105-7-85-S11.txt]
Additional file 3
The abbreviated SAS output for all the analyses.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1471- References
2105-7-85-S3.doc] 1. Klein D: Quantification using real-time PCR technology:
applications and limitations. Trends in Mol Med 2002, 8:257-260.
2. Bustin SA: Absolute quantification of mRNA using real-time
Additional file 4 reverse transcription polymerase chain reaction assays. J Mol
The SAS program implements the multiple regression model and derives Endocrinol 2000, 25:169-93.
∆∆Ct. 3. Mocellin S, Rossi CR, Pilati P, Nitti D, Marincola FD: Quantitative
Click here for file real-time PCR: a powerful ally in cancer research. Trends in
[http://www.biomedcentral.com/content/supplementary/1471- Mol Med 2003, 9:189-195.
4. Mason G, Provero P, Vaira AM, Accotto GP: Estimating the
2105-7-85-S4.sas]
number of integrations in transformed plants by quantita-
tive real-time PCR. BMC Biotechnology 2002, 2:20.
Additional file 5 5. Heid CA, Stevens J, Livak KJ, Williams PM: Real time quantitative
The input data for Program2_MR.sas, which has contains sample, gene, PCR. Genome Res 1996, 6:986-994.
6. Gibson UE, Heid CA, Williams PM: A novel method for real time
concentration and Ct number.
quantitative RT-PCR. Genome Res 1996, 6:995-1001.
Click here for file 7. Pfaffl MW: A new mathematical model for relative quantifica-
[http://www.biomedcentral.com/content/supplementary/1471- tion in real-time RT-PCR. Nucl Acids Res 2001, 29:2002-2007.
2105-7-85-S5.txt] 8. Pfaffl MW, Horgan GW, Dempfle L: Relative expression software
tool (REST(C)) for group-wise comparison and statistical
analysis of relative expression results in real-time PCR. Nucl
Additional file 6 Acids Res 2002, 30:e36.
The SAS program implements the ANCOVA model and derives ∆∆Ct. 9. Livak KJ, Schmittgen TD: Analysis of relative gene expression
Click here for file data using real-time quantitative PCR and the 2-∆∆CT
[http://www.biomedcentral.com/content/supplementary/1471- method. Methods 2001, 25:402-408.
2105-7-85-S6.sas] 10. Cook P, Fu C, Hickey M, Han ES, Miller KS: SAS programs for
real-time RT-PCR having multiple independent samples. Bio-
techniques 2004, 37:990-995.
Additional file 7 11. Zenoni S, Reale L, Tornielli GB, Lanfaloni L, Porceddu A, Ferrarini A,
The SAS program performs both student t test and Wilcoxon two group Moretti C, Zamboni A, Speghini A, Ferranti F, Pezzotti M: Downreg-
tests on the ∆Ct to derive ∆∆Ct. ulation of the Petunia hybrida α-expansin gene PhEXP1
reduces the amount of crystalline cellulose in cell walls and
Click here for file leads to phenotypic changes in petal limbs. Plant Cell 2004,
[http://www.biomedcentral.com/content/supplementary/1471- 16:295-308.
2105-7-85-S7.sas] 12. Eleaume H, Jabbouri S: Comparison of two standardisation
methods in real-time quantitative RT-PCR to follow Staphy-
lococcus aureus genes expression during in vitro growth. J
Additional file 8 Micro Methods 2004, 59:363-370.
The SAS program is a macro that derives confidence interval for Wilcoxon 13. Shen H, He LF, Sasaki T, Yamamoto Y, Zheng SJ, Ligaba A, Yan XL,
two group tests. Ahn SJ, Yamaguchi M, Hideo S, Matsumoto S: Citrate secretion
Click here for file coupled with the modulation of soybean root tip under alu-
[http://www.biomedcentral.com/content/supplementary/1471- minum stress. Up-Regulation of transcription, translation,
and threonine-oriented phosphorylation of plasma mem-
2105-7-85-S8.sas]
brane H+-ATPase. Plant Physiol 2005, 138:287-296.
14. Kutner MH, Nachtsheim CJ, Neter J, William L: Applied Linear Statisti-
cal Models Fifth edition. McGraw-Hill, Irwin, CA; 2005.
15. Hollander M, Wolfe DA: Nonparametric Statistical Methods John Wiley
and Sons, New York; 1973:503.

Page 11 of 12
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:85 http://www.biomedcentral.com/1471-2105/7/85

16. Bustin SA, Benes V, Nolan T, Pfaffl MW: Quantitative real-time


RT-PCR – a perspective. J Mol Endocrinol 2005, 34:597-601.
17. Pattyn F, Speleman F, De Paepe A, Vandesompele J: RTPrimerDB:
the real-time PCR primer and probe database. Nucl Acids Res
2003, 31:122-123.
18. Peirson SN, Butler JN, Foster RG: Experimental validation of
novel and conventional approaches to quantitative real-time
PCR data analysis. Nucl Acids Res 2003, 31:e73.
19. Liu W, Saint DA: A new quantitative method of real-time
reverse transcription polymerase chain reaction assay based
on simulation of polymerase chain reaction kinetics. Anal Bio-
chem 2002, 302:52-59.
20. Tichopad A, Dilger M, Schwarz G, Pfaffl MW: Standardized deter-
mination of real-time PCR efficiency from a single reaction
set-up. Nucl Acids Res 2003, 31:e122.
21. Rasmussen R: Quantification on the LightCycler. In Rapid Cycle
Real-time PCR, Methods and Applications; Heidelberg Edited by: Meuer S,
Wittwer C, Nakagawara K. Springer Press; 2001:21-34.
22. Yang Y, Yuan JS, Ross J, Noel JP, Pichersky E, Chen F: An Arabidopsis
thaliana methyltransferase capable of methylating farnesoic
acid. Arch Biochem Biophys 2005 in press. Corrected Proof

Publish with Bio Med Central and every


scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical researc h in our lifetime."
Sir Paul Nurse, Cancer Research UK

Your research papers will be:


available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright

Submit your manuscript here: BioMedcentral


http://www.biomedcentral.com/info/publishing_adv.asp

Page 12 of 12
(page number not for citation purposes)

You might also like