(Methods in Molecular Biology 856) Christian N. K. Anderson, Liang Liu, Dennis Pearl, Scott V. Edwards (auth.), Maria Anisimova (eds.)-Evolutionary Genomics_ Statistical and Computational Method (1).pdf
(Methods in Molecular Biology 856) Christian N. K. Anderson, Liang Liu, Dennis Pearl, Scott V. Edwards (auth.), Maria Anisimova (eds.)-Evolutionary Genomics_ Statistical and Computational Method (1).pdf
IN
MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
TM
Evolutionary Genomics
Statistical and Computational Methods, Volume 2
Edited by
Maria Anisimova
Department of Computer Science, Swiss Federal Institute of Technology (ETHZ),
Zrich, Switzerland
Swiss Institute of Bioinformatics, Lausanne, Switzerland
Editor
Maria Anisimova, Ph.D.
Department of Computer Science
Swiss Federal Institute of Technology (ETHZ)
Zurich, Switzerland
Swiss Institute of Bioinformatics
Lausanne, Switzerland
The photo used for book cover is made by one of the authors of the book, Wojciech Makaowski.
ISSN 1064-3745
e-ISSN 1940-6029
ISBN 978-1-61779-584-8
e-ISBN 978-1-61779-585-5
DOI 10.1007/978-1-61779-585-5
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2012931005
Springer Science+Business Media, LLC 2012
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of
the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,
USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified
as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed on acid-free paper
Humana Press is part of Springer Science+Business Media (www.springer.com)
Preface
Discovery of genetic material propelled the power of classical evolutionary studies across
the diversity of living organisms. Together with early theoretical work in population
genetics, the debate on sources of genetic makeup initiated by proponents of the neutral
theory made a solid contribution to the spectacular growth in statistical methodologies for
molecular evolution. The methodology developed focused primarily on inferences from
single genes or noncoding DNA segments: mainly on reconstructing the evolutionary
relationships between lineages and estimating evolutionary and selective forces. Books
offering a comprehensive coverage of such methodologies have already appeared, with
Joe Felsensteins Inferring Phylogenies and Ziheng Yangs Computational Molecular
Evolution among the favorites.
This volume is intended to review more recent developments in the statistical methodology and the challenges that followed as a result of rapidly improving sequencing
technologies. While the first sequenced genome (RNA virus Bacteriophage MS2 in
1976) was not even 4,000 nucleotides long, the sequencing progress culminated with
the completion of the human genome of about 3.3 109 base pairs and advanced to
sequence many other species genomes, heading ambitiously towards population sequencing projects such as 1,000 genome projects for humans and Drosophila melanogaster.
Next-generation sequencing (NGS) technologies sparked the genomics revolution,
which triggered a renewed effort towards the development of statistical and computational
methods capable of coping with the flood of genomic data and its inherent complexity.
The challenge of analyzing and understanding the dynamics of large-system data can
be met only through an integration of organismal, molecular, and mathematical disciplines.
This requires commitment to an interdisciplinary approach to science, where both experimental and theoretical scientists from a variety of fields understand each others needs and
join forces. Evidently, there remains a gap to be breached. This book presents works by top
scientists from a variety of disciplines, each of whom embodies the interdisciplinary spirit of
evolutionary genomics. The collection includes a wide spectrum of articlesencompassing
theoretical works and hands-on tutorials, as well as many reviews with much biological
insight.
The evolutionary approach is clearly gaining ground in genomic studies, for it enables
inferences about patterns and mechanisms of genetic change. Thus, the theme of evolution
streams through each chapter of the book, providing statistical models with basic assumptions and illustrated with appealing biological examples. This book is intended for a wide
scientific audience interested in a compressed overview of the cutting-edge statistical
methodology in evolutionary genomics. Equally, this book may serve as a comprehensive
guide for graduate or advanced undergraduate students specializing in the fields of genomics or bioinformatics. The presentation of the material in this volume is aimed to equally
suit both a novice in biology with strong statistics and computational skills and a molecular
biologist with a good grasp of standard mathematical concepts. To cater for differences in
reader backgrounds, Part I of Volume 1 is composed of educational primers to help with
fundamental concepts in genome biology (Chapters 1 and 2), probability and statistics
(Chapter 3), and molecular evolution (Chapter 4). As these concepts reappear repeatedly
throughout the books, the first four chapters will help the neophyte to stay afloat.
vi
Preface
The exercises and questions offered at the end of each chapter serve to deepen the
understanding of the material. Additional materials and some solutions to exercises can
be found online: http://www.evolutionarygenomics.net.
Part II of this volume reviews state-of-the-art techniques for genome assembly (Chapter
5), gene finding (Chapter 6), sequence alignment (Chapters 7 and 8), and inference of
orthology, paralogy (Chapter 9), and laterally transferred genes (Chapter 10). Part III opens
with a comparative review of genome evolution in different breeding systems (Chapter 11)
and then discusses genome evolution in model organisms based on the studies of transposable elements (Chapters 12 and 13), gene families, synteny (Chapter 14), and gene order
(Chapters 15 and 16).
Part I of Volume 2 is the evidence that, since embracing Darwins tree-like representation of evolution and pondering over the universal Tree of Life, the field has moved on.
Nowadays, the evolutionary biologists are well aware of numerous evolutionary processes
that distort the tree, complicating the statistical description of models and increasing
computational complexity, often to prohibitive levels. Each taking a different angle, the
chapters of Part I, Volume 2 discuss how to overcome problems with phylogenetic
discordance, as the Tree of Life turns out to be more like a forest (Chapter 3).
The multispecies coalescent model offers one solution to reconciling phylogenetic discord
between gene and species trees (Chapter 1); others pursue probabilistic reconciliation
for gene families based on a birthdeath model along a species phylogeny (Chapter 2).
By some perspectives, constraining the understanding of evolution solely with tree-like
structures omits many important biological processes that are not tree-like (Chapter 4).
Most fundamental questions in genome biology strive to disentangle the evolutionary
forces shaping species genomes, inferring evolutionary history, and understanding how
molecular changes affect genomic and phenotypic characteristics. To this goal, Part II
of the Volume 2 introduces methods for detecting and reconciling selection (Chapters 5
and 6) and recombination (Chapters 9 and 10), and discusses the mechanisms for the
origins of new genes (Chapter 7) and the evolution of protein domain architectures
(Chapter 8). The role of natural selection in shaping genomes is a pinnacle of the classical
neutralistselectionist debate and sets an important theme of the book; the neoselectionist model of genome evolution is tested on many counts. This theme is also
apparent in Part III dedicated to population genomics, which starts by discussing models
for genetic architectures of complex disease and the power of genome-wide association
studies (GWAS) for finding susceptibility variants (Chapter 11). With the availability of
multiple genomes from closely related species, gleaning the ancestral population history
also became possible, as is illustrated in the following chapter (Chapter 12). Most population
genetics problems rely on ancestral recombination graphs (ARG), and reducing the redundancy of the ARG structure helps to reduce the computational complexity (Chapter 13).
Entering the era of postgenomics biology, recent years have seen rapid growth of
complementary genomic data, such as data on expression and regulation, chemical and
metabolic pathways, gene interactions and networks, disease associations, and more.
Considering the genome as a uniform collection of coding and noncoding molecular
sequences is no longer an option. To address this, great efforts are currently dedicated to
embrace the complexity of biological systems through the emerging -omics disciplines
the focus of Part IV of this volume. Chapter 14 discusses ways to study the evolution of
gene expression and regulation based on data from old-fashioned microarrays as well
as transcriptomics data obtained with NGS such as RNAseq and ChIPseq. Interactomics
is the focus of the next chapter. Indeed, better understanding of genes, their diversity
Preface
vii
and regulation comes from studies of interaction between their protein products and
networks of interacting elements (Chapter 15). Further topics include metabolomics
(Chapter 16), metagenomics (Chapter 17), epigenomics (Chapter 18), and the newly
reinvented discipline with a mysterious namegenetical genetics (Chapter 19). Despite
the effort, complex dependencies and causative effects are difficult to infer. A way forward
must be in the integration of complimentary -omics information with genomic sequence
data to understand the fundamentals of systems biology in living organisms. This cannot be
achieved without studying how such information changes over time and across various
conditions. Vast amount of multifaceted data promise a big future for machine learning,
pattern recognition and discovery, and efficient data mining techniques, as can be seen
from many chapters of this book.
Finally, Part V of the second volume focuses on challenges and approaches for large
and complex data representation and storage (Chapter 20). The rapid pace of computational genomics, as well as research transparency and efficiency, exacerbates the need for
sharing of data and programming resources. Fortunately, some solutions already exist
(Chapter 21). Handling ever increasing amounts of computation requires efficient computing strategies, which are discussed in the closing chapter of the book (Chapter 22).
For a novice in the field, this book is certainly a treasure chest of state-of-the-art
methods to study genomic and omics data. I hope that this collection will motivate
both young and experienced readers to join the interdisciplinary field of evolutionary
genomics. But even the experienced bioinformatician reader is certain to find a few
surprises. On behalf of all authors, I hope that this book will become a source of inspiration
and new ideas for our readers. Wishing you a pleasant reading!
rich, Switzerland
Zu
Acknowledgments
The foremost gratitude goes to the authors of this book who came together to make this
resource possible and who were enthusiastic and encouraging about the whole project.
Over 100 reviewers have contributed to improving the quality and the clarity of the
presentation with their constructive and detailed comments. Some reviewers have accepted
to be acknowledged by their name. With great pleasure, I list them here:
Tyler Alioto, Peter Andolfatto, Miguel Andrade, Irena Artamonova, Richard M.
Badge, David Balding, Mark Beaumont, Chris Beecher, Robert Beiko, Adam Boyko,
Katarzyna Bryc, Kevin Bullaughey, Margarida Cardoso-Moreira, Julian Catchen, Annie
Chateau, Karen Cranston, Karen Crow, Tal Dagan, Dirk-Jan de Koning, Christophe
Dessimoz, Mario dos Reis, Katherine Dunn, Julien Y. Dutheil, Toni Gabaldon, Nicolas
Galtier, Mikhail Gelfand, Josefa Gonzalez, Maja Greminger, Stephane Guindon, Michael
Hackenberg, Carolin Kosiol, Mary Kuhner, Anne Kupczok, Nicolas Lartillot, Adam
Leache, Gerton Lunter, Thomas Mailund, William H. Majoros, James McInerney,
Gabriel Musso, Pjotr Prins, David A. Ray, Igor Rogozin, Mikkel H. Schierup, Adrian
Schneider, Daniel Schoen, Cathal Seoighe, Erik Sonnhammer, Andrea Splendiani, Tanja
si, Jijun
Stadler, Manuel Stark, Krister Swenson, Adam M. Szalkowski, Gergely J. Szollo
Tang, Todd Treangen, Oswaldo R. Trelles Salazar, Albert Vilella, Rutger Vos, Tom
Williams, Carsten Wiuf, Yuri Wolf, Xuhua Xia, S. Stanley Young, Olga Zhaxybayeva, and
Stefan Zoller.
My colleagues from the Computational Biochemistry Research Group at ETH Zurich
deserve much credit for being a constant source of inspiration and for providing such an
enjoyable working environment. Finally, but no less importantly, I would like to thank my
family for their love and for tolerating the overtime that this project required.
ix
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PART I
1
2
3
5
6
7
8
PHYLOGENOMICS
PART II
v
xiii
3
29
53
81
10
PART III
POPULATION GENOMICS
11
12
13
xi
xii
Contents
PART IV
THE -OMICS
14
15
16
17
18
19
PART V
20
21
22
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
Contributors
CHRISTIAN N.K. ANDERSON Department of Organismic and Evolutionary Biology &
Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
PETER ANDOLFATTO Department of Ecology and Evolutionary Biology,
The Lewis-Sigler Institute for Integrative Genomics, Princeton University,
Princeton, NJ, USA
MARIA ANISIMOVA Department of Computer Science, Swiss Federal Institute of
Technology (ETHZ), Zurich, Switzerland; Swiss Institute of Bioinformatics,
Lausanne, Switzerland
YASSEN ASSENOV Max Planck Institute, Saarbrucken, Germany
ADAM AUTON Wellcome Trust Centre for Human Genetics, Oxford, UK
DAVID BANKS Department of Statistical Science, Duke University, Durham, NC, USA
ERIC BAPTESTE UMR CNRS 7138, UPMC, Paris, France
DOMINIQUE BELHACHEMI Section of Biomedical Image Analysis, Department of
Radiology, University of Pennsylvania, Philadelphia, PA, USA
SREN BESENBACHER deCODE Genetics, Reykjavik, Iceland; Bioinformatics Research
Center, Aarhus University, Aarhus, Denmark
CHRISTOPH BOCK Max Planck Institute, Saarbrucken, Germany; Broad Institute,
Cambridge, MA, USA
FREDERIC BOUCHARD Departement de Philosophie, Universite de Montreal,
Station Centre-ville, Montreal, Quebec, Canada
RICHARD M. BURIAN Department of Philosophy, Virginia Tech, Blacksburg, VA, USA
MARGARIDA CARDOSO-MOREIRA Department of Molecular Biology and Genetics,
Cornell University, Ithaca, NY, USA
VINCENT DAUBIN UMR CNRS 5558, LBBE, Biometrie et Biologie Evolutive
UCB Lyon 1, Villeurbanne, France
JULIEN Y. DUTHEIL Institut des Sciences de lEvolution Montpellier (ISE-M),
UMR 5554, CNRS, Unversite Montpellier, Montpellier, France
SCOTT V. EDWARDS Department of Organismic and Evolutionary Biology & Museum
of Comparative Zoology, Harvard University, Cambridge, MA, USA
ANDREW EMILI Banting and Best Department of Medical Research, Donnelly Centre
for Cellular and Biomolecular Research, Department of Medical Genetics and
Microbiology, University of Toronto, Toronto, ON, Canada
LARS FEUERBACH Max Planck Institute, Saarbrucken, Germany
CHRISTOPHER FIELDS Institute for Genomic Biology, The University of Illinois,
Urbana, IL, USA
KRISTOFFER FORSLUND Stockholm Bioinformatics Centre, Stockholm University,
Stockholm, Sweden
LAURENT GAUTIER Department of Systems Biology, DMAC, Center for Biological
Sequence Analysis, Technical University of Denmark, Lyngby, Denmark
YOAV GILAD Department of Human Genetics, The University of Chicago, Chicago,
IL, USA
xiii
xiv
Contributors
Contributors
xv
Part I
Phylogenomics
Chapter 1
Tangled Trees: The Challenge of Inferring Species Trees
from Coalescent and Noncoalescent Genes
Christian N.K. Anderson, Liang Liu, Dennis Pearl, and Scott V. Edwards
Abstract
Phylogenies based on different genes can produce conflicting phylogenies; methods that resolve such
ambiguities are becoming more popular, and offer a number of advantages for phylogenetic analysis.
We review so-called species tree methods and the biological forces that can undermine them by violating
important aspects of the underlying models. Such forces include horizontal gene transfer, gene duplication,
and natural selection. We review ways of detecting loci influenced by such forces and offer suggestions for
identifying or accommodating them. The way forward involves identifying outlier loci, as is done in
population genetic analysis of neutral and selected loci, and removing them from further analysis, or
developing more complex species tree models that can accommodate such loci.
Key words: Species tree, Gene tree discordance, Non-coalescent genes, Outlier analysis
1. Introduction
The concept of a species tree, a bifurcating dendrogram graphically
depicting the relationships of species to each other, is one of the
oldest and most powerful icons in all of biology (Figs. 1 and 2). After
Charles Darwin sketched the first species tree (in Transmutation of
Species, Notebook B, 1837), he remained fascinated by the image for
22 years, eventually including a species tree as the only figure in On
the Origin of Species (1859). Though species trees reached their
aesthetic apogee with Ernst Haeckels Tree of Life in 1886, the
pursuit of ever-more scientifically accurate trees has kept phylogenetics a vibrant discipline for the 150 years since.
Because the direct evolution of species is not observable
(not even in the fossil record), relationships are often inferred by
shared characteristics among extant taxa. Until the 1970s, this was
done almost exclusively by using morphological characters.
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_1,
# Springer Science+Business Media, LLC 2012
5k gen
A
A
A
A
B
B
C
C
C
D
D
3000
2500
2000
1500
1000
500
1500
1000
500
Fig. 1. An example showing the utility of multiple gene trees in producing species tree topologies. (A) Nine unlinked loci are
simulated (or inferred without error) from a species group with substantial amounts of incomplete lineage sorting. Note that
no single gene recovers the correct relationship between clades. Furthermore, despite identical conditions for all nine
simulations, no two genes agree on the correct topology, let alone the correct divergence times. (B) Superimposing the
nine gene trees on top of each other clarifies the relationships. It can be (correctly) inferred that the true tree is perfectly
ordered, with (ABC) diverging from D about 1,500 generations ago, the (AB)-C split occurring at 800, and A diverging from
B about 600 generations ago. Also, the amount of crossbreeding within the recently diverged taxa implies (correctly) that C
has the effective smallest population size.
Although this approach had many successes, the paucity of characters and the challenges of comparing species with no obvious morphological homologies were persistent problems (1). When
molecular techniques were developed in the late 1960s, it soon
became clear that the sheer volume of molecular data that could
be collected would represent a vast improvement. When DNA
sequences became widely available for a range of species (2), molecular comparisons quickly became de rigueur (36). Nonetheless, it
was recognized early on that molecular phylogenies had their own
Deep coalescence
A
species tree
gene trees
AB
DA B C D A B C D
BC D
A B C D A B C D
Fig. 2. The relationship between gene trees and species trees. Lines within the species trees indicate gene lineages.
Simplified gene trees are shown below each species tree. Whereas gene trees on the left vary due to deep coalescence,
gene trees on the right are topologically concordant but vary slightly in branch lengths due to the coalescent. Modified with
permission from Edwards (2009).
suite of problems; the concept that not all gene tree topologies
would match the true species tree topology (i.e., would not be
speciodendric sensu Rosenberg (7)) was implicit in studies as early
as the 1960s ((8), see also ref. 9). However, it was generally assumed
that the idiosyncratic genealogical history of any one gene, as
reconstructed from extant mutations, was an acceptable approximation for the true history of the species given the potentially overwhelming quantity and seductive utility of molecular data (1014).
By and large, the ensuing decades of molecular phylogenetics
has fulfilled much of this potential, revolutionizing taxonomies and
resolving conundrums previously considered intractable (15).
However, as the amount of genetic data per species becomes evermore voluminous, it has become clear that individual genes can
conflict with each other and with the overarching species tree, both
in topology and branch lengths (1619). In the meantime, the
term phylogeny frequently became conflated with gene tree,
the entity produced by many of the leading phylogenetics packages
of the day. The term species tree, in use since the late 1970s to
emphasize the distinction between lineage histories and gene histories (13, 16), was only gradually acknowledged, despite the fact
that species trees are the rightful heirs to the term phylogeny and
better encapsulate the true goals of molecular and morphological
systematics (20).
At first, some researchers treated this phenomenon as though it
were an information problem: when working with only a few
2. The Multispecies
Coalescent Model
A plausible probabilistic model for analyzing multilocus sequences
should involve not only the phylogenetic relationship of species
(species tree), but also the genealogical history of each gene (gene
tree), and allow different genes to have different histories. Unlike
concatenation, such a model explains the evolutionary history of
multilocus sequences through a two-stage processfrom species
tree to gene tree and from gene tree to sequences (44). Construction of the two-stage model requires an explicit description of
how gene trees evolve in the species tree and how sequences evolve
on gene trees. As the second question has been extensively studied
in the traditional phylogenetic analyses for estimating gene trees, the
key is to address the first question adequately. With a few exceptions
(described below), the genealogical relationship (gene tree) of neutral alleles can be simply depicted by a coalescence process in which
lineages randomly coalesce with each other backward in time. The
coalescence model is simple in the sense that it assumes little or no
effect of evolutionary forces such as selection, recombination, and
gene flow, instead giving a prominent role to random genetic drift.
Despite these seemingly oversimplified assumptions, the pure coalescent model is fundamental in explaining the gene treespecies tree
relationship because it forms a baseline for incorporating additional
evolutionary forces on top of random drift (25). More importantly,
the pure coalescent model provides an analytic tool to detect the
evolutionary forces responsible for the deviation of the observed data
(molecular sequences) from those expected from the model.
The coalescent process works, in effect, by randomly choosing
ancestors from the population backward through time for each
sequence in the original sample. Eventually, two of these lineages
share a common ancestor, and the lineages are said to coalesce.
The process continues until all lineages have coalesced at the most
recent common ancestor (MRCA). Book-length treatments of the
process are available, and readers interested in the mathematical
details can find them in several sources (e.g., Refs. 28, 4749).
Multispecies coalescence works the same way but places constraints
on how recently the coalescences occur, corresponding to the species divergence times. Given a species tree, the probability density
function of each gene tree is evaluated; and these density functions
are combined to evaluate the likelihood of the species tree. In this
way, multispecies coalescent methods are the converse of consensus
methods; rather than each locus proposing a potentially divergent
species tree, a common species tree is assumed and evaluated in light
of the sometimes-divergent patterns observed across loci (30).
A number of implementations of this idea have been developed
(20). The BATWING package (50) was originally developed to
3. Sources of Gene
Tree/Species Tree
Discordance
and Violations
of the Multispecies
Coalescent Model
3.1. Population
Processes
The standard and most common reason why gene trees are not
speciodendritic is incomplete lineage sorting, i.e., lineages have not
yet been reproductively isolated for long enough for drift to cause
complete genetic divergence in the form of reciprocal monophyly
of gene trees (68). This source of gene tree heterogeneity is guaranteed to be ubiquitous, if only because it arises from the finite
10
population sizes of all species that have ever come into existence.
Almost all the techniques and software packages discussed above
are designed to approximate uncertainties in species tree topology
arising from this phenomenon.
3.1.1. Accurate
Delimitation of Species
and Diverging Lineages
11
3.2. Molecular
Processes
12
a
INFERRED HISTORY
TRUE HISTORY
Gene Duplication
Copy 1
Copy 2
Convergent Evolution
B C
C D
Mutation
C D
Fig. 3. Three examples of noncoalescent gene histories. (a) A duplication event that
precedes a speciation event can lead to incorrect inference of divergence times in the
species tree if copy 1 is compared to copy 2. This can be particularly difficult if one of the
gene copies has been lost or not sequenced by the researcher. (b) Convergent evolution
can occur at the molecular level, for example in certain genes under environmental
selection if both taxa move into the same environment. It tends to bring distantly related
taxa into a jumbled polyphyletic clade, and is likely to be given additional false support by
morphological data. (c) Horizontal gene transfer causes difficulties in current species tree
methods because it establishes a spurious lower bound to divergence times. Though rare
in eukaryotes, it is by no means unknown, and is likely to become a more difficult problem
in the future when species trees are based on tens of thousands of loci.
13
14
4. Detecting
Violations
of the Multispecies
Coalescent Model
4.1. Detecting
Population Genetic
Outliers
4.2. Detecting
Phylogenetic Outliers
15
16
in the third codon position than in positions one and two. Regions
under balancing selection should have higher nonsynonymous
mutation rates. However, using the dN/dS ratio as a means of
detecting phylogenetic outliers presents some difficulties. Of course,
such a test would only be applicable to coding regions (see Chap. 5
of this Volume; ref. 122). Additionally, although such genes may
exhibit anomalous behavior at the amino-acid level, they may not be
anomalous in their phylogenetic signal, which is our primary concern. Finally, many coding loci may undergo substitutions more
freely than expected due to canalization (sensu Waddington (105))
or genomic redundancy. Many genes exhibit a slight excess
of nonsynonymous substitutions within populations because
even strong directional selection rarely purges all such alleles from
populations (106).
GC ratio and DNA word frequencies: Regions of the genome that
have been acquired from another domain of life (such as a eukaryote with DNA from viruses, bacteria, or archea) often have an
unusual GC composition relative to the rest of the genome. Indeed,
focusing on genomic regions with anomalous GC content is a
common method for identifying genes that have undergone
HGT. More complex consequences of base composition and mutation patterns, such as the frequencies of DNA oligonucleotides
(words) in coding or noncoding regions, have also been used
to flag potential HGT genes, particularly in bacteria (107, 108).
Like the test above, the results of GC or DNA word frequency
analysis should be considered suggestive, but not conclusive. There
are other reasons for unusual GC content (e.g., leucine zipper
motifs, a GC microsatellite, etc.), which are likely to occur by
chance in a large genome. Again, the phylogenetic consequences
of such deviations in evolutionary pattern are paramount. In this
regard, high variation in GC content among genes can cause strong
deviations in resulting phylogenies, although distinguishing the
true gene tree from the tree suggested by the variation can be
challenging (e.g., using LogDet distances (109)).
4.3. Statistical Tests
to Detect Phylogenetic
Outliers
17
18
19
Fig. 4. HGT can be detected by comparing the diversity of genes in all taxa to the diversity
of genes in pairs of taxa. Transfer events should appear as anomalies in regressions or
histograms in each pair of species, in this case locus 21. In the example pair above, 1 of
the 20 normal loci also lies outside the 95% confidence band as expected, but this
locus would not be expected to lie outside the confidence band in all pairs. This particular
locus highlights another hazard of such an analysis: the locus has saturated (100
segregating sites in a 100-bp locus) and thus shows a positive deviation from expectation
in closely related taxa.
20
5. Future Directions
Species tree methods are likely to continue to gain ascendancy as
the strongest evidence of taxonomic relationship in phylogenetic
research. As with any form of evidence, the conclusions of a species
tree analysis are fallible, with each method susceptible to certain
biases in exceptional cases. In the future, we hope that these biases
and susceptibilities can be overcome, and that species tree methods
will continue to multiply. Because the most robust techniques rely
heavily on a coalescent paradigm, the field needs a method for
detecting loci that violate the assumptions of coalescent theory.
A few ideas for how to do this have been presented and outlined
above, but certainly need rigorous theoretical and empirical testing
to establish their effectiveness in phylogenetic inference.
Detection is just the first step. Currently, when such loci are
discovered, researchers have two options: they can use methods
that are sufficiently robust (hopefully) to overcome the faulty
21
6. Practice
Problems
1. Consider the following discordant set of gene trees. {Gene
1 (A:10,(B:8,C:8):2); Gene 2 (B:9,(A:6,C:6):3); and
Gene 3 ((A:4,B:4):4,C:8)}. Assuming that these genes perfectly delimit the time of genetic divergence and the only
cause of discordance is deep coalescence, what is the correct
species tree?
2. In a study of five closely related species, you sequence five short
loci, and obtain the following matrix of variable sites between
taxon pairs.
Per gene total Species A Species B Species C Species D Species E
Species A
2,3,6,4,1
3,7,6,9,1
4,7,6,9,1
4,7,6,9,1
4,7,1,9,1
4,7,5,9,1
4,6,5,9,1
3,6,5,8,1
4,7,5,9,1
Species B
16
Species C
26
22
Species D
27
26
23
Species E
27
27
26
1,2,2,3,0
8
Which gene is the most likely to have been horizontally transferred, and between which two taxa?
Appendix A:
Simulating Gene
Trees in Species
Trees
22
generations
500
1000
2000
300
3000
Ne
700
Fig. 5. The species tree simulated in the Appendix. Branch lengths are in units of
generations, and branch widths (population sizes) are in units of individuals. This
particular tree has the constraint that ancestral population sizes are the sum of the
population sizes of descendent lineages, but of course one can simulate without these
constraints using either Serial SimCoal or Phybase.
23
24
References
1. Hillis DM (1987) Molecular Versus Morphological Approaches to Systematics. Annu Rev
Ecol Syst 18:2342
2. Kocher TD, Thomas WK, Meyer A et al
(1989) Dynamics of mitochondrial DNA evolution in animals: amplification and sequencing with conserved primers. Proc Natl Acad
Sci USA 86:61966200
3. Miyamoto MM, Cracraft J (1991) Phylogeny
inference, DNA sequence analysis, and the
future of molecular systematics. In: Miyamoto
MM, Cracraft J (eds) Phylogenetic Analysis of
DNA Sequences. Oxford Univ. Press, New
York
4. Swofford DL, Olsen GJ, Waddell PJ et al
(1996) Phylogenetic inference. In: Hillis
DM MC, Mable BK (ed) Molecular Systematics. Sinauer Associates, Sunderland MA
5. Nei M (1987) Molecular Evolutionary Genetics, Columbia University Press, New York
6. Nei M, Kumar S (2000) Molecular Evolution
and Phylogenetics, Oxford University Press,
New York
7. Rosenberg NA (2002) The Probability of
Topological Concordance of Gene Trees and
Species Trees. Theor Popul Biol 61:225247
8. Cavalli-Sforza LL (1964) Population structure and human evolution. Proc R Soc
Lond, Ser B: Biol Sci 164:362379
9. Avise JC, Arnold J, Ball RM et al (1987) Intraspecific phylogeography: the mitochondrial
DNA bridge between population genetics and
systematics. Annu Rev Ecol Syst 18:489522
25
26
27
28
Chapter 2
Modeling Gene Family Evolution and Reconciling
Phylogenetic Discord
Gergely J. Szollosi and Vincent Daubin
Abstract
Large-scale databases are available that contain homologous gene families constructed from hundreds of
complete genome sequences from across the three domains of life. Here, we discuss the approaches
of increasing complexity aimed at extracting information on the pattern and process of gene family
evolution from such datasets. In particular, we consider the models that invoke processes of gene birth
(duplication and transfer) and death (loss) to explain the evolution of gene families.
First, we review birth-and-death models of family size evolution and their implications in light of the
universal features of family size distribution observed across different species and the three domains of life.
Subsequently, we proceed to recent developments on models capable of more completely considering
information in the sequences of homologous gene families through the probabilistic reconciliation of
the phylogenetic histories of individual genes with the phylogenetic history of the genomes in which they
have resided.
To illustrate the methods and results presented, we use data from the HOGENOM database, demonstrating that the distribution of homologous gene family sizes in the genomes of the eukaryota, archaea, and
bacteria exhibits remarkably similar shapes. We show that these distributions are best described by models of
gene family size evolution, where for individual genes the death (loss) rate is larger than the birth
(duplication and transfer) rate but new families are continually supplied to the genome by a process of
origination. Finally, we use probabilistic reconciliation methods to take into consideration additional
information from gene phylogenies, and find that, for prokaryotes, the majority of birth events are the
result of transfer.
Key words: Gene family evolution, Gene duplication, Gene loss, Horizontal gene transfer,
Birth-and-death models, Reconciliation
1. Introduction
The strongest evidence for the universal ancestry of all life on Earth
comes from two sources: (1) the shared molecular characters essential to the functioning of the cell, such as fundamental biological
polymers, core metabolism, and the nearly universal genetic
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_2,
# Springer Science+Business Media, LLC 2012
29
30
2. Birth-and-Death
Processes
and the Shape
of the Protein
Universe
31
32
Fig. 1. Distribution of homologous gene family sizes across the three domains. The distribution of homologous gene family
sizes was derived from the version 5 of the HOGENOM database (17). The results for the three domain data for the
complete genomes of 820 bacteria, 62 archaea, and 64 eukaryotes, and correspond to the average of the frequencies of
family sizes across species in the domain. Dashed lines indicate fits with different origination duplication and loss (ODL)
models. The linear model corresponds to the model of Reed et al. and the nonlinear is that proposed by Karev et al.; see
text for details. The bottom row presents the relative rate of duplication as a function of family size corresponding to the fits
of the nonlinear model of Eq. 2 in the two rows above it.
2.2. Interpreting
the Pattern of Gene
Family Sizes
33
Huynen and van Nimwegen were the first to describe and interpret
a widespread pattern of a slowly decaying asymptotic power law in
the distribution of homologous gene family sizes. They examined a
diverse set of genomes spanning the bacteria, archaea, eukaryota,
and viruses (12). They found that a simple, but relatively abstract,
stochastic birth-and-death process, one where the duplication and
loss events are correlated within a family, produces power-law distributions (for details, see below). They found the exponent g to be
between 2 and 4 in their studies. In fact, a value consistent with
these results of g between 2 and 3 has been observed in all
subsequent studies and can easily be read off from Fig. 1. In the
context of Huynen and van Nimwegens model, this indicates that
the origination rate (in general, a combination of gain resulting
from transfer, and the birth of new families with no homologs in
other genomes) that is required to compensate for the stochastic
loss of families must be significant.
Subsequent work has shown that for models, where the birth
and death of genes in a gene family are considered independent,
the asymptotic decay of the distribution of gene family sizes can
also become a power law, albeit such behavior is only exhibited by
a certain specific subclass of originationduplicationloss-type
birth-and-death models. As demonstrated by Karev et al. (14),
this is the case for nonlinear models (see below) in which the death
rate approaches the birth rate for large families but is considerably
greater than the birth rate for small families (see bottom row of
Fig. 1). Karev et al. have been able to accurately reproduce the
distributions of gene (and domain) family sizes for a range of
analyzed genomes. The origination rates necessary to fit empirical
family size distributions were found to be relatively high, and
comparable, at least in small prokaryotic genomes, to the overall
intragenomic duplication rate. This has been interpreted as support for the key role of horizontal gene transfer (HGT) in these
genomes (14, 18, 19).
At about the same time as the work of Karev and colleagues
appeared, Reed et al. demonstrated (20) that a very simple birthand-death process can also exhibit an asymptotic power law. They
considered a model, where the birth and death of genes are independent of each other and family size, and origination occurs
randomly with a uniform rate (see below), and found asymptotic
power-law behavior under the condition that the rate of birth
(duplication) is larger than the rate of death (loss). In Fig. 1, we
show comparisons of the fits of the linear model of Reed et al. and
the nonlinear model of Karev et al. to gene family size distributions
for the three domains. We can see that despite its relative simplicity,
considering data from individual species (top row of Fig. 1), the
linear model (described by three parameters) provides comparable
quality fits as the model of Karev et al. (described by five parameters). If we consider, however, the fits to distributions averaged
34
over the three domains, we can observe that the nonlinear mode
clearly provides a better fit (second row of Fig. 1). As the functions
being fit are discrete probability distributions, one can easily calculate the probability of the observed empirical distribution given
values of the model parameters, and subsequently perform fitting
by maximizing the likelihood of the model parameters. For the case
of the averaged distributions, this method of fitting using likelihood allows a clear interpretation of the fit to the averaged distributions, as corresponding to the hypothesis of a birth-and-death
process with identical parameter values across all species in the
domain having generated the observed distribution.
Perhaps more conclusively, the parameter values obtained in
the case of the linear model, corresponding to a birth-to-death ratio
of between roughly 2 and 5 (d/l 4.9 for the human dataset with
the best apparent fit), are qualitatively at odds with empirical estimates of the recent duplication and loss rates in eukaryotic genomes, which unanimously indicate a value much smaller than one
(see Table 1 in ref. 6).
2.3. The Theory
of Birth-and-Death
Processes
Origination
Duplication
d1
0
genes
1
gene
di1
d2
2
genes
di
i
genes
..
di+1
i+1
genes
i+1
35
..
i+2
Loss
Gain
0
genes
1
gene
2
genes
..
3
i
genes
i+1
genes
i+1
..
i+2
Loss
Gain
Duplication
d1
0
genes
1
gene
2
genes
di1
d2
i
genes
..
3
di
di+1
i+1
genes
i+1
..
i+2
Loss
Fig. 2. Birth-and-death models of homologous gene family evolution. A birth-and-death process is a stochastic process in
which transitions between states labeled by integers (representing the number of individuals, cells, lineages, etc.) are only
allowed to neighboring states. A jump to the right constitutes birth, whereas a jump to the left is a death. In the context of
birth-and-death processes that model the evolution of homologous gene families, the number of representatives a
homologous gene family has in a given gene corresponds to the model state. Birth represents the addition of gene to a
family in genome as a result of (1) origination of a new family with a single member, (2) duplication of an existing gene, or
(3) gain of a gene by means of horizontal transfer of a gene from the same family from a different genome. The three
models pictured above have been used in different contexts to model observed patterns of gene family size: (a) the
stationary distribution of nonlinear originationduplicationloss-type models is able to reproduce the general shape and in
particular the power-law-like tail of the distribution of homologous gene family sizes (cf. Subheading 2 and 14) while
transient distributions of linear originationduplicationloss can be used to construct models of gene family size evolution
along a phylogeny, modeling the inparalog, i.e., vertically evolving component of the size family distribution (21); (b) and
(c) linear gainloss and gainduplicationloss-type models are used to model the nonvertically evolving, the so-called
xenolog, component of the family size distribution along a branch of a phylogenetic tree.
36
and
ln ln:
(1)
In other words, a gene (individual) in a gene family (population) gives birth to a new gene at a rate d and undergoes death at a
rate l, independent of the size of the gene family. The stationary
distribution of a linear birth-and-death process with origination
with some rate Ocan be shown to be (1) a stretched exponential
if d l, i.e., the birth rate is smaller than the death rate or (2)
exhibiting an asymptotic power-law behavior with exponent
g (O/(d l) + 1) (30) if d > l. The transient distribution
can be analytically expressed for the linear version of all three
processes shown in Fig. 2. These distributions are important in
deriving the probability of observing a particular pattern of family
sizes at the leaves of a phylogeny, as well as in estimating branchwise duplication, transfer, and loss parameters from a forest of
gene trees that have been mapped using a series of duplication
transfer and loss events to the branches of a species phylogeny
(see Subheading 4).
A succession of more complex nonlinear models can be constructed, the simplest proposed (14) being a model with a family
size-dependent duplication and loss rate parameterized by a pair of
constants a and b:
dn dnn
0
d n a
n
n
and
ln lnn
0
l n b
n;
n
(2)
37
(3)
38
39
Table 1
Relative rates of duplication, gain, and loss for prokaryotic
phyla obtained by maximum likelihood using COUNT (43)
Phylum name
Loss
Duplication
Gain
# of
genomes
Actinobacteria
0.75
0.23
0.010
31
Alphaproteobacteria
0.85
0.13
0.008
47
Bacillales
0.52
0.42
0.048
16
Bacteroidetes/chlorobi
0.59
0.38
0.024
10
Betaproteobacteria
0.63
0.32
0.037
32
Chlamydiae/verrucomicrobia
0.70
0.24
0.043
Clostridia
0.57
0.37
0.055
11
Cyanobacteria
0.68
0.28
0.027
14
Deltaproteobacteria
0.64
0.33
0.024
13
Epsilonproteobacteria
0.54
0.29
0.158
Gammaproteobacteria
0.88
0.10
0.009
70
Lactobacillales
0.66
0.29
0.036
21
Mollicutes
0.49
0.47
0.023
14
Spirochetes
0.79
0.19
0.014
Crenarchaeota
0.69
0.28
0.018
11
Euryarchaeota
0.66
0.31
0.016
25
Rooted reference trees were obtained from concatenates of universal and nearuniversal genes and phylogenetic profiles extracted from version 4 of the
HOGENOM database (17). Relative rates correspond to the ratio of the
average of the branch-wise rates (of duplication, gain, and loss) to the average
branch-wise sum of the three rates
40
3. The Ubiquity
of Phylogenetic
Discord
and the Joint
Reconstruction
of Pattern
and Process
3.1. Phylogenetic
Discord Among
Homologous Gene
Families
41
42
Deep
coalescence
Duplication
ancestral
polymorphism
speciation
events
speciation events
(sorting of lineages)
loss
by drift
loss
loss
loss
loss
Transfer
speciation
events
loss
Fig. 3. Evolutionary processes behind phylogenetic discord. Phylogenetic incongruences can be the result of three major
evolutionary processes (45): (1) deep coalescence resulting from incomplete lineage sorting (see previous chapter);
(2) hidden paralogy (resulting from duplication and differential loss); and (3) horizontal gene transfer (HGT). Incomplete
lineage sorting occurs when an ancestral species undergoes two speciation events in rapid succession. If, for a given gene,
the ancestral polymorphism has not been fully resolved into two monophyletic lineages at the time of the second
speciation, with a probability determined by the effective population size, the gene tree will differ from the species tree.
A potential source of incongruence relevant over wider phylogenetic scales is hidden paralogy. If a gene family contains
paralogous copies (genes that are related by a duplication event, e.g., the dashed and grey lines above), the gene
phylogeny will partly reflect the duplication history of the gene that is independent of species divergence history. The third
process is HGT. If genetic exchanges occur between species, then the phylogeny of individual genes will be influenced by
the number and nature of transfers they have undergone. In the above figure, we illustrate how a particular gene tree
topology can be explained by each process. Depending on the parameters (duplication, transfer, and loss rates and
effective population size) describing the branches of the species tree, the three different scenarios have different
probabilities.
43
44
(5)
The probability of a reconciliation can be hierarchically decomposed into the product of probabilities of the reconstructions of
subtrees of G. This allows the construction of a dynamic programming algorithm that can efficiently sum or take the maximum over
reconciliations, allowing the calculation of both Eqs. 4 and 5.
Furthermore, the same dynamic programming scheme can be
used to calculate the most parsimonious reconciliation given costs
of the possible events with reduced complexity (57).
3.4. Hierarchical
Probabilistic Models
of Duplication,
Transfer, and Loss
where
and the product goes over the set of most likely gene trees {Gf}
encoding the sequence information in families of homologous
genes composing a set of genomes. This expression can be thought
of as being similar to the classic likelihood of a gene tree topology
G and some model of sequence evolution Mseq. with parameters,
time
Species tree
branch
speciation
events
Gene tree
root g
t = t1
branch
t = t2
branch
t = t3
node e
leaf a
10
t = 0
Genome A
Transfer
scenario
Genome B
Genome C
Genome D
origination
Duplication
scenario
root
g
origination
root
g
duplication
t = t
e
f
transfer
t = t
leaf a
leaf a
b
Q1,1(t;t1) g
propagation
e
g
Q2,7(t1;t)
propagation
Q2,7(t1;0)
propagation
e
Q2,7(t;0)
propagation
Q9,9(t;0)
propagation
Q3,3(t1;t2)
propagation
Q5(t2)
loss
Q6,6(t2;t3)
propagation
Q10(t3)
loss
Q9,9(t3;0)
propagation
Fig. 4. Probabilistic DTL model. If we consider gene trees to be generated by a linear birth-and-death process MBD taking
place on a tree S 0 with the order of speciation events fully specified, we can express the probability of a gene tree topology
G given a reconciliation. Specifying the order of speciation events corresponds to constructing time slices, which
decompose the branches of the species tree into pieces yielding the tree S 0 . For example, the branch leading to Genome
A is decomposed into three branches labeled 2, 4, 7 (for a formal definition, see ref. 56). Transfers are only possible between
branches in the same time slice, e.g., between 7 and 9, but not 4 and 9. A reconciliation consists of mapping the branches
and nodes of G to the branches of nodes of S 0 . For a given gene tree, there are many possible reconciliations. For G, we can
construct (1) a transfer scenario, where node g of G is a speciation at the root of S 0 , e is a transfer from 4 to 9, f is a speciation
at the end of 3, and the branch below f traverses the speciation at the end of 6 implying at least one loss and also (2) a
duplication scenario, where e maps to the root, g is a duplication above it, the position of f is unchanged, but at least four
losses have occurred. The probability of extinction Qe (t ) and the propagator Qef (t, t 0 ) can be used to construct the probability
of a given reconciliation as shown for the black subtree of G. Because the probability of a reconciliation can be hierarchically
decomposed into the product of the probabilities of the reconstructions of the subtrees of G, a dynamic programming
algorithm can be derived that is able to calculate the sum or maximum of the probability over all reconciliations.
46
Cyanobacteria
Lactobacillales
Duplication
Transfer
Loss
Duplication
Transfer
Loss
Frequency in sample
0.30
0.5
0.25
0.4
0.20
0.15
0.3
0.10
0.2
0.05
0.1
0.1
0.2
0.6
0.1
0.7
0.2
0.3
0.4
0.5
relative rate
0.6
0.7
Fig. 5. Relative rates of duplication, transfer, and loss for two prokaryotic phyla. The results were obtained by maximum
likelihood using reference trees inferred from concatenated alignments of universal and near-universal genes and all
homologous gene families with trees available in version 4 of the HOGENOM database (17). These results show that while
the ratio of birth to death is practically identical, taking into consideration phylogenetic information from gene trees, the
majority of birth events are inferred to have resulted from transfer and not duplication in contrast to results obtained from
phylogenetic profiles (see Table 1). The histograms correspond to results obtained for 1,000 jackknife samples of 20% all
trees (see Chap. 20 of 42 for a discussion of resampling). The calculation was implemented using results from 56 and 57.
We kept the species tree topology fixed and maximized Eq. 6 over the space of possible orders in time of speciations and
uniform rate parameters. We assumed each branch of S 0 to have branch lengths compatible with the time order of
speciations with all time slices being of equal width and inferred global rates of duplication, transfer, and loss.
where in this case the product goes over columns of homologous sites
composing a MSA. In Fig. 5, we present results obtained using such
an approach, where we have kept the species tree topology fixed and
maximized the likelihood given by Eq. 6 over the space of possible
orders in time of speciations and uniform rate parameters. We can see
that the inferred ratio of birth to death is in good agreement with that
obtained from phylogenetic profiles (see Table 1). In contrast, taking
into consideration additional information from the sequences of the
proteins in homologous families in the form of gene tree topologies,
we infer for both phyla considered the majority of birth events to be
the result of transfer.
This scheme has two shortcomings. First, instead of complete
sequence information, only the most likely gene tree topologies are
considered. Second, global information on how likely different gene
tree topologies are given S 0 and MBD is not considered. Both of
47
f 2families
(8)
It is important to note that this hierarchical likelihood function
is amicable to parallel computation, because the p(Gf |S 0 , MBD)
L(Gf | MSA of f ) terms can be computed independently, by client
nodes. It is possible to implement an efficient optimization scheme
consisting of a hierarchical optimization loop, wherein clients optimize the Gf -s using the independent terms in the hierarchical
likelihood product while keeping S and MBD fixed until conditionally optimal Gf -s are attained using which S and MBD can be
optimized.
4. Conclusion
In conclusion, the distributions of homologous gene family sizes in
the genomes of the eukaryota, archaea, and bacteria show astonishingly similar shapes. These distributions are best described by models of gene family size evolution, where the loss rates of individual
genes are larger than their duplication rate but new families are
continually supplied to the genome by a process of origination
that in general includes both transfer and the generation of new
gene families. This picture is supported by analysis of phylogenetic
profiles using maximum likelihood. Taking into consideration additional information from the sequences of the proteins in homologous families in the form of gene tree topologies, the inferred ratio
of birth to death is found to be in good agreement with that
obtained from phylogenetic profiles; however, in prokaryotes, the
majority of birth events is inferred to be the result of transfer.
48
5. Exercises
1. Using log-log axis on the range [0.1, 106], plot the following
functions: ex, ex/10, ex/100, ex/1000, x1, x3, x9 and
observe how power-law-like tails decay much slower than any
exponential function.
2. Using both the COG (http://www.ncbi.nlm.nih.gov/COG)
and the HOGENOM (http://pbil.univ-lyon1.fr/databases/
HOGENOM) databases, construct the histogram in Fig. 1 of
the frequency of homologous gene family sizes in the human
genome, i.e., the fraction fn of times you see a family of size n
among all homologous gene families in the human genome.
3. Using the result that the stationary distribution pn of family sizes
is reached exponentially fast and assuming that this occurs according to the relationship |pn(t) pn| / e(d + l)t, considering the
49
50
51
Chapter 3
Genome-Wide Comparative Analysis of Phylogenetic
Trees: The Prokaryotic Forest of Life
Pere Puigbo`, Yuri I. Wolf, and Eugene V. Koonin
Abstract
Genome-wide comparison of phylogenetic trees is becoming an increasingly common approach in evolutionary
genomics, and a variety of approaches for such comparison have been developed. In this article, we
present several methods for comparative analysis of large numbers of phylogenetic trees. To compare
phylogenetic trees taking into account the bootstrap support for each internal branch, the Boot-Split
Distance (BSD) method is introduced as an extension of the previously developed Split Distance
method for tree comparison. The BSD method implements the straightforward idea that comparison
of phylogenetic trees can be made more robust by treating tree splits differentially depending on the
bootstrap support. Approaches are also introduced for detecting tree-like and net-like evolutionary
trends in the phylogenetic Forest of Life (FOL), i.e., the entirety of the phylogenetic trees for
conserved genes of prokaryotes. The principal method employed for this purpose includes mapping
quartets of species onto trees to calculate the support of each quartet topology and so to quantify
the tree and net contributions to the distances between species. We describe the application of these
methods to analyze the FOL and the results obtained with these methods. These results support
the concept of the Tree of Life (TOL) as a central evolutionary trend in the FOL as opposed to the
traditional view of the TOL as a species tree.
Key words: Forest of life, Tree of life, Phylogenomic methods, Tree comparison, Map of quartets
Abbreviations
CMDS
COG
BSD
FOL
HGT
ND
NUTs
QT
TNT
TOL
SD
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_3,
# Springer Science+Business Media, LLC 2012
53
54
P. Puigbo` et al.
1. Introduction
With the advances of genomics, phylogenetics entered a new era
that is noted by the availability of extensive collections of phylogenetic trees for thousands of individual genes. Examples of such tree
collections are the phylomes that encompass trees for all sufficiently
widespread genes in a given genome (14) or the Forest of Life
(FOL) that consists of all trees for widespread genes in a representative set of organisms (5). It has been known since the early days of
phylogenetics that trees built on the same set of species often
have different topologies, especially when the set includes distant
species, most notably, in prokaryotes (6, 7). The availability of
forests consisting of numerous phylogenetic trees exacerbated
the problem as an enormous diversity of tree topologies has been
revealed. The inconsistency between trees has several major sources:
(1) problems with ortholog identification caused primarily by cryptic paralogy; (2) various artifacts of phylogenetic analysis, such as
long branch attraction (LBA); (3) horizontal gene transfer (HGT);
and (4) other evolutionary processes distorting the vertical, tree-like
pattern, such as incomplete lineage sorting and hybridization (1,
810). In order to obtain robust results in genome-level phylogenetic analysis, for instance, to classify phylogenetic trees into clusters
with (partially) congruent topologies or to identify common trends
among multiple trees, reliable methods for comparing trees are
indispensable.
The number and diversity of tree comparison methods
and software have substantially increased in the last few years.
The tree comparison methods variously use tree bipartitions, such
as partition, or symmetric difference metrics (11) and split distance
(SD) (12); distance between nodes, such as the path length metrics
(13), nodal distance (12, 14), and nodal distance for rooted
trees (15); comparison of evolutionary units, such as triplets and
quartets (16); subtransfer operations, such as subtree transfer distance (17), nearest-neighbor interchanging (18), Subtree Prune
and Regraft (SPR) using a rooted reference tree (19), SPR for
unrooted trees (20), and Tree Bisection and Reconnection (TBR)
(17); (dis)agreement methods, such as agreement subtrees (21),
disagree (12), corresponding mapping (22), and congruence
index (23); tree reconciliation (24); and topological and branch
lengths methods, such as K-tree score (25). Several algorithms
have been proposed to analyze with multifamily trees. For example,
the FMTS algorithm systematically prunes each gene copy from a
multifamily tree to obtain all possible single-gene trees (12) and an
algorithm implemented in TreeKO prunes nodes from the input
rooted trees in which duplication and speciation events are labeled
(26). However, to the best of our knowledge, none of the available
metrics for tree comparison takes into account the robustness of the
55
56
P. Puigbo` et al.
2. Materials
2.1. The Forest of Life
and Nearly Universal
Trees
3. Methods
3.1. Boot-Split
Distance: A Method
to Compare
Phylogenetic Trees
Taking into Account
Bootstrap Support
3.1.1. Boot-Split Distance
57
the different splits (Eq. 1). Equations 2 and 3 give the formulas to
calculate the eBSD and dBSD values, respectively.
eBSD dBSD
;
2
he
i
eBSD 1 Me ;
a
BSD
dBSD
d
Md :
a
(1)
(2)
(3)
There are three possible types of comparisons for trees that do not
include paralogs, that is, include one and only one sequence from
each of the constituent species (Fig. 1). In the first case, the two
trees completely overlap, that is, consist of the same set of species
(Fig. 1a). In this case, step 2, the pruning procedure, is not necessary, and the comparison involves only obtaining all possible splits
and the calculation of the BSD. In the second case, one of the
compared trees is a subset of the other tree (Fig. 1b). In this case,
the splits are only pruned and occasionally removed from the bigger
tree. In the third case, when the two trees partially overlap or when
a tree is a subset of another tree, a pruning procedure is required.
In the example shown in Fig. 2, after the pruning procedure
(step 3), there is only one remaining split (split: AB|CD) that is
repeated several times in both trees. The remaining AB|CD split in
Tree 1 is separated by four nodes that have different bootstrap
values. In this case, the bootstrap of the remaining split is calculated
using the Eq. 4, where n is the total number of nodes between the
58
P. Puigbo` et al.
a
96
4
5
47
6
2
26
[96] 45 | 6231
16|2345 [58]
[47] 62 | 4531
162|345
[26] 31 | 4562
2613|45 [79]
[8]
SD = 0.667
SD
0.667
BSD
0.333
eBSD
0.512
dBSD
0.154
2
6
1
3
4
5
58
8
79
BSD = 0.333
p=2, q=4, m=6, a=3.140, e=1.750, d=1.390
Ma=0.53, Me=0.875, Md=0.3475
5
2
33
72 | 54163
5
1
59
[5]
[33] 43 | 152
[59] 51 | 432
4
7
15
5
2
1
6
18
38
3
SD = 1.000
SD
1.000
BSD
0.681
eBSD
1.000
dBSD
0.363
BSD = 0.681
p=0, q=4, m=4, a=1.450, e=0.000, d=1.450
Ma=0.3625, Me=0, Md=0.3625
Fig. 1. Examples of the BSD algorithm in single-family trees. (a) Two trees of the same size. (b) Tree 1 is a subtree of the
tree 2. (c) Two trees that partially overlap. SD split Distance, BSD boot-split distance, eBSD BSD of equal splits, dBSD BSD
of different splits, p number of equal splits, q number of different splits, m total number of splits, a sum of bootstraps in all
splits, e sum of bootstraps in equal splits, d sum of bootstraps in different splits, Ma mean bootstrap value, Me mean
bootstrap value in equal splits, Md mean bootstrap value in different splits.
two sides of the split and BSi is the bootstrap value (adjusted to the
01 range) of the node i.
n
Y
1 BSi :
(4)
Bootstrap 1
i1
Tree 1
90
10
59
Tree 2
10
10
Q
R
C
A
B
90
C
D
90
10
A
B
10
10
Q
R
C
D
90
10
10
93
10
The key question regarding the BSD method is: What is the best
approach to phylogenetic tree comparison: using all branches, reliable
or not, with the appropriate weighting or using only branches supported by high bootstrap values? The first option is illustrated in
Fig. 1, whereas Fig. 3 shows an example of a tree comparison that
employs a bootstrap threshold of 70, i.e., only branches supported by
a higher bootstrap are taken into account in the comparison. The
second procedure appears reasonable and can be recommended in
some cases. However, it is not advisable as a general approach
because, when two large trees with varying bootstrap values are
60
P. Puigbo` et al.
86
71 | 52346 [75]
[77] 34671 | 25
71526 | 34 [80]
[98] 34 | 67125
98
4
77
37
2
5
1
35
75
1
3
80
32
6
SD = 0.600
Threshold
SD
70
0.600
BSD
0.536
eBSD
0.619
BSD = 0.536
dBSD
0.454
Fig. 3. Example of the BSD algorithm using a bootstrap cutoff. The figure shows the comparison of two phylogenetic trees
that takes into account only those branches with bootstrap support greater than 70. SD split distance, BSD boot-split
distance, eBSD BSD of equal splits, dBSD BSD of different splits, p number of equal splits, q number of different splits,
m total number of splits, a sum of bootstraps in all splits, e sum of bootstraps in equal splits, d sum of bootstraps in different
splits, Ma mean bootstrap value, Me mean bootstrap value in equal splits, Md mean bootstrap value in different splits.
Tree 2
Tree 1
SD=0
Tree 1
AB|CDEF
ABC|DEF
ABCD|EF
AB|CDEF
ABC|DEF
ABCD|EF
AC|BDEF
ACB|DEF
ACBD|EF
AC|EBDF
ACE|BDF
ACEB|DF
AC|BDEF
ACB|DEF
ACBD|EF
Tree 2
AB|CDEF
ABC|DEF
ABCD|EF
AC|EBDF
ACE|BDF
ACEB|DF
Tree 3
SD=0.33
AB|CDEF
ABC|DEF
ABCD|EF
Tree 2
Tree 1
Tree 1
Tree 3
61
Tree 1
C1
C1
=1
0
D=
SD=0
.6
SD
C
A
E
B
C1
SD=0
Tree 3
0.8
Distance
BSDSD=0
BSDSD=0.33
0.6
BSDSD=0.67
BSDSD=1
0.4
SD=0
SD=0.33
SD=0.67
0.2
SD=1
0
0
250
500
Repetition
750
1000
Fig. 4. Comparisons of trees with six taxa. Bootstrap values were assigned randomly in each comparison.
compare the trees using the BSD method, and this procedure was
repeated 1,000 times. The resulting plot (Fig. 4b) shows that, for
the comparison of trees with SD of 0 and 1, the BSD values ranged
from 0 to 0.5 and from 0.5 to 1, respectively, and in principle, could
assume all intermediate values. In the case of the comparisons that
differed in one split (SD 0.33), the BSD value was greater than
0.33 in 75% of the comparison, whereas for the comparisons that
62
P. Puigbo` et al.
differed in two splits (SD 0.67), 25% of the BSD values were
greater than 0.67. Thus, the BSD method for tree comparison
offers a better resolution than the SD method, especially for trees
with a small number of species.
Figure 5a shows the results of analysis of six simulated alignments
with an increasing level of noise (divergence respect to the initial
alignment) in each alignment, i.e., from the alignment 0 (without
noise and producing trees with bootstrap values of 100) to alignment
5 with the maximum level of noise. For each alignment, a tree was
constructed using the UPGMA method from the Web server DendroUPGMA (http://genomes.urv.cat/UPGMA). Distances were
calculated using the Jaccard coefficient, and bootstraps were generated from 100 replicates. The results of the tree comparison (Fig. 5b)
using three different methods, namely, Nodal Distance (ND), SD,
and BSD, show that the BSD method presents a continuous distribution resulting in a better resolution of the distances than the other
two methods. Indeed, the SD and ND methods fail to discern the
similarity between trees after six changes, whereas the BSD method
still reports discernible similarity (Fig. 5b). In order to compare the
three tree comparison methods, the distance reported by each
method was normalized to the maximum value in each case, i.e.,
after 46 changes (maximum number of changes in the simulation),
the distance to the initial tree is 1.41, 0.30, and 0.42 for ND, SD, and
BSD, respectively. All three distance values indicate that the trees are
similar far above the random expectation, supporting the robustness
of all methods, but the BSD method presents a better resolution in
the tree comparison.
3.1.5. Analysis of Random
Trees and the Significance
of BSD Results
1(3 changes)
2 (6 changes)
3 (12 changes)
4 (26 changes)
5 (46 changes)
Normalized distance
0.75
SD
0.5
ND
BSD
0.25
0
0
Trees
Fig. 5. Comparison of six trees constructed from alignments with increasing noise levels. (a) Comparison of trees from
six simulated alignments. The UPGMA tree from each alignment was reconstructed with the Web server DendroUPGMA (http://
genomes.urv.cat/UPGMA) using the Jaccard coefficient as the measure of distance and generating 100 bootstrap replicates.
Alignment 0 corresponds to the initial alignment without noise that perfectly separates all branches, resulting in a tree with
64
P. Puigbo` et al.
1.00
Distance
0.75
0.50
0.25
BSD
SD
0.00
0
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Tree size
Fig. 6. Random BSD and SD depending on the tree size. Results of the tree comparison of random trees (with different
sizes ranging from 4 to 100 species) show that the BSD and SD increase up to 0.75 and 0.999, respectively.
Fig. 5. (continued) bootstrap values of 100 for all internal nodes. Alignments 15 correspond to the derivatives of the initial
alignment with increasing noise levels at each step. (b) Results of the comparison of each tree (1 to 5) with the initial tree
(0). The trees were compared using three methods: Split Distance (SD), Nodal Distance (ND), and Boot-Split Distance
(BSD). For the purpose of comparison, the results obtained with each of the three methods were normalized to the
maximum value in each case.
65
0.9
BSD
0.6
0.3
20
40
60
80
100
Permutations
Fig. 7. The number of permutations and the BSD. (a) BSD depending on the number
of permutations and tree size. (b) Mean and standard deviation of the BSD for up to
100 permutations for trees with 20 species.
3.2. Analysis of
Topological Trends in
a Set of Phylogenetic
Trees Calculation of
the Tree Inconsistency
66
P. Puigbo` et al.
of the given tree. The IS is calculated using Eqs. 57, where N is the
total number of trees, X is the number of splits in the given tree,
and Y is the number of times the splits from the given tree are found
in all trees of the FOL.
1
IS Y
ISmin
;
ISmax
(5)
1
;
X N
(6)
1
ISmin :
X
(7)
ISmin
ISmax
67
Using the quartet support values for each quartet, a 100 100
between-species distance matrix was calculated as dij 1 Sij/Q ij,
where dij is the distance between two species, Sij is the number of
trees containing quartets in which the two species are neighbors,
and Q ij is the total number of quartets containing the given two
species. Then, this distance matrix was used to construct different
heat maps using the matrix2png Web server ((70), Fig. 8b). In contrast to the BSD method, which is best suited for the analysis of the
evolution of individual genes, the distance matrices derived from
maps of quartets are used to analyze the evolution of species and to
disambiguate tree-like evolutionary relationships and highways
(preferential routes) of HGT.
The quartet-based between-species distances were used to calculate the Tree-Net Trend (TNT) score. The TNT score is calculated
by rescaling each matrix of quartet distances to a 01 scale
q2
10
Tr
ee
9
Tr
ee
8
Tr
ee
7
Tr
ee
6
Tr
ee
Tr
ee
X
X
10%
q2t2
80%
10%
q1t3
q2t1
X
X
q3t3
Solved
q1t2
30%
40%
30%
Unsolved
q1
Tr
ee
q1t1
Tr
ee
Tr
ee
Tr
ee
P. Puigbo` et al.
68
qit1
qn
qit2
qit3
Sp. i
Sp. j
Heatmap
Distance matrix
100
100
0
0
0
Fig. 8. Mapping quartets. (a) Mapping quartets onto a set of ten trees. (b) A schematic of the procedure used to reconstruct
a species matrix from the map of quartets.
4.1. Patterns
in the Phylogenetic
Forest of Life
69
2000
Number of trees
4. Phylogenetic
Concepts in Light
of Pervasive
Horizontal Gene
Transfer
1000
20
40
60
Tree size
80
100
Fig. 9. The Forest of Life (FOL). The distribution of the trees in the FOL by the number of
species. Modified from ref. 5.
70
P. Puigbo` et al.
Percentage of NUTs
75%
50%
25%
0%
O U H K
COG functions
Fig. 10. Distribution of the gene functions among the NUTs. The functional classification of
genes was from the COG database (59).
We define the NUTs as trees for those COGs that were represented
in more than 90% of the included prokaryotes. This definition
yielded 102 NUTs. Not surprisingly, the great majority of the
NUTs are genes encoding proteins involved in translation and the
core aspects of transcription (Fig. 10). Among the NUTs, only 14
corresponded to COGs that consist of strict 1:1 orthologs (all of
them ribosomal proteins), whereas the rest of NUTs included
paralogs in some organisms (only the most conserved paralogs
were used for tree construction (5)). The 1:1 NUTs were similar
to the rest of the NUTs in terms of the connectivity in tree similarity
(1-BSD) networks and their positions in the single cluster of NUTs
obtained using CMDS.
The 102 NUTs were compared to trees produced by analysis of
concatenations of universal proteins (47). The results showed that
most of the NUTs were topologically similar to a tree obtained by
the concatenation of 31 universal orthologous genes (5)in other
words, the Universal TOL constructed by Ciccarelli et al. (47)
was statistically indistinguishable from the NUTs and showed properties of a consensus topology. Not surprisingly, the 1:1 ribosomal
protein NUTs were even more similar to the universal tree than the
rest of the NUTs, in part because these proteins were used for the
construction of the universal tree and, in part, presumably because
of the low level of HGT among ribosomal proteins.
71
72
P. Puigbo` et al.
Similarity
>80%
>60%
40-60%
<40%
<20%
100%
% of trees
75%
50%
25%
0%
NUTs
Fig. 11. Topological similarity between the NUTs and the rest of the FOL. Percentage of trees connected to the NUTs at a
different % of similarity (modified from Puigbo` et al. 2009).
(6) 48.6%
**
73
(2) 63.34%
*
(1) 42.43%
*
(3) 62.11%
**
(4) 56.21%
**
(5) 50.17%
**
(7) 49.66%
**
* p = 0.0014
** p < 0.000001
Fig. 12. Clusters and patterns in the FOL. The seven clusters identified in the FOL using the
CMDS method and the mean similarity values between the 102 NUTs and all trees from each
of the 7 clusters are shown (modified from Puigbo` et al. 2009).
74
P. Puigbo` et al.
The TNT map of the NUTs was dominated by the tree-like signal
(green in Fig. 13a): the mean TNT score for the NUTs was 0.63
(Fig. 14b), so the evolution of the nearly universal genes of prokaryotes appears to be almost two-third tree-like (i.e., reflects the
topology of the supertree). The rest of the FOL stood in a stark
contrast to the NUTs, being dominated by the net-like evolution,
with the mean TNT value of 0.39 (Fig. 14c) (about 60% net like).
Remarkably, areas of tree-like evolution were interspersed with areas
of net-like evolution across different parts of the FOL (Fig. 13b).
The major net-like areas observed among the NUTs were retained,
but additional ones became apparent including Crenarchaeota that
showed a pronounced signal of a non-tree-like relationship with
diverse bacteria as well as some Euryarchaeota (Fig. 13b). The
distribution of the tree and net evolutionary signals among different
groups of prokaryotes showed a striking split among the NUTs:
among the archaea, the tree signal was heavily dominant (mean
TNTNUTs_Archaea 0.80 0.20), whereas among bacteria the
contributions of the tree and net signals were nearly equal (mean
TNTNUTs_Bacteria 0.51 0.38). Among the rest of the trees in
the FOL, archaea also showed a stronger tree signal than bacteria,
but the difference was much less pronounced than it was among
the NUTs (mean TNTFOL_Archaea 0.47 0.11 and mean
TNTFOL_Bacteria 0.34 0.08). The conclusions on the tree-like
and net-like components of evolution made here are based on the
assumption that the supertree of the NUTs represents the tree-like
(vertical) signal. We did not perform direct tests of the robustness of
these conclusions to the supertree topology. However, observations
presented previously (5) suggest that the results are likely to be
robust given the coherence of the NUTs topologies as well as the
similarity of the supertree topology and the topologies of the individual NUTs to the TOL obtained from concatenated sequences
of universally conserved ribosomal proteins (47).
5. Conclusions
The analysis of the phylogenetic FOL is a logical strategy for
studying the evolution of prokaryotes because each set of orthologous genes presents its own evolutionary history and no single
topology may represent the entire forest. Thus, the methods introduced in this article that compare trees without the use of a preconceived representative topology for the entire FOL may be of
wide utility in phylogenomics.
We have shown that, although no single topology may represent
the entire FOL and several distinct evolutionary trends are detectable,
the NUTs contain a strong tree-like signal. Although the tree-like
signal is quantitatively weaker than the sum total of the signals from
HGT, it is the most pronounced single pattern in the entire FOL.
75
Fig. 13. The Tree/Network Trend (TNT) score heat maps. (a) The 102 NUTs. (b) The FOL without the NUTs (6,799 trees).
The TNT increases from red (low score, close to random, an indication of net-like evolution) to green (high score, close to
the supertree topology, an indication of tree-like evolution). The species are ordered according to the topology of the
supertree of the 102 NUTs. In (a), the major groups of archaea and bacteria are denoted (modified from Puigbo` et al. 2010).
76
P. Puigbo` et al.
NET
TREE
0.63
NUTs
NET
TRE
0.39
EE
FOL
TR
NE
Fig. 14. The Tree/Network Trends in the FOL and in the NUTs. (a) A hypothetical
equilibrium between the tree and net trends. (b) A schematic representation of the tree
tendency in the NUTs. (c) A schematic representation of the net tendency in the FOL.
6. Exercises
1. Calculate the split distance SD and BSD of the following two
trees (the trees are in the Newick format):
(((A,B)61,C)53,D,E);(((A,C)76,B)38,D,E).
2. Calculate the Inconsistency Score of the tree X in the forest of
trees Y.
X (((A,B),C),D,E);
Y (((A,B),C),D,E); (A,B,(E,D)); (((A,C),B),D,E); (A,C,
(B,D)); (A,B,(C,D)); (A,B,(C,E)); (A,E,(B,D)); (((A,C),
D),E,F); (((A,B),D),E,C); (((E,F),A),B,C).
77
Acknowledgments
The authors research is supported by the Department of Health
and Human Services intramural program (NIH, National Library
of Medicine).
References
1. Huerta-Cepas, J., Dopazo, H., Dopazo, J., and
Gabaldon, T. (2007) The human phylome.
Genome Biol 8, R109.
2. Huerta-Cepas, J., Bueno, A., Dopazo, J., and
Gabaldon, T. (2008) PhylomeDB: a database
for genome-wide collections of gene phylogenies. Nucleic Acids Res 36, D491-496.
3. Frickey, T., and Lupas, A. N. (2004) PhyloGenie: automated phylome generation and
analysis. Nucleic Acids Res 32, 52315238.
4. Sicheritz-Ponten, T., and Andersson, S. G.
(2001) A phylogenomic approach to microbial
evolution. Nucleic Acids Res 29, 545552.
5. Puigbo, P., Wolf, Y. I., and Koonin, E. V.
(2009) Search for a Tree of Life in the thicket
of the phylogenetic forest. J Biol 8, 59.
6. Felsenstein, J. (2004) Inferring Phylogenies.
Sunderland, MA: Sinauer Associates.
7. Nei, M., and Kumar, S. (2001) Molecular Evolution and Phylogenetics. Oxford: Oxford Univ.
8. Castresana, J. (2007) Topological variation in
single-gene phylogenetic trees. Genome Biol 8,
216.
9. Soria-Carrasco, V., and Castresana, J. (2008)
Estimation of phylogenetic inconsistencies in
the three domains of life. Mol Biol Evol 25,
23192329.
10. Marcet-Houben, M., and Gabaldon, T. (2009)
The tree versus the forest: the fungal tree of life
and the topological diversity within the yeast
phylome. PLoS ONE 4, e4357.
11. Robinson, D. F., and Foulds, L. R. (1981)
Comparison of phylogenetic trees. Math Biosci
53, 131147.
12. Puigbo, P., Garcia-Vallve, S., and McInerney,
J. O. (2007) TOPD/FMTS: a new software to
compare phylogenetic trees. Bioinformatics 23,
15561558.
13. Steel, M. A., and Penny, D. (1993) Distribution of tree comparison metrics - some new
results. Systematic Biol 42, 126141.
14. Bluis, J., and Shin, D.-G. (2003) Nodal distance
algorithm: calculating a phylogenetic tree comparison metric. In: Proceedings of the third
IEEE symposium on bioInformatics and bioEngineering. IEEE Computer Society, 8794.
78
P. Puigbo` et al.
43. Eisen, J. A., and Fraser, C. M. (2003) Phylogenomics: intersection of evolution and genomics. Science 300, 17061707.
44. Salzberg, S. L., White, O., Peterson, J., and
Eisen, J. A. (2001) Microbial genes in the
human genome: lateral transfer or gene loss?
Science 292, 19031906.
45. Galtier, N. (2007) A model of horizontal gene
transfer and the bacterial phylogeny problem.
Syst Biol 56, 633642.
46. Galtier, N., and Daubin, V. (2008) Dealing with
incongruence in phylogenomic analyses. Philos
Trans R Soc Lond B Biol Sci 363, 40234029.
47. Ciccarelli, F. D., Doerks, T., von Mering, C.,
Creevey, C. J., Snel, B., and Bork, P. (2006)
Toward automatic reconstruction of a highly
resolved tree of life. Science 311, 12831287.
48. Choi, I. G., and Kim, S. H. (2007) Global
extent of horizontal gene transfer. Proc Natl
Acad Sci U S A 104, 44894494.
49. Koonin, E. V., Wolf, Y. I., and Puigbo, P.
(2009) The Phylogenetic Forest and the
Quest for the Elusive Tree of Life. Cold Spring
Harb Symp Quant Biol.
50. Dagan, T., and Martin, W. (2009) Getting a
better picture of microbial evolution en route
to a network of genomes. Philos Trans R Soc
Lond B Biol Sci 364, 21872196.
51. Boucher, Y., Douady, C. J., Papke, R. T.,
Walsh, D. A., Boudreau, M. E., Nesbo, C. L.,
et al. (2003) Lateral gene transfer and the origins of prokaryotic groups. Annu Rev Genet
37, 283328.
52. Bucknam, J., Boucher, Y., and Bapteste, E.
(2006) Refuting phylogenetic relationships.
Biol Direct 1, 26.
53. Schliep, K., Lopez, P., Lapointe, F. J., and
Bapteste, E. (2011) Harvesting evolutionary
signals in a forest of prokaryotic gene trees.
Mol Biol Evol 28, 13931405.
54. Beiko, R. G., Doolittle, W. F., and Charlebois,
R. L. (2008) The impact of reticulate evolution
on genome phylogeny. Syst Biol 57, 844856.
55. Doolittle, W. F., and Zhaxybayeva, O. (2009)
On the origin of prokaryotic species. Genome
Res 19, 744756.
56. Gogarten, J. P., and Townsend, J. P. (2005)
Horizontal gene transfer, genome innovation
and evolution. Nat Rev Microbiol 3, 679687.
57. Gogarten, J. P., Doolittle, W. F., and Lawrence,
J. G. (2002) Prokaryotic evolution in light of
gene transfer. Mol Biol Evol 19, 22262238.
58. Puigbo, P., Wolf, Y. I., and Koonin, E. V.
(2010) The tree and net components of prokaryote evolution. Genome Biol Evol 2,
745756.
79
71. Koonin, E. V., and Wolf, Y. I. (2008) Genomics of bacteria and archaea: the emerging
dynamic view of the prokaryotic world. Nucleic
Acids Res 36, 66886719.
72. Ge, F., Wang, L. S., and Kim, J. (2005) The
cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol 3,
e316.
73. Brochier, C., Bapteste, E., Moreira, D., and
Philippe, H. (2002) Eubacterial phylogeny
based on translational apparatus proteins.
Trends Genet 18, 15.
74. Wolf, Y. I., Rogozin, I. B., Grishin, N. V., and
Koonin, E. V. (2002) Genome trees and the
tree of life. Trends Genet 18, 472479.
75. Wolf, Y. I., Rogozin, I. B., Grishin, N. V.,
Tatusov, R. L., and Koonin, E. V. (2001)
Genome trees constructed using five different
approaches suggest new major bacterial clades.
BMC Evolutionary Biology 1.
76. Creevey, C. J., Fitzpatrick, D. A., Philip, G. K.,
Kinsella, R. J., OConnell, M. J., Pentony, M.
M., et al. (2004) Does a tree-like phylogeny
only exist at the tips in the prokaryotes? Proc
Biol Sci 271, 25512558.
77. Brochier-Armanet, C., Boussau, B., Gribaldo,
S., and Forterre, P. (2008) Mesophilic Crenarchaeota: proposal for a third archaeal phylum, the Thaumarchaeota. Nat Rev Microbiol
6, 245252.
78. Elkins, J. G., Podar, M., Graham, D. E.,
Makarova, K. S., Wolf, Y., Randau, L., et al.
(2008) A korarchaeal genome reveals new
insights into the evolution of the Archaea.
Proc Natl Acad Sci USA in press.
79. Wolf, Y. I., Aravind, L., Grishin, N. V., and
Koonin, E. V. (1999) Evolution of aminoacyltRNA synthetasesanalysis of unique domain
architectures and phylogenetic trees reveals a
complex history of horizontal gene transfer
events. Genome Res 9, 689710.
80. Koonin, E. V. (2003) Comparative genomics,
minimal gene-sets and the last universal common ancestor. Nature Rev Microbiol 1,
127136.
Chapter 4
Philosophy and Evolution: Minding the Gap Between
Evolutionary Patterns and Tree-Like Patterns
Eric Bapteste, Frederic Bouchard, and Richard M. Burian
Abstract
Ever since Darwin, the familiar genealogical pattern known as the Tree of Life (TOL) has been prominent in
evolutionary thinking and has dominated not only systematics, but also the analysis of the units of
evolution. However, recent findings indicate that the evolution of DNA, especially in prokaryotes and
such DNA vehicles as viruses and plasmids, does not follow a unique tree-like pattern. Because evolutionary
patterns track a greater range of processes than those captured in genealogies, genealogical patterns are in
fact only a subset of a broader set of evolutionary patterns. This fact suggests that evolutionists who
focus exclusively on genealogical patterns are blocked from providing a significant range of genuine
evolutionary explanations. Consequently, we highlight challenges to tree-based approaches, and point
the way toward more appropriate methods to study evolution (although we do not present them
in technical detail). We argue that there is significant benefit in adopting wider range of models, evolutionary representations, and evolutionary explanations, based on an analysis of the full range of evolutionary
processes. We introduce an ecosystem orientation into evolutionary thinking that highlights the importance
of type 1 coalitions (functionally related units with genetic exchanges, aka friends with genetic benefits), type 2 coalitions (functionally related units without genetic exchanges), communal interactions,
and emergent evolutionary properties. On this basis, we seek to promote the study of (especially
prokaryotic) evolution with dynamic evolutionary networks, which are less constrained than the TOL,
and to provide new ways to analyze an expanded range of evolutionary units (genetic modules, recombined
genes, plasmids, phages and prokaryotic genomes, pangenomes, microbial communities) and evolutionary
processes. Finally, we discuss some of the conceptual and practical questions raised by such network-based
representation.
Key words: Network, Lateral gene transfer, Horizontal gene transfer, Evolution, Prokaryotes,
Philosophy of biology, Units of evolution
Is the phylogenetic or a definitely nonphylogenetic system (e.g., an idealistic-morphological system) better suited to serve as a general reference
system, or does one of these systems for intrinsic reasons demand this
precedence over all others? (1)
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_4,
# Springer Science+Business Media, LLC 2012
81
82
E. Bapteste et al.
1. Genealogical
Patterns
and Evolutionary
Patterns Are Two
Different Things
83
GP:
Evolutionary phenomena associated
with the genealogy
EP:
GP + evolutionary phenomena that
do not match the genealogy:
Splitting events
Evolutionary relationships
=
Genealogical relationships
Evolutionary relationships
=
Genealogical relationships
+
Other relationships
(ecological, functional, genetic
partnerships)
Evolutionary units
=
Genealogical units
Evolutionary units
=
Genealogical units
+
Other evolving units
Fig. 1. Relationships between genealogical pattern (GP) (black ) and evolutionary pattern (EP) (grey ). Evolutionary patterns
encompass genealogical patterns but not the reverse.
84
E. Bapteste et al.
85
2. What Does
the Gap Between
Genealogical
Patterns and
Evolutionary
Patterns Imply?
86
E. Bapteste et al.
87
3. Richer
Conceptualization
and Representation
of Evolution
The biological world is not easily carved up at its joints. The use of
species/monophyletic groups as the primary unit of evolutionary
change assumes a strong form of uniformity and continuity in what
evolves. LGT is but one of many processes that transgresses these
frontiers; it serves us as one indicator that this assumption does not
always obtain. Speciation patterns are of course patterns of
88
E. Bapteste et al.
89
host, they can use different options to maximize their survival and that
of their host by enhancing either cyanobacterial photosynthesis or
ATP production (69). Similarly, phylogenetically heterogeneous
communities known as gut microbiomes, comprising archaea and
bacteria, converge in their repertoires of carbohydrate-active enzymes
to adapt to shared challenges, in large part thanks to LGT mediated
by mobile elements rather than gene family expansion (70).
Gut microbiomes of metazoans are full of friends with genetic benefits. Last but not least, although the chimeric nature of many eukaryotic genomes is often underappreciated in deep eukaryotic
phylogenetics, type 1 coalitions can also be observed in eukaryotes.
Using the diatoms as an example, Moustafa et al. (71) found that 16%
of the P. tricornutum nuclear genes may have green algal origins (72).
Ignoring the probability that additional genes have been contributed
to the genome over time in a nonvertical manner, this means at least
one in five of this diatoms genes could be expected to produce a
phylogenetic signal at odds with vertically inherited genes due to
endosymbioses followed by gene transfer to the host nucleus.
On another hand, tight functional interactions between phylogenetically unrelated partners in symbioses, consortia, etc. can also
occur with few if any gene exchanges. We refer to functionally
related units with a shared evolutionary fate in which no genetic
material is swapped between communities and populations as type
2 coalitions. Many biologists might find that evolutionary studies
of type 2 coalitions do not require new models of evolution that
go beyond the TOL. However, the consideration of these type
2 coalitions argues for the dependence of the change in the evolutionary fate of various subgroups on what others (often, members of
other species or other types of partner) in the community do, a
phenomenon that cannot be represented with a genealogical tree
alone. Consider the oft-studied Vibrio fischeriHawaiian Bobtail
squid interaction, where bioluminescence of the squid allows it to
avoid predators. Bioluminescence is generated by quorum sensing
of the bacteria in the constrained environment (i.e., high-density
conditions) of the squids mantle that they colonize. The fitness
gain from bioluminescence is not obvious for the Vibrio sans
symbiosis and the squid alone cannot generate light, but as a
coalition they allow for novel adaptations for both the squid and
the Vibrio. To put things a bit simply: Vibrio do not need to glow,
and squids cannot glow, but they have coevolved the adaptations
of bioluminescence and those required for their cooperative behaviors. This illustrates our claim that we should not expect EP to
match GP, since it is the ecological interaction that allows for these
adaptations to occur, not the genealogical confinement alone (73).
Many cases of genuine coevolution (74), e.g., between pollinators
and plants or hosts and parasites, support this same conclusion.
Cases of type 2 coalitions are also well known in prokaryotes.
An example is the interspecific associations of anaerobic
90
E. Bapteste et al.
91
92
E. Bapteste et al.
% in the dataset
30
1
25
Plasmids
Phages
20
4
2
15
10
0
A: B: J: K: L: D: V: T: M: N: Z: W: U: O: C: E: F: G: H: I: P: Q: R: S:
Functional categories
Fig. 2. Distribution of genes of various functional categories in genomes of mobile elements. All functional categories of
genes, except genes of nuclear structure, can be found in mobile elements, many of which should benefit communal
evolution since expression of genes with cellular functions increases the fitness of cells containing the mobile elements,
which, in turn, increases the likelihood of the mobile elements being carried forward to the next cellular generation. Bars
for plasmids are in black; bars for phages are in white. The X-axis corresponds to the functional categories defined by
clusters of orthologous groups (COGs) (100). The Y-axis indicates the percentage of occurrences of these categories in an
unpublished data set of 148,864 plasmids and 79,413 phage sequences, annotated using RAMMCAP (101). Functional
categories are sorted as follows: (1) Information storage and processing; A: RNA processing and modification; B: chromatin
structure and dynamics; J: translation; K: transcription; L: replication and repair; (2) cellular processes; D: cell cycle control
and mitosis; Y: nuclear structure; V: defense mechanisms; T: signal transduction; M: cell wall/membrane/envelop
biogenesis; N: cell motility; Z: cytoskeleton; W: extracellular structures; U: intracellular trafficking, secretion, and vesicular
transport; O: posttranslational modification, protein turnover, and chaperone functions; (3) metabolism; C: energy
production and conversion; E: amino acid metabolism and transport; F: nucleotide metabolism and transport;
G: carbohydrate metabolism and transport; H: coenzyme metabolism and transport; I: lipid metabolism and transport;
P: inorganic ion transport and metabolism; Q: secondary metabolites biosynthesis, transport, and catabolism; (4) poorly
characterized; R: general functional prediction only; S: function unknown.
93
94
E. Bapteste et al.
Squid genome
Genetic
Functional
Vibrio genome
Vibrio genome
E
C
O
L
O
G
I
C
A
L
O
R
G
A
N
I
S
M
Fig. 3. Theoretical scheme of a dynamic evolutionary network and real polarized network of genetic partnerships between
Archaea and Bacteria. (a) Nodes are apparent entities that can be selected during evolution. Various -omics help determine
the various edges in such network in order to describe covariation of fitness between nodes. Note that nodes can contain
other nodes (nodes are multilevel). Smaller grey nodes are genes. Some of these genes have phylogenetic affinities
indicated by long, dashed black edges, and others connected by plain thin edges are coexpressed. Collectively, some of
these gene associations define larger units (here, the two Vibrio genomes or ecological organisms, like the Vibriosquid
emergent ecological individual). Some of these genes and genomes interact functionally with the products of other genes
and other genomes defining coalitions (dashed grey lines ). In many coalitions, the interaction between partners may be
transient, ephemeral, and not the result of a long coevolution, yet the adaptations they display still deserve evolutionary
analysis. Thus, edge length corresponds to the temporal stability of the association (closer nodes are in a more stable
relationships over time). (b) Network adapted from ref. 47 computed from gene trees, including only Archaea and a single
bacterial OTU in a phylogenetic forest of 6,901 gene trees with 59 species of Archaea and 41 species of Bacteria.
The isolated bacterial OTU (that can differ in different trees) is odd, since the rest of the tree comprises only archaeal
lineages. For this reason, the single odd taxon is called an intruder (47). Archaea are represented by squares, and Bacteria
are represented by circles. Edges are colored based on the lifestyle distance between the pairs of partners, from 0 (darkest
edges, same lifestyle) to 4 (clearest edges, 50% similar lifestyle). The largest lifestyle distance in that analysis was 8, so
the organisms with the greater number of LGT had all a close to moderately distant lifestyle. Edge length is inversely
proportional to the number of transferred genes: the greater the number of shared genes between distantly related
organisms, the shorter the edge on the graph. The networks are polarized by arrows pointing from donors to hosts, here
showing LGT from Archaea to Bacteria.
4. Exploiting
Dynamic
Evolutionary
Networks
95
and kinds of connection to each other. Note that even though the
examples described above mainly concern the evolution of organisms,
the biotic entities entering coalitions, partnerships, and ecosystems
can be of many types, e.g., genes, operons, plasmids, genomes, organisms, coalitions, communities, etc. Whereas multilevel selection is
usually focused on the very different levels at each of which entities
of the same type interact (i.e., genes with genes, cells with cells,
organisms with organisms, etc.), a coalition approach is open to the
possibility that entities at different levels of organization can and do
interact. The Vibriosquid symbiosis is such an example, where a
single organism interacts not with one individual organism but with
a group of individuals (i.e., a bacterial colony). Gut flora in many
metazoa has a similar profile: in those cases, an individual organism
interacts with a community of different microbial species. However, a
network-based representation of this complexity raises serious conceptual and practical questions. How could evolutionists make sense
of such dynamic evolutionary networks (except by reconstructing a
TOL) (13, 17, 88)? It is one thing to claim that whole ecosystems qua
ecosystems can evolve; it is another to try to model interactions, where
the monophyletic groups that are functional parts of those ecosystems
are not the only relevant units that one needs to model to track
evolutionary change. In the dynamic evolutionary networks
approach, it is an open question: Which units of evolution deserve
tracking and which explanatory units should be used in models?
To answer such questions, we need to think about relation
between units of evolution (i.e., what actually evolves in response
to natural selection) and units of explanation (i.e., the conceptual
objects should be used to model this change). In the GP
approach, it was largely assumed that representations of the changes
in the evolutionary units of the TOL were sufficient to provide the
explanatory units of evolutionary explanations. Monophyletic
genealogical relationships served both as evolutionary and explanatory units. We, like many others, have argued that while this representation may be appropriate for the evolution of some
monophyletic groups (especially monophyletic groups of eukaryotes), it is woefully inadequate for many microbes and is ruled
out by definition in the evolution of more complex biological
arrangements that we called coalitions (19, 41, 73). Let us now
see how other additional units of evolution and units of explanation
play out in this coalition world.
4.1. Searching
Clusters in Networks
96
E. Bapteste et al.
have shown that clusters in networks, for instance in genome networks, are areas where nodes show a greater number of connections
among themselves than with the other nodes of the graph. We
expect to demonstrate that such patterns might be the result of
evolution, as we explain below.
But first, let us stress that looking for such clusters is consistent
with the natural inclination of biologists to favor significant groupings of phenomena. In tree pattern analysis, the search for clusters is
also central, and it has translated in the classic problems of ranking
and grouping (90). The problem of grouping has been solved by
privileging a single unified type of relation, namely, the genealogical
relation exhibited by nodes. This allowed objective pairs of nodes
shown to share a last common ancestor in a data set to be grouped
together and shown to be distally related. Ranking (e.g., the decision to classify a genealogical group as a species instead of genus, an
order, etc.) was never truly solved and remains largely arbitrary
(91). This point was explicitly made by Darwin himself in Chapter
1 of the Origin. It is, therefore, somewhat ironic that evolutionary
explanations have reified clusters as real encapsulated (bounded)
evolutionary units by privileging genealogical relations. That is,
evolutionary explanations have treated evolutionary clusters as if
they were stable unitary units impervious to interference from other
clusters, apart from the change in the selective environment caused
by changes in the abiotic environment and the changes that any one
group causes in the other groups with which it interacts. Genealogical explanations have given absolute ontological priority to genealogical change of a certain type and been blind to other natural
processes that have deep consequences in the process of adaptation.
It behooves us to look at the neglected branches created by LGT,
hybridization, and other means of genetic exchange, coevolution,
and reticulation between branches in order to reexamine the adequacy of models that focus exclusively on well-compartmentalized
(i.e., modular) monophyletic groups. By looking at these usual
outliers in shared gene networks for instance, we identify new
clusters, some of which, we argue, are created and maintained by
selective pressures and evolutionary processes. Figure 4 illustrates
how clusters of partners of different types (e.g., clusters of bacteria
and plasmids, bacteria and phages, plasmids and phages) can unravel
the presence of groups of entities affected by processes of conjugation, transduction, and/or recombination, respectively. These entities are candidate friends with genetic benefits.
Importantly, as the ecosystems approach to microbial evolution
has taught us, the networks representing evolutionary dynamics
should not be purely genealogical; they should also be structural
and functional. Ecosystems involve both biotic and abiotic processes. Abiotic processes do not have genealogies (after all, they
are not genetic systems) and the arrangements of species in communities can be initiated or reorganized in ways that do not reflect
97
b
Genetic world of phages
Genetic world of
plasmids
conjugation events
recombination between
plasmids
conjugation events
Genetic world of
phages
transduction events
Fig. 4. Remarkable patterns and processes in shared genome networks. (a) Schematic diagram of a connected component,
showing a candidate coalition of friends with genetic benefits, where each node represents a genome, cellular (white for
bacterial chromosome), plasmidic (grey ), or phage (black ). Data are real and were kindly provided by S. Halary and P. Lopez (9).
Two nodes are connected by an edge if they share homologous DNA (reciprocal best BLAST hit with a minimum of 1e-20 score,
and 100% minimum identity). Edges are weighted by the number of shared DNA families. The layout was produced by
Cytoscape using an edge-weighted spring-embedded model, meaning that genomes sharing more DNA families are closer on
the display (102). Clusters of bacteria and plasmids suggest events of conjugation; clusters of bacteria and phages suggest
events of transduction; clusters of phages and plasmids suggest exchange of DNA between classes of mobile elements, etc.
(b) Three connected components corresponding to three genetic worlds, defined by displaying connections between genomes
(same color code) for a reciprocal best BLAST hit with a minimum of 1e-20 score, and a minimum of 20% identity. Their three
gene pools are absolutely distinct, which suggests that some mechanisms and barriers structure the genetic diversity and the
genetic evolution outside the TOL. These real data were also kindly provided by S. Halary and P. Lopez (9).
98
E. Bapteste et al.
99
However, for Hennig and the many evolutionists that his thinking influenced, this multiplicity was in part reducible, since one
dimension (the genealogical) provided the best proxy for all the
others. As Hennig put it: making the phylogenetic system the
general reference system for special systematics has the inestimable
advantage that the relations to all other conceivable biological systems can be most easily represented through it. This is because the
historical development of organisms must necessarily be reflected in
some way in all relationships between organisms. Consequently,
direct relations extend from the phylogenetic system to all other
possible systems, whereas they are often no such direct relations
between these other systems (1). However, the -omic disciplines
reveal that the number of processes, interactions, systems, and
relationships affecting evolutionaryand the various entities that
are, in fact, units of evolutionare more astonishingly diverse than
Hennig (and for that matter, Darwin) recognized. Phylogenomics
also provides a strong case that the TOL is a poor proxy for all the
features of biodiversity (93), as it would explain only the history of
1% of the genes in a complete tree for prokaryotes (12) or of about
1015% at the level of bacterial phyla (94, 95), and, by definition,
none of the emergent and communal microbial properties. Likewise, some functional analyses of metagenomic data show that the
functional signal is, in some cases, stronger than the genealogical
signal in portions of the genome, showing that the presence of
genetic material with a given function matters more than the presence of a given genealogical lineage in some ecosystems (90).
Thus, the claim that one system has precedence over the others
deserves empirically reassessment. We maintain that such reassessment has potential to unravel important hidden correlations in the
relationships between evolving entities, overlooked thus far when
they were not consistent with the genealogy.
Network approaches (in contrast to branching genealogical
representations) are precisely the right tool to use for this purpose;
they are better suited to the evolutionary modeling needed here in
that they are agnostic about the structure of the relevant topologies.
Network-based studies can easily represent the multiplicity of relationships discovered by -omics approaches, and test whether,
indeed, one system (i.e., one of the networks) is a better proxy
than the others. In fact, all sorts of relationships between evolving
entities can be represented on these graphs. Proteomics allows one
to draw connections based on proteinprotein interaction and functional associations. Metagenomics proposes environmental and
functional connections. Correlation studies between multiple
100
E. Bapteste et al.
Organism / Environment i
Phylogenetic
Functional
Physical
Regulatory
Organism / Environment j
Phylogenetic
Functional
Physical
Regulatory
Fig. 5. Schematic correlations between -omics network. Each node corresponds to one individual gene. Four networks
illustrate the relationships inferred by -omics for these genes: black edges between nodes indicate the shortest distances
in terms of phylogenetics, functional interaction, physical distance, and regulatory distances for these genes. The question
whether one of these networks is a better proxy for all the others (within an organism or an environment or between
organisms or environments) is an open (empirical) question. Shaded edges indicate paths that are identical between more
than two networks of a single organism; bold edges indicate paths that are identical between comparable networks of
distinct organisms. For instance, in this graph, a cluster of three interconnected genes showed functional, physical, and
regulatory coherence both in organisms/environments i and j. However, this pattern was not captured by their phylogenetic
affinities in gene trees.
101
Fig. 6. Functional networks of shared genes for plasmids, phages, and prokaryotes. Four functional genome networks,
including 2,209 genomes of plasmids, 3,477 genomes of phages, and 116 prokaryotic chromosomes (from the same data
set as Fig. 2), were reconstructed by displaying only edges that correspond to the sharing of genetic material involved in
each of these functions on a separated graph. Here, we only showed the giant connected components of four functional
genomes network: (a) for J: translation, (b for c) energy production, (c) for T: signal transduction, and (d) for U: intracellular
trafficking. Bacterial genomes are in black, archaeal genomes in white, plasmids in light grey, and phages in dark grey. It is
clear that these functional networks are quite different because the histories of the genes coding for these functions were
distinct. However, some local correspondence can be found between the GCC of these functional graphs, suggesting that
some functional categories underwent the same evolutionary history in some groups of genomes, sometimes consistently
with the taxonomy (e.g., translation and energy production in bacteria and archaea), sometimes not. The layout was
produced by Cytoscape (102).
of archaeal pathways. Also, Dilthey and Lercher characterized spatially and metabolically coherent clusters of genes in gamma-proteobacteria. Though these genes share connections in spatial and
metabolic networks, they present multiple inconsistent phylogenetic
origins with the rest of the genes of the genomes hosting them. This
lack of correlation between the genealogical affinities of genes otherwise displaying remarkable shared connections in their spatial and
functional interactions suggests that analyses of correlations in these
particular networks could be used to predict LGT of groups of tightly
associated genes (Dilthey and Lercher, in prep.). Here, additional
evolutionary units (gene coalitions), consistent with the selfish
operon theory, could be identified (110).
Our more general point is that, ifat some level of evolutionary
analysisno network is an objectively better proxy for all the others,
local parts of different networks could still show significant
102
E. Bapteste et al.
correlations, useful to elaborate evolutionary scenarios (e.g., involving genetic modules, pathway evolution, etc.). Just as Dilthey and
Lercher suggested for clusters of metabolic genes, locally common
paths between physical and functional networks reconstructed for
many organisms could define clusters of genes with physical and
functional interactions that are found in multiple taxa. If the genes
making these clusters are distantly related in terms of phylogeny,
such findings suggest that these genes may have been laterally
transferred, possibly between distantly related members of a type 1
coalition. With further investigation, the physical and functional
associations observed between these genes, in multiple taxa, could
be interpreted as emerging phenotypes owing to LGT.
Correlations between networks based on transcriptomics, proteomics, and metagenomics could also inform evolutionists about the
robustness of coalitions (e.g., the presence of resilient and recurring
edges in various OTUs/coalitions/environments/over time). Think
of a trophic cycle in a given ecosystem. Various species can play the
same functional role, but the cycle remains. A species can be replaced
(via competition, migration, etc.) within a trophic cycle. Representing
this in networks, we would observe that some clusters have changed (a
network focused on genealogical relationships) while others are stable
(those focused on functional properties). The fact that some functional relationships persist longer than some genealogical ones may be
an indication of an evolutionary cluster that cannot be tracked by GP
alone (97), i.e., when the functional composition of a community
remains stable over longer times than the taxonomic composition.
Again, this is typically observed in gut flora: the functional network
and the phylogenetic network are not always well correlated, since the
composition and diversity of microbial populations change within the
gut, even if the microbes keep thriving on a shared gene pool (96).
It would also be observed in natural geochemical cycles (92), which
has the potential to introduce functional, genetic, and environmental
signatures in evolution that might outlive genealogical ones.
Since this search for correlation between networks does not
impose an a priori dominant pattern on biodiversity, it could offer
an improved and finer-grained representation of some aspects of
evolution. In particular, this approach would facilitate the recognition of evolutionary units not revealed in analyses based solely on
monophyletic groupings. The evaluation of the evolutionary
importance of such units cannot properly begin until they are
made into explicit objects of evolutionary study. If significant correlations reveal a pattern worth naming and deserving evolutionary
explanation, they will thus have opened up pathways in the study of
evolutionary origins not accessible in a strictly phylogenetic evolutionary system (Fig. 6).
103
5. Conclusion
We suggest that in nature coalitions (both friends with genetic
benefits and type 2 coalitions) are an important category of evolving
entities. Developing the tools (e.g., of network analysis) to analyze
the evolutionary impact of the processes into which coalitions enter
and the various roles that coalitions (and their evolutionarily interesting components) play will provide an improved basis for the study
of evolution, one that can include but also go beyond what can be
achieved with TOL-based modeling. We also suggest that modeling
of evolutionary adaptive processes can be significantly improved by
examining the evolutionary dynamics of coalitions, in particular by
including parameters informative about the topology and structure
of the components of the networks classified in various ways, including their evolutionary roots. Such modeling is open to various types
of assortments of partners (whereas GPs focus on same types of
associations), various durations of association (whereas GPs focus
on the long term relative to organismal scale), and all the degrees of
functional integration (whereas GPs focus almost exclusively on the
maximally integrated associations, such as mitochondria, or on the
shallow associations of coevolution). Because genealogical patterns
and evolutionary patterns are not isomorphic, evolutionists should
not be too strict in maintaining the ontological superiority of genealogical patterns. In genealogical patterns, evolutionists had (rightly
or not) an intuition about what persisted through time: species and
monophyletic groups. This allowed for the changing of parts while
maintaining continuity of some entity (which was assumed to
be what evolution was about). In the broader (and a priori less
constrained) perspective for which we argued, i.e., in ecosystemoriented evolutionary thinking, what persists through evolution
needs to be pinned down more carefully since monophyletic groups
are not the exclusive units and do not provide all of the ways of
carving out the patterns. In particular, studies of the correlations and
clusters in evolutionary dynamic networks could offer a possible
future alternative approach to complete the TOL perspective.
Box 1
Reconstructing Genome and Gene Networks
The various networks described in this chapter can easily be
reconstructed, for instance using genetic similarities.
For genome networks, a set of protein and/or nucleic
sequences from complete genomes must be retrieved from a relevant database (e.g., the NCBI (http://www.ncbi.nlm.nih.gov/
Entrez)). All these sequences are then BLASTed against one
another. To define homologous DNA families, sequences are
(continued)
104
E. Bapteste et al.
Box 1
(continued)
clustered when they share a reciprocal best-BLAST hit (RBBH)
relationship with at least one of the sequences of the cluster,
and a minimum sequence identity. For each pair of sequences,
all best BLAST hits with a score of 1e-20 are stored in a mySQL
database. To define homologous DNA families, sequences must
be clustered, for instance using a single-linkage algorithm or
MCL. With the former approach, a sequence is added to a cluster
if it shares an RBBH relationship with at least one of the sequences
of the clusters. We call cluster of homologous DNA families
(CHDs) the DNA families so defined. Requirement that RBBH
pairs share a minimum sequence identity, in addition to a BLAST
homology, can also be taken into account to define the CHDs.
Thus, distinct sets of CHDs can be produced, e.g., for various
identity thresholds (from 100%to study recent eventsto 20%
to study events of all evolutionary ages). Based on these sets of
CHDs and their distribution in the genomes, genome networks
can be built to summarize the DNA-sharing relationships between
the genomes under study, as summarized in Fig. 7. A network
layout can be produced by Cytoscape software using an edgeweighted spring-embedded model.
Several different evolutionary gene networks (EGNs) can be
reconstructed to be contrasted with proteinprotein interaction
networks or networks of metabolic pathways. For instance,
EGN based on sequence similarity can be reconstructed when
each node in the graph corresponds to a sequence. Two nodes
are connected by edges if their sequences show significant similarity, as assessed by BLAST. Hundreds of thousands of DNA
(or protein) sequences can, thus, be all BLASTed against each
other. The results of these BLASTs (the best BLAST scores
between two sequences, their percent of identity, the length over
which they align, etc.) are stored in databases. Groups of homologous sequences are then inferred using clustering algorithms (such
as the simple linkage algorithm). The BLAST score or the percentage of identity between each pair of sequences, or in fact any
evolutionary distance inferred from the comparison of the two
sequences, can then be used to weight the corresponding edges.
Most similar sequences can then be displayed closer on the EGN.
The lower the BLAST score cutoff (e.g., 1e-5), the more inclusive
the EGNs. Since not all gene forms resemble one another, however, discontinuous variations structure the graph.
Finally, clusters in genome and gene networks can be found
by computing modules, using packages for graph analysis, such
as MCODE 1.3 Cytoscape plugin (default parameters) and
Igraph (98), or by modularity maximization (as described in refs.
11 and 99).
B
L
A
S
T
/
C
L
U
S
T
E
R
I
N
G
Plasmids
ph1
ph2
plas1
ph3
ph4
plas2
Chromosomes
chr2
chr1
Global network
105
chr1
chr2
Genomes
ph1
ph2
ph3
ph4
plas1
plas2
chr1
chr2
ph1
ph2
Display
ph3
plas1
ph4
plas2
Connected
component
Giant connected
component
Fig. 7. Illustration for Box 1. Genes found in each type of DNA vehicle and belonging to the same homologous DNA family
are represented by a similar dash. The distribution of DNA families in mobile elements and cellular chromosomes can be
summarized by a presence/absence matrix, which can be used to reconstruct a network. With real data, the network of
genetic diversity is disconnected yet highly structured. It presents multiple connected components.
6. Exercises
1. What are the computational steps required to reconstruct a
genome network?
2. Cite four examples of communal evolution.
3. Cite three examples of coalitions.
4. In your opinion, is the genealogical pattern the best proxy for all
evolutionary patterns? What aspects of evolution in particular
cannot be described by a TOL only? Are there aspects of
evolution that can be described by the TOL that cannot be
captured in a network-based approach?
5. Are genes from all functional categories found in the genomes
of mobile elements?
106
E. Bapteste et al.
Acknowledgments
This paper was made possible through a series of meetings funded
by the Leverhulme Trust (Perspectives on the Tree of Life),
organized by Maureen OMalley, whom we want to thank dearly.
We also thank P. Lopez, S. Halary, and K. Schliep for help with
some analyses and figures, and P. Lopez and L. Bittner for critical
discussions.
References
1. Hennig, W. (1966) Phylogenetic systematics.
Urbana.
2. Daubin, V., Moran, N.A., and Ochman, H.
(2003) Phylogenetics and the cohesion of
bacterial genomes. Science 301, 829832.
3. Galtier, N., and Daubin, V. (2008) Dealing
with incongruence in phylogenomic analyses.
Philos Trans R Soc Lond B Biol Sci 363,
40234029.
4. Ciccarelli, F.D., Doerks, T., von Mering, C.,
Creevey, C.J., Snel, B., and Bork, P. (2006)
Toward automatic reconstruction of a highly
resolved tree of life. Science 311, 12831287.
5. Kurland, C.G., Canback, B., and Berg, O.G.
(2003) Horizontal gene transfer: a critical
view. Proc Natl Acad Sci USA 100,
96589662.
6. Lawrence, J.G., and Retchless, A.C. (2009)
The interplay of homologous recombination
and horizontal gene transfer in bacterial speciation. Methods Mol Biol 532, 2953.
7. Retchless, A.C., and Lawrence, J.G. (2010)
Phylogenetic incongruence arising from fragmented speciation in enteric bacteria. Proc
Natl Acad Sci USA 107, 1145311458.
8. Retchless, A.C., and Lawrence, J.G. (2007)
Temporal fragmentation of speciation in bacteria. Science 317, 10931096.
9. Halary, S., Leigh, J.W., Cheaib, B., Lopez, P.,
and Bapteste, E. (2010) Network analyses
structure genetic diversity in independent
genetic worlds. Proc Natl Acad Sci USA
107, 127132.
10. Brilli, M., Mengoni, A., Fondi, M., Bazzicalupo, M., Lio, P., and Fani, R. (2008) Analysis
of plasmid genes by phylogenetic profiling
and visualization of homology relationships
using Blast2Network. BMC Bioinformatics
9, 551.
11. Dagan, T., Artzy-Randrup, Y., and Martin, W.
(2008) Modular networks and cumulative
impact of lateral transfer in prokaryote
107
108
E. Bapteste et al.
51. OMalley, M.A. (2007) Exploratory experimentation and scientific practice: Metagenomics and
the proteorhodopsin case. History and Philosophy of the Life Sciences 29(3), 337360.
52. Strasser, B.J. (2008) GenBankNatural History in the 21st Century? Science 322,
537538.
53. Strasser B.J. (2010) Laboratories, Museums,
and the Comparative Perspective: Alan A.
Boydens Serological Taxonomy, 19251962.
Historical Studies in the Natural Sciences 40
(2), 149182.
54. Bapteste, E., and Boucher, Y. (2008) Lateral
gene transfer challenges principles of microbial
systematics. Trends Microbiol 16, 200207.
55. Walsby, A.E. (1994) Gas vesicles. Microbiol
Rev 58, 94144.
56. Lo, I., Denef, V.J., Verberkmoes, N.C., Shah,
M.B., Goltsman, D., DiBartolo, G., Tyson, G.
W., Allen, E.E., Ram, R.J., Detter, J.C.,
Richardson, P., Thelen, M.P., Hettich, R.L.,
and Banfield, J.F. (2007) Strain-resolved community proteomics reveals recombining genomes of acidophilic bacteria. Nature 446,
537541.
57. Nesbo, C.L., Bapteste, E., Curtis, B., Dahle,
H., Lopez, P., Macleod, D., Dlutek, M., Bowman, S., Zhaxybayeva, O., Birkeland, N.K.,
and Doolittle, W.F. (2009) The genome of
Thermosipho africanus TCF52B: lateral
genetic connections to the Firmicutes and
Archaea. J Bacteriol 191, 19741978.
58. Wilmes, P., Simmons, S.L., Denef, V.J., and
Banfield, J.F. (2009) The dynamic genetic
repertoire of microbial communities. FEMS
Microbiol Rev 33, 109132.
59. Vogl, K., Wenter, R., Dressen, M., Schlickenrieder, M., Ploscher, M., Eichacker, L., and
Overmann, J. (2008) Identification and analysis of four candidate symbiosis genes from
Chlorochromatium aggregatum, a highly
developed bacterial symbiosis. Environ
Microbiol 10, 28422856.
60. Wanner, G., Vogl, K., and Overmann, J. (2008)
Ultrastructural characterization of the prokaryotic symbiosis in Chlorochromatium aggregatum. J Bacteriol 190, 37213730.
61. Lindell, D., Jaffe, J.D., Coleman, M.L.,
Futschik, M.E., Axmann, I.M., Rector, T.,
Kettler, G., Sullivan, M.B., Steen, R., Hess,
W.R., Church, G.M., and Chisholm, S.W.
(2007) Genome-wide expression dynamics
of a marine virus and host reveal features of
co-evolution. Nature 449, 8386.
62. Lindell, D., Sullivan, M.B., Johnson, Z.I.,
Tolonen, A.C., Rohwer, F., and Chisholm,
109
86. Bouchard, F. (2011) How ecosystem evolution strengthens the case for functional pluralism. In: Functions: selection and mechanisms.
Huneman, P. ed.: Synthese Library, Springer.
87. Konstantinidis, K.T., and Tiedje, J.M. (2005)
Genomic insights that advance the species
definition for prokaryotes. Proc Natl Acad
Sci USA 102, 25672572.
88. Doolittle, W.F. (2009) Eradicating Typological Thinking in Prokaryotic Systematics and
Evolution. Cold Spring Harb Symp Quant
Biol.
89. Popa O, Hazkani-Covo E, Landan G, Martin
W, Dagan T. (2011) Directed networks reveal
genomic barriers and DNA repair bypasses to
lateral gene transfer among prokaryotes.
Genome Res 21(4), 599609. Epub 2011
Jan 26.
90. Broogard, B. (2004) Species as Individuals.
Biology and Philosophy 19, 223242.
91. Ereshefsky, M. (2010) Mystery of mysteries:
Darwin and the species problem. Cladistics
26, 113.
92. Falkowski, P.G., Fenchel, T., and Delong, E.
F. (2008) The microbial engines that drive
Earths biogeochemical cycles. Science 320,
10341039.
93. Doolittle, W.F., and Zhaxybayeva, O. (2010)
Metagenomics and the Units of Biological
Organization. Bioscience 60, 102112.
94. Lerat, E., Daubin, V., and Moran, N.A.
(2003) From gene trees to organismal phylogeny in prokaryotes: the case of the gammaProteobacteria. PLoS Biol 1, E19.
95. Touchon, M., Hoede, C., Tenaillon, O.,
Barbe, V., Baeriswyl, S., Bidet, P., Bingen,
E., Bonacorsi, S., Bouchier, C., Bouvet, O.,
Calteau, A., Chiapello, H., Clermont, O.,
Cruveiller, S., Danchin, A., Diard, M., Dossat, C., Karoui, M.E., Frapy, E., Garry, L.,
Ghigo, J.M., Gilles, A.M., Johnson, J., Le
Bouguenec, C., Lescat, M., Mangenot, S.,
Martinez-Jehanne, V., Matic, I., Nassif, X.,
Oztas, S., Petit, M.A., Pichon, C., Rouy, Z.,
Ruf, C.S., Schneider, D., Tourret, J., Vacherie, B., Vallenet, D., Medigue, C., Rocha, E.
P., and Denamur, E. (2009) Organised
genome dynamics in the Escherichia coli species results in highly diverse adaptive paths.
PLoS Genet 5, e1000344.
96. Dinsdale, E.A., Edwards, R.A., Hall, D., Angly,
F., Breitbart, M., Brulc, J.M., Furlan, M., Desnues, C., Haynes, M., Li, L., McDaniel, L.,
Moran, M.A., Nelson, K.E., Nilsson, C.,
Olson, R., Paul, J., Brito, B.R., Ruan, Y.,
Swan, B.K., Stevens, R., Valentine, D.L.,
Thurber, R.V., Wegley, L., White, B.A., and
110
E. Bapteste et al.
Part II
Natural Selection, Recombination, and Innovation
in Genomic Sequences
Chapter 5
Selection on the Protein-Coding Genome
Carolin Kosiol and Maria Anisimova
Abstract
Populations evolve as mutations arise in individual organisms and, through hereditary transmission, may
become fixed (shared by all individuals) in the population. Most mutations are lethal or have negative
fitness consequences for the organism. Others have essentially no effect on organismal fitness and can
become fixed through the neutral stochastic process known as random drift. However, mutations may also
produce a selective advantage that boosts their chances of reaching fixation. Regions of genes where new
mutations are beneficial, rather than neutral or deleterious, tend to evolve more rapidly due to positive
selection. Genes involved in immunity and defense are a well-known example; rapid evolution in these
genes presumably occurs because new mutations help organisms to prevail in evolutionary arms races
with pathogens. In recent years, genome-wide scans for selection have enlarged our understanding of the
evolution of the protein-coding regions of the various species. In this chapter, we focus on the methods to
detect selection in protein-coding genes. In particular, we discuss probabilistic models and how they have
changed with the advent of new genome-wide data now available.
Key words: Conserved and accelerated regions, Positive selection scans, Codon models, Time and
space heterogeneity of genome evolution, Phylo-HMMs, Selection-mutation models
1. Introduction
Protein-coding genes are the DNA sequences used as templates for
the production of a functional protein. Such sequences consist of
nucleotide triplets called codons. During the protein production
phase, codons are transcribed and then translated into amino acids
(AAs) according to the organisms genetic code. In the past, selection studies on coding DNA mainly focused on the analysis of
particular proteins of interest. With the availability of comparative
genomic data, the emphasis has shifted from the study of individual
proteins to genome-wide scans for selection. The overview of genomic data underlying the genome-wide analysis of protein-coding
genes is included in Subheading 2.
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_5,
# Springer Science+Business Media, LLC 2012
113
114
Fig. 1. A diagram illustrating the different data levels to analyze protein-coding sequences
and the relationship of the various approaches modeling spatial and temporal heterogeneity.
115
2. Comparative
Genome Data
Several whole-genome sequence data sets are now available
for selection scans. Mammalian genomes are well represented
(in particular primates), and insect genomes are becoming more
numerous (in particular Drosophila). These data can be downloaded
as orthologous alignments from the Ensembl (2) and UCSC (3)
browsers. Methods for constructing orthologous sets of genes are
reviewed in Chapter 9 of Volume 1 (4).
In light of recent advances in DNA sequencing, with the
so-called next-generation sequencing (NGS) technologies that
have dramatically reduced the cost and time needed to sequence
an organisms entire genome, large-scale (involving many organisms) sequencing projects have been and are currently being undertaken. In particular, genome projects resequencing 1000 Human,
1000 Drosophila melanogaster, and 1001 Arabidopsis individuals
are ongoing. These polymorphism data from multiple individuals
from several species enable us to detect very recent selection.
Together with the progress in sequencing technologies,
algorithmic advances now allow the de novo assembly of genomes
from NGS data (see Chapter 5 in Volume 1 (5)), including complex
mammalian genomes (e.g., giant panda genome (6)). Announced
shortly after the Human 1000 Genomes Project, the 1000 Plant
Genomes Project is yet another, similar highly large-scale genomics
endeavor to take advantage of the speed and efficiency of NGS.
The Genome 10 K project aims to assemble a genomic zooa
collection of DNA sequences representing the genomes of 10,000
vertebrate species, approximately 1 for every vertebrate genus.
All these genomes can be subject to scans for selection, for which
we outline methods below.
3. Methods
3.1. Probabilistic
Models for Genome
Evolution
116
117
118
119
Fig. 2. Visualization of an example phylo-HMM showing the probabilistic graph and the input alignment. The grey columns
represent the conserved state; the white columns the fast state. At each time step, a new state is visited according to the
transition probabilities (m and n parameters on arcs) and a multiple alignment column is emitted according to
the conserved and nonconserved phylogenetic models Cc and Cn. Thereby, the phylogenetic models include the
parameters describing the tree and the pattern of substitution.
120
selection acts to preserve the protein sequence so that the nonsynonymous substitution rate is lower than the synonymous rate, with
o < 1. Neutrally evolving sequences exhibit similar nonsynonymous and synonymous rates, with o 1.
First methods that used the o-ratio as a criterion to detect
positive selection were based on pairwise estimation of dN and dS
rates with counting methods (e.g., see ref. 32). However, ML
estimates of pairwise dN and dS based on a codon model were
shown to outperform all other approaches (33). Moreover, a
Markov codon model is naturally extended to multiple sequence
alignments, unlike the counting methods. This, together with the
benefits of the probabilistic framework within which codon models
are defined, made codon models very popular in studies of positive
selection in protein-coding genes.
The first two codon models were proposed simultaneously in
the same issue of Molecular Biology and Evolution ((34) and (35)).
The model of Goldman and Yang (34) included the transition/
transversion rate ratio k, and modeled the selective effect indirectly
using a multiplicative factor based on Grantham (36) distances, but
was later simplified to estimate the selective pressure explicitly using
the o parameter (37). The main distinction between the first codon
models concerns the way to describe the instantaneous rates
with respect to equilibrium frequencies: (1) proportional to the
equilibrium frequency of a target codon (as in Goldman and Yang
(34)) or (2) proportional to the frequency of a target nucleotide
(as in Muse and Gaut (35)).
Recently, empirical codon models have been estimated (see refs.
38 and 39) that summarize substitution patterns from large quantities of protein-coding gene families. In contrast to the parametric
codon models that estimate gene-specific parameters (e.g., transitiontransversion k, selective pressure o, etc.), the empirical codon
models do not explicitly consider distinct factors that shape protein
evolution. Standard parametric models assume that protein evolution proceeds only by successive single-nucleotide substitutions.
However, empirical codon models indicate that model accuracy is
significantly improved by incorporating instantaneous doublet and
triplet changes. Kosiol et al. (39) also found that the affiliations
among codon, the amino acid it encodes, and the physicochemical
properties of the amino acid are main driving factors of the process of
codon evolution. Neither multiple nucleotide changes nor the
strong influence of the genetic code nor amino acid properties
form a part of the standard parametric models.
On the other hand, parametric models have been very successful
in applications studying biological forces shaping protein evolution
of individual genes. Thus, combining the advantages of parametric
and empirical approaches offers a promising direction. Kosiol,
Holmes, and Goldman (39) explored a number of combined
codon models that incorporated empirical AA exchangeabilities
121
First codon models assumed constant nonsynonymous and synonymous rates among sites and over time. Although most proteins
evolve under purifying selection most of the time, positive selection
may drive the evolution in some lineages. During episodes of
adaptive evolution, only a small fraction of sites in the protein
have the capacity to increase the fitness of the protein via AA
replacements. Thus, approaches assuming constant selective pressure over time and over sites lack power in detecting genes affected
by positive selection. Consequently, various scenarios of variation in
selective pressure were incorporated in codon models, making
them more powerful at detecting positive selection, and short
episodes of adaptive evolution in particular. Evidence of positive
selection on a gene can be obtained by an LRT comparing two
nested models: a model that does not allow positive selection
(constraining o 1 to represent the null hypothesis) and a
model that allows positive selection (o > 1 is allowed in the alternative hypothesis). Positive selection is detected if a model o > 1
fits data significantly better compared to the model restricting
o 1 at all sites and lineages. However, the asymptotic null distribution may vary from the standard w2 due to boundary problems or
if some parameters become not estimable (e.g., see refs. 41 and 42).
122
The simplest site models use the general discrete distribution with a
prespecified number of site classes. Each site class i has an independent parameter oi estimated by ML together with proportions of
sites pi in each class. Since a large number of site categories require
many parameters, three categories are usually used (requiring five
independent parameters). To test for positive selection, several pairs
of nested site models were defined to represent the null and alternative hypotheses in LRTs. For example, model M1a includes two
site classes, one with o0 < 1 and another with o1 1, representing the neutral model of evolution (the null hypothesis). The
alternative model M2a extends M1a by adding an extra site class
with o2 1 to accommodate sites evolving under positive selection. Significance of the LRT is tested using the w22 distribution for
the M1 vs. M2 comparison. We test the C7 gene for positive
selection by the LRT comparing nested models M1a and M2a
(Table 1).
Model M2a has two additional parameters compared to model
M1a. The resulting LRT statistic is 2 (log L2 log L1) 2
(6377.35 (6369.67)) 2 7.68 15.36. This is much
greater than the critical value of the chi-square distribution
w2 (df 2, at 5%) 5.99, and we calculate a p-value of
P 5.0e04. However, the M1a vs. M2a comparison for genes
C8B and C9 is not significant.
Another LRT can be performed on the basis of the modified
model M8 with two site classes: one with sites, where the o-ratio is
drawn from the beta distribution (with 0 o 1 describing
the neutral scenario), and the second, discrete class, with o 1.
Constraining o 1 for this second class provides a sufficiently
123
Table 1
Parameter estimates and log likelihoods for an LRT
of positive selection for the complement immunity
component C7
M1a (nearly neutral)
0
1
Site class
(p1 1 p0 0.31)
p0 0.69
Proportion
o0 0.07
(o1 1)
o ratio
Log likelihood L1 6377.35
M2a (selection)
0
Site class
p0 0.70
Proportion
o0 0.08
o ratio
1
p1 0.29
(o1 1)
2
(p2 1 p0 p1 0.01)
o2 10.89
The model M2a is the alternative model with a class of sites with o2 1.
The null hypothesis M1a is the same model but with o2 1 fixed
124
0.18
human
0.52
chimp
0.09
0.16
0.17
macaque
0.42
mouse
0.16
0.46
rat
0.32
dog
Fig. 3. An estimate of o for each branch of a six-species phylogeny. Shown is the maximum
likelihood estimate for the gene C8B. Each branch is labeled with the corresponding
estimate of o.
125
126
the asymptotic null distribution and thus result in an elevated falsepositive rate.
In the case of episodic selection where any combination of
branches of a phylogeny can be affected, a Bayesian approach in
lieu of the standard LRTs and multiple testing have been suggested.
The multiple LRT approach is most concerned with controlling the
false-positive rate of selection inference, and is less suited to infer
the best-fitting selection history. In the hypothetical example
(Fig. 3), a total of 29 1 511 selection histories (excluding
the history without selection on any branch) need to be considered.
The Bayesian analysis allows a probability distribution over possible
selection histories to be computed, and therefore permits estimates
of prevalence of positive selection on individual branches and
clades. Such approach evaluates uncertainty in selection histories
using their posterior probabilities and allows robust inference of
interesting parameters, such as the switching probabilities for gains
and losses of positive selection (43).
Other models (e.g., with dS-variation among sites (56)) also
may be extended to allow changes of selective regimes on different
branches. This is achieved by adding further parameters, one per
branch, describing the deviation of selective pressure on a branch
from the average level on the whole tree under the site model. Such
model is parameter rich and can be used for exploratory purposes
on data with long sequences, but does not provide a robust way of
testing whether o > 1 on a branch is due to positive selection on a
lineage or due to inaccuracy of the ML estimation.
Kosakovsky Pond and Frost (56) suggested detecting lineagespecific variation in selective pressure using the genetic algorithm
(GA)a computational analogue of evolution by natural selection.
The GA approach was successfully applied to phylogenetic reconstruction (see refs. 57, 58, and 59). In the context of detecting
lineage-specific positive selection, GA does not require an a priori
hypothesis. Instead, the algorithm samples regions of the whole
hypotheses space according to their fitness measured by AICC.
The branch-model selection with GA may also be adapted to incorporate dN and dS among-site variation, although this imposes a
much heavier computational burden.
In branch and branch-site models, change in selection regime is
always associated with nodes of a tree, but the selective pressure
remains constant over the length of each branch. Guindon et al. (60)
proposed a Markov-modulated model, where switches of selection
regimes may occur at any site and any time on the phylogeny. In a
covarion-like manner, this codon model combines two Markov
processes: one governs the codon substitution while the other
specifies rates of switches between selective regimes. These models
can be used to study the patterns of the changes in selective pressures over time and across sites by estimating the relative rates of
127
4. Notes/Discussion
With the wider use of codon models to detect selection, some
questioned the statistical basis of testing based on branch-site models.
In 2004, Zhang found that the original branch-site test (68)
produced excessive false positives when its assumptions were not
met. The modified branch-site test was shown to be more robust to
model violations (see refs. 45 and 69), and is now commonly used in
genome-wide selection scans (e.g., see ref. 70). Recently, however,
another simulation study by Nozawa et al. (71) suggested that this
modification also showed an excess of false positives. Yang and
Dos Reis (54) defended the branch-site test by examining the null
distribution and showing that Nozawa and colleagues (71) misinterpreted their simulation results. However, it is clear that even tests
with good statistical properties are affected by data quality and the
extent of models violations. Below, we list factors that can affect
the test, and so should be taken into account when analyzing
genome-wide data.
128
4.2. Overlapping
Reading Frames
Another line of development in modeling the evolution of proteincoding genes concerns evaluating selective pressures on overlapping reading frames (ORFs). In particular, viruses are known to
frequently encode genes with ORFs to maximize information content of their short genomes. This may increase codon bias and affect
evolutionary constraints on overlapping regions. Indeed, regions of
genes that encode several protein products evolve under constraints
129
130
4.5. Selection
on Synonymous Sites
131
certain (optimal) codons serves to increase the translational accuracy. Pressure to optimize for translational efficiency, robustness,
and kinetics leads to synonymous codon bias, which was shown to
widely affect mammalian genes (100), as well as genes of fastevolving pathogens like viruses (101). Positive selection on synonymous sites has been unheard of until recently when Resch et al.
(102) conducted a large-scale study of selection on synonymous
sites in mammalian genes. They measured selection by comparing
the average rate of synonymous substitutions (dS) to the average
substitution rate in the corresponding introns (dI). While purifying
selection was found to affect 28% of genes (dS/dI < 1), 12% of
genes were found to have been affected by positive selection on
synonymous sites (dS/dI > 1). The signal of positive selection
correlated with lower predicted mRNA stability compared to
genes with negative selection on synonymous sites, suggesting
that mRNA destabilization (affecting mRNA levels and translation)
could be driving positive selection on synonymous sites.
An increasing number of experimental studies may now explain
how synonymous mutation may be affected by positive or negative
selection. Codon bias to match skews of tRNA abundances may
influence translation (103). Changes at silent sites can disrupt splicing control elements and create new cryptic splice sites, as well as
mRNA and transcript stability can be affected through preference or
avoidance of certain sequence motifs (see refs. 104 and 100). Silent
changes may affect gene regulation via constraints for efficient binding of miRNA to sense mRNA (see refs. 105 and 100). Cotranslational protein folding hypothesis suggests that speed-dependent
protein folding may be another source of selective pressure (106)
because slower production could cause the protein to take an altered
final form (as has been shown in multidrug resistance-1 (107)).
Finally, synonymous changes may act to modulate expression by
altering mRNA secondary structure, affecting protein abundance
(108).
Models of codon evolution currently provide the best approach
for studying selection on silent sites. In particular, models with
variable synonymous rates (see refs. 64 and 109) may be applied
to evaluate the extent of variability of synonymous rates in a gene
and to predict the positions of most conserved and most variable
synonymous sites (for example, see ref. 101). Whether or not the
site has been affected by selection requires further testing. For
example, Zhou, Gu, and Wilke (110) suggested distinguishing
two types of synonymous substitution rates: the rate of conserving
synonymous changes dSC (between preferred codons or between
rare codons) and the rate of nonconserving synonymous changes
dSN (between codons from the two different groups rare and
preferred). Silent sites with dSN/dSC > 1 may be considered to
be under positive selection, and significance can be tested based on
an LRT. Alternatively, synonymous rates at sites may be compared
132
5. Outlook:
Selection Scans
Using Population
Data
133
134
6. Exercises
Q1. Amino acid and codon substitution models: How many
parameters need to be estimated in the instantaneous rate matrix
Q defining a reversible empirical AA model? How many such parameters are necessary to estimate for a reversible empirical codon
model? How many parameters are to be estimated in both cases if a
model is nonreversible?
Q2. Positive selection scans: Go to the UCSC genome browser
(http://genome.ucsc.edu). Search for the HAVCR1 (hepatitis A
virus cellular receptor 1) in the human genome (assembly
NCBI36/hg18) belonging to the mammalian clade.
Genome browser tracks provide the summary of previous analysis of coding regions. Switch Pos Sel Genes under Genes and
Gene Prediction Tracks to full and collect information on the
LRTs that were performed for the six species scan. Next, switch the
17-Way Cons under Comparative Genomics to full. Why are
only a few bases in the HAVCR1 gene conserved? Is this consistent
with the results obtained by LRTs?
Click on the Conservation track to retrieve the multiple
sequence alignment for the HAVCR1 gene. Use the PAML software
(http://abacus.gene.ucl.ac.uk/software/paml.html) to test the
models for positive selection on any lineage of the mammalian
tress by comparing models M1a and M2a with an LRT.
135
1 e2s
2s
1 e4Ns 1 e4Ns
4s
2g
4Ns
1e
1 e2 g
Acknowledgments
C.K. is supported by the University of Veterinary Medicine Vienna.
M.A. is supported by the ETH Zurich and also receives funding from
the Swiss National Science Foundation (grant 31003A_127325).
References
1. Pal C, Papp B, Lercher MJ (2006) An
integrated view on protein evolution. Nature
Rev Genet 7:337348
2. Flicek P, Aken BL, Ballester B, Beal K, Bragin
E, Brent S, Chen Y, Clapham P, Coates G,
Fairley S, Fitzgerald S, Fernandez-Banet J,
Gordon L, Gra f S, Haider S, Hammond M,
Howe K, Jenkinson A, Johnson N,
Kahari A, Keefe D, Keenan S, Kinsella R,
Kokocinski F, Koscielny G, Kulesha E,
Lawson D, Longden I, Massingham T,
McLaren W, Megy K, Overduin B, Pritchard
B, Rios D, Ruffier M, Schuster M, Slater G,
Smedley D, Spudich G, Tang YA, Trevanion
S, Vilella A, Vogel J, White S, Wilder SP,
Zadissa A, Birney E, Cunningham F, Dunham
I, Durbin R, Fernandez-Suarez XM,
136
137
138
139
140
Chapter 6
Methods to Detect Selection on Noncoding DNA
Ying Zhen and Peter Andolfatto
Abstract
Vast tracts of noncoding DNA contain elements that regulate gene expression in higher eukaryotes.
Describing these regulatory elements and understanding how they evolve represent major challenges for
biologists. Advances in the ability to survey genome-scale DNA sequence data are providing unprecedented
opportunities to use evolutionary models and computational tools to identify functionally important
elements and the mode of selection acting on them in multiple species. This chapter reviews some of the
current methods that have been developed and applied on noncoding DNA, what they have shown us, and
how they are limited. Results of several recent studies reveal that a significantly larger fraction of noncoding
DNA in eukaryotic organisms is likely to be functional than previously believed, implying that the functional
annotation of most noncoding DNA in these organisms is largely incomplete. In Drosophila, recent studies
have further suggested that a large fraction of noncoding DNA divergence observed between species may be
the product of recurrent adaptive substitution. Similar studies in humans have revealed a more complex
pattern, with signatures of recurrent positive selection being largely concentrated in conserved noncoding
DNA elements. Understanding these patterns and the extent to which they generalize to other organisms
awaits the analysis of forthcoming genome-scale polymorphism and divergence data from more species.
Key words: Adaptive evolution, Neutrality test, Selective constraint, Deleterious mutations,
McDonaldKreitman test, Population genetics
1. Introduction
and Methods
The lions share of higher eukaryotic genomes comprises noncoding
DNA, which encodes the information necessary to regulate the
level, timing, and spatial organization of the expression of
thousands of genes (1). A growing body of evidence supports the
view that the evolution of gene expression regulation is the primary
genetic mechanism behind the modular organization, functional
diversification, and origin of novel traits in higher organisms
(25). Historically, noncoding DNA has been little studied relative
to proteins and the lack of knowledge about its function has led to
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_6,
# Springer Science+Business Media, LLC 2012
141
142
143
144
145
146
147
148
0.7
neutral (2Ns = 0)
positive selection (2Ns=+10)
negative selection (2Ns=10)
mixture (50% neutral : 40% 2Ns=10 : 10% 2Ns=+10)
0.6
proportion
0.5
0.4
0.3
0.2
0.1
0.0
8 9
11
13
frequency (n=20)
15
17
19
Fig. 1. The effect of directional selection on the distribution of polymorphism frequencies (DPFs). Plotted are expected
proportion of polymorphisms on the y-axis and frequency in a sample of 20 chromosomes based on equations in
Bustamante et al. (90). Selected variants are assumed to have additive effects on fitness. In brown is a mixture model
that posits 50% of newly arising mutations being neutral, 40% being negatively selected, and 10% positively selected. The
similarity of this mixture model to neutral expectations implies that it may be difficult to detect positive or negative
selection in regions of the genome with pluralistic selective pressures based on the shape of the DPF alone.
149
150
151
100
80
60
%
40
20
0
01
010
10100
>100
N*E(s)
Fig. 2. Selective constraint and positive selection on noncoding DNA inferred using polymorphism and divergence. Shown
is the inferred distribution of fitness effects of newly arising mutations and the fraction of divergence in excess of
expectations (a) for a sample of intronic sites in D. melanogaster (from Table 6 of 77). The method uses the DPF for
synonymous sites to estimate parameters of a population size change model. The method then uses this demographic
model, with the DPF and divergence at synonymous and intronic sites, to estimate selection on the latter class of sites. The
implication is that 30% of newly arising mutations in these introns are subject to deterministic negative selection and that
20% of the nucleotide divergence observed between species is in excess of expectations under the neutral model. The
error bars indicate standard errors on the estimates.
152
153
2. Exercises
Download the coding and noncoding polymorphism data
of Andolfatto (22)http://genomics.princeton.edu/Andolfatto
Lab/link_nature2005.html. The first sequence in each file is the
sequence for D. simulans (an appropriate outgroup). The next 12
sequences are from a Zimbabwean population of D. melanogaster.
You will need a script to extract polymorphism and divergence
statistics from this data.
1. Compare the distribution of polymorphism frequencies for noncoding sites and fourfold synonymous sites of the D. melanogaster
sequences. Since both demography and selection can influence
polymorphism frequencies, how can you distinguish between
these processes based on this comparison? Katzman et al. (80)
154
Acknowledgments
Thanks to Stephen Wright, Molly Przeworski, Kevin Bullaughey,
and anonymous reviewers for helpful discussion and comments on
the manuscript. This work was supported in part by NIH grant
R01-GM083228.
References
1. Lewin, B. (2007) Genes IX, Oxford University Press. p 892.
2. Stern, D. L., (2010) Evolution, development
and the predictable genome. Roberts and Co.
Publishing. p 264.
3. Wray, G., Hahn, M., Abouheif, E., Balhoff, J.,
Pizer, M., Rockman, M., and Romano, L.
(2003) The evolution of transcriptional
regulation in eukaryotes, Mol Biol Evol 20,
13771419.
4. Davidson, E. H. (2001) Genomic regulatory
systems : development and evolution, Academic
Press, San Diego.
5. Carroll, S. B. (2000) Endless forms: the
evolution of gene regulation and morphological diversity, Cell 101, 577580.
6. Sakabe, N. J., and Nobrega, M. A. (2010)
Genome-wide maps of transcription regu-
155
156
157
158
87. Sawyer, S. A., and Hartl, D. L. (1992) Population genetics of polymorphism and divergence, Genetics 132, 11611176.
88. Ohta, T. (1993) Amino acid substitution at
the Adh locus of Drosophila is facilitated by
small population size, Proc Natl Acad Sci
U S A 90, 45484551.
89. Sawyer, S. A., Parsch, J., Zhang, Z., and
Hartl, D. L. (2007) Prevalence of positive
selection among nearly neutral amino acid
replacements in Drosophila, Proc Natl Acad
Sci U S A 104, 65046510.
90. Bustamante, C. D., Wakeley, J., Sawyer, S.,
and Hartl, D. L. (2001) Directional selection
and the site-frequency spectrum, Genetics
159, 17791788.
91. Fay, J. C., Wyckoff, G. J., and Wu, C. I.
(2001) Positive and negative selection on the
human genome, Genetics 158, 12271234.
92. Eyre-Walker, A., Keightley, P. D., Smith, N.
G., and Gaffney, D. (2002) Quantifying the
slightly deleterious mutation model of
molecular evolution, Mol Biol Evol 19,
21422149.
93. Bierne, N., and Eyre-Walker, A. (2004) The
genomic rate of adaptive amino acid substitution in Drosophila, Mol Biol Evol 21,
13501360.
94. Welch, J. J. (2006) Estimating the genomewide rate of adaptive protein evolution in
Drosophila, Genetics 173, 821837.
95. Jenkins, D. L., Ortori, C. A., and Brookfield,
J. F. (1995) A test for adaptive change in
DNA sequences controlling transcription,
Proc Biol Sci 261, 203207.
96. Ludwig, M. Z., and Kreitman, M. (1995)
Evolutionary dynamics of the enhancer region
of even-skipped in Drosophila, Mol Biol Evol
12, 10021011.
97. Holloway, A., Lawniczak, M., Mezey, J.,
Begun, D., and Jones, C. (2007) Adaptive
gene expression divergence inferred from
population genomics, PLoS Genetics 3,
20072013.
98. Kohn, M., Fang, S., and Wu, C. (2004)
Inference of positive and negative selection
on the 5 regulatory regions of Drosophila
genes, Mol Biol Evol 21, 374383.
99. Torgerson, D., Boyko, A., Hernandez, R.,
Indap, A., Hu, X., White, T., Sninsky, J., Cargill, M., Adams, M., Bustamante, C., and
Clark, A. (2009) Evolutionary Processes Acting on Candidate cis-Regulatory Regions in
Humans Inferred from Patterns of Polymorphism and Divergence, PLoS Genetics 5,
e1000592.
159
Chapter 7
The Origin and Evolution of New Genes
Margarida Cardoso-Moreira and Manyuan Long
Abstract
New genes are a major source of genetic innovation in genomes. However, until recently, understanding
how new genes originate and how they evolve was hampered by the lack of appropriate genetic datasets.
The advent of the genomic era brought about a revolution in the amount of data available to study new
genes. For the first time, decades-old theoretical principles could be tested empirically and novel and
unexpected avenues of research opened up. This chapter explores how genomic data can and is being used
to study both the origin and evolution of new genes and the surprising discoveries made thus far.
Key words: New genes, Gene duplication, Retrogenes, Gene rearrangements, De novo genes, Genetic
novelty, Copy number variation
1. Introduction
In the 1940s, geneticists were immersed in a debate over the nature
of genetic innovation and organismal complexity (reviewed in
ref. 1). The debate centered over determining which class of
mutations is responsible for the predominant changes observed
between the primordial amoeba and men. Are men and amoeba
separated only by mutations in preexisting genes or have increases
in gene number been a fundamental component of the history of
these two lineages? Fifty years onward, we find ourselves in the
genomic era, and in possession of the genomes of not only a great
number of species, but also of different individuals within the same
species. And a comparison of the (several) amoeba and human
genomes leaves no doubt as to the origination of new genes being
one of the most important sources of evolutionary change.
Most theoretical treatments of the population genetics and
molecular evolution of new genes focused on the particular class
of gene duplication and preceded the genomic revolution by several
decades (e.g., see refs. 24). When sequencing technology became
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_7,
# Springer Science+Business Media, LLC 2012
161
162
readily available in the 1980s, researchers were finally able to empirically study new genes. Initially, only a limited number of new genes
were studied in detail, and these were discovered mainly serendipitously (5, 6). In spite of the small sample size, the first examples of
new genes began to bring into question long-held views on the
mutational processes that generate new genes and on the evolutionary forces that act upon their formation (5, 7). With the onset
of the genomic era and the many technologies that it fostered (e.g.,
in situ hybridization, microarray technology), whole-genome
surveys of new genes became feasible. These data allowed researchers to start addressing decades-old questions regarding the early
stages of the evolution of new genes. Genome-wide surveys of new
genes confirmed several of the previous theoretical predictions and
provided a wealth of novel and unexpected observations.
This chapter discusses both the origin and early evolution of
new eukaryotic genes, predominantly focusing on the research of
the last 10 years that addresses both topics using genome-wide
approaches. This chapter is divided into two main sections. The
first section explores the different pathways that generate new genes
and how the different classes of new genes can be identified from
genomic data. The second section focuses on the evolutionary
trajectories of new genes. The techniques employed in different
studies are described, and the results that are relevant to understanding the evolutionary forces driving the fixation and preservation of new genes in genomes are examined.
2. Origin of New
Genes
2.1. Mechanisms
of New Gene
Origination
163
164
165
166
Gene duplication
b
Complete gene duplication
Fig. 1. Schematic depiction of (a) complete, (b) partial, and (c) dispersed gene duplication
events as seen in a phylogenetic context. Please note that for complete and partial
tandem duplications (a and b) it may be impossible to distinguish the ancestral from the
derived copies. In the case of dispersed duplications (c), the parentoffspring relationship
can be inferred by combining phylogenetic and syntenic information.
167
Inversion
Deletion
Chr A
Chr B
Translocation
Fig. 2. Schematic depiction of how different classes of genomic rearrangements (deletions, inversions, and translocations)
can create fusion genes by juxtaposing sequences from two previously independent genes. All these rearrangements can
be preceded by a duplication event, which would allow the creation of a new gene without disrupting the parental genes.
The dashed lines represent the area that is mutated (deleted, inversed, or translocated to another genomic location).
All examples would create a novel chimeric gene structure.
2.1.2. Genomic
Rearrangements
2.1.3. Retroposition
168
Retroposition
b
Retroposition
event
Germline transcription
(mRNA)
AAA
AAA
Reverse transcription
and re-insertion in the genome
AAA
Fig. 3. Schematic representation of how retrogenes are created (a) and how they can be identified using a phylogenetic
approach (b). In (a), a retrogene is created after the messenger RNA from the parental gene, intronless and containing a
poly-A tail, is reinserted back into the genome. A new regulatory element is then recruited by the new retrogene.
A retroposition event can be clearly identified and dated using phylogenetic information (b).
169
Lateral gene
transfer
event
Lateral (or horizontal) gene transfer occurs when a gene is transferred between different organisms (as opposed to being vertically
transmitted through the germ line). The laterally transferred gene
and its ortholog in the parental lineage are often called xenologs
(40). Lateral gene transfer has been shown to be rampant among
certain prokaryotic taxa, where it is associated with gains of new
genes with many distinct novel functions that contribute dramatically to the evolution of those taxa (41, 42). Lateral gene transfer
events can be recognized from genome sequence data in several
ways. A lateral gene transfer event generates anomalous or incongruent phylogenetic trees, whereby a given gene may share the highest
sequence similarity with a gene in a distantly related species (Fig. 4).
Without resorting to phylogenetic trees, genes that have been laterally transferred can be identified in genomes when there are contigs
(or sequence reads) that contain sequences readily identified as
belonging to different genomes (for example, the presence of
170
171
Acquisition of promoter
and expression
Fig. 5. A gene can be created de novo when mutations generate a new open reading frame and new regulatory sequences
(a). Although a de novo gene will only be present in the lineage where it was created, orthologous noncoding sequences
will be present in closely related taxa (b).
172
2.3. Evidence
of Functionality
in New Genes
2.4. Lessons
from Genome-Wide
Surveys of New Genes
173
174
3. The Evolutionary
Trajectories of New
Genes
Just like any other mutation, new genes can be neutral, deleterious,
or advantageous. Except in populations with an extremely small
population size, if a new gene is deleterious it will be kept at low
frequency in the population, never reaching fixation (i.e., never
becoming present in all individuals of the species). Examples of
deleterious new genes are duplications of dosage-sensitive genes,
175
3.1.2. Neofunctionalization
176
The concept that a pair of duplicate genes can share the same
function of the ancestral gene is old (1). More recently, this concept
has been formalized into distinct models. One of them is called the
duplication, degeneration, complementation model (DCC) (93).
It posits that after a gene duplication event that generates two fully
redundant copies selection is relaxed for both copies and mutations
are allowed to accumulate. A mutation that would be deleterious
when there was only one copy of the gene is now rendered neutral
due to the presence of the other copy. This allows both copies to
accumulate degenerative and complementary mutations, which
result in the two genes being necessary to fulfill the functions of
the original gene. Importantly, this model of subfunctionalization
requires only neutral substitutions (as opposed to beneficial mutations) and applies to the partitioning of functions coded both in
protein and regulatory sequences. An alternative subfunctionalization model is called the escape from adaptive conflict (EAC) (9, 94,
95). This model assumes that the original gene is capable of two or
more distinct functions that cannot be simultaneously optimized by
selection due to pleiotropic effects. Gene duplication would allow
each of the copies to perform one of the functions that could now be
optimized by positive selection. The DCC and EAC models differ in
that in the DCC the mutations that cause the subfunctionalization
are explicitly neutral and in the EAC they are adaptive.
177
178
The different models proposed for the fates of new genes make
different predictions regarding the early stages of the evolution of
new genes. The neofunctionalization model proposed by Ohno
predicts that in a duplicate gene pair one member experiences a
period of relaxed constraint, followed by a period of positive selection (after the occurrence of the mutation that confers a new
function), while the other member continuously experiences purifying selection (4). According to this model, there should be an
asymmetric rate of evolution between the two duplicates. This same
asymmetry should also be detected for those new genes whose
origination immediately confers a new advantageous function.
In this case, there should not be any period of relaxed constraint.
Instead, the new genes are expected to be driven to fixation by
positive selection, which is expected to continue to act for some
period of time. Meanwhile, the parental gene is expected to evolve
under purifying selection. New genes that are identical to its parental genes could be immediately favored by positive selection due to
changes in gene dosage, as numerous examples have demonstrated
(e.g., see refs. 99, 101). When this occurs, the new gene is fixed by
positive selection, but in this case both parental and offspring genes
are expected to be under purifying selection and exhibit a symmetrical rate of evolution.
The subfunctionalization models do not make clear predictions
regarding whether gene duplicates are expected to diverge symmetrically or asymmetrically because the functions of the ancestral gene
could potentially be divided equally or unequally between the two
duplicates. However, at least in its earlier stages, the DCC model
would predict both genes to experience relaxed constraint and
during this stage their evolution should be symmetrical. The
DCC and EAC models can be distinguished from each other
because the latter predicts both parental and offspring genes to
experience a period of positive selection.
As mentioned above, subfunctionalization and neofunctionalization are not mutually exclusive. New genes may experience an
initial stage of subfunctionalization (DCC model) followed by a
period of neofunctionalization. This would be translated into an
initial period of evolution under relaxed constraints for both genes
followed by a symmetrical or asymmetrical period of evolution
under positive selection depending on whether the latter acts on
one or both duplicates. Another alternative scenario is the fixation
of a duplicate by positive selection for dosage alteration that then
subsequently evolves a novel function. This scenario would create
an initial period of positive selection driving the duplication to
fixation, followed by a period of symmetrical evolution, where
both members are under purifying selection, and finally another
period of positive selection created by the mutation that confers the
novel function. The fact that different scenarios can be hypothesized and that the different models do not make explicit enough
179
180
181
gene pairs where parent and offspring genes are identical. Han and
colleagues (110) found a similar result when studying lineagespecific duplicates in the human, macaque, mouse, and rat genomes. By focusing on very young duplicates, they also aimed at
detecting signs of positive selection before it was masked by the
purifying selection that follows. Approximately 10% of all lineagespecific genes showed signs of positive selection acting in their
protein sequences. Furthermore, they showed that for gene duplicates, where parental and offspring genes are located in different
genomic locations, 80% of the time that there was evidence for
positive selection it came from the offspring copy. This was true
when the offspring was a retrogene or was created by the classical
model of gene duplication (110).
When divergence data is combined with polymorphism data,
further insight can be gained into the evolutionary forces acting on
new genes. More precisely, combining both types of data allows
distinguishing between the two scenarios that can cause accelerated
rates of protein evolution: relaxation of selective constraints and
positive selection. Cai and Petrov (111) combined human polymorphism data with humanchimp divergence data and found
strong evidence that the elevated rates of protein evolution found
for younger genes are mostly due to relaxed selective constraints
and found weaker evidence that younger genes experience adaptive
evolution more frequently than older genes.
4. Future
Perspectives
It is unquestionable that the wealth of genomic data collected in the
past 10 years dramatically changed our understanding of how new
genes are created. But more than answering long-standing questions, the genomics revolution brought about a brand new set of
questions. Only recently have we learned that new genes could be
created de novo (5056) and we are still lacking the proper tools to
study how selection acts in this group of genes. Also, now that we
know that an important component of genomes are nonproteincoding genes, we have to devise more sensitive detection techniques in order to detect them and study their evolution. And perhaps the greatest challenge of all, we have to go beyond simply
describing the sequence and evolution of new genes and determine
the novel functions these genes are coding. Although genomic data
helps us determining if a gene is functional or not, determining its
actual function requires a multidisciplinary effort that combines
genomics and proteomics with a multitude of functional assays.
As more genomes are sequenced, phylogenies will become
more and more complete and our capability of detecting new
genes, dating them, and understanding how they are formed will
182
5. Questions
1. Count the number of genes in the human and chimpanzee
genomes. Does the difference suggest the gain or the loss of
some genes in one lineage? How can you distinguish between
the two possibilities?
2. Imagine the genome sequences of 12 bee species (the phylogeny is known) have just been released. The 12 genomes have
been annotated using both experimental and computational
approaches. What would be the steps needed to find all lineage-specific genes, i.e., genes present in only one of the species?
What genomic hallmarks would you use to distinguish the
different classes of new genes?
Acknowledgments
We thank J. Roman Arguello, Maria Vibranovski, three anonymous
reviewers, and our editor, Maria Anisimova for comments and
critical reading of the manuscript.
References
1. Taylor JS, Raes J (2004) Duplication and
divergence: the evolution of new genes and
old ideas. Annu Rev Genet 38:615643
2. Haldane JBS (1932) The causes of evolution.
Princeton Science Library
3. Bridges CB (1936) The Bar gene a duplication. Science 83:210211
4. Ohno S (1970) Evolution by gene duplication.
Springer-Verlag
183
184
185
186
Chapter 8
Evolution of Protein Domain Architectures
Kristoffer Forslund and Erik L.L. Sonnhammer
Abstract
This chapter reviews the current research on how protein domain architectures evolve. We begin by
summarizing work on the phylogenetic distribution of proteins, as this directly impacts which domain
architectures can be formed in different species. Studies relating domain family size to occurrence have
shown that they generally follow power law distributions, both within genomes and larger evolutionary
groups. These findings were subsequently extended to multidomain architectures. Genome evolution
models that have been suggested to explain the shape of these distributions are reviewed, as well as evidence
for selective pressure to expand certain domain families more than others. Each domain has an intrinsic
combinatorial propensity, and the effects of this have been studied using measures of domain versatility or
promiscuity. Next, we study the principles of protein domain architecture evolution and how these have
been inferred from distributions of extant domain arrangements. Following this, we review inferences of
ancestral domain architecture and the conclusions concerning domain architecture evolution mechanisms
that can be drawn from these. Finally, we examine whether all known cases of a given domain architecture
can be assumed to have a single common origin (monophyly) or have evolved convergently (polyphyly).
Key words: Protein domain, Protein domain architecture, Superfamily, Monophyly, Polyphyly,
Convergent evolution, Domain evolution, Kingdoms of life, Domain co-occurrence network, Node
degree distribution, Power law, Parsimony
1. Introduction
1.1. Overview
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_8,
# Springer Science+Business Media, LLC 2012
187
188
1.3. Domain
Databases
189
1.5. Mechanisms
for Domain
Architecture Change
190
Fig. 1. Examples of mutations that can change domain architectures. Adapted from Buljan and Bateman (BioMed Central,
2010). (a) Gene fusion by a mobile element. LINE refers to a Long Interspersed Nuclear repeat Element, a retrotransposon.
The reverse transcriptase encoded within the LINE causes its mRNA to be reverse transcribed into DNA and integrated into
the genome, making the domain-encoding blue exon from the donor gene integrate along with it in the acceptor gene.
(b) Gene fusion by loss of a stop signal or deletion of much of the intergenic region. Genes 1 and 2 are joined together into a
single, longer gene. (c) Domain insertion through recombination. The blue domain from the donor gene is inserted within
the acceptor gene by either homologous or illegitimate recombination. (d) Right : Gene fission by introduction of
transcription stop (the letter O) and start (the letter A). Left : Domain loss by introduction of a stop codon (exclamation
mark) with subsequent degeneration of the now untranslated domain.
2. Distribution
of the Sizes
of Domain
Families
191
192
The power law, but not the GPD, is scale free in the sense of
fulfilling the condition
f ax gaf x;
where f (x) and g(x) are some functions of a variable x, and a is a
scaling parameter, that is, studying the data at a different scale does
not change the shape of function. This property has been extensively studied in the literature and is connected to other attributes,
notably when it occurs in network degree distributions (i.e., frequency distributions of edges per node). Here, it has been associated with properties, such as the presence of a few central and
critical hubs (nodes with many edges to other nodes), the similarity
between parts and the whole (as in a fractal), and the growth
process called preferential attachment, under which nodes are
more likely to gain new links the more links they already have.
However, the same power law distribution may be generated
from many different network topologies with different patterns of
connectivity. In particular, they may differ in the extent that hubs
are connected to each other (36). It is possible to extend the
analysis by taking into account the distribution of degree pairs
along network edges, but this is normally not done.
What kind of evolutionary mechanisms give rise to this kind of
distribution of gene or domain family sizes within genomes? In one
model by Huynen and van Nimwegen (26), every gene within a
gene family is more or less likely to duplicate, depending on the
utility of the function of that gene family within the particular
lineage of organisms studied, and they showed that such a model
matches the observed power laws. While they claimed that any
model that explains the data must take into account family-specific
probabilities of duplication fixation, Yanai and coworkers (39) proposed a simpler model using uniform duplication probability for all
genes in the genome, and also reported a good fit with the data.
Later, more complex birthdeath (37) and birthdeath-andinnovation models (BDIM) (27, 32) were introduced to explain the
observed distributions, and from investigating which model parameter
ranges allow this fit the authors were able to draw several far-ranging
conclusions. First, the asymptotic power law behavior requires that
the rates of domain gain and loss are asymptotically equal. Karev et al.
(32) interpreted this as support for a punctuated equilibrium-type
model of genome evolution, where domain family size distributions
remain relatively stable for long periods of time but may go through
stages of rapid evolution, representing a shift between different
BDIM evolutionary models and significant changes in genome complexity. Like Huynen and van Nimwegen (26), they concluded that the
likelihood of fixated domain duplications or losses in a genome directly
depends on family size. The family, however, only grows as long as
new copies can find new functional niches and contribute to a net
benefit for survival, i.e., as long as selection favors it.
193
194
Fig. 2. (a) Distribution of domain family sizes in three selected species. Power law distributions were fitted to these curves
such that, for frequency f of families of size X, f cX a. For Saccharomyces cerevisiae, a 1.8, for Escherichia coli,
a 1.7, and for Homo sapiens, a 1.5. (b) Distribution of domain family sizes across the three kingdoms. Power law
distributions were fitted to these curves such that, for frequency f of families of size X, f cX a. For bacteria, a 2.4,
for archaea, a 2.4, and for eukaryotes, a 1.8.
3. Kingdom
and Age
Distribution
of Domain
Families
and Architectures
195
Fig. 3. (a) Distribution of multidomain (architecture) family sizes in three selected species. Power law distributions were
fitted to these curves such that, for frequency f of families of size X, f cX a. For Saccharomyces cerevisiae, a 2.0,
for Escherichia coli, a 1.8, and for Homo sapiens, a 1.7. (b) Distribution of multidomain (architecture) family
sizes across the three kingdoms. Power law distributions were fitted to these curves such that, for frequency f of families of
size X, f cX a. For bacteria, a 2.5, for archaea, a 3.4, and for eukaryotes, a 2.2.
196
197
Fig. 4. (a) Kingdom distribution of unique domains. Values are given as percentages of the total 7,270 domains.
(b) Kingdom distribution of unique domain pairs. Values are given as percentages of the total 6,270 domain pairs.
(c) Kingdom distribution of unique domain triplets. Values are given as percentages of the total 20,396 domain triplets.
(d) Kingdom distribution of unique multidomain architectures. Values are given as percentages of the total 7,862
multidomain architectures.
198
4. Domain
Co-occurrence
Networks
199
Fig. 5. Example of protein domain co-occurrence network, adapted from Kummerfeld and
Teichmann (BioMed Central, 2009). (a) Sample set of domain architectures. The lines
represent proteins, and the boxes their domains in N- to C-terminal order. (b) Resulting
domain co-occurrence (neighbor) network. Nodes correspond to domains, and are linked
by an edge if at least one domain exists, where the two domains are found adjacent to
each other along the amino acid chain.
200
201
Table 1
The 20 most densely connected hubs with regards
to immediate domain neighbors, according to Pfam 24.0
Identifier
Name
Number of different
immediate neighbors
CL0123
Helix-turn-helix clan
202
CL0023
166
CL0063
FAD/NAD(P)-binding Rossmann
fold Superfamily
155
CL0159
71
CL0036
71
CL0016
62
CL0172
Thioredoxin like
52
CL0202
Galactose-binding domain-like
superfamily
50
CL0058
50
CL0125
Peptidase clan CA
46
CL0028
45
CL0304
CheY-like superfamily
44
CL0137
HAD superfamily
42
PF00571
CBS domain
41
CL0219
41
CL0010
41
CL0300
40
CL0261
NUDIX superfamily
40
CL0025
39
CL0183
38
202
Fig. 6. (a) Distribution of domain co-occurrence network node degrees in three selected species. Power law distributions
were fitted to these curves such that, for frequency f of nodes of degree X, f cX a. For Saccharomyces cerevisiae,
a 2.7, for Escherichia coli, a 2.1, and for Homo sapiens, a 2.3. (b) Distribution of domain co-occurrence
network node degrees across the three kingdoms. This corresponds to a network, where two domains are connected if any
species within the kingdom has a protein, where these domains are immediately adjacent. Power law distributions were
fitted to these curves such that, for frequency f of nodes of degree X, f cX a. For bacteria, a 1.8, for archaea,
a 2.1, and for eukaryotes, a 2.1.
edge between them if there is a protein, where they are adjacent. Each
domain was assigned a degree as its number of links to other domains.
We then counted the frequency with which each degree occurs in the
co-occurrence network. Figure 6a shows this relationship for the set
of domain architectures found in the same species as for Fig. 2a, and
Fig. 6b shows the equivalent plots for the three kingdoms as found
among the complete proteomes in Pfam. Regressions to a power law
have been added to the plots. The presence of a power law-like
behavior of this type implies that few domains have very many immediate neighbors while most domains have few immediate neighbors.
Note that the observed degrees in our dataset were strongly reduced
by removing all sequences with a stretch longer than 50 amino acids
lacking domain annotation.
5. Supradomains
and Conserved
Domain Order
6. Domain Mobility,
Promiscuity,
or Versatility
203
204
205
Another potential reason for the different results is that Basus list
was based on eukaryotes only while Weiners analysis was heavily
biased toward prokaryotes. Furthermore, the top ten lists in
Basu et al. (48) and their follow-up paper (49) only overlap by
four domains; yet the main difference is that in the latter study all
28 eukaryotes were considered while the former study was limited
to the subset of 20 animal, plant, and fungal species. The choice of
species, thus, seems pivotal for the results when using this method.
They also used different methods for calculating the average value
of relative versatility across many species, which may influence
the results.
Does domain versatility vary between different functional
classes of domains? Vogel et al. (56) found no difference in
relative versatility between broad functional or process categories
or between SCOP structural classes. In contrast to this,
Basu et al. (48) reported that high versatility was associated with
certain functional categories in eukaryotes. However, no test for
the statistical significance of these results was performed. Weiner
et al. (13) also noted some general trends, but found no significant enrichment of Gene Ontology terms in versatile domains.
This does not necessarily mean that no such correlation exists, but
more research is required to convincingly demonstrate its strength
and its nature.
Another important question is to what extent domain versatility varies across evolutionary lineages. Vogel et al. (56) reported
no large differences in average versatility for domains in different
kingdoms. The versatility measure of Basu et al. (48) can be
applied within individual genomes, which means that according
to this measure domains may be versatile in one organism group
but not in another, as well as gain or lose versatility across evolutionary time. They found that more domains were highly versatile
in animals than in other eukaryotes. Modeling versatility as a
binary property defined for domains in extant species, they further
used a maximum parsimony approach to study the persistence of
versatility for each domain across evolutionary time, and concluded that both gain and loss of versatility are common during
evolution. Weiner at al. (13) divided domains into age categories
based on distribution across the tree of life, and reported that the
versatility index is not dependent on age, i.e., domains have equal
chances of becoming versatile at different times in evolution. This
is consistent with the observation by Basu et al. (48) that versatility is a fast-evolving and varying property. When measuring versatility as a regression within different organism groups, Weiner
et al. (13) found slightly lower versatility in eukaryotes, which is
in conflict with the findings of Basu et al. (48). Again, this underscores the strong dependence of the method and dataset on
the results.
206
7. Principles
of Domain
Architecture
Evolution
207
8. Inferring
Ancestral Domain
Architectures
208
9. Polyphyletic
Domain
Architecture
Evolution
209
all leaf nodes sharing some domain arrangement (up to and including
an entire architecture) stem from a single ancestral node possessing
this combination of domains. For monophyly to be true for all
architectures containing the reference domain, the same companion
domain cannot have been inserted in more than one place along the
tree describing the evolution of the reference domain. By application
of graph theory and Dollo parsimony (62), they showed that monophyly is only possible if the domain co-occurrence network defined by
all proteins containing the reference domain is chordal, i.e., it
contains no cycles longer than three edges.
Przytycka et al. (46) then evaluated this criterion for all superfamily domains in a large-scale dataset. For all domains where the
co-occurrence network contained fewer than 20 nodes (domains),
the chordal property held, and hence any domain combinations or
domain architectures containing these domains could potentially
be monophyletic. By comparing actual domain co-occurrence networks with a preferential attachment null model, they showed that
far more architectures are potentially monophyletic than would be
expected under a pure preferential attachment process. This finding
is analogous to the observation by Apic et al. (30) that most domain
combinations are duplicated more frequently (or reshuffled less)
than expected by chance. In other words, gene duplication is much
more frequent than domain recombination (56). However, for
many domains that co-occurred with more than 20 other different
domains, particularly for domains previously reported as promiscuous, the chordal property was violated, meaning that multiple
independent insertions of the same domain, relative to the reference domain phylogeny, must be assumed.
A more direct approach is to do complete ancestral domain
architecture reconstruction of protein lineages and to search for
concrete cases that agree with polyphyletic architecture evolution.
There are two conceptually different methodologies for this type of
analysis. Either one only considers architecture changes between
nodes of a species tree or one considers any node in a reconstructed
gene tree. The advantage of using a species tree is that one avoids
the inherent uncertainty of gene trees, but on the other hand only
events that take place between examined species can be observed.
Gough (51) applied the former species tree-based methodology to SUPERFAMILY domain architectures, and concluded that
polyphyletic evolution is rare, occurring in 0.44% of architectures.
The value depends on methodological details, with the lower
bound considered more reliable.
The latter gene tree-based methodology was applied by Forslund
et al. (52) to the Pfam database. Ancestral domain architectures were
reconstructed through maximum parsimony of single-domain phylogenies which were overlaid for multidomain proteins. This strategy
yielded a higher figure, ranging between 6 and 12% of architectures
depending on dataset and whether or not incompletely annotated
210
proteins were removed. The two different approaches, thus, give very
different results. The detection of polyphyletic evolution is in both
frameworks dependent on the data that is usedits quality, coverage,
filtering procedures, etc. The studies used different datasets which
makes it hard to compare. However, given that their domain annotations are more or less comparable, the major difference ought to be
the ability of the gene-tree method to detect polyphyly at any point
during evolution, even within a single species. It should be noted that
domain annotation is by no means completeonly a little less than
half of all residues are assigned to a domain (5)and this is clearly a
limiting factor for detecting architecture polyphyly. The numbers
may, thus, be adjusted considerably upward when domain annotation
reaches higher coverage.
Future work will be required to provide more reliable estimates
of how common polyphyletic evolution of domain architectures is.
Any estimate will depend on the studied protein lineage, versatility
of the domains, and methodological factors. A comprehensive and
systematic study using more complex phylogenetic methods than
the fairly ad hoc parsimony approach, as well as effective ways to
avoid overestimating the frequency of polyphyletic evolution due to
incorrect domain assignments or hidden homology between different domain families, may be the way to go. At this point, all that can
be said is that polyphyletic evolution of domain architectures definitely does happen, but relatively rarely, and that it is more frequent
for complex architectures and versatile domains.
10. Conclusions
As access to genomic data and increasing amounts of compute
power has grown during the last decade, so has our knowledge of
the overall patterns of domain architecture evolution. Still, no study
is better than its underlying assumptions, and differences in the
representation of data and hypotheses means that results often
cannot be directly compared. Overall, however, the current state
of the field appears to support some broad conclusions.
Domain and multidomain family sizes, as well as numbers of
co-occurring domains, all approximately follow power laws, which
implies a scale-free hierarchy. This property is associated with many
biological systems in a variety of ways. In this context, it appears to
reflect how a relatively small number of highly versatile components
have been reused again and again in novel combinations to create a
large part of the domain and domain architecture repertoire
of organisms. Gene duplication is the most important factor
to generate multidomain architectures, and as it outweighs
domain recombination only a small fraction of all possible domain
combinations is actually observed. This is probably further
211
11. Materials
and Methods
Updated statistics were generated from the data in Pfam 24.0.
All Uniprot proteins belonging to any of the full proteomes
covered in Pfam 24.0 were included. These include 1,359 bacteria,
76 eukaryotes, and 68 archaea. All Pfam-A domains regardless of
type were included. However, as stretches of repeat domains are
highly variable, consecutive subsequences of the same domain were
collapsed into a single pseudo-domain, if it was classified as type
Motif or Repeat, as in several previous works (44, 52, 56, 65).
Domains were ordered within each protein based on their
sequence start position. In the few cases of domains being inserted
within other domains, this was represented as the outer domain
followed by the nested domain, resulting in a linear sequence of
domain identifiers. As long regions without domain assignments are
likely to represent the presence of as-yet uncharacterized domains, we
excluded any protein with unassigned regions longer than 50 amino
acids (more than 95% of Pfam-A domains are longer than this). This
approach is similar to that taken in previous works (51, 52, 57).
212
13. Exercises/
Questions
l
213
Table 2
A selection of protein domain databases
Database
URL
Notes
ADDA
http://ekhidna.biocenter.
helsinki.fi/sqgraph/
pairsdb
CATH
http://www.cathdb.info
CDD
http://www.ncbi.nlm.nih.
gov/Structure/cdd/cdd.
shtml
Gene3D
http://gene3d.biochem.ucl.
ac.uk
INTERPRO
http://www.ebi.ac.uk/
interpro
Pfam
http://pfam.sanger.ac.uk
PRODOM
http://prodom.prabi.fr
SCOP
http://scop.mrc-lmb.cam.
ac.uk
SMART
http://smart.emblheidelberg.de
SUPERFAMILY http://supfam.cs.bris.ac.uk
214
References
1. Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C and Murzin AG.
(2008) Data growth and its impact on the SCOP
database: new developments. Nucleic Acids Res.
36(Database issue):D419425.
2. Cuff AL, Sillitoe I, Lewis T, Redfern OC,
Garratt R, Thornton J and Orengo CA.
(2009) The CATH classification revisited
architectures reviewed and new ways to characterize structural divergence in superfamilies.
Nucleic Acids Res. 37(Database issue):D310314.
3. Wilson D, Pethica R, Zhou Y, Talbot C, Vogel
C, Madera M, Chothia C and Gough J. (2009)
SUPERFAMILYsophisticated comparative
genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37(Database issue):
D380-386.
4. Lees J, Yeats C, Redfern O, Clegg A and
Orengo C. (2010) Gene3D: merging structure
and function for a Thousand genomes. Nucleic
Acids Res. 38(1):D296-D300.
5. Finn RD, Mistry J, Tate J, Coggill P, Heger A,
Pollington JE, Gavin OL, Gunesekaran P,
Ceric G, Forslund K, Holm L, Sonnhammer
ELL, Eddy SR and Bateman A. (2010) The
Pfam protein families database. Nucleic Acids
Research, Database Issue 38:D211222.
6. Hunter S, Apweiler R, Attwood TK, Bairoch A,
Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft
D, Hulo N, Kahn D, Kelly E, Laugraud A,
Letunic I, Lonsdale D, Lopez R, Madera M,
Maslen J, McAnulla C, McDowall J, Mistry J,
Mitchell A, Mulder N, Natale D, Orengo C,
Quinn AF, Selengut JD, Sigrist CJ, Thimma
M, Thomas PD, Valentin F, Wilson D, Wu
CH and Yeats C. (2009) InterPro: the integrative protein signature database. Nucleic Acids
Res. 37(Database issue):D211-5
7. Marchler-Bauer A, Anderson JB, Chitsaz F,
Derbyshire MK, DeWeese-Scott C, Fong JH,
Geer LY, Geer RC, Gonzales NR, Gwadz M,
He S, Hurwitz DI, Jackson JD, Ke Z,
Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S,
Marchler GH, Mullokandov M, Song JS,
Tasneem A, Thanki N, Yamashita RA, Zhang
D, Zhang N and Bryant SH. (2009) CDD:
specific functional annotation with the Conserved Domain Database. Nucleic Acids Res.
37(Database issue):D205-210.
8. Letunic I, Doerks T and Bork P. (2009) SMART
6: recent updates and new developments.
Nucleic Acids Res. 37(Database issue):
D229232.
215
216
50. Bashton M and Chothia C. (2002) The Geometry of Domain Combination in Proteins.
J. Mol. Biol. 315:927939.
51. Gough J. (2005) Convergent evolution of
domain architectures (is rare). Bioinformatics
21(8):14641471.
52. Forslund K, Hollich V, Henricson A, and
Sonnhammer ELL. (2008) Domain Tree
Based Analysis of Protein Architecture
Evolution MBE 25:254264.
53. Brivanlou AH and Darnell JE. (2002) Signal
Transduction and the Control of Gene Expression. Science 295(5556):813 818.
54. Weiner J 3rd and Bornberg-Bauer E. (2006)
Evolution of Circular Permutations in
Multidomain Proteins. Mol. Biol. Evol.
23(4):734743.
55. Tordai H, Nagy A, Farkas K, Banyai L, Patthy
L. (2005) Modules, multidomain proteins and
organismic complexity. FEBS J 272
(19):50645078.
56. Vogel C, Teichmann SA and Pereira-Leal J.
(2005) The Relationship Between Domain
Duplication and Recombination. J. Mol. Biol.
346:355365.
57. Bjorklund AK, Ekman D, Light S, Frey-Skott J
and Elofsson A. (2005) Domain Rearrangements in Protein Evolution. J. Mol. Biol.
353:911923.
Chapter 9
Estimating Recombination Rates from Genetic
Variation in Humans
Adam Auton and Gil McVean
Abstract
Recombination acts to shuffle the existing genetic variation within a population, leading to various
approaches for detecting its action and estimating the rate at which it occurs. Here, we discuss the principal
methodological and analytical approaches taken to understanding the distribution of recombination across
the human genome. We first discuss the detection of recent crossover events in both well-characterised
pedigrees and larger populations with extensive recent shared ancestry. We then describe approaches for
learning about the fine-scale structure of recombination rate variation from patterns of genetic variation in
unrelated individuals. Finally, we show how related approaches using individuals of admixed ancestry can
provide an alternative approach to analysing recombination. Approaches differ not only in the statistical
methods used, but also in the resolution of inference, the timescale over which recombination events
are detected, and the extent to which inter-individual variation can be identified.
Key words: Recombination, Pedigree analysis, Linkage disequilibrium, Admixture
1. Introduction
Genetic recombination is of fundamental importance not only in the
generation of gametes within eukaryotes, but also in the process of
evolution. Specifically, while mutation provides a mechanism by
which novel variants are generated, it is recombination that allows
new combinations of variants to be exposed to natural selection.
Despite this importance, it is only recently that the key mechanisms
by which recombination is distributed along the human genome have
begun to be understood. For example, while it has been known for
some time that recombination rates vary at the broad scale (1, 2),
recent advances in experimental and statistical techniques have
revealed a complex landscape of recombination at the fine scale as
well (35). In fact, we now know that the majority of recombination
occurs in localized regions of roughly 2 kb in width (6, 7), where the
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_9,
# Springer Science+Business Media, LLC 2012
217
218
2. Pedigree
Analysis
The first whole-genome measurements of recombination in
humans were obtained in the 1980s by using individuals with a
known ancestral relationship to track the inheritance of genetic
alleles through the genealogical tree or pedigree (24). To give
an example of how transmission of alleles from one generation to
the next is informative about recombination, consider the simple
219
b
1
0/1-0/1
0/1-0/1
Mother
Father
0/1-0/1
1/1-1/1
Child 2
Child 1
3
0/0-0/0
0/1-1/1
Fig. 1. (a) Transmission of alleles in a single family quartet. In this diagram, a recombination event has occurred during the
transmission from the mother to child two, as indicated by the line shading. In practice, only the genotypes are observed,
and while it remains possible to determine that a recombination event has happened, it is not possible to resolve in which
individual it occurred without additional data. (b) An example of a simple pedigree. In each non-baseline generation,
each parent can have at most one mate and only one parent can have ancestry within the pedigree. Individuals without
ancestry within the pedigree are indicated by shaded shapes. In this example, all individuals have been genotyped at two
bi-allelic sites.
220
If the dataset consists of S bi-allelic loci and M non-founder individuals, then the calculation can be performed in most O(M 26S)
operations (27), meaning that the ElstonStewart algorithm was
suitable for large pedigrees, but with relatively few loci.
Box 1
The ElstonStewart Algorithm
Although first described for use in disease linkage studies, the
ElstonStewart algorithm allows efficient calculation of the likelihood of a given recombination rate from large pedigrees. In
order for the assumptions of the algorithm to be satisfied, the
pedigree must start with a single founder nuclear family, with
every other nuclear family containing exactly one parent with
ancestry within the pedigree and one parent with no ancestry
within the pedigree. There can be no multiple marriages within
the pedigree, and no consanguineous unions.
The ElstonStewart algorithm works by summing over all
possible data configurations that are compatible with the inheritance structure defined by the pedigree. When using genotype
data to estimate recombination rates, this means summing over
the possible haplotype configurations that are consistent with
the observed genotypes as they are transmitted from parents to
offspring.
We wish to compute the likelihood as a function of the
recombination rate, L(R), in the absence of disease data. Let
the ith individual have a set of compatible haplotype pairs Hi
(i.e. all possible pairs of haplotypes that are consistent with the
individuals genotype data). For n individuals in the pedigree
and a given recombination rate, R, the likelihood can, in a very
general way, be written as
X X Y
...
PrHm jHk ; Hl ; R:
LR
H1
Hn fk;l;mg
221
Box 1
(continued)
a trio family with one parent having ancestry within the trio.
Consider the trio family first. In this family, there are two possible
haplotype configurations (arising from the indeterminate phase of
the heterozygotic sites in individual 4). Without knowing the
phase of the parents, it is not clear if the child has inherited a
recombinant type or not.
Now consider the quartet family. As there are three
individuals with ambiguous phase in the quartet, there are
23 8 possible haplotype configurations. However, given
a haplotype configuration for the quartet, the haplotype configuration of the trio is also determined, and the probability of the
whole pedigree can be calculated by taking the product of the
transmission probabilities.
Box 2
The LanderGreen Algorithm
The LanderGreen algorithm calculates the likelihood of
pedigree data using a commonly used statistical model known
as a Hidden Markov Model (HMM). To describe this model, let
Xj denote the genotypes of all individuals within the pedigree at
site j. The genotypes of children within the pedigree are determined by the alleles transmitted from the parents, and this
information is represented in an inheritance vector, which
records which alleles are transmitted from parent to child.
As an example, consider the pedigree in Fig. 1a. At the first
site, the genotype vector is X0 ({1,0}, {1,1}, {1,1}, {1,1}),
where entries in curly brackets represent the genotypes of the
(continued)
222
Box 2
(continued)
mother, father, and two children, respectively. The inheritance
vector for the children is I0 ({0,1}, {0,0}), with 0 indicating
that the allele from the first parental chromosome was inherited,
and 1 indicating that the allele from the second chromosome was
inherited. In this example, child 1 inherited the allele from the
first maternal chromosome, and the allele from the second paternal chromosome. Conversely, child 2 inherited the alleles from
the first chromosome of both parents. Following this logic to the
second site would give us X1 ({1,0}, {1,1}, {1,1}, {0,1}) and
I1 ({0,1}, {1,0}). Given the inheritance vector at a site, we can
calculate the probability of obtaining the observed genotypes,
Pr(Xj|Ij).
In the absence of recombination, there would be a single
inheritance vector for all sites in our data. However, recombination between sites causes the inheritance vector to transition to a
new state as we move from site to site. The probability of
transitioning from one inheritance vector at one site to a different inheritance vector at the next site depends on the probability
of recombination between sites, pr. Assuming the state of the
inheritance vector at site j + 1 only depends on the state at site j,
the probability of transitioning from one vector to the next is
written as Pr(Ij + 1|Ij). For a single meiosis, there are only two
possible inheritance vectors (either the parents first allele is
transmitted or the second is). Hence, the probability of transitioning to a new inheritance vector is:
1 pr if Ij 1 Ij
Pr Ij 1 jIj
pr
otherwise.
For a pedigree containing two meioses (such as a family
trio), the possible inheritance vectors can be separated by
R 0, 1, or 2 recombination events. In this case, the transition
probabilities are:
8
2
if R 0
< 1 pr
Pr Ij 1 jIj
1 pr pr if R 1
: 2
pr
if R 2:
A recursive formula can be used to calculate the transition
probabilities between inheritance vectors for any number of
meioses, although the number of possible transitions becomes
quite large for more than a few meioses.
(continued)
223
Box 2
(continued)
In practice, only the genotypes are observed in the datathe
inheritance vector at each site is unknown and hence treated as a
hidden state and has to be summed over when calculating the
likelihood. For m sites, the likelihood can be written in a general
form as
L
X
I1
...
X
Im
PrI1
m
Y
i2
PrIi jIi1
m
Y
PrXi jIi :
i1
224
common ancestor, and the two individuals share a common haplotype. In this case, the shared region of the genome is said to have
identity by descent (IBD).
Long-range IBD can be used to obtain highly accurate phasing
of the genotyped individuals (29). First, an individual is selected for
phasing, known as the proband. If the genotypes of both parents of
this individual were known, it would be relatively trivial to phase the
proband individual by identifying which allele was inherited from
each parent (with the exception that this is not possible at those
sites where the child and both parents are heterozygous).
For example, in Fig. 1a, it is possible to identify the haplotypes
transmitted from each parent to child 2. However, in the deCODE
study, the genotypes of either one or both of the parents were
generally not known. To overcome this, the authors divided
the genome of the proband into sections, and for each section
identified a separate pair of individuals within the study showing
high levels of relatedness, or IBD, with the proband. The authors
were able to use the selected individuals as surrogate parents, and
phase the proband as if the parents were known. By exploiting the
relatedness between individuals in the study, the authors were able
to obtain near-perfect phasing for thousands of individuals over
many megabases of the genome (29). Furthermore, because it is
possible to select many surrogate parents on each side, the fraction
of sites that can be phased unambiguously is much higher because
only one of the surrogate parents on each side needs to be homozygous in order to determine transmission.
Using the above method, the 2010 deCODE study was able to
obtain highly accurate phasing for parentoffspring pairs yielding a
total of 15,257 meioses. This number of meioses represented an
order of magnitude over previous studies, and in combination with
the increased marker density, the resolution to detect recombination events was improved from ~5 Mb in previous studies to
approximately 10 kb.
An advantage of large-scale pedigree studies is that detected
recombination events can generally be assigned to a specific individual, and it is therefore possible to identify differences in recombination rate between groups of individuals. For example, the 2010
deCODE study compared recombination rates in males and
females and revealed that approximately 15% of hotspots appear
to be sex specific (20). The mechanism of sex-specific hotspot
formation is currently unknown.
Despite the success of pedigree studies, their large-scale nature
means that they cannot be practically applied in many casesfor
example, in many non-human species, the cost may be prohibitive,
and even with thousands of meioses the resolution remains relatively low. Furthermore, the resulting recombination rate estimates
are obtained by averaging across many individuals, as each family
can only provide evidence of a handful of recombination events.
3. Linkage
Disequilibrium
Based Approaches
225
D2
fA fB fa fb
(
D
minfA fb ;fa fB
D
minfA fB ;fa fb
if D 0
if D < 0:
226
Box 3
The Four-Gamete Test
The four-gamete test aims to identify patterns of population genetic data that are indicative of historical recombination events. In the absence of recombination and reverse
mutation, four haplotype sequences with two bi-allelic sites can be related by the five
possible ancestral histories shown below. Each possible ancestral history corresponds to a
specific haplotype configuration. Note that the labelling of which allele is the mutant is
arbitrary, as is the ordering of sites, and hence all possible haplotype configurations
(without recombination) can be classified into one of the configurations shown here.
However, if all four haplotypes are observed in a sample, as shown above, a simple
tree cannot represent the ancestry of the sample. In the absence of reverse mutation, only
recombination could have generated the observed pattern. The four-gamete test calls a
recombination event between sites if this situation is observed.
227
Fig. 2. (a) Example of a coalescent tree for six samples. The topology of the tree indicates
the relatedness between samples, with mutations indicated by circles. (b) An example of
an ancestral recombination graph (ARG) for four samples, with three mutations. There is a
single recombination event, indicated by the splitting of the ancestral lineage of the third
chromosome as it is followed backwards in time.
228
In the presence of recombination, a tree structure is not sufficient to describe the ancestry of a sample as it is possible for the
ancestral history to differ between loci. In this case, the ancestry of
the sample can be represented in the form of a graph known as the
Ancestral Recombination Graph (ARG, Fig. 2b) (41). As with a
coalescent tree, branches in the ARG coalesce and contain mutations. However, the ARG also contains recombination events,
which are represented by a bifurcation of a given branch representing a point in the history in which loci to the left of the recombination event follow a different ancestry to those to the right of the
recombination event. As with the basic coalescent, the shape of a
typical ARG is determined by the relative rate at which mutation,
coalescence, and recombination events occur.
Within the context of the ARG, it is not possible to make
inference of the per-generation recombination rate, r, directly.
Rather, in coalescent theory, the rate of recombination is measured
in terms of the population recombination rate, r. The population
recombination rate is related to the per-generation recombination
rate by the formula r 4Ner, where Ne is known as the effective
population size, and depends on a number of factors, such as the
demographic history of the population. In order to infer r from r, it
is necessary to obtain an independent estimate of Ne, which can be
achieved by comparison with existing genetic maps or from diversity estimates. In humans, Ne has generally been estimated in the
range of 10,00018,000 (5, 42).
Given a specific genealogy, the resulting genetic dataset is
uniquely determined. Furthermore, the probability of obtaining
the genealogy from the coalescent model can be calculated for a
given mutation, coalescence, and recombination rate. Hence, if the
genealogy is known, it is possible to calculate the probability, or
likelihood, of obtaining the observed data.
However, the converse is not true; knowing the genetic dataset
does not uniquely determine the genealogy. Typically, there is no
record of the genealogy of the samplethe genealogy is missing
data. In order to calculate the likelihood of our data, it is therefore
necessary to integrate over all possible genealogies. Unfortunately,
the number of possible genealogies is infinite, and even by restricting
the allowed genealogies to those that conform to the infinite-sites
model, and those with non-trivial recombination events, the number
of genealogies increases at a fantastic rate as the sample size increases.
For example, a dataset with just seven sequences and five SNP sites
could have been generated by over 9.1 1016 genealogiesan infeasible number to sum over even using modern supercomputers (43).
It is, therefore, difficult to calculate the likelihood of the data
under the coalescent model. While it is possible to estimate the
likelihood over a range of recombination rates for a single pair of
SNPs, the calculations do not scale with the number of sites, and
hence full likelihood inference is not practical for all but the smallest
229
230
4. Admixture
Pedigree and LD-based studies have provided complementary
insights into the genome-wide patterns of recombination. With
the growing amount of available data, these techniques will continue to improve in resolution. However, scope remains for
continued method development. One novel technique, which
makes use of individuals with a history of recent genetic admixture,
has recently been described (48) that provides an additional
resource for the measurement of recombination.
The principle of recombination detection via admixture is that
the genomes of admixed individuals are made of a mosaic of genetic
material inherited from differing ancestral populations (Fig. 3).
If the ancestral populations are sufficiently diverged from each
other, it is possible to detect the regions of the admixed genome
that have been inherited from one population or the other. The
break points between ancestral sections represent recombination
events that have occurred since the time of the admixture event.
The ability for admixture techniques to detect recombination
depends on accurate detection of break points between ancestral
haplotypes. In order to achieve this, a statistical model of the
relationship between haplotypes is needed. Such a model is available in the form of the Li and Stephens model, which is a widely
used model in a number of areas of population genetics (49).
The Li and Stephens model is based on the idea that if a number
of haplotypes have already been observed the next haplotype to be
sampled is likely to look quite similar to those already seen. The new
haplotype could be constructed as a mosaic of sections of the
previously observed haplotypes, allowing some level of mismatch or
mutation. In other words, the new haplotype is constructed by
copying sections of existing haplotypes, and hence traces a path
through the set of existing haplotypes (Box 4). The new haplotype
is modelled using an HMM, in which the hidden state defines which
of the existing haplotypes is being copied.
231
232
Box 4
The Li and Stephens Model
The basic idea of the Li and Stephens model is that if we have observed a set of haplotypes
the next haplotype we observe is likely to look similar to those we have already observed
due to their shared common ancestry. Suppose we have observed a collection of eight
haplotypes, h1 to h8, as in the diagram below.
h1
h2
h3
h4
h5
h6
h7
h8
h*
The Li and Stephens model considers the next haplotype, h*, given the set of
previously observed haplotypes. This is achieved by assuming that h* is constructed by
copying sections from the previously observed haplotypes, allowing some level of error.
In the diagram, an example of how h* could be constructed from h1 . . . h8 is indicated by
the path traced out by the arrows.
The path through the collection of haplotypes is unknown, and is therefore modelled
using an HMM, where the hidden state is the haplotype being copied from. Given k
haplotypes have been observed so far, the emission probabilities for possible alleles a at site
j in the next haplotype are given by:
k=k y 12 y=k y if hx;j a
Pr h;j ajXj x; h1 ; . . . ; hk 1
;
if hx;j 6 a
2 y=k y
where Xj defines the haplotype being copied at site j, hx,j is the allele of haplotype x at site
j, and y is the mutation parameter. The above probability captures the idea that a
haplotype is more likely to have copied from a similar haplotype than a dissimilar one.
Transitions between hidden states (i.e. the haplotype being copied from) occur with
probability that depends on the recombination distance, rj, between sites j and j + 1:
(
erj =k 1k 1 erj =k if x 0 x
0
Pr Xj 1 x jXj x 1
rj =k
otherwise.
k 1e
(continued)
233
Box 4
(continued)
Using standard HMM machinery (as for the LanderGreen algorithm), it is possible
to sum over all possible paths, and hence calculate the likelihood of obtaining the new
haplotype, given the set of existing haplotypes.
5. Conclusion
Recombination detection methods have evolved rapidly over recent
years. The methods described here differ in terms of the achievable
resolution, the regions of the genome that can be analysed, and the
number of generations that recombination events are measured
over (Table 1). Direct experimental methods such as sperm-typing
continue to provide the highest resolution insight into rate variation, but experimental challenges limit their widespread application
and only provide rate estimates within males. LD studies can
achieve similar resolution, but only offer rate estimates averaged
over thousands of generations and cannot provide substantial information of differences between individuals. Between the two lie the
pedigree and admixture studies, which are today limited largely by
sample size, but which currently provide the best prospects for
detecting and understanding variation among individuals and
populations in both local and global rates of recombination.
In recent years, these methods have led to huge leaps in our
understanding of recombination. It is now accepted that recombination hotspots are a ubiquitous feature of the human genome, but
until a few years ago the mechanisms leading to hotspot formation
were largely unknown. This has started to change with the identification of a short DNA sequence motif found to be highly enriched
234
Table 1
Summary of described methods for recombination rate measurement,
assuming typical parameters of studies to date
Method
Approximate
Size of
number
Approximate analysed of useful
resolution
region
meioses
Sperm
typing
Pedigree
studies
10 kb5 Mb
Genome
wide
1,50015,000 110
LD
studies
15 kb
Genome
wide
~300,000
Fine-scale genome-wide
estimates, but estimates
represent an average over
many generations, and may be
biased by population genetic
history
Genome
wide
1,00020,000 ~515
Admixture 1040 kb
Generations
analysed
Comments
1
~10,000
235
6. Questions
and exercises
1. Is it possible to detect recombination events using genotype
data obtained from a single nuclear family trio? Explain your
answer.
2. Write down the haplotype configurations that are consistent
with the data shown in Fig. 9.1b. Convince yourself that at least
one recombination event is required in the pedigree.
3. Suppose you have sampled the following five haplotypes with
three segregating sites from a population:
Haplotype 1: 011
Haplotype 2: 000
Haplotype 3: 100
Haplotype 4: 010
Haplotype 5: 101
Using the four-gamete test, calculate the minimum number of
recombination events that have occurred in the population
history between sites 1 and 2. How about sites 2 and 3? And
finally, between sites 1 and 3?
4. Suppose an admixture event occurred between two populations
three generations ago. Assuming a recombination rate of
1 cM/Mb, what would the average ancestry track length be
in an individual sampled from the population today? How
about after seven generations?
References
1. Broman, K.W., et al., Comprehensive human
genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet,
1998. 63(3): p. 8619.
2. Kong, A., et al., A high-resolution recombination map of the human genome. Nat Genet,
2002. 31(3): p. 2417.
3. The International HapMap Consortium, A
haplotype map of the human genome. Nature,
2005. 437(7063): p. 1299320.
4. McVean, G.A., et al., The fine-scale structure of
recombination rate variation in the human
genome. Science, 2004. 304(5670): p. 5814.
5. Myers, S., et al., A fine-scale map of recombination rates and hotspots across the human
genome. Science, 2005. 310(5746): p. 3214.
6. Jeffreys, A.J., L. Kauppi, and R. Neumann,
Intensely punctate meiotic recombination in
the class II region of the major histocompatibility complex. Nat Genet, 2001. 29(2):
p. 21722.
7. Jeffreys, A.J., et al., Human recombination
hotspots hidden in regions of strong marker
association. Nat Genet, 2005. 37(6): p. 6016.
8. Myers, S., et al., The distribution and causes of
meiotic recombination in the human genome.
Biochem Soc Trans, 2006. 34(Pt 4):
p. 52630.
9. Baudat, F., et al., PRDM9 is a major determinant of meiotic recombination hotspots in
humans and mice. Science, 2010. 327(5967):
p. 83640.
10. Berg, I.L., et al., PRDM9 variation strongly
influences recombination hot-spot activity and
meiotic instability in humans. Nat Genet,
2010. 42(10): p. 85963.
236
237
Chapter 10
Evolution of Viral Genomes: Interplay Between Selection,
Recombination, and Other Forces
Sergei L. Kosakovsky Pond, Ben Murrell, and Art F.Y. Poon
Abstract
RNA viruses evolve very rapidly, often recombine, and are subject to strong host (immune response) and
anthropogenic (antiretroviral drugs) selective forces. Given their compact and extensively sequenced
genomes, comparative analysis of RNA viral data can provide important insights into the molecular
mechanisms of adaptation, pathogenicity, immune evasion, and drug resistance. In this chapter, we present
an example-based overview of recent advances in evolutionary models and statistical approaches that enable
screening viral alignments for evidence of adaptive change in the presence of recombination, detecting
bursts of directional adaptive evolution associated with the phenotypic changes, and detecting of coevolving
sites in viral genes.
Key words: Viral evolution, Recombination, Natural selection, Epistasis, Machine learning, Bayesian
networks
1. Introduction
Whether one considers them to be living organisms or not, viruses are
the most extensively sequenced members of the natural world. Virus
genomes, especially those of RNA viruses, present many unique
challenges to genetic sequence analysis. Even though they are comparably small in size (ranging approximately from 103 to 106 nucleotides in length) and contain a relatively small number of genes, they
also undergo a very high mutation rate that drives the accumulation
of extensive sequence variation (1). Combined with the extremely
rapid pace of evolution due to high mutation and recombination
rates, short generation times, and strong selection in host environments, viruses provide some of the clearest examples of natural selection in action. Detecting the site-specific signature of selection
in viruses by codon-based models of molecular evolution is one
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_10,
# Springer Science+Business Media, LLC 2012
239
240
2. Example Data
and Software
Datasets used as examples in this chapter can be downloaded from
http://www.hyphy.org/pubs/book2011/data. All computational
procedures described below are based on the HyPhy software package (4). A basic level of familiarity with the package is expected and
we recommend that readers peruse relevant package documentation, which can be found at http://www.hyphy.org.
3. Recombination
We start by presenting a method for detecting recombination from
an alignment of homologous sequences. This is not a conventional
ordering of topics because methods for detecting recombination are
generally predated by codon model-based methods for detecting
diversifying selection (see Subheading 4). However, we strongly
advocate screening an alignment for recombination before all else
because recombinationwhich causes different regions of an alignment to be related by different phylogeniescan strongly affect the
results of subsequent analyses, such as selection detection.
Recombination plays a key role in the evolution of many viral
pathogens. For instance, major pandemic strains of the influenza A
virus (IAV) have arisen through segmental reassortment, which can
be thought of as intergenic, or gene-preserving, recombination.
For example, the swine-origin HIN1 virus has undergone at least
two reassortment events, and carries genes from three different
ancestral IAV lineages (5).
10
241
242
A
R
B
A
B
R
O
Fig. 1. (a) Phylogenetic incongruence caused by the presence of a recombinant sequence in an alignment. Sequence R is a
product of homologous recombination between sequences A and B. Phylogenies reconstructed from sequences A,B,R and
an outgroup sequence (O) differ based on which part of the alignment is being considered to the left of the break point, R
clusters with A, whereas to the right of the break point R clusters with B. (b) GARD analysis of the Cache Valley Fever Virus
glycoprotein.
10
243
244
10
245
LHS-
RHS-
LHS raw p
adjusted p
RHS raw p
adjusted p
588
0.00060
0.00480
0.00140
0.01120
1,080
0.00260
0.02080
0.02130
0.17040
1,491
0.00010
0.00080
0.00010
0.00080
1,693
0.00010
0.00080
0.00010
0.00080
point
246
4. Selection
Selection is the outcome of the variation in fitness induced by the
environment in which genetic variants are expressed. Based on
the excess number of nonsynonymous codon substitutions or a
change in allele frequencies, it is possible to identify sites within
protein-coding regions of a genome that have been targeted by
selection: some of the methods for accomplishing this are presented in preceding chapters. Diversifying (host specific) selection
on virus genome variation is predominated by the immune
response mounted by the host. Jawed vertebrates, such as humans,
have, in addition to the innate immune system, an adaptive
immune system that is further partitioned into the humoral and
cellular immune responses (21). The humoral response takes place
in the extracellular environment and mounts an antibody-based
defense that attacks exposed surfaces of the virus particle. The
cellular response takes place within the infected cell and involves
the recognition and binding of peptides encoded by the virus
genome, which are displayed on the surface of the cell to trigger
the lysis of the cell by cytotoxic T-lymphocytes (CTLs). Both
components of the adaptive immune system play a crucial role in
managing a viral infection and thereby shaping the genetic variation of the virus population. In addition, many human pathogenic
viruses, particularly HIV-1, influenza virus, hepacivirus, and herpesvirus, are treated by antiviral agents that also target specific sites
of the virus genome (22).
10
5. Detecting
Selection in the
Presence of
Recombination
247
0.1
ACC
TCC
TCC
ACC
ACC
ACC
TCC
ACC
TCC
TCC
Fig. 2. The effect of recombination on inferring diversifying selection. Reconstructed evolutionary history of codon 516 of
the Cache Valley Fever virus glycoprotein alignment is shown according to GARD-inferred segment phylogeny (left ) or a
single phylogeny inferred from the entire alignment (right ). Ignoring the confounding effect of recombination causes the
number of nonsynonymous substitutions to be overestimated. A fixed effects likelihood (FEL (60)) analysis infers codon 516
to be under diversifying selection when recombination is ignored ( p 0.02), but not when it is corrected for using a
partitioning approach ( p 0.28).
248
10
249
250
10
251
Model
Log
likelihood
Synonymous
CV
NS Exp
and CV
N/S Exp
and CV
discr (3),
M1a
10618.
85294
0.55189576
0.08138,
3.17446
discr (3),
M2a
10613.
84434
0.55501859
0.17868,
3.96245
p-Value
Prm
AIC
0.24263,
5.48971
N/A
50
21,337.71
0.66827,
7.00400
0.0066803
52
21,331.69
Model
Log
likelihood
Synonymous NS Exp
CV
and CV
N/S Exp
and CV
p-Value
Prm AIC
0.42057602 0.08316,
0.20055,
N/A
3.16349
7.23691
84
21,082.99
0.42703723 0.11230,
0.28263,
0.0781191 86
3.68455
8.72351
21,081.89
252
6. Directional
Selection
HIV-1 replicates extremely rapidly, producing as many as 1010 viral
particles per day. The fidelity of reverse transcription is low, with a
rate of 3 105 errors per base per replication cycle. Together, this
provides HIV-1 with a powerful means to escape selective pressure
introduced by antiretroviral therapy (ART), which suppresses HIV1 replication by interfering with various stages of the viral life cycle,
leading to drug resistance.
Some important features of the evolution of drug resistance
must be encoded by models of evolution to detect substitutions
under selective pressure induced by ART. For this discussion, we
are modeling evolution over a reverse transcriptase phylogeny that
has been constructed from treatment naive, as well as posttreatment
sequences (see Fig. 3). The first thing to notice is that the selective
pressure of interest is not constant over the entire phylogeny, but
rather restricted to a subset of branches: it is episodic. A second
critical property of the evolution of drug resistance is that once
ART is introduced, selection is directional, where only substitutions toward one or more target amino acids are favored. This can
be contrasted with diversifying selection, where nucleotide substitutions that change the amino acid are favored, regardless of the
amino acid. Diversifying selection approximates the continuously
shifting coevolutionary environment typified by hostpathogen
arms-race coevolution (28). The evolution of drug resistance,
on the other hand, is characterized by discrete major shifts of fitness
landscape with the introduction of therapies. The probability of the
emergence of particular amino acids contributing to drug resistance
10
253
0.03
254
10
255
Table 1
HIV-1 reverse transcriptase drug resistance: Episodic directional, episodic
diversifying, and constant diversifying selection
Site
Target
EEDS p-value
FEEDS p-value
FEL p-value
Resistance
41
<0.0001
NRTI
74
83
0.0004
103
<0.0001
0.007
NNRTI
184
<0.0001
0.02
NRTI
210
<0.0001
0.008
NRTI
215
<0.0001
<0.0001
<0.0001
NRTI
219
0.0017
NRTI accessory
0.001
0.007
NRTI
Note that the p-value for MEDS is obtained from a likelihood ratio test (LRT) for episodic directional
selection; FEEDS p-value is obtained from an LRT of the hypothesis bF > a that tests for diversifying
selection; and FEL is a test for constant diversifying selection run on Datamonkey. denotes a nonsignificant (a 0.05) p-value and asterisk indicates no target residue because of lack of detection by MEDS
256
7. Epistasis
The effect of a mutation depends not only on the host environment, but also on the rest of the genome sequence in which it
occurs. Put another way, the rest of the genome comprises an
extremely significant part of the mutations environment. The
dependence of a mutations effect on other sites of the genome is
known as epistasis. Because epistasis is inherently nonlinear, it is
exceedingly difficult to model and hence to estimate from data.
In quantitative genetics, epistasis is assessed as a nonadditive component of variance attributable to interactions among genetic factors (35); however, this framework does not provide a means of
explicitly identifying those interactions. On the other extreme,
population genetics models tend to incorporate epistasis as a nonadditive term for the effects of mutant alleles at two loci (36). While
this scheme is mathematically convenient, it is not adequate for
the purpose of studying the evolution of genomes, even when they
are relatively small in size.
The comparative study of sequence variation offers a practical
approach to identifying which sites in the genome participate in
epistatic interactions. Literally hundreds of investigators across disjoint subdisciplines of biology have proposed various comparative
methods to accomplish this objective. Though we have not yet
encountered a comprehensive review, interested readers may find
useful references in 3740. Essentially, all of these methods use
correlated patterns of substitution at different sites as evidence of
an interaction. Most methods apply some correlation test statistic
10
257
258
10
259
260
10
261
8. Identifying
Agents of Selection:
The CTL Response
In previous sections, we have outlined several methods for detecting the signature of selection from an alignment of homologous
sequences. It is much more difficult to identify which aspects of the
262
Fig. 4. HyPhy BGM diagnostics. (Left) A graph depicting a thinned Markov chain Monte Carlo sample from the posterior
probability distribution of Bayesian networks given the HIV-1 p24 example data. (Caveat: Posterior values are labeled as
LogL, which is an abbreviation of log-transformed likelihood.). (Right) A histogram summarizing the edge marginal
posterior probabilities from the same analysis.
10
31
41
1.00 0.95
33
87
0.96 0.99
135
54
1.00
58
0.97 1.00
96
98
1.00 0.99
116
187
199
1.00
0.98
208
191
203
263
204
207
Fig. 5. A graph depicting compensatory interactions inferred from the alignment of HIV-1 subtype C gag p24 sequences.
Each square node represents a position in the gp41 protein sequence that participates in at least one interaction. The
arrows (edges) representing those interactions are annotated with the fraction of graphs sampled in chain sample that
contain the edge.
molecules that are encoded by the highly variable major histocompatibility class (MHC) I loci.
Consequently, many sites in a virus genome experience strong
selection for amino acid replacements because they encode components of a protein that are preferentially targeted by the antigenprocessing pathway, such as the anchor residues that determine
HLA-binding specificities. We would like to know which regions
of a virus genome are enriched for sites targeted by the cellular
immune responsesuch regions can identify peptides to be
incorporated into anti-HIV vaccine candidates (56). However,
there are hundreds of alleles that have been described at the three
MHC class I loci (denoted A, B, and C) and each one can potentially target a different set of sites in the HIV-1 genome.
This is a situation that is amenable to being analyzed with a
Bayesian network because potential agents of selection in the host
environment can simply be handled as additional variables in the
graph (57). Simply put, we want to know if substitutions tend to
occur more often than random in branches that represent hosts
that are presenting a particular agent of selection. The capacity
of Bayesian networks to find causal relationships in the midst of
potential confounding variables is an important strength of this
application. However, there is a catchwe cannot reconstruct
host environments in the virus phylogeny. This limits an analysis
of associations between agents of selection and site-specific rates of
virus evolution to the terminal branches of the tree: in other words,
the branches that are leading directly to observed virus sequences.
Unfortunately, that means that we must sacrifice a substantial
amount of valuable information on virus genome coevolution
that has been mapped to internal branches of the tree.
In order to accomplish such an analysis, we need to extract the
substitution map that has been generated by the QuickSelectionDetection batch file. The following is a code snippet that writes this
substitution map to a file.
264
The first column contains the sequence names, which you can
use to link each row of the substitution map to whatever agents of
selection (or even phenotypes) that you have obtained for these
sequences. When we were downloading HIV p24 sequences from
the LANL Web site, we happened to include HLA genotypes into
the sequence annotations. An example file containing a binary-state
matrix corresponding to amino acid substitutions mapped to terminal branches leading to each sequence, as well as columns indicating the presence or absence of common HLA serotypes, is
provided as a comma-delimited file named agents.csv. HLA serotypes are labeled in accordance with standard nomenclature, e.g.,
A24. (Note that codons in HIV p24 are numbered and prefixed
with an X in this example file, which was simply a consequence of
merging the serotype and codon data in the statistical programming environment R, which does not permit variable names to
begin with a number.)
In order to perform a BGM analysis outside of the QuickSelectionDetection batch file, we have provided a custom batch file called
BgmAnalysis.bf. The options for this batch file are very similar to
those raised by QuickSelectionDetection, with two important
exceptions. First, you need to specify a file containing a commadelimited matrix, where each column represents an integer-valued
variable (substitution map at a given codon site, or the presence/
absence of an agent of selection, for example) and each row represents a terminal branch in the phylogeny. For each column, the
integer values must start at 0 and progress in increments of 1; in
other words, a variable cannot skip 1 and go directly to 2. In the
example matrix agents.csv, columns with HLA serotypes in the
header contain a 0 to indicate that the serotype is absent and a 1
to indicate that it is present in the corresponding host. Second,
the number of steps specified for a burn-in period is appended to
the length of the chain, rather than indicating the number of
steps in the chain to be discarded. For example, setting the
chain length to 100,000 steps and the burn-in to 10,000 steps
now result in a total of 110,000 steps, of which the first 10,000
are discarded before thinning. We recommend setting the
10
T41
98
E98
A116
T178
D187
97
98
93
98
H87
M96
A177
96
E45
A204
92
93
93
99
B81
92
91
C7
P207
98
G208
R203
98
A2
C6
87
A1
A3
99
97
91
T110
97
B58
C4
98
V191
C2
A43
I135
T200
99
98
T54
98
A31
D71
B72
98
90
95
91
T58
V59
S33
E75
K199
265
99
B42
A30
98
B71
98
B8
B44
99
C10
90
A68
91
C8
Fig. 6. A Bayesian network inferred from the joint distributions of codon site-specific substitutions mapped to terminal
branches of the HIV-1 p24 phylogeny (open nodes), and HLA serotypes presented by the corresponding host environments
(filled nodes). A marginal edge posterior cutoff of 0.9 was used to generate this consensus network. Edges between HLA
serotypes and HIV p24 codon sites are highlighted in bold.
9. Exercises
9.1. Selection
in the Presence
of Recombination
266
Launch HyPhy, select Selection/Recombination from the standard analysis menu, and then choose QuickSelectionDetectionMF.bf.
Use Universal genetic code, New Analysis, Custom nucleotide
model, 012345 to specify the general time reversible model, 1 dataset to be analyzed, either CFVg.fas or CFVg-gard.nex for
the input alignment, Estimate dN/dS only, the FEL method, 0.1
for the significance level for LRTs, and All for branch option. Save
results (a comma-separated value) to a file (taking care to keep
partitioned and unpartitioned results in separate files).
As HyPhy performs the analysis, a typical output line may look
like this:
Site 195 dN/dS inf dN 4.9848 dS 0.0000 dS(dN)
2.3353 Full Log(L) -14.5463 LRT 3.9208 p-value
0.04769 *P
10
267
-2,032.825
Bias term
32.576
Proportion
0.018
53.828%
p-value
0.000
>
2.78192e + 07)
Preferred residues: N
Substitution counts:
K->N: 8/N->K: 0
Site 184 (max BF
13,813.7)
Preferred residues: V
Substitution counts:
M->V: 10/V->M: 0
Site 210 (max BF
1.96712e + 07)
Preferred residues: W
Substitution counts:
F->L: 0/L->F: 1
L->W: 5/W->L: 0
268
135.304)
Preferred residues: W
Substitution counts:
K->Q: 1/Q->K: 0
K->W: 1/W->K: 0
9.3. Coevolution
10
269
270
10
271
272
Part III
Population Genomics
Chapter 11
Association Mapping and Disease: Evolutionary
Perspectives
Sren Besenbacher, Thomas Mailund, and Mikkel H. Schierup
Abstract
In this chapter, we give a short introduction to the genetics of complex disease with special emphasis on
evolutionary models for disease genes and the effect of different models on the genetic architecture, and
finally give a survey of the state-of-the-art of genome-wide association studies.
Key words: Complex diseases, Association mapping, Genome-wide association studies, Common
disease/common variant
1. Introduction
The phenotype of an individual is determined by a combination of
its genotype and its environment. The degree to which the phenotype is determined by genotype rather than environmentthe
balance of nature versus nurturevaries from trait to trait, with
some traits essentially independent of genotype and determined by
the environment and others highly influenced by the genotype and
independent of the environment.
A measure quantifying the importance of genotype as compared to the environment is the heritability. It is the fraction of
the total variance in the populationreferred to as the phenotypic
varianceexplained by variation in the genotype among the individuals in the population (1). An interesting trait, such as a common disease, that exhibits a nontrivial heritability, awakes an
interest in finding the genetic explanation behind the trait, that is,
identifying the genetic polymorphisms affecting the trait. The first
step toward this is association mapping, searching for polymorphisms statistically associated with the trait. Polymorphisms associated
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_11,
# Springer Science+Business Media, LLC 2012
275
276
S. Besenbacher et al.
with the disease need not influence the trait directly, but it is among
those that we will find the polymorphisms that do.
The variants at the various polymorphisms in the genome are
correlatedthey are in linkage disequilibrium (LD)so we need
not examine all polymorphisms. By analyzing a few hundred
thousands to a million evident polymorphisms, we can capture
most of the common variation in the entire genome (24). In
finding such polymorphisms associated with disease risk, we locate
a region of the genome that contains one or more polymorphisms
that affect disease risk, and by examining such a region in more
detail we may locate these.
In the following, we first discuss possible genetic architectures
of complex diseasesmainly based on theoretical considerations
since little is known about thisand then describe the state of the
art in genome-wide association studies (GWASs).
2. The Allelic
Architecture
of Genetic
Determinants
for Disease
2.1. Theoretical
Models for the
Allelic Architecture
of Common Diseases
11
277
Fig. 1. Mutation, drift, and selection. New mutations enter a population at stochastic
intervals, determined by the mutation rate, u, and the effective population size, N. For low
or high frequencies, where the range of such frequencies is determined by the selection
factor, s, and the effective population size, the frequency of a mutant allele changes
stochastically. At medium frequencies, on the other hand, the frequency of the allele
changes up or down, depending on s, in a practically deterministic fashion. If a positively
selected allele reaches moderate frequency, it will quickly be brought to high frequency,
at a speed also determined by s and N.
278
S. Besenbacher et al.
Fig. 2. Accumulation of several rare frequencies. If selection works against a set of alleles,
each will be kept at a low frequency. Their accumulated frequency, however, can be high
in the population.
11
279
280
S. Besenbacher et al.
Fig. 4. A population out of equilibrium following changes in the selective landscape. If the
selection of an allele changes direction, so the positively selected allele becomes
negatively selected and vice versa, it will eventually move through moderate frequencies.
Following a change in the selective landscape, it is thus possible to find alleles at
moderate frequencies that would not otherwise be found.
11
281
the genome. There are already clear indications that the number of
rare variants will be larger than a simple extrapolation of the common SNPs due to the complex demographic history of humans
(1215). Further, recent sequencing of 200 exomes in Europeans
reported an enrichment of nonsynonymous variants over synonymous variants among rare polymorphisms (14), strongly suggesting that many nonsynonymous variants are kept in low frequency
by natural selection. The proportion of these variants that are
involved in complex diseases and perhaps selected against due to
this effect is currently unknown.
The European population, where most GWASs so far have been
carried out, reveals a site frequency distribution of synonymous
variants that generally are shifted to more common alleles as compared to the African population. This is most likely due to a severe
bottleneck connected to the out-of-Africa expansion, but also to
the expected excess of rare variants in a demographically stable
population of the same effective size under selective neutrality.
Excess of low-frequency variants is a hallmark of recent population
growth and/or weak selection against rare alleles. The latter is
visible in the contrast between the frequency distribution for synonymous and nonsynonymous alleles as explained above.
282
S. Besenbacher et al.
Table 1
Contingency table for allele counts in case/control data
Allele A
Allele B
Case
Ncase, A
Ncase, B
Ncases
Control
Ncontrol, A
Ncontrol, B
Ncontrols
NA
NB
Table 2
Expected allele counts in case/control data
Allele A
Allele B
Case
(Ncases NA)/N
(Ncases NB)/N
Ncases
Control
(Ncontrols NA)/N
(Ncontrols NB)/N
Ncontrols
NA
NB
11
283
Ncase; A =NA
;
Ncase; B =NB
but it is very rare to have an unbiased population sample in association studies because the studies are generally designed to deliberately oversample the cases to increase the power. This oversampling
affects the RR as calculated by the formula above but not the OR
which is one of the reasons why the OR is usually reported in
association studies instead of the RR.
3.3. Quality Control
284
S. Besenbacher et al.
11
285
Fig. 5. QQ plots from an w2-distribution. (a) A QQ plot, where the observation follows the expected distribution. (b) A QQ
plot, where the majority of observations follow the expected distribution, but where some have unexpectedly high values,
i.e., are statistically significant. (c) A QQ plot, where the observations all seem to be higher than expected, which is an
indication that the observations are not following the expected distribution.
The best way to make sure that a finding is real is to replicate it. If
you find the same signal in another set of cases and controls, it
means that the association was not caused by a confounding factor
specific to your data set. Likewise, if you still see the association
after typing the markers using another genotyping method, it
means that it is not a false positive due to some artifact of the
genotyping method used.
When trying to replicate a finding, the best strategy is to try to
replicate it in a population of similar ancestry. A marker that is
286
S. Besenbacher et al.
tagging a true causal variant in one population might not be tagging the same variant in a population of different ethnicity, where
the LD structure can be different. This is especially a problem when
trying to replicate an association found in a non-African population
in an African population (26). A marker might easily have 20
completely correlated markers in a European population, but no
good correlates in an African population. This means that if you see
a significant association with an SNP that has 20 equivalent SNPs in
the European population it is not enough to try to replicate only
that SNP, but in an African population you have to test all 20. This,
however, also offers a way to fine map the signal and possibly find
the causative variant (27).
Before spending time and effort to replicate an association
signal in a foreign cohort, it is a good idea to search for the existing
partial replication of the marker within the data. Usually, a marker is
surrounded by several correlated markers on the genotyping chip,
and if one marker shows a significant association then the correlated
markers should show an association too. If a marker is significantly
associated with a disease but no other marker in the region is, then
it should be viewed as suspicious. Decisions in cases like this may be
further validated by investigating markers that according to HapMap are correlated to the marker in question.
4. Imputation:
Squeezing More
Information Out
of Your Data
The current generation of SNP chip types include only 0.32 million
of the 910 million common SNPs in the human (that is, SNPs with
an MAF of more than 5%). Because of the correlation between SNPs
in LD, however, the SNP chips can still claim to assay most of the
common variants in the genome (in European populations anyway).
Although the Illumina HumanHap300 chip only directly tests about
3% of the 10 million common SNPs, it still covers 77% of the SNPs
in HapMap with a squared correlation coefficient (r2) of at least 0.8
in a population of European ancestry (11). The corresponding
fraction in a population of African ancestry is only 33%, however.
These numbers expose two limitations of the basic GWAS
strategy. First, there is a substantial fraction of the common SNPs
that are not well covered by the SNP chips even in European
populations (23% in the case of the HumanHap300 chip). Secondly, we rely on tagging to test a large fraction of the common
SNPs and this diluted signal from correlated SNPs inevitably causes
us to overlook true associations in many instances. An efficient way
of alleviating these limitations is genotype imputation, where genotypes that are not directly assayed are predicted using information
from a reference data set that contains data from a large number of
11
287
4.2. Imputation
Software
288
S. Besenbacher et al.
5. Current Status
Association mapping has for the last 5 years had a strong focus on
GWAS using SNP chips with 500,000 1,000,000 SNPs, based on
the HapMap identification of common human variation in Europe,
Asia, and Africa. Hundreds of SNPs have been found to be associated with common diseases in a discovery cohort of affected
individuals with matched controls and at least one further population for replication of the initial finding (37). These studies have
typically found increased risks of 520% for each variant. While
initially considered a great proof of concept for the CDCV model,
there is a growing awareness that these are unlikely to explain most
of the genetic effect unless there is a very large number of common
alleles with very small ORs that have escaped detection using the
current cohort sizes (typically, 2,00020,000 individuals). Hence,
there is a renewed focus on the site frequency spectrum of disease
alleles as discussed in Subheading 2. A clear pattern observed in
findings so far is that most variants identified are common and that
the inferred ORs increase with rarity of the variant. This cannot be
taken as evidence that rare variants have higher ORs, though, since,
as demonstrated by Iles (38), we can only detect the rarer variants if
they have higher ORs. However, if analysis is restricted to nonsynonymous disease SNPs, then rare variants do seem to have a
generally larger OR (39). An analysis of the site frequency spectrum
as a function of functional classification of SNPs using PolyPhen also
found that rare variants should be more damaging (which through
selection would explain their rarity) (13). Thus, it cannot be ruled
out that the bulk of heritability not explained by GWASs so far could
be explained by many rare variants, each with ORs larger than
common variants identified (perhaps, ORs in the 210 range
which are still sufficiently low to be easily missed by linkage studies).
11
289
6. Perspectives
Future data will provide identification of most, if not all, SNPs and
CNVs in a large set of individuals. To further our quest toward
understanding the genetic etiology of common diseases, methods
are needed that expand information on the role of rare variants and
variants of small effect. This requires statistically powerful ways of
handling information from rare SNPs, rare LD blocks, and amassing distributed local effects. Several promising methods have
recently emerged as ways to add signals together locally (3943).
It will be of great interest to use these approaches even in cases,
where association with common variants has been shown, since it is
possible that some of these associations are due to synthetic association, i.e., that several rare variants accidentally are associated with
the same more common variants (44). Searching for variants adding up to a risk in a certain gene does not identify any specific causal
variants, but it points to causal genes or regions that can then be
further scrutinized either statistically in replication cohorts, by
bioinformatics pathway and functional prediction, or experimentally. With full sequencing, we know that the causal variants have
been included, but many other variants will be associated due to
LD. LD will, thus, be more of a burden than of an asset in future
studies and populations with least LD should be most easily amenable for association mapping. However, other approaches to distinguish real from associated variants that are based on biological
information on their putative function will be very useful. A wealth
of annotation to each position in the genome will soon be available,
including the epigenetic context (e.g., nucleosome positioning and
modifications, transcription factor and enhancer binding, DNAS
structure) and the structure of the protein, including its position in
biological pathways and interaction with other proteins. Thus, each
putative variant can be assigned a posterior probability of true
association, which can be used as hypothesis generator as well as a
prior probability in replication studies.
7. Questions
1. How can you distinguish causal variants from other variants
when all variants have been typed? Is there any statistical way of
distinguishing between correlation and causality just from
genotype data? Could you use functional annotations?
2. Consider a GWAS data set, where in the top ten ranked statistics you have five markers that are close together and the
remaining five scattered across the genome. Would you
290
S. Besenbacher et al.
11
22. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet
2: e190.
23. Devlin B, Roeder K (1999) Genomic control
for association studies. Biometrics 55:
9971004.
24. Sebastiani P, Solovieff N, Puca A, Hartley SW,
Melista E, et al. (2010) Genetic Signatures of
Exceptional Longevity in Humans. Science.
25. Alberts B (2010) Editorial expression of concern. Science 330(6006): 912. DOI:
10.1126/science.330.6006.912-b.
26. Teo YY, Small KS, Kwiatkowski DP (2010)
Methodological challenges of genome-wide
association analysis in Africa. Nat Rev Genet
11: 149160.
27. Zaitlen N, Pasaniuc B, Gur T, Ziv E, Halperin
E (2010) Leveraging genetic variability across
populations for the identification of causal variants. Am J Hum Genet 86: 2333.
28. Marchini J, Howie B (2010) Genotype imputation for genome-wide association studies.
Nat Rev Genet 11: 499511.
29. Sudmant PH, Kitzman JO, Antonacci F, Alkan
C, Malig M, et al. (2010) Diversity of human
copy number variation and multicopy genes.
Science 330: 641646.
30. Marchini J, Howie B, Myers S, McVean G,
Donnelly P (2007) A new multipoint
method for genome-wide association studies
by imputation of genotypes. Nat Genet
39: 906913.
31. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR
(2010) MaCH: using sequence and
genotype data to estimate haplotypes and
unobserved genotypes. Genet Epidemiol 34:
816834.
32. Scheet P, Stephens M (2006) A fast and flexible
statistical model for large-scale population
genotype data: applications to inferring missing
genotypes and haplotypic phase. Am J Hum
Genet 78: 629644.
33. Servin B, Stephens M (2007) Imputation-based
analysis of association studies: candidate regions
and quantitative traits. PLoS Genet 3: e114.
291
Chapter 12
Ancestral Population Genomics
Julien Y. Dutheil and Asger Hobolth
Abstract
The full genomes of several closely related species are now available, opening an emerging field of
investigation borrowing both from population genetics and phylogenetics. Providing we can properly
model sequence evolution within populations undergoing speciation events, this resource enables us to
estimate key population genetics parameters, such as ancestral population sizes and split times. Furthermore, we can enhance our understanding of the recombination process and investigate various selective
forces. We discuss the basic speciation models for closely related species, including the isolation and
isolation-with-migration models. A major point in our discussion is that only a few complete genomes
contain much information about the whole population. The reason being that recombination unlinks
genomic regions, and therefore a few genomes contain many segments with distinct histories. The
challenge of population genomics is to decode this mosaic of histories in order to infer scenarios of
demography and selection. We survey different approaches for understanding ancestral species from
analyses of genomic data from closely related species. In particular, we emphasize core assumptions and
working hypothesis. Finally, we discuss computational and statistical challenges that arise in the analysis of
population genomics data sets.
Key words: Coalescence, Demography, Selection, Divergence, Speciation, Markov model, Ancestral
population
1. Introduction
We are on the edge of the population genomics era, but the majority
of population genomics data sets, such as the 1000 human genomes
project (1) and the 1001 arabidopsis genomes project (2), are still in
the production stage. The current data available consists of alignments of fully sequenced and closely related genomes. In some
cases, the genomes are consensus genomes obtained by pooling
sequences from several individuals. Under these conditions, the
recent history of species is not available to the investigator (although
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_12,
# Springer Science+Business Media, LLC 2012
293
294
Ancestor
Speciation
Recombination event
Species 1
Species 2
Fig. 1. Left: Isolation model of two species. Right: The coalescent process along the genomes of the two species. By
comparing the two genomes, we obtain information about the split time of the species and the ancestral population size.
Furthermore, the break points along the genomes correspond to recombination events, so we also have information about
the recombination process.
295
2. Coalescent
Theory and
Speciation
Primates: 1 Mb alignment
Orangutans: Two full genomes
Markov
process
(20)
(10)
(9)
(12)
(11)
(14)
(25)
RAS Rate Across Site model, assuming an a priori distribution of evolutionary rate (usually a discretized gamma distribution) over alignment positions I Isolation
model IM Isolation with migration model
Primates: 1 Mb alignment
Markov
process
Markov
process
RAS + branch-specific
departure from
molecular clock
Independent estimate
of rate
Drosophila
Independent I
loci
Primates
Independent estimate
of rate
RAS
T1, T2, NA
(17)
Primates
Correction with
outgroup
T1, NA
Reference
Data set
T, NA
Rate variation/
sequencing errors
Bayesian inference
Independent IM
loci
Independent I
loci
Independent IM
loci
Independent I
loci
Parameters
ARG Approx. Spec. estimated
Principle
Table 1
Methods comparison. This table summarizes and compares existing ancestral population genomics methods. Parameters
correspond to the one in figure 4
296
J.Y. Dutheil and A. Hobolth
297
T2
W5
T3
T4
T5
Fig. 2. Illustration of the coalescent process. The waiting time before two out of n
individuals coalesce is Tn and the time before a sample of n individuals find common
ancestry is Wn.
n
X
VarTk
k2
n 2
X
k
k2
n1
X
1
1
1
3
:
8
4 1
k2
n
n
k1
298
2.3. Taking
Recombination
into Account
1 2
3 4
Fig. 3. Ancestral recombination graph for two species, (a) genealogy of four sampled sequences from two species.
The bold line shows the divergence of two sequences of interest, (b) a single recombination event happened between
the lineages of sequences 3 and 4 (horizontal line) so that in a part of the sequences the genealogy is as depicted by the
bold line and therefore displays an older divergence, (c) the corresponding ancestral recombination graph. Dotted lines
show the portions of lineages which are not present in the sample composed of sequences 1 and 3. When going backward
in time, a split corresponds to a recombination event and a merger is a coalescence event.
299
from the distribution of the age of the MRCA, and the distribution
contains information about the ancestral population size and
speciation time.
3. Models of
Speciation
In this section, we extend the standard coalescent model. We
consider coalescent models with multiple species and introduce
population splits or speciation events. The models that we describe
are shown in Fig. 4 (see also Table 1) and include (a) the twospecies isolation model; (b) the two-species isolation-with-migration
models; (c) the three-species isolation model (and incomplete
lineage sorting); and (d) the three-species isolation-with-migration
b
NA
NA
m12
T
N2
N1
m21
d
NA2
NA2
mA13
T2
NA1
T1
T2
T1
NA1
m3A1
m13
N
m23 3
m12
N1
N2
m21
m32
m31
Fig. 4. Speciation models and associated parameters. In all exemplified models, effective population size is constant
between speciation event, represented by dash lines. The timing of the speciation events, noted T are parameters of the
models, together with ancestral effective population sizes, noted NA. In some cases, contemporary population sizes can
also be estimated, and are noted Ni, where i is the index of the population. Models with postdivergence genetic exchanges
have additional migration parameters labeled mfrom!to. The number of putative migration rates increases with the number
of contemporary populations under study, and some models might consider some of them to be equal or eventually null to
reduce complexity.
300
model. We also discuss the general multiple-species isolationwith-migration model. The two-species isolation model is introduced
in ref. 7 and the isolation-with-migration model is introduced in
ref. 8.
3.1. Isolation Model
with Two Species
If the sequences are sampled from two distinct species that have
diverged a time T ago (see Fig. 4a), then the distribution of the age
of the MRCA is shifted to the right with the amount T, resulting in
the distribution
0
if t<T
fT2 t 2 e 2tT if t>T :
yA
yA
12
: P12 2 e
if
t>T
12
yA2
yA2
where
T12 T1 T2
and
2T12 T1
yA1
P12 e
Two species
301
Three species
50
Isolation+migration model
50
Density
Isolation model
Time
0.000 0.005 0.010 0.015 0.020 0.025
Fig. 5. Illustration of the density for coalescent in various models and data layout. The curves are the probability density
functions. In the most simple case with two species, a constant ancestral population size, and a punctual speciation (top left
panel), more genomic regions find a common ancestor close the species split (the vertical line) while a few regions have a
more ancient common ancestor, distributed in an exponential manner (see Eq. 1). If speciation is not punctual and migration
occurred after isolation of the species, then some sequences have a common ancestor which is more recent than the
species split and the distribution in the ancestor becomes more complex (bottom left panel, see Eqs. 4 and 6). When a third
species is added (right panel), then another discontinuity appears and all distributions depend on additional parameters,
particularly when migration is allowed. We use yA1 0.0062, yA2 0.0033, and t1 0.0038 (the first vertical line),
t2 0.0062 (the second vertical line) corresponding to the HCG triplet. Ancestral population sizes are taken from the
simulation study in Table 6 in ref. 14: y1 0.005 and y2 0.003. Migration parameters are all set to 50.
2
2 2T12 T1
P12 e yA1 :
3
3
(2)
The event that the gene tree is different from the species tree is
called incomplete lineage sorting (ILS). ILS is important because
species tree incongruence often manifests itself as a relatively clear
signal in a sequence alignment and thereby allows for accurate
estimation of population parameters. In Fig. 6, we show the (in)
302
0.8
Probability
congruence
incongruence
0.6
((human,chimpanzee),gorilla)
0.4
0.2
0.0
0.0
0.5
1.0
1.5
(123 12)/12
2.0
2.5
3.0
Fig. 6. Probability (Eq. 2) of gene tree and species tree being incongruent. In case of the HCG triplet, we obtain
(T12 T1)/yA1 (0.0062 0.0038)/0.0062 0.39 which corresponds to an incongruence probability of 30%.
303
Wang and Hey (14) consider a situation with two genes. Before
time T, the system is in one of the following five states.
S11: Both genes are in population 1.
S22: Both genes are in population 2.
S12: One gene is in population 1 and the other is in population 2.
S1: The genes have coalesced and the single gene is in population 1.
S2: The genes have coalesced and the single gene is in population 2.
The instantaneous rate matrix Q is given by
S11
S12
S22
S1
S2
S11
m1
0
S12
2m2
2m1
S22
0
m2
S1
2=y1
0
0
m1
S2
0
0
:
2=y2
m2
(5)
(6)
P
i
Here, e A 1
i0 A i! the matrix exponential of the matrix A and
A
(e )ij is entry (i,j) in the matrix exponential.
After time T, the system only has two states: SAA corresponding
to two genes in the ancestral population and SA corresponding to
one single gene in the ancestral population. The rate of going from
state SAA to state SA is 2/yA. The density for coalescent in the
ancestral population at time t > T is, therefore,
h
i 2 2 tT
y
f t eQT aS11 eQT aS12 eQT aS22
e A
: (7)
yA
In Fig. 5, we illustrate the coalescent density in the two-species
isolation with migration model.
The likelihood for a pair of homologous sequences X is given by
Z 1
PX jtf tjYdt;
(8)
PX jY LYjX
0
304
where f(t) f(t|Y) given by Eqs. 6 and 7 is the density of the two
sequences finding an MRCA at time t and P(X|t) is the probability
of the two sequences given that they find an MRCA at time t.
The latter term is calculated using a distance-based method.
One possibility is to use the infinite sites model, where it is assumed
that substitutions happen at unique sites, i.e., there are no recurrent
substitutions. In this case, the number of differences between the
two sequences follows a Poisson distribution with rate 1.
For an application of the isolation-with-migration model with
two sequences, we refer to ref. 14; a discussion of their approach
can be found in ref. 15.
3.4. Isolation with
Migration Model
with Three or More
Species and Three
or More Samples
(k 1) speciation times
4. Approximating
the Ancestral
Recombination
Graph
In this section, we discuss the three methods of taking recombination into account. The three methods are visualized in Fig. 7ce
and correspond to (1) independent loci, (2) site patterns, and (3)
hidden Markov model (HMM).
The simplest way to handle issues relating to the ancestral recombination graph is to divide the data into presumably independent
loci. Such analyses are, therefore, restricted to candidate regions
that are not too large (to avoid including a recombination point)
and not too close (to ensure that several recombination events
305
Fig. 7. The coalescent process along genomes, (a) four archetypes of coalescence scenarios with three species,
exemplified with human, chimpanzee, and gorilla. In the first scenario, human and chimpanzee coalesce within the
humanchimpanzee common ancestor. In the three other scenarios, all sequences coalesce within the common ancestor
of all species, with probability 1/3 depending on which two sequences coalesce first, (b) example of genealogical changes
along a piece of an alignment. The alignment was simulated using the true coalescent process and parameters
corresponding to the humanchimpanzeeorangutan history. The blue line depicts the variation along the genome of
the humanchimpanzee divergence. The background colors depict the change in topology, red and yellow corresponding
to incomplete lineage sorting. Every change in color or break of the blue line is the result of a recombination event. (ce)
Three possible ways of approximating the ancestral recombination graph. In (c), a number of small loci are analyzed
independently under an assumption of no recombination within loci, which allows to estimate the probability distribution of
sequence divergence. In (d), the alignment is summarized in terms of counts of site patterns, and in (e) the data is analyzed
in terms of a hidden Markov model along the sequence, with distinct genealogies featuring various divergence times
as hidden states. The underlying model includes transition probabilities between genealogies along the genome.
See Subheading 4 for more details.
306
307
The work by Hobolth et al. (9) used site patterns in a different way.
With a hidden Markov model, they used the correlation of patterns
along the genome to reconstruct the site-specific genealogy, including divergence times. They further used these divergence estimates
together with the inferred amount of incomplete lineage sorting
to compute the speciation times and ancestral population sizes.
In this approach, the recombination rate is embedded into the
transition matrix of the hidden Markov chain, which specifies the
probabilities of transition from one genealogy to the other along
the genome. Hobolth et al. showed that this matrix is constrained
by symmetric relationships, and estimated the remaining three
parameters together with the divergence parameters. Dutheil
et al. (10) extended this approach by identifying further constraints
on the parameters and fully expressing the divergence times and
probabilities of transition between genealogies as function of the
speciation times, ancestral population sizes, and recombination
rate, therefore allowing their direct estimation. The analytical
expressions of the parameters as function of populational quantities
are, therefore, difficult to obtain, notably for the transition probabilities, even in the simplest case.
Mailund et al. (20) used a different approach to compute these
for the two-species isolation model. They used a continuous Markov chain to model the evolution of a pair of contiguous positions.
This model features two types of events: when going backward in
time, the two positions can either coalesce (with a rate proportional
to the effective population size) or split (with a rate equal to the
recombination rate). The transition probabilities between genealogies are immediately available from the joint pair of contiguous
positions and the Markov assumption. This approach can be
generalized to more species are and potentially allows for more
realistic demographic scenarios, for instance allowing migration
between populations.
The coalescent HMM framework, thus, models recombination, which is assumed to be constant in all lineages and along the
alignment. The model further assumes that the probability of
switching from one genealogy to another when we walk along a
genome alignment only depends on the genealogy at the previous
position, that is, the process of genealogy change along the genome
is Markovian. This is an approximation of the true coalescent
process that greatly simplifies calculation (21). Dutheil et al. (10)
and Mailund et al. (20) used simulated data sets under a coalescent
process with recombination to show that this assumption had,
however, little influence on the parameter estimates. Using this
approach, Hobolth et al. estimated a speciation time between
human and chimpanzee around 4.1 My and a large ancestral effective population size of 60,000 for the human chimpanzee ancestor. Dutheil et al. (10) found similar estimates with the same data
set while accounting for substitution rate variation across sites, and
estimated an average recombination rate of 1.7 cM/Mb.
308
5. Specific Issues
Faced When
Dealing with
Genomic Data
309
Dealing with genomic data heavily relies on computer performance. Depending on the genome sizes and the method used,
the analysis may cover from millions to billions of genomic positions. As most methods rely on maximum likelihood or Bayesian
inference, efficient algorithmics and software implementation are
much needed. Fortunately, the data structure here comes handy:
independent parts of the genomes, like chromosomes, syntheny
blocks, or even loci, depending on the methodology used, can be
analyzed separately, therefore enabling easy parallelization for use
of computer grids. Aside to the computational issue, the genomic
area also dramatically changed the structure of the result tables.
While analyzing per-gene result sets, consisting of a few dozen
thousand rows, is still feasible with statistical software like R, it
becomes much more problematic when per-site result sets are
considered. As our understanding of genome evolution grows, we
are more keen on fishing specific regions with a peculiar demographic or selective history. Such data sets typically reach sizes
of several millions rows. While they can still be loaded into the
memory of computers with strong configuration, a single pass on
the table for retrieving information becomes prohibitive, which
becomes problematic when several sets are to be compared (for
instance, in order to compare a window-based calculation with
gene annotations). The only alternative currently available is to
use database engines, with proper indexing algorithms. Such databases are currently used in genome browsers, like the UCSC
genome browser. In that respect, cross-information storage and
310
6. Discussion
Studying the speciation process with genome data implies new
modeling challenges, as the basic configuration of a population
genetics data set is drastically changed: instead of having a few loci
sequenced in several individuals, we have an (almost) exhaustive set
of loci sequenced in one individual for a few species. The change
involve the spatial dimension, but also time, as the process under
study occurred much further back in time than the ones that are
commonly studied with a standard population genetics data set.
The use of the spatial signal has a major consequence, namely, that
recombination has to be dealt with, even if it is not directly modeled.
Apart from these considerations, ancestral population genomics, as population genetics, heavily relies on the study of sequence
genealogy, its shape, as well as its variation. The underlying models
build on existing intraspecies population modeling, as they
only need to add the species divergence process, that is, a moment
in time where two populations stop exchanging genetic
material and evolve fully independently. The simplest isolation model assumes that the speciation is instantaneous while the
isolation-with-migration model assumes that the two neo-species
311
can still exchange some material, at least for a certain time after the
split. Such a model is not different from a pure isolation model,
where the ancestral population is structured into two subpopulations: in the first case, the speciation time is defined as the time of
the split while in the second case it is the time of the last genetic
exchange. Recent work on primates (11) suggests that the speciation of human and chimp was not instantaneous. If the average
divergence of the human and chimpanzee is a bit more than 6 My
(using widely accepted mutation rate), then the split of the two
species initiated around 5.5 My ago, and the last genetic exchange
can be dated around 4 My.
The fact that we sample a large number of positions in the
genome, thus, appears to have the power to counterbalance
the reduced sampling of individuals within population, allowing
the estimation of demographic parameters in the ancestor. Nonetheless, complexity limits are rapidly reached when considering, for
example, three closely related species that can exchange migrants.
More complex demographic scenarios, incorporating for instance
variation in population sizes, will also add additional parameters
that might not all be identifiable.
If the ancient speciation processes have left signatures in the
contemporary genomes, we do not know yet how far back in time
this is true. Intuitively, the signal is maximal when the variation
in divergence due to polymorphism is large enough compared
to the total divergence. The divergence due to polymorphism is
proportional to the ancestral population size while the divergence
of species is only dependent on the time when it happened. So the
further back in time we are looking at, the bigger the population
sizes need to be so that the ancient polymorphism leaves a signature
in the total divergence time. In addition to this, one has to take into
consideration sequence saturation due to the too large number
of substitutions that accumulated since ancient split and the fact
that demographic scenarios complexity increases with time. For
instance, when considering the evolution of a species over several
millions of generations, the probability that a bottleneck, resetting
the signal from past events, occurred once is not negligible.
The population genomics era is just ahead, where we will have
full individual genomes for closely related species. Such data sets
are the key to understand the detailed evolutionary processes that
are linked to the formation and evolution of species, as they will
open windows to new periods in time. Analyzing such data sets
with the current methodologies, however, offers major challenges:
(1) developing the appropriate computational tools able to handle
such data sets with current machines (both in terms of processor
speed and memory usage) and (2) design realistic models with
enough complexity to capture the most important historical events
while remaining computationally tractable.
312
7. Exercises
7.1. ILS in Primates
7.2. Estimating
Ancestral Population
Size from the Observed
Amount of ILS
7.3. Number of
Migration Rates in the
General k-Population
IM Model
Acknowledgments
The authors would like to thank Thomas Mailund for providing
useful comments on this chapter. This publication is contribution
volution de Montpel2011-035 of the Institut des Sciences de lE
lier (UMR 5554CNRS). This work was supported by the French
Agence Nationale de la Recherche Domaines Emergents (ANR08-EMER-011 PhylAriane).
313
References
1. Siva, N. (2008), 1000 genomes project.
Nature Biotechnology 26(3), 256
2. Weigel, D., Mott, R. (2009), The 1001 genomes project for arabidopsis thaliana. Genome
Biology 10(5), 107+
3. Enard, D., Depaulis, F., Roest Crollius, H.
(2010), Human and non-human primate
genomes share hotspots of positive selection.
PLoS Genet 6(2), e1000,840+
4. Siepel, A. (2009), Phylogenomics of primates
and their ancestral populations. Genome
Research 19(11), 19291941
5. Wakeley, J. (2008). Coalescent Theory: An
Introduction, 1 edn. Roberts & Company
Publishers
6. Tavare, S. (2004). Ancestral inference in population genetics, vol. 1837, pp. 1188. Springer
Verlag, New York
7. Takahata, N., Nei, M. (1985), Gene genealogy
and variance of interpopulational nucleotide
differences. Genetics 110(2), 325344
8. Nielsen, R., Wakeley, J. (2001), Distinguishing
migration from isolation: a markov chain monte
carlo approach. Genetics 158(2), 885896
9. Hobolth, A., Christensen, O.F., Mailund, T.,
Schierup, M.H. (2007), Genomic relationships
and speciation times of human, chimpanzee,
and gorilla inferred from a coalescent hidden
markov model. PLoS Genet 3(2), e7+
10. Dutheil, J.Y., Ganapathy, G., Hobolth, A.,
Mailund, T., Uyenoyama, M.K., Schierup, M.
H. (2009), Ancestral population genomics:
The coalescent hidden markov model
approach. Genetics 183(1), 259274
11. Yang, Z. (2010), A likelihood ratio test of speciation with gene flow using genomic sequence
data. Genome Biol Evol 2(0), 200211
12. Burgess, R,., Yang, Z. (2008), Estimation of
hominoid ancestral population sizes under
bayesian coalescent models incorporating
mutation rate variation and sequencing errors.
Molecular biology and evolution 25(9),
19791994
13. Tavare, S. (1979), A note on finite homogeneous continuous-time markov chains.
Biometrics 35, 831834
14. Wang, Y., Hey, J. (2010), Estimating Divergence Parameters With Small Samples From a
Large Number of Loci. Genetics 184(2),
363379
15. Hobolth, A., Andersen, L.N., Mailund, T.
(2011), On computing the coalescence time den-
Chapter 13
Nonredundant Representation of Ancestral
Recombinations Graphs
Laxmi Parida
Abstract
The network structure that captures the common evolutionary history of a diploid population has been
termed an ancestral recombinations graph. When the structure is a tree the number of internal nodes is
usually OK where K is the number of samples. However, when the structure is not a tree, this number has
been observed to be very large. We explore the possible redundancies in this structure. This has implications
both in simulations and in reconstructability studies.
Key words: Ancestral recombinations graph, ARG, Redundancies, Minimal descriptor, Coalescent,
WrightFisher, Population simulators, Nonredundant
1. Introduction
In keeping with the theme of the book, we study in this chapter the
common evolutionary history of a diploid population. This common history is a phylogeny with the extant members at the terminal
or leaf nodes. The internal nodes of the topology are some common ancestors while the edges can be viewed as conduits for the
flow of genetic material. The direction on the edges represents the
direction of flow. A directed edge from node v1 to node v2 is to be
interpreted as v1 being an ascendant of v2 or v2 is a descendant of v1.
The topology has no cycles since, no matter what the underlying
model, a member is not an ancestor of itself. Thus, the topology is
always a directed acyclic graph (DAG). Under uni-parental (unilinear) transmission each member at a generation derives all its genetic
material from only one parent whereas under a biparental model a
member derives the material from two parents. Then does this
simple difference in inheritance in the two models have an effect
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_13,
# Springer Science+Business Media, LLC 2012
315
316
L. Parida
2. Background
The ideal population or WrightFisher model assumes some properties of the evolving population such as constant population size
and nonoverlapping generations. While these conditions appear
nonrealistic at first glance, the assumptions are reasonable for the
13
317
Fig. 1. The terminal (leaf) nodes are as follows: the 60 brown nodes represent African samples, the 50 blue nodes AfricanAmerican samples, the 50 yellow nodes Asian samples and the 50 green nodes European samples. The internal cyan and
red nodes are recombination nodes and gray nodes are coalescent nodes. The simulation was generated with COSI (2) and
the visualization using Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/). The red recombination nodes are the ones
reconstructed by the method in (1).
318
L. Parida
Fig. 2. (a) The first ten generations of the relevant part of the complete pedigree graph (GPG (K, N) with K 4 and N 8).
The solid (blue) dots represent one gender, say males and the hollow (red) dots represent the other gender (females). Each
row is a generation with the direction on edges indicating the flow of the genetic material and the four extant units are at
the bottom row, i.e., row 0. Under the WrightFisher population model, there are equal number of males and females in
each row and the two distinct parents, one male and one female from the immediately preceding generation are randomly
chosen. (b) Tracking a locus gives a subgraph of (a).
and the latter from the father. However, if the locus is on the
autosome or even the X chromosome then the genetic material
may be transmitted from two parents. This implies that the topology
of the evolutionary history is no longer a tree, but a network (i.e., it
may have closed paths in the directed graph). Thus, due to the
occurrence of genetic exchange event, such as recombination, the
common evolutionary history can no longer be captured by a tree.
The network that captures both the genetic exchange event (such as
recombinations) and events that do not exchange genetic material
between parents (such as mutations) is the ARG. For simplicity of
exposition we call the class of latter events as nonexchange events.
Notice that this important distinction in the topological characteristics arises simply from the basic locus-inheritance model, that
is uniparental or biparental. The rest of the model characteristics
define the depth (or age) distribution of the nodes. Thus, it is
important to note the subtlety that an ARG is a random object
and there are many (infinite) instances of the ARG. Usually, when
we say that a topological property holds for the ARG, we mean that
the property that holds for every instance of the ARG, i.e., the
property holds with probability 1. Note that some may hold for a
subset of instances (such as unboundedness).
Focusing on the topology of the ARG and its effect on the
samples provides us with insights to identify vertices that do not
matter. Modeling these as missing nodes in the ARG leads to a
core that preserves the essential characteristics. The random object
ARG is defined by at least two parameters: K, the number of extant
samples and 2N, the population size at a generation. A Grand Most
Recent Common Ancestor (GMRCA) plays an important role in
restricting the zone of interest in the common evolutionary
13
319
3. A Combinatorial
Definition of ARG
The random object ARG is usually parameterized by three essential
parameters: K the number of extant samples, 2N the population
size, and recombination rate r (see texts such as ref. 3 for a detailed
description). The following theorem is paraphrased from (7):
Theorem 1. Every ARG G on K > 1 extant samples is the topological
union of some M 1 trees (or forests).
320
L. Parida
3
G
2
3
Three embedded trees
Fig. 3. Here K 4 and the extant samples are numbered 1, 2, 3, and 4. The hatched nodes are the genetic exchange
nodes. (a) The topology of an ARG, where the GMRCA is marked by an additional rectangle (on top). (b) A possible
embedding of (a) by three trees (shown in green, red, and blue, respectively).
13
321
a
s
z
c
x
r
v
a
w
Genetic flow
c
r
x
v
a
d
1
Tree 1
q
1
Tree 2
Tree 3
Fig. 4. (a) Genetic event labels on the edges. At each node the nonmixing segment corresponding to the embedded tree is
shown in the same color as that of the tree. The three embedded trees are shown separately in (b), (c), and (d).
322
L. Parida
use the value of the label to define the sequence s(u). Let P(s(u))
M
S
denote the elements of s(u). Then Psu
fxi jxi 2 lblv1 v2
i1
13
323
4. Redundancies
in an ARG
How do we identify redundancies in the topology of an ARG?
Studying the effect of the topology on the samples provides us
with insights to identify vertices that do not matter. Modeling
these as missing nodes in the ARG leads to a core that preserves the
essential characteristics.
To maintain biological relevance, a missing node is modeled
by the following vertex removal operation. Note that in an ARG,
each node has an implicit depth associated with it that reflects its
age (in generations). An alternative view is that the edge length
denotes the age. Note that in the following the age of the nodes
does not change and the new edges get the edge length from the
ages of the nodes they connect. Given G and a node v in G, G\{v} is
obtained in the following steps. This is not the only possible definition of vertex removal, but it is a simple and natural one and is used
in this chapter
1. For each child vc,i of v, that is in the embedded tree 1 i M
(a) (adding new edges) This child is connected by a new edge
to vp,i, a parent of v in i.
(b) (annotating the new edges) The new edges between vp,i
and vc,i are annotated as follows: for each strand i, the label
of the new edge is the union of the labels on the i-path
from vp,i to vc,i. Next if a label xi appears on multiple new
outgoing edges of vp,i, then it is removed from all but one
of the outgoing edges. (This is to avoid introducing parallel mutations, i.e., the same label appearing multiple times
on the embedded tree i.)
2. The node v with all the edges incident on it are removed from G.
4.1. SamplesPreserving
Transformation
4.2. StructurePreserving
Transformation
324
L. Parida
Note that the embedded trees (also called marginal trees) are
very important in an ARG and critical in defining the ARG: Not just
the topology but also the branch lengths, which represent the time
(in generations) to the next coalescent event. Then is it possible to
characterize a node that can lead to structure-preserving transformation? A coalescent vertex in G is t-coalescent if and only if it is also
a coalescent node in at least one of the M embedded trees. In fact
the following is proved in (8).
Theorem 2. If G 0
G \U and no t-coalescent vertex of G is in U, then
G 0 is structure-preserving.
In other words, if a set of coalescent nodes that are not tcoalescent are removed from G to obtain G 0 , then G and G 0 are
structure preserving. With this useful property, we are ready to
zero-in on a core preserving structure.
4.3. Minimal Descriptor
The theorem shows that the vertices that ensure the invariance
of the branch lengths of each embedded tree are also resolvable,
leading to the following definitions.
1. An ARG G is a minimal descriptor if and only if every coalescent
vertex, except the GMRCA, is t-coalescent.
2. An ARG Gmd is a minimal descriptor of G if and only if (a) Gmd
is a minimal descriptor, (b) Gmd preserves the structure of G,
and (c) G and Gmd are samples preserving, i.e., S(G) S(Gmd)
holds.
Given G, let U be the set of all coalescent vertices in G, other
than the GMRCA, that is not t-coalescent. Let G0
G\U. By the
definition of a minimal descriptor and the following statement, G0 is
a minimal descriptor.
If v1 is a t-coalescent vertex in G and v2 is not, then v1 continues to be
a t-coalescent vertex in G\{v2}. Further if V1 is a set of t-coalescent
vertices in G, and none of the vertices in V2 is, then each v 2 V1
continues to be t-coalescent in G\V2.
The following gives a constructive description of a minimal
descriptor. Let G0 be a minimal descriptor of G. Then G0 is biologically
and evolutionarily relevant as
1. (Structure preserving) the embedded (marginal) trees of G and
G0 are identical.
2. (Samples preserving) the allele statistics (including allele frequencies, LD decay) in the samples in both G and G0 are identical.
13
5. Properties
of Minimal
Descriptor
325
b
s
z
c
c
x
a
w
3
G
a
w
Gmd
Fig. 6. Overall picture: (a) A generic ARG and all its genetic flow, thus defining the samples S(G). The two marked nodes are
not t-coalescent. (b) A minimal descriptor, Gmd as it preserves the structure of G. Although the graphs are clearly topologically
very different, yet they define exactly the same samples, i.e., S(G) S(Gmd) and Gmd preserves the structure of G.
326
L. Parida
Fig. 7. (a) Bounded Gmd of unbounded G of Fig. 5. (b) Pairwise overlap of genetic segments in the children of node v.
6. Population
Simulators
A modelless approach to simulations is to take an existing population sample S and perturbs it to obtain S0 that has similar properties
as S. However, here we discuss systems that explicitly model the
population evolution evolving under the WrightFisher model (9).
It is important to point out that literature abounds with population
simulation systems and the list of simulators mentioned here is by
no means complete. However, the attempt here is to classify them
based on the underlying approaches. The simulation systems are
aligned along two approaches: forward and backward. In the former the simulation of the events proceeds forward in time, that is
from past to present. While this is a natural direction to proceed a
trickier approach is to simulate backward in time that is from
present to past. In principle, this is more economical in space and
time. In both approaches an implicit phylogeny structure is constructed. We call the reduced version of this as the ARG in Fig. 8.
An internal node in an ARG is either a coalescent node or a genetic
exchange node but not neither. A mathematically interesting
approach is to simulate the time to the next coalescent, or recombination, event without explicit simulation of every generation.
13
MaCS
SMC; FastCoal
327
Minimal Descriptor
SMC
Nonredundant
Spatial Algo
Exact coalescent
Approximate coalescent
Nonredundant
FORWSIM
COSI; SelSim
MS
(binary ARG)
Coalescent
Hybrid
SFS_CODE; FREGENE
GENOME
[simuPOP]
Forwards
Backwards
(ARG)
Modelbased
Fig. 8. A classification of the model-based (hence an associated ARG) population evolution systems based on their
underlying architectures. The software systems are shown either in red or green. The systems in green additionally
incorporate selection and/or demographics to produce genetic diversity patterns that somewhat reflect the current
populations. Bottom to top: Backward and forward are the two basic schemes with hybrid as a combination of the two.
Coalescent is a mathematically interesting backward scheme whose ARG topology characterizes it as a binary ARG. A set
of simulators are listed here as approximate coalescent which are attempts at removing redundancies in the underlying
binary ARG. The minimal descriptor, by its definition, is a nonredundant representation of the ARGs resulting from all the
schemes (and additionally it is an exact coalescent model, hence the bifurcation in the coalescent lineage above).
328
L. Parida
population genetics models are available through their cookbooks. This is a suitable system for experimentations since the
user can engineer complex evolutionary scenarios in the environment.
Next we discuss a few simulators that directly provide the
population samples based on a set of input parameters. SFS_CODE
(11) is a forward simulator that additionally handles effects of
migration, demographics, and selection. The migration model is
the general island model with complex demographic histories.
FREGENE (12) additionally incorporates selection, recombination
(crossovers and gene conversion), population size and structure,
and migration.
6.2. Backward
Simulators
13
329
330
L. Parida
7. Conclusion
Population evolution models are important to understand the differences and similarities in individual genomes, particularly due to
the explosion of data in this area. While these faithfully model the
genetic dynamics of the evolving population, their structure is
usually very large involving tens of thousands of internal nodes
for say a few hundred samples with a thousand SNPs each. The
complexity of this combinatorial structure raises the question of
redundancies in this structure. This chapter addressed this precise
question and gave mathematical description of such a substructure.
This is important not only for simulations and reconstruction
purposes, but also opens the door for a comprehensive understanding of genetic dynamics that ultimately shape the chromosomes.
8. Exercises
1. Construct an instance of GPG(4, 3) with no LCAs.
What is the probability of an instance of GPG(4, 3) having no
LCAs?
(Hint: see ref. 7 for the definition of a natural probability
measure).
2. (a) What is the difference in topology of a pedigree history
graph and ARG?
(Hint: How many parents must a diploid have?)
(b) When tracing a haploid, at most how many parents can the
extant unit have? Why? Does this hold for a unit at every
generation? (Hint: Fig. 9a.)
3. Is it possible to assign labels to the nodes of the ARGs in
Fig. 9b, c, why?
4. Argue that the number of resolvable nodes decreases with
depth of the nodes.
5. Argue that an ARG may have multiple minimal descriptors.
(Hint: Fig. 10.)
Acknowledgments
I would like to thank Marc Pybus for generating the visualization of
the ARG produced by COSI to show the world populations
(Fig. 1). I am grateful to the anonymous referees whose comments
have substantially improved the exposition.
13
331
Fig. 9. (a) Tracking haploids in diploids. (b) and (c) The pattern of connectivity is repeated in both to produce infinite graphs.
r
z
c
r
z
q
c
r c
s
w
w,y
p
b,d p,q,s
w,y
b,d p,q,s
x
G
Fig. 10. Gmd and G 0 md are minimal descriptors of G.
Gmd
vx
Gmd
332
L. Parida
References
1. R. R. Hudson. Properties of a neutral allele
model with intragenic recombination. Theoretical Population Biology, 23(2):183201, April
1983.
2. R. C. Griffiths and P. Marjoram. An ancestral
recombinations graph. Progress in Population
Genetics and Human Evolution (P Donnelly
and S Tavare Eds) IMA vols in Mathematics
and its Applications, 87:257270, 1997.
3. Jotun Hein, Mikkel H. Schierup, and Carsten
Wiuf. Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford
Press, 2005.
4. Laxmi Parida, Marta Mele, Francesc Calafell,
Jaume Bertranpetit, and Genographic Consortium. Estimating the ancestral recombinations
graph (ARG) as compatible networks of SNP
patterns. Journal of Computational Biology, 15
(9):122, 2008.
5. Marta Mele, Asif Javed, marc Pybus,, Francesc
Calafell, Laxmi Parida, Jaume Bertranpetit, and
Genographic Consortium.
6. M.A. Jobling, M. Hurles, and C. Tyler-Smith.
Human Evolutionary Genetics: Origins, Peoples
and Disease. Mathematical and Computaional
Biology Series. Garland Publishing, 2004.
7. Laxmi Parida. Ancestral Recombinations
Graph: A Reconstructability Perspective using
Random-Graphs Framework. to appear in
Journal of Computational Biology, 2010.
8. Laxmi Parida, Pier Palamara, and Asif Javed. A
minimal descriptor of an ancestral recombinations graph. BMC Bioinformatics, 12(Suppl 1):
S6, 2011. http://www.biomedcentral.com/
1471-2105/12/S1/S6.
9. R. R. Hudson. Generating samples under a
Wright-Fisher neutral model of genetic variation. Bioinformatics, 18:337338, Feb 2002.
10. Bo Peng* and Marek Kimmel. simuPOP: a
forward-time population genetics simulation
environment. Bioinformatics, 21:36863687,
2005.
11. RD. Hernandez. A flexible forward simulator
for populations subject to selection and
Part IV
The -omics
Chapter 14
Using Genomic Tools to Study Regulatory Evolution
Yoav Gilad
Abstract
Differences in gene regulation are thought to play an important role in speciation and adaptation.
Comparative genomic studies of gene expression levels have identified a large number of differentially
expressed genes among species, and, in a number of cases, also pointed to connections between interspecies
differences in gene regulation and differences in ultimate physiological or morphological phenotypes.
The mechanisms underlying changes in gene regulation are also being actively studied using comparative
genomic approaches. However, the relative importance of different regulatory mechanisms to interspecies
differences in gene expression levels is not yet well understood. In particular, it is often difficult to infer
causality between apparent differences in regulatory mechanisms and changes in gene expression levels, a
challenge that is compounded by the fact that the link between sequence variation and gene regulation is
not clear. Indeed, in certain cases, gene regulation can be conserved even when sequences at associated
regulatory elements have changed. In this chapter, I examine different genomic approaches to the study of
regulatory evolution and the underlying genetic and epigenetic regulatory mechanisms. I try to distinguish
between hypothesis-driven and exploratory studies, and argue that the latter class of studies provides
valuable information in its own right as well as necessary context for the former. I discuss issues related
to study designs and statistical analyses of genomic studies, and review the evidence for natural selection on
gene expression levels and associated regulatory mechanisms. Most of the issues that are discussed pertain
to the general nature of multivariate genomic data, and thus are often relevant regardless of the technology
that is used to collect high-throughput genomic data (for example, microarrays or massively parallel
sequencing).
Key words: Comparative genomics, Gene regulation, Evolution
1. What Can
We Learn from
Genomic-Scale
Comparative
Studies of Gene
Regulation?
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_14,
# Springer Science+Business Media, LLC 2012
335
336
Y. Gilad
337
2. How to Compare
Gene Expression
Levels Across
Species?
338
Y. Gilad
339
340
Y. Gilad
Fig. 1. RNAseq data from human and chimpanzee liver samples are plotted along the Vanin-family protein 3 (VNN3) gene
region. The human gene structure is provided below each plot and indicates that there are seven annotated exons in this
genes (there is no independent annotation of the chimpanzee genome). The arrows indicate a cluster of sequencing reads
that does not correspond to any part of the human gene model. A de novo definition of transcriptional units clearly
classifies this as an additional exon. Arguably, there is yet another unannotated exon at the 50 end of the region.
341
used). For simplicity of writing, I will also henceforth refer generally to genes as examples of transcriptional units. It should be
kept in mind, however, that RNAseq data can be used to study the
expression levels of any transcriptional unit, including individual
exons, alternatively spliced transcripts, small RNAs, etc.
2.2. General Issues
in Design
of Comparative Gene
Expression Studies
342
Y. Gilad
Fig. 2. Comparative liver gene expression profiles in primates (data from Blekhman et al. 2008). In all panels, the mean
(s.e.m) log gene expression level (y-axis) of six individuals from each species (x-axis) is plotted relative to the human
value (which was set to zero). Top panels: Though Blekhman et al. did not obtain staged tissuesthe samples were
collected opportunistically during postmortem procedures; the expression levels of each of these four genes are
remarkably constant across individuals and species (importantly, these four genes are expressed at moderate to high
levels, so the observed interindividual low variation is not due to lack of expression). Technical or environmental
explanations for these patterns are unlikely. It is, therefore, reasonable to assume that the expression levels of these
genes are tightly regulated (indeed, Blekhman and colleagues argue that the regulation of these genes has likely evolved
under stabilizing selection in primates). Bottom panels: These genes have similar expression levels in chimpanzees and
rhesus macaques, and a significantly different expression level in humans. In these four cases, explanations based on
interspecies genetic or environmental differences are completely confounded.
343
Fig. 3. Examples of strong concordance between expression levels measured using the multispecies arrays from Blekhman
et al., 2008, and using the RNAseq data from Blekhman et al., 2009. Six genes are displayed, chosen at random from the
data of Blekhman et al., 2008, conditional only on a significant (FDR < 0.05) difference in gene expression level between
humans and chimpanzees (expression levels in the rhesus macaques were not considered for the selection process). For
each gene, the expression estimate (mean s.e.m) from the multispecies array (left ) and normalized expression level
(mean s.e.m) from the RNAseq data (right ) are shown for each species (H human, C chimpanzee, R rhesus macaque).
Each study used different individual samples, yet the patterns are consistent across studies, suggesting that the relative
estimates of gene expression levels based on six individuals from each species are mostly stable.
344
Y. Gilad
345
346
Y. Gilad
347
3. What Have We
Learned from
Comparative
Genomic Studies
of Gene Expression
Levels?
348
Y. Gilad
349
Lemos et al. (19) used their approach to perform a metaanalysis of available gene expression datasets from multiple species,
and found that the overwhelming majority of genes in all datasets
exhibited far less between species variation than expected under a
neutral model. They interpreted this pattern to be the result of
stabilizing selection acting on within-species gene expression. In
fact, Lemos et al. (19) estimated that even if the mutational input to
gene expression were two orders of magnitude lower than they had
assumed, levels of between-population differentiations in gene
expression would still be inconsistent with neutrality. Only in comparisons between mouse lab strains did an appreciable number of
genes evolve in a manner consistent with neutrality.
The conclusions of Lemos et al. were supported by several
studies that directly measured the mutational input of variation in
gene expression levels per generation in a number of model organisms (2224). Mutational input can be estimated by measuring the
variance for a phenotypic trait among a set of initially homogeneous
lines maintained with minimally sized populations for many
generations. Natural selection is at its weakest under such conditions because genetic drift in such small populations is extremely
fast. In an extreme case, when a single, randomly chosen individual
propagates each line, the only mutations which can be selected
against are those that kill the organism before reproduction or that
eliminate fertility altogether. Otherwise, most mutations will be
effectively neutral and will quickly either drift to fixation or be lost.
As different lines fix different random mutations, the lines drift
apart. Variation between lines can then be used to estimate the
mutational variance.
These mutation accumulation studies (2224) provided the
first direct estimates of mutational variance in gene expression
levels. When comparative gene expression data were analyzed in
the context of these estimates (by applying a similar modeling
approach to the one used by Lemos et al.) in all systems studied
to date, it was concluded that stabilizing selection places severe
bounds on gene expression divergence.
3.1. Gene Expression
in Apes
350
Y. Gilad
4. How to Compare
Regulatory
Mechanisms
Across Species?
351
Because inference of causality almost always relies on prior information, genome-wide studies of regulatory mechanisms should
aspire to build the strongest possible independent circumstantial
case for a relationship between variation in regulatory interactions
and changes in gene expression levels. This can often be done
by combining different sources of genome-wide information.
For example, consider the task of identifying the direct regulatory
targets of a transcription factor. To do so, empirical studies typically
use one of the two main approaches: (1) expression profiling following a perturbation of the transcription factor dosage or (2)
chromatin immunoprecipitation followed by sequencing (ChIPseq) using a specific antibody against the transcription factor.
In the first approach, the dosage of the transcription factor
is perturbed in cells or in model organisms by a treatment of
either overexpression or knockdown (using, for example, siRNA
technology (29, 30)) of the transcription factor. Following the
treatment, the expression profiles of a large number of genes are
studied in order to identify the genes whose regulation has been
affected by the perturbation of the transcription factor dosage (29).
Typically, a large number of genesoften several thousandsare
found to be differentially expressed in such experiments (30, 31).
However, it is clear that not all the differentially expressed genes are
directly regulated by the transcription factor whose dosage was
perturbed. Indeed, a large proportion of the genes are expected
to be secondary targets (i.e., regulated by genes that are themselves
directly regulated by the transcription factor). In addition, a change
in the dosage of a transcription factor often affects the cellular
environment in ways that may trigger larger changes in the gene
expression profiles, not directly related to the regulatory effects of
the perturbed transcription factor (30).
In order to identify the subset of direct transcriptional targets
among all the differentially expressed genes, computational predictions of the transcription factor-binding sites are often used.
Namely, a gene is considered as a direct regulatory target only if it
is differentially expressed following the perturbation of the transcription factor and the binding motif of the transcription factor
can be found within the genes putative promoter (30, 31). The
problem is that computational searches for transcription factor-
352
Y. Gilad
binding sites are known to have a high error rate (32). In particular,
since transcription factor-binding sites are short (612-mers), a
large number of false positives are expected. In addition, it is
unclear how to assign significance to the identification of transcription factor-binding sites based on a single sequence (32).
An alternative approach is to use ChIPseq (33) to directly
identify all the sites in the genome to which the transcription
factor binds (e.g., refs. 34, 35). In these experiments, sequencing
is used to measure the abundance of chromatin that is first precipitated along with the transcription factor of interest. The goal is
to identify genomic regions with peaks of aligned sequencing
reads, which correspond to regions putatively bound by the transcription factor. When the transcription factor-binding locus is in
proximity to a known gene, it is assumed that the gene is being
regulated by the transcription factor (35, 36). However, even if
the antibody against the transcription factor is highly specific and
the number of falsely identified binding events is assumed to be
small (37), it is unclear how many binding events reflect a true
biological function. Namely, it is unclear how often a transcription
factor can bind to genomic regions near genes without participating in the regulation of those genes.
Thus, ChIPseq and dosage perturbation experiments, considered one at a time, suffer from high false-positive rates due to the
nonspecificity of the antibody, random binding of the transcription
factor in the case of the ChIPseq experiment, or the ripple effect of
knocking down a transcription factor in the siRNA experiments.
Considered together, however, these approaches enable the reliable
identification of genes whose promoter regions are bound to by the
transcription factor and whose regulation is affected by the perturbation of the transcription factor dosage. In other words, using this
paradigm, one can build a strong circumstantial case for classifying
direct regulatory targets of a specific transcription factor.
4.2. Statistical
Challenges in
Comparative Studies
of Gene Regulation
353
354
Y. Gilad
Fig. 4. Example of how a distribution of FDR values can guide the choice of statistical cutoffs. (a) All ChIPseq peaks with
FDR 20% from a genomic study of histone modification in cell lines from three primate species; the chosen stringent
2% FDR cutoff is indicated with a dashed line. (b) Enrichment peaks with FDR 20% in each species, which also overlap
peaks with FDR 2% in any of the other species; the chosen relaxed 5% FDR cutoff for a secondary observation is
indicated with a dashed line.
modification. Accordingly, one can relax the statistical cutoff for the
classification of such secondary observations. Although the choice of
statistical cutoffs may still be arbitrary, the distributions of FDR
values can be used as a guide, especially with respect to the choice of
the second cutoff (Fig. 4). The two cutoff approach uses information
across all studied species to increase the power to detect histone
modification in any species. This approach is, therefore, conservative
with respect to identifying differences across species.
5. What Have We
Learned from
Comparative
Studies of
Regulatory
Mechanisms?
Comparative studies of genetic mechanisms. In contrast to the relative abundance of comparative gene expression data from multiple
species, there are far fewer genomic-scale comparative datasets of
regulatory mechanisms. At the genetic level, the largest comparative
study of regulatory mechanisms to date is that of Schmidt and
colleagues (38), who used ChIPseq to compare the genomic locations of binding sites of two transcription factors (CCAAT/
enhancer-binding protein alpha and hepatocyte nuclear factor 4
355
356
Y. Gilad
357
6. Summary and
Additional Topics
We have gained important insights from comparative genomic
studies of gene expression levels. We established that the regulation of most genes evolves under stabilizing selection (51, 52) and
described variation in gene expression levels within and between
species with sufficient details so that we can now use empirical
approaches to identify genes whose regulation likely evolved under
directional selection (53). These would be promising candidates
for further functional studies. Current efforts are moving beyond
the investigation of interspecies variation in gene expression
levels to studies of the underlying regulatory mechanisms. In
that respect, I did not mention in this chapter many of the types
of datasets that are currently being collected, such as measures of
chromatin accessibility (using DNase hypersensitive sites, for
example), different markers of enhancer elements (such as the
cofactors p300 and mediator), maps of nucleosome positions,
and expression levels of small regulatory RNA classes. Once we
combine different sources of comparative genomic data into a
unified model of gene regulation, we should obtain power to
truly dissect the genetic and epigenetic architecture of gene regulatory evolution.
7. Exercises
1. You are ready to design a large study to compare gene expression between species using RNAseq. You know that you need
to take into account a large number of possible biological and
technical effects, but then you also learn that a certain physical
environment (such as temperature, humidity, amount of light,
etc.) might affect your results. You, therefore, decided to
design a pilot experiment to test the effect of this physical
environment on the measurements of gene expression level
using your platform of choice. Your design should not rely
on the availability of gold standards (namely, you are not
able to obtain samples for which the differences in gene
expression are known, neither a priori nor by using additional
techniques).
(a)
Explain the study design that allows you to test for the
effects of the physical environment of choice.
(b)
What are the expected results if the physical environment of choice has no effect on the measurement of
gene expression levels?
358
Y. Gilad
(c)
(d)
What are the expected results if the physical environment of choice is nonrandom? In that case, how will you
take this information into account when you design the
larger study?
359
360
Y. Gilad
361
Chapter 15
Characterization and Evolutionary Analysis
of ProteinProtein Interaction Networks
Gabriel Musso, Andrew Emili, and Zhaolei Zhang
Abstract
While researchers have known the importance of the proteinprotein interaction for decades, recent
innovations in large-scale screening techniques have caused a shift in the paradigm of protein function
analysis. Where the focus was once on the individual protein, attention is now directed to the surrounding
network of protein associations. As protein interaction networks can provide useful insights into the
potential function of and phenotypes associated with proteins, the increasing availability of large-scale
protein interaction data suggests that molecular biologists can extract more meaningful hypotheses through
examination of these large networks. Further, increasing availability of high-quality protein interaction data
in multiple species has allowed interpretation of the properties of networks (i.e., the presence of hubs and
modularity) from an evolutionary perspective. In this chapter, we discuss major previous findings derived
from analyses of large-scale protein interaction data, focusing on approaches taken by landmark assays in
evaluating the structure and evolution of these networks. We then outline basic techniques for protein
interaction network analysis with the goal of pointing out the benefits and potential limitations of these
approaches. As the majority of large-scale protein interaction data has been generated in budding yeast,
literature described here focuses on this important model organism with references to other species
included where possible.
Key words: Protein interaction, Network, Modularity, Evolution, Hub, Scale free
1. Introduction:
Mining Protein
Interaction
Networks
Although it has long been known that proteins elicit their function
through association, over the past few years it has become increasingly apparent that analyses of entire networks of protein interactions can provide useful information regarding protein function
and deletion consequence. An increase in the use of genome-scale
interaction detection techniques, such as tandem affinity purification (TAP) and yeast 2-hybrid (Y2H) screening (see Fig. 1), has
generated a wealth of proteinprotein interaction (PPI) data in
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_15,
# Springer Science+Business Media, LLC 2012
363
364
G. Musso et al.
Fig. 1. Protein interaction detection. Binary detection assays used for protein interaction screening typically employ the
reconstruction of a reporter when two recombinant proteins (each tethered to one component of the reporters activator)
are in sufficiently close proximity. In the case of traditional Y2H screening (upper left ), the DNA binding and activation
domains of GAL4 are tethered to a bait (B) and prey (P) protein, respectively, reconstructed, and a reporter signal activated.
Split ubiquitin screening (upper right ) utilizes a variation of this concept in which ubiquitin is reconstructed, cleaves an
attached transcription factor, and subsequently causes reporter activation. Alternately, detection of complexes typically
involves some form of epitope tagging followed by affinity purification. While there are multiple tags that can be used for
affinity purification assay, traditional tandem affinity purification (TAP; bottom half ) uses a tag containing protein A, a
tobacco etch virus (TEV) cleavage site, and calmodulin-binding peptide for two successive rounds of purification based on
immobilization of the tagged bait. In either binary or affinity purification-based techniques, interactions are generally
confirmed through reciprocal assay.
15
365
Table 1
Types of interactions used to generate networks
Interaction
type
Description
Potential sources
Genetic
Protein
Functional
similarity
or data
integration
Coexpression
NNN: http://quantbio-tools.princeton.
Similarity in patterns of expression is a
edu/cgi-bin/nnn
good indication of both physical and
genetic association and can be used to Avadis: http://www.strandls.com/Avadis
derive useful functional relationships
Listed are four basic types of association used to draw inference regarding overlapping function of genes or
gene products
366
G. Musso et al.
Fig. 2. Illustration of network types. Preferences in the attachment of edges during the
generation of a network greatly affect its topology. Both networks above contain seven
nodes connected by six edges; however, in the left graph, associations were distributed
uniformly, whereas on the right edges were preferentially attached to nodes with existing
edges. The right graph is an example of a small world design, as the presence of hubs
(black nodes) affords a structure in which any two nodes can be connected by a small
number of edges.
2. Major Works
in ProteinProtein
Interaction
Network Analysis
2.1. Observation of
Small World Properties
in Protein Interaction
Networks
15
367
Early analysis of PPI data further suggested that protein interaction networks fit the more stringent definition of having a scale-free
connectivity distribution (6, 7): a subset of small world networks in
which new edges are preferentially connected to highly connected
nodes, and consequently the number of edges incident on each node
follows a power-law distribution. This would have implications not
only for the topological properties of the network, but also in the
interpretation of its evolution, as this would suggest retention and
loss of interactions through specific mechanisms (10). Incorrectly
labeling an interaction graph as being scale free has additional analytic
ramifications. For example, Khanin and Wit (11) argue that this
results in the incorrect assumption that biological networks follow
the same design principles as those observed in the physical and social
sciences. In the past several years, the classification of virtually all
protein interaction networks as scale free has been contested based
on goodness of fit tests (11, 12), although this discrepancy may be
due to an incomplete sampling of the full interaction network (13).
While the presence of scale-free connectivity distributions may
still be a contentious issue, properties of small world networks
appear to be universally apparent in PPI networks. Two characteristics of the small world connection structure that are commonly
observed in PPI networks are the presence of highly connected
nodes (or hubs) and a simplified definition of cliques or subnetworks. Each of these properties is discussed in detail below.
2.2. Properties
of Network Hubs
368
G. Musso et al.
15
369
370
G. Musso et al.
15
371
3. Evolutionary
Comparisons of
Protein Networks
3.1. Cross-Species
Comparisons of Protein
Interaction Networks
372
G. Musso et al.
15
373
Fig. 3. Mechanisms of gene duplication. Depicted are several common mechanisms for both gene and genome duplication.
Beginning at top left and going clockwise, two well-described mechanisms for tandem duplication are unequal exchange
or crossing over occurring due to misalignment (indicated by small squares and dotted lines) during mitosis and meiosis,
respectively. Retrotransposition involves the reverse transcription of mRNA sequences into the genome as cDNA.
Allopolyploidy events involve the combination of the genomes of two species to increase the genetic complement (one
described case depicted). In contrast, autopolyploidies typically result from errors in the reduction of gametes among a
single species. Portions regarding auto and alloploidization adapted from Campbell and Reece (65), and regarding tandem
duplication adapted from Ohno (66).
374
G. Musso et al.
15
375
4. Hands-On
Network Analysis
4.1. Determination
of Network Properties
In this section, we present a basic analysis of the topological properties of a protein interaction network. Although the instructions
given in this section are meant to be generally applicable to any
dataset, the example results are derived using the human MAP
kinase protein interaction data published by Bandyopadhyay et al.
(60). This analysis calculates basic network properties (Table 2)
using the NetworkAnalyzer (61) plugin for the network visualization tool Cytoscape (62). As it is a publically available multiplatform tool with a wealth of analytical features constantly being
added and refined by the Computational Biology community, we
strongly recommend the use of Cytoscape for all forms of network
analysis. While a description of the basic use of Cytoscape is beyond
the scope of this chapter, detailed information regarding the installation and functionality of Cytoscape can be found in the associated
wiki: http://cytoscape.wodaklab.org/wiki, as well as the protocol
written by Cline et al. (63). The NetworkAnalyzer plugin that is
used for this analysis can be downloaded from: http://med.bioinf.
mpi-inf.mpg.de/netanalyzer.
This Web site contains further documentation describing the
full capabilities of the NetworkAnalyzer plugin as well as instructions for its implementation. Alternative tools that could provide
more in-depth analysis are Pajek (64), and the network analysis and
visualization package for R: http://igraph.sourceforge.net/doc/
R/00Index.html.
376
G. Musso et al.
Table 2
Network property description
Property
Description
Calculation
Clustering
coefficient
Characteristic
path length
Network
centralization
Described are three network characteristics outputted by NetworkAnalyzer. For a detailed description of
the remaining metrics, see the tools online help: http://med.bioinf.mpi-inf.mpg.de/netanalyzer/help/
2.7/index.html
Interaction data used for this analysis can be obtained as the first
supplementary table published by Bandyopadhyay et al.: http://www.
nature.com/nmeth/journal/v7/n10/extref/nmeth.1506-S2.xls.
As downloaded, this file will be in a 10-column format with
columns including names, gene IDs, descriptions, and confidence
information for each interaction. Only the gene IDs are required for
the purpose of this analysis, so columns 2 and 4 should be copied to
a new Excel file and saved (without headers; should have 2,272
rows). This file can be directly imported into Cytoscape using the
Import Network from Table command in the File menu.
15
377
Fig. 4. Simple network parameters from NetworkAnalyzer. NetworkAnalyzer outputs a small number of basic network
parameters that can be saved for further analysis and comparison with other networks. A description of some of these
metrics can be found in Table 2.
viewed with any text editor. Selecting a subset of nodes and repeating
this analysis using the Analyze subset of nodes option allows comparison among a specific subset of genes. This is useful, for example,
to identify the local properties of one gene family of interest.
Under the heading of Node Degree Distribution, we see a
log-log plot of node degree versus frequency of occurrence. The
Fit Power Law function can be used to determine whether the
distribution of edges in this graph approximates a power law
(Fig. 5). The MAP kinase protein interaction network seems to fit
this definition (r 0.955), which is to be expected since a small
number of baits with somewhat overlapping targets were screened
in depth. Graphs visualizing the distributions of network properties
(degree, clustering coefficient, and shortest path length) can be
exported as image files by selecting Export Chart.
5. Questions
1. Describe the major differences in filtering procedures applied in
the 2006 Krogan et al. and Gavin et al. studies. Discuss the
merits and disadvantages of defining clusters that only allow
exclusive membership.
2. Protein interaction data from the Krogan et al. and Gavin et al.
screens is freely available from BioGRID (http://thebiogrid. org).
A comprehensive list of yeast paralogs originating from the
378
G. Musso et al.
Fig. 5. Fitting network edge distribution to a power law using NetworkAnalyzer. This graph was outputted directly from
NetworkAnalyzer and shows a strong correlation between the degree distribution of our network and a power-law function,
suggesting it to be a scale-free network. NetworkAnalyzer fits the power-law function to degree data using the least
squares technique.
Acknowledgments
AE and ZZ acknowledge a Team Grant from the Canadian Institutes of Health Research (CIHR MOP#82940).
References
1. Krogan NJ, G Cagney, et al. (2006). Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 4407084: 637643.
2. Gavin AC, P Aloy, et al. (2006). Proteome
survey reveals modularity of the yeast cell
machinery. Nature 4407084: 631636.
15
379
380
G. Musso et al.
Chapter 16
Statistical Methods in Metabolomics
Alexander Korman, Amy Oh, Alexander Raskind, and David Banks
Abstract
Metabolomics is the relatively new field in bioinformatics that uses measurements on metabolite abundance
as a tool for disease diagnosis and other medical purposes. Although closely related to proteomics, the
statistical analysis is potentially simpler since biochemists have significantly more domain knowledge about
metabolites. This chapter reviews the challenges that metabolomics poses in the areas of quality control,
statistical metrology, and data mining.
Key words: ALS disease, Machine learning, Mass spectrometry, Metabolomics, Premature labor,
Quality control
1. Introduction
Metabolism may be defined as the complete set of chemical
reactions that take place in the living organism. This set is divided
into two major branches: anabolism (synthesis) and catabolism
(breakdown). The subjects of these reactions are metabolitesa
very diverse group of chemicals combining all small (nonpolymeric)
molecules found in living cells. Natural metabolites may be roughly
separated into two large groups: primary, which directly involved in
normal growth, development, and reproduction; and secondary,
which are not directly involved in these processes, but that may still
play a vital role in the organisms biochemistry. Artificial food
components, drugs, and products of their breakdown constitute a
third large group, often referred to as xenobiotics (from the Greek
xenos stranger and biotic related to living beings). The collection of all metabolites of the cell, tissue, organ, or organism is called
the metabolome, in analogy with genome, proteome, and transcriptome.
In the living system, metabolites are connected by a complex
network of enzyme-assisted reactions. Logical components of this
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_16,
# Springer Science+Business Media, LLC 2012
381
382
A. Korman et al.
16
383
2. Technology
This section is intended as a brief overview of separation techniques
used in metabolomics. Readers interested in the details of the
subject should refer to other chapters in this book or 7.
384
A. Korman et al.
16
385
3. Quality Control
Issues
Good experimental design is essential to ensure quality checks on
each phase of the analysis. This requires executive commitment;
every set of samples should include multiple internal controls of
different kinds. Managers should anticipate that a significant proportion of the runs will be dedicated to quality control goals. There
is an implicit costbenefit analysis in determining the trade-off
between resources spent on quality control and on analysis, better
equipment, curation, and so forth. Historically, most laboratories
have undervalued process control, which reduces data quality and
often increases total operational costs (8, 9).
The experimental setup includes biological sample collection,
storage, analytical sample preparation, and analysis (data acquisition) itself. Each of these steps contributes to the variance of the
386
A. Korman et al.
16
387
388
A. Korman et al.
and depends upon the purpose of the experiment and the intended
depth of data analysis.
Most of the current analytical platforms are based on plate
formats, where samples are delivered from a multiwell plate. Plate
geometry may be additional source of variation since it determines
to some extent the order in which the robotic mechanisms deposit
samples, calibrants, and chemical reagents. In some platforms, the
order in which wells are filled is random; but one should record the
time stamp at which a well is filled in order to allow estimation of
time-dependent biases, such as volatilization. It is good practice to
reserve the center well and the center wells in each plate quadrant
for a known complex calibranta process blankwhich is identical
in composition and treatment to all other samples but does not
contain biological material. This enables direct correction across
multiple plates, with minimal noise. It also enables detection of
drift and estimation of systematic effects due to plate geometry, and
the complexity of the calibrant provides known anchors that enable
multivariate regression methods to de-bias other measurements. In
addition to process blanks, it is useful to reserve some wells for pure
blanks (or solvent blanks) which are used to estimate and correct for
carryover and to estimate background noise. These locations
should also be geometrically balanced across the plate so that
systematic measurement biases can be assessed.
The appropriate experimental design for assigning samples,
process blanks, and solvent blanks depends upon the geometry of
the plate and the number of replicates per sample. If the plate has
square geometry, then one would consider a Latin square or
Graeco-Latin square design (10). These allow the analyst to control
for two or three possible confounders, such as plate row, plate
column, and order in which the well is filled. If the plate has
rectangular geometry, then there are analogous Latin rectangle
designs (10). Depending upon the situation, it may be appropriate
to use a balanced incomplete block design or the more exotic
partially balanced incomplete block design with some specific number of associate classes (11).
Besides plate geometry, experimental design issues will arise if
the tissue samples come from a research study. For example, one
common goal is to see whether two groups are different (say liver
tissue from sacrificed lab rats, some of whom received a new drug
and some of whom did not). In this case, the analyst should use
randomization for the order in which samples are run, and do
careful double blinding for all significant steps in the process.
Restricted randomization is sensible to do, but it can be hard to
explain. (With restricted randomization, not all possible randomizations are permitted; if the randomization happens to situate all or
most of the treatment group before the control group, it should
probably be excluded.) But most researchers prefer to write papers
which say that the samples were run in random order without any
16
389
footnotes. This is not a significant issue with large sample sizes, but
metabolomics research often must use relative small sample sizes.
A second common type of analysis is time-course studies
these look at trends over time within the same subject. For example, studies may examine metabolic changes in blood drawn at
hourly intervals after a drug is administered. It is good if one can
run all samples from the same subject on the same plate, but the
time order should be randomized intelligently. Crossover experiments also generate special structure that requires thought when
laying out the allocation of samples to wells. Often, a useful and
flexible heuristic is to run the samples in blocks, with the order of
the samples randomized within each block. The definition of a
block may vary according to the structure of the experiment.
Finding a good experimental design is not trivial and requires
specialized statistical expertise. For routine operation, it is probably
sufficient to select a good, robust design and use it for nearly all
runs. But if the problem has specific design structure (e.g., yeast
culture cultivated under crossed stress factors), then the operator
should have access to a competent statistician.
Data post-processing includes usually several stepsnoise
removal, background subtraction (sometimes), signal deconvolution, and compound identification. Noise and background removal
are self-explanatory. Deconvolution is the most difficult step, and
produces most of the errors. Briefly, the purpose of deconvolution
is to separate the signals from different compounds which entered
mass spectrometer simultaneously or with significant overlap based
on the shape of their signals, the combination of mass values, and a
set of chemical rules. The complexity of the task is highlighted by
the fact that even the best software packages on the market often
make incorrect assignments during deconvolution. What is more
important is that deconvolution results may be inconsistent
between samples; very minor variations in the raw data can lead
to significant differences in results, since deconvolution output is
used for the ultimate identification and quantification of the metabolites. At present, extremely labor-intensive manual expert curation is an inevitable step if high-quality data are required.
A good metabolomics platform invests in quality. Strategies for
monitoring and improving quality include the following.
l
Randomly assign several wells to hold the same known calibrant. The random assignment of the calibrant provides a
measure of how the magnitude of the noise is affected by
geometry. Some noise occurs because of periodic refill of solvents, being last in line for testing on a plate, or degraded robot
fingers.
390
A. Korman et al.
l
Locate aliquot replicates randomly but with restrictions. Random well assignment prevents systematic errors that accrue
from locating triplets or quadruplets in the same locations,
run after run. But balanced random assignment is better
because it avoids chance neighboring that may result in correlated noise. For example, one might randomize the placement
subject to the constraint so that there is one aliquot in each
quadrant of the plate.
Freeze and save some of the sample aliquots until after curation
has been completed. Complex procedures mean that peculiar
things can happen. A single reagent might be bad or one of the
internal calibrants might have been mixed incorrectly. Such
mistakes can affect estimates of certain metabolites but not
others.
16
391
4. Abundance
Estimation
The primary purpose of metabolomics is abundance estimation.
The main steps for achieving this are locating the ion peaks in the
bivariate histogram, integrating the peaks to estimate total ion
counts, and then apportioning those counts to different metabolites.
A given metabolite compound usually has several distinct
fragmentation patterns, depending upon randomness in the ionization step. Sometimes, the molecule breaks at one bond, and
sometimes at another; but usually, breaks occur in a small number
of different ways. Therefore, the abundance signal is typically
distributed across multiple peaks in the two-dimensional histogram in which ion counts are plotted against elution time and
the mass/charge ratio. In general, the analyst knows the probability of each of the major fragmentation patterns and the location at
which the peaks should occur. But it must be borne in mind that
different metabolites may have some ion fragments in common, so
certain peaks are combinations of signal from several different
metabolites.
The first step is to locate the peaks. For nearly all the main
metabolites, the presumptive locations are known. This information is available from corporate or public libraries of fragmentation
outcomes. A prominent one is the NIST/EPA/NIH Mass Spectral
Library (15), initially developed by Steve Stein at the National
Institute of Standards and Technology. So, in principle, one
knows exactly where the peaks for each metabolite should appear
(this is importantly different from the case with proteomics).
However, platforms tend to drift, despite regular recalibration
and quality control. Thus, a particular run might have the peaks
slightly shifted, independently in both the elution time axis and the
mass-to-charge ratio axis. The amount of that shift may not be
constant across the entire range of the instrument; for example,
lightweight ions may be shifted a bit more than heavy ions. Also,
the amount of shift may be affected by the abundance; a dense
cloud of charged ions has internal electrodynamics that affects the
TOF measurements differently from a less dense cloud.
392
A. Korman et al.
Since it is not possible that the two axis shifts could physically
interact, one can decompose the peak location problem into separate problems. The solution requires estimation of two warping
functions, f1(x) and f2(y), which fit the amount of shift at a given
location on each of the two axes (16). These functions must be
monotonic; if there is no shift, they perfectly prescribe the lines
f1(x) x and f2(y) y. If a warping function dips below that ideal
line, then the measurement axis is compressed at that location; if it
is above the line, then the measurement axis is stretched.
Few platforms or analysts explicitly calculate warping functions.
Most just use software that implements decision rules; i.e., it is
known that cholesterol produces a peak at a given location, so the
system looks for the nearest peak and declares that to be the
appropriate ion fragment of cholesterol. Although this piecemeal
approach is quick to code and avoids some technical mathematics, it
is less accurate than simultaneously warping both axes to best
accommodate all of the signals. One implication is that the curation
step takes longer and is, thus, more costly. Another is that one does
not learn as much about the performance of the platform as one
might.
Once the peak location has been identified, the second step is
to calculate the number of ions at that peak. There are two main
issues: the peak is slightly smeared, with respect to both axes,
and the peak may be an underestimate due to saturation of the
ion counter.
The smearing of the peak can be complex. Typically, the spread
in the elution time axis is greater than the spread in the mass/
charge ratio axis. However, the mass/charge ratio axis has special
structure. First, there are isotope shadows. These occur when the
chemical structure of an ion contains atoms that have distinct but
common isotopes. The instrumentation is now sensitive enough to
resolve these into distinct peaks, nearby but separated along the
mass/charge axis, but essentially simultaneous on the elution time
axis. Additionally, there are several common adducts which characteristically attach to certain ions; in this case, there will be a second
trail of isotope shadow peaks, a little further to the right on the
mass/charge ratio axis and perhaps slightly delayed on the elution
time axis (17).
As previously mentioned, undercount of the ions occurs when
the abundance is high and the ion detector becomes saturated. In
this case, there are two strategies: one can try to adjust for the
undercount or perhaps impute the count in the saturated peak from
an unsaturated isotope shadow. Ideally, a proper statistical analysis
would combine the multiple signals, but this requires some mathematics and a clear understanding of the measurement capability
function of the hybrid ion counter. In practice, the curation process
is used to address peak abundance estimation outside the dynamic
range of the instrument.
16
393
n
X
w kx Xi kYi y0 Xi 2
i1
394
A. Korman et al.
i1
16
395
Fig. 1. This graph shows nearly raw m/z, elution time, and abundance data from a process blank measurement. This
profiled data has been preprocessed by commercial software that is part of the mass spectrometer. In most applications,
researchers do not have direct access to nor detailed knowledge of that software, and must rely upon the capability of the
instruments vendor, as ratified by regular calibration.
396
A. Korman et al.
x 106
12
10
Intensity
8
6
4
2
0
0
ut
El
200
n
io
2000
m
Ti
1500
400
1000
600
rge
500
0
Cha
ass/
Fig. 2. This graph shows centroided data, in which a measurement error model has been used to deblur the peaks. This
concentrates the smeared signals shown in Fig. 1 into single peaks, and thus provides a much cleaner representation of
the ion abundance.
16
397
where wi picks out and weights the peaks that contribute to metabolite i, sm(z) smooths the raw bivariate histogram (i.e., accumulates
ion counts from all the isotope shadows and adducts), and b(m/z, t)
is the baseline correction subtracted during denoising.
In the previous equation, one usually takes the logarithm since
the main interest is ratios of abundances. (Using ratios eliminates
the effect of dilution, which can vary from sample to sample.) The
hope is that this measurement equation creates an independent
homoscedastic error term, and that components of variance analysis
(24) can ascribe a certain portion of the error to each of the
following sources: within-subject variation, within-tissue variation,
miscalibration of standards, measurement error, and so forth.
The law of propagation of error (also known as the delta method
(25)) says that the variance in the univariate estimate xi is approximately
Varxi
p
X
j 1
Varzj 2
X @ 2 gi
Covzj ; zk :
@zj @zk
j 6k
398
A. Korman et al.
focus is on which has the smallest variance and hence the greatest
replicability). In fact, it is fundamentally impossible to decide which
laboratory or platform gives the right answer; one can only
estimate the differences between laboratories. In statistical language, the true value of the measurand is not identifiable (25),
but contrasts between laboratories are identifiable. Therefore,
good metabolomics platforms are ones with small variance within
their range of measurement. Calibration can then tune the output
to match that from other systems.
The basis for such cross-platform calibration are key comparison
designs (26). Here, the same samples (aliquots of Grob or some
tissue) are sent to multiple labs, and each lab produces its own
estimate and a corresponding estimate of uncertainty. There are
several prominent key comparison designs. In the star design, after
a laboratory measures the sample, it is returned to the starting point
for remeasurement (to ensure that transit has not altered the sample). In the circle design, the sample is not remeasured until all of
the participating laboratories have measured it. The latter is less
expensive, but if there is contamination during the process, it is
difficult to determine where along the exchange that happened.
The Mandel bundle-of-lines model (27) is a standard method
for the analysis of the key comparison designs. Here, the measurement Xij on sample j at laboratory i is modeled as
Xij ai bi tj eij ;
where tj is the unknown true value of the sample, ai and bi determine the linear calibration for lab i, and eij N 0; s2ij is the
measurement error.
Because tj is not estimable, one must impose constraints. A
1, and t 0.
frequentist would typically require that a 1, b
However, many other constraints would work, and forcing the
average t to some sensible but arbitrary value vj can be convenient.
A Bayesian would put priors on the laboratory coefficients
and the
2
error
variance.
Natural
priors
would
be
a
N
0;
s
i
A ; bi N
1; s2B and tj N vj ; s2T .
A multivariate version of the Mandel bundle-of-lines model
would best serve metabolomics needs. The strategy is straightforward, but to our knowledge it has not been developed. Instead,
people do one-at-a-time calibrations. Usually, they use the same
sample but consider the measurement on each metabolite separately, ignoring known correlations among the measurements.
5. Disease
Diagnosis
Although metabolomics may serve many purposes, a key application is the diagnosis of disease. For most situations, this entails
the Curse of Dimensionality. When the data are high dimensional,
16
399
Geometric, which includes discriminant analysis, flexible discriminant analysis, partial least squares, and recursive partitioning: These methods tend to be based on fairly specific models.
400
A. Korman et al.
xn (in Fishers example, xi was the sepal length, sepal width, petal
length, and petal width on the ith iris). Fishers linear discriminant
analysis assumes that the two populations have multivariate normal
distributions with common unknown covariance matrix and different unknown mean vectors. It assigns a new observation x to the
population whose mean has the smallest Mahalanobis distance to
the observation:
1=2
dM x; xj x xj 0 S 1 x xj
;
where x1 is the sample mean for the training sample vectors from
the first class, x2 is the sample mean for the training sample vectors
from the second class, and S is the sample covariance matrix.
To analyze the effect of noise in linear discriminant analysis,
suppose one has a fixed sample size n and assume that the covariance
matrices are known to be s2I. Write the estimates of the means as:
s
^1 m1 p n1
m
n
s
^2 m2 p n2
m
n
Also, write the new observation to classify as:
x m1 n:
Fishers classification rule assigns population 1 if
^1 <dM x; m
^2
dM x; m
and under our assumptions, this is equivalent to:
^1 <x m
^2 0 x m
^2 :
^1 0 x m
x m
^2 in terms of n1, n2, and n shows that this
^1 and m
Writing x; m
criterion is equivalent to:
s
s
s
s
n p n1
n p n1 < m1 m2 n p n2
m1 m2 n p n2
n
n
n
n
16
401
Now, consider the same problem from a Curse of Dimensionality perspective. Without using asymptotics in the sample size n,
the rule assigning population 1 can be written as
2sn0
r !
2
2
m1 m2
sn1 > km1 m2 k2 s2 n01 n2
n
n
!1=2 3
5;
402
A. Korman et al.
b;b0 ;kbk1
subject to
yi bT xi b0 d;
i 1; . . . n:
i 1; . . . n
n
n
X
X
1
li yi x 0 i b b0
li :
kbk2
2
i1
i1
16
403
n
X
li
404
A. Korman et al.
j 1
K(x, x*) tanh(a1 < x, x* > +a2) or the neural network basis
To see how this works, suppose p 2 and use the rth degree
polynomial basis with r 2. Then,
K x; x 1 <x; x >2
1 x1 x1 x2 x2 2
1 2x1 x1 2x2 x2 x1 x1 2 x2 x2 2
2x1 x2 x1 x2 :
16
405
406
A. Korman et al.
Fig. 3. CART tree used to classify patients with respect to their risk of a heart attack. It is
based upon an example in 46.
It handles missing data very well, and can maintain high levels
of accuracy when up to 80% of the data are missing at random.
16
407
Its logic seems well suited to metabolomics; recursive partitioning is a natural way to deal with data generated in pathways.
6. Case Studies
The following two metabolomics case studies make use of data
mining techniques. Also, they illustrate the inferential methods
discussed in Subheading 5.
6.1. Classifying ALS
Patients
408
A. Korman et al.
n
X
i1
p
X
1 yi b0 bT xi l
bj
;
j 1
where the first sum is over the observations and the second sum is
over the coefficients on the basis elements. The function []+ is zero
when the argument is negative, and otherwise it equals the argument. The L1 penalty encourages most of the coefficients to be
zero, and thus it performs variable selection.
The SCAD SVM replaces the L1 penalty with a non-convex
penalty that asymptotes to a constant. Thus, all large coefficients
tend to have nearly the same penalty, as opposed to having penalties
proportional to their absolute values. As a result, SCAD SVM
requires more computation, but it avoids overpenalizing coefficients that are large but necessary.
Several other analyses were performed but were not definitive.
Besides relatively standard methods of classification, we tried a multiple tree analysis with FIRMPlusTM software from Golden Helix, as
16
409
The early labor subsides, and the pregnancy continues for the
normal duration.
410
A. Korman et al.
16
True\predicted
Term
Inflammation
Term
39
Inflammation
32
No inflammation
29
411
No inflammation
For those who had preterm delivery and inflammation or infection, the carbohydrates were low and the amino acids were high.
7. Exercises
1. Using the profiled data shown in Fig. 1 that is available from the
Beecher Laboratory at http://mctp-ap1.path.med.umich.
edu:8010/pub, write a program to deconvolve the measurements, thereby producing cleaner data, such as that shown in
Fig. 2. Most analysts assume that the signal is blurred according
to a bivariate Gaussian distribution with mean centered at the
true value and covariance matrix given by the performance specifications for the instrument. A more thoughtful analysis might
assume a gamma distribution to model blur in the elution time
(because, for physical reasons, it is unlikely for an ion to arrive
early, but there are several mechanisms that might delay it) and an
independent univariate Gaussian distribution to model blur in
the m/z measurement.
2. Using the same data, write a program to perform baseline
correction of the profile data (for example, with Loess). In
principle, baseline correction has already been done by the
software in the mass spectrometer. But if the estimated correction is statistically significantly different from zero, this suggests
that the automatic baseline correction software is inadequate.
(Hint: To assess whether the new correction is significantly
different from zero, use the bootstrap.)
412
A. Korman et al.
16
22. Vidakovic, B. (1999) Statistical Modeling by
Wavelets, Wiley, New York, N.Y.
23. Cameron, J. (1982) Error analysis. Encyclopedia of Statistical Sciences, vol. 2, 545551,
Wiley, New York, N.Y.
24. Searle, S., Casella, G., and McCulloch, C.
(1992) Variance Components, Wiley, New
York, N.Y.
25. Casella, G., and Berger, R. (1990) Statistical
Inference, Duxbury Press, Belmont, CA.
26. Steele, A., Hill, K., and Douglas, R. (2002).
Data pooling and key comparison reference
values. Metrologia, 39, 269277.
27. Milliken, G. A. and Johnson, D. E. (2000) The
Analysis of Messy Data, vol. II. Wiley.
28. Clarke, B., Fokoue, E., and Zhang, H. (2009).
Principles and Theory for Data Mining and
Machine Learning, Springer, New York, N.Y.
29. Hastie, T., Tibshirani, R., and Friedman, J.
(2009) The Elements of Statistical Learning,
Springer, New York, N.Y.
30. Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Eugenics,
7, 179188.
31. Raudys, S. and Young, D. (2004) Results in
statistical discriminant analysis: A review of
the former Soviet Union literature. Journal
of Multivariate Analysis, 89, 135.
32. Weisberg, S. (1980) Applied Linear Regression, Wiley, New York, N.Y.
33. Tibshirani, R. (1996). Regression shrinkage
and selection via the lasso. Journal of the
Royal Statistical Society, B, 58, 267288.
34. Zou, H. and Hastie, T. (2005). Regularization
and variable selection via the elastic net.
Journal of the Royal Statistical Society, B, 67,
301320.
35. Candes, E., and Tao, T. (2007). The Dantzig
selector: Statistical estimation when p is much
larger than n. Annals of Statistics, 35,
23132351.
36. Vapnik, V. (1996) The Nature of Statistical
Learning. Springer, New York, N.Y.
37. Cortes, C., and Vapnik, V. (1995), Supportvector networks, Machine Learning, 20,
273297.
38. Boser, B., Guyon, I., and Vapnik, V. (1992) A
training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, D.
Haussler, ed., pp. 144152. ACM Press, Pittsburgh, PA.
39. Aizerman, M., Braverman, E., and Rozonoer,
L. (1964) Theoretical foundations of the
potential function method in pattern recogni-
413
Chapter 17
Introduction to the Analysis of Environmental
Sequences: Metagenomics with MEGAN
Daniel H. Huson and Suparna Mitra
Abstract
Metagenomics is the study of microbial organisms using sequencing applied directly to environmental
samples. Similarly, in metatranscriptomics and metaproteomics, the RNA and protein sequences of such
samples are studied. The analysis of these kinds of data often starts by asking the questions of who is out
there?, what are they doing?, and how do they compare?. In this chapter, we describe how these
computational questions can be addressed using MEGAN, the MEtaGenome ANalyzer program. We first
show how to analyze the taxonomic and functional content of a single dataset and then show how such
analyses can be performed in a comparative fashion. We demonstrate how to compare different datasets
using ecological indices and other distance measures. The discussion is conducted using a number of
published marine datasets comprising metagenomic, metatranscriptomic, metaproteomic, and 16S rRNA
data.
Key words: MEGAN, RMA-file, Taxonomic analysis, Functional analysis, Comparative metagenomics, 16S analysis, KEGG pathways, SEED subsystems
1. Introduction
In metagenomics, the aim is to understand the composition and
operation of complex microbial consortia in environmental samples through sequencing and analysis of their DNA. Similarly,
metatranscriptomics and metaproteomics target the RNA and
proteins contained in such samples. Technological advances in
next-generation sequencing methods are fueling a rapid increase
in the number and scope of environmental sequencing projects. In
consequence, there is a dramatic increase in the volume of
sequence data to be analyzed. The first three basic computational
tasks for such data are taxonomic analysis, functional analysis, and
comparative analysis. These are also known as the who is out
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_17,
# Springer Science+Business Media, LLC 2012
415
416
17
417
2. Getting Started
Throughout this chapter, we use eight published datasets from a
controlled coastal ocean mesocosm study involving an induced
phyto-plankton bloom as a running example (8). Four are metagenomes (labeled DNA) and four are metatranscriptomes
(labeled cDNA). Four were sampled at the peak of the bloom
(labeled Time1) and the other four after the bloom had collapsed
(labeled Time2). In each case we report on two replicates (labeled
Bag1 and Bag6, respectively). Based on the mentioned labels, we use
the following names for the datasets: DNA-Time1-Bag1, DNATime1-Bag2, DNA-Time2-Bag1, DNA-Time2-Bag2, cDNATime1-Bag1, cDNA-Time1-Bag2, cDNA-Time2-Bag1, and
cDNA-Time2-Bag2.
2.1. BLAST Computation
418
3. Taxonomic
Analysis
Although the diversity of the microbial world is believed to be
huge, to date less than 6,000 microbial species have been named
(13), and most of these are represented by only just one or a few
genes in public sequence databases. Current databases are biased
toward organisms of specific interest and were not explicitly populated to provide an unbiased representative sampling of the true
biodiversity. For this reason, at present, taxonomic analysis usually
cannot be based on high-similarity sequence matching, but rather
depends on the detection of remote homologies using more sensitive methods, such as BLASTX.
One type of approach is to use phylogenetic markers to distinguish between different species in a sample. The most widely used
marker is the SSU rRNA gene; others include RecA, EF-Tu, EF-G,
HSP70, and RNA polymerase B (RpoB) (14). A main of advantage
of this type of approach is that such genes have been studied in
detail and there are large phylogenies of high quality available that
can be used to phylogenetically place reads. However, one problem
is that the universal primers used to target specific genes are not
17
419
truly universal and it can happen that only a portion of the actual
diversity is captured (15). While the use of a random shotgun
approach can overcome this problem, less than 1% of the reads in
a random shotgun dataset correspond to commonly used phylogenetic marker genes (16), and it seems wasteful that more than 99%
of the reads will remain unused (and unclassified).
A second type of method is based on analyzing the nucleotide
composition of reads. In a supervised approach (see, e.g., ref. 11,
12), the nucleotide composition of a collection of reference genomes is used to train a classifier, which is then used to place a given
set of reads into taxonomic bins. In an unsupervised approach (see,
e.g., ref. 17), reads are clustered by composition similarity and then
the resulting clusters are analyzed in an attempt to place the reads.
The approach adopted in MEGAN is to compare random
shotgun reads against the NCBI-NR database (or some other
appropriate database) to find homologous sequences, thus making
use of the fact that remote homologies are easier to detect on the
protein level. The program treats all sequence matches of high
significance as equally valid indications that the given read represents a gene that is present in the corresponding organism. In more
detail, each read is placed on the lowest common ancestor (in the
NCBI taxonomy) of all the organisms that are known to contain
the gene present in the read. So, in essence, the placement of a read
is governed by the gene content of the available reference genomes
and thus we refer to our method as the LCA gene-content approach.
An attractive feature of the LCA gene-content approach is that
it is inherently conservative and is more prone to err toward noninformative assignments of reads (to high-level nodes in the taxonomy) than toward false-positive assignments (placing reads from
one species onto the node of another species). In particular, genes
that are susceptible to horizontal gene transfer will not be assigned
to either of the participating species, if both donor and acceptor
species are represented in the reference database.
MEGAN provides a number of parameters to tune the LCA
algorithm. First, the min-score parameter allows one to set a minimum value that the bit score must attain so that a BLAST match is
considered by the LCA algorithm. Second, the top-percent parameter restricts the set of considered matches further to those whose bit
score lies within the given percentage of the highest score. Third,
the min-support parameter is used to specify the minimum number
of reads that must be assigned to a taxon before that taxon is
considered present. If the number of reads assigned to a node
does not meet the threshold, then the reads are moved up the
taxonomy until they reach a node that has the number of reads
required.
If the program is given paired reads (i.e., pairs of reads each
sequenced from different ends of the same clone), then in its
paired-end-mode MEGAN uses a modified version of the LCA
420
algorithm that boosts the bit score of any match for one read of the
pair that is confirmed by a match to the same reference species for
the other read, by adding an increment of 20% to the bit score.
Moreover, if one read is given a more specific assignment than the
other by the LCA algorithm, then both reads are assigned to the
more specific taxon.
In summary, MEGAN uses the NCBI taxonomy to bin all reads
of a given metagenome dataset. The NCBI taxonomy provides
names and accession numbers for over 670,000 taxa, including
approximately 287,000 eukaryota, 28,000 bacteria, and 62,000
viruses. The species are hierarchically classified at the levels of superkingdom, kingdom, phylum, class, order, family, genus, and species
(and some unofficial clades in between like groups, subspecies).
We now demonstrate how to perform a taxonomic analysis of
the marine sample DNA-Time1-Bag1 using MEGAN. The first
step is to compare the set of reads (in this case, approximately
200,000) against the NCBI-NR database using BLASTX, in this
case resulting in a 18-GB file containing approximately 30 million
high-scoring pairs (or BLAST hits). The second step is then to
process the BLAST file and reads using MEGAN to obtain an
RMA file DNA-Time1-Bag1.rma, which is about 5 GB in size, if
MEGAN is set to embed all reads and relevant BLAST hits in the
file.
MEGAN can then be used to interactively explore the dataset.
In Fig. 1, we show the assignment of reads to the NCBI taxonomy.
Each node is labeled by a taxon and the number of reads assigned to
it. The size of a node is scaled logarithmically to represent the
number of assigned reads. Optionally, the program can also display
the number of reads summarized by a node, that is, the number of
reads that are assigned to the node or to any of its descendants in
the taxonomy. The program allows one to interactively inspect the
assignment of reads to a specific node, to drill down to the individual BLAST hits that support the assignment of a read to a node, and
to export all reads (and their matches, if desired) that were assigned
to a specific part of the NCBI taxonomy. Additionally, one can
select a set of taxa and then use MEGAN to generate different
types of charts for them.
4. Functional
Analysis
MEGAN 4 provides two different methods for analyzing the functional content of a dataset.
4.1. SEED Analysis
with MEGAN
17
421
Fig. 1. Taxonomic analysis of 200,000 reads of a marine dataset (DNA-Time1-Bag1, (8)) by MEGAN. Different parts of the
taxonomy have been expanded to different ranks. Each node is labeled by a taxon and the number of reads assigned to
the taxon, or to any taxon below it in the taxonomy. The size of each node is scaled logarithmically to represent the number
of assigned reads.
422
Fig. 2. Part of a SEED-based functional analysis of 200,000 reads from a marine dataset (DNA-Time1-Bag1, (8)). Details of
the Mannose Metabolism of subtree of Carbohydrates are shown.
17
423
Fig. 3. The citrate cycle KEGG pathway (4), as displayed by MEGAN. Numbered rectangles represent different enzymes
that are shaded on a scale from white (corresponding to 0 reads) to dark green (corresponding to 330 reads, for this
example) to indicate the number of reads assigned to each enzyme.
5. Comparing
Datasets
Environmental samples are rarely studied in isolation and thus
the task of comparing different datasets is important. MEGAN
supports both visual and computational comparison of multiple
datasets.
5.1. Visual Comparison
of Metagenomes
To facilitate the visual comparison of a collection of different datasets, MEGAN provides a comparison view that is displayed as a tree
in which each node shows the number of reads assigned to it for
each of the datasets. This can be done either as a pie chart, a bar
chart, or as a heat map. To construct such a view using MEGAN,
first the datasets must be individually opened in the program. Using
424
Fig. 4. Comparative visualization of eight marine datasets (8), displaying the bacterial part of the NCBI taxonomy down to
the rank of Phylum. The number of reads assigned to a node is indicated by a logarithmically scaled bar chart. The node
labeled Chlamydiae/Verrucomicrobia group is shown in a selected mode, in which both the number of reads assigned to
the node (Ass) and summarized by the node (Sum) is listed for the eight datasets.
17
425
Fig. 5. Comparative visualization of eight marine datasets based on their functional content using SEED subsystems. Here,
MEGAN has been set to display the full subtree below the node representing Flagellar motility.
426
Fig. 6. Split network representing Goodalls index for the eight marine datasets, based on all leaves of the tree shown in
Fig. 4, except for the Not Assigned and No Hits nodes.
6. Analyzing Other
Types of Data
So far, our focus has been on metagenomic and metatranscriptomic data. However, it is easily possible to analyze metaproteomic data as well. We illustrate this using a set of 8,073 peptide
sequences recently published in (23). In a first analysis, one can
simply compare the sequences against the NCBI-NR database
using the BLASTP program. Because the peptides are very short,
only about 1,700 give rise to significant hits. In a more sophisticated two-stage approach described in (23), the peptide
sequences are first blasted against much longer environmental
sequences that are available from the Global Ocean Sampling
(GOS) project (24). Then the GOS sequences that are hit by
the peptide sequences are blasted against NR and the LCA
algorithm is applied to determine taxonomic assignments for
the reads.
Finally, we would like to demonstrate that MEGAN can also
be used to analyze sequencing reads obtained in an approach
targeted at 16S rRNA sequences (25). To illustrate this, we use a
set of 849 16S rRNA reads published in (23). The sequences were
compared against the Silva database (26) using BLASTN
and processed then by MEGAN. All three analyses are compared
in Fig. 7.
17
427
Fig. 7. Comparative visualization of two different analyses of a set of 8,073 metaproteomic sequences (23). The data
labeled Peptides-NR-Morris2010 were obtained as a result of blasting the sequences against the NCBI-NR database. The
data labeled Peptides-GOS-CAMERA-Morris2010 were obtained in a more sophisticated two-stage approach, as described
in (23). In addition, we display the result of an analysis of 849 16S rRNA sequences, based on a BLASTN comparison
against the Silva database (26).
7. Discussion
and Outlook
The main goal of MEGAN is to provide a powerful and easy-to-use
tool to explore, analyze, and compare the taxonomic and functional
content of multiple metagenome datasets. MEGAN is based on the
comparison of reads against a reference database. Unfortunately, at
present, publicly available sequence databases cover only a very
small percentage of the true microbial diversity believed to exist in
nature. While projects such as GEBA (27) and the Human Microbiome Project (28) aim at addressing this problem, progress in
sequencing new reference genomes will be slow and so the analysis
of complex environmental samples will remain very challenging.
428
8. Exercises
Download and install MEGAN from http:www-ab.informatik.unituebingen.de/software/megan/welcome.html. Download four
preprocessed mouse datasets (MEGANs own rma files) from
http://www-ab2.informatik.uni-tuebingen.de/megan/rma/
BookChap_data. These analyses are based on datasets described in
(29). Using MEGAN, open the files.
1. Analyze the taxonomic content of mouse samples and compare
the results with the published results.
2. Analyze the functional content of mouse samples and compare
the results with the published results.
3. Compare all four mouse samples and try to identify differences
that are correlated with the different diets.
References
1. Huson DH, Auch AF, Qi J, Schuster SC
(2007) MEGAN analysis of metagenomic
data. Genome Res 17: 377386.
2. Ashburner M, Ball CA, Blake JA, Botstein D,
Butler H, et al. (2000) Gene ontology: tool for
the unification of biology. the gene ontology
consortium. Nat Genet 25: 2529.
3. Overbeek R, Begley T, Butler RM, Choudhuri
JV, Chuang HY, et al. (2005) The subsystems
approach to genome annotation and its use in
the project to annotate 1000 genomes. Nucleic
Acids Res 33: 569102.
4. Kanehisa M, Goto S (2000) Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids
Res 28: 2730.
5. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search
tool. J Mol Biol 215: 403410.
6. Benson D, Karsch-Mizrachi I, Lipman D,
Ostell J, Wheeler D (2005) Genbank. Nucleic
Acids Res 1: D3438.
7. Huson DH, Mitra S, Ruscheweyh HJ, Weber
N, Schuster SC (2011) Integrative analysis of
environmental sequences using MEGAN 4.
Under revision.
8. Gilbert JA, Field D, Huang Y, Edwards R, Li
W, et al. (2008) Detection of large numbers of
novel sequences in the metatranscriptomes of
complex marine microbial communities. PLoS
One 3: e3042.
17
429
Chapter 18
Analyzing Epigenome Data in Context of Genome
Evolution and Human Diseases
Lars Feuerbach, Konstantin Halachev, Yassen Assenov,
Fabian Muller, Christoph Bock, and Thomas Lengauer
Abstract
This chapter describes bioinformatic tools for analyzing epigenome differences between species and in
diseased versus normal cells. We illustrate the interplay of several Web-based tools in a case study of CpG
island evolution between human and mouse. Starting from a list of orthologous genes, we use the Galaxy
Web service to obtain gene coordinates for both species. These data are further analyzed in EpiGRAPH,
a Web-based tool that identifies statistically significant epigenetic differences between genome region sets.
Finally, we outline how the use of the statistical programming language R enables deeper insights into the
epigenetics of human diseases, which are difficult to obtain without writing custom scripts. In summary, our
tutorial describes how Web-based tools provide an easy entry into epigenome data analysis while also
highlighting the benefits of learning a scripting language in order to unlock the vast potential of public
epigenome datasets.
Key words: Epigenomics, Computational epigenetics, DNA methylation, CpG islands, Comparative
genomics, Galaxy, EpiGRAPH, R statistical programming language
1. Introduction
Readers who are new to the field of epigenetics may wonder why
DNA sequence alone is not sufficient to encode the information
required by a cell. To answer this question, imagine that the book
you are currently reading consisted of plain text only, without paragraphs, headlines, or any other markup. Finding specific pieces of
information would become a time-consuming task. Likewise, proteins, such as polymerases, need guidance to find gene promoters
among the billions of nucleotides in mammalian genome. As this
cellular markup differs between cell types, an additional layer of
information is required on top of (which is one of the many
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_18,
# Springer Science+Business Media, LLC 2012
431
432
L. Feuerbach et al.
18
433
434
L. Feuerbach et al.
18
2. Conservation
Statistics on CpG
Island Promoters
435
436
L. Feuerbach et al.
curated and analyzed in a recent study (15) comparing the distribution of CGIs in those promoters. This Jiang dataset can be
downloaded from http://mbe.oxfordjournals.org/cgi/content/
full/msm128/DC1 (first supplementary table).
For studies that cannot benefit from such preparatory work,
Exercise 1 (see below) outlines how the approach can be
generalized to arbitrary selection of species and gene sets. The
Galaxy analysis workflow is available online at http://main.g2.bx.
psu.edu/u/fmueller/w/conservation-of-cpg-island-promoters,
but it is recommended to perform the analysis manually to become
familiar with Galaxy.
2.1. Obtain Human
Gene List
from BioMart
To load a new dataset into the History panel, click on the Get
Data menu entry in the Tools panel. Several alternatives for data
acquisition are offered. In order to retrieve the human gene list,
we choose the BioMart Central server option. The Browser
opens the BioMart interface. From the -CHOOSE DATABASE- pull-down menu, we choose the recent Ensembl instance
(for this analysis, ENSEMBL GENES 58 (SANGER UK) was
applied, but the resource is constantly updated). The new pulldown menu -CHOOSE DATASET- is displayed. Select the
Homo sapiens genes (GRCh37) option. Galaxy loads the new
dataset and displays it in the left panel. To select the subset of
genes of interest, click on Filters. To limit the scope of the region
list on the genes from the Jiang dataset, choose Gene: from the
selection criteria on the right area and check the box ID list
limit. From the pull-down menu beside this box, we pick
HGNC symbol(s) (e.g., ZFY). We can now restrict the selection
of genes to those that match the gene symbols we enter into the
text area below.
Copy the human gene symbol column from the Jiang dataset
(H-M sheet of the Excel file) and paste it into the Human
official gene symbol field.
To specify which additional information we need for our analysis,
we now select the Attributes option in the left panel. In the Gene:
category, we first deselect both preselected attributes. Now, we
choose Chromosome Name, Gene Start (bp), Gene End
(bp), and Strand. Additionally, we expand the External: section
and check the HGNC symbol box in the External Reference
subsection. Note that the order in which the Attributes are selected
determines the format of the output file. For some steps downstream
in our pipeline, the order of the first three columns is important
(Chromosome Name, Gene Start, and Gene End).
Click on Results in the black top panel to export the complete
dataset to Galaxy. You will see a preview on the data that will be
exported. Galaxy is already selected as target. Check the box
Unique results only to exclude duplicates and press the Go
button. The browser returns to the Galaxy interface, which displays
18
437
the new dataset in the History panel. The upload from BioMart
to Galaxy may take a few moments. Eventually, we obtain a
tab-separated table containing the data for the subset of the orthologous genes that were retrieved. Note that some genes from the
Jiang dataset are not included in the BioMart database and thus are
not imported into Galaxy.
2.2. Obtain Mouse
Gene List
from BioMart
2.3. Convert
Chromosome
Symbols and Strand
Symbols to Achieve
Compatibility
438
L. Feuerbach et al.
18
439
2.6. Import
Whole-Genome CpG
Island Annotations
440
L. Feuerbach et al.
Script 1.
18
441
To integrate the previously obtained data into a single file, open the
Text Manipulation toolbox and select Add column. In the text
field, Add this value, enter True. to Query should be set to
the dataset of the human genes promoters that overlap with CGIs.
Hit Execute. Repeat this step for the corresponding datasets in
mouse.
Similarly, we add a column with the value False to both
datasets of promoters not overlapping with CGIs.
To join the corresponding datasets for each genome, use the
Concatenate queries tool from the same toolbox: first, select both
human datasets using the Concatenate Query drop-down menu
and then the Add new Query button and the Select pull-down
menu. By pressing the Execute button, both queries are joined
head to tail. Repeat this step to concatenate the mouse datasets.
Finally, we want to integrate both sets in such a way that each
promoter line contains information on its genomic locations in both
genomes and indicators for the existence of CGIs in either species.
First, convert the MGI gene symbols to Upper case to match
the HGNC gene symbols by applying the Change Case to the
symbol column in the Text Manipulation toolbox. Choose the
combined dataset of the mouse genes, enter the column number of
the gene symbols in the Change case of columns: text field, check
that the correct option is selected in the To: pull-down menu,
and execute the operation.
Next, open the Join, Subtract and Group toolbox, choose
the Join two Queries tool for human and mouse datasets, and
select the corresponding column numbers of the upper case gene
symbols. By pressing the Execute button, a new dataset is generated. It contains only genes that appear in both lists and share
exactly the same uppercase gene symbol. Download the dataset by
clicking on the disk symbol in the History panel and name it
orthologous-genes.txt. Finally, open the file in a text editor and add
a row containing the column headers separated by tabulator.
In order to obtain summary statistics on how many promoters
are included in each of the groups (CGI in human but not in
442
L. Feuerbach et al.
Table 1
Conservation of human and mouse CpG island promoters
Mouse CGI
promoter
Non-CGI promoter
152
1,972
1,425.8
284
678.2
2,104
546.2
654
259.8
806
938
2,910
Apparently, the null hypothesis that promoter types are independent for
homologous genes in human and mouse can be rejected
3. Genomic
Features Analysis
with EpiGRAPH
18
443
At http://epigraph.mpi-inf.mpg.de/WebGRAPH/faces/Login.jsp,
a free user account can be created that enables the use of EpiGRAPH s advanced custom analyses. Instructive video tutorials are
provided on the same site and in previously published tutorials (19).
3.2. Identifying
Properties of
Conserved and Not
Conserved Promoters
3.2.1. Uploading
the Dataset with Mapped
Promoters in EpiGRAPH
3.2.2. Defining
an EpiGRAPH Analysis
444
L. Feuerbach et al.
hg18_CGI column from the list (3) and selecting Add Column
button (4) from below the Inclusion Filter field and adding the
True statement at the end. Before continuing via Submit Attribute and Proceed, make sure you assigned an attribute label (5).
Proceed to be taken to a view used for defining control sets. As a
control set is not needed for this analysis, skip the next step by
selecting the Skip this Step button.
The next screen (Fig. 5) is the Analysis View, in which the parameters for the actual analysis are specified. First, specify the target
feature that is the basis of the analysis. Partitioning of the region set
for all further statistical and machine learning analyses is based on the
target featurein this case, the mm9_CGI (1). Next, choose the
additional genomic and epigenomic features for EpiGRAPH to
inspect for each genomic region. These features include frequency
counts for various DNA sequence patterns, predicted DNA structure,
information for overlap with repeats, evolutionary history, population
variation, and others. All of above are automatically obtained and
preprocessed from public sources and databases. A full list with
detailed descriptions of the features and interpretation of the computed representative values can be found on the EpiGRAPH Web site
(http://epigraph.mpi-inf.mpg.de/WebGRAPH/faces/Background.
18
445
Fig. 4. The Attribute View used for computing an attribute based on an already existing dataset.
Once the analysis is complete, first inspect the results of the statistical
analysis, which focuses on each computed feature separately. The
values for each feature are split into two groups depending on the
target feature. EpiGRAPH then uses a statistical method named
Wilcoxon rank sum test (21) to assess the validity of the null hypothesis that these two sets of values come from the same distribution.
446
L. Feuerbach et al.
Fig. 5. Analysis View allows the user to specify the settings of the analysis he/she wants to perform.
18
447
Fig. 6. Results from EpiGRAPH (a) Statistical analysis and (b) Machine learning.
448
L. Feuerbach et al.
orthologs resemble CGI promoters in mouse as well. These posttranslational modifications of histones are generally associated with
open chromatin and CGIs that are especially enriched for CpGs.
However, the experimental data for those histone modifications
were obtained only from blood tissues (more information can be
found in the EpiGRAPH documentation) and should be interpreted
cautiously, as they do not necessarily correlate to histone modification states in other tissues. More precisely, the presence of these
marks indicates that a promoter is subject to epigenetic regulation
in at least one tissue, but their absence in one tissue does not rule out
that the promoter is epigenetically regulated in other tissues.
Among the most significant sequence patterns are a measure for
the ratio between CpG frequency and the frequency of the spontaneous deamination products TpG and CpG (CpG_vs_TpG_v_CpA_ratio) and the CpA/TpG frequency (CA_freq as search is performed on
both strands and thus includes the reverse complement as well). Both
values indicate that deamination products are enriched in those promoters that lost their CGI status in mouse.
As previously mentioned, visual inspection of the data is an
important step. The diagram generation module of EpiGRAPH
allows the user to inspect the distribution of a feature with respect
to the target. This is achieved by selecting the checkboxes of the
features you would like to visualize and clicking Calculate Selected
Diagrams. The box plot presented in Fig. 7 indicates that for
18
449
450
L. Feuerbach et al.
18
451
Fig. 8. Visualization of the promoter methylation obtained via RRBS for mouse and human. The black vertical lines indicate
the thresholds chosen to identify methylated (>0.66) and unmethylated cases(<0.33).
452
L. Feuerbach et al.
Table 2
Distribution of promoter methylation data visualized
by genome and promoter CGI status
Unmethylated
Methylated
hg18
mm9
hg18
mm9
1,746
1,759
28
10
hg18
224
94
18
40
mm9
14
119
28
Neither
34
42
165
137
Script 2.
18
453
in our case, the choice of cutoff values would not influence the
results significantly. Of further interest for the reader might be to
modify the R code from Script 1 to visualize the distributions only
for the True/True groups, the False/True group, etc.
Script 3.
First, ensure to be working on the mouse dataset. Using the EpiGRAPH filtering options on the already computed datasets, we
extract the mouse promoters that are CGI in human but are not
CGI in mouse. Repeat the steps of the Defining an EpiGRAPH
analysis section, defining the inclusion filter (point (4) in Fig. 4) to
ensure that the hg18_CGI feature has value True and the
mm9_CGI feature has value False. We also exclude all cases
that do not have strong methylation scores by adding to the inclusion filter a restriction that methylation score is either less than 0.33
or more than 0.66. We also add a new column that contains the
454
L. Feuerbach et al.
Fig. 9. Visualization of some of the most significant features differentiating between methylated and unmethylated mouse
non-CGI promoters orthologous to human CGI promoters.
18
455
computed by partitioning the region in multiple consecutive subregions, estimating the CpG frequency in each of them and computing the standard deviation of the obtained set of frequencies. It
has low values when the distribution of CpGs is similar along the
whole region and higher values if certain parts have high CpG
frequencies while others are CpG poor. High values of CG_std
are usually indicative of regions overlapping with CGI. A possible
explanation for the significantly elevated values of this feature in
unmethylated non-CGI promoters is the previously described erosion process (15) that starts from the edges of the CGI. Alternatively, the mouse genome may possess smaller CGIs that are
somewhat below the minimal length of human CGIs.
3.3.3. Discussion
3.4. Differential
Analysis on Human
and Mouse Promoter
Traits
As a follow-up analysis, we test which genomic features are significantly different between human and mouse promoters. For this
purpose, we use the full attribute data computed in the beginning
of the previous section for human and mouse promoters.
We download the data from both the human and the mouse analyses
(see Subheading 3). This is done by selecting the button named
Download Data Table from the analyses pages. We then run an R
script (Script 4) that (1) reads the two datasets (the specific file
names need to be set additionally); (2) selects only the properties
that are common for both datasets; (3) extracts the combined
dataset as a result with additional column called Genome, which
indicates which is the original genome source of the specific case;
and finally (4) stores this new combined dataset into a new file. Once
the file is prepared, we are ready to perform an EpiGRAPH analysis
to identify significant differences between the properties for the
different genomes.
456
L. Feuerbach et al.
Script 4.
3.4.2. Statistical
and Machine Learning
Analysis of the Variation
of Genome Features
Between Human
and Mouse
Fig. 10. List of genomic features that are most significantly different between human and mouse promoters.
18
457
3.5. Summary
We can conclude that there are two independent biological explanations for the loss of CGIs in some mouse promoters. First, a
general trend in mouse was observed toward smaller but functional
CGIs with less CpGs that are not captured by the TakaiJones CGI
definition, which leads to false-negative classifications in the assessment of the promoter type. Second, a number of genes actually
differ in their promoter type, which is reflected by an increase of
DNA methylation and a uniform loss of CpGs over the whole
promoter region. While studying epigenetic regulation of promoters from the first group in mouse may also grant insights into their
regulation in humans, this is implausible for promoters from the
second group.
In the next section, we use this knowledge to enhance candidate selection in a computational screen for methylation-associated
cancer markers in human. We search for those candidates that are
amenable to functional studies in the mouse model system.
4. Methylation
and Disease
Epigenetic drugs can bring important progress to anticancer therapy. For example, the drug 5-azacytidine is applied in the treatment
of myelodysplastic syndrome. With the acquired knowledge on
enzymes involved in gene regulation, epigenetic therapies could
provide an effective alternative to chemotherapy (26). In this section, we are going to identify genes that are differentially methylated in ovarian cancer and normal tissue. The epigenetic state of
some of these genes might have a causal effect on tumor progression and proliferation. Such genes are putative targets for epigenetic therapy. Finally, we are going to prioritize the list of potential
458
L. Feuerbach et al.
Fig. 11. (a) Table of methylation values, also referred to as beta values. They depict methylation degree and span the range
between 0 (completely unmethylated) and 1 (fully methylated). Row names are Infinium probe identifiers; column names
are sample identifiers. Note that only the first two rows and four columns are displayed. The full table contains 27,578 rows
and 540 columns. (b) Table of sample clinical information. Row names are sample identifiers. Note that only the first two
rows and two columns are displayed. The full table contains 540 rows and 14 columns. (c) Table summarizing the Illumina
Infinium platform specification. Row names are probe identifiers. Note that only the first two rows and four columns are
displayed. The full table contains 27,578 rows and 38 columns.
18
459
The methylation dataset is deposited in the Gene Expression Omnibus (GEO) under the identifier GSE19711 (http://www.ncbi.nlm.
nih.gov/geo/query/acc.cgi?accGSE19711). The SOFT-formatted family file contains the methylation degrees of all probes, as well
as clinical information about the samples. SOFT files are text files
with a simple structure. A short script can be used to extract from
this file the tables presented in Fig. 11 (an example R script is
provided in the supplementary information.). The mapping from
Infinium probe identifiers to genes is available in the GEO under
platform specification GPL8490 (http://www.ncbi.nlm.nih.gov/
geo/query/acc.cgi?accGPL8490). The table can be downloaded
by clicking the button Download full table. . ..
The information available in the tables shown in Fig. 11 enables
us to compare the methylation states of healthy and cancerous
samples. For every gene, we can obtain a set of 266 numbers
depicting its methylation status in healthy samples, as well as a set
of 274 numbers corresponding to its methylation status in ovarian
cancer cells. These genes are then classified as OV associated if the
two sets of numbers are significantly different. Several statistical
tests can be applied for that purpose. We apply Wilcoxon rank sum
test. Instead of a long script, the necessary R code to perform the
analysis is presented as snippets within the associated subsections.
The tables presented in Fig. 11 are loaded from files named
GSE19711-betas.txt
(Fig.
11a),
GSE19711-clinical.txt
(Fig. 11b), and GPL8940-65.txt (Fig. 11c). The first two tables can
be obtained after parsing the SOFT file in GEO record GSE19711.
The third table can be downloaded from GEO record GPL8940.
4.2. Determining
OV-Associated Genes
4.2.1. Step 1
The first step is to load the table of methylation values for all
samples, the table with clinical information on the samples, as well
as a table containing information on the probes used in the Illumina
Infinium platform.
460
L. Feuerbach et al.
Fig. 12. R Objects created in the analysis of disease association. Matrices are represented by tables, and listsby charts
with inner borders only. Arrows indicate how the objects are derived. Numbers in the arrows correspond to the steps
described in the Subheading 4.2.
Script 5.
4.2.2. Step 2
Many genes are represented by more than one probe in the Infinium assay. CpG methylation is highly correlated over short
distances (30). Therefore, we can estimate the methylation level
of a gene promoter by averaging the methylation value for all
18
461
associated probes. The next code snippet performs this task and
creates a matrix of methylation values named betas.genes. In this
matrix, every row corresponds to a gene and every sample is
represented by a column.
Script 6.
4.2.3. Step 3
Script 7.
4.2.4. Step 4
With this definition of sets, the Wilcoxon rank sum test (introduced in
Subheading 2) is applied for every row of betas.genes, that is, for
every gene. The resulting p-values are stored in a matrix gene.p.
values. Every column in the matrix corresponds to a gene. The first
row of the matrix lists the p-values for hypomethylation, and the
second rowfor hypermethylation of the respective gene. A p-value
reflects the probability that an observation can be explained by
chance. If an individual gene is tested, a p-value of 0.001 is highly
significant. If multiple genes are tested, this level loses its significance.
For instance, in a set of 20,000 genes, we can expect to observe 20
genes with a p-value below 0.001 by chance alone. In Subheading 3,
EpiGRAPH automatically provided two alternative multiple-testing
correction procedures that vary in their strictness. As in the current
analysis, the objective is to filter only the most promising candidates,
and the more conservative Bonferroni method is applied in the last
line of the code snippet.
Note that gene.p.values is transposed as a side effect of the last
command. Therefore, genes are represented by rows and methylation states by columns.
462
L. Feuerbach et al.
Script 8.
Script 9.
4.3. Relationship
Between Disease and
CpG Island Association
Script 10.
18
463
Table 3
Contingency table of the properties OV association
and promoter CpG island status
OV associated
Not OV associated
Observed
Expected
Observed
Expected
Total
CGI
96
120
1,882
1,858
1,978
Non-CGI
70
46
686
710
756
Total
166
2,568
2,734
Script 11.
The last step in this analysis performs Fishers exact test in order to
check for a significant correlation between CGI promoter status
and OV association among the set of orthologous genes.
Script 12.
464
L. Feuerbach et al.
Script 13.
4.4. Summary
In this section, we applied Wilcoxon rank sum test gene-wise, comparing the methylation profiles of healthy ovary and ovarian cancer
tissues. We identified a list of differentially methylated gene promoters. The majority of these promoters are hypomethylated in cancer.
We also observed that the methylation state of non-CGI promoters is
preferentially modified in ovarian cancer cells. As a final outcome, we
created a list of genes that are suitable candidates for studying epigenetic deregulation associated with ovarian cancer. These are the genes
that exhibit significant hypo- or hypermethylation in ovarian cancer,
and have similarly regulated orthologs in mouse.
5. Concluding
Remarks
In this chapter, we outlined methods and tools for comparative
epigenomics analysis in the context of genome evolution and
human diseases. The combination of Web-based tools is becoming
increasingly powerful and provides a productive start into epigenome data analysis. As users become more experienced, it is a natural
extension that they start learning a scripting language (e.g., R or
Python) that can often be combined with Web services to perform
more advanced and individualized data analysis tasks. A biologist
equipped with ever more powerful Web-based tools and basic
scripting skills will be in a good position to capitalize on the
increasing wealth of public epigenome datasets.
We point out that each of these tools not only has certain advantages, but also drawbacks. Galaxy and EpiGRAPH offer easy access to
powerful operations on genome-wide datasets while simple text
manipulations can usually be performed more efficiently in text
18
465
6. Exercises
1. In Subheading 2 of this chapter, we used a published set of
orthologous genes as starting point of our analysis, and furthermore we used the similarity of the gene symbols in human
and mouse to map them to each other. To generalize this
approach, the reader should explore ways to repeat this step
with a comprehensive set of human genes and use the LiftOver
tool to map them to the mouse genome. We obtain a larger but
more noisy set of putative homologous genes. How can we
identify unconserved genes? Considering the lessons from
Chap. 9 of volume 1 (32), can you discriminate between
orthologous and paralogous genes? Does the larger gene set
influence the statistic on promoter-type conservation?
2. In Subheading 3, we used loose thresholds for classifying a
promoter as methylated or unmethylated. We then observed
that gene promoters overlapping with CGI in human, but not
in mouse, still appear to be mostly unmethylated and epigenetically active in mouse. Use the script from this section to test if
such observation still holds if stricter thresholds of 0.25 and
0.75 are applied.
3. In the Subheading 4, the methylation value for each gene in
each sample is obtained by averaging over all probes that correspond to the genes promoter region. However, the methylation values of the probes might differ drastically. In such a
case, the average value is probably an unreliable estimate of the
genes promoter methylation. Write an R script that filters out
every gene with multiple probes, for which the methylation
values in at least one sample differ by 0.5 or more.
4. In the Subheading 4, the Wilcoxon rank sum test is applied in
order to obtain a p-value for OV association for every gene. Use
the R function ks.test for applying the KolmogorovSmirnov
(KS) test instead. Inspect to what extent the resulting gene
associations change, compared to applying Wilcoxon rank
sum test. Why is the KS test inappropriate if we need to find
genes with differential methylation?
466
L. Feuerbach et al.
Acknowledgment
The contribution of Y.A. was partially supported by the EU STREP
CancerDIP (EU grant HEALTH-F2-2007-200620)
References
1. Jaenisch, R., and Bird, A. (2003) Epigenetic
regulation of gene expression: how the genome
integrates intrinsic and environmental signals,
Nat Genet 33 Suppl, 245254.
2. Bird, A. (2002) DNA methylation patterns and
epigenetic memory, Genes Dev 16, 621.
3. Novik, K. L., Nimmrich, I., Genc, B., Maier,
S., Piepenbrock, C., Olek, A., and Beck, S.
(2002) Epigenomics: genome-wide study of
methylation phenomena, Current issues in
molecular biology 4, 111128-111128.
4. Noushmehr, H., Weisenberger, D. J., Diefes,
K., Phillips, H. S., Pujara, K., Berman, B. P.,
Pan, F., Pelloski, C. E., Sulman, E. P., Bhat, K.
P., Verhaak, R. G., Hoadley, K. A., Hayes, D.
N., Perou, C. M., Schmidt, H. K., Ding, L.,
Wilson, R. K., Van Den Berg, D., Shen, H.,
Bengtsson, H., Neuvial, P., Cope, L. M.,
Buckley, J., Herman, J. G., Baylin, S. B.,
Laird, P. W., and Aldape, K. (2010) Identification of a CpG island methylator phenotype that
defines a distinct subgroup of glioma, Cancer
Cell 17, 510522.
5. Figueroa, M. E., Lugthart, S., Li, Y.,
Erpelinck-Verschueren, C., Deng, X., Christos,
P. J., Schifano, E., Booth, J., van Putten, W.,
Skrabanek, L., Campagne, F., Mazumdar, M.,
Greally, J. M., Valk, P. J., Lowenberg, B.,
Delwel, R., and Melnick, A. (2010) DNA
methylation signatures identify biologically
distinct subtypes in acute myeloid leukemia,
Cancer Cell 17, 1327.
6. Yi, J. M., Dhir, M., Van Neste, L., Downing, S.
R., Jeschke, J., Glockner, S. C., de Freitas
Calmon, M., Hooker, C. M., Funes, J. M.,
Boshoff, C., Smits, K. M., van Engeland, M.,
Weijenberg, M. P., Iacobuzio-Donahue, C. A.,
Herman, J. G., Schuebel, K. E., Baylin, S. B.,
and Ahuja, N. (2011) Genomic and Epigenomic Integration Identifies a Prognostic Signature in Colon Cancer, Clin. Cancer Res. 17,
15351545.
7. Bock, C., Kiskinis, E., Verstappen, G., Gu, H.,
Boulting, G., Smith, Z. D., Ziller, M., Croft,
G. F., Amoroso, M. W., Oakley, D. H., Gnirke,
A., Eggan, K., and Meissner, A. (2011) Reference Maps of Human ES and iPS Cell Variation
Enable High-Throughput Characterization of
Pluripotent Cell Lines, Cell 144, 439452.
18
467
Chapter 19
Genetical Genomics for Evolutionary Studies
Pjotr Prins, Geert Smant, and Ritsert C. Jansen
Abstract
Genetical genomics combines acquired high-throughput genomic data with genetic analysis. In this
chapter, we discuss the application of genetical genomics for evolutionary studies, where new highthroughput molecular technologies are combined with mapping quantitative trait loci (QTL) on the
genome in segregating populations.
The recent explosion of high-throughput datameasuring thousands of proteins and metabolites, deep
sequencing, chromatin, and methyl-DNA immunoprecipitationallows the study of the genetic variation
underlying quantitative phenotypes, together termed xQTL. At the same time, mining information is not
getting easier. To deal with the sheer amount of information, powerful statistical tools are needed to analyze
multidimensional relationships. In the context of evolutionary computational biology, a well-designed
experiment may help dissect a complex evolutionary trait using proven statistical methods for associating
phenotypical variation with genomic locations.
Evolutionary expression QTL (eQTL) studies of the last years focus on gene expression adaptations,
mapping the gene expression landscape, and, tentatively, eQTL networks. Here, we discuss the possibility of
introducing an evolutionary prior, in the form of gene families displaying evidence of positive selection, and
using that in the context of an eQTL experiment for elucidating hostpathogen proteinprotein interactions. Through the example of an experimental design, we discuss the choice of xQTL platform, analysis
methods, and scope of results. The resulting eQTL can be matched, resulting in putative interacting genes
and their regulators. In addition, a prior may help distinguish QTL causality from reactivity, or independence of traits, by creating QTL networks.
Key words: Genetical genomics, QTL, eQTL, xQTL, R-genes, Evolution, R/qtl, NGS, Genomics,
Metabolomics, Network inference
1. Introduction
Genetics, as it is used here, concerns the study of quantitative, or
complex, traits. A quantitative trait is influenced by multiple factors,
including gene interactions and environmental factors, and typically
does not lead to discrete phenotypes. Many traits of interest, such as
milk production in cattle, response to fertilizer in crops, and most
469
470
P. Prins et al.
19
471
472
P. Prins et al.
QTL link complex traits with one or more locations on the genome
(Fig. 1). Such a location is a wide measure because a QTL is a
statistical estimate, and rarely a precise indicator. On the genome, a
single QTL may represent tens, hundreds, or even thousands of real
genes. Combining the QTL with high-throughput technologies,
such as microarrays, can add information. To zoom in on the genes
underlying QTL, information from other sources can be utilized.
Such a priori knowledge could consist of results from traditional
linkage studies or association studies of, for example, human disease.
That way, one can assign a specific regulatory role to polymorphic
sites in a genomic region known to be associated with disease (14).
Other useful priors can be the existing information on gene ontology
terms, metabolic pathways, and proteinprotein interactions, which
can be used to identify genes and pathways (16), provided these
databases are sufficiently informative.
19
473
474
P. Prins et al.
Zou et al. (11), for example, used gene ontology as a prior and
concluded that trans-acting eQTL divergence between duplicate
pairs is related to fitness defect under treatment conditions, but not
with fitness under normal condition.
Chen et al. (17) identified strong candidate genes for resistance
to leaf rust in barley and on the general pathogen response pathway
using a custom barley microarray on 144 doubled haploid lines of
the St/Mx population. 15,685 eQTL were mapped from 9,557
genes. Correlation analysis identified 128 genes that were correlated with resistance, of which 89 had eQTL colocating with the
phenotypic QTL (phQTL) or classic QTL. Transcript abundance in
the parents and conservation of synteny with rice prioritized six
genes as candidates for Rphq11, the phQTL of largest effect (17).
1.3. Evidence
of Positive Selection
as the Prior
2. Designing
an Evolutionary
x QTL Experiment
19
475
Box 1
R-Genes
Plant resistance genes (R-genes) are a homologous family of
genes, formed by gene duplication events and hypothesized to
be involved in an evolutionary arms race with pathogen effectors. R-genes are involved in recognizing specific pathogens with
cognate avirulence genes and initiating defense signaling that
results in disease resistance (25). R-genes are characterized by a
molecular gene-for-gene interaction (26) in which a specific
allele of a disease resistance gene recognizes an avirulence protein or pathogen allele. This specificity is often encoded, at least
in part, in a relatively fast-evolving leucine-rich-repeat (LRR)
region (27), which consists of a varying number of LRR modules. Activation of at least some of these proteins are regulated in
trans, as has been shown for RPM1 and RPS2 (28).
A single A. thaliana plant has about 150 R-genes, representing
a subset of R-genes in the overall population. The protein products
of R-genes are involved in molecular interactions. They generally
have a recognition site which can dock against, i.e., recognize,
another one or more specific molecule(s). The proteins encoded
by the largest class of R-genes carry a nucleotide-binding site LRR
domain (NB-LRR, also referred to as NB-ARC-LRR and NBSLRR). NB-LRR R-genes can be further subdivided based on their
N-terminal structural features into TIR-NB-LRR, which have
homology to the Drosophila Toll and mammalian interleukin-1
receptors and CC-NB-LRR, which contain a putative coiled-coil
motif (29). The LRR domain appears to mediate specificity in
pathogen recognition while the N-terminal TIR, or coiled-coil
motif, is likely to play a role in downstream signaling (27). When
a molecule is docked, the R-protein is able to activate pathways in
the cell, resulting in, for example, a hypersensitive response causing
apoptosis and preventing spread of infection.
Meanwhile, one single R-protein only recognizes one type of
invading molecules. Therefore, through its R-genes, one individual plant only recognizes a limited number of strains of invading
pathogens, as the individual pathogens have variation in effectors
too. When a pathogen evolves to use nonrecognized effectors, the
plant becomes susceptible. The success of plant defense is determined by both evolution and the variation of specificity in a
population. Unlike the evolved mammal immune system, which
can change in a living organism and learn about invasions on the
fly (30), plant R-genes depend on the variation inside a gene pool
to provide the resistance against a pathogen; see for example
Holub et al. (31). Even so, many genes involved in pathogen
recognition undergo rapid adaptive evolution (24), and studies
have found that A. thaliana R-genes show evidence of positive
selection, e.g., refs. 3234.
476
P. Prins et al.
19
477
478
P. Prins et al.
19
479
480
P. Prins et al.
19
481
3. Discussion
A QTL is a statistical property connecting genotype with phenotype.
In this chapter, we reviewed studies which, with various degrees of
success, combine some type of prior information with xQTL. We
propose that a search for genome-wide evidence of positive selection
can produce a valid and interesting prior for xQTL analysis. This is
achieved by tying genomic locations of putative gene families, possibly involved in plantpathogen interactions, with QTL locations
derived from a genetical genomics experiment. Both the eQTL
example and the search for genome-wide evidence of positive
selection pressure are essentially exploratory and result in a list of
putative genes, or gene families, with known genomic locations. The
combined information yields candidate genes and pathways that are
under positive selection pressure and, potentially, involved in
hostpathogen interactions. We explain that it is possible to design
an eQTL experiment using existing experimental populations, e.g.,
using an A. thaliana RIL population, and analyze results with the
existing free and open-source software, such as the R/qtl tool set.
Genetical genomics bridges the study of quantitative traits
with molecular biology and gives new impetus to QTL population
studies. Genetic variation at multiple loci in combination with environmental factors can induce molecular or phenotypic variation.
Variation may manifest itself as linear patterns among traits at different levels that can be deconstructed. Correlations can be attributed
to detectable QTL and a logical framework based on common and
distinct QTL and propagation of biological variation, which can be
used to infer network causality, reactivity, or independence (61).
Unexplained biological variation can be used to infer direction
between traits that share a common QTL and have no distinct
QTL, though it may be difficult to separate biological from technical
variation. Prior knowledge and complementary experiments, such
as deletion mapping followed by independent gene expression
482
P. Prins et al.
4. Questions
1. What is an eQTL, and why does it present two genomic locations?
2. Can a prior, as used here, really add statistical power, or is it no
more than circumstantial evidence?
3. When designing an evolutionary genetical genomics experiment, what are the steps to consider?
4. How can causal inference be used in QTL networks?
Acknowledgments
The European Commissions Integrated Project BIOEXPLOIT
(FOOD-2005-513959 to GS and PP); the Netherlands Organization for Scientific Research/TTI Green Genetics (1CC029RP to
PP); the EU 7th Framework Programme under the Research Project PANACEA (222936 to RJ).
19
483
References
1. Nandi S, Subudhi P K, Senadhira D et al.
(1997) Mapping QTLs for submergence
tolerance in rice by AFLP analysis and selective
genotyping. Mol Gen Genet. 255:18
2. Meaburn E, Butcher L M, Schalkwyk L C &
Plomin R (2006) Genotyping pooled DNA
using 100K SNP microarrays: a step towards
genomewide association scans. Nucleic Acids
Res. 34:e27p
3. Kim S, Plagnol V, Hu T T et al. (2007) Recombination and linkage disequilibrium in Arabidopsis thaliana. Nat Genet. 39:11511155. http://
www.ncbi.nlm.nih.gov/pubmed/17676040
4. Dixon A L, Liang L, Moffatt M F et al. (2007)
A genome-wide association study of global
gene expression. Nat Genet. 39:12021207
5. Jansen R C & Nap J P (2001) Genetical genomics: the added value from segregation. Trends
Genet. 17:388391
6. Gibson G & Weir B (2005) The quantitative
genetics of transcription. Trends Genet.
21:616623
7. Li Y, Alvarez O A, Gutteling E W et al. (2006)
Mapping determinants of gene expression plasticity by genetical genomics in C. elegans.
PLoS Genet. 2:e222p
8. Jansen R C, Tesson B M, Fu J, Yang Y &
Mcintyre L M (2009) Defining gene and
QTL networks. Curr Opin Plant Biol.
12:241246
9. Brem R B & Kruglyak L (2005) The landscape
of genetic complexity across 5,700 gene
expression traits in yeast. Proc Natl Acad Sci
USA. 102:15721577
10. Fraser H B, Moses A M & Schadt E E (2010)
Evidence for widespread adaptive evolution of
gene expression in budding yeast. Proc Natl
Acad Sci U S A. 107:29772982
11. Zou Y, Su Z, Yang J, Zeng Y & Gu X (2009)
Uncovering genetic regulatory network divergence between duplicate genes using yeast eqtl
landscape. J Exp Zool B Mol Dev Evol.
312:722733
12. Li Y, Breitling R & Jansen R C (2008) Generalizing genetical genomics: getting added value
from environmental perturbation. Trends
Genet. 24:518524. http://www.ncbi.nlm.
nih.gov/pubmed/18774198
13. Kliebenstein D J, West M A, van Leeuwen H
et al. (2006) Identification of QTLs controlling
gene expression networks defined a priori.
BMC Bioinformatics. 7:308p
14. Gilad Y, Rifkin S A & Pritchard J K (2008)
Revealing the architecture of gene regulation:
484
P. Prins et al.
41. Goto N, Prins P, Nakao M et al. (2010) BioRuby: bioinformatics software for the Ruby programming
language.
Bioinformatics.
26:26172619. doi:10.1093/bioinformatics/
btq475
42. Altschul S F, Madden T L, Schaffer A A et al.
(1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25:33893402
43. Rhee S Y, Beavis W, Berardini T Z et al. (2003)
The Arabidopsis Information Resource
(TAIR): a model organism database providing
a centralized, curated gateway to Arabidopsis
biology, research materials and community.
Nucleic Acids Res. 31:224228. http://www.
ncbi.nlm.nih.gov/pubmed/12519987
44. Anisimova M, Nielsen R & Yang Z (2003)
Effect of recombination on the accuracy of
the likelihood method for detecting positive
selection at amino acid sites. Genetics.
164:12291236
45. (2000) Analysis of the genome sequence of the
flowering plant Arabidopsis thaliana. Nature.
408:796815
46. Michelmore R W & Meyers B C (1998) Clusters of resistance genes in plants evolve by
divergent selection and a birth-and-death process. Genome Res. 8:11131130. http://www.
ncbi.nlm.nih.gov/pubmed/9847076
47. Salinas J & Sanchez-serrano J (2006) Arabidopsis protocols. Humana Pr Inc, Totowa, NJ
48. Fu J & Jansen R C (2006) Optimal design and
analysis of genetic studies on gene expression.
Genetics. 172:19931999. doi:10.1534/
genetics.105.047001
49. Mortazavi A, Williams B A, Mccue K, Schaeffer L
& Wold B (2008) Mapping and quantifying
mammalian transcriptomes by rna-seq. Nat
Methods.
5:621628.
doi:10.1038/
nmeth.1226
50. Eklund A C, Turner L R, Chen P et al. (2006)
Replacing cRNA targets with cDNA reduces
microarray cross-hybridization. Nat Biotechnol.
24:10711073. doi:10.1038/nbt0906-1071
51. Hoen P A, Ariyurek Y, Thygesen H H et al.
(2008) Deep sequencing-based expression
analysis shows major advances in robustness,
resolution and inter-lab portability over five
microarray platforms. Nucleic Acids Res. 36:
e141p. doi:10.1093/nar/gkn705
52. Keurentjes J J, Sulpice R, Gibon Y et al. (2008)
Integrative analyses of genetic variation in
enzyme activities of primary carbohydrate
metabolism reveal distinct modes of regulation
in Arabidopsis thaliana. Genome Biol. 9:
R129p. doi:10.1186/gb-2008-9-8-r129
53. Fu J, Swertz M A, Keurentjes J J & Jansen R C
(2007) Metanetwork: a computational
19
protocol for the genetic study of metabolic
networks.
Nat
Protoc.
2:685694.
doi:10.1038/nprot.2007.96
54. Fu J, Keurentjes J J, Bouwmeester H et al.
(2009) System-wide molecular evidence
for phenotypic buffering in Arabidopsis.
Nat Genet. 41:166167. doi:10.1038/ng.308
55. Breitling R, Li Y, Tesson B M et al. (2008)
Genetical genomics: spotlight on QTL hotspots.
PLoS
Genet.
4:e1000232p.
doi:10.1371/journal.pgen.1000232
56. Development core team R (2010) R: a language and environment for statistical computing. http://www.R-project.org
57. Broman K & Sen (2009) A guide to QTL
mapping with R/qtl. Springer Verlag, New
York, NY
58. Arends D, Prins P, Jansen R C & Broman K W
(2010) R/qtl: high-throughput multiple QTL
mapping. Bioinformatics. 26:29902992.
doi:10.1093/bioinformatics/btq565
59. Tierney L, Rossini A & Li N (2009) SNOW: a
parallel computing framework for the R system.
485
Part V
Handling Genomic Data: Resources and Computation
Chapter 20
Genomics Data Resources: Frameworks and Standards
Mark D. Wilkinson
Abstract
The emergence of genomics tools for the evolutionary and comparative biology community led to a rapid
explosion in the number of online resources targeted at this specialized community, including Web-based
comparative genomics software, such as the Artemis Comparison Tool (WebACT); databases, such as
PaleoDB, Global Biodiversity Information Facility, and TreeBase; and knowledge frameworks, such as
the Evolution Ontology. Unfortunately, these providers are largely independent of one another and
therefore the individual resources do not share any centralized plan for how the data or tools would or
should be provided. As a result, there are a myriad of often incompatible technologies and frameworks
being used by this community of providers. In this chapter, we explore approaches to online resource
publication, both those already in use by the community, as well as new and emergent frameworks and
standards. Exploration of the strengths and weaknesses of each approach, together with a brief exploration
of the philosophy or informatics theory behind the varying approaches, will hopefully help readers as they
navigate this data space. The discussion is constructed such that it lays the groundwork for exploration of a
new global standard for data and knowledge representationThe Semantic Webthat holds promise of
providing solutions to many of the complexities users face in their attempts to discover and integrate
biodiversity data, and examples are provided.
Key words: Interoperability, REST, Identifier systems, HTTP protocol, URI, URL, LSID, Web
services, Semantic Web
1. Introduction
Informatics is the field of study that examines technological
approaches that improve access to, and utilization of, information.
For bioinformaticians, informatics research and development provides the core computational communications standards and messaging syntaxes through which they can find the data they need, and then
integrate, organize, format, and analyze it. For biologists, informatics
technologies lie underneath the software applications they use day by
day that allow them to do relatively complex bioinformatics analyses
without necessarily having to become computer programmers
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_20,
# Springer Science+Business Media, LLC 2012
489
490
M.D. Wilkinson
themselves. Informatics is also, arguably, the most broadly interdisciplinary research domain under the bioinformatics umbrella,
spanning the biological sciences, computer sciences, library sciences,
legal/ethical studies, and (increasingly) pure philosophy (1).
Broadly speaking, this chapter covers two main topics:
1. How do we name things, and what things are being named?
2. How do we get information about or analyze a named thing?
These topics are examined primarily in the context of the Web,
and, in particular, Web resources related to Evolutionary Genomics; however, the discussion occasionally extends to more general
themes, since the informatics issues faced by Evolution researchers
are shared with most other biological domains.
2. Naming
2.1. How Do We
Name Things?
20
491
The process of retrieving the data and/or metadata that is identified by any Web identifier is called resolution;
therefore, URIs of all types are resolved to data or resolved to metadata by calling a server using a protocol
that is appropriate for that type of URI.
492
M.D. Wilkinson
20
493
Fig. 1. An RDF Graph, representing a small portion of the data from LSID record for Pternistis leucoscepus from TDWG.
Notice that RDF is able to link URIs to textual content (dark rectangles), as well as linking URIs to other URIs (light ovals),
and moreover that these linkages themselves take the form of a URI. In this way, machines are able to traverse these vast
global graphs of inter-linked data while maintaining and understanding the context or meaning of each data linkage
without human intervention. Exercise 2 includes additional exploration of RDF retrieval and visualization.
page look like, and act like, typical hyperlinks. To keep the distinction clear, it is necessary to understand that:
The hyperlink:
http://lsid.tdwg.org/summary/urn:lsid:ubio.org:namebank:11815.
Displays the data obtained by resolving the following LSID:
urn:lsid:ubio.org:namebank:11815.
494
M.D. Wilkinson
20
495
496
M.D. Wilkinson
20
497
498
M.D. Wilkinson
3. Analysing
3.1. How Do We Get
Information About,
or Analyze a Named
Thing?
The HTTP methods, GET, PUT, POST, and DELETE, roughly mimic the database operations of Retrieve,
Create, Update, and Delete. The fifth method, HEAD, is used to retrieve basic metadata about the page, such as
its expiry date, its size, or its date of creation.
20
3.1.1. REST and GET
499
3
Though there is no formal requirement for RESTful applications to be Web based at all, REST is a design pattern,
not a Web architecture. On the contrarythe Web follows the REST pattern, not the other way around.
500
M.D. Wilkinson
This URL contains the verb find; the find interface function
of PhyloWS is, therefore, exposed by the URL, together with parameters needed by the find function call, such as the requirement for
a contributor name. All of these become part of the name of that
documentpart of its URLand this is not considered appropriate
in REST.
In a true REST architecture, the same operation might be done
by asking the REST interface to assign you a novel URLa URL
which (eventually) contains your query results. You would then use
HTTP POST to send your query parameters (find: contributor
Huelsenbeck) to this URL. This has the effect of updating (POST
Update) the state of the document identified by that URL such
that it now contains the result of the query. These results can be
obtained by calling GET on that URL. The find functionality is
not exposed within any of the URLs themselves, rather it is exposed
by allowing you to POST a set of find-query parameters to a URL
that was created specifically to identify/contain your result set.
Nevertheless, while PhyloWSs RESTful interface is not truly
RESTful, it is extremely clear how it should be used and what
functionalities it has; moreover, the PhyloWS interface exposes all
of its search and retrieval functionalities as GET-strings (URLs
with parameters); thus, the most important parts of PhyloWS
functionality, from the perspective of the comparative biologist,
can be accessed via a Web browser. This easy accessibility clearly
trumps the desire to create a philosophically pure REST interface, and, in fact, this type of faux-REST interface is almost ubiquitous in bioinformatics and life science Web frameworks for precisely
this reason! For example, other GET-string-based interfaces are
offered by the Dryad project (10)a repository of the data underlying scientific publicationsand by the EOL (26).
3.1.2. Web Forms
While the true REST philosophy does provide a means for executing
analyses on data, the method for doing so (as described in the earlier
example) is quite arcane. It is far more common (and intuitive) to
simply send data to a computational tool, and be presented with a
result. This functionality has traditionally been served by Web
FormsWeb pages with fields that can be filled in or selected by
users to achieve their desired outcome. Web Forms are specifically
designed to be utilized by a human operator and are generally
embedded within the content and visual layout elements of the
HTML page. Moreover, because they are manual interfaces, Web
20
501
502
M.D. Wilkinson
20
503
504
M.D. Wilkinson
20
505
506
M.D. Wilkinson
While many of the Web Service and Semantic Web Service projects
have their own repositories, there has recently been a move towards
providing a single source for searching and browsing both Web
Services and Workflows, particularly within the bioinformatics and
genomics communities.
The BioCatalogue (41) is The Life Science Web Service Registry, with functionality to discover, register, annotate, and monitor biological Web Services. A primary objective of the
BioCatalogue project team is to overcome the frustrating lack of
annotation that is currently true of most bioinformatics Web Service interfaces. They plan to achieve this by (a) creating a standard
minimal set of annotation elements required to be a wellbehaved Service provider and (b) opening up the annotation
interface in Web 2.0 open and collaborative style, where any
user can annotate any Web Service. The goal of this focus on
annotation is to make Service discovery more accurate and complete, as well as assist end-users in correctly wiring together
service input and output data components into meaningful functional workflows.
In parallel with BioCatalogue, and led by the same group, is the
myExperiment (42) project. Like BioCatalogue, myExperiment is a
Web 2.0-style repository, but with a focus on Workflows as the
primary deposition. myExperiment encourages sharing of and
social media-type discussion about workflows, as well as keeping
track of the edit history of workflows as they are reused and repurposed by varying end-users.
4. Summary
This chapter provided a high-level overview of the widely divergent
approaches to data and tool provision in evolutionary biology and
genomics. Technological, social, and philosophical decisions are
made by individual resource providers largely in response to the
specific needs of their target communities. Moreover, given limited
resources, data and tool providers are often loathe to buy-in to new
and potentially transient or flawed technologies. As a result, data
integrationparticularly automated data integrationfrom one
resource to another can be difficult and error prone. While the
discussed technology has implications beyond evolutionary biology, even beyond bioinformatics, it is highly relevant because in
evolutionary biology we are dealing with complex data integration.
In discussing in some detail the issues related to data integration
in general and how the various Evolutionary Genomics projects have
dealt with these issues, it will hopefully be easier for the users of these
resources to utilize their offerings. In addition, the emergence of new
semantic technologiestechnologies that are now starting to be
20
507
5. Exercises
5.1. Exercise 1: LSIDs
5.2. Exercise 2:
Exploring RDF
508
M.D. Wilkinson
5.4. Exercise 4:
The SHARE Interface
into SADI Semantic
Web Services
Browse to the SADI Framework homepage at http://sadiframework.org. Click the Show Me tab, and then follow the link to the
SHARE demonstration.
The SHARE demo presents SADI Semantic Web Services as if
they represented a massive global database of bioinformatics information. The SHARE interface is simply a text box, which is where
you type queries over this database. The query language used is
called SPARQLthe approved language for querying RDF data.
Understanding SPARQL queries is quite straightforward.
1. The SELECT clause details the variables that you wish to be
filled with your query results.
20
509
510
M.D. Wilkinson
5.5. Exercise 5:
myExperiment
Users: Those individuals who have contributed anything matching those keywords
Under Workflows, scroll down to the workflow titled Compare two genomes for similarity. Clicking on it brings you to a
preview pane, where you can examine the workflow and see the
contributors comments about it. In this case, the workflow reads in
two FASTA files (representing whole genomes) and then uses the
M-GCAT algorithm to compare them.
Under the Download heading, click on the link and save the
file to your desktop. Now open Taverna and load that file. You will
see the workflow in Tavernas preview pane, and are now ready to
run that analysis on your own FASTA-formatted data, following
what you learned in the tutorials from Exercise 3.
References
1. Stein, L. (2003). Bioinformatics: Gone in
2012. OReilly Bioinformatics Technology
Conference, 2003, San Diego, California,
USA.
2. Pearson, H. (2001). Biologys name game.
Nature 411 (7 June), 631632.
3. Good, B.; Wilkinson, M. D. (2006). The Life
Sciences Semantic Web is Full of Creeps! Briefings in Bioinformatics 7 (3), 275286.
4. World Wide Web Consortium. Cool URIs.
http://www.w3.org/TR/cooluris.
5. World Wide Web Consortium. URIs, URLs,
and URNs: Clarifications and Recommendations 1.0. http://www.w3.org/TR/uri-clarification.
6. Clark, T.; Martin, S.; Liefeld, T. (2004). Globally distributed object identification for
biological knowledgebases. Briefings in bioinformatics 5 (1), 5770.
7. Bafna, S.; Humphries, J.; Miranke, D. (2008).
Schema driven assignment and implementation
of life science identifiers (LSIDs). Journal of
Biomedical Informatics 41 (5), 730738.
8. Mendelsohn, N. My conversation with Sean
Martin about LSIDs. http://lists.w3.org/
Archives/Public/www-tag/2006Jul/0041.
20
511
Chapter 21
Sharing Programming Resources Between Bio*
Projects Through Remote Procedure Call and Native
Call Stack Strategies
Pjotr Prins, Naohisa Goto, Andrew Yates, Laurent Gautier,
Scooter Willis, Christopher Fields, and Toshiaki Katayama
Abstract
Open-source software (OSS) encourages computer programmers to reuse software components written by
others. In evolutionary bioinformatics, OSS comes in a broad range of programming languages, including
C/C++, Perl, Python, Ruby, Java, and R. To avoid writing the same functionality multiple times for different
languages, it is possible to share components by bridging computer languages and Bio* projects, such as
BioPerl, Biopython, BioRuby, BioJava, and R/Bioconductor. In this chapter, we compare the two principal
approaches for sharing software between different programming languages: either by remote procedure call
(RPC) or by sharing a local call stack. RPC provides a language-independent protocol over a network
interface; examples are RSOAP and Rserve. The local call stack provides a between-language mapping not
over the network interface, but directly in computer memory; examples are R bindings, RPy, and languages
sharing the Java Virtual Machine stack. This functionality provides strategies for sharing of software between
Bio* projects, which can be exploited more often. Here, we present cross-language examples for sequence
translation, and measure throughput of the different options. We compare calling into R through native R,
RSOAP, Rserve, and RPy interfaces, with the performance of native BioPerl, Biopython, BioJava, and
BioRuby implementations, and with call stack bindings to BioJava and the European Molecular Biology
Open Software Suite. In general, call stack approaches outperform native Bio* implementations and these, in
turn, outperform RPC-based approaches. To test and compare strategies, we provide a downloadable
BioNode image with all examples, tools, and libraries included. The BioNode image can be run on VirtualBox-supported operating systems, including Windows, OSX, and Linux.
Key words: Bioinformatics, R, BioPerl, BioRuby, Biopython, BioJava Web services, Remote
procedure call, Java virtual machine
513
514
P. Prins et al.
1. Introduction
Bioinformatics has created its tower of Babel. The full set of
functionality for bioinformatics, including statistical and computational methods for evolutionary biology, is implemented in a range
of computer languages, including Java, C/C++, Perl, Python,
Ruby, and R. This comes as no surprise, as language design is the
result of multiple trade-offs, for example, in strictness, convenience,
and performance.
For example, Java is a statically typed compiled language, and R
is a dynamically typed interpreted language. In principle, a compiled language is converted into machine code once by a language
compiler, and an interpreted language is compiled every time at
runtime, the moment it is run by the interpreter. Static typing
allows the compiler to optimize machine code for speed. Dynamic
typing resolves variable and function types at runtime, and is typically suited for an interpreter. Design decisions cause Java to have
stronger type checking and faster execution speed than, R. Meanwhile, R offers sophisticated interactive analysis of data in an interpreted shell, not directly possible with Java. When comparing
runtime performance of these languages, compiled statically typed
languages, such as C land Java, outperform interpreted dynamically
typed languages, such as Python, Perl, and R. For comparisons,
see ref. 1.
Runtime performance, however, is not the only criterion for
selecting a computer language. Another important criterium may
be conciseness. All mentioned interpreted languages allow functionality to be written in less lines of code than, Java. The number
of lines matter, as it is often easier to grasp something expressed in
a short and concise fashion, if done competently, leading to easier
coding and maintenance of software, i.e., programmer productivity. In general, with R, Perl, Python, and Ruby, it takes less lines of
code to write software than with C or Java; see also ref. 1. Based on
the conciseness criterium, these languages fall into the same two
groups as when split on performance. This may suggest a trade-off
between execution speed and consciseness or execution speed and
programmer productivity.
Discussing other important criteria for selecting a programming
language, such as ease of understanding, productivity, portability,
and the size and dynamics of the supporting Bio* project developer
communities, is beyond the scope of this book. The authors, who
have different individual preferences, wish to emphasize that every
language has characteristics driven by language design and there is
no single perfect all-purpose computer language.
In practice, the choice of a computer language depends mainly
on the individuals involved in a project partly due to the investment
21
Bio* Programming
515
516
P. Prins et al.
21
Bio* Programming
517
518
P. Prins et al.
1.4. Comparing
Approaches
2. Results
2.1. Calling into R
from Other
Languages
21
Bio* Programming
519
RSOAP
Next, we added an R/SOAP (34) adapter for codon translation
and invoke it from Python. RSOAP provides a SOAP interface
for R. After starting up the R instance, which acts as a SOAP server,
usage is
520
P. Prins et al.
MSMVRNVSNQSEKLEIL
21
Bio* Programming
521
import pyRserve
conn=pyRserve.rconnect()
conn(library(GeneR))
conn(strTranslate("atgtcaatggtaagaaatgtatcaaatca gagcgaaaaattggaaa
ttttgt"))
MSMVRNVSNQSEKLEIL
where Biopython (7) is used for parsing FASTA, and the Rserve +
GeneR service translates. At 767 Seq/s, Python + Rserves speed is
comparable to calling within R, and seven times faster than
Python + RSOAP (Fig. 1). The script is named DNAtranslate.py.
2.1.3. Calling into R from
Other Languages with the
Call Stack Approach
RPy2 executes R code from within Python over a local call stack (3).
Invoking the same GeneR functions from Python.
Python
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
importr(GeneR)
strTranslate=robjects.r[strTranslate]
strTranslate("atgtcaatggtaagaaatgtatcaaatcagagcgaaa aattggaaattttgt")[0]
MSMVRNVSNQSEKEIL
522
P. Prins et al.
coding_dna = Seq("atgtcaatggtaagaaatgtatcaaatcagagcg aaaaattggaaattttgt",
generic_dna)
coding_dna.translate()
Seq(MSMVRNVSNQSEKLEIL, ExtendedIUPACProtein())
val ma = dna.getRNASequence(transcriber)
rna.getProteinSequence(transcriber)
21
Bio* Programming
523
3. Discussion
Cross-language interfacing is a topic of importance to evolutionary
genomics because computational biologists need to provide tools
that are capable of complex analysis and cope with the amount of
biological data generated by the latest technologies. Cross-language
interfacing allows sharing of code. This means computer software
can be written in the computer language of choice for a particular
purpose. Flexibility in choice of computer programming language
allows optimizing of computational resources, and, perhaps even
more important, software developer resources, in bioinformatics.
When some functionality is needed that exists in a different
computer language than the one used for a project, a developer has
the following options: either rewrite the code in the preferred
language, essentially a duplication of effort, or bridge from one
language to the other. For bridging, there are essentially two
524
P. Prins et al.
technical methods that allow full programmatic access to functionality: through RPC or a local call stack.
RPC function invocation, over a network interface, has the
advantage of being language agnostic, and even machine independent. A function can run on a different machine or even over the
Internet, which is the basis of Web services and may be attractive
even for running services locally. RPC XML-based technologies,
however, are slow because of expensive parsing and high data load.
Metrics suggest that it may be worth experimenting with binary
protocols, such as Rserve.
When performance is critical, e.g., when much data needs to be
transported, or functions are invoked millions of times, a native call
stack approach may be preferred over RPC. Metrics suggest that the
EMBOSS C implementation performs well, and that binding to the
native C libraries with SWIG is efficient. Alternatively, it is possible
to use R as an intermediate to C libraries. Interestingly, calling R
libraries, many of which are written in C, may give higher performance than calling into native Bio* implementations. For example,
Python + RPy + GeneR is faster that Biopython pure Python
implementation of sequence translation.
Even though RPC may perform less well than local stack-based
approaches, RPC has some real advantages. For example, if you
have a choice of calling a local BLAST library or call into a remote
and ready NCBI RPC interface, the latter lacks the deployment
complexity. Also the public resource may be more up to date than a
copied server running locally. This holds for many curated services
that involve large databases, such as PDB (38), Pfam (39), KEGG
(40), and UniProt (41). Chapter 20 of this volume gives a deeper
treatment of these Internet resources (12).
From the examples given in this chapter, it may be clear that
actual invocation of functions through the different technologies is
similar, i.e., all listed Python scripts look similar, provided the
underlying dependencies on tools and libraries have been resolved.
The main difference between implementations is with the deployment of software, rather than invocation of functionality. The JVM
approach is of interest, as it makes bridging between supported
languages transparent and deployment straightforward. Not only
can languages be mixed, but also the advanced Java tool chain is
available, including debuggers, profilers, load distributors, and
build tools. Other shared virtual machines, such as .NET and
Parrot, potentially offer similar advantages, but are currently less
used in bioinformatics.
When striving for reliable and correct software solutions, the
alternative strategy of calling computer programs as external units
via the command line should be discouraged: not only it is less
efficient, a program gets started every time a function gets called,
but also a potential deployment nightmare is introduced. What
happens when the program is not installed, or the interface
21
Bio* Programming
525
4. Questions
1. Install BioNode and run the different test scripts. Can you
replicate the differences of throughput statistics?
2. Why is SOAP the slowest protocol?
3. What are the possible advantages of using a virtual machine,
such as the JVM?
4. If you were to bridge between your favorite language and an R
library, what options do you have?
Acknowledgments
We thank all OSS developers for creating such great tools and
libraries for the scientific community.
References
1. The computer language benchmarks game.
http://shootout.alioth.debian.org
2. Gentleman R C, Carey V J, Bates D M et al.
(2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5:R80p. doi:10.1186/
gb-2004-5-10-r80
3. Gautier L (2010) An intuitive Python interface
for Bioconductor libraries demonstrates the
utility of language translators. BMC Bioinfor-
526
P. Prins et al.
24:303311.
http://dx.doi.org/10.1007/
s00180-008-0132-x
6. Stajich J E, Block D, Boulez K et al. (2002)
The Bioperl toolkit: Perl modules for the life
sciences. Genome Res. 12:16111618.
doi:10.1101/gr.361602
7. Cock P J, Antao T, Chang J T et al. (2009)
Biopython: freely available Python tools for
computational molecular biology and bioinformatics.
Bioinformatics.
25:14221423.
doi:10.1093/bioinformatics/btp163
8. Goto N, Prins P, Nakao M et al. (2010)
Bioruby: bioinformatics software for the Ruby
programming
language.
Bioinformatics.
26:26172619. doi:10.1093/bioinformatics/
btq475
9. Holland R C, Down T A, Pocock M et al. (2008)
BioJava: an open-source framework for bioinformatics.
Bioinformatics.
24:20962097.
doi:10.1093/bioinformatics/btn397
10. Rice P, Longden I & Bleasby A (2000)
EMBOSS: the european molecular biology
open
software
suite.
Trends
Genet.
16:276277. http://www.ncbi.nlm.nih.gov/
pubmed/10827456
11. Dutheil J, Gaillard S, Bazin E et al. (2006)
Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and
population genetics. BMC Bioinformatics.
7:188p. doi:10.1186/1471-2105-7-188
12. Wilkinson M (2012) Genomics data resources
Frameworks and standards. In: Anisimova M
(ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in
Molecular Biology, Springer Science+Business
Media New York
13. Yang Z (1997) PAML: a program package for
phylogenetic analysis by maximum likelihood.
Comput Appl Biosci. 13:555556
14. Eddy S R (2008) A probabilistic model of
local sequence alignment that simplifies
statistical significance estimation. PLoS Comput Biol. 4:e1000069p. doi:10.1371/journal.
pcbi.1000069
15. Larkin M A, Blackshields G, Brown N P et al.
(2007) Clustal W and clustal X version 2.0.
Bioinformatics. 23:29472948. doi:10.1093/
bioinformatics/btm404
16. Katoh K, Kuma K, Toh H & Miyata T (2005)
MAFFT version 5: improvement in accuracy of
multiple sequence alignment. Nucleic Acids
Res. 33:511518. doi:10.1093/nar/gki198
17. Edgar R C (2004) MUSCLE: a multiple
sequence alignment method with reduced
time and space complexity. BMC Bioinformatics. 5:113p. doi:10.1186/1471-2105-5-113
21
mapping. Bioinformatics. 26:29902992.
doi:10.1093/bioinformatics/btq565
31. Yandell B S, Mehta T, Banerjee S et al. (2007)
R/qtlbim: QTL with Bayesian interval
mapping in experimental crosses. Bioinformatics. 23:641643. doi:10.1093/bioinformatics/btm011
32. Harris T W, Antoshechkin I, Bieri T et al.
(2010) WormBase: a comprehensive resource
for nematode research. Nucleic Acids Res. 38:
D463D467. doi:10.1093/nar/gkp952
33. Cottret L, Lucas A, Marrakchi E et al. GeneR: R
for genes and sequences analysis. http://www.
bioconductor.org/help/bioc-views/release/
bioc/html/GeneR.html
34. Warnes G (2004) RSOAP provides a SOAP interface for the open-source statistical package R.
http://research.warnes.net/statcomp/projects/
RStatServer/rsoap
35. Koenig D, Glover A, King P, Laforge G &
Skeet J (2007) Groovy in action. Manning
Publications Co. Greenwich, CT, USA
Bio* Programming
527
Chapter 22
Scalable Computing for Evolutionary Genomics*
Pjotr Prins, Dominique Belhachemi, Steffen Moller, and Geert Smant
Abstract
Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of
multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss
techniques for scaling computations through parallelization of calculations, after giving a quick overview of
advanced programming techniques. Unfortunately, parallel programming is difficult and requires special
software design. The alternative, especially attractive for legacy software, is to introduce poor mans
parallelization by running whole programs in parallel as separate processes, using job schedulers. Such
pipelines are often deployed on bioinformatics computer clusters.
Recent advances in PC virtualization have made it possible to run a full computer operating system, with
all of its installed software, on top of another operating system, inside a box, or virtual machine (VM).
Such a VM can flexibly be deployed on multiple computers, in a local network, e.g., on existing desktop
PCs, and even in the Cloud, to create a virtual computer cluster. Many bioinformatics applications in
evolutionary biology can be run in parallel, running processes in one or more VMs. Here, we show how a
ready-made bioinformatics VM image, named BioNode, effectively creates a computing cluster, and
pipeline, in a few steps. This allows researchers to scale-up computations from their desktop, using available
hardware, anytime it is required.
BioNode is based on Debian Linux and can run on networked PCs and in the Cloud. Over 200
bioinformatics and statistical software packages, of interest to evolutionary biology, are included, such as
PAML, Muscle, MAFFT, MrBayes, and BLAST. Most of these software packages are maintained through
the Debian Med project. In addition, BioNode contains convenient configuration scripts for parallelizing
bioinformatics software. Where Debian Med encourages packaging free and open source bioinformatics
software through one central project, BioNode encourages creating free and open source VM images, for
multiple targets, through one central project.
BioNode can be deployed on Windows, OSX, Linux, and in the Cloud. Next to the downloadable
BioNode images, we provide tutorials online, which empower bioinformaticians to install and run BioNode
in different environments, as well as information for future initiatives, on creating and building such images.
Key words: BioNode, Bioinformatics, Evolutionary biology, Big data, Parallelization, MPI, Cloud
computing, Cluster computing, Virtual machine, Amazon EC2, OpenStack, PAML, MrBayes,
VirtualBox, Debian Linux
Availability: The 32-bit and 64-bit BioNode desktop images for VirtualBox and the BioNode Cloud images are
based on free and open source software and can be found at http://www.evolutionarygenomics.net/ and http://
biobeat.org/bionode.
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_22,
# Springer Science+Business Media, LLC 2012
529
530
P. Prins et al.
1. Introduction
Investigative evolutionary biology, nowadays, includes comparative
analysis of genomes, transcriptomes, proteomes, and interactomes,
across individuals and even across species. The analysis of data,
generated by the latest acquisition technologies, is becoming so
computationally intensive that either an analysis wont run on a
desktop computer or it is so slow that it prevents researchers from
trying different scenarios and/or hypotheses.
Evolutionary genomics often requires lengthy computations in
a multidimensional search space. Examples of such expensive computations are Bayesian analysis, inference based on Hidden Markov
Models, and maximum likelihood analysis, implemented, e.g., by
MrBayes (1), HMMER (2), and phylogenetic analysis by maximum
likelihood (PAML) (3), respectively. Genome-sized data, or Big
Data (4), such as produced by next-generation sequencers, as well
as growing sample sets, such as from the 1,000 genome project (5),
are exacerbating the computational time problem.
In addition to being computationally expensive, many implementations of major algorithms and tools in bioinformatics do not
scale automatically. An example of legacy software requiring
lengthy computation is Ziheng Yangs codeml implementation of
PAML (3). PAML can find amino acid sites which show evidence of
positive selection using dN/dS ratios, which is the ratio of nonsynonymous and synonymous substitution rate, see also Chapter 5
of this volume on selection on the protein coding genome (6).
Executing PAML over an alignment of hundred sequences may
take hours, sometimes days, even on a fast PC. PAML (version 4.
x) is designed as a single-threaded process and can only utilize one
single central processing unit (CPU) to complete a calculation. To
test hundreds of alignments, e.g., different gene families, PAML is
invoked hundreds of times in a serial fashion, possibly taking days
on a single computer. Here, we use PAML as an example, but the
idea holds for any software program that is both CPU bound, i.e.,
the CPU speed determines total program execution time. A CPU
bound program will show (close to) 100% usage for a CPU. A large
number of such legacy programs are CPU bound and do not scale
by themselves.
Scaling up of computations may be possible through parallelization. Parallelization means the computational effort is distributed
among multiple CPUs. This can be among multiple cores within a
single processor, a multiprocessor system or a network of computers, a so-called computing cluster. While CPUs are still getting
faster, the last years most of the gain in computational processing
power has come from parallelization.
531
532
P. Prins et al.
533
534
P. Prins et al.
2. On-Demand
Scalability
with BioNode
2.1. Packaging
Software for BioNode
535
536
P. Prins et al.
2.3. Parallelizing
an Application
with BioNode
on a Desktop PC
For the PC, BioNode comes as an Internet downloadable VirtualBox image, ready for the desktop. VirtualBox is an 86 virtualization application, with similar functionality to, e.g., VMWare or
XEN, that is installed in an existing host operating system, (e.g.,
ref. 29). Within this box, additional guest operating systems, each
known as a Guest OS, can be loaded and run with its own environment. This means a researcher can run a BioNode on an existing
installation of Microsoft Windows, Apple OSX, or Linux. While
VirtualBox is a commercial product, there also exist a free and open
source edition (OSE), which can be freely deployed on existing
PCs on a local area network (LAN). VirtualBox uses hardware
virtualization, which gives it close to native performance, (e.g.,
ref. 30). On a PC, install the free VirtualBox on Windows, OSX,
or Linux; download our BioNode image and add it to VirtualBox.
For example, see VirtualBox online tutorials or our BioNode tutorial (31). In VirtualBox, specify the number of CPUs to use as well
as computer memory. When the image boots up, it presents a
standard Debian Linux desktop, login with user guest and password guest. The desktop allows the use of both graphics and
command-based tools.
We have created a number of tests, or examples, which allow
testing the working and performance of a BioNode. Tests are
available as icons, or can be used from the command line, in the
/home/guest/Springer/Scalability directory. For example, to
run the PAML20 test, click on the icon, which runs the prepared
script:
Shell
cd/home/guest/Springer/Scalability
# run the single CPU version
./scripts/run-CPU1-PAML20.sh
# run the parallel version on four CPUs
./scripts/run-rq-PAML20.sh 4
537
538
P. Prins et al.
2.4. Parallelizing
an Application
with BioNode
on Multiple Machines
539
2.5. Parallelizing
an Application
with BioNode
in the Cloud
540
P. Prins et al.
Fig. 1. Schematic diagram of scaling up computations on BioNode, here an example of SNP detection, both on a local area
network (LAN) and in the Cloud. From the PC, BioNodes are started, first virtualizing BioNode with VirtualBox on idle
computers on the LAN, e.g., on office or laboratory computers, and next by running BioNode in the Cloud, e.g., Amazon
EC2, when more calculation power is required. Jobs are distributed across nodes. This way, a virtual computing cluster
is created, where nodes communicate through a shared file storage (FS), which can be located either on the LAN or in the
Cloud. BioNode provides a full Debian Linux environment, with the largest collection of free and open source bioinformatics
software currently available. From the users perspective, scaling BioNode from a PC, onto the LAN, and into the cloud,
amounts to a single investment. Note that clustering computers in the cloud does not escape the physical bottlenecks
of computing, i.e., computer networks are a bottleneck for big data, see also ref. 8. Public domain graphics courtesy of
http://www.openclipart.org.
541
3. Discussion
In this chapter, we discuss the scaling up of computations through
parallelization, a necessary strategy because the rate of the data
acquisition in biology increases rapidly, and outpaces computer
hardware speed increases. In bioinformatics, the common parallelization strategy is to take an existing nonparallel application and
divide data into discrete units of work, or jobs, across multiple
CPUs, and clustered computers. Ideally, parallelizing processes
shows linear performance increase for every CPU added, but in
reality it usually is less than linear. Resource contention on the
machine, e.g., disk or network IO, has processes wait for each other.
We created BioNode, a ready-made Linux BioNode image for
parallelized computing, that can be downloaded from the Internet
and deployed as a virtual machine, so that it can run on a single
multicore desktop computer, a few networked computers, and even
in the Cloud. BioNode is based on Debian Linux and includes
software packages, and meta-packages, of the Debian Med team.
542
P. Prins et al.
543
large file systems. One important issue is that Cloud providers offer
hardware that is not necessarily designed for high throughput at
every level. For example, hard disk IO may be a bottleneck. Network speeds in the Cloud can be fluctuating and can be low, e.g.,
transferring data between S3 and EC2. Also multiple VMs may be
competing for resources on a single machine, whether it concerns
disk or network IO. We strongly recommend to validate assumptions and run trials first. Cloud computing is of interesting for
bioinformatics, currently for computational problems that can be
split into jobs that require little computer memory and avoid large
data transfers. For other types of problems, such as sequence
assembly, it is more attractive to use a single large multicore computer with large memory and fast storage (8).
For additional information on downloading, installing, and
using BioNode, see the provided online tutorial and wiki space
(31). We also include online resources that contain build instructions for creating these images and information for running
TORQUE and setting up a Cloud cluster with Amazon EC2 or
Eucalyptus. BioNode can be used as the basis for specialized bioinformatics Linux (cluster) VMs. Finally, BioNode provides a flexible
cluster environment with a low barrier to entry, even for researchers
who normally use a Microsoft Windows desktop. BioNode is not
only useful for scaling computations, but can also be used for
educational purposes, especially as the experience gained with
tools and techniques applies to Unix and HPC setups.
4. Questions
1. Download and install BioNode on a desktop, using the instructions in the tutorial (31). How much time does it take to run
the test script discussed above?
2. Install BioNode on a second machine with a bridged network
interface. Mount NFS or sshfs. How much time does it take to
run the test script now?
3. Using online tutorials, create a free EC2 instance, create keys,
and locate and fire up a BioNode AMI. Login to BioNode
using ssh and record how much time it takes to run the test
script?
4. Use the Amazon EC2 calculation sheet and calculate how
much it would cost to store 100 GB in S3, and execute a
calculation on 100 large nodes, each reading 20 GB of
data. Do the same for another Cloud provider.
544
P. Prins et al.
Acknowledgments
The European Commissions Integrated Project BIOEXPLOIT
(FOOD-2005-513959 to G.S. and P.P.); the Netherlands Organization for Scientific Research/TTI Green Genetics (1CC029RP to P.P.).
References
1. Ronquist F & Huelsenbeck J P (2003)
MrBayes 3: Bayesian phylogenetic inference
under
mixed
models.
Bioinformatics
19:15721574
2. Eddy S R (2008) A probabilistic model of local
sequence alignment that simplifies statistical
significance estimation. PLoS Comput Biol. 4:
e1000069p
3. Yang Z (1997) PAML: a program package for
phylogenetic analysis by maximum likelihood.
Comput Appl Biosci. 13:555556
4. Doctorow C (2008) Big data: welcome to the
petacentre. Nature 455:1621.
5. Durbin R M, Abecasis G R, Altshuler D L et al.
(2010) A map of human genome variation
from population-scale sequencing. Nature
467:10611073
6. Kosiol C & Anisimova M (2012) Selection on the
protein coding genome. In: Anisimova M (ed)
Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media
New York
7. Schadt E E, Linderman M D, Sorenson J, Lee
L & Nolan G P (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet. 11:647657
8. Trelles O, Prins P, Snir M & Jansen R C (2012)
Big data, but are we ready?. Nat Rev Genet.
12:224p.
http://www.ncbi.nlm.nih.gov/
pubmed/21301471
9. Patterson D A & Hennessy J L (1998) Computer organization and design (2nd ed.): the
hardware/software interface. Morgan Kaufmann Publishers Inc
10. Mattson T, Sanders B & Massingill B (2004)
Patterns for parallel programming. AddisonWesley Professional, 384 pages. http://portal.
acm.org/citation.cfm?id1406956
11. Graham R L, Woodall T S & Squyres J M (2005)
Open MPI: a flexible high performance MPI
12. Stamatakis A & Ott M (2008) Exploiting finegrained parallelism in the phylogenetic likelihood function with mpi, pthreads, and openmp:
a performance study. Pattern Recognition in
Bioinformatics, Springer Berlin/Heidelberg,
424435. http://dx.doi.org/10.1007/978-3540-88436-1_36
13. Tierney L, Rossini A & Li N (2009) Snow: a
parallel computing framework for the R system. International Journal of Parallel Programming 37:7890. http://dx.doi.org/10.1007/
s10766-008-0077-2
14. Cesarini F & Thompson S (2009) Erlang programming. 1st. OReilly Media, Inc.
15. Peyton Jones S (2003) The Haskell 98 language and libraries: the revised report. Journal
of Functional Programming 13:0255
16. Odersky M, Altherr P, Cremet V et al. (2004)
An overview of the Scala programming language. LAMP-EPFL
17. Okasaki C (1998) Purely functional data
structures. Cambridge University Press,
doi:10.2277/0521663504
18. Alexandrescu A (2010) The D programming language. 1st. Addison-Wesley Professional, 460p
19. Griesemer R, Pike R & Thompson K (2009) The
Go programming language. http://golang.org
20. Hoare C A R (1978) Communicating sequential
processes. Commun. ACM 21:666677. doi:
http://doi.acm.org/10.1145/359576.359585
21. Welch P, Aldous J & Foster J (2002) Csp networking for java (jcsp. net). Computational
ScienceICCS 2002. 695708
22. Sufrin B (2008) Communicating scala objects.
Communicating Process Architectures. 35p
23. Dean J & Ghemawat S (2008) MapReduce:
Simplified data processing on large clusters.
Communications of the ACM 51:107113
24. White T (2009) Hadoop: the definitive guide.
first edition. OReilly, http://oreilly.com/catalog/9780596521981
25. May P, Ehrlich H & Steinke T (2006) Zib
structure prediction pipeline: composing a
complex biological workflow through web services. Euro-Par 2006 Parallel Processing,
Springer Berlin/Heidelberg, 11481158.
http://dx.doi.org/10.1007/11823285_121
26. Mungall C J, Misra S, Berman B P et al.
(2002) An integrated computational pipeline
and database to support whole-genome
sequence annotation. Genome Biol. 3:
545
throughput.
Nucleic
Acids
Res.
32:17921797. doi:10.1093/nar/gkh340
34. Schneider A, Souvorov A, Sabath N et al.
(2009) Estimates of positive darwinian
selection are inflated by errors in sequencing,
annotation, and alignment. Genome Biol Evol.
1:114118. doi:10.1093/gbe/evp012
35. Pond S L, Frost S D & Muse S V (2005)
HyPhy: hypothesis testing using phylogenies.
Bioinformatics 21:676679. http://www.ncbi.
nlm.nih.gov/pubmed/15509596
36. Gentzsch W (2002) Sun grid engine: towards
creating a compute power grid. Cluster Computing and the Grid, 2001. Proceedings. First
IEEE/ACM International Symposium on,
IEEE, 3536
37. Staples G (2006) Torque resource manager.
Proceedings of the 2006 ACM/IEEE
conference on Supercomputing, ACM, doi:
http://doi.acm.org/10.1145/
1188455.1188464
38. Openstack open source cloud computing software. http://www.openstack.org
39. Nurmi D, Wolski R, Grzegorczyk C et al.
(2009) The Eucalyptus open-source cloudcomputing system. Proceedings of the 2009
9th IEEE/ACM International Symposium on
Cluster Computing and the Grid, IEEE Computer Society, 124131
40. Matthews S J & Williams T L (2010) Mrsrf: an
efficient mapreduce algorithm for analyzing
large collections of evolutionary trees. BMC
Bioinformatics 11 Suppl 1:S15p
INDEX
A
Actors...................................................................532, 533
Adaptation. See Adaptive, evolution; Selection, positive
Adaptive
evolution.............. 34, 103, 121, 151, 181, 471, 475
immune system..............................................242, 474
Admixture................................................... 218, 230235
Akaike Information Criterion (AIC)......................... 126,
237, 239241, 247, 248
Algorithm
ElstonStewart algorithm ............................. 219221
LanderGreen algorithm ..................... 221223, 233
Alleles ..................................................................8, 10, 13,
14, 16, 130, 153, 166, 218, 219, 221, 222,
224226, 232, 242, 252, 259, 261, 276283,
288, 322324, 470, 475, 478, 499
ALS disease. See Amyotrophic lateral sclerosis
(ALS) disease
Amazon................................................................ 539543
Amplified fragment length polymorphism
(AFLP) ........................................................470
Amyotrophic lateral sclerosis (ALS)
disease........................................ 382, 407, 408
Analysis benchmarks ....................................................387
Ancestral recombination graph (ARG).............227, 228,
299, 304307, 315331
Anomaly zone...............................................................6, 9
Apes.................................................... 341, 349350, 518
Apoptosis ......................................................................475
Application programming interface (API)................ 490,
506, 533, 539
Arabidopsis thaliana ................................. 163, 470472,
475477, 481
Archaea .........................................30, 32, 33, 47, 56, 68,
69, 71, 72, 74, 75, 82, 89, 94, 100, 101, 194,
195, 198, 202, 211
Association mapping ........................................... 275290
ATP .................................................................................89
Autosome ............................................................174, 318
B
Balancing selection. See Selection
Baseline correction ............................ 393, 394, 397, 411
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5,
# Springer Science+Business Media, LLC 2012
547
VOLUTIONARY GENOMICS
548 || EIndex
C
Caenorhabditis elegans ..................... 163, 173, 175, 470,
472, 509, 510
Calibrants............................................................. 388390
Call stack.............................................................. 513515
Causality ............................................ 289, 350, 480481
C/C++ ................................................................. 504507
cDNAs .............................114, 168, 373, 417, 471, 472,
476, 509
Cell
cycle ..........................................................................92
division............................................................... 82, 90
membrane...............................................................122
Cellulose .......................................................................474
Centroided data ..................................................394, 396
CgiHunter ....................................................................439
Chaperone ......................................................................92
Chimeric .............................................. 89, 167, 173, 174
ChIP-seq............................................ 142, 346, 351356
Chromatid ....................................................................166
Chromatography .................................................384, 407
Chromosome
rearrangement ........................................................477
Cis ............................................................... 147, 336, 472
Classification rule ......................................399, 400, 402,
403, 405, 407, 409, 412
Clique ..................................................................367, 369
Cloud computing......................531534, 539, 541543
Clustering ................................ 70, 72, 73, 97, 105, 188,
212, 213, 369, 370, 376, 377, 407, 410, 411,
538540, 542
Clusters of orthologous genes (COG(s))............. 48, 56,
70, 72, 92
Coalescence .................. 5, 8, 9, 14, 17, 18, 21, 42, 228,
299, 300, 305, 328, 329
Coalescent model ........... 820, 228, 295298, 327329
Coalitions......................................... 8895, 97, 101103
Coarse grained.....................................................531, 533
Codon
translation .................................... 505, 508, 509, 513
usage bias .............................................. 130, 132, 248
Co(-)evolution ....................................................... 89, 91,
94, 96, 103, 121, 248, 253, 255, 256, 259,
264265, 468
Colombia (Col) ............................................................477
Command line................................... 257, 515, 536, 537
Common disease
common variant (CDCV)........... 276, 279, 280, 288
rare variant (CDRV) .....................................276, 279
Communicating sequential processes (CSP) ..............532
Communities ............................9, 13, 8789, 9193, 95,
96, 98, 102, 375, 478, 482, 483, 486488,
492, 494, 496, 497, 504, 505, 513
D
DAG. See Directed acyclic graph
Darwin ............................................. 3, 55, 82, 90, 96, 99
Darwin Core (DC) .............................................487, 488
Dating ......................................................... 164, 168, 181
Deamination ............................. 433, 443, 448, 450, 457
Debian
Index |
E
Ecosystems..................9193, 9597, 99, 102, 103, 535
Effective population size.................... 6, 22, 31, 42, 114,
135, 151, 228, 276, 277, 297, 300, 307, 311
Effector ................................................................474, 475
EM algorithm. See Expectation-maximization (EM)
algorithm
EMBOSS. See European Molecular Biology Open
Software Suite (EMBOSS)
Emergent evolutionary properties ................................92
Emission probability ....................................................232
Empirical bayes....................................................247, 251
codon model(s) .............................................120, 134
Encyclopedia of Life (EOL) ...............................482, 490
Endosymbioses ...............................................................89
Enhancer..................144, 146, 289, 350, 354, 356, 432
Environmental
factors.................................................... 142, 469, 481
Epigenetics ................................................289, 350, 352,
353, 355, 356, 431434, 442, 443, 447450,
457, 463, 464
Epigenomics .....................432435, 442, 444, 458, 464
Episodic selection. See Selection
Epistasis ............................................. 236, 252257, 365
Erlang ...........................................................................532
Error handling and exceptions ....................................506
Escape from adaptive conflict (EAC) model .....176, 178
Eucalyptus ......................................... 539, 540, 542, 543
Euchromatin.................................................................432
Eukaryotes .................................... 1113, 16, 30, 32, 89,
95, 100, 152, 170, 193195, 200, 202, 204,
205, 211, 213, 217
European Molecular Biology Open Software Suite
(EMBOSS)....................... 505, 510, 513, 514
Euryarchaeota.......................................................... 39, 74
Eutherians.....................................................................121
Evolutionary
biology ..............236, 471, 482, 483, 496, 504, 505,
508, 530
distance ........................................ 105, 116, 147, 355
expression QTL (eQTL).............471474, 476478,
480482
genetical genomics ............................... 469482, 534
models........................85, 92, 93, 99, 116, 118, 192,
239, 476, 477
VOLUTIONARY GENOMICS
550 || EIndex
F
False discovery rate (FDR) .......................122, 309, 343,
353, 354, 390, 479
False positive (error) ...........................................128, 481
Fine grained....................................... 211, 506, 531, 532
Fixation probability.................................... 130, 173, 175
Fixed effect models ......................................................123
Forest of Life (FOL) ............................................... 5376
Four-Gamete test ....................................... 225227, 235
Frameshift .....................................................................172
Fst ........................................................................... 13, 142
Functional
analysis ................................... 99, 415418, 420422
relationship .................................. 102, 365, 374, 480
Fusion ...................18, 19, 84, 167, 173, 176, 190, 203,
207, 208
G
Gag.............................................................. 253, 257, 259
Galaxy ........................................434439, 442, 462, 464
Gametes ...............................................................217, 373
GC-content ........................................................... 16, 130
Gene
accelerated ..................................................... 117118
cluster......................................................................118
comparison ........................ 13, 14, 20, 33, 122, 164,
337340, 349, 350, 376
conserved ......................................................... 71, 433
conversion................................... 116, 130, 164, 189,
218, 328
duplication..................................................11, 1314,
18, 161167, 171176, 180, 181, 200, 203,
208211, 372, 373, 472, 475
evolution...................................................................88
Index |
H
Haploid segregants ......................................................471
Haploimbalance ..................................................373, 374
HardyWeinberg model .....................................283, 284
Haskell ..........................................................................532
Heterochromatin .........................................................432
Hidden Markov model (HMM) ............................... 114,
118119, 147, 188, 213, 221, 223, 230, 232,
233, 287, 296, 304307, 530
Hidden paralogy...................................................... 41, 42
High performance computing (HPC)...............479, 531
Histone modification ....................... 352356, 432, 434,
442, 447449
Histones........................................................................448
HIV-1 ..............................237, 242, 248, 251, 253, 258,
259, 261, 264, 265
HMMER .............................................................506, 530
HOGENOM ....................................... 32, 39, 41, 46, 48
Homologous
pairs of chromosomes ............................................166
recombination (HR) .....................................166, 238
Homology (homologous) ......................3032, 38, 104,
166, 201, 210, 339, 473, 475, 477
Horizontal gene transfer (HGT) ......................9, 1113,
33, 42, 54, 6974, 169170, 419
Host-pathogen ................................. 248, 474, 476, 477,
480, 481
HTTP protocol ........................ 482, 488, 489, 497, 506
HudsonKreitmanAguade test (HKA) .....................133
HyPhy ..............................127, 239, 240, 242, 245, 246,
248, 250, 253, 255, 257, 258, 262265, 537
Hypothesis-driven ........................................................335
I
Identify by descent (IBD) ......................... 224, 253, 278
Illegitimate recombination ..........................................190
Illumina.............................281, 286, 434, 458, 459, 462
Incomplete lineage sorting ................................4, 6, 7, 9,
18, 19, 42, 54, 298, 300302, 305, 306, 312
Incongruence ...................7, 42, 71, 237, 238, 301, 302
Inconsistency score ................................................. 65, 76
Independence .................................... 116, 147, 480, 481
Inhibitors .............................................................122, 476
Inititaion of DNA replication.............................165, 166
Innate immune system........................................242, 474
In-paralog .......................................................................38
Insertion of domains...........................................208, 209
Instantaneous rate matrix ................. 116, 129, 134, 303
Interacting genes..........................................................121
Interaction network
clustering ....................................... 97, 376, 377, 540
degree distribution...................... 192, 201, 377, 378
guilt-by-association ....................................... 369370
modularity ............................................ 105, 369370
robustness ...............................................................372
Interoperability........................................... 487, 492, 509
Inter-species differences........... 337, 341, 345, 354, 355
Interspersed repeats .....................................................190
Intrinsic information...........................................147, 531
Intron...............................131, 132, 143, 144, 149151,
168, 189, 190, 207, 208, 340
Inversion ..............................................................165, 167
Ion counter...................................................................392
Isochores.......................................................................130
J
Jaccard coefficient ................................................... 62, 63
Jackknife .................................................................. 17, 46
Java.................. 416, 504, 505, 507, 508, 511, 514, 532
Java Virtual Machine (JVM)...................... 507, 512, 514
Job scheduler................................................................477
JRI........................................................................505, 508
Junk DNA ....................................................................142
Jython ......................................................... 507, 512, 513
K
KEGG pathways ................................ 382, 418, 421, 422
L
Landsberg erecta (Ler).................................................477
Last universal common ancestor (LUCA)..................193
Lateral gene transfer (LGT). See Horizontal gene
transfer (HGT)
Latin square ..................................................................388
Leucine-rich-repeat (LRR) ..........................................475
Likelihood
composite (CL) ......................................................229
function .................................................. 47, 246, 264
ratio test (LRT) ........................... 117, 247, 250, 251
Lineage specific
gene duplications ...................................................173
tests .........................................................................471
Linkage ....................... 7, 104, 105, 218, 220, 288, 365,
470472, 479, 483
Linkage disequilibrium (LD) ................... 142, 225230,
261, 276, 280, 286, 287, 289, 323, 324,
470, 481
Linked data ................................................. 483, 485, 495
Long-branch-attraction (LBA)......................................54
Long interspersed nucleotide element-1 (LINE1) ....190
Long non-coding RNAs ..............................................171
VOLUTIONARY GENOMICS
552 || EIndex
Lower envelopes...........................................................393
LSID ........................................................... 482484, 497
M
Machine learning.............399, 407, 443, 444, 447, 449,
456457
Macro language............................................................507
Mahalanobis distance ...................................................400
Mandel bundle-of-lines................................................398
Mapping power ............................................................479
MapReduce..........................................................533, 542
Marginal trees......................................................324, 326
Marker ....................................... 13, 219, 223, 224, 281,
285, 286, 289, 290, 352, 355, 356, 418, 419,
432, 456, 457, 463, 470, 478, 479
Marker map ................................................ 470, 478, 479
Markov
chain.....................................254, 258, 307, 394, 531
Chain Monte Carlo (MCMC).....................9, 17, 18,
129, 130, 229, 254, 256258, 394
clustering ................................................................369
models................ 114, 118119, 188, 213, 287, 530
(see also Evolutionary, models)
Mass Spectral Library...................................................391
Mass spectrometry ..................................... 384, 394, 407
Mating system ..............................................................153
Maximum
estimate (see Maximum likelihood estimate (MLE))
estimator .....................................................239
likelihood (ML)............................ 39, 46, 47, 56, 69,
116, 118, 124, 237, 239, 250, 253, 256, 262,
309, 476, 479, 530
parsimony (see Parsimony)
Maximum likelihood estimate (MLE) ........................124
McDonaldKreitman test (MK)................ 133, 149, 154
Measurement equation .......................................396, 397
MEGAN .............................................................. 415428
Meiosis ...................................... 130, 219, 222, 316, 373
Meloidogyne hapla.......................................................474
Message passing interface (MPI) ......................239, 240,
257, 479, 531534, 537
Messenger RNA (mRNA) ...........................................168
Metabolic pathways.............................................104, 472
Metabolite QTL (mQTL) ......................... 471, 478, 480
Metabolites ............... 92, 381384, 386, 387, 389391,
393, 394, 398, 399, 401, 407410, 471
Metabolomics ...................................................... 381411
Metagenomics ..............................................................415
Methyl-DNA immunoprecipitation ............................471
Metropolis-Hastings algorithm...................................254
Microarray ........................................ 182, 192, 337, 338,
343345, 382, 470472, 474, 478, 505, 508
MicroRNAs ..................................................................171
Microsatelite ......................................................9, 16, 165
Minimal descriptor.............................324327, 329331
N
Natural population..................................... 167, 470, 479
Natural selection. See Selection
Nearly Universal Trees (NUTs) .......................55, 56, 70
Negating QTL (QTL in repulsion phase) ..................479
Nematode ...........................11, 143, 163, 170, 173, 474
Neofunctionalization .......................................... 175178
Network
analyzer .......................................................... 375378
of domain co-occurrence .......................................202
hubs ..............................................367368, 372374
inference ............................................... 473, 480481
Neutrality test...................................................... 133134
Next(-)generation sequencing (NGS) .....115, 165, 177,
182, 337, 350, 415
NGS. See Next(-)generation sequencing (NGS)
NHEJ. See Non-homologous end-joining (NHEJ)
NIST .............................................................................391
Non-coding .........................................................153, 171
Non-homologous end-joining (NHEJ) .....................166
Nonsynonymous mutation.............................1416, 143
Nonsynonymous to synonymous rate ratio................135
Nonsynonymous to synonymous rates ratio.
See dN/dS
Normalization ............................................ 343345, 357
NP-complete ..................................................................43
N-terminus ...................................................................211
Nucleoid ..................................................... 235, 475, 505
Nucleosome.........................................................289, 356
Nucleotide binding site leucine rich repeat domain
(NB-LRR) ...................................................475
NUTs. See Nearly Universal Trees (NUTs)
Index |
O
Olfactory receptor ........................................................175
Oligomer ........................................................................16
OpenPBS ......................................................................534
Open reading frame (ORF)....................... 163, 171, 172
Open source software ............. 257, 481, 505, 529, 533,
535, 542
OpenStack ...........................................................539, 542
Operon ........................................................... 86, 95, 101
Optimization ...................... 47, 118, 246, 310, 401403
ORF. See Open reading frame (ORF)
Origins of DNA replication................................165, 166
Ortholog........................ 54, 56, 70, 121, 169, 179, 180
Overlapping reading frames ............................... 128129
OWL. See Web Ontology Language (OWL)
P
PAML ........................................................127, 134, 135,
245, 476477, 480, 505, 530, 533, 534, 536,
537, 539
Parallelization ............................309, 530534, 541, 542
Paralog ...................14, 57, 70, 172, 180, 372374, 377
Paralogy .............................................................41, 42, 54
Parent............................... 165, 166, 179181, 219224,
256, 261, 264, 294, 315318, 320, 322, 323,
330, 474, 477, 478, 482
Parrot native compiler interface ..................................507
Parsimony ..................................164, 200, 205, 207210
Pathogen................................ 13, 88, 91, 122, 130, 131,
165, 236, 242, 248, 474477, 480482, 537
Pattern
discovery ........................................................128, 218
Pedigree analysis.................................................. 218224
Perl ..................................................... 212, 504, 505, 508
Permutation
strategy....................................................................479
Pfam ......................................... 188, 189, 193, 197, 198,
200202, 204, 209, 211213, 514
Phenotype............................ 48, 86, 275, 283, 284, 470,
471, 473, 477, 481, 482
Phybase ...........................................................................22
Phylogenetic hidden Markov models
(phylo-HMMs) ........114, 118119, 127, 129
Phylogenetic
footprinting ............................................................143
network.....................................................................87
outliers ............................................................... 1316
shadowing...............................................................143
tree .............................18, 35, 55, 59, 67, 69, 82, 93,
118, 164, 252, 477
phylo-HMMs. See Phylogenetic hidden Markov models
(phylo-HMMs)
Phytophthora infestans .......................................474, 537
VOLUTIONARY GENOMICS
554 || EIndex
Protein (continued)
domain .......................... 30, 167, 187213, 371, 372
neighbor pair ........................................ 198, 204, 206
order ..............................................................189, 199
promiscuity/versatility .................................. 203206
QTL (pQTL)........................................ 471, 478, 480
sequence ....................................... 11, 104, 116, 120,
123, 142, 143, 145, 179181, 188, 252, 253,
259, 262, 265, 371, 420, 487
structure................................................ 114, 116, 208
triplets ............................................................120, 189
Protein-coding gene ..........................................113, 117,
120, 128, 132134, 152, 171
Protein-protein interaction(s) .................... 99, 100, 104,
123, 200, 363378, 472, 537
Pruning ......................................................... 57, 129, 406
Pseudogene .................................................. 31, 172, 191
Pseudogenization .................................................. 31, 175
Punctuated equilibrium ...............................................192
Purifying selection. See Selection
Python ..............................438, 464, 504, 505, 507514
Q
Quality control ........ 281, 283284, 383, 385391, 394
Quantitative
phenotypes.................. 470, 471, 473, 477, 479481
trait loci (QTL) ................... 470474, 476481, 508
R
R (statistical language).................................................479
Random effect (RE) models........................................123
Random forest.................. 399, 401, 405408, 410412
Random variable
continuous ................................................................37
discrete......................................................................37
Rate
heterogeneity.................................................118, 239
shift ................................................................114, 118
RDF. See Resource Description Framework (RDF)
Reactivity .............................................................480, 481
Rearrangement ................................. 166, 167, 241, 308,
433, 477
Reasoning ...............................................6, 367, 480, 532
Reassortment................................................................236
Recessive lethal alleles ..................................................470
Recombinant inbred line (RIL) ..................................470
Recombination ...............................8, 86, 116, 147, 166,
189, 217265, 287, 294, 316, 365, 470
Redundancy............................................16, 43, 152, 316
Regulation .............. 131, 141, 142, 196, 200, 335337,
341, 342, 346356, 432434, 443, 448, 449,
457, 463, 471, 472, 480
Regulator ............................................................. 471473
Regulatory
element ..........................................................152, 168
genomic regions ............................................352, 353
mechanisms ..................................336, 337, 350356
Relative rate test ...........................................................179
RELL. See Resampling of estimated log-likelihoods
(RELL)
Remote procedure call (RPC) ............................ 503515
Repeat ......................... 57, 61, 143, 165, 166, 168, 190,
203, 206, 211, 225, 295, 308, 309, 322, 331,
338, 340, 376, 410, 437, 439, 441, 444, 450,
451, 453, 456, 465, 475
Replication................................. 92, 165, 166, 238, 248,
249, 285286, 288, 289, 478
Representational State Transfer (REST)...........489490,
492, 506
Resampling of estimated log-likelihoods (RELL)......241
Residual variance ..........................................................479
Resolution schema .............................................481, 482
Resource Description Framework (RDF)................. 483,
485, 487, 494, 495, 497, 499, 506
Restriction fragment length polymorphism
(RFLP) ................................................. 23, 470
Retrogenes........................167169, 173, 174, 176, 180
Retroposition..................................... 167169, 171, 173
Retrotransposons ................................................169, 189
RFLP. See Restriction fragment length polymorphism
(RFLP)
R-gene .................................................................477, 482
Ribosomal RNA (rRNA) .................. 416, 418, 426, 427
Ribosome............................................................... 72, 369
RIL. See Recombinant inbred line (RIL)
RNA-seq ..................................................... 142, 345, 478
RPM1 ...........................................................................475
RPS2 .............................................................................475
RPy............................................ 505, 508, 510, 511, 514
rq ........................................................ 477, 536540, 542
rRNA. See Ribosomal RNA (rRNA)
Rserve ...............................505, 506, 508, 510, 511, 514
RSOAP ................................................................ 508510
RSPerl ...........................................................................508
RSRuby .........................................................................508
S
Saccharomyces cerevisiae.................... 163, 164, 193195,
202, 371, 471, 472
16S analysis..........................................................416, 427
Scaffolding ...........................................................306, 491
Scala ................................................... 507, 512, 513, 532
S. cerevisiae. See Saccharomyces cerevisiae
SEED subsystem ..........................................................425
Segmental duplication ...................... 145, 164166, 477
Segment alignment ......................................................244
Segregating sites.................................. 19, 133, 225, 235
Index |
T
Tajimas D ....................................................................133
TAMBIS. See Transparent Access to Multiple
Bioinformatics Services (TAMBIS)
Tandem affinity purification (TAP)...................363365,
368, 369
TAP. See Tandem affinity purification (TAP)
TAPIR. See TDWG Access Protocol for Information
Retrieval (TAPIR)
Target gene................................................. 464, 470, 472
Taverna .............................................. 493495, 498, 500
Taxonomic
analysis .................................................. 415420, 422
Taxonomic database working group (TDWG) ........ 482,
483, 487, 492, 497
TDWG. See Taxonomic database working group
(TDWG)
TDWG Access Protocol for Information Retrieval
(TAPIR) ......................................................492
VOLUTIONARY GENOMICS
556 || EIndex
Telomeres .....................................................................130
TORQUE................................................... 536, 538543
Trade-off............................................ 385, 390, 504, 507
Training sample ................................. 399407, 409, 410
Trait ................... 38, 84, 85, 90, 91, 98, 141, 275, 276,
348, 349, 442, 455457, 469474, 479481,
487, 488
Trans ....................................................................147, 475
trans-band ...........................................................471, 472
Transcript.........................131, 167, 169, 338340, 344,
345, 471, 474, 480
Transcription
factor .................................. 152, 203, 209, 336, 346,
349352, 354, 364, 432, 474
factor binding sites ..............145, 146, 149, 351, 355
start sites (TSSs) .....................................................439
Transcriptome assembly...............................................340
Transition probability .............. 116, 119, 222, 305, 307
Transition/transversion (rate).....................................264
Translation.................... 70, 71, 92, 101, 131, 142, 172,
193, 265, 432, 505, 508514, 536
Translocation .......................................................167, 201
Transparent Access to Multiple Bioinformatics Services
(TAMBIS) .......................................... 492493
Transposition.......................................................173, 450
Tree
of life (TOL)............................. 3, 11, 39, 55, 7072,
74, 76, 82, 8487, 89, 9395, 97, 99, 103,
104, 205, 207
reconciliation ...............................................41, 43, 54
rooted ............................................ 6, 39, 44, 54, 421
search ........................................................................96
topology..................... 47, 9, 10, 17, 21, 40, 4247,
54, 55, 67, 76, 240, 241, 244
ultrametric ................................................................66
unrooted ...................................... 54, 55, 66, 67, 125
Triple...............................6, 17, 485487, 495, 498, 499
tRNA............................................................... 71, 72, 131
Two color cDNA microarray..............................471, 472
U
Unequal crossing over .................................................373
Uniform Resource Identifier (URI) ......... 481485, 495
Uniform Resource Locator (URL) .......... 213, 481484,
488490, 497, 499, 510
Uniparental inheritance ...................................... 315316
UniProt.............................................. 188, 211, 213, 514
Untranslated regions....................................................147
V
Vertebrates.........................13, 115, 117, 145, 168, 196,
207, 242, 354, 432
VirtualBox ................................ 529, 536, 538, 540, 542
Virtualization..................................... 532536, 540, 542
Virtual machine (VM) .............507, 508, 514, 534536,
539543
Virus..................................... 11, 16, 30, 33, 82, 88, 128,
130, 131, 226, 235240, 242, 243, 247, 258,
259, 364, 420
VM. See Virtual machine (VM)
VMWare........................................................................536
W
Wald confidence interval,
Warping ...............................................................392, 394
Wassilewskijai (Ws).......................................................477
Web Ontology Language (OWL) .... 485488, 494, 495
Web-services ....................434, 464, 489, 491496, 498,
506, 514, 540
Web Services Description Language (WSDL) ......... 501,
503, 504
Weight array matrix,
WGD. See Whole genome duplication (WGD)
Whole genome duplication (WGD) .................162164,
373, 374, 377, 378
Wolbachia............................................................... 11, 170
WormBase............................................................509, 510
Wright-Fisher population ................. 317, 318, 322, 328
Wright-Fisher model, 295, 316318, 326, 328, 329
X
XEN ............................................................ 536, 540, 542
Xenologs .................................................35, 38, 169, 194
XML......................... 487, 491495, 497, 506, 511, 514
xQTL ..................................................471472, 474482
Y
Yeast ........................100, 147, 150, 153, 163, 164, 365,
371, 377, 378, 389, 471
Yeast-2-Hybrid (Y2H) ..............363365, 367, 368, 372
Z
Zinc(Zn) finger protein ...............................................234