Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview

Bioinformatics/Computationa
l Tools for NGS Data Analysis

An Overview
Dr.S.Balaji
Bauplechain Technologies Private Limited
www.bauplechain.com
Agenda
• Next Generation Sequencing (NGS)
• Second Generation Sequencing
Platforms
• Third Generation Sequencing
Platforms
• NGS Bioinformatics
• Tools for Primary Analysis
• Tools for Secondary Analysis
• Tools for Tertiary Analysis
• NGS Pitfalls
• Concluding Remarks
2
• Genetics is extremely important to medical practice.
• Provides a definitive diagnosis for many clinically
heterogeneous diseases;
• Enables accurate disease prognosis;
• Provides guidance towards the selection of the best
possible options of care for the patients.
Genetics and • Current potential derives from the capacity to
Medical interrogate the human genome at different levels.
• Chromosomal to the single-base alterations.
Practice • Pioneering works on DNA sequencing made possible
several progresses resulting in Sanger sequencing
followed by automated DNA sequencer which
facilitated human genome sequencing in 2001.
• R&D in nanotechnology and informatics, contributed
to the new generation of sequencing methods.
3
DNA
Sequencin
g Timeline
NG—next generation
PCR—polymerase chain
reaction
SMS—single molecule
sequencing
SeqLL—sequence the 4
lower limit
• New approaches targeted to complement
and eventually replace Sanger sequencing.
• This technology is collectively referred to as
Next-Generation Sequencing (NGS) or
Next Massively Parallel Sequencing (MPS).
Generation • An umbrella to designate a wide diversity of
approaches.
Sequencing • Through this technology, it is possible to
(NGS) generate massive amounts of data per
instrument run, in a faster and cost-effective
way.
• Now possible to stream in parallel huge data of
several genes or even the entire genome.
5
• Global market is projected to reach 21.62 billion US dollars by
2025, growing at about 20% from 2017 to 2025.
• Several brands are presently on the NGS market – top
sequencing companies.
• Illumina
• Ion Torrent (Thermo Fischer Scientific)
NGS • BGI Genomics

• PacBio
Market • Oxford Nanopore Technologies

• All provide different strategies towards the same problem -
massification of the sequencing data.
• Second-generation sequencing is based on massive parallel
and clonal amplification of molecules (polymerase chain
reaction (PCR)).
• Third-generation sequencing relies on single-molecule
sequencing without a prior clonal amplification step.
6
• A library is a collection of DNA/RNA fragments that
represents either the entire genome/transcriptome or a
target region.
• Each NGS platform has its specificities, but, in simple
terms, the preparation of an NGS library starts with the
NGS fragmentation of the starting material, then connecting

sequence adaptors to fragments to allow the enrichment
Library of those fragments.
• A good library should have great sensitivity and specificity.
• All fragments of interest should be equally represented in the
library and should not contain random errors.
• As genomic regions are not equally prone to be sequenced,
making the construction of a sensitive and specific library is
challenging.
7
• The first step to prepare libraries in most
NGS workflows is the fragmentation of
nucleic acid.
• Fragmentation can be done either by
physical or enzymatic methods.
NGS • Physical methods include acoustic shearing,
Library sonication and hydrodynamic shearing.
• Enzymatic methods include digestion by DNase
Preparation or Fragmentase.
Step 1 • Both methods give similar yields.

• Choice between physical or enzymatic
method only depends on experimental
design or external factors, such as lab
facilities.
8
• Once the starting DNA has been fragmented,
adaptors are connected to these fragments.
• The adaptors are introduced to create known
NGS begins and ends to random sequences allowing
Library the sequencing process.
• An alternative strategy combines
Preparation fragmentation and adaptor ligation in a
Step 2 single step, thus making the process simpler,
faster and requiring a reduced sample input.
• This process is known as tagmentation and is
based on transposon-based technology.
9
• Upon nucleic acid fragmentation, the fragments are selected
according to the desired library size.
• Limited either by the type of NGS instrument and by the specific
sequencing application.
• Short-read sequencers, such as Illumina and Ion Torrent, present best
NGS results when DNA libraries contain shorter fragments of similar sizes.
• Illumina fragments are longer than that in Ion Torrent.
Library • In Illumina, the fragments can go up to 1500 bases in length.
• In Ion Torrent, the fragments can go up to 400 bases in length.
Preparation • Long-read sequencers, like PacBio RS II produce ultra-long reads
Step 3 by fully sequencing a DNA fragment.

• The optimal library size is also limited by the sequencing
application.
• For whole-genome sequencing, the longer fragments are preferable,
while for RNA sequencing and exome sequencing smaller fragments are
feasible since most of the human exons are under 200 base pairs in
length.
10
• In enrichment step, the amount of target material is increased in
a library to be sequenced.
• When just a part of the genome needs to be investigated both
for research and clinical applications, it is known as target
libraries.
NGS • Two methods are commonly used for such targeted approaches:
• Capture hybridization-based sequencing
Library • In the hybrid capture method, upon the fragmentation step, the
fragmented molecules are hybridized specifically to DNA
Preparation fragments complementary to the targeted regions of interest.

• This could be done by different methods such as microarray
Step 4 technology or using biotinylated oligonucleotide probes, which
aims to physically capture and isolate the sequences of interest.
Enrichment Step • Two well-known examples of commercial library preparation
solutions based on hybrid capture methods are the SureSelect
(Agilent Technologies) and SeqCap (Roche).
• Amplicon-based sequencing
• HaloPlex (Agilent Technologies) and AmpliSeq (Ion Torrent) are
two examples of commercial library preparation solutions based
on amplicon-based strategies. 11
• Second-generation platforms belong to
the group of cyclic-array sequencing
technologies.
• Basic workflow for second-generation
Second- platforms includes:
• Preparation of libraries (prepared
Generation from DNA/RNA samples),

• Amplification of libraries,
• Clonal expansion,
Sequencing • Sequencing,
• Analysis.
Platforms • Two most widely known sequencing

companies of second-generation
sequencing platforms are:
• Illumina
• Ion Torrent
12
• Illumina commercializes several integrated systems for the
analysis of genetic variation and biological function that can be
Second- applied in multiple biological systems from agriculture to
medicine.
Generation • The process of Illumina sequencing is based on the
Sequencing sequencing-by-synthesis (SBS) concept.
• Capture, on a solid surface, of individual molecules, followed by
Platforms bridge PCR that allows its amplification into small clusters of
identical molecules.
(contd.) • The DNA/cDNA is fragmented, and adapters are added to both

ends of the fragments.
• Next, each fragment is attached to the surface of the flow cell
by means of oligos on the surface that have a nucleotide
sequence complementary to the adapters allowing the
hybridization and the subsequent bridge amplification, forming
a double-strand bridge.
• Next, it is denatured to give single-stranded templates, which
are attached to the substrate.
13
• The process is continuously repeated to generate several millions
Second-
of dense clusters of double-stranded DNA in each channel of the
flow cell.
Generation • The sequencing can then occur, by the addition to the template (on
the flow cell) of a single labelled complementary deoxynucleotide
Sequencing triphosphate (dNTP), which serves as a “reversible terminator”.
Platforms • The fluorescent dye is identified through laser excitation and

imaging, and subsequently, it is enzymatically cleaved to allow the
(contd.) next round of incorporation.
• In Illumina, the raw data are the signal intensity measurements
detected during each cycle.
• This technology is considered highly accurate and robust but has
some drawbacks.
• For instance, during the sequencing process, some dyes may lose activity
and/or may have a partial overlap between emission spectra of the
fluorophores, which limit the base calling on the Illumina platform.
14
• Ion Torrent, the major competitor of Illumina,
employs a distinct sequencing concept, named
Second- semi-conductor sequencing.
Generation • The sequencing method in Ion Torrent is based on
Sequencing pH changes caused by the release of hydrogen ions
(H+) during the polymerization of DNA.
Platforms • In Ion Torrent, instead of being attached to the
(contd.) surface of a flow cell, the fragmented molecules are
bound to the surface of beads and amplified by
emulsion PCR, generating beads with a clonally
amplified target molecule (one molecule/bead).
• Each bead is then dispersed into a micro-well on the
semiconductor sensor array chip, named as
complementary metal-oxide-semiconductor (CMOS)
chip.
15
• Every time the polymerase incorporates the complementary
nucleotide into the growing chain, a proton is released,
Second- causing pH changes in the solution, which are detected by an

ion sensor incorporated on the chip.
Generation • These pH changes are converted into voltage signals, which are
Sequencing then used in base calling.

• There are also limitations in this technology.
Platforms • Major source of sequencing errors are the occurrence of
(contd.) homopolymeric template stretches.

• During the sequencing, multiple incorporations of the same base
on each strand will occur, generating the release of a higher
concentration of H+ in a single flow.
• The increased shift in pH generates a greater incorporation
signal, indicating to the system that more than one nucleotide
was incorporated.
• For longer homopolymers the system is not effective, making
their quantification inaccurate.
16
• The 3rd generation NGS technology brought
the possibility of circumventing common and
transversal limitations of PCR-based methods.
• Nucleotide misincorporation by a
polymerase,
• Chimera formation,
Third- • Allelic drop-outs (preferential

amplification of one allele) causing an
artificial homozygosity call.
Generation • The first commercial 3rd generation sequencer

was the platform Helicos Genetic Analysis
System.
Sequencing • In contrast to the second-generation

sequencers, here the DNA was simply sheared,
tailed with poly-A, and hybridized to a flow cell
Platforms surface containing oligo-T nucleotides.

• The sequencing reaction occurs by the
incorporation of the labelled nucleotide, which
is captured by a camera.
• With this technology, every strand is uniquely
and independently sequenced.
• However, it was relatively slow and expensive
and did not stay long in the market.
17
• In 2011, Pacific Biosystems introduced the concept of single-
molecule real-time (SMRT) sequencing, with the PacBio RS II
Third- sequencer.
• This technology enables the sequence of long reads (with
Generation average read lengths up to 30 kb).
• Individual DNA polymerases are attached to the zero-mode
Sequencing waveguide (ZMW) wells, which are nanoholes where a single
molecule of the DNA polymerase enzyme can be directly
Platforms placed.
• A single DNA molecule is used as a template to polymerase
(contd.) incorporation of fluorescently labelled nucleotides.
• Each base has a different fluorescent dye, thereby emitting a
signal out of the ZMW.
• A detector reads the fluorescent signal and, based on the
colour of the detected signal, identifies the signal.
• Then, the base leads are added to the cleavage of the
fluorescent tag by the polymerase.
• A closed, circular single strand DNA (ssDNA) is used as
template (called a SMRTbell) and can be sequenced multiple
times to provide higher accuracy.
18
• In addition, this technology
• Enables de novo assembly and direct detection of haplotypes
Third- and high consensus accuracy.
• Allows the epigenetic characterization (direct detection of DNA
Generation base modifications at one-base resolution).
Sequencing • The first sequencers using the SMRT technology faced the
drawbacks of having a limited high-throughput, higher costs
Platforms and higher error rate.
• However, significant improvements have been made to
(contd.) overcome these limitations.
• More recently, PacBio launched the Sequel II System, which
claims to reduce the project costs and timelines with highly
accurate individual long reads (HiFi reads, up to 175 kb).
• Using a PacBio System, researchers demonstrated a successful
application of long-read genome sequencing to identify a
pathogenic variant in a patient with Mendelian disease.
• Suggests that this technology has significant potential for the
identification of disease-causing structural variation.
19
• A second approach to single-molecule sequencing was
commercialized by Oxford Nanopore Technologies, named
Third- MinION, and commercially made available in 2015.
• This sequencer does not rely on SBS but instead relies on the
Generation electrical changes in current as each nucleotide (A, C, T and G)
passes through the nanopore.
Sequencing • Nanopore sequencing uses electrophoresis to transport an
Platforms unknown sample through a small opening, then an ionic
current passes through nanopores and the current is changed
(contd.) as the bases pass through the pore in different combinations.
• This information allows the identification of each molecule and
to perform the sequencing.
• In May 2017, Oxford Nanopore Technologies launched the
GridION Mk1, a flexible benchtop sequencing and analysis
device that offers real-time, long-read, high-fidelity DNA and
RNA sequencing.
• It was designed to allow up to five experiments run
simultaneously or individually, with simple library preparation,
enabling the generation of up to 150 Gb of data during a run.
20
• New advances were launched with PromethION 48 system
that offers 48 flow cells, and each flow cell allows up to 3000
Third- nanopores sequence simultaneously, which can deliver up to
7.6 Tb yields in 72 hours.
Generation • Longer reads are of utmost importance to unveil repetitive
Sequencing elements and complex sequences, such as transposable
elements, segmental duplications and telomeric/centromeric
Platforms regions that are difficult to address with short reads.
(contd.) • This technology has already permitted the identification,

annotation and characterization of tens of thousands of
structural variants (SVs) from a human genome.
• Although the accuracy of nanopore sequencing is not yet
comparable with that of short-read sequencing (for example,
Illumina platforms claim 99.9% of accuracy).
• Updates are constant and ongoing developments aim to expand
the range of genomes and further improve accuracy of the
technology.
21
• The 10X Genomics was founded in 2012 and offers innovative
Third- solutions from single-cell analysis to complex SV and copy
number variant analysis.
Generation • In 2016, 10X Genomics launched the Chromium instrument
that includes the gel beads in emulsion (GEMs) technology.
Sequencing • In this technology, each gel bead is infused with millions of
Platforms unique oligonucleotide sequences and mixed with a sample
(that could be high molecular weight (HMW) DNA, individual
(contd.) cells or nuclei).
• Then, gel beads with the samples are added to an oil-
surfactant solution to create gel beads in emulsion (GEMs).
• GEMs act as individual reaction vesicles in which the gel beads
are dissolved and the sample is barcoded in order to create
barcoded short-read sequences.
• The advantage of GEMs technology is that it reduces the time,
amount of starting material and costs.
22
• For structural variant analysis, the short-read libraries are
computationally reconstructed to be able to perform
Third- megabase-scale haplotype genome sequences using small
amounts of input DNA.
Generation • This technology:
• allows linked-read phasing of SVs that distinguishes true SVs
Sequencing from false predictions,
• has the potential to be applied to de novo genome assembly,
Platforms • can remap difficult regions of the genome,
(contd.) • can detect rare alleles and

• can elucidate complex structural rearrangements.
• The Chromium system also offers single-cell genome and
transcriptional profiling, immune profiling and analysis of
chromatin accessibility at single-cell resolution, with low error
rates and high throughput.
• Opens up exciting new applications especially on the
development of new techniques for epigenetics research, de
novo genome assembly and for long sequencing reads.
23
NGS Bioinformatics
• Sequencing platforms are getting more efficient and productive.
• It is now possible to completely sequence the human genome within days at a relatively
affordable price.
• PromethION promises the delivery of human genomes for less than $1000.
• Amount of data generated demand computational and bioinformatics skills to manage,
analyse and interpret the huge amount of NGS data.
• Considerable development in NGS (bio)informatics is taking place, which could only take
place greatly due to the increasing computational capacities (hardware), as well as
algorithms and applications (software) to assist all the required steps.
• From the raw data processing to more detailed data analysis and interpretation of variants in a
clinical context.
• NGS bioinformatics is subdivided into primary, secondary and tertiary analysis.
• The overall goal of each analysis is basically the same regardless of the NGS platform.
• However, each platform has its own particularities and specificities .
• For simplicity, we focus on the two main commercial 2nd generation platforms in the rest of
the discussion.
• Illumina
24
• Ion Torrent
NGS Bioinformatics Workflow 25
NGS Bioinformatics Workflow
(contd.)
• NGS bioinformatics is subdivided in to:
• Primary (blue),
• Secondary (orange) and
• Tertiary (green) analysis.
• Primary data analysis consists of the detection and
analysis of raw data.
• In the secondary analysis, the reads are aligned against
the reference human genome (or de novo assembled)
and the calling is performed.
• Tertiary analysis includes the variant annotation, variant
filtering, prioritization, data visualization and reporting.
CNV—copy number variation
ROH—runs of homozygosity
VCF—variant calling format.
26
Primary Analysis • The primary data analysis consists of:
• detection and analysis of raw data (signal
analysis),
• targeting the generation of legible sequencing
reads (base-calling) and
• scoring base quality.
• Typical outputs from this primary analysis are:
• FASTQ file (Illumina) and
• unmapped binary alignment map (uBAM) file
(Ion Torrent).
27
• The principle for signal detection relies on fluorescence.
• Therefore, the base-calling is apparently much simpler, and is
Primary Analysis made directly from fluorescent signal intensity measurements
resulting from the incorporated nucleotides during each cycle.
• Illumina’s SBS technology delivers the highest percentage of
Illumina error-free reads.
• Latest versions of Illumina’s chemistry have been reoptimized
to enable accurate base-calling in difficult genomic regions.
• The dNTPs have been chemically modified to contain a
reversible blocking group that acts as a temporary terminator
for DNA polymerization.
• After each dNTP incorporation, the image is processed to
identify the corresponding base and then enzymatically
cleaved-off to allow incorporation of the next one.
• A single flow cell often contains billions of DNA clusters tightly
and randomly packed into a very small area.
• Such physical proximity could lead to crosstalk events
between neighbouring DNA clusters.
28
• As fluorophores attached to each base produce light emissions, there can be some
degree of interference between the nucleotide signals, which can overlap with the
optimal emissions of the fluorophores of the surrounding clusters.
• Although the base-calling is simpler than in Ion Torrent, the image processing step is
Primary Analysis
Illumina (contd.)
quite complex.
• The overall process requires aligning each image to the template of cluster position
on the flow cell, image extraction to assign an intensity value for each DNA cluster,
followed by intensity correction.
• Besides this crosstalk correction, other problematic aspects occur during the
sequencing process and influence the base-calling process:
• phasing (failures in nucleotide incorporation),
• fading (or signal decay) and
• T accumulation (thymine fluorophores are not always efficiently removed after each
iteration, causing a build-up of the signal along the sequencing run).
• Over many cycles, these errors accumulate and decrease the overall signal to noise
ratio per single cluster, causing a decrease in quality towards the ends of the reads.
• Some of the initial base-callers for the Illumina platform were Alta-Cyclic and
Bustard.
• Currently, there are multiple other base-callers differing in the statistical and
computational methodologies used to infer the correct base.
• Despite this variability, the most widely used base-caller is the Bustard and several
base-calling algorithms were built using Bustard as the starting point. 29
Widely used
base caller
software for
the Illumina
platform
INT – intermediate
executable code for old
platforms
CIF – cluster intensity
files for recent
platforms
30
• Bustard algorithm is based on fluorescence signals conversion into actual
sequence data.
• The intensities of four channels for every cluster in each cycle are taken,
Primary Analysis
Illumina (contd.)
which allows the determination of concentration of each base.
• Bustard algorithm is based on a parametric model and applies the Markov
algorithm to determine transition matrix modelling probability of phasing
(no new base synthesized), prephasing (two new bases synthesized) and
normal incorporation.
• Bustard algorithm assumes a crosstalk matrix constant for a given
sequencing run and that phasing affects all nucleotides in the same way.
• Aiming to improve performance and decreasing error rate, several base-
callers have been developed.
• There is no evidence that a given base-caller is better than the other.
• Comparison between the performance of different base-callers regarding
the alignment rates, error rate and running time, shows that:
• AYB presents the lowest error rate, the BlindCal is the fastest while BayesCall has the best
alignment rate.
• BayesCall, freeIbis, Ibis, naiveBayesCall, Softy FB and Softy SOVA did not show significant
differences among each other, but all showed improvements in the error rates compared
to the standard Bustard.
• 3Dec is recently developed base-caller for Illumina sequencing platforms,
which claims to reduce the base-calling errors by 44–69% compared to the
previous ones. 31
• Performed in the Ion Torrent Suite Software.
• Starts with signal processing, in which the signal of nucleotide
Primary Analysis incorporation is detected by the sensor at the bottom of chip cell,
converted to voltage and transferred from the sequencer to the server as a
Ion Torremt raw voltage data, named DAT file.
• For each nucleotide flow, one acquisition file is generated that contains the
raw signal measurement in each chip well for the given nucleotide flow.
• During the analysis pipeline, these raw signal measurements are converted
into incorporation measures, named WELLS file.
• The base-calling is the final step of primary analysis and is performed by a
base-caller module.
• Objective is to determine the most likely base sequence, the best match for the
incorporation signal stored in a WELLS file.
• Mathematical models behind this base-calling are complex and comprises
three sub-steps:
• key-sequence based normalization,
• iterative/adaptive normalization and
• phase correction.
32
Primary Analysis
Workflow in Ion
• Signal emitted from nucleotide incorporation is inspected
Torrent
by the sensor, which converts the raw voltage data into a
DAT file.
• DAT file serves as input to the server, which converts into a
WELLS file.
• WELLS file is used as input on the Ion Torrent Base-caller
module that gives a final BAM file, ready for the secondary 33
analysis.
• Such a procedure is required to address some of the errors occurring during
the SBS process, namely phasing or signal droop (DR).
Ion Torremt (contd.)

• DR is signal decay over each flow, this is because some of the template clones attached to
Primary Analysis bead become terminated and there is no more nucleotide incorporation.
• These errors occur quite frequently and thus, as an initial step, Ion Torrent
performs phase-correction and signal decay normalization algorithms.
• The three parameters that are involved in the signal generation are:
• the carry-forward (CF, that is, an incorrect nucleotide-binding),
• incomplete extension (IE, e.g., the flown nucleotide did not attach to the correct position
in the template) and
• droop (DR).
• CF and IE regulate the rate of non-phase polymerase build-up, while DR
measures DNA polymerase loss rate during a sequencing run.
• The chip area is divided into specific regions, and each well is further
divided into two groups (odd- and even-numbered) each of which receives
its own, independent set of estimates.
• Then, some training wells are selected and used to find optimal CF, IE and
DR, using the Nelder–Mead optimization algorithm.
• Nelder–Mead optimization algorithm uses a triangle shape or a simplex, e.g., a
generalized triangle in N dimensions, to search for an optimal solution in a
multidimensional space.
34
• The CF, IE and DR measurements, as well as, the normalized
Primary Analysis signals, are used by Solver, which follows the branch-and-
bound algorithm.
• A branch-and-bound algorithm consists of a systematic listing
of all partial sequences forming a rooted tree with the set of
candidate solutions being placed as branches of this tree.
• The algorithm expands each branch by checking it against the
optimal solution (the theoretical one) and goes on-and-on until
finding the closer to the optimal solution.
• Before the first performance of the Solver, key normalization
occurs.
• The key normalization is based on the premise that the signal
emitted during nucleotide incorporation is theoretically 1 and
for the non-incorporation, the signal emitted is 0.
35
• Key normalization scales the measurements by a constant

factor, which is selected to bring the signal of the known
Primary Analysis expected 1-mers produced by sequencing through the key as
close to unity as possible.
• Once the Solver has selected a base sequence, the predicted
signal is used to create an adaptive normalization, which
compares the theoretical signal with the real signal to
subsequently generate an improved normalized signal with a
reduced error rate.
• This enters in a cycle of subsequent iterations (applying a
function repeatedly, with the output from one being the input
of the next) with the Solver for several rounds, allowing the
normalized signal and the predicted signal to converge towards
the best solution.
• This algorithm ends with a list of promising partial sequences,
which estimates to determine the most likely base sequence in
the well.
36
• Errors occur both in Ion Torrent and Illumina sequencing,
which are expressed in quality scores of the base-call using
Quality Control: Read Filtering and

Primary Analysis Phred score, a logarithmic error probability.
• A Phred score of 10 (Q10) refers to a base with a 1 in 10 probability of
being incorrect or an accuracy of 90.0%
• Q30 means 1 in 1000 probability of an incorrect base or 99.9%
accuracy.
• Fastq files are important to the first quality control step, as
Trimming
they contain the raw sequencing reads, the filenames and the
quality values, with higher numbers indicative of higher
qualities.
• The quality of the raw sequence is critical for the overall
success of NGS analysis.
• Several bioinformatics tools were developed to evaluate the
quality of raw data.
• NGS QC toolkit
• QC-Chain
• FastQC
37
• FastQC is the most popular.
• As output, FastQC gives a report containing well-structured and graphically
illustrated information about different aspects of the read quality.

Primary Analysis • If sequence reads have enough quality, the sequences are ready to be
aligned/mapped against the reference genome.
• The Phred score is also useful for filtering and trimming sequences.
• Additional trimming is performed at the end of each read to remove
Trimming (contd.)
adapters’ sequences.
• The trimming step, although reduces the overall number and the length of
reads, raises quality to acceptable levels.
• Several tools were developed to perform trimming with Illumina data.
• BTrim, IeeHom, AdapterRemoval and Trimmonatic
• The choice of the tool is highly dependent on the dataset, downstream
analysis and parameters used.
• In Ion Torrent, sequencing and data management are processed in Torrent
Suite Software, which has a Torrent Browser as a web interface.
• To further analyze the sequences, a demultiplexing process is required,
which separates the sequencing reads into separate files according to the
barcodes used for each sample.
• Most of the demultiplexing protocols are specific to NGS platform
38
manufacturers.
• Demultiplexing is the adapter trimming step, whose function is
Primary Analysis to remove the remaining library adapter sequences from the
end of the reads.
Trimming (contd.) • In most cases from 3’ end but can depend on the library preparation.
• This step is necessary, since residual adaptor sequences in the
reads may interfere with mapping and assembly.
• Trimming is important in RNA-Seq, SNP identification and
genome assembly procedures to increase the quality and
reliability of the analysis and optimize the execution time and
computational resources needed.
• Several tools are used to perform the trimming with Illumina
data.
• There is no best tool, instead, the choice is dependent on downstream
analysis and user-decided parameter-dependent trade-offs.
• In Ion Torrent, this is also done in Torrent Suite Software.
39
Secondary Analysis
• Secondary analysis in NGS data analysis pipeline includes the reads
alignment against the reference human genome (typically hg19 or
hg38) and variants calling.
• To map sequencing reads, two different alternatives can be followed:
• read alignment, that is the alignment of the sequenced fragments
against a reference genome, or
• de novo assembly that involves assembling a genome from scratch
without the help of external data.
• Choice between approaches relies on the existence or not of a
reference genome.
• For most NGS applications, in clinical genetics, mapping against a
reference sequence is the first choice.
• As for de novo assembly, it is still mostly confined to more specific
projects, especially targeting to correct inaccuracies in the reference
genome and to improve the identification of SVs and other complex
rearrangements. 40
Secondary Analysis (contd.)
• Sequence alignment is a classic problem addressed by bioinformatics.
• Sequencing reads from most NGS platforms are short, therefore to
sequence a genome, billions of DNA/RNA fragments are generated
that must be assembled, like a puzzle.
• This represents a great computational challenge, especially when
dealing with the existence of reads derived from repetitive elements.
• The algorithm must choose from which repeat copy the read belongs to.
• In such a context, it is impossible to make high-confidence calls, the
decision must be taken either by the user or software through a
heuristic approach.
• Sequence alignment errors may emerge from multiple reasons.
• Errors in sequencing (caused by a process such as fading and signal
droop)
• Discrepancies between the sequenced data and the reference genome
also cause misalignment problems.
• Major difficulty is to establish a threshold between what is a real
variation and a misalignment. 41
Secondary Analysis (contd.)
• Most widely accepted data input file format for an assembly is FASTQ.
• Typical output files produced by various sequencing platforms.
• Read aligners are in binary alignment map (BAM) and sequence
alignment map (SAM) formats.
• Both formats include basically the same information, namely, read
sequence, base quality scores, location of alignments, differences
relative to reference sequence and mapping quality scores (MAPQ).
• The main distinction between them is that SAM format is a text file,
created to be informatically easier to process with simple tools, while
BAM format provides binary versions of the same data.
• Alignments can be viewed using user-friendly and freely available
software, such as the Interactive Genome Viewer (IGV) or Genome
Browse
(http://goldenhelix.com/products/GenomeBrowse/index.html).
42
• The preferential assembly method when the reference
genome is known is the alignment against the reference
genome.
• A mapping algorithm will try to locate a location in the
reference sequence that matches the read, tolerating a certain
Secondary number of mismatches to allow subsequent variation
detection.
Analysis - • More than 60 tools for genome mapping have been developed
and the number is increasing.
Sequence • As the NGS platforms are updated, more and more tools
appear as an on-going evolving process.
Alignment • Commonly used methods to perform short reads alignments:
• Burrows–Wheeler Aligners (BWAs) and Bowtie are mostly used
for Illumina;
• For Ion Torrent, the Torrent Mapping Alignment Program (TMAP)
is the most common alignment software as it was specifically
optimized for this platform.
43
• BWA uses the Burrows–Wheeler transform
algorithm.
• Data transformation algorithm that restructures data to be
more compressible.
Secondary • Initially developed to prepare data for compression
techniques such as bzip2.
Analysis - • It is a fast and efficient aligner performing very well for
both short and long reads.
Sequence • Bowtie (now Bowtie 2) has the advantage of being
faster than BWA for some types of alignment, but it
Alignment may compromise the quality, with the reduction of
sensitivity and accuracy.
(contd.) • Bowtie may fail to align some reads with valid
mappings when configured for maximum speed.
• It is usually applied to align reads derived from RNA
sequencing experiments.
44
• The simulation and evaluation suite, Seal, runs to compare the most
widely used tools for mapping, such as Bowtie, BWA, mr- and mrsFAST,
Novoalign, SHRiMP and SOAPv2.
• Compares different parameters, including sequencing error, indels and
coverage.
Secondary • There is no perfect tool as every tool has advantages and disadvantages.
Analysis - • Each presents different specificities and performances that are

dependent on the users’ choice of the accuracy of coverage.
Sequence • Ion Torrent tool, TMAP comprises two stages:
Alignment i. initial mapping (using Smith–Waterman or Needleman–Wunsch

algorithms) allowing reads to be roughly aligned against the
(contd.) reference genome and

ii. alignment refinement, as particular positions of the read are
realigned to the corresponding position in the reference.
• This alignment refinement is designed to compensate for specific
systematic biases of the Ion Torrent sequencing process (such as
homopolymer alignment and phasing errors with low indel scores).
45
• De novo assembly circumvents:
• the bias from a reference genome,
• limitations of inaccuracies in the reference genome,
• being the most effective to identify SVs and complex rearrangements,
avoiding the loss of new sequences.
Secondary • Most of the algorithms for de novo assembly are based on an

overlap-layout strategy, in which equal regions in the genome
Analysis – are identified and then overlapped by fragmented sequenced
ends.
De Novo • This strategy has the limitation of being prone to incorrect
alignment, especially with short read lengths.
Assembly • Mainly due to the highly repetitive regions making it difficult to
identify which region of the genome they belong to.
• De novo assembly is preferred when longer reads are
available.
• De novo assembly is much slower and requires more
computational resources compared to mapping assemblies.
46
De novo assemblers are based on
graph theory and can be categorized
into three major groups:
(1) the Overlap-Layout-Consensus
(OLC),
(2) the de Bruijn graph (DBG, also
known as Eurelian) methods using
some form of K-mer graph and
(3) the greedy graph algorithms that
can use either OLC or DBG.
47
• The greedy algorithm starts by adding a read to another identical one.
• This process is repeated until all options of assembly for that fragment
is achieved and is repeated to all fragments.
• Each operation uses the next highest-scoring overlap to make the next
join.
Secondary • To make the scoring, the algorithm measures the number of matching
bases in the overlap.
Analysis – • This algorithm is suitable for a small number of reads and smaller
genomes.
De Novo • In contrast, the OLC method is optimized for the low-coverage long
reads.
Assembly • The OLC begins by identifying the overlaps between pairs of reads and
builds a graph of the relationships between those reads, which is
(contd.) highly computationally demanding, especially with a large number of

reads.
• As soon as a graph is constructed, a Hamiltonian path (a path in an
undirected/directed graph that visits each vertex exactly once) is
required, which give rise to contig sequences.
• To end the process a further computational and manual inspection is
required.
48
• The Eurelian (or DBG method) assembler is particularly
suitable for representing the short-read overlap
relationship.
• A graph is also constructed where the nodes of the
Secondary graph represent k-mers and the edges represent
adjacent k-mers that overlap by k-1 bases.
Analysis – • The graph size is determined by the genome size and
De Novo repeat content of the sequenced sample.
• In principle, will not be affected by the high redundancy of
Assembly deep read coverage.
• With long-range sequencing and mapping technologies,
(contd.) the third-generation sequencers, newer bioinformatics
tools are continuously being created.
• They leverage the unique features and overcome the
limitations of these new sequencing and mapping platforms.
• high error rates dominated by false insertions
• deletions and sparse sequencing rather than true long reads.
49
• Post-alignment processing is recommended prior to
performing the variant call.
• Its objective is to increase the variant call accuracy Secondary
and quality of the downstream process, by reducing
base-call and alignment artifacts. Analysis -
• In the Ion Torrent platform, this is included in TMAP
software, whereas in Illumina other tools are Post-
required.
• In general terms, it consists of:
Alignment
• filtering (removal) of duplicate reads,
• intensive local realignment (mostly near INDELs) and
Processing
• base quality score recalibration.
50
• SAMtools, Genome Analysis Toolkit (GATK) and Picard
are some of the bioinformatic tools used to perform this
post-alignment processing.
• Since variant calling algorithms assume that, in the case
of fragmentation-based libraries, all reads are
independent, removal of PCR duplicates and non-unique
Secondary
alignments (i.e., reads with more than one optimal
alignment) is critical.
Analysis -
• This step can be performed using Picard tools (e.g.,
MarkDuplicates).
Post-
• If not removed, a given fragment will be considered as a
different read, increasing the number of incorrect
Alignment
variant calls and leading to an incorrect coverage and/or
genotype assessment.
Processing
• Reads spanning INDELs impose further processing.
• Given the fact that each read is independently aligned to
(contd.)
the reference genome when an INDEL is part of the read,
there is a higher chance for alignment mismatches.
51
• The realigner tool firstly determines suspicious intervals
requiring a realignment due to the presence of INDELs.
• Next, the realigner runs over those intervals combining
shreds of evidence to generate a consensus score to
support the presence of the INDEL.
• IndelRealigner from the GATK suite can be used to run
Secondary
this step.
• The confidence of the base-call is given by the Phred-
Analysis -
scaled quality score, which is generated by the
sequencer machine and represents the raw quality
Post-
score.
• This score may be influenced by multiple factors like the
Alignment
sequencing platform and the sequence composition, and
not reflecting the base-calling error rate.
Processing
• It is necessary to recalibrate this score to improve variant
calling accuracy.
(contd.)
• BaseRecalibrator from the GATK suite is one of the most
used tools.
52
Secondary Analysis - Variant
Calling
• The variant calling step has the main objective of
identifying variants using the post-processed BAM file.
• Several tools are available for variant calling, some
identify variants based on the number of high
confidence base calls that disagree with the reference
genome position of interest.
• Others use Bayesian, likelihood, or machine learning
and statistical methods that use factor parameters, such
as base and mapping quality scores, to identify variant
differences.
• Machine learning algorithms have evolved greatly in
recent years and will be critical to assist scientists and
clinicians to handle large amounts of data and to solve
complex biological challenges.
53
Calling (contd.)
• SAMtools, GATK and Freebayes are among the most
widely used toolkits for Illumina data.
• Ion Torrent has its own variant caller known as the
Torrent Variant Caller (TVC).
• Running as a plugin in the Ion Torrent server, TVC calls
single-nucleotide polymorphisms (SNVs), multi-
nucleotide variants (MNVs), INDELS in a sample across a
reference or within a targeted subset of that reference.
• Several parameters can be customized, but often
predefined configurations (germ-line vs. somatic, high
vs. low stringency) can be used depending on the type
of experiment performed.
54
Calling (contd.)
• Most of these tools use the SAM/BAM format as input and
generate a variant calling format (VCF) file as their output.
• VCF format is a standard format file, currently in version 4.2,
developed by the large sequencing projects such as the 1000
genomes project.
• VCF is basically a text file containing meta-information lines, a
header line, followed by data lines each containing information
on chromosomal position, the reference base, the identified
alternative base or bases.
• The format also contains genotype information on samples for
each position.
• VCFtools provide the possibility to easily manipulate VCF files,
e.g., merge multiple files or extracting SNPs from specific regions.
55
Secondary Analysis –
Structural Variant Calling
• Genetic variations can occur in the human genome ranging from SNVs and INDELS to more
complex (submicroscopic) SVs.
• These SVs include both large insertions/duplications and deletions (also known as copy
number variants, CNVs) and large inversions and can have a great impact on health.
• Longer-read sequencers hold the promise to identify large structural variations and the
causative mutations in unsolved genetic diseases.
• Incorporating the calling of such SVs would increase the diagnostic yield of these NGS
approaches, overcoming some of the limitations present in other methods and with the
potential to eventually replace them.
• Reflecting this growing tendency, several bioinformatics tools have been developed to detect
CNVs from NGS data.
• Currently, to detect CNVs from NGS data, five approaches are used, according to type
algorithms/strategies used: paired-end mapping, split read, read depth, de novo genome
assembly and combinatorial approach.
56
Main Methods for
Calling Structural
Variants (SVs) and
Copy Number
Variations (CNVs)
from NGS Data
57
Secondary Analysis –
Structural Variant Calling (contd.)
• Detection of CNV mainly relies on whole-genome sequencing (WGS) data since
it includes non-coding regions which are known to encompass a significant
percentage of SVs.
• Whole-exome sequencing (WES) has emerged as a more cost-effective
alternative to WGS and the interest in detecting CNVs from WES data has
grown considerably.
• Since only a small fraction of the human genome is sequenced by WES, it is
not able to detect the complete spectrum of CNVs.
• The lower uniformity of WES as compared with WGS may reduce its sensitivity
to detect CNVs.
• Usually, WES generates higher depth for targeted regions as compared with
WGS.
• Most of the tools developed for CNVs detection using WES data, have depth-
based calling algorithms implemented and require multiple samples or
matched case-control samples as input.
• Ion Torrent also developed a proprietary algorithm, as part of the Ion Reporter
software, for the detection of CNVs in NGS data derived from amplicon-based
61
libraries.
The third main step of the NGS Finding, in the human clinical
genetics' context, the
analysis pipeline addresses the
fundamental link between
important issue of “making variant data and the phenotype
sense” or data interpretation. observed in a patient.
Adds a further layer of

The tertiary analysis information to predict the
starts with variant functional impact of all
annotation. variants previously found in
Tertiary variant calling steps.
Analysis Variant annotation is followed by variant filtering,

prioritization and data visualization tools.
Constantly updated and

These analytical steps enhanced to include the
can be performed by a recent scientific findings.
combination of a wide Require constant support
variety of software. and further improvements
by the developers.
62
• Variant annotation is a key initial step for the analysis of
sequencing variants.
• Output of the variant calling is a VCF file.
• Each line in such a file contains high-level information about a
variant, such as genomic position, reference and alternate
Tertiary bases, but no data about its biological consequences.
Analysis - • Variant annotation offers such biological context for all the
variants found.
Variant • Given the massive amount of NGS data, data annotation is

performed automatically.
Annotation • Several tools are currently available, and each uses different
methodologies and databases for variant annotation.
• Most of the tools can perform both the annotation of SNVs
and the annotation of INDELs.
• Annotation of SVs or CNVs are more complex and are not
performed by all methods.
63
• One basic step in the annotation is to provide the variant’s context.
• In which gene the variant is located, its position within the gene and
the impact of the variation (missense, nonsense, synonymous, stop-
loss, etc.).
Tertiary • Such annotation tools offer additional annotation based on

functionality.
Analysis - • Integrating other algorithms such as SIFT, PolyPhen-2, CADD and
Variant
Condel, which compute the consequence scores for each variant
based on various parameters.
Annotation
• Degree of conservation of amino acid residues
• Sequence homology
(contd.) • Evolutionary conservation

• Protein structure or statistical prediction based on known
mutations
• Additional annotation can refer to disease variants databases such as
ClinVar and HGMD, using which information about its clinical
association is retrieved.
64
• Numerous annotation tools are available.
• Most widely used are ANNOVAR, variant effect predictor (VEP), snpEff and
SeattleSeq.
• ANNOVAR is a command-line tool that can identify SNPs, INDELs and CNVs.
• ANNOVAR requires software installation and experienced users.
• Annotates the functional effects of variants with respect to genes and other
genomic elements and compares variants to existing variation databases.
Tertiary • ANNOVAR can also evaluate and filter-out subsets of variants that are not
reported in public databases, which is important especially when dealing with
Analysis -
rare variants causing Mendelian diseases.
• Like ANNOVAR, VEP from Ensembl (EMBL-EBI) can provide genomic annotation
for numerous species.
Variant • VEP has a user-friendly interface through a dedicated web-based genome
browser, although it can have programmatic access via a standalone Perl script
or a REST API.
Annotation • A wider range of input file formats are supported, and it can annotate SNPs,
indels, CNVs or SVs.
(contd.) • VEP searches the Ensembl Core database and determines where in the
genomic structure the variant falls and depending on that gives a consequence
prediction.
• snpEff is another widely used annotation tool, standalone or integrated with
other tools commonly used in sequencing data analysis pipelines such as
Galaxy, GATK and GKNO.
• In contrast with VEP and ANNOVAR, snpEff does not annotate CNVs but has the
capability to annotate non-coding regions.
• snpEff can perform annotation for multiple variants being faster than VEP.
65
• Variant annotation may seem like a simple and
straightforward process.
• It can be very complex considering the genetic organization’s
intricacy.
Tertiary • In theory, the exonic regions of the genome are transcribed
Analysis - into RNA, which in turn is translated into a protein.
• One gene would originate only one transcript and ultimately a
Variant single protein.
• Such a concept (one gene–one enzyme hypothesis) is completely
Annotation outdated as the genetic organization and its machinery are much
more complex.
(contd.) • Due to a process known as alternative splicing, from the same
gene, several transcripts and thus different proteins can be
produced.
• Alternative splicing is the major mechanism for the
enrichment of transcriptome and proteome diversity.
66
• Depending on the transcript choice, the biological information and
implications concerning the variant can be very different.
• Blurriness concerning annotation tools is caused by the existence of a
diversity of databases and reference genome datasets, which are not
Tertiary completely consistent and overlapping in terms of content.

• Most frequently used are Ensembl (http://www.ensembl.org), RefSeq (
Analysis - http://www.ncbi.nlm.nih.gov/RefSeq/) and UCSC genome browser (

http://genome.ucsc.edu), which contain such reference datasets and
Variant additional genetic information for several species.

• These databases also contain a compilation of different sets of
Annotation transcripts that were observed for each gene and are used for variant
annotation.
(contd.) • Each database has its own particularities and thus depending on the
database used for annotation, the outcome may be different.
• For instance, if for a given locus one of the possible transcripts has an
intron retained, while in the others have not, a variant located in such
region will be considered as located in the coding sequence in only one
of the isoforms.
67
• To minimize the problem of multiple transcripts, the collaborative
consensus coding sequence (CCDS) project was born.
• This project aims to catalog identical protein annotations both on
human and mouse reference genomes with stable identifiers and to
uniformize its representation on the different databases.
Tertiary • Use of different annotation tools also introduces more variability to
NGS data.
Analysis - • For instance, ANNOVAR by default uses 1 Kb window to define upstream and
downstream regions, while snpEff and VEP use 5 kb.
Variant • This makes the classification of variants different even though the
same transcript was used.
Annotation • There exist significant differences in VEP and ANNOVAR annotations of the same
transcript.
(contd.) • Besides the problems related to multiple transcripts and annotation

tools, there are also problems with overlapping genes, i.e., more than
one gene in the same genomic position.
• There is still no complete/definitive solution to deal with these
limitations, thus results from variant annotation should be analyzed
with respect to the research context problem and, if possible, resorting
to multiple sources.
68
• After annotation of a VCF file from WES, the total number of variants
may range between 30,000 and 50,000.
Tertiary • To make clinical sense of so many variants and to identify the disease-
causing variant(s), some filtering strategies are required.
Analysis - • Although quality control was performed in previous steps, several
false-positive variants will still be present.
Variant • When starting the third-level of NGS analysis, it is highly
recommended to, based on quality parameters or previous
Filtering, knowledge of artifacts, reduce the number of false-positive calls and
variant call errors.
Prioritization • Parameters such as the total number of independent reads and the
percentage of reads showing the variant and the homopolymer
and length (particularly for Ion Torrent, with stretches longer than five
bases being suspicious) are examples of filters that could be applied.
Visualization • The user should define the threshold based on observed data and
research question but, relative to the first parameter, less than 10
independent reads are usually rejected since it is likely due to
sequencing bias or low coverage.
69
• One commonly used NGS filter is the population frequency filter.
Tertiary
• Minor allele frequency (MAF), one of the metrics used to filter based on allele
frequency, can sort variants in three groups:
Analysis -
• rare variants (MAF < 0.5, usually selected when studying Mendelian
diseases),
Variant
• low frequency variants (MAF between 0.5% and 5%) and
• common variants (MAF > 5%).
Filtering,
• Populational databases that include data from thousands of individuals from
several populations represent a powerful source of variant information about
Prioritization
the global patterns of human genetic variation.
• It helps, not only to better identify disease alleles but also are important to
understand the populational origins, migrations, relationships, admixtures and
and changes in population size, which could be useful to understand some disease
patterns.
Visualization • Population databases such as 1000 genome project, Exome Aggregation

Consortium (ExAC), and the Genome Aggregation Database (gnomAD) are the
(contd.) most widely used databases.

• Nonetheless, this filter has also limitations and could cause erroneous
exclusion, which is difficult to overcome.
70
• As carriers of recessive disorders, carriers do not show any
Tertiary signs of the disease, the frequency of damaging alleles in
populational variant databases can be higher than the
Analysis - established threshold.
Variant • In-house variant databases are important to assist variant

filtering, namely, to help to understand the frequency of
Filtering, variants in a study population and to identify systematic errors
Prioritization of an in-house system.

• Numerously standardized, widely accepted guidelines exist for
and the evaluation of genomic variations obtained through NGS.
Visualization • American College of Medical Genetics and Genomics (ACMG)

• European Society of Human Genetics guidelines.
(contd.) • These provide standards and guidelines for the interpretation

of genomic variations.
71
Tertiary • For a recognizable inheritance pattern, it is advisable to perform
family inheritance-based model filtering.
Analysis - • These are especially useful if more than one patient of such
families are available for study, as it would greatly reduce the
Variant
number of variants to be thoroughly analyzed.
• For instance, for diseases with an autosomal dominant (AD)
inheritance pattern the ideal situation would be testing at least
Filtering, three patients, each from a different generation, and select only the
heterozygous variants located in the autosomes.
Prioritization • If a pedigree indicates a likely X-linked disease, variants located in
the X chromosome are selected and those in other chromosomes
and are not primarily inspected.
• As for autosomal recessive (AR) diseases, with more than one
Visualization affected sibling, it would be important to study as many patients
as possible and to select homozygous variants in patients that
(contd.) were found in heterozygosity in both parents, or genes with two
heterozygous variants with distinct parental origins.
72
• For sporadic cases (and cases in which the disease pattern is not known), the
trio analysis can constitute as extremely useful to reduce the analytical
burden.
Tertiary • In such a context, heterozygous variants found only in the patient and not
present in both parents would indicate a de novo origin.
Analysis - • Even in non-related cases with very homogeneous phenotypes, such as those
typically syndromic, it is possible to use an overlap-based strategy, assuming
that the same gene or even the same variant is shared among all the patients.
Variant • An additional filter, useful when many variants persist after applying others, is
based on the predicted impact of variants (functional filter).
Filtering, • In some pipelines intronic or synonymous variants are analyzed, based on the
assumption that they are likely to be benign (non-disease associated).
• Care should be taken since numerous intronic and apparent synonymous
Prioritization variants, have been implicated in human diseases.
• A functional filter is applied in which the variants are prioritized based on its
and genomic location (exonic or splice-sites).
• Additional information for filtering missense variants includes evolutionary
Visualization
conservation, predicted effect on protein structure, function or interactions.
• To enable such filtering the scores generated by algorithms to evaluate
missense variants (for instance PolyPhen-2, SIFT and CADD) are annotated in
(contd.) the VCF.
• The same applies to variants that might have an effect over splicing, as
prediction algorithms are being incorporated in VCF annotation, such as the
Human Splice finder in VarAFT.
• More examples are given in the table next. 73
Software Description
PhyloP Phylogenetic p- Based on a model of neutral evolution, the patterns of

values conservation (positive scores)/acceleration (negative scores) are
analysed for various annotation classes and clades of interest.
Some SIFT
Sorting Intolerant from
Predicts based on sequence homology, if an AA substitution will
affect protein function and potentially alter the phenotype.
Software Tolerant Scores less than 0.05 indicate a variant as deleterious.
Tools to PolyPhen-2
Polymorphism
Phenotyping v2
Predicts the functional impact of an AA replacement from its
individual features using a naive Bayes classifier. Includes two
tools HumDiv (designed to be applied in complex phenotypes)
Perform and HumVar (designed to diagnostic of Mendelian diseases).
Higher scores (>0.85) predicts, more confidently, damaging
NGS
variants.
CADD Combined Integrates diverse genome annotations and scores all human SNV
Functional Annotation Dependent
Depletion
and Indel. It prioritizes functional, deleterious, and disease causal
variants according to functional categories, effect sizes and
Filtering genetic architectures. Scores above 10 should be applied as a cut-

off for identifying pathogenic variants.
MutationTaster Analyses evolutionary conservation, splice-site changes, loss of

protein features and changes that might a↵ect the amount of
mRNA. Variants are classified, as polymorphism or disease-
causing 74
Human Splice Finder Predict the e↵ects of mutations on splicing signals or to identify
splicing motifs in any human sequence.
Some
Software
nsSNPAnalyzer Extracts structural and evolutionary information from a query
nsSNP and uses a machine learning method (Random Forest) to
predict its phenotypic e↵ect. Classifies the variant as neutral and
Tools to disease.
Perform
TopoSNP Topographic Analyze SNP based on its geometric location and conservation
mapping of SNP information, produces an interactive visualization of disease and
non-disease associated with each SNP.
NGS Condel Consensus Condel integrates the output of di↵erent methods to predict the
Functional Deleteriousness impact of nsSNP on protein function. The algorithm based on the
weighted average of the normalized scores classifies the variants
as neutral or deleterious.
Filtering
(contd.) ANNOVAR Annotate
Variation
Annotates the variants based on several parameters, such as
identification whether SNPs or CNVs a↵ect the protein (gene-
based), identification of variants in specific genomic regions
outside protein-coding regions (region-based) and identification
of known variants documented in public and licensed database
(filter-based)
75
Some
Software
VEP Determines the effect of multiple variants (SNPs, insertions,
Tools to Variant Effect Predictor deletions, CNVs or structural variants) on genes, transcripts and
protein sequence, as well as regulatory regions.
Perform snpEff Annotation and classification of SNV based on their e↵ects on

annotated genes, such as synonymous/nsSNP, start or stop codon
gains or losses, their genomic locations, among others.
NGS Considered as a structural based tool for annotation.
Functional SeattleSeq Provides annotation of SNVs and small indels, by providing to

each the dbSNP rs IDs, gene names and accession numbers,
variation functions, protein positions and AA changes,
Filtering conservation scores, HapMap frequencies, PolyPhen predictions
and clinical association.
(contd.)
76
• Although functional annotation adds an important layer of
information for filtering, the fundamental question to be answered,
Tertiary especially in the context of gene discovery, is whether a specific
variant or mutated gene is indeed the disease-causing one.
Analysis - • To address this complex question, a new generation of tools is being
developed, that instead of merely excluding information, perform
Variant variants ranking thereby allowing their prioritization.

• Different approaches have been proposed.
Filtering, • For instance, PHIVE explores the similarity between human disease
phenotype and those derived from knockout experiments in animal
Prioritization model organisms.

• Other algorithms attempt to solve this problem by an entirely
and
different way, through the computation of a deleteriousness score
(also known as burden score) for each gene, based on how intolerant
genes are to normal variation and using data from population
Visualization variation databases.
• Human disease genes are much more intolerant to variants than non-
(contd.) disease associated genes.
• Human phenotype ontology (HPO) enables the hierarchical sorting by
disease names and clinical features (symptoms) for describing medical
conditions.
77
• HPO can also provide an association between symptoms and known disease
genes.
• Several tools attempt to use these phenotype descriptions to generate a
Tertiary ranking of potential candidates in variant prioritization.
• As an example, some attempt to simplify analysis in a clinical context,
Analysis - such as the phenotypic interpretation of exomes that only reports genes
previously associated with genetic diseases.
Variant • While others can also be used to identify novel genes, such as Phevor that use
data gathered in other related ontologies, gene ontology (GO) for example, to
suggest novel gene–disease associations.
Filtering, • The main goal of these tools is to end-up with few variants for further
validation with molecular techniques.
Prioritization • Recently, several commercial software have been developed to aid in

interpretation and prioritization of variant in a diagnostic context, that is
simple to use, intuitive and can be used by clinicians, geneticists and
and researchers.
• VarSeq/VSClinical (Golden Helix),
Visualization • Ingenuity Variant Analysis (Qiagen),

• Alamut® software (Interactive Biosoftware) and
(contd.) • VarElect .
• Besides those tools that aid in interpretation and variant analysis, currently
clinicians have at their disposal several medical genetics companies, such as
Invitae (https://www.invitae.com/en/) and CENTOGENE (https://
www.centogene.com/) that provide to clinicians a precise medical diagnosis.
78
NGS Pitfalls
• Seventeen years have passed since the introduction of the
first commercially available NGS platform, 454 GS FLX from
Life Sciences.
• Since then, the “genomics” field has greatly expanded our
knowledge about structural and functional genomics and
the underlying genetics of many diseases.
• Besides, it allows the creation of the concepts of “omics”
(transcriptomics, genomics, metabolomics, etc.), which
provide new insights into the knowledge of all living beings,
to know how different organisms use genetics and
molecular biology to survive and reproduce in healthy and
disease situations, to know about their population networks
and changes in environmental conditions.
• This information is very useful also to understand human
health.
• NGS brought a panoply of benefits and solutions for
medicine and to other areas, such as agriculture that
helped to increase the quality and productivity.
• However, it has also brought new challenges. 79
NGS Pitfalls (contd.)
• First challenge is regarding the sequencing costs.
• Although the overall costs of NGS is coming down, an NGS
experiment is not cheap and still not accessible to all laboratories.
• Imposes high initial costs with the acquisition of the sequencing
machine, plus consumables and reagents.
• Costs with experimental design, sample collection and sequencing
library preparation also must be considered.
• Many times, costs of development of sequencing pipelines, and
the development of bioinformatics tools to improve those
pipelines and to perform the downstream sequence analysis, costs
of data management, informatics equipment and downstream
data analysis, are not considered in the overall NGS costs.
• A typical BAM file from a single WES experience consumes up to
30 Gb of space, thus storing and analyzing data of several patients
require higher computational power and storage space, which
clearly add significant costs.
• Expert bioinformaticians are needed to deal with data analysis.
• These additional costs are evidently part of NGS workflow and
must be accounted for.
80
• Concerns about data sharing and confidentiality arise with the
massive amount of data that is generated with NGS and analysis.
• It is debatable which degree of protection should be adopted to
genomic data.
• Should genomic data be or not be shared between multiple
parties (including laboratory staff, bioinformaticians,
researchers, clinicians, patients and their family members)?
• When analyzing NGS data, it is important to be aware of its
technical limitations, namely, PCR amplification bias (a significant
source of bias due to random errors that can be introduced), and
sequencing errors.
• High coverage is needed to understand which variants are
true and which are caused by sequencing or PCR errors.
• Limitations also exist in downstream analysis of the read
alignment/mapping, especially for indels in which some alignment
tools have low detection capabilities or not detected at all.
• Though bioinformatics tools have helped and made the data
analysis more automatic, a manual inspection of variants in the
BAM file are frequently needed.
• Thus, it is critical to understand the limitations of NGS platform
and workflow to try to overcome these limitations and increase
the quality of variant detection. 81
• Another major challenge to clinicians and researchers is to
correlate the findings with the relevant medical
information.
• May not be a simpler task, especially when dealing with new
variants or new genes not previously associated with disease.
• Requires additional efforts to validate the pathogenicity of
variants (which in a clinical setting may not be feasible).
• More importantly, both clinicians and patients must be
clearly conscious that a positive result, although providing
an answer that often terminates a long and expensive
diagnostic journey, does not necessarily mean that a better
treatment will be offered nor that it will be possible to find
a cure.
• In many cases that genetic information may not alter the
prognosis or the outcome for an affected individual.
• This is an inconvenient and hard truth that clinicians should
clearly explain to patients.
• Nevertheless, huge efforts have been made to increase the
choice of the best therapeutic options based on DNA
sequencing results, for cancer and a growing number of
82
rare diseases.
Concluding Remarks
• Despite all the accomplishments made so far, a long journey is ahead before
genetics can provide a definitive answer towards the diagnoses of all
genetic diseases.
• Further improvements in sequencing platforms and data handling strategies
are required to reduce error rates and to increase variant detection quality.
• To increase our understanding about the disease, especially the complex
and heterogeneous diseases, scientists and clinicians must combine
information from multiple -omics sources (such as genomics,
transcriptomics, proteomics and epigenomics).
• NGS is evolving rapidly to deal with the classic genomic approach and is
rapidly gaining broad acceptance.
• Major challenge continues to be dealing with and interpreting all the
distinct layers of information.
• Current computational methods may not be able to handle and extract the
full potential of large genomic and epigenomic datasets being generated.
• Bioinformaticians, scientists and clinicians will have to work together to
interpret the data and to develop novel tools for integrated systems level
analysis.
• Machine learning algorithms as well as the emerging developments in
artificial intelligence, will be decisive to improve NGS platforms and
software.
• This will help scientists and clinicians to solve complex biological challenges,
thus improving clinical diagnostics and opening new avenues for novel
therapies development. 83
Thank You.
84

Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics/Computationa L Tools For NGS Data Analysis: An Overview

Uploaded by

Copyright:

Available Formats

Bioinformatics/Computationa

l Tools for NGS Data Analysis

NGS • BGI Genomics

Market • Oxford Nanopore Technologies

NGS fragmentation of the starting material, then connecting

Step 1 • Both methods give similar yields.

Step 3 by fully sequencing a DNA fragment.

Preparation fragments complementary to the targeted regions of interest.

Generation from DNA/RNA samples),

Platforms • Two most widely known sequencing

(contd.) • The DNA/cDNA is fragmented, and adapters are added to both

Platforms • The fluorescent dye is identified through laser excitation and

Second- causing pH changes in the solution, which are detected by an

Sequencing then used in base calling.

(contd.) homopolymeric template stretches.

Third- • Allelic drop-outs (preferential

Generation • The first commercial 3rd generation sequencer

Sequencing • In contrast to the second-generation

Platforms surface containing oligo-T nucleotides.

(contd.) • This technology has already permitted the identification,

(contd.) • can detect rare alleles and

Ion Torremt (contd.)

Ion Torremt (contd.)

Quality Control: Read Filtering and

Quality Control: Read Filtering and

Analysis - • Each presents different specificities and performances that are

Alignment i. initial mapping (using Smith–Waterman or Needleman–Wunsch

(contd.) reference genome and

Secondary • Most of the algorithms for de novo assembly are based on an

(contd.) highly computationally demanding, especially with a large number of

Adds a further layer of

Tertiary variant calling steps.

Analysis Variant annotation is followed by variant filtering,

Constantly updated and

Variant • Given the massive amount of NGS data, data annotation is

Tertiary • Such annotation tools offer additional annotation based on

(contd.) • Evolutionary conservation

Tertiary completely consistent and overlapping in terms of content.

Analysis - http://www.ncbi.nlm.nih.gov/RefSeq/) and UCSC genome browser (

Variant additional genetic information for several species.

(contd.) • Besides the problems related to multiple transcripts and annotation

Visualization • Population databases such as 1000 genome project, Exome Aggregation

(contd.) most widely used databases.

Variant • In-house variant databases are important to assist variant

Prioritization of an in-house system.

Visualization • American College of Medical Genetics and Genomics (ACMG)

(contd.) • These provide standards and guidelines for the interpretation

PhyloP Phylogenetic p- Based on a model of neutral evolution, the patterns of

Filtering genetic architectures. Scores above 10 should be applied as a cut-

MutationTaster Analyses evolutionary conservation, splice-site changes, loss of

Perform snpEff Annotation and classification of SNV based on their e↵ects on

Functional SeattleSeq Provides annotation of SNVs and small indels, by providing to

Variant variants ranking thereby allowing their prioritization.

Prioritization model organisms.

Prioritization • Recently, several commercial software have been developed to aid in

Visualization • Ingenuity Variant Analysis (Qiagen),

You might also like