Title: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis
1Introduction to Next-Generation Sequencing Data
and Related Bioinformatic Analysis
- Han Liang, Ph.D.
- Department of Bioinformatics and Computational
Biology - 3/25/2014 _at_ Rice University
2Outline
- History
- NGS Platforms
- Applications
- Bioinformatics Analysis
- Challenges
3Central Dogma
4Sanger sequencing
- DNA is fragmented
- Cloned to a plasmid vector
- Cyclic sequencing reaction
- Separation by electrophoresis
- Readout with fluorescent tags
5Sanger vs NGS
- Sanger sequencing has been the only DNA
sequencing method for 30 years but - hunger for even greater sequencing throughput
and more economical sequencing technology - NGS has the ability to process millions of
sequence reads in parallel rather than 96 at a
time (1/6 of the cost) - Objections fidelity, read length, infrastructure
cost, handle large volum of data - .
6Platforms
- Roche/454 FLX 2004
- Illumina Solexa Genome Analyzer 2006
- Applied Biosystems SOLiDTM System 2007
- Helicos HeliscopeTM recently available
- Pacific Biosciencies SMRT launching 2010
7Quickly reduced Cost
8Three Leading Sequencing Platforms
- Roche 454
- Illumina Solexa
- Applied Biosystems SOLiD
9 The general experimental procedure
Wang et al. Nature Reviews Genetics 2009
10454bead microreactor
Maridis Annu. Rev. Genome. Human Genet. 2008
11(No Transcript)
12Illumina (Solexa)Bridge amplification
Maridis Annu. Rev. Genome. Human Genet. 2008
13SOLiDcolor coding
Maridis Annu. Rev. Genome. Human Genet. 2008
14Comparison of existing methods
15Real Data nucleotide space
- Solexa
- _at_SRR002051.1 81325773 length33
- AAAGAACATTAAAGCTATATTATAAGCAAAGAT
- SRR002051.1 81325773 length33
- IIIIIIIIIIIIIIIIIIIIIIIII'II_at_I)-
- _at_SRR002051.2 81409432 length33
- AAGTTATGAAATTGTAATTCCAATATCGTAAGC
- SRR002051.2 81409432 length33
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII07
- _at_SRR002051.3 81488490 length33
- AATTTCTTACCATATTAGACAAGGCACTATCTT
- SRR002051.3 81488490 length33
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
16Real Data color space
- SOLiD Data
- gt1_24_47_F3
- T1.1.23..0120230.320033300030030010022.00.0201.020
1 - gt1_24_52_F3
- T2.3.21..2122321.213110332101132321002.11.0111.122
2 - gt1_24_836_F3
- T0.2.22..2222222.010203032021102220200.01.2211.221
1 - gt1_24_1404_F3
- T2.3.30..2013222.222103131323012313233.22.2220.021
3 - gt1_25_202_F3
- T0.3213.111202312203021101111330201000313.12112221
1 - gt1_25_296_F3
- T0.1130.100123202213120023121112113212121.01330121
0
17Data output difference among the three platforms
- Nucleotide space vs. color space
- Length of short reads
- 454 (400500 bp) gt SOLiD (70 bp) Solexa
(36120bp) -
18Applications with Digital output
- De novo genome assembly
- Genome re-sequencing
- RNA-Seq (gene expression, exon-intron structure,
small RNA profiling, and mutation) - CHIP-Seq (protein-DNA interaction)
- Epigenetic profiling
19Ancient Genomes Resurrected
- Degraded state of the sample ? mitDNA sequencing
- Nuclear genomes of ancient remains cave bear,
mommoth, Neanderthal (106 bp )
Problems contamination modern humans and
coisolation bacterial DNA
20(No Transcript)
21Elucidating DNA-protein interactions through
chromoatin immunoprecipitation sequencing
- Key part in regulating gene expression
- Chip technique to study DNA-protein interaccions
- Recently genome-wide ChIP-based studies of
DNA-protein interactions - Readout of ChIP-derived DNA sequences onto NGS
platforms - Insights into transcription factor/histone
binding sites in the human genome - Enhance our understanding of the gene expression
in the context of specific environmental stimuli
22Discovering noncoding RNAs
- ncRNA presence in genome difficult to predict by
computational methods with high certainty because
the evolutionary diversity - Detecting expression level changes that correlate
with changes in environmental factors, with
disease onset and progression, complex disease
set or severity - Enhance the annotation of sequenced genomes
(impact of mutations more interpretable)
23Metagenomics
- Characterizing the biodiversity found on Earth
- The growing number of sequenced genomes enables
us to interpret partial sequences obtained by
direct sampling of specif environmental niches. - Examples ocean, acid mine site, soil, coral
reefs, human microbiome which may vary according
to the health status of the individual
24Defining variability in many human genomes
- Common variants have not yet completly explained
complex disease genetics ?rare alleles also
contribute - Also structural variants, large and small
insertions and deletions - Accelerating biomedical research
25Epigenomic variation
- Enable of genome-wide patterns of methylation and
how this patterns change through the course of an
organisms development. - Enhanced potential to combine the results of
different experiments, correlative analyses of
genome-wide methylation, histone binding patterns
and gene expression, for example.
26Integrating Omics
Mutation discovery
Protein-DNA interaction
Copy number variation
mRNA expression
microRNA expression
Alternative Splicing
Kahvejian et al. 2008
27Data Analysis Flow
SOLiD machine Raw data
Central Server Basic processing decoding, filter
and mapping
Local Machine Downstream analysis
28Short Read Mapping
- DNA-Resequencing
- BLAST-like approach
- RNA-Seq
29(No Transcript)
30(No Transcript)
31Read length and pairing
TCGTACCGATATGCTG
ACTTAAGGCTGACTAGC
- Short reads are problematic, because short
sequences do not map uniquely to the genome. - Solution 1 Get longer reads.
- Solution 2 Get paired reads.
32Post-alignment Analysis
- DNA-SEQ
- SNP calling
- RNA-SEQ
- Quantifying gene expression level
33Concepts
The reference genome hg19 (GRC37) Main
assembly Chr1-22, X, and Y 3,095,677,412 bp
Target Region exonome Ensembl 85.3 Million
(2.94) RefSeq 67.7Million (2.34) ccds
31,266,049 (1.08) consisting of 185,446 nr exons
34Target Coverage
35(No Transcript)
36SOLiDcolor coding
Maridis Annu. Rev. Genome. Human Genet. 2008
37SNP calling
38(No Transcript)
39Array-based High-throughput Dataset
40Limitations of hybridization-based approach
- Reliance existing knowledge about genome sequence
- Background noise and a limited dynamic detecting
range - Cross-experiment comparison is difficult
- Requiring complicated normalization methods
Wang et al. Nature Reviews Genetics 2009
41Quantifying gene expression using RNA-Seq data
- RPKM Reads Per Kb exon length and Millions of
mapped readings
42Large Dynamic Range
Mortazavi et al. Nature Methods 2008
43High reproducibility
Mortazavi et al. Nature Methods 2008
44High Accuracy
Wang et al. Nature 2008
45Advantages of RNA-Seq
- Not limited to the existing genomic sequence
- Very low (if any) background signal
- Large dynamic detecting range
- Highly reproducibility
- Highly accurate
- Less sample
- Low cost per base
Wang et al. Nature Reviews Genetics 2009
46Huge amount of data!
- For a typical RNA-Seq SOLiD run,
- 2T image file
- 120G text file for downstream analysis
- 75 M short reads per sample
Efficient methods for data storage and
management
47Considerable sequencing error
High-quality image analysis for base calling
48Genome alignment and assembly time consuming
and memory demanding
- To perform genome mapping for SOLiD data
- 32-opteron HP DL785 with 128GB of ram
- 1214 hours per sample
High-performance parallel computing
49Bioinformatics Challenges
- Efficient methods to store, retrieve and process
huge amount of data - To reduce errors in image analysis and base
calling - Fast and accurate for genome alignment and
assembly - New algorithms in downstream analyses
50Experimental Challenges
Library fragmentation Strand specific
Wang et al. Nature Reviews Genetics 2009
51Question Answer
Han Liang E-mail hliang1_at_mdanderson.org Tel
713-745-9815