Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis

Description:

Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis Han Liang, Ph.D. Department of Bioinformatics and Computational Biology – PowerPoint PPT presentation

Number of Views:202
Avg rating:3.0/5.0
Slides: 52
Provided by: tomc59
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis


1
Introduction to Next-Generation Sequencing Data
and Related Bioinformatic Analysis
  • Han Liang, Ph.D.
  • Department of Bioinformatics and Computational
    Biology
  • 3/25/2014 _at_ Rice University

2
Outline
  • History
  • NGS Platforms
  • Applications
  • Bioinformatics Analysis
  • Challenges

3
Central Dogma
4
Sanger sequencing
  • DNA is fragmented
  • Cloned to a plasmid vector
  • Cyclic sequencing reaction
  • Separation by electrophoresis
  • Readout with fluorescent tags

5
Sanger vs NGS
  • Sanger sequencing has been the only DNA
    sequencing method for 30 years but
  • hunger for even greater sequencing throughput
    and more economical sequencing technology
  • NGS has the ability to process millions of
    sequence reads in parallel rather than 96 at a
    time (1/6 of the cost)
  • Objections fidelity, read length, infrastructure
    cost, handle large volum of data
  • .

6
Platforms
  • Roche/454 FLX 2004
  • Illumina Solexa Genome Analyzer 2006
  • Applied Biosystems SOLiDTM System 2007
  • Helicos HeliscopeTM recently available
  • Pacific Biosciencies SMRT launching 2010

7
Quickly reduced Cost
8
Three Leading Sequencing Platforms
  • Roche 454
  • Illumina Solexa
  • Applied Biosystems SOLiD

9
The general experimental procedure
Wang et al. Nature Reviews Genetics 2009
10
454bead microreactor
Maridis Annu. Rev. Genome. Human Genet. 2008
11
(No Transcript)
12
Illumina (Solexa)Bridge amplification
Maridis Annu. Rev. Genome. Human Genet. 2008
13
SOLiDcolor coding
Maridis Annu. Rev. Genome. Human Genet. 2008
14
Comparison of existing methods
15
Real Data nucleotide space
  • Solexa
  • _at_SRR002051.1 81325773 length33
  • AAAGAACATTAAAGCTATATTATAAGCAAAGAT
  • SRR002051.1 81325773 length33
  • IIIIIIIIIIIIIIIIIIIIIIIII'II_at_I)-
  • _at_SRR002051.2 81409432 length33
  • AAGTTATGAAATTGTAATTCCAATATCGTAAGC
  • SRR002051.2 81409432 length33
  • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII07
  • _at_SRR002051.3 81488490 length33
  • AATTTCTTACCATATTAGACAAGGCACTATCTT
  • SRR002051.3 81488490 length33
  • IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

16
Real Data color space
  • SOLiD Data
  • gt1_24_47_F3
  • T1.1.23..0120230.320033300030030010022.00.0201.020
    1
  • gt1_24_52_F3
  • T2.3.21..2122321.213110332101132321002.11.0111.122
    2
  • gt1_24_836_F3
  • T0.2.22..2222222.010203032021102220200.01.2211.221
    1
  • gt1_24_1404_F3
  • T2.3.30..2013222.222103131323012313233.22.2220.021
    3
  • gt1_25_202_F3
  • T0.3213.111202312203021101111330201000313.12112221
    1
  • gt1_25_296_F3
  • T0.1130.100123202213120023121112113212121.01330121
    0

17
Data output difference among the three platforms
  • Nucleotide space vs. color space
  • Length of short reads
  • 454 (400500 bp) gt SOLiD (70 bp) Solexa
    (36120bp)

18
Applications with Digital output
  • De novo genome assembly
  • Genome re-sequencing
  • RNA-Seq (gene expression, exon-intron structure,
    small RNA profiling, and mutation)
  • CHIP-Seq (protein-DNA interaction)
  • Epigenetic profiling

19
Ancient Genomes Resurrected
  • Degraded state of the sample ? mitDNA sequencing
  • Nuclear genomes of ancient remains cave bear,
    mommoth, Neanderthal (106 bp )

Problems contamination modern humans and
coisolation bacterial DNA
20
(No Transcript)
21
Elucidating DNA-protein interactions through
chromoatin immunoprecipitation sequencing
  • Key part in regulating gene expression
  • Chip technique to study DNA-protein interaccions
  • Recently genome-wide ChIP-based studies of
    DNA-protein interactions
  • Readout of ChIP-derived DNA sequences onto NGS
    platforms
  • Insights into transcription factor/histone
    binding sites in the human genome
  • Enhance our understanding of the gene expression
    in the context of specific environmental stimuli

22
Discovering noncoding RNAs
  • ncRNA presence in genome difficult to predict by
    computational methods with high certainty because
    the evolutionary diversity
  • Detecting expression level changes that correlate
    with changes in environmental factors, with
    disease onset and progression, complex disease
    set or severity
  • Enhance the annotation of sequenced genomes
    (impact of mutations more interpretable)

23
Metagenomics
  • Characterizing the biodiversity found on Earth
  • The growing number of sequenced genomes enables
    us to interpret partial sequences obtained by
    direct sampling of specif environmental niches.
  • Examples ocean, acid mine site, soil, coral
    reefs, human microbiome which may vary according
    to the health status of the individual

24
Defining variability in many human genomes
  • Common variants have not yet completly explained
    complex disease genetics ?rare alleles also
    contribute
  • Also structural variants, large and small
    insertions and deletions
  • Accelerating biomedical research

25
Epigenomic variation
  • Enable of genome-wide patterns of methylation and
    how this patterns change through the course of an
    organisms development.
  • Enhanced potential to combine the results of
    different experiments, correlative analyses of
    genome-wide methylation, histone binding patterns
    and gene expression, for example.

26
Integrating Omics
Mutation discovery
Protein-DNA interaction
Copy number variation
mRNA expression
microRNA expression
Alternative Splicing
Kahvejian et al. 2008
27
Data Analysis Flow
SOLiD machine Raw data

Central Server Basic processing decoding, filter
and mapping
Local Machine Downstream analysis
28
Short Read Mapping
  • DNA-Resequencing
  • BLAST-like approach
  • RNA-Seq

29
(No Transcript)
30
(No Transcript)
31
Read length and pairing
TCGTACCGATATGCTG
ACTTAAGGCTGACTAGC
  • Short reads are problematic, because short
    sequences do not map uniquely to the genome.
  • Solution 1 Get longer reads.
  • Solution 2 Get paired reads.

32
Post-alignment Analysis
  • DNA-SEQ
  • SNP calling
  • RNA-SEQ
  • Quantifying gene expression level

33
Concepts
The reference genome hg19 (GRC37) Main
assembly Chr1-22, X, and Y 3,095,677,412 bp
Target Region exonome Ensembl 85.3 Million
(2.94) RefSeq 67.7Million (2.34) ccds
31,266,049 (1.08) consisting of 185,446 nr exons

34
Target Coverage
35
(No Transcript)
36
SOLiDcolor coding
Maridis Annu. Rev. Genome. Human Genet. 2008
37
SNP calling
38
(No Transcript)
39
Array-based High-throughput Dataset
40
Limitations of hybridization-based approach
  • Reliance existing knowledge about genome sequence
  • Background noise and a limited dynamic detecting
    range
  • Cross-experiment comparison is difficult
  • Requiring complicated normalization methods

Wang et al. Nature Reviews Genetics 2009
41
Quantifying gene expression using RNA-Seq data
  • RPKM Reads Per Kb exon length and Millions of
    mapped readings

42
Large Dynamic Range
Mortazavi et al. Nature Methods 2008
43
High reproducibility
Mortazavi et al. Nature Methods 2008
44
High Accuracy
Wang et al. Nature 2008
45
Advantages of RNA-Seq
  • Not limited to the existing genomic sequence
  • Very low (if any) background signal
  • Large dynamic detecting range
  • Highly reproducibility
  • Highly accurate
  • Less sample
  • Low cost per base

Wang et al. Nature Reviews Genetics 2009
46
Huge amount of data!
  • For a typical RNA-Seq SOLiD run,
  • 2T image file
  • 120G text file for downstream analysis
  • 75 M short reads per sample

Efficient methods for data storage and
management
47
Considerable sequencing error
High-quality image analysis for base calling
48
Genome alignment and assembly time consuming
and memory demanding
  • To perform genome mapping for SOLiD data
  • 32-opteron HP DL785 with 128GB of ram
  • 1214 hours per sample

High-performance parallel computing
49
Bioinformatics Challenges
  • Efficient methods to store, retrieve and process
    huge amount of data
  • To reduce errors in image analysis and base
    calling
  • Fast and accurate for genome alignment and
    assembly
  • New algorithms in downstream analyses

50
Experimental Challenges
Library fragmentation Strand specific
Wang et al. Nature Reviews Genetics 2009
51
Question Answer
Han Liang E-mail hliang1_at_mdanderson.org Tel
713-745-9815
Write a Comment
User Comments (0)
About PowerShow.com