Next Generation Sequencing - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Next Generation Sequencing

Description:

* * * * * * * * * * BS-seq: genomic DNA is treated with sodium bisulphite (BS) to convert cytosine, but not methylcytosine, to uracil, and subsequent high-throughput ... – PowerPoint PPT presentation

Number of Views:2773
Avg rating:3.0/5.0
Slides: 38
Provided by: Gulci6
Category:

less

Transcript and Presenter's Notes

Title: Next Generation Sequencing


1
Next Generation Sequencing
2
Sequencing techniques
  • ChIP-seq
  • MBD-seq (MIRA-seq)
  • BS-seq
  • RNA-seq
  • miRNA-seq

3
ChIP-seq
  • ChIP-Seq is a new frontier technology to analyze
    in vivo protein-DNA interactions.
  • ChIP-Seq
  • Combination of chromatin immunoprecipitation
    (ChIP) with ultra high-throughput massively
    parallel sequencing
  • Allow mapping of proteinDNA interactions in-vivo
    on a genome scale

4
Workflow of ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
5
(No Transcript)
6
(No Transcript)
7
The advantages of ChIP-seq
  • Current microarray and ChIP-ChIP designs require
    knowing sequence of interest as a promoter,
    enhancer, or RNA-coding domain.
  • Lower cost
  • Higher resolution
  • Higher accuracy
  • Alterations in transcription-factor binding in
    response to environmental stimuli can be
    evaluated for the entire genome in a single
    experiment.

8
Sequencers
  • Solexa (Illumina)
  • 1 GB of sequences in a single run
  • 35 bases in length
  • 454 Life Sciences (Roche Diagnostics)
  • 25-50 MB of sequences in a single run
  • Up to 500 bases in length
  • SOLiD (Applied Biosystems)
  • 6 GB of sequences in a single run
  • 35 bases in length

9
Illumina Genome Analysis System
10
Sequencing
11
Sequencer Output
12
Sequence Files
  • 10-40 million reads per lane
  • 500 MB files

13
Quality Score Files
  • Quality scores describe the confidence of bases
    in each read
  • Solexa pipeline assigns a quality score to the
    four possible nucleotides for each sequenced base
  • 9 million sequences (500MB file) ? 6.5GB quality
    score file

14
Bioinformatics Challenges
  • Rapid mapping of these short sequence reads to
    the reference genome
  • Visualize mapping results
  • Thousand of enriched regions
  • Peak analysis
  • Peak detection
  • Finding exact binding sites
  • Compare results of different experiments
  • Normalization
  • Statistical tests

15
Mapping of Short Oligonucleotides to the
Reference Genome
  • Mapping Methods
  • Need to allow mismatches and gaps
  • SNP locations
  • Sequencing errors
  • Reading errors
  • Indexing and hashing
  • genome
  • oligonucleotide reads
  • Use of quality scores
  • Use of SNP knowledge
  • Performance
  • Partitioning the genome or sequence reads

16
Mapping Methods Indexing the Genome
  • Fast sequence similarity search algorithms (like
    BLAST)
  • Not specifically designed for mapping millions of
    query sequences
  • Take very long time
  • e.g. 2 days to map half million sequences to 70MB
    reference genome (using BLAST)
  • Indexing the genome is memory expensive

17
(No Transcript)
18
SOAP (Li et al, 2008)
  • Both reads and reference genome are converted to
    numeric data type using 2-bits-per-base coding
  • Load reference genome into memory
  • For human genome, 14GB RAM required for storing
    reference sequences and index tables
  • 300(gapped) to 1200(ungapped) times faster than
    BLAST
  • 2 mismatches or 1-3bp continuous gap
  • Errors accumulate during the sequencing process
  • Much higher number of sequencing errors at the
    3-end (sometimes make the reads unalignable to
    the reference genome)
  • Iteratively trim several basepairs at the 3-end
    and redo the alignment
  • Improve sensitivity

19
Mapping Methods Indexing the Oligonucleotide
Reads
  • ELAND (Cox, unpublished)
  • Efficient Large-Scale Alignment of Nucleotide
    Databases (Solexa Ltd.)
  • SeqMap (Jiang, 2008)
  • Mapping massive amount of oligonucleotides to
    the genome
  • RMAP (Smith, 2008)
  • Using quality scores and longer reads improves
    accuracy of Solexa read mapping
  • MAQ (Li, 2008)
  • Mapping short DNA sequencing reads and calling
    variants using mapping quality scores

20
Mapping Algorithm (2 mismatches)
  • Partition reads into 4 seeds A,B,C,D
  • At least 2 seed must map with no mismatches
  • Scan genome to identify locations where the seeds
    match exactly
  • 6 possible combinations of the seeds to search
  • AB, CD, AC, BD, AD, BC
  • 6 scans to find all candidates
  • Do approximate matching around the
    exactly-matching seeds.
  • Determine all targets for the reads
  • Ins/del can be incorporated
  • The reads are indexed and hashed before scanning
    genome
  • Bit operations are used to accelerate mapping
  • Each nt encoded into 2-bits

21
ELAND (Cox, unpublished)
  • Commercial sequence mapping program comes with
    Solexa machine
  • Allow at most 2 mismatches
  • Map sequences up to 32 nt in length
  • All sequences have to be same length

22
(No Transcript)
23
(No Transcript)
24
RMAP (Smith et al, 2008)
  • Improve mapping accuracy
  • Possible sequencing errors at 3-ends of longer
    reads
  • Base-call quality scores
  • Use of base-call quality scores
  • Quality cutoff
  • High quality positions are checked for mismatces
  • Low quality positions always induce a match
  • Quality control step eliminates reads with too
    many low quality positions
  • Allow any number of mismatches

25
(No Transcript)
26
(No Transcript)
27
Visualization
  • BED files are build to summarize mapping results
  • BED files can be easily visualized in Genome
    Browser
  • http//genome.ucsc.edu

28
Visualization Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657
(2007)
29
Visualization Custom
300 kb region from mouse ES cells
Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)
30
Screen shot for ZNF263 peaks
Frietze et al JBC 2010
31
ChIP-seq peak analysis programs
  • SISSRs (Site Identification from Short Sequence
    Reads) Jothi et al. NAR, 2008.
  • MACS (Model-based Analysis of ChIP-Seq) Zhang et
    al, Genome Biology, 2008.
  • QuEST (Genome-wide analysis of transcription
    factor binding sites based on ChIPseq data)
    Valouev, A. et al. Nature Methods, 2008.
  • PeakSeq (PeakSeq enables systematic scoring of
    ChIPseq experiments relative to controls)
    Rozowsky, J. et al. Nature Biotech. 2009.
  • FindPeaks (FindPeaks 3.1 a tool for identifying
    areas of enrichment from massively parallel
    short-read sequencing technology.) Fejes, A .P.
    et al. Bioinformatisc, 2008.
  • Hpeak (An HMM-based algorithm for defining
    read-enriched regions from massive parallel
    sequencing data) Xu et al, Bioinformatics, 2008.

32
MBD-seq (MIRA-seq)
  • The MBD methyl-CpG binding domain-based (MBDCap)
    technology to capture the methylation sites.
    Double stranded methylated DNA fragments can be
    detected. It is sensitive to different
    methylation densities
  • Genome-wide sequencing technology was used to get
    the sequence of each short fragment.
  • The sequenced read was mapped to human genome to
    find the locations.

33
Application on MBD-seq data (MCF7)
Lan et al Unpublished
34
BS-seq
  • BS-seq genomic DNA is treated with sodium
    bisulphite (BS) to convert cytosine, but not
    methylcytosine, to uracil, and subsequent
    high-throughput sequencing.
  • Truly single-base resolution

35
RNA-seq
  • RNA-Seq is a new approach to transcriptome
    profiling that uses deep-sequencing technologies.
  • Studies using this method have already altered
    our view of the extent and complexity of
    eukaryotic transcriptomes. RNA-Seq also provides
    a far more precise measurement of levels of
    transcripts and their isoforms than other methods.

36
RNA-seq protocol
37
The advantages of RNA-seq
  • Single base resolution
  • High throughput
  • Low background noise
  • Ability to distinguish different isoforms and
    alleic expression
  • Relatively low cost
Write a Comment
User Comments (0)
About PowerShow.com