Next Generation Sequencing - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Next Generation Sequencing

Description:

* * * * * * * * * * BS-seq: genomic DNA is treated with sodium bisulphite (BS) to convert cytosine, but not methylcytosine, to uracil, and subsequent high-throughput ... – PowerPoint PPT presentation

Number of Views:2773

Avg rating:3.0/5.0

Slides: 38

Provided by: Gulci6

Category:

more less

Transcript and Presenter's Notes

Title: Next Generation Sequencing

1
Next Generation Sequencing
2
Sequencing techniques

ChIP-seq
MBD-seq (MIRA-seq)
BS-seq
RNA-seq
miRNA-seq

3
ChIP-seq

ChIP-Seq is a new frontier technology to analyze
in vivo protein-DNA interactions.
ChIP-Seq
Combination of chromatin immunoprecipitation
(ChIP) with ultra high-throughput massively
parallel sequencing
Allow mapping of proteinDNA interactions in-vivo
on a genome scale

4
Workflow of ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
5
(No Transcript)
6
(No Transcript)
7
The advantages of ChIP-seq

Current microarray and ChIP-ChIP designs require
knowing sequence of interest as a promoter,
enhancer, or RNA-coding domain.
Lower cost
Higher resolution
Higher accuracy
Alterations in transcription-factor binding in
response to environmental stimuli can be
evaluated for the entire genome in a single
experiment.

8
Sequencers

Solexa (Illumina)
1 GB of sequences in a single run
35 bases in length
454 Life Sciences (Roche Diagnostics)
25-50 MB of sequences in a single run
Up to 500 bases in length
SOLiD (Applied Biosystems)
6 GB of sequences in a single run
35 bases in length

9
Illumina Genome Analysis System
10
Sequencing
11
Sequencer Output
12
Sequence Files

10-40 million reads per lane
500 MB files

13
Quality Score Files

Quality scores describe the confidence of bases
in each read
Solexa pipeline assigns a quality score to the
four possible nucleotides for each sequenced base
9 million sequences (500MB file) ? 6.5GB quality
score file

14
Bioinformatics Challenges

Rapid mapping of these short sequence reads to
the reference genome
Visualize mapping results
Thousand of enriched regions
Peak analysis
Peak detection
Finding exact binding sites
Compare results of different experiments
Normalization
Statistical tests

15
Mapping of Short Oligonucleotides to the
Reference Genome

Mapping Methods
Need to allow mismatches and gaps
SNP locations
Sequencing errors
Reading errors
Indexing and hashing
genome
oligonucleotide reads
Use of quality scores
Use of SNP knowledge
Performance
Partitioning the genome or sequence reads

16
Mapping Methods Indexing the Genome

Fast sequence similarity search algorithms (like
BLAST)
Not specifically designed for mapping millions of
query sequences
Take very long time
e.g. 2 days to map half million sequences to 70MB
reference genome (using BLAST)
Indexing the genome is memory expensive

17
(No Transcript)
18
SOAP (Li et al, 2008)

Both reads and reference genome are converted to
numeric data type using 2-bits-per-base coding
Load reference genome into memory
For human genome, 14GB RAM required for storing
reference sequences and index tables
300(gapped) to 1200(ungapped) times faster than
BLAST
2 mismatches or 1-3bp continuous gap
Errors accumulate during the sequencing process
Much higher number of sequencing errors at the
3-end (sometimes make the reads unalignable to
the reference genome)
Iteratively trim several basepairs at the 3-end
and redo the alignment
Improve sensitivity

19
Mapping Methods Indexing the Oligonucleotide
Reads

ELAND (Cox, unpublished)
Efficient Large-Scale Alignment of Nucleotide
Databases (Solexa Ltd.)
SeqMap (Jiang, 2008)
Mapping massive amount of oligonucleotides to
the genome
RMAP (Smith, 2008)
Using quality scores and longer reads improves
accuracy of Solexa read mapping
MAQ (Li, 2008)
Mapping short DNA sequencing reads and calling
variants using mapping quality scores

20
Mapping Algorithm (2 mismatches)

Partition reads into 4 seeds A,B,C,D
At least 2 seed must map with no mismatches
Scan genome to identify locations where the seeds
match exactly
6 possible combinations of the seeds to search
AB, CD, AC, BD, AD, BC
6 scans to find all candidates
Do approximate matching around the
exactly-matching seeds.
Determine all targets for the reads
Ins/del can be incorporated
The reads are indexed and hashed before scanning
genome
Bit operations are used to accelerate mapping
Each nt encoded into 2-bits

21
ELAND (Cox, unpublished)

Commercial sequence mapping program comes with
Solexa machine
Allow at most 2 mismatches
Map sequences up to 32 nt in length
All sequences have to be same length

22
(No Transcript)
23
(No Transcript)
24
RMAP (Smith et al, 2008)

Improve mapping accuracy
Possible sequencing errors at 3-ends of longer
reads
Base-call quality scores
Use of base-call quality scores
Quality cutoff
High quality positions are checked for mismatces
Low quality positions always induce a match
Quality control step eliminates reads with too
many low quality positions
Allow any number of mismatches

25
(No Transcript)
26
(No Transcript)
27
Visualization

BED files are build to summarize mapping results
BED files can be easily visualized in Genome
Browser
http//genome.ucsc.edu

28
Visualization Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657
(2007)
29
Visualization Custom
300 kb region from mouse ES cells
Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)
30
Screen shot for ZNF263 peaks
Frietze et al JBC 2010
31
ChIP-seq peak analysis programs

SISSRs (Site Identification from Short Sequence
Reads) Jothi et al. NAR, 2008.
MACS (Model-based Analysis of ChIP-Seq) Zhang et
al, Genome Biology, 2008.
QuEST (Genome-wide analysis of transcription
factor binding sites based on ChIPseq data)
Valouev, A. et al. Nature Methods, 2008.
PeakSeq (PeakSeq enables systematic scoring of
ChIPseq experiments relative to controls)
Rozowsky, J. et al. Nature Biotech. 2009.
FindPeaks (FindPeaks 3.1 a tool for identifying
areas of enrichment from massively parallel
short-read sequencing technology.) Fejes, A .P.
et al. Bioinformatisc, 2008.
Hpeak (An HMM-based algorithm for defining
read-enriched regions from massive parallel
sequencing data) Xu et al, Bioinformatics, 2008.

32
MBD-seq (MIRA-seq)

The MBD methyl-CpG binding domain-based (MBDCap)
technology to capture the methylation sites.
Double stranded methylated DNA fragments can be
detected. It is sensitive to different
methylation densities
Genome-wide sequencing technology was used to get
the sequence of each short fragment.
The sequenced read was mapped to human genome to
find the locations.

33
Application on MBD-seq data (MCF7)
Lan et al Unpublished
34
BS-seq

BS-seq genomic DNA is treated with sodium
bisulphite (BS) to convert cytosine, but not
methylcytosine, to uracil, and subsequent
high-throughput sequencing.
Truly single-base resolution

35
RNA-seq

RNA-Seq is a new approach to transcriptome
profiling that uses deep-sequencing technologies.
Studies using this method have already altered
our view of the extent and complexity of
eukaryotic transcriptomes. RNA-Seq also provides
a far more precise measurement of levels of
transcripts and their isoforms than other methods.

36
RNA-seq protocol
37
The advantages of RNA-seq