ChIP Sequencing BMI/IBGP 730 - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

ChIP Sequencing BMI/IBGP 730

Description:

ChIP Sequencing BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. H. Gulcin Ozer) Department of Biomedical Informatics ... – PowerPoint PPT presentation

Number of Views:298
Avg rating:3.0/5.0
Slides: 47
Provided by: Gul102
Category:

less

Transcript and Presenter's Notes

Title: ChIP Sequencing BMI/IBGP 730


1
ChIP Sequencing BMI/IBGP 730
  • Victor Jin, Ph.D.
  • (Slides from Dr. H. Gulcin Ozer)
  • Department of Biomedical Informatics

2
What is ChIP-Sequencing?
  • ChIP-Sequencing is a new frontier technology to
    analyze protein interactions with DNA.
  • ChIP-Seq
  • Combination of chromatin immunoprecipitation
    (ChIP) with ultra high-throughput massively
    parallel sequencing
  • Allow mapping of proteinDNA interactions in-vivo
    on a genome scale

3
Workflow of ChIP-Seq
Mardis, E.R. Nat. Methods 4, 613-614 (2007)
4
Workflow of ChIP-Seq
5
(No Transcript)
6
Johnson et al, 2007
  • ChIP-Seq technology is used to understand in vivo
    binding of the neuron-restrictive silencer factor
    (NRSF)
  • Results are compared to known binding sites
  • ChIP-Seq signals are strongly agree with the
    existing knowledge
  • Sharp resolution of binding position
  • New noncanonical NRSF binding motifs are
    identified

7
(No Transcript)
8
Robertson et al, 2007
  • ChIP-Seq technology used to study genome-wide
    profiles of STAT1 DNA association
  • STAT1 targets in interferon-?-stimulated and
    unstimulated human HeLA S3 cells are compared
  • The performance of ChIP-Seq is compared to the
    alternative protein-DNA interaction methods of
    ChIP-PCR and ChIP-chip.
  • 41,582 and 11,004 putative STAT-1 binding regions
    are identified in stimulated and unstimulated
    cells respectively.

9
Why ChIP-Sequencing?
  • Current microarray and ChIP-ChIP designs require
    knowing sequence of interest as a promoter,
    enhancer, or RNA-coding domain.
  • Lower cost
  • Less work in ChIP-Seq
  • Higher accuracy
  • Alterations in transcription-factor binding in
    response to environmental stimuli can be
    evaluated for the entire genome in a single
    experiment.

10
(No Transcript)
11
Sequencers
  • Solexa (Illumina)
  • 1 GB of sequences in a single run
  • 35 bases in length
  • 454 Life Sciences (Roche Diagnostics)
  • 25-50 MB of sequences in a single run
  • Up to 500 bases in length
  • SOLiD (Applied Biosystems)
  • 6 GB of sequences in a single run
  • 35 bases in length

12
Illumina Genome Analysis System
13
Sequencing
14
Sequencer Output
15
Sequence Files
  • 10 million sequences per lane
  • 500 MB files

16
Quality Score Files
  • Quality scores describe the confidence of bases
    in each read
  • Solexa pipeline assigns a quality score to the
    four possible nucleotides for each sequenced base
  • 9 million sequences (500MB file) ? 6.5GB quality
    score file

17
Bioinformatics Challenges
  • Rapid mapping of these short sequence reads to
    the reference genome
  • Visualize mapping results
  • Thousand of enriched regions
  • Peak analysis
  • Peak detection
  • Finding exact binding sites
  • Compare results of different experiments
  • Normalization
  • Statistical tests

18
Mapping of Short Oligonucleotides to the
Reference Genome
  • Mapping Methods
  • Need to allow mismatches and gaps
  • SNP locations
  • Sequencing errors
  • Reading errors
  • Indexing and hashing
  • genome
  • oligonucleotide reads
  • Use of quality scores
  • Use of SNP knowledge
  • Performance
  • Partitioning the genome or sequence reads

19
Mapping Methods Indexing the Genome
  • Fast sequence similarity search algorithms (like
    BLAST)
  • Not specifically designed for mapping millions of
    query sequences
  • Take very long time
  • e.g. 2 days to map half million sequences to 70MB
    reference genome (using BLAST)
  • Indexing the genome is memory expensive

20
(No Transcript)
21
SOAP (Li et al, 2008)
  • Both reads and reference genome are converted to
    numeric data type using 2-bits-per-base coding
  • Load reference genome into memory
  • For human genome, 14GB RAM required for storing
    reference sequences and index tables
  • 300(gapped) to 1200(ungapped) times faster than
    BLAST

22
SOAP (Li et al, 2008)
  • 2 mismatches or 1-3bp continuous gap
  • Errors accumulate during the sequencing process
  • Much higher number of sequencing errors at the
    3-end (sometimes make the reads unalignable to
    the reference genome)
  • Iteratively trim several basepairs at the 3-end
    and redo the alignment
  • Improve sensitivity

23
Mapping Methods Indexing the Oligonucleotide
Reads
  • ELAND (Cox, unpublished)
  • Efficient Large-Scale Alignment of Nucleotide
    Databases (Solexa Ltd.)
  • SeqMap (Jiang, 2008)
  • Mapping massive amount of oligonucleotides to
    the genome
  • RMAP (Smith, 2008)
  • Using quality scores and longer reads improves
    accuracy of Solexa read mapping
  • MAQ (Li, 2008)
  • Mapping short DNA sequencing reads and calling
    variants using mapping quality scores

24
Mapping Algorithm (2 mismatches)
GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG
25
Mapping Algorithm (2 mismatches)
  • Partition reads into 4 seeds A,B,C,D
  • At least 2 seed must map with no mismatches
  • Scan genome to identify locations where the seeds
    match exactly
  • 6 possible combinations of the seeds to search
  • AB, CD, AC, BD, AD, BC
  • 6 scans to find all candidates
  • Do approximate matching around the
    exactly-matching seeds.
  • Determine all targets for the reads
  • Ins/del can be incorporated
  • The reads are indexed and hashed before scanning
    genome
  • Bit operations are used to accelerate mapping
  • Each nt encoded into 2-bits

26
ELAND (Cox, unpublished)
  • Commercial sequence mapping program comes with
    Solexa machine
  • Allow at most 2 mismatches
  • Map sequences up to 32 nt in length
  • All sequences have to be same length

27
(No Transcript)
28
(No Transcript)
29
RMAP (Smith et al, 2008)
  • Improve mapping accuracy
  • Possible sequencing errors at 3-ends of longer
    reads
  • Base-call quality scores
  • Use of base-call quality scores
  • Quality cutoff
  • High quality positions are checked for mismatces
  • Low quality positions always induce a match
  • Quality control step eliminates reads with too
    many low quality positions
  • Allow any number of mismatches

30
(No Transcript)
31
(No Transcript)
32
Bioinformatics Challenges
  • Rapid mapping of these short sequence reads to
    the reference genome
  • Visualize mapping results
  • Thousand of enriched regions
  • Peak analysis
  • Peak detection
  • Finding exact binding sites
  • Compare results of different experiments
  • Normalization
  • Statistical tests

33
Visualization
  • BED files are build to summarize mapping results
  • BED files can be easily visualized in Genome
    Browser
  • http//genome.ucsc.edu

34
Visualization Genome Browser
Robertson, G. et al. Nat. Methods 4, 651-657
(2007)
35
Visualization Custom
300 kb region from mouse ES cells
Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)
36
Visualization
Huang, 2008 (unpublished)
37
Huang, 2008 (unpublished)
38
Bioinformatics Challenges
  • Rapid mapping of these short sequence reads to
    the reference genome
  • Visualize mapping results
  • Thousand of enriched regions
  • Peak analysis
  • Peak detection
  • Finding exact binding sites
  • Compare results of different experiments
  • Normalization
  • Statistical tests

39
Peak Analysis
  • Peak Detection
  • ChIP-Peak Analysis Module (Swiss Institute of
    Bioinformatics)
  • ChIPSeq Peak Finder (Wold Lab, Caltech)

40
(No Transcript)
41
(No Transcript)
42
Peak Analysis
  • Finding Exact Binding Site
  • Determining the exact binding sites from short
    reads generated from ChIP-Seq experiments
  • SISSRs (Site Identification from Short Sequence
    Reads) (Jothi 2008)
  • MACS (Model-based Analysis of ChIP-Seq) (Zhang et
    al, 2008)

43
Bioinformatics Challenges
  • Rapid mapping of these short sequence reads to
    the reference genome
  • Visualize mapping results
  • Thousand of enriched regions
  • Peak analysis
  • Peak detection
  • Finding exact binding sites
  • Compare results of different experiments
  • Normalization
  • Statistical tests

44
Compare Samples
Huang, 2008 (unpublished)
45
Compare Samples
  • Fold change
  • HPeak An HMM-based algorithm for defining
    read-enriched regions from massive parallel
    sequencing data
  • Xu et al, 2008
  • Advanced statistics

46
QUESTIONS?
Write a Comment
User Comments (0)
About PowerShow.com