10 Billion Piece Jigsaw Puzzles - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

10 Billion Piece Jigsaw Puzzles

Description:

10 Billion Piece Jigsaw Puzzles John Cleary Netvalue Ltd. Real Time Genomics * * * Need to be wary of false positives Variation in numbers - more than expected by ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 42
Provided by: CSS69
Category:

less

Transcript and Presenter's Notes

Title: 10 Billion Piece Jigsaw Puzzles


1
10 Billion Piece Jigsaw Puzzles
  • John Cleary
  • Netvalue Ltd.
  • Real Time Genomics

2
(No Transcript)
3
100 billion
10 billion
billion
  • 100 million

10 million
million
100 thousand
thousand
10 thousand
hundred
4
Genome Transcriptome Cancer
5
Genomes of
  • human
  • reference speciesmouse, chimp, arabidopsis
  • agricultural speciescattle, sheep, pig, rice,
    wheat, grape
  • bacterialdisease, human ecosystem

6
Differences between
  • Individuals
  • Populations disease and quantitative traits
  • Somatic and tumor genomes
  • Transcriptome of child and parents
  • Bacterial populations of individuals

7
Human Genome
3 billion Nucleotides
8
Shapes of the Jigsaw Pieces
Company Lengths (nt)
454 15 - 700
Illumina 36 - 150
Complete Genomics 36
Ion Torrent upto 200
Oxford Nanopore(?) upto 50,000
Pacific Biosciences 100
9
Differences betweengenomes - SNPs
1 / 1,000 3,000,000 nt
10
REF aatgttttctcagaatgtggagaaccttggtgcggacgatgcgca
at_atagggtgggtaccgtccggatac_gctgc______aat______ct
gcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgt
tgcgtagttagtgttcgtgctgg SIM
T
AAGAAT SIM T

AAGAAT CALL T

G CALL T

T READ ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGC
GCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC

READ
ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATA
GGGTGGGTACCGTCCGGATAC_GC

READ TTCTCAGAATGTGGTGAAC
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
TGC______AA

READ TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCG
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG

READ
CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGG
TACCGTCCGGATAC_GCTGC______AAG______A

READ AATGTGGTGAACCTTGGTG
CGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC____
__AAG______AATAAT
READ
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTG
GGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC

READ
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGT
CCGGATAC_GCTGC______AAG______AATAATC

READ GGTGAACCTTGGTGCGGACG
ATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT_
_____CTGCA
READ
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACC
GTCCGGATAC_GCTGC______AAGAATAATCTGCA

READ
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGG
ATAC_GCTGC______AAGAATAATCTGCA

READ TGAACCTTGGTGCGGACGATGC
GCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT_____
_CTGCAAT
READ
GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCC
GGATAC_GCTGC______AAGAATAATCTGCAAT

READ
AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC
_GCTGC______AAGAATAATCTGCAATGG

READ AACCTTGGTGCGGACGATGCG
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAAT
CTGCAATGG
READ
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCG
GATAC_GCTGCAAGAATAAT______CTGCAATGGGAA

READ
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
TGC______AAGAATAATCTGCAATGGGAA

READ
TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAA
GAATAAT______CTGCAATGGGAACGACA
READ

TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT__
____CTGCAATGGGAACGACATGATACAAT
READ
GCGCAAT_ATAGGGTGGGTACCG
TCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGA
TACAATC
READ
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG
AATAATCTGCAATGGGAACGACATGATACAATCCTG
READ

_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGC
AATGGGAACGACATGATACAATCCTGACGG
READ
TAGGGTGGGTACCGTCCGGATA
C_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCC
TGACGGG
READ
GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______
CTGCAATGGGAACGACATGATACAATCCTGACGGGCG
READ

TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGA
ACGACATGATACAATCCTGACGGGCGGTA
READ
GGGTACCGTCCGGATAC_GCTGC
______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG
CGGTAT READ

GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG
GAACGACATGATACAATCCTGACGGGCGGTATAG
READ

TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGA
CATGATACAATCCTGACGGGCGGTATAGA
READ
CGTCCGGATAC_GCTGC____
__AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGT
ATAGAGGT READ

TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAAC
GACATGATACAATCCTGACGGGCGGTATAGAGGTT
READ

CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATA
CAATCCTGACGGGCGGTATAGAGGTTCTG
READ
TGCAAGAAT______AAT__
____CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGG
TTCTGTTGCGTAGT READ

AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGA
TACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG
READ

AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACG
GGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ

______AAT______CTGCAAT
GGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGT
AGTTAGTGTTCG
11
(No Transcript)
12
Differences between humangenomes - MNPs
13
Differences between humangenomes - indels
1 / 10,000 300,000
14
Differences between genomes - inserts
T T A G G A C C C A
Up to 1,000,000 nt total 3,000,000 nt
15
Differences between genomes structural variants
Tandem Repeat
Inversion
Copy
16
Solving the Jigsaw
  • Indexing
  • Alignment
  • SNP/MNP/Indel/SV calling

Mapping
17
Indexing
A C G T T A G T G A A G
A C G T T C G T G A A G
A C G T T A G T G A A G
A C G T T C G T G A A G
4.5 billion
18
Aligning
1.6 billion
19
Cutting Edge Run
  • Human genome (3 billion nt)
  • 1 billion reads of 100 nt coverage of 30
  • Indexing Aligning in 27 minutes

20
i7 Quad Core
21
2 sockets X 4 cores X 2 hyperthreads 16
48 GB RAM
10 computers
1 TB disk/genome 500GB 200GB 200GB 0.3GB
X thousands of genomes
22
Shapes of the Jigsaw Pieces
Company Lengths (nt)
454 15 - 700
Illumina 36 - 150
Complete Genomics 36
Ion Torrent upto 200
Oxford Nanopore(?) upto 50,000
Pacific Biosciences 100
23
Paired End Reads
100 nt
100 - 1,000 nt
Index Align
Index Align
Match
24
Solving the Jigsawwithout the picture
  • Indexing
  • Alignment

Assembly
25
Assembly
A C G T T C G T G A A G
T A G T G A A G A A T T
A C G T T C G T G A A G
T A G T G A A G A A T T
A C G T T ? G T G A A G A A T T
26
SNP calling
15A 13C
AC heterozygous SNP
Throw it out
31A 42C
27
REF aatgttttctcagaatgtggagaaccttggtgcggacgatgcgca
at_atagggtgggtaccgtccggatac_gctgc______aat______ct
gcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgt
tgcgtagttagtgttcgtgctgg SIM
T
AAGAAT SIM T

AAGAAT CALL T

G CALL T

T READ ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGC
GCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC

READ
ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATA
GGGTGGGTACCGTCCGGATAC_GC

READ TTCTCAGAATGTGGTGAAC
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
TGC______AA

READ TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCG
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG

READ
CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGG
TACCGTCCGGATAC_GCTGC______AAG______A

READ AATGTGGTGAACCTTGGTG
CGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC____
__AAG______AATAAT
READ
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTG
GGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC

READ
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGT
CCGGATAC_GCTGC______AAG______AATAATC

READ GGTGAACCTTGGTGCGGACG
ATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT_
_____CTGCA
READ
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACC
GTCCGGATAC_GCTGC______AAGAATAATCTGCA

READ
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGG
ATAC_GCTGC______AAGAATAATCTGCA

READ TGAACCTTGGTGCGGACGATGC
GCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT_____
_CTGCAAT
READ
GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCC
GGATAC_GCTGC______AAGAATAATCTGCAAT

READ
AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC
_GCTGC______AAGAATAATCTGCAATGG

READ AACCTTGGTGCGGACGATGCG
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAAT
CTGCAATGG
READ
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCG
GATAC_GCTGCAAGAATAAT______CTGCAATGGGAA

READ
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
TGC______AAGAATAATCTGCAATGGGAA

READ
TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAA
GAATAAT______CTGCAATGGGAACGACA
READ

TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT__
____CTGCAATGGGAACGACATGATACAAT
READ
GCGCAAT_ATAGGGTGGGTACCG
TCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGA
TACAATC
READ
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG
AATAATCTGCAATGGGAACGACATGATACAATCCTG
READ

_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGC
AATGGGAACGACATGATACAATCCTGACGG
READ
TAGGGTGGGTACCGTCCGGATA
C_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCC
TGACGGG
READ
GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______
CTGCAATGGGAACGACATGATACAATCCTGACGGGCG
READ

TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGA
ACGACATGATACAATCCTGACGGGCGGTA
READ
GGGTACCGTCCGGATAC_GCTGC
______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG
CGGTAT READ

GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG
GAACGACATGATACAATCCTGACGGGCGGTATAG
READ

TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGA
CATGATACAATCCTGACGGGCGGTATAGA
READ
CGTCCGGATAC_GCTGC____
__AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGT
ATAGAGGT READ

TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAAC
GACATGATACAATCCTGACGGGCGGTATAGAGGTT
READ

CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATA
CAATCCTGACGGGCGGTATAGAGGTTCTG
READ
TGCAAGAAT______AAT__
____CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGG
TTCTGTTGCGTAGT READ

AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGA
TACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG
READ

AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACG
GGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ

______AAT______CTGCAAT
GGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGT
AGTTAGTGTTCG
28
Comparing twins
3,000,000 SNPs Do any of them differ between the
twins?
15A 4C
3A 10C 3G
29
(No Transcript)
30
Gene
DNA
mRNA
protein
31
(No Transcript)
32
Cancer comparison
33
Copy Number Variants
  • Varying levels of extraction of reads across
    genome (use differences)
  • Locate boundaries (as accurately as possible)
  • Extract number of variants
  • Use SNPs

34
(No Transcript)
35
(No Transcript)
36
Metagenomics or what is living on you
  • Mapping reads back onto a database of known
    bacteria/viruses
  • Many are ambiguous
  • Many dont map at all
  • Estimate frequency of each species
  • Remove human contamination

37
TS1 0.389 gi29611500refNC_004703.1
Bacteroides thetaiotaomicron VPI-5482 plasmid
p5482 0.183 gi187734516refNC_010655.1
Akkermansia muciniphila ATCC BAA-835 0.145 gi150
002608refNC_009614.1 Bacteroides vulgatus
ATCC 8482 0.037 gi119025018refNC_008618.1
Bifidobacterium adolescentis ATCC 15703 TS4
0.428 gi29611500refNC_004703.1 Bacteroides
thetaiotaomicron VPI-5482 plasmid p5482 0.210
gi150002608refNC_009614.1 Bacteroides
vulgatus ATCC 8482 0.149 gi60650141refNC_0068
73.1 Bacteroides fragilis NCTC 9343 plasmid
pBF9343 0.037 gi121999251refNC_008790.1 Camp
ylobacter jejuni subsp. jejuni 81-176 plasmid
pTet 0.036 gi238922432refNC_012781.1 Eubacte
rium rectale ATCC 33656 TS25 0.752
gi29611500refNC_004703.1 Bacteroides
thetaiotaomicron VPI-5482 plasmid p5482 0.073
gi150002608refNC_009614.1 Bacteroides
vulgatus ATCC 8482 0.041 gi121999251refNC_008
790.1 Campylobacter jejuni subsp. jejuni 81-176
plasmid pTet 0.020 gi58036264refNC_004307.2
Bifidobacterium longum NCC2705 0.018
gi189438863refNC_010816.1 Bifidobacterium
longum DJO10A
38
Metagenomics
  • Map reads to database
  • Estimate most likely frequenciesa hill climbing
    estimation problem
  • Can anything be done about unmapped reads?

39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com