Title: 10 Billion Piece Jigsaw Puzzles
110 Billion Piece Jigsaw Puzzles
- John Cleary
- Netvalue Ltd.
- Real Time Genomics
2(No Transcript)
3100 billion
10 billion
billion
10 million
million
100 thousand
thousand
10 thousand
hundred
4Genome Transcriptome Cancer
5Genomes of
- human
- reference speciesmouse, chimp, arabidopsis
- agricultural speciescattle, sheep, pig, rice,
wheat, grape - bacterialdisease, human ecosystem
6Differences between
- Individuals
- Populations disease and quantitative traits
- Somatic and tumor genomes
- Transcriptome of child and parents
- Bacterial populations of individuals
7Human Genome
3 billion Nucleotides
8Shapes of the Jigsaw Pieces
Company Lengths (nt)
454 15 - 700
Illumina 36 - 150
Complete Genomics 36
Ion Torrent upto 200
Oxford Nanopore(?) upto 50,000
Pacific Biosciences 100
9Differences betweengenomes - SNPs
1 / 1,000 3,000,000 nt
10REF aatgttttctcagaatgtggagaaccttggtgcggacgatgcgca
at_atagggtgggtaccgtccggatac_gctgc______aat______ct
gcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgt
tgcgtagttagtgttcgtgctgg SIM
T
AAGAAT SIM T
AAGAAT CALL T
G CALL T
T READ ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGC
GCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
READ
ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATA
GGGTGGGTACCGTCCGGATAC_GC
READ TTCTCAGAATGTGGTGAAC
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
TGC______AA
READ TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCG
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG
READ
CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGG
TACCGTCCGGATAC_GCTGC______AAG______A
READ AATGTGGTGAACCTTGGTG
CGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC____
__AAG______AATAAT
READ
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTG
GGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC
READ
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGT
CCGGATAC_GCTGC______AAG______AATAATC
READ GGTGAACCTTGGTGCGGACG
ATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT_
_____CTGCA
READ
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACC
GTCCGGATAC_GCTGC______AAGAATAATCTGCA
READ
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGG
ATAC_GCTGC______AAGAATAATCTGCA
READ TGAACCTTGGTGCGGACGATGC
GCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT_____
_CTGCAAT
READ
GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCC
GGATAC_GCTGC______AAGAATAATCTGCAAT
READ
AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC
_GCTGC______AAGAATAATCTGCAATGG
READ AACCTTGGTGCGGACGATGCG
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAAT
CTGCAATGG
READ
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCG
GATAC_GCTGCAAGAATAAT______CTGCAATGGGAA
READ
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
TGC______AAGAATAATCTGCAATGGGAA
READ
TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAA
GAATAAT______CTGCAATGGGAACGACA
READ
TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT__
____CTGCAATGGGAACGACATGATACAAT
READ
GCGCAAT_ATAGGGTGGGTACCG
TCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGA
TACAATC
READ
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG
AATAATCTGCAATGGGAACGACATGATACAATCCTG
READ
_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGC
AATGGGAACGACATGATACAATCCTGACGG
READ
TAGGGTGGGTACCGTCCGGATA
C_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCC
TGACGGG
READ
GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______
CTGCAATGGGAACGACATGATACAATCCTGACGGGCG
READ
TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGA
ACGACATGATACAATCCTGACGGGCGGTA
READ
GGGTACCGTCCGGATAC_GCTGC
______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG
CGGTAT READ
GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG
GAACGACATGATACAATCCTGACGGGCGGTATAG
READ
TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGA
CATGATACAATCCTGACGGGCGGTATAGA
READ
CGTCCGGATAC_GCTGC____
__AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGT
ATAGAGGT READ
TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAAC
GACATGATACAATCCTGACGGGCGGTATAGAGGTT
READ
CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATA
CAATCCTGACGGGCGGTATAGAGGTTCTG
READ
TGCAAGAAT______AAT__
____CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGG
TTCTGTTGCGTAGT READ
AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGA
TACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG
READ
AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACG
GGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ
______AAT______CTGCAAT
GGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGT
AGTTAGTGTTCG
11(No Transcript)
12Differences between humangenomes - MNPs
13Differences between humangenomes - indels
1 / 10,000 300,000
14Differences between genomes - inserts
T T A G G A C C C A
Up to 1,000,000 nt total 3,000,000 nt
15Differences between genomes structural variants
Tandem Repeat
Inversion
Copy
16Solving the Jigsaw
- Indexing
- Alignment
- SNP/MNP/Indel/SV calling
Mapping
17Indexing
A C G T T A G T G A A G
A C G T T C G T G A A G
A C G T T A G T G A A G
A C G T T C G T G A A G
4.5 billion
18Aligning
1.6 billion
19Cutting Edge Run
- Human genome (3 billion nt)
- 1 billion reads of 100 nt coverage of 30
- Indexing Aligning in 27 minutes
20i7 Quad Core
212 sockets X 4 cores X 2 hyperthreads 16
48 GB RAM
10 computers
1 TB disk/genome 500GB 200GB 200GB 0.3GB
X thousands of genomes
22Shapes of the Jigsaw Pieces
Company Lengths (nt)
454 15 - 700
Illumina 36 - 150
Complete Genomics 36
Ion Torrent upto 200
Oxford Nanopore(?) upto 50,000
Pacific Biosciences 100
23Paired End Reads
100 nt
100 - 1,000 nt
Index Align
Index Align
Match
24Solving the Jigsawwithout the picture
Assembly
25Assembly
A C G T T C G T G A A G
T A G T G A A G A A T T
A C G T T C G T G A A G
T A G T G A A G A A T T
A C G T T ? G T G A A G A A T T
26SNP calling
15A 13C
AC heterozygous SNP
Throw it out
31A 42C
27REF aatgttttctcagaatgtggagaaccttggtgcggacgatgcgca
at_atagggtgggtaccgtccggatac_gctgc______aat______ct
gcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgt
tgcgtagttagtgttcgtgctgg SIM
T
AAGAAT SIM T
AAGAAT CALL T
G CALL T
T READ ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGC
GCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
READ
ATGTTTTCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATA
GGGTGGGTACCGTCCGGATAC_GC
READ TTCTCAGAATGTGGTGAAC
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
TGC______AA
READ TCTCAGAATGTGGTGAACCTTGGTGCGGACGATGCG
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG
READ
CTCAGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGG
TACCGTCCGGATAC_GCTGC______AAG______A
READ AATGTGGTGAACCTTGGTG
CGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC____
__AAG______AATAAT
READ
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTG
GGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC
READ
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGT
CCGGATAC_GCTGC______AAG______AATAATC
READ GGTGAACCTTGGTGCGGACG
ATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT_
_____CTGCA
READ
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACC
GTCCGGATAC_GCTGC______AAGAATAATCTGCA
READ
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGG
ATAC_GCTGC______AAGAATAATCTGCA
READ TGAACCTTGGTGCGGACGATGC
GCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT_____
_CTGCAAT
READ
GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCC
GGATAC_GCTGC______AAGAATAATCTGCAAT
READ
AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC
_GCTGC______AAGAATAATCTGCAATGG
READ AACCTTGGTGCGGACGATGCG
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAAT
CTGCAATGG
READ
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCG
GATAC_GCTGCAAGAATAAT______CTGCAATGGGAA
READ
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
TGC______AAGAATAATCTGCAATGGGAA
READ
TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAA
GAATAAT______CTGCAATGGGAACGACA
READ
TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT__
____CTGCAATGGGAACGACATGATACAAT
READ
GCGCAAT_ATAGGGTGGGTACCG
TCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGA
TACAATC
READ
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG
AATAATCTGCAATGGGAACGACATGATACAATCCTG
READ
_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGC
AATGGGAACGACATGATACAATCCTGACGG
READ
TAGGGTGGGTACCGTCCGGATA
C_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCC
TGACGGG
READ
GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______
CTGCAATGGGAACGACATGATACAATCCTGACGGGCG
READ
TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGA
ACGACATGATACAATCCTGACGGGCGGTA
READ
GGGTACCGTCCGGATAC_GCTGC
______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG
CGGTAT READ
GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG
GAACGACATGATACAATCCTGACGGGCGGTATAG
READ
TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGA
CATGATACAATCCTGACGGGCGGTATAGA
READ
CGTCCGGATAC_GCTGC____
__AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGT
ATAGAGGT READ
TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAAC
GACATGATACAATCCTGACGGGCGGTATAGAGGTT
READ
CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATA
CAATCCTGACGGGCGGTATAGAGGTTCTG
READ
TGCAAGAAT______AAT__
____CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGG
TTCTGTTGCGTAGT READ
AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGA
TACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCG
READ
AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACG
GGCGGTATAGAGGTTCTGTTGCGTAGTTAGTGTT READ
______AAT______CTGCAAT
GGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTTGCGT
AGTTAGTGTTCG
28Comparing twins
3,000,000 SNPs Do any of them differ between the
twins?
15A 4C
3A 10C 3G
29(No Transcript)
30Gene
DNA
mRNA
protein
31(No Transcript)
32Cancer comparison
33Copy Number Variants
- Varying levels of extraction of reads across
genome (use differences) - Locate boundaries (as accurately as possible)
- Extract number of variants
- Use SNPs
34(No Transcript)
35(No Transcript)
36Metagenomics or what is living on you
- Mapping reads back onto a database of known
bacteria/viruses - Many are ambiguous
- Many dont map at all
- Estimate frequency of each species
- Remove human contamination
37TS1 0.389 gi29611500refNC_004703.1
Bacteroides thetaiotaomicron VPI-5482 plasmid
p5482 0.183 gi187734516refNC_010655.1
Akkermansia muciniphila ATCC BAA-835 0.145 gi150
002608refNC_009614.1 Bacteroides vulgatus
ATCC 8482 0.037 gi119025018refNC_008618.1
Bifidobacterium adolescentis ATCC 15703 TS4
0.428 gi29611500refNC_004703.1 Bacteroides
thetaiotaomicron VPI-5482 plasmid p5482 0.210
gi150002608refNC_009614.1 Bacteroides
vulgatus ATCC 8482 0.149 gi60650141refNC_0068
73.1 Bacteroides fragilis NCTC 9343 plasmid
pBF9343 0.037 gi121999251refNC_008790.1 Camp
ylobacter jejuni subsp. jejuni 81-176 plasmid
pTet 0.036 gi238922432refNC_012781.1 Eubacte
rium rectale ATCC 33656 TS25 0.752
gi29611500refNC_004703.1 Bacteroides
thetaiotaomicron VPI-5482 plasmid p5482 0.073
gi150002608refNC_009614.1 Bacteroides
vulgatus ATCC 8482 0.041 gi121999251refNC_008
790.1 Campylobacter jejuni subsp. jejuni 81-176
plasmid pTet 0.020 gi58036264refNC_004307.2
Bifidobacterium longum NCC2705 0.018
gi189438863refNC_010816.1 Bifidobacterium
longum DJO10A
38Metagenomics
- Map reads to database
- Estimate most likely frequenciesa hill climbing
estimation problem - Can anything be done about unmapped reads?
39(No Transcript)
40(No Transcript)
41(No Transcript)