Linkage Disequilibrium Based SNP Genotype Calling from Short Sequencing Reads - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Linkage Disequilibrium Based SNP Genotype Calling from Short Sequencing Reads

Description:

... .00 714959.00 86.45 87.19 105007.00 12.70 12.81 7091.00 0.86 2852184.00 2694254.00 94.46 95.32 132213.00 4.64 4.68 25717.00 0.90 0.38 1.00 2852184.00 2025127 ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 9
Provided by: dnaEngrU
Category:

less

Transcript and Presenter's Notes

Title: Linkage Disequilibrium Based SNP Genotype Calling from Short Sequencing Reads


1
Linkage Disequilibrium Based SNP Genotype Calling
from Short Sequencing Reads
Ion Mandoiu Computer Science and Engineering
Department University of Connecticut Joint work
with S. Dinakar, J. Duitama, Y. Hernández, J.
Kennedy, and Y. Wu
2
Ultra-High Throughput Sequencing
  • Recent massively parallel sequencing technologies
    deliver orders of magnitude higher throughput
    compared to classic Sanger sequencing

Roche/454 FLX Titanium 400bp reads 400Mb/10h run
ABI SOLiD 2.0 25-35bp reads 3-4Gb/6 day run
Helicos HeliScope 25-55bp reads gt1Gb/day
Illumina Genome Analyzer II 35-50bp
reads 1.5Gb/2.5 day run
3
Personal Genomes The Future is Now!
4
Challenges for Genomic Medicine at Single-Base
Resolution
  • Medical sequencing focuses on genetic variation
    (SNPs, CNVs, genome rearrangements)
  • Requires accurate determination of both alleles
    at variable loci
  • This is limited by coverage depth due to random
    nature of shotgun sequencing
  • For the Venter and Watson genomes (both sequenced
    at 7.5x average coverage), comparison with SNP
    genotyping chips has shown only 75 accuracy for
    sequencing based calls of heterozygous SNPs Levy
    et al 07, Wheeler et al 08
  • WendlWilson 08 predict that 21x coverage is
    required for sequencing of normal tissue samples
    based on idealized theory that neglects any
    heuristic inputs
  • What heuristic inputs help?
  • How much can we gain from improved data analysis?

5
Linkage Disequilibrium Sources Modeling
HMM model of haplotype frequencies
  • Fi founder haplotype at locus i, Hi observed
    allele at locus i
  • P(Fi), P(Fi Fi-1) and P(Hi Fi) estimated from
    reference panel such as Hapmap
  • For given haplotype h with n SNPs, P(HhM) can
    be computed in O(nK2) using forward algorithm,
    where Kfounders

6
Pipeline for LD-Based Genotype Calling
Reference genome sequence
Read sequences
gtgnlti1779718824 nameEI1W3PE02ILQXT TCAGTGAGGGT
TTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAAT
TTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCAC
TGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAA
GTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTG
T TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCC
TGACCTCAAATGAC gtgnlti1779718825
nameEI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGT
TCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTT
TATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCA
CAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAA
GAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGG
AGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTT
ACA
gtgnlti1779718824 nameEI1W3PE02ILQXT TCAGTGAGGGT
TTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAAT
TTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCAC
TGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAA
GTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTG
T TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCC
TGACCTCAAATGAC gtgnlti1779718825
nameEI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGT
TCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTT
TATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCA
CAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAA
GAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGG
AGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTT
ACA
gtgnlti1779718824 nameEI1W3PE02ILQXT TCAGTGAGGGT
TTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTGTTTTTGAGACAGAAT
TTTGCTCTT GTCGCCCAGGCTGGTGTGCAGTGGTGCAACCTCAGCTCAC
TGCAACCTCTGCCTCCAGGTTCAAGCAATT CTCTGCCTCAGCCTCCCAA
GTAGCTGGGATTACAGGCGGGCGCCACCACGCCCAGCTAATTTTGTATTG
T TAGTAAAGATGGGGTTTCACTACGTTGGCTGAGCTGTTCTCGAACTCC
TGACCTCAAATGAC gtgnlti1779718825
nameEI1W3PE02GTXK0 TCAGAATACCTGTTGCCCATTTTTATATGT
TCCTTGGAGAAATGTCAATTCAGAGCTTTTGCTCAGCTTT TAATATGTT
TATTTGTTTTGCTGCTGTTGAGTTGTACAATGTTGGGGAAAACAGTCGCA
CAACACCCGGC AGGTACTTTGAGTCTGGGGGAGACAAAGGAGTTAGAAA
GAGAGAGAATAAGCACTTAAAAGGCGGGTCCA GGGGGCCCGAGCATCGG
AGGGTTGCTCATGGCCCACAGTTGTCAGGCTCCACCTAATTAAATGGTTT
ACA
gtgi88943037refNT_113796.1Hs1_111515 Homo
sapiens chromosome 1 genomic contig, reference
assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCC
ATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACT
TGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG
GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTA
ATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTT
CCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAA
TTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAG
CGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAG
GCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAG
ATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAG
CTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAG
CTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGG
GAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
gtgi88943037refNT_113796.1Hs1_111515 Homo
sapiens chromosome 1 genomic contig, reference
assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCC
ATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACT
TGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG
GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTA
ATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTT
CCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAA
TTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAG
CGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAG
GCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAG
ATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAG
CTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAG
CTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGG
GAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
gtgi88943037refNT_113796.1Hs1_111515 Homo
sapiens chromosome 1 genomic contig, reference
assembly GAATTCTGTGAAAGCCTGTAGCTATAAAAAAATGTTGAGCC
ATAAATACCATCAGAAATAACAAAGGGAG CTTTGAAGTATTCTGAGACT
TGTAGGAAGGTGAAGTAAATATCTAATATAATTGTAACAAGTAGTGCTTG
GATTGTATGTTTTTGATTATTTTTTGTTAGGCTGTGATGGGCTCAAGTA
ATTGAAATTCCTGATGCAAGT AATACAGATGGATTCAGGAGAGGTACTT
CCAGGGGGTCAAGGGGAGAAATACCTGTTGGGGGTCAATGCC CTCCTAA
TTCTGGAGTAGGGGCTAGGCTAGAATGGTAGAATGCTCAAAAGAATCCAG
CGAAGAGGAATAT TTCTGAGATAATAAATAGGACTGTCCCATATTGGAG
GCCTTTTTGAACAGTTGTTGTATGGTGACCCTGA AATGTACTTTCTCAG
ATACAGAACACCCTTGGTCAATTGAATACAGATCAATCACTTTAAGTAAG
CTAAG TCCTTACTAAATTGATGAGACTTAAACCCATGAAAACTTAACAG
CTAAACTCCCTAGTCAACTGGTTTGA ATCTACTTCTCCAGCAGCTGGGG
GAAAAAAGGTGAGAGAAGCAGGATTGAAGCTGCTTCTTTGAATTTAC
Quality scores
gtgnlti1779718824 nameEI1W3PE02ILQXT 28 28 28
28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21
7 27 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28
43 36 22 9 28 43 36 22 9 28 44 36 24 14 4 28
28 28 27 28 26 26 35 26 40 34 18 3 28 28 28 27
33 24 26 28 28 28 40 33 14 28 36 27 26 26 37 29
28 28 28 28 27 28 28 28 37 28 27 27 28 36 28
37 28 28 28 27 28 28 28 24 28 28 27 28 28 37 29
36 27 27 28 27 28 33 23 28 33 23 28 36 27 33 23
28 35 25 28 28 36 27 36 27 28 28 28 24 28 37 29
28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24
28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
gtgnlti1779718824 nameEI1W3PE02ILQXT 28 28 28
28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21
7 27 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28
43 36 22 9 28 43 36 22 9 28 44 36 24 14 4 28
28 28 27 28 26 26 35 26 40 34 18 3 28 28 28 27
33 24 26 28 28 28 40 33 14 28 36 27 26 26 37 29
28 28 28 28 27 28 28 28 37 28 27 27 28 36 28
37 28 28 28 27 28 28 28 24 28 28 27 28 28 37 29
36 27 27 28 27 28 33 23 28 33 23 28 36 27 33 23
28 35 25 28 28 36 27 36 27 28 28 28 24 28 37 29
28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24
28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
gtgnlti1779718824 nameEI1W3PE02ILQXT 28 28 28
28 26 28 28 40 34 14 44 36 23 13 2 27 42 35 21
7 27 42 35 21 6 28 43 36 22 10 27 42 35 20 6 28
43 36 22 9 28 43 36 22 9 28 44 36 24 14 4 28
28 28 27 28 26 26 35 26 40 34 18 3 28 28 28 27
33 24 26 28 28 28 40 33 14 28 36 27 26 26 37 29
28 28 28 28 27 28 28 28 37 28 27 27 28 36 28
37 28 28 28 27 28 28 28 24 28 28 27 28 28 37 29
36 27 27 28 27 28 33 23 28 33 23 28 36 27 33 23
28 35 25 28 28 36 27 36 27 28 28 28 24 28 37 29
28 19 28 26 37 29 26 39 33 13 37 28 28 28 21 24
28 27 41 34 15 28 36 27 26 28 24 35 27 28 40 34 15
SNP genotype calls
Hapmap genotypes
rs12095710 T T 9.988139e-01 rs12127179 C T
9.986735e-01 rs11800791 G G 9.977713e-01 rs1157831
0 G G 9.980062e-01 rs1287622 G G 8.644588e-01
rs11804808 C C 9.977779e-01 rs17471528 A G
5.236099e-01 rs11804835 C C 9.977759e-01 rs1180483
6 C C 9.977925e-01 rs1287623 G G 9.646510e-01
rs13374307 G G 9.989084e-01 rs12122008 G G
5.121655e-01 rs17431341 A C 5.290652e-01 rs881635
G G 9.978737e-01 rs9700130 A A 9.989940e-01
rs11121600 A A 6.160199e-01 rs12121542 A A
5.555713e-01 rs11121605 T T 8.387705e-01 rs1256377
9 G G 9.982776e-01 rs11121607 C G
5.639239e-01 rs11121608 G T 5.452936e-01 rs1202974
2 G G 9.973527e-01 rs562118 C C 9.738776e-01
rs12133533 A C 9.956655e-01 rs11121648 G G
9.077355e-01 rs9662691 C C 9.988648e-01
rs11805141 C C 9.928786e-01 rs1287635 C C
6.113270e-01
90 209342 16 F 0 0 2110001?01002100100110021222012
10211?1221220212000 18 F 15 16 2110001201002100100
1100?100201?10111110111?0212000 15 M 0
0 211200100120012010011200101101010111110111102120
00 7 M 0 0 2110001001000200122110001111011100111?1
21210222000 8 F 0 0 011202100120022012211200101101
210211122111?0120000 12 F 9 10 2110001001000200122
1100010110111001112121210220000 9 M 0
0 011?001?012002201221120010?101210211122111101200
00 11 M 7 8 21100210010002001221100012110111001112
121210222000
90 209342 16 F 0 0 2110001?01002100100110021222012
10211?1221220212000 18 F 15 16 2110001201002100100
1100?100201?10111110111?0212000 15 M 0
0 211200100120012010011200101101010111110111102120
00 7 M 0 0 2110001001000200122110001111011100111?1
21210222000 8 F 0 0 011202100120022012211200101101
210211122111?0120000 12 F 9 10 2110001001000200122
1100010110111001112121210220000 9 M 0
0 011?001?012002201221120010?101210211122111101200
00 11 M 7 8 21100210010002001221100012110111001112
121210222000
90 209342 16 F 0 0 2110001?01002100100110021222012
10211?1221220212000 18 F 15 16 2110001201002100100
1100?100201?10111110111?0212000 15 M 0
0 211200100120012010011200101101010111110111102120
00 7 M 0 0 2110001001000200122110001111011100111?1
21210222000 8 F 0 0 011202100120022012211200101101
210211122111?0120000 12 F 9 10 2110001001000200122
1100010110111001112121210220000 9 M 0
0 011?001?012002201221120010?101210211122111101200
00 11 M 7 8 21100210010002001221100012110111001112
121210222000
7
Genotype Calling Accuracy vs. Coverage
Watson/454 reads
NA18507/Illumina reads
8
Conclusions Ongoing Work
  • Exploiting LD information yields significant
    improvements in genotyping calling accuracy
    and/or cost reduction
  • Accuracy achieved by previously proposed binomial
    test is achieved by HMM-based posterior decoding
    algorithm using less than 1/4 of the reads
  • Ongoing work
  • Modeling ambiguities in read mapping
  • Haplotype inferrence
  • Extension to population sequencing data (removing
    need for reference panels)

ACKNOWLEDGEMENTS This work was supported in part
by NSF under awards IIS-0546457 and DBI-0543365
to IM and IIS-0803440 to YW. SD and YH performed
this research as part of the Summer REU program
Bio-Grid Initiatives for Interdisciplinary
Research and Education" funded by NSF under
award CCF-0755373.
Write a Comment
User Comments (0)
About PowerShow.com