Assembly of the sheep genome and identification of SNPs using 454 and Illumina sequencing and design of the Illumina Ovine SNP50 BeadChip - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Assembly of the sheep genome and identification of SNPs using 454 and Illumina sequencing and design of the Illumina Ovine SNP50 BeadChip

Description:

Using hits to bovine genome in bins of 1 Mb at a time use Newbler to build sheep ... order of sheep contigs is heavily dependant on the bovine genome assembly ... – PowerPoint PPT presentation

Number of Views:316
Avg rating:3.0/5.0
Slides: 34
Provided by: dalrympleb
Category:

less

Transcript and Presenter's Notes

Title: Assembly of the sheep genome and identification of SNPs using 454 and Illumina sequencing and design of the Illumina Ovine SNP50 BeadChip


1
Assembly of the sheep genome and identification
of SNPs using 454 and Illumina sequencing and
design of the Illumina Ovine SNP50 BeadChip
Brian P. Dalrymple on behalf of the
International Sheep Genomics Consortium
2
Sheep SNP (genome sequencing) project
  • International Sheep Genomics Consortium Project
  • Data generation and analysis
  • AgResearch/University of Otago
  • Baylor College of Medicine
  • CSIRO Livestock Industries
  • Universities of Sydney and Melbourne
  • Funded
  • International Science Linkages
  • SheepGenomics
  • Ovita
  • Genesis Faraday
  • 6 animals to 0.5 X or more each, 454 FLX
    sequencing
  • 60 animals 20 X reduced representational (1-5
    genome) sequencing SOLEXA
  • To identify sheep SNPs
  • To design a 60K Illumina SNP chip
  • Work towards a sheep genome assembly
  • Cost and time
  • AU2 million
  • 18 months

3
Chromosome assembly jargon
  • Contigs
  • overlapping sequence reads contiguous sequence
  • Scaffolds
  • a series of ordered and oriented contigs, but
    with gaps between contigs
  • Chromosome
  • a series of scaffolds ordered and oriented using
    links between sequence and a physical map

assembled scaffold
paired end reads
sequence contigs
marker 1
marker 2
chromosome
4
Our sequencing and assembly strategy
  • Position 454 FLX reads on a related genome
    sequence
  • The cow is the closest organism with an assembled
    genome
  • The longer 454 FLX sequences (average 230 bases)
    can be more accurately positioned than the short
    Illumina GA I reads
  • but large numbers of slightly divergent repeat
    sequences in the two genomes play havoc with
    alignments, repeats cannot be assembled
    (generally longer than the read length) and would
    be missing

SNPs
Illumina - reduced representational sequence
Sheep seq.
454 FLX paired end reads
454 FLX 3X Whole Genome Shotgun
SANGER genome assembly of the cow
5
Converting a cow into a sheep
6
The importance of BACs
  • BACs are large segments of a genome inserted in a
    small vector that can propagate the foreign DNA
    in a bacterium
  • low copy number
  • constructed for most large genomes prior to
    sequencing
  • typically 100-200 kb in length
  • Frequently end sequenced to provide tags from the
    two ends

tail-to-tail BAC same organisation as in original
genome
tail-to-head BAC one end inverted relative to
original genome
head-to-head BAC both ends inverted relative to
original genome
broken BAC linkage relative to original genome
lost
7
Turn bovine genome into virtual sheep genome
Sheep BAC-end sequences
BLAST v. hg17
BLAST v. CanFam2
BLAST v. Btau3.1 addn seq
BLAST v. EquCab1
cow matches
horse matches
human matches
dog matches
liftover match positions to Btau3.6x rebuilt
bovine assembly
CSIROv1.0 bovine assembly
All sequences mapped to bovine assembly
Construct BAC contigs
Set of sheep BAC contigs on bovine assembly
Sheep markers
Using sheep markers, BACs and BAC contigs to
reorder and reorientate segments of cow genome to
their location in the sheep genome number
chromosomes correctly etc.
Virtual sheep genome build v2
8
Maximizing genome coverage by sheep BACs
Sheep BAC
Sheep BAC
Liftover coordinates to cow genome
10 - 500 kb
10 - 500 kb
Map 350,000 BAC-ends to all four genomes using
very sensitive BLASTn parameters ( a week per
genome on 66 dual processor cluster), convert to
cow genome coordinates using conversion
coordinates precalculated at UCSC using Blastz.
9
Maximizing genome coverage by tail-to-tail BACs
Method Number of BACs positioned
tail-to-tail - both ends significant BLAST score Traditional approach 170
As above and one end significant BLAST score 1680
tail-to-tail integrated analysis 2450
Coverage in BAC-contigs built using tail-to-tail BACs
0.4 fold
0.4 fold
6 fold
Examples for HSA18/OAR23 high sensitivity BLAST
and position filter maximises coverage with
tail-to-tail BACs and hence coverage by contigs
10
The benefit of using the integration
species TT BACs on genome BAC contigs on genome
dog 56,482 2,146
horse 60,229 1,470
cow 76,251 1,411
human cow dog 79,996 1,299
human cow dog horse 95,797 943
1400 markers available to anchor BAC contigs to
sheep map
11
Turning a cow into a sheep - low resolution
OAR1
OAR2
OAR3
OAR4
OAR5
OAR6
BTA3
BTA8
BTA11
BTA4
BTA7
BTA6
BTA1
BTA2
BTA5
OAR7
OAR8
OAR9
OAR10
OAR11
OAR12
BTA9
BTA10
BTA9
BTA12
BTA19
BTA16
BTA14
OAR13
OAR14
OAR15
OAR16
OAR17
OAR18
BTA13
BTA18
BTA15
BTA20
BTA17
BTA21
OAR19
OAR20
OAR21
OAR22
OAR23
OARX
BTA22
BTA23
BTA29
BTA26
BTA24
OAR24
OAR25
OAR26
BTAX
5 chromosome inversions 4 chromosome fusions 1
split chromosome
BTA25
BTA22
BTA27
12
Assembling the sheep genome sequence
13
Ovine Draft genome assembly
  • 454/Roche genome sequencing on GS FLX
  • 3-fold coverage 6 animals, each 0.5-fold
  • 42.6 million reads
  • Average length 226 bases
  • 9.6 Gb
  • 3.5 fold assuming a 2.7Gb genome
  • Assembly contigs
  • Contig count 2,314,423
  • Mean contig length 512 bases
  • Median contig length 377 bases
  • Total assembled sequence 1.18 Gb
  • 43 of the genome
  • Repeats not assembled
  • 89 of bovine RefSeqs map uniquely to the assembly

14
Sequence assembly pipeline
454 sequence BCM
454 sequence AgResearch/Otago
John McEwan and colleagues at AgResearch
Sequence database
repeat mask
lower case masked 454 reads
MegaBLAST -D 3 -W 11 s 55 -U T -F m D retain
only top hit
Bovine genome
stored summarised MegaBLAST results
Using hits to bovine genome in bins of 1 Mb at a
time use Newbler to build sheep contigs
Sheep 454 contigs
Position sheep contigs on vsg2, replace cow
sequence with sheep sequence, remove remaining
cow sequence
Virtual sheep genome build v2
Sheep genome assembly v1.0
15
The sheep genome assembly
  • Is comprised of 454 sequence contigs organised
    into sheep chromosomes
  • 2.5 million contigs
  • average length 480 bases
  • ordered and oriented using vsheep v2 framework
  • cover 43 of the sheep genome
  • cover 80 of the unique fraction of the genome

15 sheep contigs 1 bovine contig
16
How close is the assembly to the real sheep?
  • The local order of sheep contigs is heavily
    dependant on the bovine genome assembly
  • Used a light pass of 454 sequencing paired end
    reads as a quality control check of the assembly
  • short paired ends 30 bases long
  • 89 in tail-to-tail arrangement
  • long paired ends 70-100 bases long
  • 98 in tail-to-tail arrangement
  • BAC-end sequences
  • 93 in tail-to-tail arrangement

17
Distributions of lengths of tail-to-tail
paired-end reads
sonication aim 2.5kb
sonication aim 3kb
18
Designing the SNP chip
19
Finding the SNPs
  • Pilot 1,536 SNP chip SNPs
  • Already have a set of 1,536 SNPs from a pilot
    project
  • Sanger resequencing of BAC-end sequences
  • 454 SNPs
  • 41 million reads
  • align 454 reads to assembled sheep genome
    sequence
  • require at least two reads (each from a different
    animal) for each allele to call a SNP
  • 265,000 SNPs high confidence SNPs called
  • Illumina SNPs
  • 48 million reads
  • filter based on starting CC, quality, remove
    singletons
  • align 28.5 million SOLEXA reads to assembled
    sheep genome sequence
  • 3.7 million unique sequences
  • require at least two reads of each allele with a
    quality score gt27 at the putative variable base
  • 76,044 high confidence SNPs called

20
Illumina GA SNPs RRS
  • Reduced representational sequencing (RRS)
  • DNA pooled from 20 individuals/breeds
  • HaeIII genome digest, 3 size fractions
  • 75-90 bp
  • 100-120 bp
  • 130-155 bp

21
Illumina GA sequencing
  • 112 million reads from 3 Solexa runs (29 / 35 /
    48 million)
  • 33 bases / read
  • 3.7 Gb in total

Total Reads Non CC count N Read Count Ave qual lt25 Reads Passed Different seqs Non singletons
112,075,672 22,965,875 830,817 8,308,027 84,675,240 9,048,701 2,942,112
  • Total different sequences 0.3 Gb
  • Different sequences represented more than once
    2.9 million
  • Singletons likely to be sequencing errors
  • At 33 bases per read, just under 0.1 Gb

22
Calling Illumina GA SNPs
  • 76,044 SNPs called
  • Number depends on cut offs used

HaeIII fragments with sequences from both ends
23
Calling SNPs from Illumina reads
  • High sequencing error rate towards the end of the
    Illumina reads
  • with large datasets two identical sequences may
    actually result from sequencing errors if base is
    late in the sequence reads

24
SNP distribution
  • HaeIII sites are not randomly distributed in the
    genome assembly
  • perhaps not the best enzyme to have chosen!
  • SNPs from Illumina sequencing follow HaeIII site
    distribution
  • SNPs from 454 sequencing more evenly distributed
    across the genome
  • more like the distribution of sequence coverage

25
SNP validation
  • So how likely are the SNPs called from the
    sequences to be real SNPs v sequencing errors?
  • 64 randomly selected Illumina SNPs
  • 112 randomly selected 454 SNPs
  • tested on 63 samples including those used for
    discovery and the International mapping flock
    using a Sequenom iPlex system.
  • Only two SNPs, one each type were not polymorphic
  • More stringently gt80 passed QC (gt85 genotype
    calls, HW equilibrium test, MAFgt0.05 in the 63
    selected animals)
  • Predicts gt 85 validation on the Illumina
    Infinium system.

26
SNP chip design I
  • 60K high quality SNPs evenly distributed across
    the genome
  • quality of candidates
  • 1536 Sanger gt Illumina gt 454
  • probability that variation is a real SNP, not a
    sequencing artefact
  • Minor Allele Frequency
  • not very accurate for 454 SNPs, more accurate for
    Illumina and 1536 SNPs
  • bias against SNPs with low minor allele frequency
  • spacing
  • even v. favouring genes v. favouring favourite
    genome regions
  • use of chip real estate
  • Infinium I v. II (SNPs)
  • Infinium II uses one position on chip v two
    positions for Infinium I
  • Infinium I AT and GC SNPs, 17 of all SNPs

27
SNP chip design II
  • probe score
  • Probability that assay will work on chip we
    are using Illumina Infinium platform
  • 50 base oligo-probe
  • Assay oligonucleotide should be unique in the
    genome
  • But it is not complete sequence
  • Assay oligonucleotide should not contain other
    SNPs
  • Do not know all SNPs possible

28
Not all SNPs can be converted to assays
  • 454 SNPs distribution of probe scores for
    Illumina Infinium assay

29
SNP chip design issues
  • gt80 probability that observed variation is a
    useable SNP (average 92)
  • gt80 probability that assay works (average
    92)
  • gt90 Infinium II assays
  • 60K chip may actually have only 48K SNPs that
    returned useful data.
  • Indeed in use filtered reliable dataset is 49K
    SNPs
  • This is why it is marketed as the Ovine SNP50
    BeadChip

30
The 60K sheep SNP chip final design
  • Used all 60,800 available beads 59,454 SNPs
  • Includes 138 validated SNPs being developed for a
    sheep parentage chip
  • Average spacing
  • still few long gaps without any SNPs
  • Average assay design score 0.975

Source Infinium I Infinium II Total Percent of available SNPs
Sanger - validated 29 571 600 43
454 1,049 39,125 40174 14
Illumina 268 18,401 18669 25
mtDNA - validated 0 11 11
Total 1,346 58,108 59,454
2.26 97.74
31
Distribution of gaps between SNPs
32
How did we do?
  • Ovine SNP50 BeadChip is in commercial production
    by Illumina
  • 50k of the SNPs are assayable and return useable
    data
  • Performance is on a par with dog, horse and human
    chips designed at much greater cost with much
    more information
  • Equal performance of Sanger, Illumina and 454
    SNPs
  • Around one SNP in 1000 has been assigned to the
    wrong chromosome
  • Much better performance than bovine chip
  • How come, if we based it on assuming conserved
    synteny with bovine genome
  • Because the first thing we did was to rebuild
    bovine genome based on conserved mammalian
    synteny
  • A judicious combination of new sequencing
    technologies, appropriate filters and comparative
    genomics approaches can produce very high quality
    results
  • Generally applicable to a wide range of organisms
    for production and ecological reasons

33
Genome and SNP chip sections of the International
Sheep Genomics Consortium
  • AgResearch NZ
  • John McEwan
  • Gemma Payne
  • Nessa OSullivan
  • Tracey Van StijnTheresa Wilson
  • Rudiger Brauning
  • Alan McCulloch
  • Russel Smithies
  • Benoit Auvray
  • University of Otago
  • Jo Stanton
  • Chrissie
  • Mark
  • Baylor College of Medicine
  • Richard Gibbs
  • Donna Muzny
  • Michael E. Holder
  • Lynne Nazareth
  • Rebecca L. Thornton
  • Christie Kovar
  • CSIRO Livestock Industries
  • Brian Dalrymple
  • James Kijas
  • David Townley
  • Abhirami Ratnakumar
  • Wes Barris
  • Sean McWilliam
  • Genesis Faraday
  • Chris Warkup
  • sheepGENOMICS
  • Rob Forage
  • Terry Longhurst
  • TIGR
  • Ewen Kirkness
  • Uni Melbourne
  • Jill Maddox
  • USDA
  • Tim Smith
  • UNE
  • Hutton Oddy
  • Uni Sydney
  • Frank Nicholas
  • Herman Raadsma
  • Utah State University
  • Noelle Cockett

isgcdata.agresearch.co.nz
www.sheephapmap.org
www.sheepgenomics.com
www.livestockgenomics.csiro.au
Write a Comment
User Comments (0)
About PowerShow.com