Title: Assembly of the sheep genome and identification of SNPs using 454 and Illumina sequencing and design of the Illumina Ovine SNP50 BeadChip
1Assembly of the sheep genome and identification
of SNPs using 454 and Illumina sequencing and
design of the Illumina Ovine SNP50 BeadChip
Brian P. Dalrymple on behalf of the
International Sheep Genomics Consortium
2Sheep SNP (genome sequencing) project
- International Sheep Genomics Consortium Project
- Data generation and analysis
- AgResearch/University of Otago
- Baylor College of Medicine
- CSIRO Livestock Industries
- Universities of Sydney and Melbourne
- Funded
- International Science Linkages
- SheepGenomics
- Ovita
- Genesis Faraday
- 6 animals to 0.5 X or more each, 454 FLX
sequencing - 60 animals 20 X reduced representational (1-5
genome) sequencing SOLEXA - To identify sheep SNPs
- To design a 60K Illumina SNP chip
- Work towards a sheep genome assembly
- Cost and time
- AU2 million
- 18 months
3Chromosome assembly jargon
- Contigs
- overlapping sequence reads contiguous sequence
- Scaffolds
- a series of ordered and oriented contigs, but
with gaps between contigs - Chromosome
- a series of scaffolds ordered and oriented using
links between sequence and a physical map
assembled scaffold
paired end reads
sequence contigs
marker 1
marker 2
chromosome
4Our sequencing and assembly strategy
- Position 454 FLX reads on a related genome
sequence - The cow is the closest organism with an assembled
genome - The longer 454 FLX sequences (average 230 bases)
can be more accurately positioned than the short
Illumina GA I reads - but large numbers of slightly divergent repeat
sequences in the two genomes play havoc with
alignments, repeats cannot be assembled
(generally longer than the read length) and would
be missing
SNPs
Illumina - reduced representational sequence
Sheep seq.
454 FLX paired end reads
454 FLX 3X Whole Genome Shotgun
SANGER genome assembly of the cow
5Converting a cow into a sheep
6The importance of BACs
- BACs are large segments of a genome inserted in a
small vector that can propagate the foreign DNA
in a bacterium - low copy number
- constructed for most large genomes prior to
sequencing - typically 100-200 kb in length
- Frequently end sequenced to provide tags from the
two ends
tail-to-tail BAC same organisation as in original
genome
tail-to-head BAC one end inverted relative to
original genome
head-to-head BAC both ends inverted relative to
original genome
broken BAC linkage relative to original genome
lost
7Turn bovine genome into virtual sheep genome
Sheep BAC-end sequences
BLAST v. hg17
BLAST v. CanFam2
BLAST v. Btau3.1 addn seq
BLAST v. EquCab1
cow matches
horse matches
human matches
dog matches
liftover match positions to Btau3.6x rebuilt
bovine assembly
CSIROv1.0 bovine assembly
All sequences mapped to bovine assembly
Construct BAC contigs
Set of sheep BAC contigs on bovine assembly
Sheep markers
Using sheep markers, BACs and BAC contigs to
reorder and reorientate segments of cow genome to
their location in the sheep genome number
chromosomes correctly etc.
Virtual sheep genome build v2
8Maximizing genome coverage by sheep BACs
Sheep BAC
Sheep BAC
Liftover coordinates to cow genome
10 - 500 kb
10 - 500 kb
Map 350,000 BAC-ends to all four genomes using
very sensitive BLASTn parameters ( a week per
genome on 66 dual processor cluster), convert to
cow genome coordinates using conversion
coordinates precalculated at UCSC using Blastz.
9Maximizing genome coverage by tail-to-tail BACs
Method Number of BACs positioned
tail-to-tail - both ends significant BLAST score Traditional approach 170
As above and one end significant BLAST score 1680
tail-to-tail integrated analysis 2450
Coverage in BAC-contigs built using tail-to-tail BACs
0.4 fold
0.4 fold
6 fold
Examples for HSA18/OAR23 high sensitivity BLAST
and position filter maximises coverage with
tail-to-tail BACs and hence coverage by contigs
10The benefit of using the integration
species TT BACs on genome BAC contigs on genome
dog 56,482 2,146
horse 60,229 1,470
cow 76,251 1,411
human cow dog 79,996 1,299
human cow dog horse 95,797 943
1400 markers available to anchor BAC contigs to
sheep map
11Turning a cow into a sheep - low resolution
OAR1
OAR2
OAR3
OAR4
OAR5
OAR6
BTA3
BTA8
BTA11
BTA4
BTA7
BTA6
BTA1
BTA2
BTA5
OAR7
OAR8
OAR9
OAR10
OAR11
OAR12
BTA9
BTA10
BTA9
BTA12
BTA19
BTA16
BTA14
OAR13
OAR14
OAR15
OAR16
OAR17
OAR18
BTA13
BTA18
BTA15
BTA20
BTA17
BTA21
OAR19
OAR20
OAR21
OAR22
OAR23
OARX
BTA22
BTA23
BTA29
BTA26
BTA24
OAR24
OAR25
OAR26
BTAX
5 chromosome inversions 4 chromosome fusions 1
split chromosome
BTA25
BTA22
BTA27
12Assembling the sheep genome sequence
13Ovine Draft genome assembly
- 454/Roche genome sequencing on GS FLX
- 3-fold coverage 6 animals, each 0.5-fold
- 42.6 million reads
- Average length 226 bases
- 9.6 Gb
- 3.5 fold assuming a 2.7Gb genome
- Assembly contigs
- Contig count 2,314,423
- Mean contig length 512 bases
- Median contig length 377 bases
- Total assembled sequence 1.18 Gb
- 43 of the genome
- Repeats not assembled
- 89 of bovine RefSeqs map uniquely to the assembly
14Sequence assembly pipeline
454 sequence BCM
454 sequence AgResearch/Otago
John McEwan and colleagues at AgResearch
Sequence database
repeat mask
lower case masked 454 reads
MegaBLAST -D 3 -W 11 s 55 -U T -F m D retain
only top hit
Bovine genome
stored summarised MegaBLAST results
Using hits to bovine genome in bins of 1 Mb at a
time use Newbler to build sheep contigs
Sheep 454 contigs
Position sheep contigs on vsg2, replace cow
sequence with sheep sequence, remove remaining
cow sequence
Virtual sheep genome build v2
Sheep genome assembly v1.0
15The sheep genome assembly
- Is comprised of 454 sequence contigs organised
into sheep chromosomes - 2.5 million contigs
- average length 480 bases
- ordered and oriented using vsheep v2 framework
- cover 43 of the sheep genome
- cover 80 of the unique fraction of the genome
15 sheep contigs 1 bovine contig
16How close is the assembly to the real sheep?
- The local order of sheep contigs is heavily
dependant on the bovine genome assembly - Used a light pass of 454 sequencing paired end
reads as a quality control check of the assembly - short paired ends 30 bases long
- 89 in tail-to-tail arrangement
- long paired ends 70-100 bases long
- 98 in tail-to-tail arrangement
- BAC-end sequences
- 93 in tail-to-tail arrangement
17Distributions of lengths of tail-to-tail
paired-end reads
sonication aim 2.5kb
sonication aim 3kb
18Designing the SNP chip
19Finding the SNPs
- Pilot 1,536 SNP chip SNPs
- Already have a set of 1,536 SNPs from a pilot
project - Sanger resequencing of BAC-end sequences
- 454 SNPs
- 41 million reads
- align 454 reads to assembled sheep genome
sequence - require at least two reads (each from a different
animal) for each allele to call a SNP - 265,000 SNPs high confidence SNPs called
- Illumina SNPs
- 48 million reads
- filter based on starting CC, quality, remove
singletons - align 28.5 million SOLEXA reads to assembled
sheep genome sequence - 3.7 million unique sequences
- require at least two reads of each allele with a
quality score gt27 at the putative variable base - 76,044 high confidence SNPs called
20Illumina GA SNPs RRS
- Reduced representational sequencing (RRS)
- DNA pooled from 20 individuals/breeds
- HaeIII genome digest, 3 size fractions
- 75-90 bp
- 100-120 bp
- 130-155 bp
21Illumina GA sequencing
- 112 million reads from 3 Solexa runs (29 / 35 /
48 million) - 33 bases / read
- 3.7 Gb in total
Total Reads Non CC count N Read Count Ave qual lt25 Reads Passed Different seqs Non singletons
112,075,672 22,965,875 830,817 8,308,027 84,675,240 9,048,701 2,942,112
- Total different sequences 0.3 Gb
- Different sequences represented more than once
2.9 million - Singletons likely to be sequencing errors
- At 33 bases per read, just under 0.1 Gb
22Calling Illumina GA SNPs
- 76,044 SNPs called
- Number depends on cut offs used
HaeIII fragments with sequences from both ends
23Calling SNPs from Illumina reads
- High sequencing error rate towards the end of the
Illumina reads - with large datasets two identical sequences may
actually result from sequencing errors if base is
late in the sequence reads
24SNP distribution
- HaeIII sites are not randomly distributed in the
genome assembly - perhaps not the best enzyme to have chosen!
- SNPs from Illumina sequencing follow HaeIII site
distribution - SNPs from 454 sequencing more evenly distributed
across the genome - more like the distribution of sequence coverage
25SNP validation
- So how likely are the SNPs called from the
sequences to be real SNPs v sequencing errors? - 64 randomly selected Illumina SNPs
- 112 randomly selected 454 SNPs
- tested on 63 samples including those used for
discovery and the International mapping flock
using a Sequenom iPlex system. - Only two SNPs, one each type were not polymorphic
- More stringently gt80 passed QC (gt85 genotype
calls, HW equilibrium test, MAFgt0.05 in the 63
selected animals) - Predicts gt 85 validation on the Illumina
Infinium system.
26SNP chip design I
- 60K high quality SNPs evenly distributed across
the genome - quality of candidates
- 1536 Sanger gt Illumina gt 454
- probability that variation is a real SNP, not a
sequencing artefact - Minor Allele Frequency
- not very accurate for 454 SNPs, more accurate for
Illumina and 1536 SNPs - bias against SNPs with low minor allele frequency
- spacing
- even v. favouring genes v. favouring favourite
genome regions - use of chip real estate
- Infinium I v. II (SNPs)
- Infinium II uses one position on chip v two
positions for Infinium I - Infinium I AT and GC SNPs, 17 of all SNPs
27SNP chip design II
- probe score
- Probability that assay will work on chip we
are using Illumina Infinium platform - 50 base oligo-probe
- Assay oligonucleotide should be unique in the
genome - But it is not complete sequence
- Assay oligonucleotide should not contain other
SNPs - Do not know all SNPs possible
28Not all SNPs can be converted to assays
- 454 SNPs distribution of probe scores for
Illumina Infinium assay
29SNP chip design issues
- gt80 probability that observed variation is a
useable SNP (average 92) - gt80 probability that assay works (average
92) - gt90 Infinium II assays
- 60K chip may actually have only 48K SNPs that
returned useful data. - Indeed in use filtered reliable dataset is 49K
SNPs - This is why it is marketed as the Ovine SNP50
BeadChip
30The 60K sheep SNP chip final design
- Used all 60,800 available beads 59,454 SNPs
- Includes 138 validated SNPs being developed for a
sheep parentage chip - Average spacing
- still few long gaps without any SNPs
- Average assay design score 0.975
Source Infinium I Infinium II Total Percent of available SNPs
Sanger - validated 29 571 600 43
454 1,049 39,125 40174 14
Illumina 268 18,401 18669 25
mtDNA - validated 0 11 11
Total 1,346 58,108 59,454
2.26 97.74
31Distribution of gaps between SNPs
32How did we do?
- Ovine SNP50 BeadChip is in commercial production
by Illumina - 50k of the SNPs are assayable and return useable
data - Performance is on a par with dog, horse and human
chips designed at much greater cost with much
more information - Equal performance of Sanger, Illumina and 454
SNPs - Around one SNP in 1000 has been assigned to the
wrong chromosome - Much better performance than bovine chip
- How come, if we based it on assuming conserved
synteny with bovine genome - Because the first thing we did was to rebuild
bovine genome based on conserved mammalian
synteny - A judicious combination of new sequencing
technologies, appropriate filters and comparative
genomics approaches can produce very high quality
results - Generally applicable to a wide range of organisms
for production and ecological reasons
33Genome and SNP chip sections of the International
Sheep Genomics Consortium
- AgResearch NZ
- John McEwan
- Gemma Payne
- Nessa OSullivan
- Tracey Van StijnTheresa Wilson
- Rudiger Brauning
- Alan McCulloch
- Russel Smithies
- Benoit Auvray
- University of Otago
- Jo Stanton
- Chrissie
- Mark
- Baylor College of Medicine
- Richard Gibbs
- Donna Muzny
- Michael E. Holder
- Lynne Nazareth
- Rebecca L. Thornton
- Christie Kovar
- CSIRO Livestock Industries
- Brian Dalrymple
- James Kijas
- David Townley
- Abhirami Ratnakumar
- Wes Barris
- Sean McWilliam
- Genesis Faraday
- Chris Warkup
- sheepGENOMICS
- Rob Forage
- Terry Longhurst
- TIGR
- Ewen Kirkness
- Uni Melbourne
- Jill Maddox
- USDA
- Tim Smith
- UNE
- Hutton Oddy
- Uni Sydney
- Frank Nicholas
- Herman Raadsma
- Utah State University
- Noelle Cockett
isgcdata.agresearch.co.nz
www.sheephapmap.org
www.sheepgenomics.com
www.livestockgenomics.csiro.au