Assembly of the sheep genome and identification of SNPs using 454 and Illumina sequencing and design of the Illumina Ovine SNP50 BeadChip - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Assembly of the sheep genome and identification of SNPs using 454 and Illumina sequencing and design of the Illumina Ovine SNP50 BeadChip

Description:

Using hits to bovine genome in bins of 1 Mb at a time use Newbler to build sheep ... order of sheep contigs is heavily dependant on the bovine genome assembly ... – PowerPoint PPT presentation

Number of Views:316

Avg rating:3.0/5.0

Slides: 34

Provided by: dalrympleb

Category:

more less

Transcript and Presenter's Notes

Title: Assembly of the sheep genome and identification of SNPs using 454 and Illumina sequencing and design of the Illumina Ovine SNP50 BeadChip

1
Assembly of the sheep genome and identification
of SNPs using 454 and Illumina sequencing and
design of the Illumina Ovine SNP50 BeadChip
Brian P. Dalrymple on behalf of the
International Sheep Genomics Consortium
2
Sheep SNP (genome sequencing) project

International Sheep Genomics Consortium Project
Data generation and analysis
AgResearch/University of Otago
Baylor College of Medicine
CSIRO Livestock Industries
Universities of Sydney and Melbourne
Funded
International Science Linkages
SheepGenomics
Ovita
Genesis Faraday
6 animals to 0.5 X or more each, 454 FLX
sequencing
60 animals 20 X reduced representational (1-5
genome) sequencing SOLEXA
To identify sheep SNPs
To design a 60K Illumina SNP chip
Work towards a sheep genome assembly
Cost and time
AU2 million
18 months

3
Chromosome assembly jargon

Contigs
overlapping sequence reads contiguous sequence
Scaffolds
a series of ordered and oriented contigs, but
with gaps between contigs
Chromosome
a series of scaffolds ordered and oriented using
links between sequence and a physical map

assembled scaffold
paired end reads
sequence contigs
marker 1
marker 2
chromosome
4
Our sequencing and assembly strategy

Position 454 FLX reads on a related genome
sequence
The cow is the closest organism with an assembled
genome
The longer 454 FLX sequences (average 230 bases)
can be more accurately positioned than the short
Illumina GA I reads
but large numbers of slightly divergent repeat
sequences in the two genomes play havoc with
alignments, repeats cannot be assembled
(generally longer than the read length) and would
be missing

SNPs
Illumina - reduced representational sequence
Sheep seq.
454 FLX paired end reads
454 FLX 3X Whole Genome Shotgun
SANGER genome assembly of the cow
5
Converting a cow into a sheep
6
The importance of BACs

BACs are large segments of a genome inserted in a
small vector that can propagate the foreign DNA
in a bacterium
low copy number
constructed for most large genomes prior to
sequencing
typically 100-200 kb in length
Frequently end sequenced to provide tags from the
two ends

tail-to-tail BAC same organisation as in original
genome
tail-to-head BAC one end inverted relative to
original genome
head-to-head BAC both ends inverted relative to
original genome
broken BAC linkage relative to original genome
lost
7
Turn bovine genome into virtual sheep genome
Sheep BAC-end sequences
BLAST v. hg17
BLAST v. CanFam2
BLAST v. Btau3.1 addn seq
BLAST v. EquCab1
cow matches
horse matches
human matches
dog matches
liftover match positions to Btau3.6x rebuilt
bovine assembly
CSIROv1.0 bovine assembly
All sequences mapped to bovine assembly
Construct BAC contigs
Set of sheep BAC contigs on bovine assembly
Sheep markers
Using sheep markers, BACs and BAC contigs to
reorder and reorientate segments of cow genome to
their location in the sheep genome number
chromosomes correctly etc.
Virtual sheep genome build v2
8
Maximizing genome coverage by sheep BACs
Sheep BAC
Sheep BAC
Liftover coordinates to cow genome
10 - 500 kb
10 - 500 kb
Map 350,000 BAC-ends to all four genomes using
very sensitive BLASTn parameters ( a week per
genome on 66 dual processor cluster), convert to
cow genome coordinates using conversion
coordinates precalculated at UCSC using Blastz.
9
Maximizing genome coverage by tail-to-tail BACs
Method Number of BACs positioned
tail-to-tail - both ends significant BLAST score Traditional approach 170
As above and one end significant BLAST score 1680
tail-to-tail integrated analysis 2450
Coverage in BAC-contigs built using tail-to-tail BACs
0.4 fold
0.4 fold
6 fold
Examples for HSA18/OAR23 high sensitivity BLAST
and position filter maximises coverage with
tail-to-tail BACs and hence coverage by contigs
10
The benefit of using the integration
species TT BACs on genome BAC contigs on genome
dog 56,482 2,146
horse 60,229 1,470
cow 76,251 1,411
human cow dog 79,996 1,299
human cow dog horse 95,797 943
1400 markers available to anchor BAC contigs to
sheep map
11
Turning a cow into a sheep - low resolution
OAR1
OAR2
OAR3
OAR4
OAR5
OAR6
BTA3
BTA8
BTA11
BTA4
BTA7
BTA6
BTA1
BTA2
BTA5
OAR7
OAR8
OAR9
OAR10
OAR11
OAR12
BTA9
BTA10
BTA9
BTA12
BTA19
BTA16
BTA14
OAR13
OAR14
OAR15
OAR16
OAR17
OAR18
BTA13
BTA18
BTA15
BTA20
BTA17
BTA21
OAR19
OAR20
OAR21
OAR22
OAR23
OARX
BTA22
BTA23
BTA29
BTA26
BTA24
OAR24
OAR25
OAR26
BTAX
5 chromosome inversions 4 chromosome fusions 1
split chromosome
BTA25
BTA22
BTA27
12
Assembling the sheep genome sequence
13
Ovine Draft genome assembly

454/Roche genome sequencing on GS FLX
3-fold coverage 6 animals, each 0.5-fold
42.6 million reads
Average length 226 bases
9.6 Gb
3.5 fold assuming a 2.7Gb genome
Assembly contigs
Contig count 2,314,423
Mean contig length 512 bases
Median contig length 377 bases
Total assembled sequence 1.18 Gb
43 of the genome
Repeats not assembled
89 of bovine RefSeqs map uniquely to the assembly

14
Sequence assembly pipeline
454 sequence BCM
454 sequence AgResearch/Otago
John McEwan and colleagues at AgResearch
Sequence database
repeat mask
lower case masked 454 reads
MegaBLAST -D 3 -W 11 s 55 -U T -F m D retain
only top hit
Bovine genome
stored summarised MegaBLAST results
Using hits to bovine genome in bins of 1 Mb at a
time use Newbler to build sheep contigs
Sheep 454 contigs
Position sheep contigs on vsg2, replace cow
sequence with sheep sequence, remove remaining
cow sequence
Virtual sheep genome build v2
Sheep genome assembly v1.0
15
The sheep genome assembly

Is comprised of 454 sequence contigs organised
into sheep chromosomes
2.5 million contigs
average length 480 bases
ordered and oriented using vsheep v2 framework
cover 43 of the sheep genome
cover 80 of the unique fraction of the genome

15 sheep contigs 1 bovine contig
16
How close is the assembly to the real sheep?

The local order of sheep contigs is heavily
dependant on the bovine genome assembly
Used a light pass of 454 sequencing paired end
reads as a quality control check of the assembly
short paired ends 30 bases long
89 in tail-to-tail arrangement
long paired ends 70-100 bases long
98 in tail-to-tail arrangement
BAC-end sequences
93 in tail-to-tail arrangement

17
Distributions of lengths of tail-to-tail
paired-end reads
sonication aim 2.5kb
sonication aim 3kb
18
Designing the SNP chip
19
Finding the SNPs

Pilot 1,536 SNP chip SNPs
Already have a set of 1,536 SNPs from a pilot
project
Sanger resequencing of BAC-end sequences
454 SNPs
41 million reads
align 454 reads to assembled sheep genome
sequence
require at least two reads (each from a different
animal) for each allele to call a SNP
265,000 SNPs high confidence SNPs called
Illumina SNPs
48 million reads
filter based on starting CC, quality, remove
singletons
align 28.5 million SOLEXA reads to assembled
sheep genome sequence
3.7 million unique sequences
require at least two reads of each allele with a
quality score gt27 at the putative variable base
76,044 high confidence SNPs called

20
Illumina GA SNPs RRS

Reduced representational sequencing (RRS)
DNA pooled from 20 individuals/breeds
HaeIII genome digest, 3 size fractions
75-90 bp
100-120 bp
130-155 bp

21
Illumina GA sequencing

112 million reads from 3 Solexa runs (29 / 35 /
48 million)
33 bases / read
3.7 Gb in total

Total Reads Non CC count N Read Count Ave qual lt25 Reads Passed Different seqs Non singletons
112,075,672 22,965,875 830,817 8,308,027 84,675,240 9,048,701 2,942,112

Total different sequences 0.3 Gb
Different sequences represented more than once
2.9 million
Singletons likely to be sequencing errors
At 33 bases per read, just under 0.1 Gb

22
Calling Illumina GA SNPs

76,044 SNPs called
Number depends on cut offs used

HaeIII fragments with sequences from both ends
23
Calling SNPs from Illumina reads

High sequencing error rate towards the end of the
Illumina reads
with large datasets two identical sequences may
actually result from sequencing errors if base is
late in the sequence reads

24
SNP distribution

HaeIII sites are not randomly distributed in the
genome assembly
perhaps not the best enzyme to have chosen!
SNPs from Illumina sequencing follow HaeIII site
distribution
SNPs from 454 sequencing more evenly distributed
across the genome
more like the distribution of sequence coverage

25
SNP validation

So how likely are the SNPs called from the
sequences to be real SNPs v sequencing errors?
64 randomly selected Illumina SNPs
112 randomly selected 454 SNPs
tested on 63 samples including those used for
discovery and the International mapping flock
using a Sequenom iPlex system.
Only two SNPs, one each type were not polymorphic
More stringently gt80 passed QC (gt85 genotype
calls, HW equilibrium test, MAFgt0.05 in the 63
selected animals)
Predicts gt 85 validation on the Illumina
Infinium system.

26
SNP chip design I

60K high quality SNPs evenly distributed across
the genome
quality of candidates
1536 Sanger gt Illumina gt 454
probability that variation is a real SNP, not a
sequencing artefact
Minor Allele Frequency
not very accurate for 454 SNPs, more accurate for
Illumina and 1536 SNPs
bias against SNPs with low minor allele frequency
spacing
even v. favouring genes v. favouring favourite
genome regions
use of chip real estate
Infinium I v. II (SNPs)
Infinium II uses one position on chip v two
positions for Infinium I
Infinium I AT and GC SNPs, 17 of all SNPs

27
SNP chip design II

probe score
Probability that assay will work on chip we
are using Illumina Infinium platform
50 base oligo-probe
Assay oligonucleotide should be unique in the
genome
But it is not complete sequence
Assay oligonucleotide should not contain other
SNPs
Do not know all SNPs possible

28
Not all SNPs can be converted to assays

454 SNPs distribution of probe scores for
Illumina Infinium assay

29
SNP chip design issues

gt80 probability that observed variation is a
useable SNP (average 92)
gt80 probability that assay works (average
92)
gt90 Infinium II assays
60K chip may actually have only 48K SNPs that
returned useful data.
Indeed in use filtered reliable dataset is 49K
SNPs
This is why it is marketed as the Ovine SNP50
BeadChip

30
The 60K sheep SNP chip final design

Used all 60,800 available beads 59,454 SNPs
Includes 138 validated SNPs being developed for a
sheep parentage chip
Average spacing
still few long gaps without any SNPs
Average assay design score 0.975

Source Infinium I Infinium II Total Percent of available SNPs
Sanger - validated 29 571 600 43
454 1,049 39,125 40174 14
Illumina 268 18,401 18669 25
mtDNA - validated 0 11 11
Total 1,346 58,108 59,454
2.26 97.74
31
Distribution of gaps between SNPs
32
How did we do?

Ovine SNP50 BeadChip is in commercial production
by Illumina
50k of the SNPs are assayable and return useable
data
Performance is on a par with dog, horse and human
chips designed at much greater cost with much
more information
Equal performance of Sanger, Illumina and 454
SNPs
Around one SNP in 1000 has been assigned to the
wrong chromosome
Much better performance than bovine chip
How come, if we based it on assuming conserved
synteny with bovine genome
Because the first thing we did was to rebuild
bovine genome based on conserved mammalian
synteny
A judicious combination of new sequencing
technologies, appropriate filters and comparative
genomics approaches can produce very high quality
results
Generally applicable to a wide range of organisms
for production and ecological reasons