Genome Sequence Informatics - PowerPoint PPT Presentation

Loading...

PPT – Genome Sequence Informatics PowerPoint presentation | free to download - id: 451e56-ZmIzN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Genome Sequence Informatics

Description:

Sequence Analysis Niclas ... Genome Research 7:315 Tool for comparative genome sequence analysis ... Image Microsoft Photo Editor 3.0 Photo Genome Sequence ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 76
Provided by: niclasj5
Learn more at: http://teacher.bmc.uu.se
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Genome Sequence Informatics


1
Genome Sequence InformaticsComparative
GenomeSequence Analysis
  • Niclas Jareborg
  • AstraZeneca RD Södertälje

2
Genome sequencing projects
  • Aim Better understanding of biology
  • Bioinformatics
  • Manage data
  • Cut corners
  • Generate and test new hypotheses
  • Make the most of the data
  • comparative analysis

3
Where are the functional elements?
gttaaaattcagcaggcagaatgaaaataaatgtcaataattttttattt
taaaatattcatgttttactattttgatataatttttaaagaaaaaggc
a gaaaccactgcttattagaaggcagattttattgattttataccccta
ga cttgttgcatatcaaacctatgtaaaaacatctataaatcaaatcat
taa ttgcacctagtataataattctatatatggaggtaatgtttgattc
ttca ggagctttaataacttgaagcccgtttgattgctttaaaatgatt
tctca ttgtatttgtttatattgtatcattaagcaaaagtacagagtaa
gcaatt agtgtgattaattcctcttccataatacagtaaagcactgcct
ccataga ccaattctctgggatccctggaaaacatctggcatccagcaa
gtcttgac ccctctttagaaagccatggagaaactggaggcaattctgt
taattattt gccctctagaggcaattgggttaattaccctcccttccct
atccatgaca caatttctccagttacatgtagaatgctgttatgtgtct
cctgaccagac cccttatttcatagatgtggaaactgaggccatgaagg
atgaggtgactg ttcacaatccacatggctagttagtgtccagagcctg
gcctggacttctc tcttgttctggggccttgagttctctccctcttctt
tagtacatatggcc acaggtaacgtaatctgcgtaccacatttgcattt
ggagtgcatctgttt tgcattcatttaatcttgttgagatggtttgctt
gttgacctactcagtc agttatcttttcacctttgtgagttgagagctt
tgtgtattaaatctgta aaactttgcatcgtggaaagtgacataatctg
tagcagacccatgctgtt tttagatgcatcttcattgtggtagtgacag
tgattgagaaactttacat
4
Features in genome sequences
  • Genes
  • Exons, introns, promoters
  • RNA genes
  • CpG islands
  • Enhancers
  • Other functional elements
  • e.g. Replication origins, Nuclear matrix
    association
  • Repeats

5
How to find genomic features
  • Repeats, CpG islands, RNA gene
  • Bioinformatics programs
  • Genes
  • Homology to known sequences
  • Bioinformatics prediction programs
  • Transcription regulatory regions
  • Bioinformatics prediction programs

6
Finding genes by homology
  • Database searches BLAST, BLAT, SSAHA
  • EST and cDNA sequences
  • Protein sequences
  • High accuracy, misses unknown sequences
  • caveat junk EST sequences

7
Genewise (Birney Durbin)
Alignment of DNA to protein (or HMM) allowing for
splicing Uses dynamic programming with extra
states for introns
8
-20bp
pkinase.hmm 1 YELGEKLGEGA
GKVYKAKHK---TGKIVAVKILKKESLSLL
REIQI LG
G YA I
K E
INIKNLLGGDT
GCLYMAPKVQATKQQIYKLCFIKIKTFVLQ
TELNL HSU71B4 -27753 aaaaactgggaGTGTGAGTA
Intron 1 CAGTgtttagcagcgaaccatatttaaaaatgccAGG
TCACTA Intron 2 CAGGagcac
tataattggac lt2-----2771822469-2gt
ggtatccataccaaataatgttatacttta
lt2-----2237521185-2gt catat
atcatggtata
acatgaaaaaaaaaattagcctaaattgta
tacct pkinase.hmm 45
LKRLN-HPNIVRLLGVFED-----SKDHLY
LVLEYMEGGDLFDYLRRKG--PLSEKEAKKIALQILR
L HNIV GF L
VE G DR L E I
IL LRKYSFHKNIVSFYGAFFKLSPPGQR
HQLW MVMELCAAGSVTDVVRMTSNQSL
KEDWIAYICREILQ HSU71B4 -21168
caatttcaaagtttggttacaccgccccctGTATGTT Intron 3
CAGagagttgggtgagggaaaaacataggtagtatcgacc
tgaactaaattctagcttatgccgagaatglt0-----210
7815667-0gttttatgccgctcattgtcgaagtaaagtcatggatta
gggctccactgcctaatcggtcttggcatg
ggggataatgcttagagcttgtaaatgtt
tccaactg pkinase.hmm 104
GLEYLHSNGIVHRDLKPENILLDENGTVKI
DFGLAKLLK-SGEKLTTFV
GLLH
HRDK NLLN VK
DFG F
GLAHLHAHRVIHRDIKGQNVLLTHNAEVKL
DFGVSAQVSRTNGRRNSFI HSU71B4
-15555 GTGAGTC Intron 4 CAGgtgcccgccgaccgaagcag
ccacagggacGGTAAGTT Intron 5 CAGTTgtggagcgaaaaga
aaata lt0-----1555514066-0gtgtc
atacagttagatagaatttcaacatat lt1-----1397410915-1
gt atgtgcatggcagggagtt
catctcacaatcgccatgtgggttttaaag
ttagtcggcattaagttct
pkinase.hmm 153
GTPWYMMAPEVILKG-----RGYSTK
VDVWSLGVILYELLTGKL
FPG-D GTPM APEV R Y
DVWSG E G

GTPYWM-APEV-IDCDEDPRRSYDYR
SDVWSVGITAIEMAEGAP
LCNLQ HSU71B4 -10855 gactta gcgg
agtgggcacttgtaGTGAGTG Intron 6
CAGaggttggaagagaggggcCGTGAGTA Intron 7
CAGCTctacc gccagt ccat
tagaaacggcaaaglt0-----10783 8881-0gtgatgctgtcctat
cagcc lt1-----8825 4234-1gt tgata
gaacgg atgg tcttgcaaccttca
ttggtgattctagtaact
gtcta pkinase.hmm 196
PLEELFRIKKRLRLPLPPNC
SEELKDLLKKCLNKDPSKRPTAKELLEHPW
PLELF I S
KC K RPT LHP
PLEALFVILRESAPTVKSSG
SRKFHNFMEKCTIKNFLFRPTSANMLQHPF HSU71B4 -4214
ctggctgatcgtgcagatagTGGTAAAGA Intron 8
TAGGtcatcatagataaaatctccatgaacccct
ctactttttgacccctacgg lt2-----4154 3085-2gt
cgataattaagctaatttgccccattaact
cgatccttggattcacacca
ctgcctcgagtgaatcgtttttacgtacat
- 6bp
12bp
3bp
- 8bp
- 66bp
- 1bp
0bp
- 3bp
-1 bp
2bp
1bp
1bp
9
Gene prediction methods
  • ATGs
  • Stop codons
  • ORFs
  • Coding preference
  • Splice sites
  • profiles, statistical methods, neural networks
    etc.
  • High coverage, low accuracy

10
Accuracy of gene-finding programs for 1.4 MB
genomic region BRCA2 on humanchromosome 13q
Region includes 159 true exons
exact match overlap exons 5'-
splice site 3'- splice site
NE N acc cov N acc cov N acc
cov N acc cov fgenesh.masked 169
110 0.65 0.69 125 0.74 0.79 118 0.70 0.74 116
0.69 0.73 fgenesh 190 109 0.57
0.69 126 0.66 0.79 117 0.62 0.74 117 0.62
0.74 fgenes.masked 238 103 0.43 0.65
132 0.55 0.83 114 0.48 0.72 118 0.50
0.74 fgenes 281 104 0.37 0.65
136 0.48 0.86 116 0.41 0.73 120 0.43
0.75 genscan 292 105 0.36 0.66
129 0.44 0.81 116 0.40 0.73 115 0.39
0.72 fgeneh 381 68 0.18 0.43
101 0.27 0.64 79 0.21 0.50 87 0.23 0.55 mzef
623 95 0.15 0.60 122 0.20
0.77 106 0.17 0.67 107 0.17 0.67 fgeneshmgene
scan 118 97 0.82 0.61 106 0.90 0.67 101
0.86 0.64 101 0.86 0.64 fgeneshmfgenes
89 83 0.93 0.52 86 0.97 0.54 86 0.97
0.54 83 0.93 0.52 acc - specificity (true
predicted/all predicted) cov - sensitivity (true
predicted/true) NE - number of predicted exons
data provided by Tim Hubbard and Richard
Bruskiewich (Sanger Centre)
11
Repetitive elements
  • 1/3 of the human genome
  • Transposable elements
  • LINEs (Long Interspersed Nuclear Elements), 6-8
    kb
  • SINEs (Short Interspersed Nuclear Elements, e.g.
    Alu), 100-400 bp
  • Retrovirus-like elements, 1.5-10 kb (LTRs
    300-1000 bp)
  • DNA transposons, 80 bp-3 kb
  • Tandem repeats
  • Simple repeats/Microsatellites (1-5bp)n, e.g.
    caacaacaa
  • Minisatellites (6-1000s bp)n
  • Low complexity regions

12
Repeat masking
  • Repeats disturb analysis
  • Homology searching
  • Gene prediction
  • Masking exchange repeat region with N's. Will be
    ignored by analysis programs
  • RepeatMasker (Smit Green)
  • LINEs, SINEs, LTR transposons, DNA transposons,
    Simple repeats, Low complexity regions
  • trf (Benson)
  • Tandem repeats

13
Predicting regulatory regions
  • Transcription Factor Binding Sites (TFBSs) have
    very low information content
  • Given a long enough sequence a binding site will
    be predicted
  • Combination of TFBSs
  • Even the best algorithms will overpredict

14
CpG islands
  • Associated with transcribed genes
  • House keeping genes 50 of other genes
  • Often in 5' ends of genes
  • gt200 bp
  • GC content gt50
  • obs/exp CpG gt0.6

15
Gene Ontology
Biologists would rather share a toothbrush than
a gene name - Michael Ashburner
  • Controlled vocabulary that can be applied to all
    organisms even as knowledge of gene and protein
    roles in cells is accumulating and changing.

16
Gene Ontology
  • Organizing principles
  • Molecular function
  • Biological process
  • Cellular component
  • Hierarchical structure

17
Genome resources
  • Genome sequence centered
  • Ensembl
  • http//www.ensembl.org
  • NCBI
  • http//www.ncbi.nlm.nih.gov
  • UCSC Human genome browser
  • http//genome.ucsc.edu
  • All based on NCBI assembly
  • Gene centered
  • SOURCE
  • http//source.stanford.edu
  • GeneLynx
  • http//www.genelynx.org
  • GeneCards
  • http//bioinformatics.weizmann.ac.il/cards/

18
Ensembl
19
Ensembl Map view
20
Ensembl Contig view
21
Ensembl Contig view
22
Ensembl Gene view
23
Ensembl Gene view
24
Ensembl Gene view
25
NCBI Genome resources
26
NCBI Map View
27
NCBI Locus Link
28
NCBI Sequence view
29
UCSC Genome browser
30
UCSC Genome browser
31
UCSC Genome browser
32
Gene-centered resources
  • Genomic resources
  • Transcripts
  • Protein sequences
  • Protein structure and domains
  • Protein function and disease links
  • Homologs
  • Functional/GO classifications
  • Physical clones
  • etc

33
Comparative Genomic Sequence Analysis
  • Aid in finding functional regions
  • Coding regions
  • Regulatory regions

34
Comparative Genomic Sequence Analysis
  • Compare corresponding genomic sequences from
    different species
  • Potential protein coding and/or regulatory
    regions can be identified by their conservation
  • Phylogenetic footprinting

35
Why it works
36
Synteny maps
  • Maps corresponding regions in different genomes
  • Large-scale relationships
  • Based on
  • genetics
  • sequence
  • Available for
  • Human vs.
  • Mouse
  • Rat
  • Dog
  • Chimp
  • etc
  • Mouse vs Rat

37
Ensembl synteny views
  • Protein sequence based

38
NCBI comparative maps
  • Based on genetics
  • Several genetic maps

39
Human/vertebrate sequence comparisons (80-450
Myrs)
  • Coding sequences generally well conserved
  • Non-coding regions show highly variable levels of
    conservation
  • Conservation of non-coding regions imply a
    functional role
  • promoters
  • other transcriptional regulators
  • replication origins
  • chromatin condensation
  • matrix association

40
Model organisms for vertebrate comparative
analysis
  • Not too evolutionary close
  • Impossible to identify functional regions through
    conservation
  • Mouse 3000 Mb 80 Myrs
  • Genetics
  • Sequence finished
  • Chicken 1200 Mb 300 Myrs
  • Micro-chromosomes (75 of genes)
  • Prioritized for sequencing
  • Fugu (Puffer fish) 400 Mb 450 Myrs
  • Small genome, shorter introns and intergenic
    regions
  • More or less the same gene content as higher
    vertebrates
  • Sequence finished

41
What are we comparing?
  • Homologue
  • common ancestor, may have similar function
  • Orthologue
  • the same sequence, generated by a speciation
    event, probably same function
  • Paralogue
  • similar sequence within species, generated by a
    gene duplication event, may have similar function

42
Globins (I)
43
Globins (II)
44
Finding conserved regions
  • Dot plot
  • Dotter
  • Similarity search programs
  • Blast
  • Alignment programs
  • DBA (Jareborg et al)
  • blastz (Schwartz et al.)
  • Dialign (Morgenstern et al.)
  • WABA (Kent Zahler)
  • Avid (Bray et al.)
  • others

45
Dotter (Sonnhammer Durbin)
  • Graphical dot plot program for detailed
    comparison of two sequences
  • Features
  • dynamic greyscale ramp for stringency cut-off
  • alignment viewer
  • zooming.
  • Unix Windows
  • http//www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.
    html

46
(No Transcript)
47
DBA (Jareborg, Birney Durbin)
  • DNA Block Aligner
  • Finds co-linear blocks with high similarity
  • Does not try to align the sequences between these
    blocks
  • Divides blocks into four different categories
  • approx. 60-70, 70-80, 80-90, 90-100

48
Comparison-based functional prediction
  • Gene prediction
  • Regulatory region predictions

49
Comparative gene prediction programs
  • Twinscan
  • Doublescan
  • SGP-1

http//genes.cs.wustl.edu/
http//www.sanger.ac.uk/Software/analysis/doublesc
an/
http//195.37.47.237/sgp-1
50
Regulatory region prediction
  • Consite
  • Detection of TFBS conserved in corresponding
    genomic sequences from different species

www.phylofoot.org/consite
51
ConSite
52
Visualisation
  • Easier to grasp large data volumes
  • Programs
  • Dot plot (e.g. Dotter)
  • PIP
  • Alfresco
  • VISTA
  • Genome comparative resources
  • VISTA genome browser
  • UCSC
  • Ensembl

53
PIP - Percent Identity Plot
Oeltjen et al. (1997) Genome Research 7315
54
Alfresco
Tool for comparative genome sequence analysis
  • Over-all control of comparative analysis
  • Display and summarize results from external
    analysis programs

Jareborg Durbin Genome Research 1011481157
55
Alfresco Features
  • Interactive graphical interface
  • Uses external programs for analysis
  • Dotter - interactive dotplot program
  • Blastn alignments - finds conserved blocks
  • DBA - detects and aligns conserved blocks
  • Cpg - detects CpG islands
  • RepeatMasker - identifies repeats
  • Genscan - gene prediction
  • GeneWise - gene prediction using homologous
    protein sequence
  • est_genome - gene prediction using homologous RNA
    sequence

56
Alfresco
57
(No Transcript)
58
Vista Genome Browser
  • Human Mouse - Rat comparisons
  • VISTA viewer
  • http//pipeline.lbl.gov/

59
VISTA genome browser
60
UCSC Genome browser
  • Human - Mouse
  • Twinscan predictions
  • Conservation profiles
  • Quantitative

61
Ensembl contig viewer
  • Human-Mouse match locations
  • Qualitative
  • Twinscan predictions
  • Move between Human and Mouse contig views

62
Comparative Analysis Examples
  • Interspecies non-coding regions conservation
  • Coding region predictions
  • Regulatory region predictions

63
Comparative Analysis of Noncoding Regions of 77
Mouse and Human Gene PairsJareborg, Birney, and
Durbin.(1999)Genome Research 9815
  • How conserved are non-coding regions between
    mouse and human?
  • Measure of conservation?
  • identity
  • fraction conserved

64
A typical intron
65
mouse/human data set
  • Genomic sequences from the EMBL database
    containing 78 pairs of mouse-human orthologous
    genes
  • Features as defined in feature tables
  • Corresponding features aligned with DBA
  • Fraction covered by blocks gt60 identical
  • Upstream regions 36
  • 5 UTRs 49
  • Introns 23
  • 3 UTRs 56
  • Sizes
  • 20 - 700 bp

Jareborg, Birney Durbin. Genome Research
9815-824
66
(No Transcript)
67
Analysis example - coding region predictionUTY
68
Analysis example - cont.
69
Analysis example - Regulatory regions
  • BTK - Brutons Tyrosine Kinase
  • agammaglobulinemia
  • Expression
  • early stages of B-cell differentiation
  • myeloid cell lines
  • not in T cells

70
BTK region PIP
Oeltjen et al. (1997) Genome Research 7315
71
Alfresco - BTK 5end
72
Promoter constructs
2.5 kb conserved region in first intron
contributes to cell-lineage specific expression
myeloid
B-cell
T-cell
Oeltjen et al. (1997) Genome Research 7315
73
Comparative AnalysisIssues for the future
  • Faster/better algorithms for aligning vertebrate
    genomes
  • Multiple alignments
  • Comparing several species can give clues to which
    regulatory sequences are of a basic nature, and
    which are lineage specific
  • Cataloguing of comparative data
  • Better visualisation
  • Whole syntenic region ltgt nucleotide level
  • Multiple genome sequences

74
Future Issues - cont.
  • Genome evolution
  • macro scale
  • molecular evolutionary rates
  • repeats
  • Transcriptional regulatory regions
  • definition/modelling
  • identification of combinations of conserved TFBSs
    coupled with gene expression data
  • prediction

75
Fin
About PowerShow.com