Benchmarking%20Orthology%20in%20Eukaryotes - PowerPoint PPT Presentation

About This Presentation
Title:

Benchmarking%20Orthology%20in%20Eukaryotes

Description:

... analysis, examination of large COGs (might be split up) ... Tatusov et al., 'The COG database: an updated version includes eukaryotes', BMC Bioinformatics. ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 32
Provided by: Hul95
Category:

less

Transcript and Presenter's Notes

Title: Benchmarking%20Orthology%20in%20Eukaryotes


1
Benchmarking Orthology in Eukaryotes
  • 12-01-2004 Nijmegen
  • Tim Hulsen

2
Summary
  • (1) An introduction to orthology
  • (2) Orthology determination methods
  • (3) Benchmarking
  • co-expression
  • conservation of co-expression
  • SwissProt name
  • (4) Conclusions

3
An introduction to orthology
(from http//www.ncbi.nlm.nih.gov/Education/BLASTi
nfo/Orthology.html)
4
Orthology determination methods
  • Orthology databases/methods
  • COG/KOG
  • Inparanoid
  • OrthoMCL
  • Inclusiveness
  • one-to-one/one-to-many/many-to-many
  • organisms
  • Best bidirectional hit/Phylogenetic trees

5
Benchmarking orthology
  • Quality of orthology difficult to test no golden
    standard
  • Orthologs should have highly similar functions
  • Measuring conservation of function
  • functional annotation
  • co-expression
  • domain structure

6
Benchmarked orthology determination methods
  • BBH Best Bidirectional Hit
  • KOG euKaryotic Orthologous Groups
  • INP INPARANOID
  • MCL OrthoMCL
  • Z1H All pairs with Z gt 100
  • COM Comics Phylogenetic Tree Method
  • EQN Equal SwissProt Names

7
Data set used
  • Protein World all proteins in all available
    (SPTREMBL) proteomes compared to each other
  • Smith-Waterman with Z-value statistics
  • 100 randomized shuffles to test significance of
    SW score

rnd ori 5SD ? Z 5
O. MFTGQEYHSV shuffle 1. GQHMSVFTEY 2.
YMSHQFTVGE etc.
seqs
SW score
8
Data set used
  • Z-value compensates for
  • bias in amino acid composition
  • sequence length
  • Proteomes used
  • Human 28,508 proteins
  • Mouse 20,877 proteins
  • ? 595,161,516 pairs

9
BBH method
  • Easiest method best bidirectional hit
  • Human protein (1) ? SW ? best hit in mouse (2)
  • Mouse protein (2) ? SW ? best hit in human (3)
  • If 3 equals 1, the human and mouse protein are
    considered to be orthologs
  • 12,817 human-mouse orthologous pairs (12,817
    human, 12,817 mouse proteins)

10
KOG method
  • KOG euKaryotic Orthologous Groups
  • Eukaryotic version of COG, Clusters of
    Orthologous Groups
  • COG method
  • All-vs-all seq. comparison (BLAST)
  • Detect and collapse obvious paralogs

Sp1-Sp1 Sp2-Sp2 Sp1-Sp2
etc. for other species ? determine BBHs
EHs-Hs lt EBBH ? paralogs EMm-Mm lt EBBH ? paralogs
11
KOG method
  • Detect triangles of best hits
  • Merge triangles with a common side to form COGs
  • Case-by-case manual analysis, examination of
    large COGs (might be split up)

12
KOG method
  • KOG method mainly the same as COG method special
    attention for eukaryotic multidomain structure
  • Group orthologies many-to-many
  • Cognitor assign a KOG to each protein
  • (mouse not yet in KOG)
  • 810,697 human-mouse orthologous pairs (20,478
    human, 15,640 mouse proteins)

Tatusov et al., The COG database an updated
version includes eukaryotes, BMC Bioinformatics.
2003 Sep 114(1)41
13
INP method
  • All-vs-all followed by a number of extra steps to
    add in-paralogs ? many-to-many relations
    possible
  • 54,553 human-mouse orthologous pairs (19,504
    human, 17,030 mouse proteins)

Remm et al., Automatic clustering of orthologs
and in-paralogs from pairwise species
comparisons, J Mol Biol. 2001 Dec
14 314(5)1041-52
14
MCL method
  • All-vs-all BLASTP ? determine orthologs
    recent paralogs ? use Markov clustering to
    determine ortholog groups
  • 7,322 human-mouse orthologous pairs (human 6,332,
    mouse 6,115 proteins)

Li et al., OrthoMCL identification of ortholog
groups for eukaryotic genomes, Genome Res. 2003
Sep13(9)2178-89
15
Z1H method
  • All human-mouse pairs with Z gt 100 in Protein
    World set are considered to be orthologs
  • 290,176 human-mouse orthologous pairs (19,055
    human, 16,149 mouse proteins)

16
COM method
  • Human

  • All 9 eukaryotic proteomes in
    Protein World
  • Zgt20, RHgt0.5QL
  • 24,263
    groups

PROTEOME
Hs-Mm 85,848 pairs Hs-Dm 55,934 pairs etc.
TREE SCANNING
17
COM method
  • Example BMP6 (Bone Morphogenetic Protein 6)
  • ? 5 Hs-Mm orthologous relations defined

18
EQN method
  • Consider all Hs-Mm pairs with equal SwissProt
    names to be orthologous
  • e.g. ANDR_HUMAN??ANDR_MOUSE
  • Used as benchmark later on
  • 5,214 Hs-Mm orthologous pairs (5,214 human, 5,214
    mouse proteins)

19
Benchmarkingthrough co-expression
  • Comparison of expression profiles of each
    orthologous gene pair
  • Using GeneLogic Expressor data set

organism samples fragments tissue categories SNOMED tissue categories
human 3269 44792 115 15
mouse 859 36701 25 12
20
Expression tissue categories
HUMAN MOUSE
1 Blood vessel 1 Blood vessel
2 Cardiovascular system 2 Cardiovascular system
3 Digestive organs 3 Digestive organs
4 Digestive system 4 Digestive system
5 Endocrine gland -
6 Female genital system 5 Female genital system
7 Hematopoietic system 6 Hematopoietic system
8 Integumentary system 7 Integumentary system
HUMAN MOUSE
9 Male genital system 8 Male genital system
10 Musculoskeletal system 9 Musculoskeletal system
11 Nervous system 10 Nervous system
12 Product of conception -
13 Respiratory system 11 Respiratory system
14 Topographic region -
15 Urinary tract 12 Urinary tract
21
Co-expression calculation
  • Calculation of the correlation coefficient
  • N?xy (?x)(?y)
  • r ----------------------------
  • sqrt( (N?x2 - (?x)2)(N?y2 (?y)2))
  • Measured over the 12 corresponding SNOMED tissue
    categories

22
Co-expression example 1
High correlation 0.914167
23
Co-expression example 2
Low correlation -0.935731
24
Benchmarking throughco-expression

-
25
Benchmarking through conservation of co-expression
Human
Gene A Gene B
Co-expression Cab (-1ltcorr.lt1)
(Co-expression calculated over 115 tissues in
human, 25 in mouse)
All-vs-all Human 40,678 chip fragments Mouse
29,910 chip fragments
Mouse
Gene A Gene B
Cab gt Cab
? Increases probability that A and B are involved
in the same process
26
Benchmarking through conservation of co-expression
  • Gene Ontology (GO) database hierarchical system
    of function and location descriptions
  • Orthologs are in same functional category when
    they are in the same 4th level GO
  • Biological Process class

27
Benchmarking through conservation of co-expression
28
Benchmarking through SwissProt name
  • How many of the predicted orthologous relations
    have equal SwissProt names (EQN set in other
    benchmarks)
  • reliable because checked by hand
  • - assumes only one-to-one relationships are
    possible

29
Benchmarking through SwissProt name
(ALL if all possible human-mouse pairs (or
random fraction) would be orthologs)
30
Conclusions
  • Hard to point out the best orthology
    determination method
  • In most cases lessbetter, moreworse
  • Method that should be used depends on research
    question do you need few reliable orthologies or
    many less reliable orthologies?
  • Future directions look at conservation of domain
    structure as a benchmark

31
Credits
  • Martijn Huynen
  • Peter Groenen
  • Comics Group
  • Gert Vriend
  • Rest of CMBI
  • Organon Bioinf. Group
Write a Comment
User Comments (0)
About PowerShow.com