Protein Clustering to Assemble Families of Homeomorphic Proteins - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Protein Clustering to Assemble Families of Homeomorphic Proteins

Description:

Pfam-A - a collection of domain families created using profile HMMs built on ... Sn=Tp(Tp Fn) Specificity = the percentage of predicted linker residues that ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 53
Provided by: wrp7
Category:

less

Transcript and Presenter's Notes

Title: Protein Clustering to Assemble Families of Homeomorphic Proteins


1
Protein Clustering to Assemble Families of
Homeomorphic Proteins
  • Chris Elsik
  • c-elsik_at_tamu.edu

2
Outline
  • Using evolution to infer protein function
  • The problem of automatic assembling of
    homeomorphic (identical domain organization)
    protein families
  • Agglomerative clustering
  • Delineating protein domain boundaries

3
Evolution Allows us to Infer Function
  • The most powerful method for inferring function
    of a gene or protein is by similarity searching a
    sequence database.
  • Our ability to characterize biological properties
    of a protein using sequence data alone stems from
    properties conserved through evolutionary time.
  • Homologous (evolutionarily related) proteins
    always share a common 3-dimensional folding
    structure.
  • They often contain common active sites or binding
    domains.
  • They frequently share common functions.
  • Predictions made using similar, but
    non-homologous proteins are much less reliable.

4
Orthologs
  • Homologs genes that are evolutionarily related
  • There are two kinds of homologs
  • Orthologs genes in different species that have
    diverged from a common gene in an ancestral
    species.
  • Paralogs genes that have diverged due to gene
    duplication.
  • Orthologs are more likely than paralogs to have
    conserved function.
  • Orthologs cannot be identified using BLAST or
    FASTA sequence comparison alone.
  • Reliable ortholog identification requires
    phylogenetic methods.

5
Example Gene Tree (with plant genes)
Rice-2b
paralogs
Rice-2a
Maize-2
paralogs
Wheat-2
Sorghum-2
Barley-1
Wheat-1
orthologs
Maize-1
Sorghum-1
Arabidopsis
6
Why shouldnt we depend on inferences based on
paralogs?
  • Paralogs emerge after a gene duplication.
  • Possible fates of duplicated genes
  • Loss of function for one of the duplicates - lack
    of selective pressure allows gene to mutate
    beyond recognition
  • Emergence of new functional paralogs - one
    duplicate aquires a new function, so selection
    favors its maintenance in the genome
  • Sub-functionalization - both duplicates are
    required to maintain the function of the original

7
How do people identify orthologs?
  • Best Hit
  • Problems - 1) What if the ortholog is not present
    in the database - there is always a best hit. 2)
    For multidomain proteins, the best hit may be a
    local domain hit.
  • Reciprocal Best Hit
  • Problem - this is usually based on E-value, but
    E-value is affected by the length of the match.
  • Predicted genes are often gene fragments (not the
    true length of the gene)
  • This method is a phenetic, not phylogenetic,
    approach. It does not take advantage of a model
    of protein evolution.
  • Synteny - infer orthology by comparing locations
    on chromosomes
  • Can be applied to closely related species, such
    as mammals, but not distantly related species
  • difficult to distinguish tandemly duplicated
    genes
  • Phylogenetics - build gene trees to distinguish
    orthologs and paralogs

8
1gtgtgtCG2830-PA - 648 aa vs fly_test.lseg
libraryThe best scores are
length bits E(18717) _id
alenCG7235-PC ( 576) 474.6 9.4e-134
0.590 581CG7235-PA ( 576) 474.6
9.4e-134 0.590 581CG7235-PB ( 576)
474.6 9.4e-134 0.590 581 CG12101-PB
( 573) 474.4 1.1e-133 0.620
597CG12101-PA ( 573) 474.4 1.1e-133
0.620 597CG2830-PA-short ( 240) 339.2
2.2e-93 1.000 240CG16954-PB ( 558)
298.3 1.1e-80 0.450 536 CG16954-PA
( 558) 298.3 1.1e-80 0.450 536
Reciprocal Best Hit incorrect gene length
A gene fragment was simulated by replacing a
protein in the library dataset with a truncated
sequence. The library was searched with the
full-length protein. The best match is a paralog.
The identical, but truncated, match has a less
significant E-value. We would not be able to
identify the ortholog using the reciprocal best
hit method.
9
Identifying Orthologs Using Phylogenetics
  • Challenge - we need to create gene families prior
    to creating phylogenetic gene trees
  • One approach is automated sequence clustering

10
Reasons for Clustering Protein Sequences
  • Classifying proteins - structural and functional
    annotation
  • Identifying errors in gene prediction
  • Selecting representative proteins
  • Creating models of protein domain families to aid
    identification of distantly related proteins
  • Studying protein family evolution

11
Protein Clustering - General Method
  • Perform a protein similarity search of a protein
    database against itself using FASTA or BLAST
  • Use E-value or score as a similarity measure for
    automated clustering
  • E-value is not the optimal linkage criterion due
    to the length effect

12
Problems in Clustering Proteins
  • Multidomain proteins can cause unrelated proteins
    to group together
  • This problem is amplified in large proteomes
  • Cluster validation is difficult -
  • We do not know the correct number of clusters
  • We do not have a good test set for large proteomes

13
Src Tyrosine Kinase
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGP
SAAFAPAAAEPKLFGGFNSS DTVTSPQRAGPLAGGVTTFVALYDYESRT
ETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV APSDSIQA
EEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAK
GLNVKHYKIRKL DSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTV
CPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGC FGEVWMGTWNGTTRVA
IKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSK
GSLL DFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANI
LVGENLVCKVADFGLARLIEDNEYT ARQGAKFPIKWTAPEAALYGRFTI
KSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPEC PES
LHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
14
Agglomerative Clustering Methods
  • Single linkage - similarity between closest
    neighbors meets a threshold (BLASTCLUST)
  • Complete linkage - similarity between furthest
    neighbors meets a threshold
  • Average linkage - average pairwise similarity
    meets a threshold
  • Fractional linkage - fraction of pairwise
    similarities meet a threshold

15
Single linkage
If A is related to B, and B is related to C,
then A is related to C.
This is true only if proteins are homologous over
their entire length. Multidomain proteins pull
unrelated proteins into the same cluster
A
B
C
16
Complete linkage
Each protein must be homologous to every other
protein in the cluster.
In the analysis of the human genome, Celera used
the criterion that the list of matches for each
protein must be equivalent.
Query A B
C D Matches
C D
A B C
A B C D
A B C
Clusters
A B
C D
Problem Does the domain structure of D differ
from A and B, or is it homologous, but too
distantly related to be detected by the BLAST
search? Complete linkage generates many small
clusters.
17
Fractional linkage
Proteins must have a fraction of matches in
common.
In analysis of the human genome, Celera used the
Lek metric with a cutoff of .75 Lek 2 x
(MatchesA ? MatchesB) / (MatchesA MatchesB)
LekAB 2 x 3 / (3 3) 1
LekAC 2 x 3 / (3 4) 0.85
Clusters
D
A B C
LekCD 2 x 2 / (4 2) 0.67
LekAD 2 x 1 / (32) 0.40
18
Average linkage
The average pairwise similarity between all
proteins must meet a threshold. In hierarchical
average linkage, the most closely related
proteins are grouped first. Then the threshold is
relaxed, allowing clusters to merge.
19
Which is method is best for clustering
full-length protein sequences?
  • Objectives
  • Group full-length proteins with similar domain
    organization
  • Minimize the problem caused by multidomain
    proteins
  • Minimize number of singletons

20
Comparing Protein Clustering Methods
  • Four methods - single, complete, average and
    fractional linkage
  • Thresholds range from E ? 10-10 (permissive) to E
    ? 10-200 (stringent)
  • Look at cluster number, number of singletons,
    largest cluster size, CDC score for each cluster
    set
  • Test set the Arabidopsis proteome (25,000
    proteins)

21
Comparison of cluster sets assembled using
linkage threshold E ? 10-10
Singletons
Largest Cluster Size
Method
Clusters (including Singletons)
SL
7702
4940
5852
9607
AL
385
5943
FL
12072
8248
428
60
CL
17639
14151
22
Comparing protein cluster sets CDC Score
  • Cluster Domain Consistency (CDC) score - a metric
    for testing similarity in domain organization
    among clustered proteins
  • CDC 0 proteins have no domains in common
  • CDC 1 proteins have identical domain
    organization
  • The proteins first must be searched against a
    domain database (Pfam) to identify domains

23
(No Transcript)
24
Pfam CDC Score
25
Linkage Criteria for Protein Clustering
  • The linkage criterion should discern full- and
    partial-length matches
  • If proteins are similar in domain organization,
    we expect them to match over their entire length

26
Linkage Criteria E-value
  • E-value may not be the best linkage criterion
    because it is length-dependent
  • Matches to longer proteins have lower E-values
    (more significant)
  • Length dependency makes it difficult to discern
    partial and full-length protein matches

27
Linkage Criteria Global Alignment Score
  • Partial matches will have low scores and
    full-length matches will have high scores
  • Disadvantage global alignments for each protein
    pair must be generated using the ALIGN program

28
Linkage Criteria Weighted Local Alignment Score
  • The local Smith-Waterman score for a pair of
    similar proteins is divided by the self-match
    score for the larger of the two proteins.
  • The weighted score is low for partial matches.
  • The advantage over global score is that global
    alignments are not required.

29
Linkage Criteria Comparison
  • Two clustering methods single and average
    linkage
  • Three linkage criteria E()-value, weighted local
    score, global score
  • Range of thresholds tested for each criterion
  • Look at cluster number, number of singletons,
    largest cluster size, CDC score for each cluster
    set

30
Cluster set comparison using most permissive
thresholds that resulted in biologically
meaningful family sizes (lt 600 members).
Method
Single Linkage
Average Linkage
Weighted Local Score
Weighted Local Score
Global Score
Linkage Parameter
Global Score
E()-value
E()-value
E() ? 10-10
E() ? 10-40
0.05
0.5
Threshold
1.1
0.25
Clusters (including singletons)
14570
9607
11469
8087
10073
10905
4643
11560
6195
5943
Singletons
7780
7429
250
538
Largest Cluster
385
225
171
563
31
Pfam CDC Score
32
Protein Clustering - Conclusions
  • Single linkage using E-value is not appropriate
    for clustering eukaryote proteomes.
  • Weighted local score and global score are better
    linkage criteria than E-value.
  • Single linkage using a weighted local score above
    0.40 performs as well as average linkage and
    generates a moderate number (15,746) of
    Arabidopsis clusters.
  • Average linkage generates the smallest cluster
    numbers and reasonable cluster sizes, and should
    be the useful for grouping distantly related
    proteins for function and structure prediction.

33
Which clustering method is best prior to
phylogenetic analysis of predicted proteins?
  • Complete linkage? - Too many singletons
  • Fractional linkage? - Also a large number of
    singletons
  • Average linkage?
  • Problem caused by truncated or overextended gene
    predictions. Average linkage would likely
    separate orthologs with incorrectly predicted
    protein lengths.
  • Single linkage with a strict alignment criterion?
  • Also has problem caused by truncated or
    overextended gene predictions
  • A strict alignment criterion separates orthologs
    due to incorrectly predicted protein lengths
  • Single linkage without a strict alignment
    criterion
  • Problem associated with multidomain proteins

34
SL-AL a compromise
  • We have developed the SL-AL algorithm, which uses
    a combination of single linkage and average
    linkage
  • First sequences are clustered using single
    linkage and a permissive alignment criterion
  • Percent identity for each protein pair in a
    single linkage cluster is recorded
  • Any cluster with lt a threshold percent identity
    is reclustered using average linkage

35
Advantages of SL-AL
  • The multidomain problem is reduced.
  • The algorithm is more tolerant to incorrect gene
    predictions, and less likely to separate
    incorrectly predicted orthologs into separate
    clusters. The cluster set can be used to identify
    gene prediction problems.
  • A minimum pairwise percent identity can be set,
    making multiple alignment possible.

36
Another Alternative for Protein Clustering
  • Overcome the multidomain problem by dissecting
    proteins into domains prior to clustering
  • Prodom uses protein alignments to identify domain
    families (129,000 domain families)
  • Problem - partial sequences in protein databases
    cause the overfragmenting of domains.
  • Our work toward a solution use an alternative
    approach to identify domain boundaries

37
Identifying protein linkers to delineate domain
boundaries
  • Protein structural domain - a unit of a protein
    that can independently fold into a stable
    tertiary structure
  • Evolutionary domain - a conserved modular unit
    that can be identified by aligning protein
    sequences
  • Often structural domains are also evolutionary
    domains (not always)
  • Linker - a peptide that connects two protein
    domains

38
Src Tyrosine Kinase
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGP
SAAFAPAAAEPKLFGGFNSS DTVTSPQRAGPLAGGVTTFVALYDYESRT
ETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYV APSDSIQA
EEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAK
GLNVKHYKIRKL DSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTV
CPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGC FGEVWMGTWNGTTRVA
IKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSK
GSLL DFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANI
LVGENLVCKVADFGLARLIEDNEYT ARQGAKFPIKWTAPEAALYGRFTI
KSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPEC PES
LHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
39
Amino acid propensity
  • Our goal is to predict protein linker regions
    using amino acid sequence alone, in the absence
    of known homologs
  • We take advantage of amino acid propensity - the
    preference of some amino acids for linkers or
    domains

40
Our Dataset
  • Pfam-A - a collection of domain families created
    using profile HMMs built on multiple alignments
    of homologous proteins. (Boeckmann et al. Nucleic
    Acids Research 2003)
  • Linker a sequence segment of 4 to 20 residues
    that connects two adjacent regions identified by
    Pfam as domains.
  • Non-linker regions sequence segments excluding
    linker regions.
  • Used only protein sequences whose entire length
    can be classified as linker or domain by our
    criteria, except we allowed up to 20 non-domain
    residues at the N- and C-termini.
  • Results - 11968 sequences with at least one
    linker region (14339 linker, 28726 corresponding
    domains and 824 unique domain regions).

41
Removing Redundancy in the Dataset
  • 11968 proteins were first grouped into
    homeomorphic families (identical domain
    organization).
  • All by all sequence comparison of the 11968
    sequences using FASTA
  • Single-linkage clustering using criteria of
    E-value10-6 and at least 80 alignment coverage.
  • Some of the resulting clusters contained
    sequences with different domain organizations,
    due to the transitive nature of single-linkage
    clustering, so we selected one sequence from each
    domain organization within each cluster.
  • Result - 802 sequences with at least one linker
    region. These 802 sequences contained 993 linkers
    and 1988 corresponding domain regions from 376
    unique Pfam-A domain families.
  • The average length of linkers and domains was
    11.24 and 141.38, respectively.

42
Linker Index
  • Linker index, yl, reflects the preference of
    amino acids in the linkers relative to the domain
    region from Suyama and Ohara, 2003,
    Bioinformatics.
  • yl - ln ( fllinker / fldomain )
  • Where fllinker relative frequency of the amino
    acid l in the linker (region in the data set.
  • yl will be negative if the relative frequency of
    amino acid l in the linker region is greater than
    its relative frequency in the domain region.
  • To calculate the smoothed linker index, average
    the linker index within each window size w and
    assign this averaged linker index value y to the
    center amino acid of the window by sliding from
    the N-terminus to the C-terminus of a protein
    sequence.
  • Window size, w 9, provided the greatest
    discrimination between linker and non-linker
    regions among the window sizes from 3 to 20.

43
Relative Amino Acid Frequency
Amino Linker Domain yl Acid
A 7.97 (7.94) 8.10 0.0166 C 0.89
(1.24) 1.50 0.5724 D 6.32
(5.28) 5.60 -0.1278 E 7.97
(6.89) 6.60 -0.1794 F 2.74
(4.34) 4.03 0.3561 G 7.74
(6.14) 7.37 -0.0442 H 1.91
(2.32) 2.27 0.1643 I 4.73
(5.13) 6.37 0.2758 K 6.97
(5.72) 5.81 -0.2134 L 7.51
(9.60) 9.54 0.2523 M 2.13
(2.15) 2.24 0.0197 N 4.22
(4.12) 4.08 -0.0786 P 6.63
(6.07) 4.30 -0.4188 Q 3.90
(4.05) 3.33 -0.1051 R 5.77
(5.79) 5.24 -0.0762 S 7.20
(5.55) 6.13 -0.1629 T 5.80
(5.66) 5.35 -0.0701 V 6.24
(6.64) 7.34 0.1782 W 0.81
(1.24) 1.32 0.3836 Y 2.46
(3.47) 3.38 0.2500

Numbers in parentheses are from the linker
database of George and Heringa, 2002, Protein
Engineering.
44
Hidden Markov Model to predict linker residues
  • Sequences are assumed to have a structure
    composed of regions that are homogeneous within a
    region but may differ between regions.
  • We assume protein sequence data is produced by a
    hidden Markov model and compositional variation
    is likely to reflect functional or structural
    differences between regions.
  • Each region is classified into one of a finite
    number of states (linker and non-linker) we wish
    to estimate the states given the observed protein
    sequence.
  • Instead of recognizing the protein sequence as a
    string of amino acids (categorical variables), we
    recognize the protein sequence as a string of
    linker index values (continuous variables).

45
Predicting linker state
  • Our objective is to identify linker index values
    that discriminate linker and non-linker regions.
  • Used Gibbs sampling to overcome the problem of
    missing data.
  • Calculate the probability state (linker or
    non-linker) for each residue i in a protein
    sequence.
  • Predict the state of an amino acid using the
    classification variable
  • CV 1 if the probability of state k is ? x.
  • CV 0 if the probability of state k is ? x.

46
Training and Testing
  • We applied our model to the protein sequence
    dataset constructed from Pfam-A.
  • 5-fold cross validation was applied to the
    dataset - we divided the dataset into the
    training dataset and the test dataset randomly
    with the ratio of 41.
  • We trained the model with the training dataset of
    642 sequences and tested the trained model with
    the test dataset of 160 sequences.
  • We ran Gibbs sampling with 40,000 iterations and
    10,000 burn-in to train the model.

47
Examples of good predictions
48
Examples of Over-prediction
49
Sensitivity and Selectivity
  • Sensitivity the percentage of actual linker
    residues that were predicted to be linker.
    SnTp(TpFn)
  • Specificity the percentage of predicted linker
    residues that were truly linker. SpTp/(TpFp)

50
Sensitivity and Selectivity plotted against
cut-off for classification variable
Sensitivity and selectivity are each 67 at their
intersection.
51
Conclusions
  • This method will be useful in defining linkers
    for evolutionary domains.
  • Our method appears to do well at distinguishing
    linkers from intra-domain loops.
  • We must test our approach using a structurally
    defined dataset to fully understand its ability
    to distinguish structural linkers from
    intra-domain loops.
  • Since Pfam-A identifies domains as evolutionarily
    conserved units, non-conserved intra-domain loops
    can cause structural domains to be annotated as
    multiple Pfam-A domains. Thus, some of our Pfam-A
    defined linkers may actually be loops in
    structural domains.
  • Conversely, two structural domains that are
    always found together may be defined by Pfam-A as
    a single evolutionary domain some of our false
    positives may actually be structural linkers.

52
Acknowledgements
  • William Pearson - University of Virginia
  • Bani Mallick - Texas AM University
  • Elsik Lab
  • Kyounghwa Bae
  • Justin Reese
  • Anand Venkatraman
  • Shreyas Murthi
  • Michael Dickens
  • Juan Anzola
Write a Comment
User Comments (0)
About PowerShow.com