ProClust: - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

ProClust:

Description:

ProClust: Improved clustering of protein sequences with an ... The Novel ... homologues, bringing light into the so-called twilight zone of low similarity. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 59
Provided by: yin4
Category:
Tags: proclust

less

Transcript and Presenter's Notes

Title: ProClust:


1
ProClust
  • Improved clustering of protein sequences with an
    extended graph-based approach

Ying Jin, Jonathan Michael Nowacki Nov. 21, 2003
2
What in this presentation
  • Papers
  • SCOP a Structural Classification of Proteins
  • database
  • link
  • Clustering protein sequences structure
    prediction by transitive homology
  • link
  • Improved clustering of protein sequences with an
    extended graph-based approach
  • link

3
Part I
  • SCOP a Structural Classification of Proteins
    database

4
The main idea
  • A database that provides a detailed and
    comprehensive description of all known protein
    structures

5
The Novel
  • The distinction between evolutionary
    relationships and those that arise from the
    physics and chemistry of proteins
  • The classification of proteins in SCOP has been
    constructed by visual inspection and comparison
    of structures. Believed better than purely
    automatic methods.

6
The organizational Basics
  • By three traits
  • Family near evolutionary relationships
  • Based on one of two criteria that imply having
    common evolutionary origin significant sequence
    similarity, and functional/structural similarity.
  • Super Family far evolutionary relationships
  • Low sequence identity, but whose structures and
    in many cases, functional features suggest that a
    common evolutionary origin is probable, i.e.
    variable and constant domains of immunoglobulins.
  • Fold geometrical relationships
  • If proteins have the same major secondary
    structures in the same arrangement and with the
    same topological connections.
  • Others classes. Domain, PDB, literature reference

7
More on folds
  • All-alpha essentially all alpha
  • All-beta essentially all beta
  • Alpha/beta mix of alpha and beta
  • Alpha beta helices and strands are segregated
  • Multi-domain no known homologues

8
PDB at a Glance
  • The PDB structure entries, consisting of a
    collection of files having nondescript names,
    cannot be easily grasped in a biochemically
    meaningful context. Manually organizing the
    structures based on the descriptive information
    in the files is becoming less and less practical
    as the database expands. A chemically or
    biologically meaningful context can be provided
    by the user in the form of a search keyword (e.g.
    hemoglobin), but the range of available contexts
    cannot be predetermined from the database
    itself--users must know, in general, what they
    are looking for. Although searching is an
    extremely useful approach for locating specific
    PDB entries, the scope of the database is best
    ascertained by browsing a set of predetermined
    contexts. Useful contexts include molecular
    classes (e.g. "cytochrome"), secondary/tertiary
    structural classes (e.g. "globin fold")
    functional classes (e.g. "binding protein"),
    species of origin, and experimental determination
    method. The descriptive information in the PDB
    files is distributed between a set of fields
    (e.g. "HEADER").

9
Other advantages of PDB
  • PDB entry viewer links PDB entries to various
    graphical view, external databases and SCOP
    itself.
  • Links to
  • images of structure
  • Interactive molecular views
  • Atomic co-ordinates
  • Data on functional conformational changes
  • Sequence data
  • Homologues
  • MEDLINE abstracts

10
Access Methods
  • Main url
  • http//scop.mrc-lmb.cam.ac.uk/scop/index.html
  • Numerous mirrors
  • Europe
  • East Coast USA
  • Japan
  • Isreal
  • Taiwan
  • China
  • Australia

11
The Root Down Method
12
Example Pic
13
Chime
14
Search Engine
15
3d Search
16
In Conclusion
  • SCOP is an easy way to access data and images.
  • SCOP has a powerful generic purpose interface to
    the PDB
  • Excellent overview of the diversity of protein
    structures which can aid researchers and students
    alike.

17
Part II
  • Clustering protein sequences structure
    prediction by transitive homology

18
Main Idea of the Paper
  • A graph-based clustering approach using
    transitivity handling multi-domain proteins and
    cluster comparison algorithms.
  • - determined all pair-wise similarities for
    the sequences in the SwissProt database using the
    Smith-Waterman local alignment algorithm
  • - transformed the data into a directed graph
  • vertices protein sequence
  • directed edges sequence A to B if
  • score(A, B)/ score(A, A) gt T
  • - the clustering process using transitivity
  • SCOP was used as an evaluation data set

19
Motivation
  • Finding the three-dimensional structure of
    proteins is one of the fundamental problems in
    molecular biology.
  • X-ray diffraction analysis cant keep up with the
    ever-increasing speed at which proteins are
    sequenced.
  • Desirable method predict structure from the
    sequence data. The main idea
  • The sequence similarity gt homology
  • gt similar structure
  • gt function virtue
  • (Note same structure or function does not imply
    a common ancestor)

20
Motivation (cont.)
  • The relation of sequence similarity obtained by
    pair-wise alignment.
  • Rule-of-thumb is that 30 identity over aligned
    regions (T)
  • A widely accepted approach
  • Score(A, B) gt T, implies structural similarity of
    sequence A and B
  • This is a sufficient, but not a necessary
    condition
  • Example

21
  • Histogram of pair-wise alignment scores for
    all pairs from the same super-family in the SCOP1
    data set

22
  • Detecting those distant homologues, bringing
    light into the so-called twilight zone of low
    similarity.
  • ? What other criteria can be used to identify
    remote homologues

23
Graph-based Approach
  • A graph-based clustering approach using the
    transitivity concept.
  • Transitivity
  • In mathematics if AB and BC then AC
  • In biology for given three sequences A, B and C,
    if A and B as well as B and C have a common
    ancestor, then A and C have a common ancestor

24
Use of Transitivity
  • The concept of transitivity can be used to detect
    remote homologues.

However - It is not fully understood
if transitivity always holds and whether
transitivity can be extended ad infinitum. -
Multi-Domain Problem
25
Multi-Domain Problem
If use an undirected graph, then solid black
edges provide a path from 1-4. In the directed
case, the grey edges avoid this possible problem.
26
Algorithm (1)
  • Computing pair-wise similarities
  • A complete undirected graph G
  • Given edge between sequence P and Q,
  • the weight of the edge raw(P, Q)
  • raw(P, Q) is the raw Smith-Waterman local
    alignment score
  • As mentioned above, there is the multi-domain
    problem with this approach the unwanted
    bridges connecting clearly unrelated proteins

27
Algorithm (2)
  • Directing the edges
  • Aim to solve the multi-domain problem
  • there has to be a difference in length between
    sequences if multi-domain proteins cause a
    problem.

G
Gd
Note Raw self similarity score raw(P, P) is
approximately proportional to the length of P
28
Algorithm (3)
  • Clustering in a threshold graph
  • Remove all the edges from Gd if w(P, Q) lt T,
    resulting graph Gd(t)
  • Using SCCs as clusters
  • Definition 1 of SCC In a directed graph G, a
    Strongly Connected Component (SCC) is a maximal
    set C of nodes of G, such that for every pair of
    nodes p and q in C there is one directed path in
    G from p and one from q to p.
  • Complexity O(n e), while n is the number of
    nodes and e the number of edges

29
An example of a SCC in SwissProt
  • The grey nodes are not part of the SCC, but are
    clearly related.
  • No edge present between nodes P03480 and P03475.
    The transitivity applied.
  • Threshold 32

30
Implementation and Evaluation
  • The algorithm implemented in C
  • Own implementation of the Simith-Waterman local
    alignment algorithm for computing sequence
    similarity.
  • The substitution matrix BLOSUM80
  • Gap opening (gop) 90
  • Gap extension penalties (gep) 9

31
Data (1)
  • SwissProt (SP) excluded all sequences with less
    than 40 amino acids (a.a.), resulting in a set of
    86494 protein sequences
  • The total running time for the pair-wise
    Smith-Waterman alignment was on the order of
    14000 cpu-days
  • The evaluation data SCOP database
  • Three levels are used family, super-family and
    fold

32
Data (2)
  • SCOP1 set of 2692 sequences
  • Contains all non-identical sequences from SCOP
  • No sequences shorter than 40 a.a.
  • No sequences from classes 8, 9, 10
  • 65464 pairs of homologue sequences i.e. pairs
    where both sequences are in the same super-family
    and 3556622 pairs where the sequences are in
    distinct super-families.
  • SCOP1 SP 85961 sequences
  • All sequences are from SCOP1 and SwissProt
  • SCOP2 609 randomly chosen sequences from SCOP
  • Including sequences shorter than 40 a.a.
  • no sequences from classes 8, 9, 10.

33
Performance measure
  • Sensitivity specifies the proportion of
    identified homologue pairs
  • Specificity the proportion of errors among the
    pairs predicted to be homologues

Sens spec 1 means the most highly desired
performance
34
Discussion (1)
Threshold 32 sens 55.6 spec 100 TP
due to intermediate linking 8 noise floor
lifting off at threshold 23
  • Sensitivity, specificity and the percentage of
    indirectly linked true positives versus
    clustering threshold for the SCOP1 data set

35
Discussion (2)
Threshold 32 sens 57.9 spec
99.8 Indirect TP 11.6 absolute increase in
sens 2.3 relative increase in sens
4.1 absolute increase in indirect TP 3.6
The noise floor is higher
  • Sensitivity, specificity and the percentage of
    indirectly linked true positives versus
    clustering threshold for the SCOP1 SP data set

36
Discussion (3)
SCOP1
SCOP1 SP
of SCOP super-families
  • Total number of SCC clusters, and SCC clusters
    of sizes 1, 2-5, 6-10 etc. for varying thresholds
    from 25 to 50.

37
Discussion (4)
Comparison with algorithm by Arvestad employs
only pair-wise sequence comparisons, their
approach uses a more involved scoring method,
optimized substitution matrices, and gap
penalties, to achieve a substantial improvement
over straight-forward pair-wise sequence
comparisons. 24 better sensitivity at
virtually equal specificity.
  • Sensitivity versus specificity for the SCOP2
    data set on the fold, super-family and family
    level

38
Links
  • http//promoter.mi.uni-koeln.de/proclust/

39
Part III
  • Improved clustering of protein sequences with an
    extended graph-based approach

40
The goal
  • To detect structural homology through sequence
    similarity, by increasing sensitivity through
    transitive homology heuristics.

41
Some Alternatives
  • Altenatve approaches using the concept of
    transitivity for large scale analysis of protein
    sequences
  • Iterated BLAST or FASTA search for computing
    clusters, which are subsequently merged and
    processed further. Does not explicitly deal with
    multi-domain problems.
  • Protomap Graph based approach, uses a combination
    of BLAST, FASTA and Smith-Waterman E-Values to
    create a hierarchy of clusters. Has problems
    with multi-domain proteins which cause cluster
    splitting.
  • All against All BLAST search and ignore all hits
    below a specified threshold yielding a (0,1)
    similarity matrix. Extensive post processing is
    required to symmetrize the matrix and to deal
    with multi-domain proteins.
  • Build clusters of orthologous groups (COGs)
    starting with proteins from seven different
    species. Tries to compensate for multi-domain
    proteins with an iterative merging process.

42
A new solution
  • Extended graph-based approach is designed to
    provide clustering as an aid in finding remote
    homologues the multi-domain problem is directly
    addressed, although is not fully solved.
    Sensitivity is increased without a significant
    loss of specificity.

43
Different Symmetries
  • Symmetric similary
  • Does not distinguish between two proteins being
    globally similar and one protein being similar to
    an individual domain of a multi-domain protein.
    Can lead to incorrect links
  • Asymmetric similarity
  • Can be employed to distinguish between global and
    non-global similarity.

44
Limiting factors
  • Large random similarities can cause
    super-clusters, which will connect large parts of
    the sequence space. This can be countered by
    using more stringent criteria.
  • Multi-domain proteins
  • Domains are the compact semi-independent
    structural units of proteins, which often appear
    highly conserved in a number of multi-domain
    proteins.

45
An example run
  • The dataset
  • SCOP v1.53
  • All sequences with less than 40 amino acids were
    removed
  • Filtered for low complexity regions using seg
    with the parameters of 12, 1.8, 2.0 x
  • Sequences containing masked amino acids as well
    as duplicate sequences were removed.
  • SPROT
  • Release 39
  • Processed analogously to SCOP

46
The Filtering by Significance
  • Extremal value distribution
  • The maximum scores of a large number of
    alignments between random sequences of equal
    length tends to have an extreme value
    distribution. Used to estimate maximal scores
    observable with the Smith-Waterman alogrithm for
    random sequences of given lengths.
  • Pruning consists of removing edges (P,Q) from
    graph if the significance of the score w(P,Q) was
    below the chosen significance threshold.

47
The Algorithm
  • Compute a complete undirected graph
  • Replace each undirected edge with two directed
    edges
  • Proceed to threshold graph by removing all edges
    of weight less than the threshold
  • Compute all strongly connected components SCCs

48
Post Processing Merging Clusters
  • Clusters with at least 20 sequences were selected
  • Multiple alignment was built for each set of
    sequences with the ClustalW.
  • Profiles were built and calibrated with the HMMER
    package using default parameters.
  • For each such cluster profile all sequences not
    contained in the cluster were scored using the
    profile and the E-value was recorded.
  • If a profile of one cluster resulted in an
    E-value below threshold against another cluster,
    those clusters are merged.

49
Complexity
  • Using C software on a Comaq ES40 running Tru64
    Unix V5.1
  • Smith-Waterman computations needed 70 CPU days.
  • Clustering needed 30 seconds.
  • Cluster merging using HMMs needed 21 CPU days.

50
Psi-Blast Flexibility
51
Multi-Domain Problem
52
Path Length FP. vs. TP.
53
Abundance of multi-domain proteins
54
Multi-domain Problem
Present at 13.1 threshold but disappears if
threshold is raised above 15.4 since d1m1da1
vanishes
55
Larger Multi-Domain Problem
Threshold of 21.3 and no edges between P12715
and P33497
56
More Laddering of Proteins
  • False positives caused by just the right increase
    in length of proteins. None of the edges are
    removed when going over to threshold graph.

57
Extended Graph vs. PSI-BLAST
58
Conlcusion
  • Sensitivity 63.5 _at_ 99.0 specificity
  • Improvement of 34 upon PSI-Blasts performance
    of 47.5 sensitivity and 99.0 specificity.
  • Performance is gained at the expense of a much
    larger computational effort.
  • Performance can be further improved by taking
    length and position of conserved regions into
    account.
Write a Comment
User Comments (0)
About PowerShow.com