Functional Analysis of Proteins and Proteomes

About This Presentation

Title:

Functional Analysis of Proteins and Proteomes

Description:

Functional Analysis of Proteins and Proteomes – PowerPoint PPT presentation

Number of Views:202

Avg rating:3.0/5.0

Slides: 161

Provided by: douglasl7

Learn more at: https://conferences.computer.org

Category:

more less

Transcript and Presenter's Notes

Title: Functional Analysis of Proteins and Proteomes

1
Functional Analysis of Proteins and Proteomes

CSB2003 Tutorial
Steve Bennett, Ph.D.
steve_at_bennett.org

2
Introduction

Although genetic material contains all the
information required for cellular function, DNA
itself does not carry out much the work in cells.
Rather, it is the products of those genes,
proteins and sometimes regulatory and catalytic
RNAs, that carry out the chemical and mechanical
work in biological systems.

3
Central Dogma in Biology

DNA
RNA (sequence, structure)
Protein (sequence, structure)

4
Introduction

With the completion of numerous genome projects,
resources and focus are shifting from genomes to
proteomes.
Once researchers have an accurate collection of
gene sequences, the next question is what these
genes do.

5
Introduction

Although there are numerous definitions of
function with respect to proteins, here we
define it as precisely that what it is that a
particular gene product, or protein, does in the
cell.
Examples
Molecular motor (kinesin, myosin)
Zinc-finger transcription factor

6
Introduction

In this tutorial, I will first give a general
protein introduction, followed by historical and
current computational approaches for assigning
function to a protein.
Background
Overview of some selected algorithms and
approaches
Demos and software examples

7
Forming a Peptide Bond
Creates the Primary Structure, or protein sequence
8
Polypeptide Chain
The chemical nature of the R groups determine
the amino acid sequence of the peptide
9
Translation
10
Planar Peptide Bond
11
Alpha Helix Structure
12
Alpha Helix End-On
13
Alpha Helix Variable Pitch
14
Anti-Parallel Beta Sheets
15
Corrugated Beta Sheets
16
Beta Turn at End of Anti-Parallel Sheet
17
Protein Database (PDB) Growth

19,225 released atomic coordinate entries
17,315 proteins, peptides, and viruses
1,892 nucleic acids, protein-nucleic acid
complexes

18
Structural Challenges

Compare all known structures to each other
Classify and organize all structures in a
biological way
Find common folding patterns and structural
motifs
Compute evolutionary distances between protein
structures
Study interactions between structures and other
molecules (Protein Docking)
Use known structures to predict structure from
sequence (Protein Threading)
Many more ...

19
Classification of Protein Structures

Class
Similar secondary structure content
All a all b ab a/b etc
Fold (Architecture)
Major structural similarity
SSEs in similar arrangement
globin-like fold, TIM barrel fold
Superfamily (Topology)
Probable common ancestry
globins phycocyanin
Family
Clear evolutionary relationship
Sequence similarity usually gt 25

20
Class
Fold / Architecture
Superfamily
21
Classes of Protein Structures

Mainly ?
Mainly ?
????
Parallel ? sheets, ?-?-? units

???
Anti-parallel ? sheets, segregated ? and ?
regions
helices mostly on one side of sheet

22
Classes of Protein Structures

Others
Multi-domain, membrane and cell surface, small
proteins, peptides and fragments, designed
proteins

23
Folds / Architectures

??? and ???
Closed
Barrel
Roll, ...
Open
Sandwich
Clam, ...

Mainly ?
Bundle
Non-Bundle
Mainly ?
Single sheet
Roll
Barrel
Clam
Sandwich
Prism
4/6/7/8 Propeller
Solenoid

24
eg. The TIM Barrel Fold
25
Growth in PDB Folds
Gold Old Folds White New Folds
26
Databases of Folds

SCOP
Murzin AG, Brenner SE, Hubbard T, Chothia C
Structural Classification of Protein Structures
Manual assembly by inspection
All nodes are annotated (eg. All-alpha,
alpha/beta)
Structural similarity search using 3dSearch
(Singh and Brutlag)
CATH
Dr. C.A. Orengo, Dr. A.D. Michie, Dr. S. Jones,
Dr. M.B. Swindells, Dr. G. Hutchinson, Dr. A.
Martin, Dr. D.T. Jones, Prof. J.M. Thornton
Class - Architecture - Topology - Homologous
Superfamily
Manual classification at Architecture level
Automated topology classification using the SSAP
algorithm No structural similarity search

27
Databases of Folds

FSSP
L. L. Holm and C. Sander
Fully automated using the DALI algorithm (Holm
and Sander)
No internal node annotations
Structural similarity search using DALI
Pclass
A. Singh, X. Liu, J. Chang, D. Brutlag
Fully automated using the LOCK and 3dSearch
algorithms
All internal nodes automatically annotated with
common terms
JAVA based classification browser
Structural similarity search using 3dSearch

28
Protein Structure Prediction
Sequence of 984 amino acids
PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI
G PENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAG
LKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNV
LPQGW KGSPAIFQSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQ
HRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLWMGYELHPDKWTVQPIVL
PEKDSWTVN DIQKLVGKLNWASQIYPGIKVRQLCKLLRGTKALTEVIPL
TEEAELELAEN REILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQE
PFKNLKTGKYARM RGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPI
QKETWETWWTEYWQA TWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYV
DGAANRETKLGKAGYVT NKGRQKVVPLTNTTNQKTELQAIYLALQDSGL
EVNIVTDSQYALGIIQAQP DKSESELVNQIIEQLIKKEKVYLAWVPAHK
GIGGNEQVDKLVSAGI PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKA
LVEICTEMEKEGKISKIG PENPYNTPVFAIKKKDSTKWRKLVDFRELNK
RTQDFWEVQLGIPHPAGLKK KKSVTVLDVGDAYFSVPLDEDFRKYTAFT
IPSINNETPGIRYQYNVLPQGW KGSPAIFQSSMTKILEPFKKQNPDIVI
YQYMDDLYVGSDLEIGQHRTKIEE LRQHLLRWGLTTPDKKHQKEPPFLW
MGYELHPDKWTVQPIVLPEKDSWTVN DIQKLVGKLNWASQIYPGIKVKQ
LCKLLRGTKALTEVIPLTEEAELELAEN REILKEPVHGVYYDPSKDLIA
EIQKQGQGQWTYQIYQEPFKNLKTGKYARM RGAHTNDVKQLTEAVQKIT
TESIVIWGKTPKFKLPIQKETWETWWTEYWQA TWIPEWEFVNTPPLVKL
WYQ
HIV reverse transcriptase
3D coordinates of 7404 atoms
29
Abstracting the problem
3D coords of C-alpha backbone
3D coords of all atoms
3D coords of secondary structure elements
C-alpha groups
30
Defining the secondary structure of a protein
sequence
Alpha helix and anti-parallel beta sheet
31
The Secondary Structure Prediction Problem

Given a protein sequence
NWVLSTAADMQGVVTDGMASGLDKD...
Predict a secondary structure sequence
LLEEEELLLLHHHHHHHHHHLHHHL...
3-state problem ARNDCQEGHILKMFPSTWYVn -gt
L,H,En

32
Amphipathic helix End view
33
Amphipathic helix backbone sidechains
34
Amphipathic helixhydrophobic sidechains
35
Amphipathic helixhydrophobic sidechains
36
Amphipathic helixsidechain periodicity
Sequence NLAKMVVKTAEAILKD
37
Structural Correlations in Alpha-Helices
38
Structural Correlations inBeta-Strands
39
Functional Analysis of Proteins
40
Sequence methods

The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.

Seqs of known function
?
41
Sequence methods

The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.

42
Sequence methods

The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.

43
Sequence methods

The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.

44
Sequence methods

The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.

45
Sequence methods

The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.

46
Sequence methods

The earliest general approach for assigning
function to a protein sequence was to compare the
sequence of unknown function to a sequence (or
sequences) of known function.

47
Sequence methods

One such method for doing this is sequence
alignment in which two sequences are aligned to
determine how similar they are to one another.

48
Sequence methods

If an alignment is of sufficient quality, one
might assign the function of the known sequence
to be that of the unknown sequence as well.

Scorealignment gt threshold
Function Zn finger transcription factor
function assignment
49
Sequence Alignment

Well briefly discuss two different alignment
approaches, pairwise sequence alignment and
multiple sequence alignment before moving on to
other topics.
Alignments are most often in one of two forms
local or global.

50
Amino Acid Similarity

To discuss alignment methods, we first need to
discuss methods for determining if characters in
different sequences are similar.
Identity
Biochemical properties
PAM, BLOSUM matrices

51
PAM Matrices

Percent Accepted Mutation Matrices (Dayhoff)
Examine amino acid changes in groups of related
proteins with at least 85 sequence similarity.
The differing amino acids are assumed to be
accepted over evolutionary time.
Counts are normalized and used to estimate a
matrix representing all possible amino acid
changes.

52
PAM Matrices
53
BLOSUM Matrices
54
Dot Matrices

Dot matrices create an n x m matrix from the
two sequences to be compared.
A match is scored in the matrix by strict
character identity, chemical similarity of the
amino acids, or the use of a symbol comparison
matrix such as PAM or BLOSUM.
Mark the matrix location of each match with a
dot. Connected regions of similarity will
appear as diagonal lines.

55
Dot Matrices

A D S C T F G V V L I
A
E º
S
C
V
V
L
V º

56
Dot Matrices

Drosophila SLIT

57
Dot Matrices SLIT vs. itself
58
Dot Matrices

59
Dot Matrices

Improving signal-to-noise in dot matrices
Sliding Window
Scoring a match at position i, j in the matrix
is not independent downstream positions are
considered as well. This helps screen out
spurious matches in favor of meaningful local
regions of similarity.
Variable Stringency
Used with the sliding window method denotes
how many characters in the window must match for
a hit to be declared at position i, j.

60
Dot Matrices

Advantages
Intuitive and Straightforward
Immediate visualization of similar subsequences
Limitations
Although related subsequences are easily seen, it
is unclear what the best alignment is between
the.
Difficult to assess the quality of different
alignments no scoring system

61
Dynamic Programming

Considerable improvement to the basic dot matrix
approach DP generates a provably-optimal
alignment between a pair of sequences.
Produces a score which can be evaluated for
statistical significance given the aligned
sequences and conditions.
Allows for the inclusion of gaps without the
extremely large number of computations required
in any direct computation.
Can be used for both local and global alignments.

62
Dynamic Programming

As observed for dot matrices, DP uses a scoring
system that favors identical and similar amino
acids, and penalizes dissimilar amino acids and
gaps.
Values for the scoring system are usually derived
from amino acid substitution tables such as PAM
or BLOSUM matrices. Each position in a potential
alignment is evaluated according to these
substitution tables.
The scores for all positions are then summed to
generate an overall log-odds score for the
alignment.

63
Dynamic Programming

Similar to the dot matrix algorithm, we construct
an n x m matrix consisting of the 2 sequences to
be aligned and a gap row, allowing each
sequence to begin with a gap if necessary.
Instead of marking a dot, we calculate a
running best score that depends on the scores of
the cells calculated previously. The matrix is
built left to right, to bottom.

64
Dynamic Programming

Specifically, given two sequences, p and q
p p1p2pipn
q q1q2qjqm
then the score at each position i in sequence p
and position j in sequence q (that is, the score
Sij in each matrix cell) is given by

65
Example VDFS and VET
66
Example VDFS and VET
67
Example VDFS and VET
68
Example VDFS and VET
69
Example VDFS and VET
70
Example VDFS and VET
71
Example VDFS and VET
72
Example VDFS and VET

VDFS
V E -T

73
Dynamic Programming

Smith-Waterman more widely used implementation
for local alignments.
Software packages
BESTFIT (Smith-Waterman)
GAP (Needleman-Wunsch)
On the web http//motif.stanford.edu/alion/

74
Dynamic Programming

Advantages
Alignments are optimal
Quantitative score associated with the alignments
Limitations
Costly in time and space hardware required for
database-sized searches (increasingly important
for modern bioinformatics applications).

75
Rapid Database Searching

Goal 1 Execute with less demands in time and
space than dynamic programming.
Goal 2 Perform reasonably well using
heurisitics, as compared to the DP optimal
solution.
Drawback Resulting sequence alignments are not
guaranteed to be optimal.

76
Rapid Database Searching FASTA

Dynamic Programming approaches match single
characters at a time
FASTA matches groups of characters, called words
or k-tuples, which are managed in a table.

ADCGPH
ADCGPH
ADCGPH
ADCGPH
77
Rapid Database Searching FASTA
db1
db2
Assume k 3 8000 possible 3-character words.
db3
Scan each database sequence, recording the
position of each 3-tuple in a lookup table of
size 8000 keyed on the 3-tuple
78
Rapid Database Searching FASTA
db1
AAC
db2
Assume k 3 8000 possible 3-character words.
AAC
db3
AAC

Assume that the 3-tuple AAC occurs
at position 12 in db1
at position 52 in db2
at position 20 in db3

After scanning the database sequences and
building the lookup table, the table element
corresponding to AAC would look like
AAC ? db112 , db252 , db320
79
Rapid Database Searching FASTA
db1
AAC
db2
AAC
db3
AAC
AAC ? db112 , db252 ,
db320
Suppose a query sequence, q, has the 3-tuple AAC
at position 60. The table returns the 3 database
sequences and the locations where the matching
tuple occurs in those sequences. Assuming this is
done for another 3-tuple, DFE, we might
have 3-tuple db1 q AAC
12 60 DFE 56 104
80
Rapid Database Searching FASTA
AAC ? db112 , db252 ,
db320
3-tuple db1 q AAC
12 60 DFE 56 104 Next,
FASTA compute the offsets between the locations
of matched tuples. Here, we see that for AAC and
DFE, the offest is identical, equal to 48. This
indicates that these 3-tuples are in-phase or
part of a larger locally-aligned region. FASTA
then rescores these local alignments using a
PAM250 matrix, and takes the 10 highest regions
of identity and performs a joining step in an
attempt to join the regions.
81
Rapid Database Searching BLAST

BLAST is a much faster algorithm than FASTA, and
has been shown to be just as sensitive.
As such, BLAST is considerably more widely used.
Similar to FASTA in that it uses words, but the
size is fixed at k 3.

82
Rapid Database Searching BLAST

BLAST first extracts all overlapping 3-tuples
from a query sequence.
Then, the tuples in the query are evaluated
against the possible 8000 tuples using a BLOSUM
matrix. This determines if inexact matches
between query words and potential database words
are above a certain threshold score. Those tuples
remaining are assembled into a tree for rapid
database search.

ADCGPH
ADC
DCG
CGP
GPH
83
Rapid Database Searching BLAST

Suppose we observed the tuple SEI in the query
sequence.
Step 1. Score against all 8000 tuples, keeping
only those that are above our predetermined
scoring threshold.
SEI scored against SEI gives a score of 13 (S-S
E-E I-I) in the BLOSUM matrix)
SEI scored against SDI gives a score of 10
SEI scored against SDG gives a score of only 2.
Hence, if our cutoff score were 9, we would keep
SEI and SDI, but not SDG when assembling the
search tree.

84
Rapid Database Searching BLAST

Step 2. Once the possible matching tuples are
stored in the tree, database sequences are
searched for exact matches to these possible
scores.
Matches are examined for regions that are on the
same diagonal and within some distance, A of one
another. These regions serve as starting points
for a longer ungapped alignment between the
words. These joined regions are then extended in
each direction as long as the score is
increasing.
PSI-BLAST Iterative BLAST approach that includes
conservation information within a family of
proteins as opposed to just between two proteins.

85
Multiple Sequence Alignments

So far, we have only discussed pairwise
alignments since they are the most commonly used,
have optimal solutions.
Multiple sequence alignments are vitally
important to understanding true evolutionary
conservation between sequences in a family.

86
Multiple Sequence Alignments

Allows for the extraction of probes for new
members of a family (motifs / patterns).
Helps identify the functionally important amino
acids in a protein family. Amino acids not
required for function or structural integrity
will in general not be highly conserved within a
family
VTDIAYRCGFSDSNHFSTLFRREFNWSPRDI
VTEIAYRCGFGDSNHFSTLFRREFNWSPRDI
VFQISHRCGFGSNAYFCDVFKRKYNMTPSQF
VFQISHRCGFGSNAYFCDAFKRKYGMTPSQF

87
Representations for Similarity and Alignments

Short, simple representations for conserved
sequence information in MSAs make assigning new
proteins to the family considerably easier.
Short representations can suggest functional and
biological conclusions regarding why certain
amino acids are conserved at certain positions.
Such representations can identify function in a
protein that more global homology methods (such
as BLAST) might miss.

88
Profiles

A highly conserved local region in an MSA is
identified, then a profile (a type of PSSM) is
constructed to describe it.
20 x n matrix with each column describing the
scores, or probabilities of different amino acids
appearing at a given position.

89
PROSITE patterns

Patterns (motifs, signatures, fingerprints) are
short regular expression-like text strings that
describe a conserved region.
PROSITE is a manually curated database of
profiles and patterns.
Focuses on particular regions of family MSAs
shown in the literature to be biologically
important (usually catalytic sites, metal-binding
sites, reduced cysteines, or ligand binding
sites).

90
PROSITE patterns

Short conserved sequences from the MSA are then
extracted as a core and used to search
SWISS-PROT. If no additional sequences are found,
the core is designated as the actual signature.
If numerous false positives are picked up, then
the core is increased in size until good
discrimination is achieved, or until it is clear
that good discrimination wont be possible.
C-x(15)-A-x(3,4)-G-x(3)-C-x(2)-G-x(8,9)-P-x(7)-
C

91
Blocks

Blocks are short, ungapped conserved regions in
multiple sequence alignments.

92
Blocks

They are created from one of two starting with
either
Unaligned sequences from PROSITE families
An existing MSA.
Since PROSITEs manual curation limits is size,
the BLOCKS database currently includes families
from PRINTS-S and InterPro in addition to PROSITE.

93
BLOCKS Contains Many Protein Families(Henikoff
Henikoff, 1999)
94
Properties of eMOTIFshttp//emotif.stanford.edu/

Discrete motifs that represent specific functions
Highly specific motifs for searching entire
proteomes
Maintain sensitivity with multiple motifs
Generate motifs automatically from protein
alignments
Resistant to sequence errors, misalignment
misclassification
Robust with respect to protein subclasses
Generates structural motifs potential drug
targets
Biological generalization from known examples

95
eMOTIFshttp//emotif.stanford.edu/
fly..h...hst..krpfy.c
96
Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
97
Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
TEARENIAVLERDFEEV SDVESDNNDPVAEYIQL A A LYE V
ANY Q A S Q K
98
Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
TEAREDLAALERDYEEV S K I A
99
Generating Motifs fromAligned Protein Sequences
TEAESNMNDPVAEYQQYTDARQDLYELEVDYANLTEARENIAVLERDF
EEVTEAESNMNDLVSEYQQYTEVRANMNDLVAEYQQYSEAESNMNDL
VSEYQQYTEAREDLAALEKDYEEVTEAREDLAALERDYIEVSEARED
LAALEKDYEEVAEAREDLAALEKDYIEVSEAREDLAALEKDYEEVSE
AREDLAALERDYEEV
TEARENIAVLERDFEEV SDVESDNNDPVAEYIQL A A LYE V
ANY Q A S Q K
100
Amino Acid Substitution Groups Based on Physical
Properties

Only permit groups of amino acids
sharing some chemical or physical property

Group

AG

ST

PAGST

QN

QNED

KR

VLI

VLIM

FYW

KRH

DE

101
Allowable Amino AcidSubstitution Groups
fly..h...hst..krpfy.c
102
(No Transcript)
103
Discovery of eMOTIFshttp//emotif.stanford.edu/
104
Discovery of eMOTIFshttp//emotif.stanford.edu/
105

Each red dot is an eMOTIF
Most specific eMOTIFs along pareto-optimal curve
High Sensitivity gt Low Specificity
High Specificity gt Low Sensitivity

106
(No Transcript)
107
(No Transcript)
108
(No Transcript)
109
Protein Function with eMOTIF Searchhttp//emotif.
stanford.edu/
110
Protein Function with eMOTIF-Searchhttp//emotif.
stanford.edu/
111
3MOTIFs 3MATRICEShttp//3motif.stanford.edu/
112
(No Transcript)
113

Searched for 3est
Visualization Features Conservation strength
shading Relative and overall solvent
accessibilities per residue, and for the eMOTIF
as a whole Accessibility shading Multiple display
and manipulation options
114
Visualization Features
3est - cgg.lilv...wvilmvstaahc
115
(No Transcript)
116
(No Transcript)
117
(No Transcript)
118
(No Transcript)
119
(No Transcript)
120
(No Transcript)
121
(No Transcript)
122
(No Transcript)
123

3motif Pipeline Construction Query
124
(No Transcript)
125
eMotifs and SCOP

eMotifs were observed to correlate strongly with
SCOP classification, even when global sequences
were not overly similar.
eMotifs that were found to hit proteins in
different SCOP locations were particularly
interesting.

126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
eMATRIXPosition-Specific Scoring Matrices
132
An eMATRIXhttp//ematrix.stanford.edu/
133
eMATRIX Scanhttp//ematrix.stanford.edu/
134
eMATRIX Scan Resultshttp//ematrix.stanford.edu/e
matrix-scan/
135
eMATRIX Searchhttp//ematrix.stanford.edu/
136
eMATRIX Search Resultshttp//ematrix.stanford.edu
/
137
eMATRIX Makerhttp//ematrix.stanford.edu/
138
3MATRIXhttp//3matrix.stanford.edu/
139
(No Transcript)
140
ePROTEOMEA Functional Genomics
Databasehttp//eproteome.stanford.edu/
141
BLOCKS Is Based On SeveralProtein Family
Databases
142
eBLOCKs - Discovering Protein Motifshttp//eblock
s.stanford.edu/
Higher Specificity
A
B
C
Higher Sensitivity
143
Building eBLOCKs with PSI-BLAST

1) Compare the query to database with BLAST
2) Construct profile from significant
similarities
3) Compare the profile to database
4) Repeat step 2 and 3 until convergence

144
Generating Multiple OverlappingeBLOCKs from
PSI-BLAST Results
G2B1
G2B2
1
2
G3B1
G3B2
G3B3
G1B1
G1B2
1 Clustering Grouping 2 Aligning Trimming
145
Clusters Are Organized Into Groupswith Varying
Specificity Sensitivity
Higher Specificity
A
B
C
Higher Sensitivity
146
eBLOCKs Summary

SWISS-PROT
79,449 Sequences
Filtered Target Set
Homologous, putative, fragment, hypothetical,
probable, possible
57,266 Sequences
PSI-BLAST Searches
17,415
Final Number Of Groups
19,889
Final Number Of Blocks
81,413

147
eBLOCKs are More Comprehensive
148
Properties of 52,671 Novel eBLOCKs

New eBLOCKs are the same width as BLOCKS blocks
Average new eBLOCK 34 positions, others 37
New eBLOCKs have fewer sequences than BLOCKS
blocks
Average new eBLOCK has 18 sequences, BLOCKS 27
New eBLOCKs have similar information content
New eBLOCKs have 2.82 bits/position, BLOCKS 2.88
bits
One half of new eBLOCKs (26,254) are in known
families
One half of new eBLOCKs (26,471) are in 6,713 new
families

149
Example of New eBLOCK in a Known Family
14-3-3 Family of Proteins
68 Sequences, 72 Sequences
BL00796A, P29358G1B1 BL00796B,
P29358G1B2 BL00796C, P29358G1B4
(28, 33)
150
Catalytic Site of ATP Synthase
P-SAP-LIV-DNH-x(3)-S-x-S
PROSITE PS00152
eBLOCKs P19483G1B2
BLOCKS BL00152F
151
Protein Functional AnalysisUsing BLOCKS or
eBLOCKs
Motifs Significant at an Expectation of 10-4
Red eBLOCKs Black BLOCKS
152
Two Human Protein Sets

Ensembl - (http//www.ensembl.org)
29,304 proteins (Feb 2002) from the human genome
project
Based on GenScan Models
Shorter, more fragmentary protein sequences
RefSeq (http//www.ncbi.nlm.nih.gov/)
21,724 -- curated (XP)
11,407 reviewed sequences (NP)
Based on full length cDNAs
Longer, more reliable protein sequences

153
eBLOCKs Assignments forRefSeq and ENSEMBL
Proteins
154
Web Access to eBLOCKshttp//eblocks.stanford.edu/
155
An Entry From eBLOCKs http//eblocks.stanford.edu
/
156
Another Entry from eBLOCKs http//eblocksanford.e
du/
157
A Sample Keyword Search http//eblocks.stanford.e
du/
158
Search A Sequence http//eblocks.stanford.edu/
159
Software demos

Dot Matrices
http//bioinf.ibun.unal.edu.co/java/dotlet/Dotlet.
html
Software Smith-Waterman alignments
http//motif.stanford.edu/alion/
Hardware Smith-Waterman alignments
http//decypher.stanford.edu
eMotif
http//motif.stanford.edu/emotif/
eMatrix
http//motif.stanford.edu/ematrix/
3motif / 3matrix
http//motif.stanford.edu/3motif/
http//motif.stanford.edu/3matrix/
eBlocks
http//eblocks.stanford.edu/
LOCK
http//dlb3.stanford.edu/lock/
SCOP / PDB
http//scop.berkeley.edu

160
Conclusion

Functional Analysis is more important than ever
with the rate of growth of sequence databases.
Important for understanding of biology give
researchers a head start on how to experimentally
examine proteins.
Important in pharmaceuticals allows rapid
discovery of targets.

Write a Comment

User Comments (0)