Title: Bioinformatics Approaches to Identifying Candidate Effector Molecules of S. typhimurium
1Bioinformatics Approaches to Identifying
Candidate Effector Molecules of S. typhimurium
- Matthew Sylvester
- 12/1/03
2Endocytic Trafficking
SPI-1
SPI-2
?
Salmonella-containing vacuole
bacterial effector proteins (SseJ, SifA, SseXs,
and several others)
Lysosome
3Selection of S. typhimurium Proteins
- Salmonella effectors are secreted into the host
cell via either the Salmonella pathogenicity
island 1 (SPI1) or SPI2 type three secretion
system (TTSS) - We chose only those proteins shown experimentally
in the literature to go out through one or both
of these systems - (see PubMed at http//ncbi.nlm.nih.gov)
- The seventeen identified SPI1 and SPI2-associated
effectors were considered as one group for
subsequent analysis - As the N-terminal 150 amino acids have been shown
to contain conserved sequences for several SPI2
effectors, we compared this region (Miao and
Miller, 2000)
4Alignment of SPI-2 Effector Proteins
Miao E and Miller S. A conserved amino acid
sequence directing intracellular type III
secretion by Salmonella typhimurium. PNAS.
2000, 97(13). Pp. 7539-7544.
Published alignment of known and putative SPI2
effectors identified by a BLAST (Basic Local
Alignment Search Tool) search and then aligned
using ClustalW. Note the presence of the
WEK(I/M)XXFF motif from approx. aa 31-38.
5BLAST
- Tries to find the most similar proteins
- Compares a query to sequences in a database and
each comparison is given a score (higher scores
are more similar) - Scoring matrices (substitution-based) are used to
assign a score based on the probability of each
residue substitution - Gap penalties are negative scores
- The alignment score is the sum of scores at each
position - Significance of overall alignment given a p-value
or an e-value - e-value expectation value The number of
different alignments with scores equivalent to or
better than S that are expected to occur in a
database search by chance. The lower the E value,
the more significant the score.
6(No Transcript)
7(No Transcript)
8(No Transcript)
9Building Substitution Matrices Part I
Blocks Local ungapped alignment with rows
protein segments and columns amino acid
position
1 A D E P Q D A 2 A C E P D D A
.. 10 S D E P Q D A
New Sequence A D E P Q R A -count number of
matches and mismatches between new sequence and
every other sequence in block. -We have 9AA
matches and 1 AS mismatch in pos. 1
Henikoff S, Henikoff JG. Amino acid substitution
matrices from protein blocks. PNAS (1992).
pp.10915-10919.
10Building Substitution Matrices Part II
Next, sum the results of each column, store
results in a table and add the new
sequence to the group
By successively adding new sequences, we get a
table with all possible pairs
If we have 9 As and 1 S in the first column,
we get 1 2 836 possible AA pairs and
we get 9 AS or SA pairs and we get 0 SS pairs
If w width of amino acids and s sequences,
we have ws(s-1)/2 total possible
pairs. Here, we have 36945 or 1109/245
11Calculating the Lod (log-odds) Matrix
- Let fij be the total number of amino acid pairs
in the frequency table at position i,j - (1ltjltilt20)
- Then the observed proportion for each amino acid
pairing is - We have fAA36 and fAS9, so qAA36/45 and
qAS9/45
12Calculating the Lod Matrix II
- Now we need the expected probabilities of
occurrence for each amino acid pair - If we assume that the observed frequencies of
each amino acid are the population frequencies,
we have - For our example, pA36/45(9/45)/2 0.9 and
pS(9/45)/20.1 - Then the expected probability (eij)of occurrence
is pipj for ij and pipjpjpi for i!j - We have expected probability of AA0.90.90.81,
AS20.90.10.18, SS0.10.10.01
13Calculating the Lod Matrix III
- Then we calculate the log-odds score in bits as
sijlog2(qij/eij), so if we see more than
expected, sijgt0, if we see as many as expected,
sij0, and if we see less than expected, sijlt0 - Multiplying s by 2 and rounding to the nearest
integer, we obtain our values for the block
substitution matrix (BLOSUM)
14Clustering
- To prevent double-counting amino acid
contributions from closely related proteins,
sequences are clustered and counted as a single
sequence in counting amino acids - Thus, if two sequences are identical at gtX of
their aligned positions, then contributions are
averaged between the two - In our example, if we were to cluster 8 of our
sequences with A in the first position, we now
have 2As and 1S - These matrices will be denoted BLOSUM X, such as
BLOSUM 62
15Substitution Matrix (log-odds)
Based on observed frequencies of substitutions in
related proteins identical amino acids are given
high positive scores, frequently observed
substitutions get lower positive scores, and
seldom observed substitutions get negative scores.
16Related Calculations
- Relative entropy
- measures the average information in bits that
can be distinguishes an alignment from chance - Expected score in bit units
17Bioinformatics ApproachesPrimary Structure
18Primary Sequence Search Methodology
- Hmmer search of aligned sequences
- Hmmer uses hidden markov models to make a profile
probability matrix of amino acids from aligned
sequences - The matrix is searched against the appropriate
genome database - TRVI search allowing for gaps and substitutions
- A motif is developed by allowing for a flexible
number of gaps wherever there are gaps in the
alignment - Substitutions of amino acids with similar
properties are allowed - The motif is searched against the appropriate
genome database - MEME/MAST search of unaligned sequences
- Identifies a specified number of domains
(probability matrices) across a subset of the
input sequences - The domains are searched against the appropriate
genome database
19How Hmmer WorksProfile Hidden Markov Models for
Protein Sequence Analysis
20Hmmer Architecture
- Squares are match states (consensus positions),
diamonds are insertions, circles are deletions
and beginning/end. Arrows indicate state
transitions.
21Hidden Markov Model Background
From PMMBSandrine Dudoit See also
http//www.ai.mit.edu/murphyk/Bayes/rabiner.pdf
22More Hidden Markov Model Background
23Still More Background
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Hmmer Intro
- Each M/D/I is a node and are determined by data
and the multiple sequence alignment - Each M state aligns with a single amino acid and
carries a vector of 20 probabilities determined
by the proportion of times that an amino acid has
shown up in a position in a multiple sequence
alignment - Capable of handling gapped alignments
- At each node either the M (amino acid aligned) or
D state is used, and I states occur between nodes
and self-transition - Arrows are transition probabilities and are
estimated by the residues in each column of the
multiple sequence alignment - S,N,C,T,J are special states that are
algorithm-dependent and controlled externally
35Intermediate Hmmer
- Want to calculate P(SM) where the sum over the
space of all sequence should be 1 - The rules of the HMM allow us to do this
- Implied that the insertions follow a geometric
distribution - From a multiple sequence alignment seed, Hmmer
make a consensus sequences and searches databases
against this consensus sequence
36Hmmer Results
37ClustalW Alignment of SPI1 Effectors
38ClustalW Alignment of All Known Effectors
39Analysis of TRVI-Putative Cytoplasmic Proteins
- Literature search
- YciE not found
- YciF classified as a putative structural protein
by Blattner et al. - BLAST searches
- STM0274 almost exactly SciI (S. typhimurium)
other homologies to ImpC and ImpD (Rhizobium
leguminosarum), and conserved hypotheticalsno
literature on SciI, ImpC, nor ImpD - YciF has homologies to other putative structural
proteins in Shigella and E.coli. Also homologous
to several conserved hypotheticals - YciE has homologies to YciE from E.coli and other
putative cytoplasmic/structural proteins in other
species (YciE and YciF do not hit each other) - STM3767 homologous to a 4-hydroxy-2-oxoglutarate
aldolase and several hypothetical proteins - STM4192 homologous to a nucleoprotein/polynucleoti
de-associated enzyme, hypothetical protein YaiL
from E.coli, and hypotheticals (YaiL not in
literature)
40Analysis of TRVI-Microarray Proteins
- SseJ and YciE show up
- fruF is part of the phosphoenolpyruvate fructose
phosphotransferase system - STM1181 is a putative flagella basal body part
41S. typhimurium MEME Motif Summary
42MEME MAST Analysis
- MEME search results using MAST and searched by
domain - Domain 1 SseI, SlrP, SopA (putative effector
proteins), YebE - Domain 2 SseI, SlrP, YeeY, YeaH (putative
cytoplasmic protein) - Domain 3 SseI, HepA/RapA, Putative inner
membrane protein (STM1698) - Domain 4 YfeC, Putative periplasmic proteins
(STM3783 and STM3605) - Domain 5 RffG, OmpR (regulatory protein),
PrpA,SirC (invasion regulator) - Domain 6 SseI, SlrP, YadF, YaiB, PrpC(protein
phosphatase), InvB - (part of needle complex)
- Domain 7 CitC (citrate carrier), YcfN, YjeQ,
STM0611, STM2406 - Domain 8 DdlA (d-alanine ligase), GlyS, PgtA
(phosphoglycerate transporter), STM4502 - Domains 1,3, and 5 look to be important for SPI2
secretion - The other domains are important for small,
related subsets of proteins
43MEME Including Putative Cytoplasmic Proteins
44S. typhimurium Search Results Summary
- Hmmer search of aligned sequences
- Only the input sequences ( 2 theoretically
secreted proteins) were returned. SPI1 and SPI2
effectors both have significant e-values from a
combined matrix. - TRVI search allowing for gaps and substitutions
- 56 hits returnedPossible interesting hits
include SseI, 5 LysR family proteins, 5 putative
cytoplasmic proteins , 1 putative periplasmic
protein, 2 inner membrane proteins, and 3
flagellar proteins. 4 proteins (FruF, SseJ,
YciE, and a putative flagellar protein) were also
identified in a DNA microarray screen under SPI2
inducing conditions with cholesterol. - MEME search results using MAST and searched by
domain - Domain 1 SseI, SlrP, SopA (putative effector
proteins), YebE - Domain 2 SseI, SlrP, YeeY, YeaH (putative
cytoplasmic protein) - Domain 3 SseI, HepA/RapA, Putative inner
membrane protein (STM1698) - Domain 4 YfeC, Putative periplasmic proteins
(STM3783 and STM3605) - Domain 5 RffG, OmpR (regulatory protein),
PrpA,SirC (invasion regulator) - Domain 6 SseI, SlrP, YadF, YaiB, PrpC(protein
phosphatase), InvB (part of needle
complex) - Domain 7 CitC (citrate carrier), YcfN, YjeQ,
STM0611, STM2406 - Domain 8 DdlA (d-alanine ligase), GlyS, PgtA
(phosphoglycerate transporter), STM4502
45Primary Structure Conclusions
- The best lead may be YciE, a putative cytoplasmic
protein found with two different search methods - The methods did not give the same output
- Hypothetical proteins found in the literature
such as SipD, SptP (SPI1) and SpiC, SrfJ,
SseB,C,D (SPI2) were not found - All proteins that go out via SPI2 do not
necessarily have the WEK(I/M)XXFF motif - There is not a clear SPI1 motif
46Secondary Structure Prediction
- Psipred structure prediction server used
- Predictions made by two feed-forward neural
networks based on PSI-BLAST output - N-terminal motif (MEME 3)random coil in all SPI2
proteins - First SPI2 motif at aa 31-38 (MEME 1)examples
are SseJ, SifA, SifB(F), SlrP(F), SseI,
SspH1(F) - Second SPI2 motif at aa 105-120 (no
MEME)entirely random coil except for a small
segment of SspH2
47Secondary Structure Prediction of SifA
48Alpha-helical Wheel (SifA,SifB)
WEK(I/M)XXFF is the Conserved motif among SPI2
effectors from aa 34 -41 (positions
1,2,3,4,7). All show this profile but SseJ
(position 7 is polar-- still a hydrophobic face).
49SspH1 Secondary Structure
50SspH2 Alpha-Helical Wheel
51SseG Secondary Structure
52SseG Alpha-helical Wheel
53SopD Alpha-Helical Wheel
54Secondary Structure Conclusion
- A hydrophobic face on the alpha helix containing
the conserved may be at least in part responsible
for the translocation signal - Other seemingly important domains do not have
secondary structure (other than random coils) - I have not looked at the SPI1 effectors nor the
putative cytoplasmic proteins in this regard
553D Structure Prediction andComparisonAb initio
- Prediction based solely upon the primary amino
acid sequence of the protein - Rosetta Stone has done fairly well at CASP
competitions David Baker at U. of Washington - Accuracy of predictions still in question
563D Prediction and Comparison Homology Modeling
- BLAST protein of interest on proteins in the
Brookhaven Protein Data Bank (PDB) - If there is significant homology (approx. 30),
then a model for the protein of interest can be
determined based on the known structure(s) of the
other protein(s) - This model can be compared to other known or
predicted models to determine similarity - The main flaw is that if there is not a sequence
with significant homology that has been
crystallized, this method cannot be used
57Results of Swiss-Model Homology Search of all
Putative and Know Effectors
- Only full-length SspH1, SspH2 and SopE had enough
homology to get structures - Only SopE gave me a result when I submitted the
first 150 amino acids - The catalytic domain of SopE has been
crystallized, but the first 77 amino acids are
missing - Only the Leucine-rich repeat region of SspH1 and
SspH2 could be modeled (amino acids 158 and
higher)
58Tertiary Structure Examples
SspH1 homology-modeled to YopM. Homology starts
at Amino acid 158. Geno3D2 used.
Catalytic domain of SopE (starts at aa 77) and
cdc42
59Future Directions
- Do a similar primary structure analysis but
expanding to also include hypothetical proteins
from the literature (19 such proteins) - Study the different classes of proteins known to
form the needle, form the translocon and act as
chaperones - Do secondary structure analysis on the known SPI1
proteins and on the putative cytoplasmic proteins
just identified - Try Rosetta Stone program
60Acknowledgments
- Kasturi Haldar
- Team Salmonella
- Drew Big Daddy Salmonella Catron
- Everett Roark
- Team Malaria
- Paul Cheresh
- Carlos Lopez-Estrano
- Sean Murphy
- Thanos Lykidis
- Luisa Hiller
- Thomas Akompong
- Travis Harrison
- Parwez Nawabi
- Souvik Bhattacharjee
- Team Bioinformatics
- Dhugal Bedford