Bioinformatics Approaches to Identifying Candidate Effector Molecules of S. typhimurium - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Approaches to Identifying Candidate Effector Molecules of S. typhimurium

Description:

Salmonella effectors are secreted into the host cell via either the Salmonella ... sequence directing intracellular type III secretion by Salmonella typhimurium. ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 61
Provided by: halda
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Approaches to Identifying Candidate Effector Molecules of S. typhimurium


1
Bioinformatics Approaches to Identifying
Candidate Effector Molecules of S. typhimurium
  • Matthew Sylvester
  • 12/1/03

2
Endocytic Trafficking
SPI-1
SPI-2
?
Salmonella-containing vacuole
bacterial effector proteins (SseJ, SifA, SseXs,
and several others)
Lysosome
3
Selection of S. typhimurium Proteins
  • Salmonella effectors are secreted into the host
    cell via either the Salmonella pathogenicity
    island 1 (SPI1) or SPI2 type three secretion
    system (TTSS)
  • We chose only those proteins shown experimentally
    in the literature to go out through one or both
    of these systems
  • (see PubMed at http//ncbi.nlm.nih.gov)
  • The seventeen identified SPI1 and SPI2-associated
    effectors were considered as one group for
    subsequent analysis
  • As the N-terminal 150 amino acids have been shown
    to contain conserved sequences for several SPI2
    effectors, we compared this region (Miao and
    Miller, 2000)

4
Alignment of SPI-2 Effector Proteins
Miao E and Miller S. A conserved amino acid
sequence directing intracellular type III
secretion by Salmonella typhimurium. PNAS.
2000, 97(13). Pp. 7539-7544.
Published alignment of known and putative SPI2
effectors identified by a BLAST (Basic Local
Alignment Search Tool) search and then aligned
using ClustalW. Note the presence of the
WEK(I/M)XXFF motif from approx. aa 31-38.
5
BLAST
  • Tries to find the most similar proteins
  • Compares a query to sequences in a database and
    each comparison is given a score (higher scores
    are more similar)
  • Scoring matrices (substitution-based) are used to
    assign a score based on the probability of each
    residue substitution
  • Gap penalties are negative scores
  • The alignment score is the sum of scores at each
    position
  • Significance of overall alignment given a p-value
    or an e-value
  • e-value expectation value The number of
    different alignments with scores equivalent to or
    better than S that are expected to occur in a
    database search by chance. The lower the E value,
    the more significant the score.

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
Building Substitution Matrices Part I
Blocks Local ungapped alignment with rows
protein segments and columns amino acid
position
1 A D E P Q D A 2 A C E P D D A
.. 10 S D E P Q D A
New Sequence A D E P Q R A -count number of
matches and mismatches between new sequence and
every other sequence in block. -We have 9AA
matches and 1 AS mismatch in pos. 1
Henikoff S, Henikoff JG. Amino acid substitution
matrices from protein blocks. PNAS (1992).
pp.10915-10919.
10
Building Substitution Matrices Part II
Next, sum the results of each column, store
results in a table and add the new
sequence to the group
By successively adding new sequences, we get a
table with all possible pairs
If we have 9 As and 1 S in the first column,
we get 1 2 836 possible AA pairs and
we get 9 AS or SA pairs and we get 0 SS pairs
If w width of amino acids and s sequences,
we have ws(s-1)/2 total possible
pairs. Here, we have 36945 or 1109/245
11
Calculating the Lod (log-odds) Matrix
  • Let fij be the total number of amino acid pairs
    in the frequency table at position i,j
  • (1ltjltilt20)
  • Then the observed proportion for each amino acid
    pairing is
  • We have fAA36 and fAS9, so qAA36/45 and
    qAS9/45

12
Calculating the Lod Matrix II
  • Now we need the expected probabilities of
    occurrence for each amino acid pair
  • If we assume that the observed frequencies of
    each amino acid are the population frequencies,
    we have
  • For our example, pA36/45(9/45)/2 0.9 and
    pS(9/45)/20.1
  • Then the expected probability (eij)of occurrence
    is pipj for ij and pipjpjpi for i!j
  • We have expected probability of AA0.90.90.81,
    AS20.90.10.18, SS0.10.10.01

13
Calculating the Lod Matrix III
  • Then we calculate the log-odds score in bits as
    sijlog2(qij/eij), so if we see more than
    expected, sijgt0, if we see as many as expected,
    sij0, and if we see less than expected, sijlt0
  • Multiplying s by 2 and rounding to the nearest
    integer, we obtain our values for the block
    substitution matrix (BLOSUM)

14
Clustering
  • To prevent double-counting amino acid
    contributions from closely related proteins,
    sequences are clustered and counted as a single
    sequence in counting amino acids
  • Thus, if two sequences are identical at gtX of
    their aligned positions, then contributions are
    averaged between the two
  • In our example, if we were to cluster 8 of our
    sequences with A in the first position, we now
    have 2As and 1S
  • These matrices will be denoted BLOSUM X, such as
    BLOSUM 62

15
Substitution Matrix (log-odds)
Based on observed frequencies of substitutions in
related proteins identical amino acids are given
high positive scores, frequently observed
substitutions get lower positive scores, and
seldom observed substitutions get negative scores.
16
Related Calculations
  • Relative entropy
  • measures the average information in bits that
    can be distinguishes an alignment from chance
  • Expected score in bit units

17
Bioinformatics ApproachesPrimary Structure
18
Primary Sequence Search Methodology
  • Hmmer search of aligned sequences
  • Hmmer uses hidden markov models to make a profile
    probability matrix of amino acids from aligned
    sequences
  • The matrix is searched against the appropriate
    genome database
  • TRVI search allowing for gaps and substitutions
  • A motif is developed by allowing for a flexible
    number of gaps wherever there are gaps in the
    alignment
  • Substitutions of amino acids with similar
    properties are allowed
  • The motif is searched against the appropriate
    genome database
  • MEME/MAST search of unaligned sequences
  • Identifies a specified number of domains
    (probability matrices) across a subset of the
    input sequences
  • The domains are searched against the appropriate
    genome database

19
How Hmmer WorksProfile Hidden Markov Models for
Protein Sequence Analysis
  • http//hmmer.wustl.edu/

20
Hmmer Architecture
  • Squares are match states (consensus positions),
    diamonds are insertions, circles are deletions
    and beginning/end. Arrows indicate state
    transitions.

21
Hidden Markov Model Background
From PMMBSandrine Dudoit See also
http//www.ai.mit.edu/murphyk/Bayes/rabiner.pdf
22
More Hidden Markov Model Background
23
Still More Background
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Hmmer Intro
  • Each M/D/I is a node and are determined by data
    and the multiple sequence alignment
  • Each M state aligns with a single amino acid and
    carries a vector of 20 probabilities determined
    by the proportion of times that an amino acid has
    shown up in a position in a multiple sequence
    alignment
  • Capable of handling gapped alignments
  • At each node either the M (amino acid aligned) or
    D state is used, and I states occur between nodes
    and self-transition
  • Arrows are transition probabilities and are
    estimated by the residues in each column of the
    multiple sequence alignment
  • S,N,C,T,J are special states that are
    algorithm-dependent and controlled externally

35
Intermediate Hmmer
  • Want to calculate P(SM) where the sum over the
    space of all sequence should be 1
  • The rules of the HMM allow us to do this
  • Implied that the insertions follow a geometric
    distribution
  • From a multiple sequence alignment seed, Hmmer
    make a consensus sequences and searches databases
    against this consensus sequence

36
Hmmer Results
37
ClustalW Alignment of SPI1 Effectors
38
ClustalW Alignment of All Known Effectors
39
Analysis of TRVI-Putative Cytoplasmic Proteins
  • Literature search
  • YciE not found
  • YciF classified as a putative structural protein
    by Blattner et al.
  • BLAST searches
  • STM0274 almost exactly SciI (S. typhimurium)
    other homologies to ImpC and ImpD (Rhizobium
    leguminosarum), and conserved hypotheticalsno
    literature on SciI, ImpC, nor ImpD
  • YciF has homologies to other putative structural
    proteins in Shigella and E.coli. Also homologous
    to several conserved hypotheticals
  • YciE has homologies to YciE from E.coli and other
    putative cytoplasmic/structural proteins in other
    species (YciE and YciF do not hit each other)
  • STM3767 homologous to a 4-hydroxy-2-oxoglutarate
    aldolase and several hypothetical proteins
  • STM4192 homologous to a nucleoprotein/polynucleoti
    de-associated enzyme, hypothetical protein YaiL
    from E.coli, and hypotheticals (YaiL not in
    literature)

40
Analysis of TRVI-Microarray Proteins
  • SseJ and YciE show up
  • fruF is part of the phosphoenolpyruvate fructose
    phosphotransferase system
  • STM1181 is a putative flagella basal body part

41
S. typhimurium MEME Motif Summary
42
MEME MAST Analysis
  • MEME search results using MAST and searched by
    domain
  • Domain 1 SseI, SlrP, SopA (putative effector
    proteins), YebE
  • Domain 2 SseI, SlrP, YeeY, YeaH (putative
    cytoplasmic protein)
  • Domain 3 SseI, HepA/RapA, Putative inner
    membrane protein (STM1698)
  • Domain 4 YfeC, Putative periplasmic proteins
    (STM3783 and STM3605)
  • Domain 5 RffG, OmpR (regulatory protein),
    PrpA,SirC (invasion regulator)
  • Domain 6 SseI, SlrP, YadF, YaiB, PrpC(protein
    phosphatase), InvB
  • (part of needle complex)
  • Domain 7 CitC (citrate carrier), YcfN, YjeQ,
    STM0611, STM2406
  • Domain 8 DdlA (d-alanine ligase), GlyS, PgtA
    (phosphoglycerate transporter), STM4502
  • Domains 1,3, and 5 look to be important for SPI2
    secretion
  • The other domains are important for small,
    related subsets of proteins

43
MEME Including Putative Cytoplasmic Proteins
44
S. typhimurium Search Results Summary
  • Hmmer search of aligned sequences
  • Only the input sequences ( 2 theoretically
    secreted proteins) were returned. SPI1 and SPI2
    effectors both have significant e-values from a
    combined matrix.
  • TRVI search allowing for gaps and substitutions
  • 56 hits returnedPossible interesting hits
    include SseI, 5 LysR family proteins, 5 putative
    cytoplasmic proteins , 1 putative periplasmic
    protein, 2 inner membrane proteins, and 3
    flagellar proteins. 4 proteins (FruF, SseJ,
    YciE, and a putative flagellar protein) were also
    identified in a DNA microarray screen under SPI2
    inducing conditions with cholesterol.
  • MEME search results using MAST and searched by
    domain
  • Domain 1 SseI, SlrP, SopA (putative effector
    proteins), YebE
  • Domain 2 SseI, SlrP, YeeY, YeaH (putative
    cytoplasmic protein)
  • Domain 3 SseI, HepA/RapA, Putative inner
    membrane protein (STM1698)
  • Domain 4 YfeC, Putative periplasmic proteins
    (STM3783 and STM3605)
  • Domain 5 RffG, OmpR (regulatory protein),
    PrpA,SirC (invasion regulator)
  • Domain 6 SseI, SlrP, YadF, YaiB, PrpC(protein
    phosphatase), InvB (part of needle
    complex)
  • Domain 7 CitC (citrate carrier), YcfN, YjeQ,
    STM0611, STM2406
  • Domain 8 DdlA (d-alanine ligase), GlyS, PgtA
    (phosphoglycerate transporter), STM4502

45
Primary Structure Conclusions
  • The best lead may be YciE, a putative cytoplasmic
    protein found with two different search methods
  • The methods did not give the same output
  • Hypothetical proteins found in the literature
    such as SipD, SptP (SPI1) and SpiC, SrfJ,
    SseB,C,D (SPI2) were not found
  • All proteins that go out via SPI2 do not
    necessarily have the WEK(I/M)XXFF motif
  • There is not a clear SPI1 motif

46
Secondary Structure Prediction
  • Psipred structure prediction server used
  • Predictions made by two feed-forward neural
    networks based on PSI-BLAST output
  • N-terminal motif (MEME 3)random coil in all SPI2
    proteins
  • First SPI2 motif at aa 31-38 (MEME 1)examples
    are SseJ, SifA, SifB(F), SlrP(F), SseI,
    SspH1(F)
  • Second SPI2 motif at aa 105-120 (no
    MEME)entirely random coil except for a small
    segment of SspH2

47
Secondary Structure Prediction of SifA
48
Alpha-helical Wheel (SifA,SifB)
WEK(I/M)XXFF is the Conserved motif among SPI2
effectors from aa 34 -41 (positions
1,2,3,4,7). All show this profile but SseJ
(position 7 is polar-- still a hydrophobic face).
49
SspH1 Secondary Structure
50
SspH2 Alpha-Helical Wheel
51
SseG Secondary Structure
52
SseG Alpha-helical Wheel
53
SopD Alpha-Helical Wheel
54
Secondary Structure Conclusion
  • A hydrophobic face on the alpha helix containing
    the conserved may be at least in part responsible
    for the translocation signal
  • Other seemingly important domains do not have
    secondary structure (other than random coils)
  • I have not looked at the SPI1 effectors nor the
    putative cytoplasmic proteins in this regard

55
3D Structure Prediction andComparisonAb initio
  • Prediction based solely upon the primary amino
    acid sequence of the protein
  • Rosetta Stone has done fairly well at CASP
    competitions David Baker at U. of Washington
  • Accuracy of predictions still in question

56
3D Prediction and Comparison Homology Modeling
  • BLAST protein of interest on proteins in the
    Brookhaven Protein Data Bank (PDB)
  • If there is significant homology (approx. 30),
    then a model for the protein of interest can be
    determined based on the known structure(s) of the
    other protein(s)
  • This model can be compared to other known or
    predicted models to determine similarity
  • The main flaw is that if there is not a sequence
    with significant homology that has been
    crystallized, this method cannot be used

57
Results of Swiss-Model Homology Search of all
Putative and Know Effectors
  • Only full-length SspH1, SspH2 and SopE had enough
    homology to get structures
  • Only SopE gave me a result when I submitted the
    first 150 amino acids
  • The catalytic domain of SopE has been
    crystallized, but the first 77 amino acids are
    missing
  • Only the Leucine-rich repeat region of SspH1 and
    SspH2 could be modeled (amino acids 158 and
    higher)

58
Tertiary Structure Examples
SspH1 homology-modeled to YopM. Homology starts
at Amino acid 158. Geno3D2 used.
Catalytic domain of SopE (starts at aa 77) and
cdc42
59
Future Directions
  • Do a similar primary structure analysis but
    expanding to also include hypothetical proteins
    from the literature (19 such proteins)
  • Study the different classes of proteins known to
    form the needle, form the translocon and act as
    chaperones
  • Do secondary structure analysis on the known SPI1
    proteins and on the putative cytoplasmic proteins
    just identified
  • Try Rosetta Stone program

60
Acknowledgments
  • Kasturi Haldar
  • Team Salmonella
  • Drew Big Daddy Salmonella Catron
  • Everett Roark
  • Team Malaria
  • Paul Cheresh
  • Carlos Lopez-Estrano
  • Sean Murphy
  • Thanos Lykidis
  • Luisa Hiller
  • Thomas Akompong
  • Travis Harrison
  • Parwez Nawabi
  • Souvik Bhattacharjee
  • Team Bioinformatics
  • Dhugal Bedford
Write a Comment
User Comments (0)
About PowerShow.com