Title: A Systematic Search for Genes Encoding Proline Rich Proteins in Arabidopsis
1A Systematic Search for Genes Encoding Proline
Rich Proteins in Arabidopsis
- Aaron Newman
- CMPS290N Project
2Synopsis
- Background
- Objective
- Methods Results
- Analysis of
- Putative PRPs
- Future work
3Proline Physiochemical Properties
- An Imino Acid
- Non-Polar
- Planar Cyclic Molecule
- Cis Trans forms
- Rotationally Hindered
- a-Helix Breaker
- Often Present in Structural Proteins
4A well-characterized example of a Proline Rich
Structural Protein
- Collagen, a fibrous protein, contains relatively
long regions of tandem repeats of the motif,
Gly-X-Pro/HyP. - Collagen strands assume a triple helical
assemblage
5Arabidopsis thaliana, a Model Plant
- Complete genome is sequenced
- An immense database awaits biological inquiry.
- Arabidopsis, like other plants, incorporates
Proline-Rich Proteins into cell wall matrices.
PRPs assist in structural support and may also
help defend against physical damage and pathogens.
6Outline
- Background
- Objective
- Methods Results
- Analysis of Putative PRPs
- Future work
7Project Objective
- Mine the Arabidopsis Protein Databank for
hypothetical/unknown/unnamed proteins that
satisfy the following broad criterion - Sufficient sequence similarity to predefined
PRPs. This requirement can be understood in terms
of the presence of - Known Proline-rich motifs and/or proposed
Pro-Rich motifs - Tandem or regular repeats of one or more of the
above - Predicted N-terminal hydrophobic subsequence
indicative of secretion signal - The genes underlying candidate PRPs will be
flagged for further investigation.
8Outline
- Background
- Objective
- Methods Results
- Analysis of Putative PRPs
- Future work
9Methods Results
- Given set of known AtPRPs, arbitrarily select
PRP4 to probe Arabidopsis protein database using
Protein-Protein Blast. - Use sequences with top five scores for training
set. - Since PRP1 and PRP3 not present in training set,
add them to enrich sequence space for motif
detection. - Training Set
- iv) PRP4 GI7620015
- ii) PRP2 GI7620011
- iii) PRP1 GI25456291
- iv) PRP3 GI25456294
- v) Putative extensin protein GI24030361
- vi) Extensin-like protein GI24030361
- vii) At2g21140 GI30017301
10Methods Results continued
- Use MEME to extract motifs from training set with
maximum of motifs parameter set to three. - MEME output
- i) PVPVYKPP
- ii)Â IPKKPCPP
- iii) SPPYYTPP
11Methods Results continued
- Use MEME to extract motifs from training set with
maximum of motifs parameter set to three. - MEME output
- i) PVPVYKPP
- ii)Â IPKKPCPP
- iii) SPPYYTPP
- Notice similarity in sequence (ii) with known
PRP2 and PRP4 motif, KKPCPP and similarity of
sequence (i) with defined PRP4 motif
PPPKIEHPPPVPVYK
12Methods Results continued
- Use the three consensus sequences produced by
MEME to query the Arabidopsis Protein Database
using an instance of Protein-Protein BLAST that
searches for short, nearly exact matches. - Combine the output for all consensus sequences to
produce one set of hits. - For this set of hits, manually remove
- Sequences not annotated as hypothetical/unknown/un
named - Each protein sequence that fails to exhibit 8 or
more hits to at least one consensus sequence.
(The threshold of 8 hits is arbitrarily chosen)
13Methods Results continued
- The following list of sequences emerged
- gi11994465dbjBAB02467.1 unnamed protein
product Arabidopsis thaliana - gi10178119dbjBAB11412.1 unnamed protein
product Arabidopsis thaliana - gi23308313gbAAN18126.1 At2g10940/F15K19.1
Arabidopsis thaliana - gi8978351dbjBAA98204.1 unnamed protein
product Arabidopsis thaliana - gi24417264gbAAN60242.1 unknown Arabidopsis
thaliana - gi5430752gbAAD43152.1 HypotheticalProteinAra
bidopsis thaliana - gi20259488gbAAM13864.1unknown protein
Arabidopsis thaliana - gi5306245gbAAD41978.1 unknown protein
Arabidopsis thaliana - gi5306260gbAAD41992.1 hypothetical protein
Arabidopsis thaliana
14Methods Results continued
- Let the new set of potential PRPs as well as the
training set be independently queried against the
PFAM database.
15Methods Results continued
- Explanation of conserved domains of several
putative PRPs according to PFAM database - Extensin 2 Family of hydroxyproline-rich
glycoproteins found in plant extracellular matrix
and characterized by repetitive motifs. - Root Cap Conserved region within plant root cap
proteins - LRR 1-Leucine-Rich Repeat region Implicated in a
diverse array of functions and found in a broad
range of organisms - Tryp alpha amal-Family in plants composed of
trypson alpha amylase inhibitors, seed storage
proteins and lipid transport proteins. - Proteasome-Broad family of proteins that function
to degrade other proteins - Description of conserved domain identified in
defined PRPs - DUF1210 Representative region of family of
Proline-Rich Proteins. Aside from a PRP
indicator, the significance of this region is not
understood/published. - Interestingly, AtPRP1 and AtPRP3 do not harbor
DUF1210 nor do they contain any other discernable
regions of conservation when compared to PFAM
database.
16Methods Results continued
- In order to narrow down the set of putative PRPs
further, a CLUSTALW multiple alignment was
performed. - The following set of sequences have the highest
degree of relatedness. This claim is predicated
on manual inspection of the alignment. - GI 5306260
- GI 20259488
- GI 5306245
- GI 5430752
- GI 23308313
- GI 24417264
17Methods Results continued
- Furthermore,
- Sequences of the same color were observed to be
very similar to one another. CLUSTALW alignments
of each set of sequences of the same color in
isolation from the other potential PRP sequences
supported this categorization scheme. - PFAM HMM
- GI 5306260
- GI 20259488 LRR_1
- GI 5306245 LRR_1
- GI 5430752 LRR_1
- GI 23308313 tryp_alpha_amal
- GI 24417264
18Methods Results continued
- Furthermore,
- Sequences of the same color were observed to be
very similar to one another. CLUSTALW alignments
of each set of sequences of the same color in
isolation from the other potential PRP sequences
supported this categorization scheme. - PFAM HMM Let sequence i be denoted by
- GI 5306260 A
- GI 20259488 LRR_1 B1
- GI 5306245 LRR_1 B2
- GI 5430752 LRR_1 B3
- GI 23308313 tryp_alpha_amal C1
- GI 24417264 C2
19Outline
- Background
- Objective
- Methods Results
- Analysis of Putative PRPs
- Future work
20Analysis-Secretion Signal
- Each defined pre-processed PRP has a signal
peptide at the N-terminus. The signal peptide is - 20-70 successive amino acids
- hydrophobic
- cleaved upon translocation of the nascent PRP
into the ER - in general, the cellular version of a destination
address, that, in this case, allows a PRP to be
transported out of the cell
21Analysis-Secretion Signal
- Two signal peptide detection programs were
employed on known AtPRP set (PRP1,2,3,4) and
refined potential AtPRP set - SPScan in GCG(Empirically derived Scoring Matrix)
- SignalP 3.0 (HMMs Neural Networks)
22Analysis-Secretion Signal
- Characterized PRPs
- SPScan predicted same signal peptides as those
that are predicted in the literature. (perhaps
the same program was used) - SignalP also yielded identical predictions with
the exception of PRP2, where the cleavage site is
inferred to be two amino acids downstream from
the cleavage site predicted by SPScan and the
publication. - PRP1 and PRP3 exhibit a high degree of
relatedness in regard to signal peptides and
motifs. - Similarly, PRP2 and PRP4 show a high level of
similarity in terms of signal peptides and
motifs. - Thus, in Arabidopsis, there is variation within
both PRP signal peptides and PRP motif
composition (will be supported shortly). - Fowler J., Characterization and Expression of
Four Proline Rich Proteins in Arabidopsis
23Analysis-Secretion Signal
- Putative PRPs
- By and large, SPScan and SignalP converged on the
same secretory signal predictions. - Both programs rejected the presence of an
N-terminal secretion signal in B2 when parameters
were set to default values. - By reducing maximum score threshold from 7 to 5.5
in SPScan, a secretion signal for B2 was
predicted with probability score of .9 (the
closer to zero, the better) - Sequence Predicted Signal Peptide
- A 1 mrvplidflrflvlilslsgasvaad 26
- B1 1 mtrrtmekpfgcflllfcftisiffys 27
- B2 1 mphiykqplgifqgfvptltdaev 24
- B3 1 merpfgcffilllisytvvatf 22
- C1 1 mdssklsslslclfliciiylpqhslacg 29
- C2 1 mdssklsslslclfliciiylpqhslacg 29
24Analysis-Signal Comparison
- Predicted signal peptide A observed to be similar
to PRP3 and PRP1 predicted signal peptides. - CLUSTAL W (1.82) multiple sequence alignment
- PRP3 MAITRSSLA--ICLILSLVTITTA 22
- PRP1 MAITRASFA--ICILLSLATIATA 22
- A MRVPLIDFLRFLVLILSLSGASVA 24
- . . .
- Predicted signal peptides B1,B2,B3 are mildly
similar to PRP4 and PRP2 signal peptides. B1 and
B3 are fairly similar to one another. - CLUSTAL W (1.82) multiple sequence alignment
- B3
-----MERPFG---CFFILLLISYTVVA--- 20 - B1
MTRRTMEKPFG---CFLLLFCFTISIFF--- 25 - PRP4
--MRILPEPRGSVPCLLLLVSVLLSATLSLA 29 - PRP2
--MRILPKSGGGALCLLFVF-ALCSVAHS-- 26 - B2
-MPHIYKQPLG----IFQGFVPTLTDA---- 22 - ..
. - Predicted signal peptides C1 and C2 are
identical.
25Analysis of Motifs
- Known PRP Motifs Proposed PRP Motifs
- Compared to Known PRPs Potential PRPs
Cooper, J., A New Proline-rich Early Nodulin
from Medicago truncatula
26Analysis sequence A
- Colored regions correspond to proposed motifs
with the exception of PPVHK which is a previously
described motif. This analysis is not thorough.
27Analysis - group B
- Comparison of group B sequences
28Analysis group B cont.
- Comparison of group B sequences
29Analysis group C
- Comparison of group C sequences
30Analysis Upshot A
- So far, A is the strongest contender for PRP
denomination because - Contains many occurrences of the previously
established PRP motif, PPVHK. - The majority of the hypothetical sequence is both
proline rich and in tandem repeat configuration - The predicted signal peptide is strikingly
similar to the estimated signal peptides for PRP1
and PRP3.
31Analysis Upshot B
- The sequences of group B are potential PRPs
because - They contain multiple occurrences of proposed PRP
motifs, PPVHS and PPVYS. These PRP motif
suppositions only differ from the following known
PRP motifs by the last amino acid, Serine. Serine
Threonine are physiochemically similar. - Defined PRP motifs Estimated PRP motifs
- PPVHT PPVHS
- PPVYT PPVYS
- B1 and B3 have probable signal peptides
- B2, while similar in amino acid sequence to B1
B3, is somewhat of an anomaly in regard to its
improbable signal peptide. - The potential presence of LRRs among the
sequences of group B does not seem to diminish
the probability that the sequences of B are
PRPs.
32Analysis Upshot C
- The sequences of group C are putative PRPs
because - They contain multiple tandem repeats of a
decapeptide proposed proline rich motif, which is
of amino acid composition comparable to known PRP
motifs. - Both have probable signal peptides.
- The probable presence of typson alpha amylase
inhibitor domain in C1 is unexpected, but it is
far from clear that the presence of this domain
raises sufficient doubt to remove C1 or group C
from the set of potential PRPs.
33Outline
- Background
- Objective
- Methods Results
- Analysis of Putative PRPs
- Future work
34Future Work
- Continue using bioinformatics tools to
increase/decrease confidence levels regarding A,
B, and C. - Another Approach to detect PRPs
- Subtraction Method
- Acquire raw Arabidopsis Genome
- Remove all inferred non-coding regions
- Convert estimated exons into amino acids.
- Remove all inferred proteins without probable
N-terminal signal peptides - Search for tandem repeats with similar character
composition to known PRP motifs. - Admit remaining sequences to set of potential
PRPs - Perform analysis as previously described.