Protein Sequence Analysis - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Protein Sequence Analysis

Description:

Genomes of most organism have been deciphered. ... Kyte-Doolitle hydrophobicity plot. Nature of amino acids- hydrophilic or Hydrophobic ... – PowerPoint PPT presentation

Number of Views:584
Avg rating:3.0/5.0
Slides: 50
Provided by: davvbio
Category:

less

Transcript and Presenter's Notes

Title: Protein Sequence Analysis


1
Protein Sequence Analysis
  • By
  • Rashmi Shrivastava
  • Lecturer
  • School of Biotechnology
  • Devi Ahilya Vishwavidyalaya
  • Indore

2
Introduction
  • Genomes of most organism have been deciphered.
  • Further step is to identify key regions,
    speciallly protein coding regions.
  • Assigning functions to individual proteins
  • Predicting molecular structures of the proteins.
  • Developing protein interaction network.
  • Utilizing the information obtained for structure
    based drug design, discovering new drug targets,
    Creating mutations to alter properties/ create
    desired property in proteins and so
    on...............

3
  • By themselves the letters(amino acid sequence/
    genome sequence) have no meaning.
  • our aim is to create sentence------proteins
  • words--------
    motifs (recognize patterns and signatures)
  • To investigate the meaning of sequences there are
    two approaches-
  • pattern recognition techiques- detect
    similarity between sequences.
  • ab initio prediction methods-prediction of
    structure and thus the function

4
Protein databases(The source of information)
  • Primary and Secondary databases
  • Primary sequence databases-
  • Entrez-protein
  • PIR- Developed at NBRF
  • Swiss-Prot
  • TrEMBL

5
Secondary database
  • -Results of analysis of primary databases
  • -PROSITE/InterPro-protein families characterized
    by presence of single most conserved motif
    (domains) by multiple sequence alignment
  • -PRINTS-protein families are characterized by
    several conserved motifs to develop a fingerprint
    or signature for a particular family.
  • BLOCKS and Pfam
  • Profiles-variable regions between conserved
    motifs contain information about insertions and
    deletionsdistant sequence relationship
  • Enzyme and KEGG- Functional classification

6
Structure classification databases
  • SCOP(Structural classification of proteins)
    classify on Hierarchy Family, superfamily and
    fold
  • CATH(Class, Architecture, Topology,
    Homology)-Hierarchial domain classification of
    proteins
  • C-gross secondary structure content
  • A- Arrangement of secondary elemnts
  • T-Overalll shape and connectivity
  • H- gt 35 sequence identity
  • Protein Data Bank (PDB)

7
  • Sequence alignment

Pair wise
Multiple
8
Pair wise Sequence Alignment
9
Sequence alignment
Global Sequence Alignment
Local sequence alignment
10
Algorithm
  • Global sequence alignment-
  • Needleman Wunch
  • Local Sequence alignment-
  • Smith Waterman

11
Identity Similarity In alignment the sequence
which is already in database is known as Subject
and the sequence for which the alignment is
going on is termed as query or probe sequence. If
the aligned Probe residue is same with the
Subject residue then it is identical but if they
are of same nature (Glutamate Aspartate) then
they are similar.
VLSPADKTNVKAAWGKVGAHAGYEG
.
VLSEGEWQLVLHVWAKVEADVAGHG Total
Residue 25 Identical Residue 09 Similar
(not identical)01 Gap00 Percent Similarity
40.000 ( and .) (Identity similarity) Percent
Identity 36.000 ( only)
12
Alignment
  • ATCAGAGTC

  • TTC----AGTC
  • ATCAGAGTC
  • TTCAG----TC
  • ATCAGAGTC
  • TTCA----GTC







13
Aligning Sequences.
actaccagttcatttgatacttctcaaa
Sequence 1 Sequence 2
taccattaccgtgttaactgaaaggacttaaagact
14
Gap Insertion
V K L A W A A K G N E A A P A K A A V D H Y V A
A
V K A W A A K G N E A E G L S A A P D J K V A A P
Total Residue 25 Identical Residue 04
Gap00 Percent Identity 16.00
V K L A W A A K G N E A A P A K A A V D H Y V A A
V K _ A W A A K G N E A E G L S A A P D J K V A
A
Total Residue 25 Identical Residue 18
Gap01 Percent Identity 72.00
15
Scoring System
Proteins can differ in close organisms. Some
substitutions are more frequent than other
substitutions. Chemically similar amino acids
can be replaced without severely effecting the
proteins function and structure
16
Matrices formed to score alignment
  • Sparse Matrices
  • Based on identical residue matching
  • Problem Faced
  • Diagnostic power is relatively poor, as all the
  • identical matches carry equal weighting
  • 2. Mathematically significant but biologically
  • insignificant.

17
To solve this problem
Scoring matrices has been devised that weight
matches between non identical residues,
according to observed substitution rates across
large evolutionary distances. This scoring
matrices are mathematically insignificant
but biologically significant specially for
aligning sequences of very low identity.
18
(No Transcript)
19
Percent Accepted Mutation (PAM or Dayhoff)
Matrices
  • Similar sequences organized into phylogenetic
    trees
  • Number of amino acid changes counted
  • Relative mutabilities evaluated
  • 20 x 20 amino acid substitution matrix calculated

20
  • PAM 1 1 accepted mutation event per 100 amino
    acids PAM 250 250 mutation events per 100
  • PAM 1 matrix can be multiplied by itself N times
    to give transition matrices for sequences that
    have undergone N mutations

21
  • Derived from global alignments of closely related
    sequences.
  • Matrices for greater evolutionary distances are
    extrapolated from those for lesser ones.
  • The number with the matrix (PAM40, PAM100) refers
    to the evolutionary distance greater numbers are
    greater distances.
  • Does not take into account different evolutionary
    rates between conserved and non-conserved
    regions.

22
PAM 1
23
PAM 250
24
(No Transcript)
25
Scoring
A K W T N L K - - - - W A K V - A D V A G H
- G
A K - T N V KA K L P W G K V G G H V A G E Y G
  • The score of the alignment in this system is
  • -Matrix value at (A,A) (K,K) (T,T) (K,K)
    (W,W) (A,G)
  • (penalty for gap insertion/deletion)gap
  • - (penalty for gap extension)(total length of
    all gaps)

26
(No Transcript)
27
  • Henikoff, S. Henikoff J.G. (1992)
  • Use blocks of protein sequence fragments from
    different families (the BLOCKS database)
  • Amino acid pair frequencies calculated by summing
    over all possible pairs in block
  • Different evolutionary distances are incorporated
    into this scheme with a clustering procedure
    (identity over particular threshold same
    cluster)

28
  • Target frequencies are identified directly
    instead of extrapolation.
  • Sequences more than x identitical within the
    block where substitutions are being counted, are
    grouped together and treated as a single sequence
  • BLOSUM 50 gt 50 identity
  • BLOSUM 62 gt 62 identity

29
BLOSUM
  • A 4
  • B -2 6
  • C 0 -3 9
  • D -2 6 -3 6
  • E -1 2 -4 2 5
  • F -2 -3 -2 -3 -3 6
  • G 0 -1 -3 -1 -2 -3 6
  • H -2 -1 -3 -1 0 -1 -2 8
  • I -1 -3 -1 -3 -3 0 -4 -3 4
  • K -1 -1 -3 -1 1 -3 -2 -1 -3 5
  • L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4
  • M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2 5
  • N -2 1 -3 1 0 -3 0 1 -3 0 -3 -2 6
  • P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7
  • Q -1 0 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5
  • R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1
    5
  • S 1 0 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0
    -1 4
  • T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1
    -1 1 5
  • V 0 -3 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2
    -3 -2 0 4

30
(No Transcript)
31
Thumb rules
Lower PAMs and higher Blosums find short local
alignment of highly similar sequences. Higher
PAMs and lower Blosums find longer weaker local
alignment.
32
PAM vs. BLOSUM
Based on the basic assumptions and the
construction of each matrix PAM model is
designed to track evolutionary origin of
proteins. Blosum model is designed to find
conserved domains of proteins.
33
(No Transcript)
34
Protein Structure
  • Primary structure- The linear sequence
  • of amino acids in a protein molecule
  • Secondary structure- regions of local
  • regularity within a protein fold (a
    helices, ß strands, turns etc)
  • Super secondary structure- the arrangement of a
    helices and/or ß strands, into discrete folding
    units (ß-barrels, ß aß- units, greek key motifs
    etc.)
  • Tertiary structure-The overall fold of a protein
    sequence formed by packing of its secondary
    and/or super- secondary structure elements.
  • Quaternary structure- Arrangement of separate
    protein chains in a protein molecule

35
From the Primary sequence to protein properties
  • Predicting protein localization/ secretory nature
    by the presence of signal peptide and
    localization signal
  • Transmembrane helix prediction to identify
    membrane proteins
  • Calculation of physiochemical properties-pI, Mwt.
  • Identification of coiled coiled regions

36
Post translational modification prediction
www.expasy.org
37
(No Transcript)
38
(No Transcript)
39
Kyte-Doolitle hydrophobicity plot
  • Nature of amino acids- hydrophilic or Hydrophobic
  • A window of 9-20 a,.a taken
  • A value greater than 0 means hydrophobic

40
From Sequence to Structure
  • Secondary structure prediction- GOR, Predict
    protein, nnpredict
  • Domain Prediction- SBASE, PRODOM

41
Importance of protein secondary structure
prediction
42
(No Transcript)
43
Basis of Secondary structure prediction
  • Conservation in the multiple sequence alignment
  • Hidden Markov Models and Neural networks
  • 70-80 accuracy is achieved.

44
Method used
45
Key features of secondary structure prediction
46
Chou Fasman Algorithm
47
(No Transcript)
48
GOR
Multiple Sequence
49
Some sites
  • Predator
Write a Comment
User Comments (0)
About PowerShow.com