Sequence specific recognition of DNA by proteins. - PowerPoint PPT Presentation

Loading...

PPT – Sequence specific recognition of DNA by proteins. PowerPoint presentation | free to download - id: ae6cf-ODA1M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Sequence specific recognition of DNA by proteins.

Description:

Different Watson/Crick base pairs have different patterns of ... Human/carp. 0.216. 0.205. 0.186. Human/kangaroo. 0.134. 0.129. 0.121. Human/cow. Gamma-distance ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 50
Provided by: Pan92
Learn more at: http://www.seas.gwu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Sequence specific recognition of DNA by proteins.


1
Sequence specific recognition of DNA by proteins.
  • Nitrogen and oxygen exposed in the grooves can
    make hydrogen bonds with proteins.
  • Different Watson/Crick base pairs have different
    patterns of donors and acceptors
  • - H-bond acceptor -
    hydrogen atom
  • - H-bond donor
    - methyl group

C
G
G
C
T
A
A
T
C
G
C
G
T
A
T
A
Minor groove
Major groove
2
Difference between DNA RNA
  • Differences between DNA RNA
  • T is replaced by U
  • Extra OH group at 2 pentose sugar, sugar is
    ribose, not deoxyribose
  • RNA usually does not form double helix, makes
    loops within one strand, often contains modified
    bases
  • RNA has an additional 2-OH group which can form
    HB, stabilizing tertiary structure

3
Illustration of RNA secondary structures.
From M.S. Andronescu
4
DNA/RNA thermodynamics.
  • Two major types of interactions
  • Base pairing (hydrogen bonds)
  • Base stacking of nearest neighbors (p-electron
    sharing of aromatic rings hydrophobic)

5
RNA secondary structure prediction
  • Assumptions used in predictions
  • The most likely structure is the most stable one.
  • - The energy of each base pair depends only on
    the energy of the previous base pair.
  • - Energy parameters for different types of RNA
    secondary structures are derived from the
    experiment.
  • - The structure is formed w/o knots.

6
Minimum energy method of RNA secondary structure
prediction.
  • Self-complementary regions can be found in a dot
    matrix
  • The energy of each base pair depends only on the
    energy of the previous base pair
  • Energy parameters for different types of RNA
    secondary structures are derived from the
    experiment
  • The most energetically favorable conformations
    are predicted by the method similar to dynamic
    programming

7
Sequence covariation method.
  • Some positions from different species can covary
    because they are involved in pairing
  • fm(B1) - frequences in column m
  • fn(B2) frequences in column n
  • fm,n(B1,B2) joint frequences of two nucleotides
    in two columns.
  • Seq 1
    ---G------C---
  • Seq 2
    ---G------C---
  • Seq 3
    ---A------T---
  • Seq 4
    ---T------A---

8
Gene prediction.
  • Gene DNA sequence encoding protein, rRNA, tRNA
  • Gene concept is complicated
  • Introns/exons
  • Alternative splicing
  • Genes-in-genes
  • Multisubunit proteins

9
Codon usage tables.
  • Each amino acid can be encoded by several
    codons.
  • Each organism has characteristic pattern of
    codon usage.

10
Problems arising in gene prediction.
  • Distinguishing pseudogenes (not working former
    genes) from genes.
  • Exon/intron structure in eukaryotes, exon
    flanking regions not very well conserved.
  • Exon can be shuffled alternatively alternative
    splicing.
  • Genes can overlap each other and occur on
    different strands of DNA.

11
Gene identification
  • Homology-based gene prediction
  • Similarity Searches (e.g. BLAST, BLAT)
  • ESTs
  • Ab initio gene prediction
  • Prokaryotes
  • ORF identification
  • Eukaryotes
  • Promoter prediction
  • PolyA-signal prediction
  • Splice site, start/stop-codon predictions

12
Ab initio gene prediction.
  • Predictions are based on the observation that
    gene DNA sequence is not random
  • - Gene-coding sequence has start and stop
    codons.
  • Each species has a characteristic pattern of
    synonymous codon usage.
  • Non-coding ORFs are very short.
  • Gene would correspond to the longest ORF.
  • These methods look for the characteristic
    features of genes and score them high.

13
Example of ORFs.
There are six possible ORFs in each sequence for
both directions of transcription.
14
Gene preference score important indicator of
coding region.
  • Observation frequencies of codons and codon
    pairs in coding and non-coding regions are
    different.
  • Given a sequence of codons
  • and assuming independence, the probability of
    finding coding region
  • The probability of finding sequence C in
    non-coding regions
  • The gene preference score

15
Gene prediction accuracy.
  • True positives (TP) nucleotides, which are
    correctly predicted to be within the gene.
  • Actual positives (AP) nucleotides, which are
    located within the actual gene.
  • Predicted positives (PP) nucleotides, which are
    predicted in the gene.
  • Sensitivity TP / AP
  • Specificity TP / PP

16
The value of genome sequences lies in their
annotation
  • Annotation Characterizing genomic features
    using computational and experimental methods
  • Genes levels of annotation
  • Gene Prediction Where are genes?
  • What do they encode?
  • What proteins/pathways involved in?

17
Human Genome project.
18
Analysis of gene order (synteny).
  • Genes with a related function are frequently
    clustered on the chromosome.
  • Ex E.coli genes responsible for synthesis of Trp
    are clustered and order is conserved between
    different bacterial species.
  • Operon set of genes transcribed simultaneously
    with the same direction of transcription

19
Structure and stability of globular proteins.
20
Native proteins are marginally stable.
  • Scale of interactions in proteins
  • - Interactions less than kT0.6 kcal/mol
  • are neglected.
  • - ?G 5 - 20 kcal/mol
  • Potential energy Van der Waals
    Electrostatic

G
U
F
?G
Reaction coordinate
21
Hydrophobic effect.
H
  • Hydrophobic interaction tendency of
  • nonpolar compounds to transfer from an
  • aqueous solution to an organic phase.
  • The entropy of water molecules decreases when
    they make a contact with a nonpolar surface (T?S
    -9.6 kcal/mol for cyclohexane) .
  • The effect is entropic because the energy of HB
    is very high.
  • The hydrophobic effect is proportional to buried
    surface area, the energy is 20-25 cal/mol/A2

O
H
H
O
H
22
Hierarchy of protein structure.
  • Amino acid sequence
  • Secondary structure
  • Tertiary structure
  • Quaternary structure

Picture from Branden Tooze Introduction to
protein structure
23
Protein secondary structure prediction.
  • Assumptions
  • There should be a correlation between amino acid
    sequence and secondary structure. Short aa
    sequence is more likely to form one type of SS
    than another.
  • Local interactions determine SS. SS of a residues
    is determined by their neighbors (usually a
    sequence window of 13-17 residues is used).
  • Exceptions short identical amino acid sequences
    can sometimes be found in different SS.
  • Accuracy 65 - 75, the highest accuracy
    prediction of an a helix

24
Methods of SS prediction.
  • Chou-Fasman method
  • GOR (Garnier,Osguthorpe and Robson)
  • Neural network method

25
PHD neural network program with multiple
sequence alignments.
  • Blast search of the input sequence is performed,
    similar sequences are collected.
  • Multiple alignment of similar sequences is used
    as an input to a neural network.
  • Sequence pattern in multiple alignment is
    enhanced compared to if one sequence used as an
    input.

26
Protein structure prediction.
27
Fold recognition.
  • Unsolved problem direct prediction of protein
    structure from the physico-chemical principles.
  • Solved problem to recognize, which of known
    folds are similar to the fold of unknown protein.
  • Fold recognition is based on observations/assumpti
    ons
  • The overall number of different protein folds is
    limited (1000-3000 folds)
  • The native protein structure is in its ground
    state (minimum energy)

28
Protein structure prediction.
  • Prediction of three-dimensional structure from
    its protein sequence. Different approaches
  • Homology modeling (predicted structure has a very
    close homolog in the structure database).
  • Fold recognition (predicted structure has an
    existing fold).
  • Ab initio prediction (predicted structure has a
    new fold).

29
Steps of homology modeling.
  • Template recognition initial alignment.
  • Backbone generation.
  • Loop modeling.
  • Side-chain modeling.
  • Model optimization.

30
1. Template recognition.
  • Recognition of similarity between the target and
    template.
  • Target protein with unknown structure.
  • Template protein with known structure.
  • Main difficulty deciding which template to
    pick, multiple choices/template structures.
  • Template structure can be found by searching for
    structures in PDB using sequence-sequence
    alignment methods.

31
Fold recognition.
  • Goal to find protein with known structure which
    best matches a given sequence.
  • Since similarity between target and the closest
    to it template is not high, sequence-sequence
    alignment methods fail.
  • Solution threading sequence-structure
    alignment method.

32
Threading method for structure prediction.
  • Sequence-structure alignment, target sequence is
    compared to all structural templates from the
    database.
  • Requires
  • Alignment method (dynamic programming, Monte
    Carlo,)
  • Scoring function, which yields relative score for
    each alternative alignment

33
Scoring function for threading.
  • Contact-based scoring function depends on the
    amino acid types of two residues and distance
    between them.
  • Sequence-sequence alignment scoring function
    does not depend on the distance between two
    residues.
  • If distance between two non-adjacent residues
    in the template is less than 8 Å, these residues
    make a contact.

34
Threading model validation.
  • Correct bond length and bond angles
  • Correct placement of functionally important sites
  • Prediction of global topology, not partial
    alignment (minimum number of gaps)

gtgt 3.8 Angstroms
35
Classwork II Homology modeling.
  • Go to NCBI Entrez, search for gi461699
  • Do Blast search against PDB
  • Repeat the same for gi60494508
  • Predict functionally important sites

36
Protein engineering and protein design.
Protein engineering altering protein sequence
to change protein function or structure Protein
design designing de novo protein which
satisfies a given requirement
37
Stability of mutants compared to wild-type
protein.
  • Measure of stability melting temperature at
    which 50 of enzyme is inactivated during
    reversible heat denaturation. For wild-type Tm
    42 C.
  • all mutants were more stable than wild-type.
  • the longer the loop between Cys, the larger the
    effect (the more restricted is unfolded state).
  • the more disulfide bonds were introduced, the
    more stable was the mutant.

From B. Mathews et al
38
Can structural scaffolds be reduced in size with
maintaining function?
  • Braisted J.A. Wells used Z-domain (58 residues)
    of bacterial protein A
  • removed third helix (truncated protein - 38
    residues)
  • mutated residues in the first and second helices
  • used phage display to select active forms
  • restored the binding of truncated protein.

39
Designing an amino acid sequence that will fold
into a given structure.
  • Inverse protein folding problem designing a
    sequence which will fold into a given structure
    much easier than folding problem!
  • B. Dahiyat S. Mayo designed a sequence of
    zinc finger domain that does not require
    stabilization by Zn.
  • Wild type protein domain is stabilized by Zn
    (bound to two Cys and two His) mutant is
    stabilized by hydrophobic interactions.

40
Molecular basis of evolution.
  • Goal to reconstruct the evolutionary history of
    all organisms in the form of phylogenetic trees.
  • Classical approach phylogenetic trees were
    constructed based on the comparative morphology
    and physiology.
  • Molecular phylogenetics phylogenetic trees are
    constructed by comparing DNA/protein sequences
    between organisms.

41
Mechanisms of evolution.
  • By mutations of genes. Mutations spread through
    the population via genetic drift and/or natural
    selection.
  • - By gene duplication and recombination.

42
Measures of evolutionary distance between amino
acid sequences.
  • 1. P-distance. Evolutionary distance is usually
    measures by the number of amino acid
    substitutions.

nd number of amino acid differences between two
sequences n number of aligned amino acids.
43
Poisson correction for evolutionary distance.
  • 2. PC-distance. Takes into account multiple
    substitutions and therefore is proportional to
    divergence time.
  • PC-distance can be expressed through the
    p-distance

44
The concept of evolutionary trees.
  • Trees consist of nodes and branches, topology -
    branching pattern.
  • The length of each branch represents the number
    of substitutions occurring between two nodes. If
    rate of evolution is constant, branches will have
    the same length (molecular clock hypothesis).
  • The distance along the tree is calculated by
    summing up all intervening branch lengths.
  • Trees can be binary or bifurcating.
  • Trees can be rooted and unrooted. The root is
    placed by including a taxon which is known to
    branch off earlier than others.

45
Accuracies of phylogenetic trees.
  • Two types of errors
  • Topological error
  • Branch length error
  • Bootstrap test
  • Resampling of alignment columns with
    replacement recalculating the tree counting how
    many times this topology occurred bootstrap
    confidence value. If it is close to 100
    reliable topology/interior branch.

46
Estimation of evolutionary rates in hemoglobin
alpha-chains.
Estimate the evolutionary rate of divergence
between human and cow (time of divergence between
these groups is 90 millions years).
47
1. Distance methods. Calculating branch lengths
from distances.
a
c
b
48
Neighbor-joining method.
  • NJ is based on minimum evolution principle (sum
    of branch length should be minimized).
  • Given the distance matrix between all sequences,
    NJ joins sequences in a tree so that to give the
    estimate of branch lengths.
  • Starts with the star tree, calculates the sum of
    branch lengths.

C
B
b
c
D
a
d
e
A
E
49
2.1 Maximum parsimony definition of informative
sites.
  • Maximum parsimony tree tree, that requires the
    smallest number of evolutionary changes to
    explain the differences between external nodes.
  • Site, which favors some trees over the others.
  • 1 2 3 4 5 6 7
  • A A G A C T G
  • A G C C C T G
  • A G A T T T C
  • A G A G T T C
  • Site is informative (for nucleotide sequences) if
    there are at least two different kinds of letters
    at the site, each of which is represented in at
    least two of the sequences.
About PowerShow.com