Computing Patterns in Biology - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Computing Patterns in Biology

Description:

Hydrophobicity is a property of groups of amino acids - best examined as a graph ... pepinfo: plots protein secondary structure and hydrophobicity in parallel panels ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 57
Provided by: stuart67
Category:

less

Transcript and Presenter's Notes

Title: Computing Patterns in Biology


1
Computing Patterns in Biology
  • Stuart M. Brown
  • New York University School of Medicine

2
Why Compute Biological Patterns?
  • Because we can
  • (computer scientists love to find interesting
    problems)
  • patterns are beautiful
  • Its practical - helps with genecloning
    experiments, predict functions of new proteins
  • Systems biology - figure out circuits of
    regulation, predict outcome of changes, design
    new biological systems

3
Overview
  • Restriction sites
  • Finding genes in DNA sequences
  • Regulatory sites in DNA
  • Protein signals (transport and processing)
  • Protein functional Motifs
  • Protein families
  • Protein 3-D structure

4
Restriction Sites
  • Bacteria make restriction enzymes that cut DNA
    at specific sequences (4-8 base patterns)
  • Very simple to find these patterns - can even use
    the Find function of your web browser or word
    processor
  • Open any page of text and look for CAT
  • you now have a restriction site search program!

5
NEBcutter2
  • http//tools.neb.com/NEBcutter2/

6
Finding Genes in Genomic DNA
  • Translate (in all 6 reading frames) and look for
    similarity to known protein sequences
  • Look for long Open Reading Frames (ORFs) between
    start and stop codons (startATG, stopTAA,
    TAG, TGA)
  • Look for known gene markers
  • TAATAA box, intron splice sites, etc.
  • Statistical methods (codon preference)

7
GCCACATGTAGATAATTGAAACTGGATCCTCATCCCTCGCCTTGTACAA
AAATCAACTCCAGATGGATCTAAGATTTAAATCTAACACCTGAAACCATA
AAAATTCTAGGAGATAACACTGGCAAAGCTATTCTAGACATTGGCTTAGG
CAAAGAGTTCGTGACCAAGAACCCAAAAGCAAATGCAACAAAAACAAAAA
TAAATAGGTGGGACCTGATTAAACTGAAAAGCCTCTGCACAGCAAAAGAA
ATAATCAGCAGAGTAAACAGACAACCCACAGAATGAGAGAAAATATTTGC
AAACCATGCATCTGATGACAAAGGACTAATATCCAGAATCTACAAGGAAC
TCAAACAAATCAGCAAGAAAAAAATAACCCCATCAAAAAGTGGGCAAAGG
AATGAATAGACAATTCTCAAAATATACAAATGGCCAATAAACATACGAAA
AACTGTTCAACATCACTAATTATCAGGGAAATGCAAATTAAAACCACAAT
GAGATGCCACCTTACTCCTGCAAGAATGGCCATAATAAAAAAAAATCAAA
AAAGAATAAATGTTGGTGTGAATGTGGTGAAAAGAGAACACTTTGACACT
GCTGGTGGGAATGGAAACTAGTACAACCACTGTGGAAAACAGTACCGAGA
TTTCTTAAAGAACTACAAGTAGAACTACCATTTGATCCAGCAATCCCACT
ACTGGGTATCTACCCAGAGGAAAAGAAGTCATTATTTGAAAAAGACACTT
GTACATACATGTTTATAGCAGCACAATTTGCAATTGCAAAGATATGGAAC
CAGTCTAAATGCCCATCAACCAACAAATGGATAAAGAAAATATGGTATAT
ATACACCATGGAACACTACTCAGCCATAAAAAGGAACAAAATAATGGCAA
CTCACAGATGGAGTTGGAGACCACTATTCTAAGTGAAATAACTCAGGAAT
GGAAAACCAAATATTGTATGTTCTCACTTATAAGTGGGAGCTAAGCTATG
AGGACAAAAGGCATAAGAATTATACTATGGACTTTGGGGACTCGGGGGAA
AGGGTGGGAGGGGGATGAGGGACAAAAGACTACACATTGGGTGCAGTGTA
CACTGCTGAGGTGATGGGTGCACCAAAATCTCAGAAATTACCACTAAAGA
ACTTATCCATGTAACTAAAAACCACCTCTACCCAAATAATTTTGAAATAA
AAAATAAAAATATTTTAAAAAGAACTCTTTAAAATAAATAATGAAAAGCA
CCAACAGACTTATGAACAGGCAATAGAAAAAATGAGAAATAGAAAGGAAT
ACAAATAAAAGTACAGAAAAAAAATATGGCAAGTTATTCAACCAAACTGG
TAATTTGAAATCCAGATTGAAATAATGCAAAAAAAAGGCAATTTCTGGCA
CCATGGCAGACCAGGTACCTGGATGATCTGTTGCTGAAAACAACTGAAAA
TGCTGGTTAAAATATATTAACACATTCTTGAATACAGTCATGGCCAAAGG
AAGTCACATGACTAAGCCCACAGTCAAGGAGTGAGAAAGTATTCTCTACC
TACCATGAGGCCAGGGCAAGGGTGTGCACTTTTTTTTTTCTTCTGTTCAT
TGAATACAGTCACTGTGTATTTTACATACTTTCATTTAGTCTTATGACAA
TCCTATGAAACAAGTACTTTTAAAAAAATTGAGATAACAGTTGCATACCG
TGAAATTCATCCATTTAAAGTGAGCAATTCACAGGTGCAGCTAGCTCAGT
CAGCAGAGCATAAGACTCTTAAAGTGAACAATTCAGTGCTTTTTAGTATA
TTCACAGAGTTGTGCAACCATCACCACTATCTAATTGGTCTTAGTCTGTT
TGGGCTGCCATAACAAAATACCACAAACTGGATAGCTCATAAACAACAGG
CATTTATTGCTCACAGTTCTAGAGGCTGGAAGTGCAAGATTAAGATGCCA
GCAGATTCTGTGTCTGCTGAGGGCCTGTTCCTCATAGAAGGTGCCCTCTT
GCTGAATTCTCACATGGTGGAAGGGGGAAAACAAGCTTGCATTGCAAAGA
GGTGGGCCTCTTTAATCCCAAAGGCCCCACCTCTAAAAGGCCCCACTTCT
GAATACCATTACATTGAGAATTAAGTTTCAACATAGGAATTTGGGGGAAC
ACAAATATCCAGACTGTAGCATAATTCCAGAACGGATTCAT
8
Intron/Exon structure
  • Gene finding programs work well in bacteria
  • None of the gene prediction programs do an
    adequate job predicting intron/exon boundaries
  • The only reasonable gene models are based on
    alignment of cDNAs to genome sequence
  • Perhaps 50 of all human genes still do not have
    a correct coding sequence defined
  • (transcription start, intron splice sites)

9
Gene Finding on the Web
  • GRAIL Oak Ridge Natl. Lab, Oak Ridge, TN
  • http//compbio.ornl.gov/grailexp
  • ORFfinder NCBI
  • http//www.ncbi.nlm.nih.gov/gorf/gorf.html
  • DNA translation Univ. of Minnesota Med. School
  • http//alces.med.umn.edu/webtrans.html
  • GenLang
  • http//cbil.humgen.upenn.edu/sdong/genlang.html
  • BCM GeneFinder Baylor College of Medicine,
    Houston, TX
  • http//dot.imgen.bcm.tmc.edu9331/seq-search/gene-
    search.html
  • http//dot.imgen.bcm.tmc.edu9331/gene-finder/gf.h
    tml

10
Genomic Sequence
  • Once each gene is located on the chromosome, it
    becomes possible to get upstream genomic sequence
  • This is where transcription factor (TF) binding
    sites are located
  • promoters and enhancers
  • Search for known TF sites, and discover new ones
    (among co-regulated genes)

11
Phage CRO repressor bound to DNA Andrew Coulson
Roger Sayles with RasMol, Univ. of Edinburgh
1993
12
Many DNA Regulatory Sequences are Known
  • Databases of promoters, enhancers, etc.
  • TransFac the Transcription Factor database
  • 4342 entries w/ known protein binding and
    transcriptional regulatory functions
  • Maintained by Gesellschaft for Biotechnologische
    Forschung mbH (Braunschweig, Germany)
  • The Eukaryotic Promoter Database (EPD)
  • Bucher Trifonov. (1986) NAR 14 10009-26
  • 1314 entries taken directly from scientific
    literature
  • Maintained by ISREC (Lausanne, Switzerland) as a
    subset of the EMBL

13
TF Binding sites lack information
  • Most TF binding sites are determined by just a
    few base pairs (typically 6)
  • Sequence is variable (consensus)
  • This is not enough information for proteins to
    locate unique promoters for each gene
  • TF's bind cooperatively and combinatorially
  • the key is in the location in relation to each
    other and to the transcription units of genes
  • Can use information from alignment of related
    genes

14
Sequence Logos
15
Tools to find TF sites in DNA
  • GCG FINDPATTERNS
  • with database file TFSITES.DAT
  • Macintosh (Signal Scan), PC/UNIX (Promoter Scan)
  • Dr. Dan S. Prestridge, Univ. of Minnesota

16
Websites for Promoter finding
  • Promoter Scan NIH Bioinformatics (BIMAS)
  • http//bimas.dcrt.nih.gov/molbio/proscan/
  • Promoter Scan II Univ. of Minnesota Axyx
    Pharmaceuticals
  • http//biosci.cbs.umn.edu/software/proscan/promote
    rscan.htm
  • Signal Scan NIH Bioinformatics (BIMAS)
  • http//bimas.dcrt.nih.gov80/molbio/signal/index.h
    tml
  • Transcription Element Search (TESS) Center for
    Bioinformatics, Univ. of Pennsylvania
  • http//www.cbil.upenn.edu/tess/
  • Search TransFac at GBF with MatInspector,
    PatSearch, and FunSiteP
  • http//transfac.gbf-braunschweig.de/TRANSFAC/progr
    ams.html
  • TargetFinder Telethon Inst.of Genetics and
    Medicine, Milan, Italy
  • http//hercules.tigem.it/TargetFinder.html

17
Protein Sequence
18
Protein Sequence Analysis
  • Molecular properties (pH, mol. wt. isoelectric
    point, hydrophobicity)
  • Motifs (signal peptide, coiled-coil,
    trans-membrane, etc.)
  • Protein Families
  • Secondary Structure (helix vs. beta-sheet)
  • 3-D prediction, Threading

19
Chemical Properties of Proteins
  • Proteins are linear polymers of 20 amino acids
  • Chemical properties of the protein are determined
    by its amino acids
  • Molecular wt., pH, isoelectric point are simple
    calculations from amino acid composition
  • Hydrophobicity is a property of groups of amino
    acids - best examined as a graph

20
Hydrophobicity Plot
P53_HUMAN (P04637) human cellular tumor antigen
p53 Kyte-Doolittle hydrophilicty, window19
21
Web Sites for Simple Protein Analysis
  • Protein Hydrophobicity Server Bioinformatics
    Unit, Weizmann Institute of Science , Israel
  • http//bioinformatics.weizmann.ac.il/hydroph/
  • SAPS - statistical analysis of protein sequences
    composition, charge, hydrophobic and
    transmembrane segments, cysteine spacings,
    repeats and periodicity
  • http//www.isrec.isb-sib.ch/software/SAPS_form.htm
    l

22
EMBOSS Protein Analysis Toolkit
  • plotorf simple open reading frame finder
  • Garnier predicts 2ndary structure
  • Charge plot of protein charge
  • Octanol hydrophobicity plot
  • Pepwindow hydorpathy plot
  • pepinfo plots protein secondary structure and
    hydrophobicity in parallel panels
  • tmap predict transmembrane regions
  • Topo draws a map of transmembrane protein
  • Pepwheel shows protein sequence as helical wheel
  • Pepcoil predicts coiled-coil domains
  • Helixturnhelix predicts helix-turn-helix domains

23
Simple Motifs
  • Common structural motifs
  • Membrane spanning
  • Signal peptide
  • Coiled coil
  • Helix-turn-helix

24
Super-secondary Structure
  • Common structural motifs
  • Membrane spanning (GCG TransMem)
  • Signal peptide (GCG SPScan)
  • Coiled coil (GCG CoilScan)
  • Helix-turn-helix (GCG HTHScan)

25
Web servers that predict these structures
  • Predict Protein server EMBL Heidelberg
  • http//www.embl-heidelberg.de/predictprotein/
  • SOSUI Tokyo Univ. of Ag. Tech., Japan
  • http//www.tuat.ac.jp/mitaku/adv_sosui/submit.htm
    l
  • TMpred (transmembrane prediction) ISREC (Swiss
    Institute for Experimental Cancer Research)
  • http//www.isrec.isb-sib.ch/software/TMPRED_form.h
    tml
  • COILS (coiled coil prediction) ISREC
  • http//www.isrec.isb-sib.ch/software/COILS_form.ht
    ml
  • SignalP (signal peptides) Tech. Univ. of Denmark
  • http//www.cbs.dtu.dk/services/SignalP/

26
Protein Domains/Motifs
  • Proteins are built out of functional units know
    as domains (or motifs)
  • These domains have conserved sequences
  • Often much more similar than their respective
    proteins
  • Exon splicing theory (W. Gilbert)
  • Exons correspond to folding domains which in
    turn serve as functional units
  • Unrelated proteins may share a single similar
    exon (i.e.. ATPase or DNA binding function)

27
Motifs are built from Multiple Alignmennts
28
Protein Motif Databases
  • Known protein motifs have been collected in
    databases
  • Best database is PROSITE
  • The Dictionary of Protein Sites and Patterns
  • maintained by Amos Bairoch, at the Univ. of
    Geneva, Switzerland
  • contains a comprehensive list of documented
    protein domains constructed by expert molecular
    biologists
  • Alignments and patterns built by hand!

29
PROSITE is based on Patterns
  • Each domain is defined by a simple pattern
  • Patterns can have alternate amino acids in each
    position and defined spaces, but no gaps
  • Pattern searching is by exact matching, so any
    new variant will not be found (can allow
    mismatches, but this weakens the algorithm)

30
(No Transcript)
31
Tools for PROSITE searches
  • Free Mac program MacPattern
  • ftp//ftp.ebi.ac.uk/pub/software/mac/macpattern.hq
    x
  • Free PC program (DOS) PATMAT
  • ftp//ncbi.nlm.nih.gov/repository/blocks/patmat.do
    s
  • GCG provides the program MOTIFS
  • Also in virtually all commercial programs
    MacVector, OMIGA, LaserGene, etc.

32
Websites for PROSITE Searches
  • ScanProsite at ExPASy Univ. of Geneva
  • http//expasy.hcuge.ch/sprot/scnpsit1.html
  • Network Protein Sequence Analysis Institut de
    Biologie et Chimie des Protéines, Lyon, France
  • http//pbil.ibcp.fr/NPSA/npsa_prosite.html
  • PPSRCH EBI, Cambridge, UK
  • http//www2.ebi.ac.uk/ppsearch/

33
Profiles
  • Profiles are tables of amino acid frequencies at
    each position in a motif
  • They are built from multiple alignments
  • PROSITE entries also contain profiles built from
    an alignment of proteins that match the pattern
  • Profile searching is more sensitive than pattern
    searching - uses an alignment algorithm, allows
    gaps

34
(No Transcript)
35
EMBOSS ProfileSearch
  • EMBOSS has a set of profile analysis tools.
  • Start with a multiple alignment
  • fuzzpro protein pattern search
  • preg regular expression search of a protein
    sequence
  • prophecy create a profile
  • profit scans a database with your profile
  • prophet makes pairwise alignments between a
    single sequence and a profile
  • patmatmotifs scan a query protein with the
    PROSITE motif database

36
Websites for Profile searching
  • PROSITE ProfileScan ExPASy, Geneva
  • http//www.isrec.isb-sib.ch/software/PFSCAN_form.h
    tml
  • BLOCKS (builds profiles from PROSITE entries and
    adds all matching sequences in SwissProt) Fred
    Hutchinson Cancer Research Center, Seattle,
    Washington, USA
  • http//www.blocks.fhcrc.org/blocks_search.html
  • PRINTS (profiles built from automatic alignments
    of OWL non-redundant protein databases)
    http//www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTSc
    an/fps/PathForm.cgi

37
More Protein Motif Databases
  • PFAM (1344 protein family HMM profiles built by
    hand) Washington Univ., St. Louis
  • http//pfam.wustl.edu/hmmsearch.shtml
  • ProDom (profiles built from PSI-BLAST automatic
    multiple alignments of the SwissProt database)
    INRA, Toulouse, France
  • http//www.toulouse.inra.fr/prodom/doc/blast_form.
    html
  • This is my favorite protein database - nicely
    colored results

38
Sample ProDom Output
39
Hidden Markov Models
  • Hidden Markov Models (HMMs) are a more
    sophisticated form of profile analysis.
  • Rather than build a table of amino acid
    frequencies at each position, they model the
    transition from one amino acid to the next.
  • Pfam is built with HMMs.
  • EMBOSS HMM tools (HMMER)
  • ehmmBuild ehmmCalibrate
  • ehmmSearch ehmmPfam
  • ehmmAlign ehmmEmit
  • ehmmFetch ehmmIndex

40
Discovery of new Motifs
  • All of the tools discussed so far rely on a
    database of existing domains/motifs
  • How to discover new motifs
  • Start with a set of related proteins
  • Make a multiple alignment
  • Build a pattern or profile
  • You will need access to a fairly powerful UNIX
    computer to search databases with custom built
    profiles or HMMs.

41
Patterns in Unaligned Sequences
  • Sometimes sequences may share just a small common
    region
  • transcription factors
  • MEME San Diego Supercomputing Facility
  • http//www.sdsc.edu/MEME/meme/website/meme.html
  • EMBOSS also includes the MEME program

42
Protein 3-D Structure
43
Self-assembly
  • Proteins self-assemble in solution
  • All of the information necessary to determine the
    complex 3-D structure is in the amino acid
    sequences
  • Structure determines function
  • - lock key model of enzyme function
  • Know the sequence, know the function?
  • Nearly infinite complexity

44
Structure prediction
  • Protein Structure prediction is the Holy Grail
    of bioinformatics
  • Since structure function, then structure
    prediction should allow protein design, design of
    inhibitors, etc.
  • Huge amounts of genome data - what are the
    functions of all of these proteins?

45
3-D Structure
  • Cannot be accurately predicted from sequence
    alone (known as ab initio)
  • Levinthals paradox a 100 aa protein has 3200
    possible backbone configurations - many orders of
    magnitude beyond the capacity of the fastest
    computers
  • There are perhaps only a few hundred basic
    structures, but we dont yet have this vocabulary
    or the ability to recognize variants on a theme

46
Secondary Structure
  • Protein secondary structure takes one of three
    forms
  • Alpha helix
  • Beta pleated sheet
  • Turn
  • 2ndary structure is predicted within a small
    window
  • Many different algorithms, not highly accurate
  • Better predictions from a multiple alignment

47
Structure Prediction on the Web
  • Secondary Structural Content Prediction (SSCP)
    EMBL, Heidelberg
  • http//www.bork.embl-heidelberg.de/SSCP/sscp_seq.h
    tml
  • BCM Search Launcher Protein Secondary Structure
    Prediction Baylor College of Medicine
  • http//dot.imgen.bcm.tmc.edu9331/seq-search/struc
    -predict.html
  • PREDATOR EMBL, Heidelberg
  • http//www.embl-heidelberg.de/cgi/predator_serv.pl

48
Sample 2-D Structure Prediction
49
Threading Protein Structures
  • Best bet is to compare with similar sequences
    that have known structures gtgt Threading
  • Only works for proteins with gt25 sequence
    similarity to a protein with known structure
  • Some websites offer quick approximations
  • Will improve as more 3-D structures are described

50
Websites for 3-D structure prediction
  • UCLA-DOE Protein Fold Recognition
  • http//www.doe-mbi.ucla.edu/people/fischer/TEST/ge
    tsequence.html
  • SwissModel ExPASy, Univ. of Geneva
  • http//www.expasy.ch/swissmod/SWISS-MODEL.html
  • CPHmodels Technical Univ. of Denmark
  • http//www.cbs.dtu.dk/services/CPHmodels/

51
View Known Protein Structures
  • GenBank includes a database of protein 3-D
    structures and a free viewer Cn3D
  • GenBank database is derived from PDB (Protein
    Data Base)
  • primary repository of protein structure data
  • determined by X-ray crystallography and/or NMR
  • has its own data format and many free viewers
  • some are very sophisticated - can calculate
    intermolecular distances

52
Cn3D
  • Cn3D is a helper application that allows you to
    view three dimensional structures from NCBI's
    Entrez database.
  • Cn3D runs on Windows, MacOS, and Unix/Linux.
  • Cn3D simultaneously displaysstructure, sequence,
    and alignment, it also allows the user to set
    display styles for features of interest.
  • http//www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.s
    html

By being tightly coupled to the genomic and
literature databases, Cn3D is the ideal program
for viewing structures found in the NCBI
databases.
53
(No Transcript)
54
RasMol
  • RasMol is the simplest PDB viewer.
  • http//www.umass.edu/microbio/rasmol/
  • It can work together with a web browser to let
    you view the structure of any sequence found with
    Entrez that has a known 3-D structure.

55
Swiss PDB Viewer
56
Summary
  • Restriction sites are trivial to compute, but
    very useful
  • Genomic DNA has genes and other information gt
    transcription factors
  • Proteins have predictable 2ndary structures and
    functional domains, but generally cant predict
    new 3-D structures
  • Can visualize and compare known structures
Write a Comment
User Comments (0)
About PowerShow.com