Patterns in Biological Sequences - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Patterns in Biological Sequences

Description:

Can analyze sequence using a Unix mainframe, or with free tools on the Web ... http://www.doe-mbi.ucla.edu/people/fischer/TEST/getsequence.html ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 51
Provided by: stuart67
Category:

less

Transcript and Presenter's Notes

Title: Patterns in Biological Sequences


1
Patterns in Biological Sequences
  • Stuart M. Brown
  • New York University School of Medicine

2
Overview
  • DNA Structure and Function
  • Regulatory Sites in DNA
  • Finding Genes in DNA Sequences
  • RNA Structure
  • Protein Structure and Function
  • Protein Motifs

3
Sequence Analysis on the Web
  • Can analyze sequence using a Unix mainframe, or
    with free tools on the Web
  • Web tools are often best
  • Available to everyone
  • Constantly upgraded
  • not always available and subject to random change
  • Local Unix tools are better if you need to do big
    jobs (lots of sequences - scripts pipelines)

4
DNA Structure
  • Primary the sequence itself
  • Secondary double helix
  • Tertiary supercoiled, bent, etc.
  • Quaternary complexes with proteins
  • Histones
  • RNA Polymerase
  • DNA binding proteins (transcription factors)
  • Chromosome structure
  • centromeres telomeres

5
Phage CRO repressor bound to DNA Andrew Coulson
Roger Sayles with RasMol, Univ. of Edinburgh
1993
6
DNA Information Content
  • Just a 4 letter alphabet (GATC)
  • Encodes proteins with 3 letter codons
  • Punctuation determines transcription starts and
    stops
  • Transcripitonal regulation (promoters, enhancers,
    etc.)
  • Determines its own replication

7
Simple DNA Patterns
  • Restriction enzyme cut sites
  • 4, 6, or 8 bases long, inverted repeat
  • Repeats (direct or inverted)
  • Promoters (universal site recognized by RNA
    polymerase to start transcription)
  • Transcription Factors (unique to one gene or a
    group of co-regulated genes)
  • often just 8-12 bases long
  • generally located upstream from transcribed part
    of gene
  • enhancers can be located anwhere within 10,000
    bp of gene

8
DNA Regulatory Sequences
  • Databases of promoters, enhancers, etc.
  • (DNA patterns)
  • TransFac the Transcription Factor database
  • 4504 entries from 1078 eukaryotic genes
  • maintained by GBF (Germany)
  • http//transfac.gbf.de/TRANSFAC/
  • The Eukaryotic Promoter Database (EPD)
  • Bucher Trifonov. (1986) NAR 14 10009-26
  • 1314 entries taken directly from scientific
    literature
  • maintained by ISREC (Switzerland)
  • http//www.epd.isb-sib.ch/

9
Tools to find patterns in DNA
  • Signal Scan, Promoter Scan - Mac, Windows, Unix
  • (Dr. Dan S. Prestridge, Univ. of Minnesota)
  • EMBOSS tools Unix
  • tfscan scans DNA sequences for transcription
    factors
  • fuzznuc nucleic acid pattern search
  • fuzzpro protein pattern search
  • fuzztran translate DNA-gtprotein search for
    protein patterns
  • restrict finds restriction enzyme cleavage sites
  • repeats (G. Benson) - tandem repeats
  • palindrome - inverted repeats
  • REPuter (whole genome repeat search) Unix

10
TF Binding sites lack information
DE IFI-6-16 (interferon-induced gene 6-16)
G000176. SQ gGGAAAaTGAAACT SF -127 ST
-89 BF T00428 ISGF-3 Quality 6 Species
human, Homo sapiens.
  • Most TF binding sites are determined by just a
    few base pairs (typically 8-12)
  • This is not enough information for proteins to
    locate unique promoters for each gene in a 3
    billion base genome
  • TF's bind cooperatively and combinatorially
  • The key is in the location in relation to each
    other and to the transcription units of genes

11
Websites for Promoter finding
  • Promoter Scan NIH Bioinformatics (BIMAS)
  • http//bimas.dcrt.nih.gov/molbio/proscan/
  • Promoter Scan II Univ. of Minnesota Axyx
    Pharmaceuticals
  • http//biosci.cbs.umn.edu/software/proscan/promote
    rscan.htm
  • Signal Scan NIH Bioinformatics (BIMAS)
  • http//bimas.dcrt.nih.gov80/molbio/signal/index.h
    tml
  • Transcription Element Search (TESS) Center for
    Bioinformatics, Univ. of Pennsylvania
  • http//www.cbil.upenn.edu/tess/
  • Search TransFac at GBF with MatInspector,
    PatSearch, and FunSiteP
  • http//transfac.gbf-braunschweig.de/TRANSFAC/progr
    ams.html
  • TargetFinder Telethon Inst.of Genetics and
    Medicine, Milan, Italy
  • http//hercules.tigem.it/TargetFinder.html

12
Finding Genes in Genomic DNA
  • Translate (in all 6 reading frames) and look for
    similarity to known protein sequences
  • Translate and look for long Open Reading Frames
    (ORFs) between start and stop codons
  • Look for known gene markers
  • TAATAA box, intron splice sites, etc.
  • Statistical methods (codon preference)

13
Gene Finding on the Web
  • ORFfinder NCBI
  • http//www.ncbi.nlm.nih.gov/gorf/gorf.html
  • GRAIL Oak Ridge Natl. Lab, Oak Ridge, TN
  • http//compbio.ornl.gov/grailexp
  • DNA translation Univ. of Minnesota Med. School
  • http//alces.med.umn.edu/webtrans.html
  • GenLang
  • http//cbil.humgen.upenn.edu/sdong/genlang.html
  • BCM GeneFinder Baylor College of Medicine,
    Houston, TX
  • http//dot.imgen.bcm.tmc.edu9331/seq-search/gene-
    search.html
  • http//dot.imgen.bcm.tmc.edu9331/gene-finder/gf.h
    tml

14
Genomic Sequence
  • Once each gene is located on the chromosome, it
    becomes possible to get upstream genomic sequence
  • This is where the transcription factor binding
    sites are located
  • Search for known TF sites, and discover new ones
    (among co-regulated genes)

15
Intron/Exon structure
  • Gene finding programs work well in bacteria
  • None of these gene prediction programs do an
    adequate job predicting intron/exon boundaries
  • The only reasonable gene models are based on
    alignment of cDNAs to genome sequence
  • Perhaps 50 of all human genes still do not have
    a correct coding sequence defined

16
RNA Structure
  • Similar to DNA - base pairing
  • Smaller molecules, free to take on more complex
    shapes
  • tRNA, ribozymes, self-splicing introns

17
tRNA Structures
18
RNA Information Content
  • Primary structure (sequence) contains
  • Information for 3-D self-assembly
  • Genetic code for amino acids in protein
  • Translation start and stop signals
  • Intron splicing signals
  • Controls for RNA stability and transcription level

19
RNA Secondary Structure
  • Rules for base pairing and free energy
    minimization are known
  • Characteristic tRNA stem-loop structures
  • Michael Zuker created the computer program
    FoldRNA
  • UNIX/Mac/PC freeware, in commercial products, and
    on the Web
  • Can predict many RNA secondary structures, not
    necessarily the optimal or true structure

20
Protein Sequence Analysis
  • Molecular properties (pH, mol. wt. isoelectric
    point, hydrophobicity)
  • Secondary Structure
  • Super-secondary (signal peptide, coiled-coil,
    trans-membrane, etc.)
  • 3-D prediction, Threading
  • Domains, motifs, etc.

21
Self-assembly
  • Proteins self-assemble in solution
  • All of the information necessary to determine the
    complex 3-D structure is in the amino acid
    sequences
  • Structure determines function
  • lock key model of enzyme function
  • Know the sequence, know the function?
  • Nearly infinite complexity

22
Structure prediction
  • Protein Structure prediction is the Holy Grail
    of bioinformatics
  • Since structure function, then structure
    prediction should allow protein design, design of
    inhibitors, etc.
  • Huge amounts of genome data - what are the
    functions of all of these proteins?

23
Chemical Properties of Proteins
  • Proteins are linear polymers of 20 amino acids
  • Chemical properties of the protein are determined
    by its amino acids
  • Molecular wt., pH, isoelectric point are simple
    calculations from amino acid composition
  • Hydrophobicity is a property of groups of amino
    acids - best examined as a graph

24
Hydrophobicity Plot
P53_HUMAN (P04637) human cellular tumor antigen
p53 Kyte-Doolittle hydrophilicty, window19
25
Web Sites for Simple Protein Analysis
  • Protein Hydrophobicity Server Bioinformatics
    Unit, Weizmann Institute of Science , Israel
  • http//bioinformatics.weizmann.ac.il/hydroph/
  • SAPS - statistical analysis of protein sequences
    composition, charge, hydrophobic and
    transmembrane segments, cysteine spacings,
    repeats and periodicity
  • http//www.isrec.isb-sib.ch/software/SAPS_form.htm
    l

26
Secondary Structure
  • Protein secondary structure takes one of three
    forms
  • Alpha helix
  • Beta pleated sheet
  • Turn
  • 2ndary structure is predicted within a small
    window
  • Many different algorithms, not highly accurate
  • Better predictions from a multiple alignment

27
Structure Prediction on the Web
  • Secondary Structural Content Prediction (SSCP)
    EMBL, Heidelberg
  • http//www.bork.embl-heidelberg.de/SSCP/sscp_seq.h
    tml
  • BCM Search Launcher Protein Secondary Structure
    Prediction Baylor College of Medicine
  • http//dot.imgen.bcm.tmc.edu9331/seq-search/struc
    -predict.html
  • PREDATOR EMBL, Heidelberg
  • http//www.embl-heidelberg.de/cgi/predator_serv.pl
  • UCLA-DOE Protein Fold Recognition Server
  • http//www.doe-mbi.ucla.edu/people/fischer/TEST/ge
    tsequence.html

28
Sample Structure Prediction
29
Super-secondary Structure
  • Common structural motifs
  • Membrane spanning (GCG TransMem)
  • Signal peptide (GCG SPScan)
  • Coiled coil (GCG CoilScan)
  • Helix-turn-helix (GCG HTHScan)

30
Web servers that predict these structures
  • Predict Protein server EMBL Heidelberg
  • http//www.embl-heidelberg.de/predictprotein/
  • SOSUI Tokyo Univ. of Ag. Tech., Japan
  • http//www.tuat.ac.jp/mitaku/adv_sosui/submit.htm
    l
  • TMpred (transmembrane prediction) ISREC (Swiss
    Institute for Experimental Cancer Research)
  • http//www.isrec.isb-sib.ch/software/TMPRED_form.h
    tml
  • COILS (coiled coil prediction) ISREC
  • http//www.isrec.isb-sib.ch/software/COILS_form.ht
    ml
  • SignalP (signal peptides) Tech. Univ. of Denmark
  • http//www.cbs.dtu.dk/services/SignalP/

31
3-D Structure
  • Cannot be accurately predicted from sequence
    alone (known as ab initio)
  • Levinthals paradox a 100 aa protein has 3200
    possible backbone configurations - many orders of
    magnitude beyond the capacity of the fastest
    computers
  • There are perhaps only a few hundred basic
    structures, but we dont yet have this vocabulary
    or the ability to recognize variants on a theme

32
Threading Protein Structures
  • Best bet is to compare with similar sequences
    that have known structures gtgt Threading
  • Only works for proteins with gt25 sequence
    similarity to a protein with known structure
  • Current state of the art requires many days of
    computing on a dedicated workstation
  • Some websites offer quick approximations
  • Will improve as more 3-D structures are described
  • Another aspect of the Genome Project

33
Predicted Structure
34
Protein Data Base
  • There is a database of all known protein
    structures called the PDB.
  • These have been determined by X-ray
    crystalography and/or NMR.
  • Anyone download and view these structures with a
    PDB viewer program.

35
RasMol
  • RasMol is the simplest PDB viewer.
  • http//www.umass.edu/microbio/rasmol/
  • It can work together with a web browser to let
    you view the structure of any sequence found with
    Entrez that has a known 3-D structure.

36
Websites for 3-D structure prediction
  • UCLA-DOE Protein Fold Recognition
  • http//www.doe-mbi.ucla.edu/people/fischer/TEST/ge
    tsequence.html
  • SwissModel ExPASy, Univ. of Geneva
  • http//www.expasy.ch/swissmod/SWISS-MODEL.html
  • CPHmodels Technical Univ. of Denmark
  • http//www.cbs.dtu.dk/services/CPHmodels/

37
Searching for Patterns in Proteins
38
Protein Domains/Motifs
  • Proteins are built out of functional units know
    as domains (or motifs)
  • These domains have conserved sequences
  • Often much more similar than their respective
    proteins
  • Exon splicing theory (W. Gilbert)
  • Exons correspond to folding domains which in
    turn serve as functional units
  • Unrelated proteins may share a single similar
    exon (i.e.. ATPase or DNA binding function)

39
Protein Motif Databases
  • Known protein motifs have been collected in
    databases
  • Best database is PROSITE
  • The Dictionary of Protein Sites and Patterns
  • maintained by Amos Bairoch, at the Univ. of
    Geneva, Switzerland
  • contains a comprehensive list of documented
    protein domains constructed by expert molecular
    biologists.

40
PROSITE is based on Patterns
  • Each domain is defined by a simple pattern
  • Patterns can have alternate amino acids in each
    position and defined spaces, but no gaps
  • Pattern searching is by exact matching, so any
    new variant will not be found (can allow
    mismatches, but this weakens the algorithm)

41
Tools for PROSITE searches
  • Free Mac program MacPattern
  • ftp//ftp.ebi.ac.uk/pub/software/mac/macpattern.hq
    x
  • Free PC program (DOS) PATMAT
  • ftp//ncbi.nlm.nih.gov/repository/blocks/patmat.do
    s
  • GCG provides the program MOTIFS
  • Also in virtually all commercial programs
    MacVector, OMIGA, LaserGene, etc.

42
Websites for PROSITE Searches
  • ScanProsite at ExPASy Univ. of Geneva
  • http//expasy.hcuge.ch/sprot/scnpsit1.html
  • Network Protein Sequence Analysis Institut de
    Biologie et Chimie des Protéines, Lyon, France
  • http//pbil.ibcp.fr/NPSA/npsa_prosite.html
  • PPSRCH EBI, Cambridge, UK
  • http//www2.ebi.ac.uk/ppsearch/

43
Profiles
  • Profiles are tables of amino acid frequencies at
    each position in a motif
  • They are built from multiple alignments
  • PROSITE entries also contain profiles built from
    an alignment of proteins that match the pattern
  • Profile searching is more sensitive than pattern
    searching - uses an alignment algorithm, allows
    gaps

44
Websites for Profile searching
  • PROSITE ProfileScan ExPASy, Geneva
  • http//www.isrec.isb-sib.ch/software/PFSCAN_form.h
    tml
  • BLOCKS (builds profiles from PROSITE entries and
    adds all matching sequences in SwissProt) Fred
    Hutchinson Cancer Research Center, Seattle,
    Washington, USA
  • http//www.blocks.fhcrc.org/blocks_search.html
  • PRINTS (profiles built from automatic alignments
    of OWL non-redundant protein databases)
    http//www.biochem.ucl.ac.uk/cgi-bin/fingerPRINTSc
    an/fps/PathForm.cgi

45
More Protein Motif Databases
  • PFAM (1344 protein family HMM profiles built by
    hand) Washington Univ., St. Louis
  • http//pfam.wustl.edu/hmmsearch.shtml
  • ProDom (profiles built from PSI-BLAST automatic
    multiple alignments of the SwissProt database)
    INRA, Toulouse, France
  • http//www.toulouse.inra.fr/prodom/doc/blast_form.
    html
  • This is my favorite protein database - nicely
    colored results

46
Sample ProDom Output
47
Hidden Markov Models
  • Hidden Markov Models (HMMs) are a more
    sophisticated form of profile analysis.
  • Rather than build a table of amino acid
    frequencies at each position, they model the
    transition from one amino acid to the next.
  • Pfam is built with HMMs
  • HMMER software - free for UNIX

48
Discovery of new Motifs
  • All of the tools discussed so far rely on a
    database of existing domains/motifs
  • How to discover new motifs
  • Start with a set of related proteins
  • Make a multiple alignment
  • Build a pattern or profile
  • You will need access to a fairly powerful UNIX
    computer to search databases with custom built
    profiles or HMMs.

49
Patterns in Unaligned Sequences
  • Sometimes sequences may share just a small common
    region
  • common signal peptide
  • new transcription factors
  • MEME San Diego Supercomputing Facility
  • http//www.sdsc.edu/MEME/meme/website/meme.html
  • - GCG also includes the MEME program

50
Summary
  • DNA has genes and other information
  • Transcription factors
  • RNA has predictable structures
  • Proteins have predictable 2ndary structures and
    functional domains, but generally cant predict
    new 3-D structures
Write a Comment
User Comments (0)
About PowerShow.com