Introduction to Sequence Analysis Software - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Introduction to Sequence Analysis Software

Description:

Jaime E. Ramirez-Vick, PhD ... Sequence analysis is the process of applying ... The stronger the evidence, the more confident we can be in the inference. ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 50
Provided by: jaimeram
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Sequence Analysis Software


1
Introduction to Sequence Analysis Software
Data Libraries
  • BIOINFORMATICS I
  • Protein and DNA Sequence Analysis
  • Jaime E. Ramirez-Vick, PhD

2
Why Sequence Analysis
  • Sequence analysis is the process of applying
    computational methods to a biological sequence
    represented as a character string.
  • The goal is to use these computational methods to
    infer information about the structure, function,
    or evolutionary history of the sequence.
  • The stronger the evidence, the more confident we
    can be in the inference.
  • To get the strongest evidence the proper
    techniques must be employed.

3
The Goal
4
The Process
Homology Modeling
CURATED DATASET
5
The Project
  • Part I Submit three candidate families for your
    course project.
  • Part II Collect an initial set of sequences,
    generate a multiple sequence alignment
  • Part III Improve the quality of your alignment,
    and identify additional family members
  • Part IV Add structural and/or evolutionary
    information, and give a final report

6
Part I
Homology Modeling
CURATED DATASET
7
Part II
Homology Modeling
CURATED DATASET
Classification Libraries
Sequence Libraries
8
Part III
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
Sequence Libraries
Profile PSSM
Local Patterns
9
Part IV
Evolutionary Analysis
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
10
Structural Libraries
  • Structure libraries contain the actual three
    dimensional coordinates of a macromolecule.
  • Used to
  • Determine if the three dimensional structure for
    a molecule has been solved
  • Visualize the three dimensional structure
  • Assist in homology modeling

11
Structural Libraries
  • Protein Data Bank (PDB)
  • Large Molecules (1000 atoms)
  • For more information see
  • http//www.psc.edu/general/software/packages/pdb/p
    db.html
  • http//www.rcsb.org/pdb/
  • Cambridge Structural Database
  • Small Molecules (100 atoms)
  • For more information see
  • http//www.ccdc.cam.ac.uk/

12
Classification Libraries
  • Built from sets of related sequences and contain
    information about the residues that are essential
    to the structure/function of the group of related
    sequences
  • Used to
  • Generate a testable hypothesis that the query
    sequence belongs to the group.
  • Quickly identify a good group of sequences known
    to share a biological relationship.

13
Classification Libraries
  • Some Popular Classification Libraries
  • PROSITE http//www.expasy.ch/prosite.html
  • PFAM http//pfam.wustl.edu/
  • IPROCLASS http//pir.georgetown.edu/iproclass/
  • BLOCKS http//www.blocks.fhcrc.org/
  • PRINTS http//www.biochem.ucl.ac.uk/bsm/dbbrowser
    /PRINTS/PRINTS.html
  • Transcription Factor Database http//transfac.gbf
    .de/TRANSFAC/
  • Restriction Enzyme Database http//rebase.neb.com
    /
  • Search software is usually specific to the
    database

14
Classification Libraries - Representation
  • Consensus
  • Residue most common at each position in alignment
  • Composite
  • Set representation (e.g. a g,c acg t a)
  • Composition Matrix
  • Table of how many residues present at each
    position
  • Position Specific Scoring Matrix (PSSM or
    Profile)
  • Log-odds likelihood of each residue at each
    position
  • Hidden Markov Model
  • Probabilistic state representation

15
Sequence Libraries
  • Compilations of known sequences with experimental
    information about those sequences.
  • Used to
  • Generate a testable hypothesis that the query
    sequence may be related to known sequences in the
    library.
  • Retrieve annotation information about sequences

16
Sequence Libraries
  • Nucleic Acids
  • GenBank http//www.ncbi.nlm.nih.gov/
  • EMBL http//www.ebi.ac.uk/
  • Protein
  • UniProt/UniRef http//www.uniprot.org/
  • Other protein collections
  • PIR http//nbrfa.georgetown.edu/
  • Swiss-Prot http//www.ebi.ac.uk/
  • GenPept http//www.ncbi.nlm.nih.gov/
  • PIR-NREF http//nbrfa.georgetown.edu/
  • TREMBL http//www.ebi.ac.uk/
  • Older Libraries PATCHX, OWL

17
Sequence Libraries - Searching
  • Searching Methods
  • Dynamic Programming
  • Global Needleman-Wunch, Sellers
  • Local Smith-Waterman, Waterman-Egert "Maxsegs"
  • Approximations
  • Fasta
  • Blast
  • User must understand what the searching method
    thinks similar means

18
Sequence Libraries Similar?
Blood coagulation protein superfamily From The
Molecular Basis of Blood Coagulation by Furie and
Furie in Cell Vol 53
19
Sequence Libraries - Results
Box 3.6 from Introduction to Bioinformatics by
Attwood and Parry-Smith
20
Multiple Sequence Alignment
  • An MSA is an alignment of a group of related
    sequences across their entire lengths in a manner
    than highlights the conservation of the important
    residues in the sequences
  • Critical building block for many next steps such
    as finding distantly related sequences,
    determining the evolutionary history, and
    homology modeling
  • Not all sets of related sequences can be aligned
    across their entire lengths cleanly

21
Multiple Sequence Alignment
Blood coagulation protein superfamily From The
Molecular Basis of Blood Coagulation by Furie and
Furie in Cell Vol 53
22
Multiple Sequence Alignment
  • When aligning groups of related sequences it is
    important to note that residues in those
    sequences are either
  • Conserved (not mutated)
  • Unconstrained (when mutated can be almost any
    amino acid)
  • Constrained (when mutated must be one of a few
    amino acids)
  • Motifs are distinct units that consists of the
    conserved and constrained regions Motifs
    generally tell us the residues that are essential
    to the structure/function of the sequence
  • Typically, multiple sequence alignments contain
    motifs as well as unconstrained regions.
  • Aligning motifs in a multiple sequence alignment
    will improve the quality of the alignment

23
Multiple Sequence Alignment
helix
helix
sheet
Lrr 2e kdafrdlhsLsl-LsLydNnI-----qsL
LRR A1 wteLlpllqqyEvvrLddCgLTeehCkdi LRR A2
lqgLqsPtCkiqkLsLqnCsLTeaGCgvL LRR A3
cegLldPqChLEkLqLeyCrLTaasCepL LRR A4
gqgLadsaCqLEtLrLenCgLTpanCkdL LRR A5
cpgLlsPasrLktLwLweCdiTasGCrdL LRR A6
cesLlqPGCQLEsLwvksCsLTaacCqhv LRR A7
cqaLsqPgttLrvLcLgdCeVTnsGCssL LRR A8
lgsLeQPgCaLEqLvLydtywTeevedrL LRR B1
gsaLranpsLtE-LcLrtNeLGDaGvhlv LRR B2
pstLrslptLrE-LhLsdNpLGDaGlrlL LRR B3
asvLratraLkE-LtvsnNdiGeaGarvL LRR B4
cgivasqasLrE-LDLgsNgLGDaGiaeL LRR B5
crvLqaketkKE-LsLagNkLGDeGarlL LRR B6
slmLtqnkhLlE-LqLssNkLGDsGiqeL LRR B7
aslLlanrsLRE-LdLsnNcvGDpGvlqL
24
Multiple Sequence Alignment
Figure 8.2 from Introduction to Bioinformatics by
Attwood and Parry-Smith
25
Multiple Sequence Alignment - Programs
  • MSA Using Progressive Pairwise Technique
  • Clustal
  • MSA Using Multidimensional Dynamic Programming
  • MSA
  • MSA Using Consistency Measures
  • T-Coffee
  • Probcons
  • MSA Editor
  • GeneDoc

26
Position Specific Scoring Matrix
  • A Position Specific Scoring Matrix (PSSM or
    Profile) is a way to abstract the information
    contained in a multiple sequence alignment.
  • Think of a PSSM as a custom PAM or BLOSUM scoring
    matrix that has been specially tuned to locate
    sequences exactly like those in the alignment.
  • Probabilities represented by Log Odds Technique

27
Position Specific Scoring Matrix
  • Used to
  • Help locate distantly related sequences
  • Help resolve sequences that are not considered
    statistically significant by a database search,
    but share enough important residues to infer that
    the sequences may have the same function and be
    distant members of the sequence family
  • Good MSA Good PSSM
  • Poor MSA Poor PSSM
  • A lot of Sequences Good PSSM
  • Few Sequences Good PSSM

28
PSSM Programs
  • MakePSSM
  • Used to create a PSSM from a multiple sequence
    alignment
  • PSSM can be created using different methods
    including Gribskov and Henikoff and with a
    variety of PAM and BLOSUM matrices.
  • ProfileSS
  • Used to search a sequence database with a profile.

29
Hidden Markov Model
  • A Hidden Markov Model (HMM) is a way to abstract
    the information contained in a multiple sequence
    alignment.
  • Think of a HMM as a way to represent a multiple
    sequence alignment by deriving probabilities
    directly from the multiple sequence alignment
  • Probabilistic model Includes probabilities for
    insertions and deletions

30
Hidden Markov Model
  • Used to
  • Help locate distantly related sequences
  • Help resolve sequences that are not considered
    statistically significant by a database search,
    but share enough important residues to infer that
    the sequences may have the same function and be
    distant members of the sequence family
  • Good MSA Good HMM
  • Poor MSA Poor HMM
  • A lot of Sequences Good HMM
  • Few Sequences Poor HMM

31
Hidden Markov Model
  • HMMER Package
  • hmmalign - Align multiple sequences to a profile
    HMM.
  • hmmbuild - Build a profile HMM from a given
    multiple sequence alignment.
  • hmmcalibrate - Determine appropriate statistical
    significance parameters for a profile HMM prior
    to doing database searches.
  • hmmconvert - Convert HMMER profile HMMs to other
    formats
  • hmmemit - Generate sequences probabilistically
    from a profile HMM.
  • hmmfetch - Retrieve an HMM from an HMM database
  • hmmindex - Create a binary SSI index for an HMM
    database
  • hmmpfam - Search a profile HMM database with a
    sequence hmmsearch - Search a sequence database
    with a profile HMM

32
Local Patterns
  • Local patterns are short motifs that exist in
    all (or a subset) of the related sequences.
  • Finding local patterns may help us to align
    biologically important sections in a multiple
    sequence alignment.
  • These local patterns can also be used to probe
    sequence data libraries for distant relatives

33
Local Patterns
Blood coagulation protein superfamily From The
Molecular Basis of Blood Coagulation by Furie and
Furie in Cell Vol 53
34
Local Patterns - Programs
  • MEME a tool for discovering motifs in groups of
    sequences
  • oops - One Occurrence Per Sequence
  • zoops - Zero or One Occurrence Per Sequence
  • tcm Multiple occurrences per sequence
  • MAST will search a sequence database for
    sequences that contain MEME patterns

35
Homology Modeling
  • Predicts the three-dimensional structure of a
    given protein sequence (TARGET) based on an
    alignment to one or more known protein structures
    (TEMPLATES)
  • If similarity between the TARGET sequence and the
    TEMPLATE sequence is detected, structural
    similarity can be assumed.

36
Homology Modeling
Structural Superposition of Aldehyde
Dehydrogenase Family Members
37
Homology Modeling - Programs
  • Modeller used for homology (or comparative)
    modeling of protein three-dimensional structures
  • MMTSB Multiscale Modeling Tools for Structural
    Biology
  • VMD - molecular visualization program for
    displaying, animating, and analyzing large
    biomolecular systems using 3-D graphics and
    built-in scripting

38
Evolutionary Analysis
  • Inferring phylogenies finding the tree that
    implies the correct evolutionary history of the
    sequences
  • Principal Methods
  • Parsimony analysis
  • Distance methods
  • Maximum Likelihood
  • Each approach has its own strengths/weaknesses
  • To assess the correctness of the tree we need
    to understand
  • Overall signal noise in the data
  • How the tree comparisons to alternate trees
  • How reliable the individual branches are

39
Evolutionary Analysis
  • Refining Trees
  • Bootstrap analysis can give us estimates of
    variability
  • Incorporating information about duplication and
    loss
  • reconcile a gene tree with a species tree
  • identify gene duplications
  • root an unrooted tree by minimizing gene
    duplications and losses
  • refine rooted trees to minimize duplications and
    losses
  • Groups analysis
  • Discover what is unique about each subgroup in a
    tree
  • Help resolve which subgroup a sequence belongs to
    in a tree

40
Evolutionary Analysis - Programs
  • Phylip Package
  • Package contains many programs to help infer
    phylogenies including programs for
    bootstrapping, maximum parsimony, distance
    methods, maximal likelihood methods, etc.
  • Notung
  • Enables the incorporation of information about
    duplication and loss into phylogenies
  • Subgroup Refinement
  • GEnt Calculates a group cross entropy

41
Course Project
Homology Modeling
CURATED DATASET
42
Part I
Homology Modeling
CURATED DATASET
43
Course Project Part I
  • Select Sequence Family
  • In the labs we will be working with the
    Haloalkane Dehalogenase superfamily
  • A hydrolase that acts on halide bonds in C-halide
    compounds.
  • Reaction 1-haloalkane H(2)O a primary
    alcohol halide.
  • The PIR Superfamily is PIRSF037173
  • The initial query sequence that we will be using
    for database searching is from Xanthobacter
    autrophicus. The UniProt ID is DHLA_XANAU.

44
Part II
Homology Modeling
CURATED DATASET
Classification Libraries
Sequence Libraries
45
Course Project Part II
  • Collect an initial set of sequences, generate a
    multiple sequence alignment
  • Perform a database search with the query sequence
    across several databases with different
    algorithms
  • Perform a multiple sequence alignment with a
    variety of different alignment algorithms
  • Select best multiple sequence alignment

46
Part III
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
Sequence Libraries
Profile PSSM
Local Patterns
47
Course Project Part III
  • Improve the quality of your alignment, and
    identify additional family members
  • Search for local patterns in the group of
    sequences
  • Refine multiple sequence alignment based on local
    patterns
  • Convert alignment into HMM/PSSM and search the
    database for distantly related sequences.

48
Part IV
Evolutionary Analysis
Homology Modeling
CURATED DATASET
Multiple Sequence Alignment
49
Course Project Part IV
  • Add structural and evolutionary information
  • Build a phylogenetic tree
  • Refine the phylogenetic tree
  • Refine groups
  • Produce and visualize structure using homology
    modeling techniques
  • Produce final multiple sequence alignment
Write a Comment
User Comments (0)
About PowerShow.com