Marine Biological Laboratory - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Marine Biological Laboratory

Description:

Marine Biological Laboratory Workshop on Molecular Evolution Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM – PowerPoint PPT presentation

Number of Views:212
Avg rating:3.0/5.0
Slides: 31
Provided by: Steve1990
Category:

less

Transcript and Presenter's Notes

Title: Marine Biological Laboratory


1
Marine Biological Laboratory Workshop on
Molecular Evolution
Woods Hole, Massachusetts
July 25, 2006, 7 to 10 PM
2
Multiple Sequence Alignment Analysis thru GCGs
SeqLab
  • Steven M. Thompson
  • Florida State University School of Computational
    Science (SCS)

More data yields stronger analyses if done
carefully! Mosaic ideas and evolutionary
importance.
3
But first a prelude My definitions
  • Biocomputing and computational biology are
    synonymous and describe the use of computers and
    computational techniques to analyze any
    biological system, from molecules, through cells,
    tissues, organisms, and populations, to complete
    ecologies.
  • Bioinformatics describes using computational
    techniques to access, analyze, and interpret the
    biological information in any of the available
    online biological databases.
  • Sequence analysis is the study of molecular
    sequence data for the purpose of inferring the
    function, mechanism, interactions, evolution, and
    perhaps structure of biological molecules.
  • Genomics analyzes the context of genes or
    complete genomes (the total DNA content of an
    organism) within and across genomes.
  • Proteomics is a subdivision of genomics concerned
    with analyzing the complete protein complement,
    i.e. the proteome, of organisms, both within and
    between different organisms.

4
And a way to think about it The reverse
biochemistry analogy
  • from a virtual DNA sequence to actual molecular
    physical characterization, not the other way
    round.
  • Using bioinformatics tools, you can infer all
    sorts of functional, evolutionary, and,
    structural insights into a gene product, without
    the need to isolate and purify massive amounts of
    protein! Eventually you can go on to clone and
    express the gene based on that analysis using PCR
    techniques.
  • The computer and molecular databases are an
    essential part of this process.

5
The exponential growth of molecular sequence
databases
cpu power
  • Year BasePairs Sequences
  • 1982 680338 606
  • 1983 2274029 2427
  • 1984 3368765 4175
  • 1985 5204420 5700
  • 1986 9615371 9978
  • 1987 15514776 14584
  • 1988 23800000 20579
  • 1989 34762585 28791
  • 1990 49179285 39533
  • 1991 71947426 55627
  • 1992 101008486 78608
  • 1993 157152442 143492
  • 1994 217102462 215273
  • 1995 384939485 555694
  • 1996 651972984 1021211
  • 1997 1160300687 1765847
  • 1998 2008761784 2837897
  • 1999 3841163011 4864570

Doubling time 1 year!
6
Back to multiple sequence alignment
Applicability?
So what why even bother? Applications Probe/pr
imer, and motif/profile design Graphical
illustrations Comparative homology
inference Molecular evolutionary analysis. OK
well, how do you do it?
7
Dynamic programmings complexity increases
exponentially with the number of sequences being
compared
  • N-dimensional matrix . . . .
  • complexitysequence lengthnumber of sequences

8
Global heuristic solutions
See MSA (global within bounding box)
and PIMA (local portions only) on the multiple
alignment page at the Baylor College of
Medicines Search Launcher http//searchlauncher
.bcm.tmc.edu/ but, severely limiting
restrictions!
9
Multiple Sequence Dynamic Programming
Therefore pairwise, progressive dynamic
programming restricts the solution to the
neighbor-hood of only two sequences at a
time. All sequences are compared, pairwise, and
then each is aligned to its most similar partner
or group of partners. Each group of partners is
then aligned to finish the complete multiple
sequence alignment.
10
Reliability and the Comparative Approach
  • explicit homologous correspondence
  • manual adjustments should be encouraged based
    on knowledge,
  • especially structural, regulatory, and functional
    sites.
  • Therefore, editors like SeqLab and
  • the Ribosomal Database Project
  • http//rdp.cme.msu.edu/index.jsp

11
Structural Functional correspondence in the
Wisconsin Packages SeqLab
12
Work with proteins!If at all possible
  • Twenty match symbols versus four, plus
    similarity! Way better signal to noise.
  • Also guarantees no indels are placed within
    codons. So translate, then align.
  • Nucleotide sequences will only reliably align if
    they are very similar to each other. And they
    will require extensive hand editing and careful
    consideration.

13
Beware of aligning apples and oranges and
grapefruit!
  • Parologous versus orthologous
  • genomic versus cDNA
  • mature versus precursor.

14
Mask out uncertain areas
15
Complications
  • Order dependence.
  • Not that big of a deal.
  • Substitution matrices and gap penalties.
  • A very big deal!
  • Regional realignment becomes incredibly
    important, especially with sequences that have
    areas of high and low similarity (GCG PileUp
    -InSitu option).

16
Complications cont.
  • Format hassles!
  • Specialized format conversion tools such as GCGs
    SeqConv program and PAUPSearch, and
  • Don Gilberts public domain ReadSeq program.

17
Still more complications
  • Indels and missing data symbols (i.e. gaps)
    designation discrepancy headaches
  • ., -, , ?, N, or X
  • . . . . . Help!

18
Web resources for pairwise, progressive multiple
alignment
http//www.techfak.uni-bielefeld.de/bcd/Curric/Mul
Ali/welcome.html. http//pbil.univ-lyon1.fr/alignm
ent.html http//www.ebi.ac.uk/clustalw/ http//sea
rchlauncher.bcm.tmc.edu/ However, problems with
very large datasets and huge multiple alignments
make doing multiple sequence alignment on the Web
impractical after your dataset has reached a
certain size. Youll know it when youre there!
19
If large datasets become intractable for analysis
on the Web, what other resources are available?
  • Desktop software solutions public domain
    programs are available, but . . . complicated to
    install, configure, and maintain. User must be
    pretty computer savvy. So,
  • commercial software packages are available, e.g.
    MacVector, DS Gene, DNAsis, DNAStar, etc.,
  • but . . . license hassles, big expense per
    machine, and Internet and/or CD database access
    all complicate matters!

20
Therefore, UNIX server-based solutions
  • Public domain solutions also exist, but now a
    very cooperative systems manager needs to
    maintain everything for users, so,
  • commercial products, e.g. the Accelrys GCG
    Wisconsin Package and the SeqLab Graphical User
    Interface, simplify matters for administrators
    and users. One format, one look-and-feel.
  • One license fee for an entire institution and
    very fast, convenient database access on local
    server disks. Connections from any networked
    terminal or workstation anywhere!
  • Operating system UNIX command line operation
    hassles communications software telnet, ssh,
    and terminal emulation X graphics file transfer
    ftp, and scp/sftp and editors vi, emacs,
    pico (or desktop word processing followed by file
    transfer save as "text only!"). See my
    supplement pdf file.

21
The Genetics Computer Group
  • The Accelrys Wisconsin Package for Sequence
    Analysis
  • GCG began in 1982 in Oliver Smithies Genetics
    Dept. lab at the University of Wisconsin,
    Madison and then starting in 1990 it became a
    private company which was acquired by the Oxford
    Molecular Group, U.K., in 1997 and then by
    Pharmacopeia Inc., U.S.A., in 2000 and then in
    2004 Accelrys, San Diego, California, left
    Pharmacopeia to become an independent entity.
  • The suite contains around 150 programs designed
    to work in a toolbox fashion. Several simple
    programs used in succession can lead to very
    sophisticated results.
  • Also internal compatibility, i.e. once you
    learn to use one program, all programs can be run
    similarly, and, the output from many programs can
    be used as input for other programs.
  • Used all over the world at over 950 institutions,
    so learning it will likely be useful at other
    research institutions as well.

22
To answer the always perplexing GCG question
What sequence(s)? . . . .
Specifying sequences, GCG style in order of
increasing power and complexity
  • The sequence is in a local GCG format single
    sequence file in your UNIX account. (GCG
    Reformat and SeqConv programs)
  • The sequence is in a local GCG database in which
    case you point to it by using any of the GCG
    database logical names. A colon, , always
    sets the logical name apart from either an
    accession number or a proper identifier name or a
    wildcard expression, and they are case
    insensitive.
  • The sequence is in a GCG format multiple sequence
    file, either an MSF (multiple sequence format)
    file or an RSF (rich sequence format) file. To
    specify sequences contained in a GCG multiple
    sequence file, supply the file name followed by a
    pair of braces, , containing the sequence
    specification, e.g. a wildcard .
  • Finally, the most powerful method of specifying
    sequences is in a GCG list file. It is merely
    a list of other sequence specifications and can
    even contain other list files within it. The
    convention to use a GCG list file in a program is
    to precede it with an at sign, _at_. Furthermore,
    you can supply attribute information within list
    files to specify something special about the
    sequence such as begin and end constraints.

23
Clean GCG format single sequence file after
reformat (or the SeqConv program)
!!NA_SEQUENCE 1.0 This is a small example of GCG
single sequence format. Always put some
documentation on top, so in the future you can
figure out what it is you're dealing with!
The line with the two periods is converted to the
checksum line. example.seq Length 77 July 21,
1999 0930 Type N Check 4099 .. 1
ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA
CAAGTATACA 51 GATTTAATAG CATGCGATCC CATGGGA
SeqLabs Editor mode can also Import native
GenBank format and ABI or LI-COR trace files!
24
Logical terms for the Wisconsin Package
  • Sequence databases, nucleic acids Sequence
    databases, amino acids
  • GENBANKPLUS all of GenBank plus EST, HTC GSS
    subdivisions GENPEPT GenBank CDS translations
  • GBP all of GenBank plus EST, HTC GSS
    subdivisions GP GenBank CDS translations
  • GENBANK all of GenBank except EST, HTC GSS
    subdivisions UNIPROT or UNI all of Swiss-Prot and
    all of SPTrEMBL
  • GB all of GenBank except EST, HTC GSS
    subdivisions SWISSPROTPLUS all of Swiss-Prot and
    all of SPTrEMBL
  • BA GenBank bacterial subdivision SWP all of
    Swiss-Prot and all of SPTrEMBL
  • BACTERIAL GenBank bacterial subdivision UNISPROT a
    ll of Swiss-Prot (fully annotated)
  • EST GenBank EST (Expressed Sequence Tags)
    subdivision SWISSPROT all of Swiss-Prot (fully
    annotated)
  • GSS GenBank GSS (Genome Survey Sequences)
    subdivision SWISS all of Swiss-Prot (fully
    annotated)
  • HTC GenBank High Throughput cDNA SW all of
    Swiss-Prot (fully annotated)
  • HTG GenBank High Throughput Genomic UNITREMBL Swis
    s-Prot preliminary EMBL translations
  • IN GenBank invertebrate subdivision SPTREMBL Swiss
    -Prot preliminary EMBL translations
  • INVERTEBRATE GenBank invertebrate
    subdivision SPT Swiss-Prot preliminary EMBL
    translations
  • OM GenBank other mammalian subdivision P all of
    PIR Protein
  • OTHERMAMM GenBank other mammalian
    subdivision PIR all of PIR Protein
  • OV GenBank other vertebrate subdivision PIR1 PIR
    fully annotated subdivision
  • OTHERVERT GenBank other vertebrate subdivision
    PIR2 PIR preliminary subdivision
  • PAT GenBank patent subdivision PIR3 PIR
    unverified subdivision
  • PATENT GenBank patent subdivision PIR4 PIR
    unencoded subdivision

These are easy they make sense and youll have
a vested interest.
25
GCG MSF RSF format
!!AA_MULTIPLE_ALIGNMENT 1.0 small.pfs.msf MSF
735 Type P July 20, 2001 1453 Check 6619
.. Name a49171 Len 425 Check
537 Weight 1.00 Name e70827 Len
577 Check 21 Weight 1.00 Name g83052
Len 718 Check 9535 Weight 1.00
Name f70556 Len 534 Check 3494
Weight 1.00 Name t17237 Len 229
Check 9552 Weight 1.00 Name s65758
Len 735 Check 111 Weight 1.00 Name
a46241 Len 274 Check 3514 Weight
1.00 // ///////////////////////////////
///////////////////
!!RICH_SEQUENCE 1.0 .. name
ef1a_giala descrip PileUp of
_at_/users1/thompson/.seqlab-mendel/pileup_28.list ty
pe PROTEIN longname /users1/thompson/seqlab/EF
1A_primitive.orig.msfef1a_giala sequence-ID
Q08046 checksum 7342 offset
23 creation-date 07/11/2001 165119 strand
1 comments ///////////////////////////////////////
/////////////////////
This is SeqLabs native format
  • The trick is to not forget the Braces and wild
    card, e.g. filename, when specifying!

26
The List File Format
remember the _at_ sign!
  • !!SEQUENCE_LIST 1.0
  • An example GCG list file of many elongation 1a
    and Tu factors follows. As with all GCG data
    files, two periods separate documentation from
    data. ..
  • my-special.pep begin24 end134
  • SwissProtEfTu_Ecoli
  • Ef1a-Tu.msf
  • /usr/accounts/test/another.rsfef1a_
  • _at_another.list

The way SeqLab works!
27
SeqLab GCGs X-based GUI!
  • SeqLab is the merger of Steve Smiths Genetic
    Data Environment and GCGs Wisconsin Package
    Interface
  • GDE WPI SeqLab
  • Requires an X-Windowing environment either
    native on UNIX computers (including LINUX, but
    not installed by default on Mac OS X v.10
    systems, however, see Apples free X11 package or
    XDarwin), or emulated with X-Server Software on
    personal computers.

28
Conclusions
Gunnar von Heijne in his old but quite readable
treatise, Sequence Analysis in Molecular Biology
Treasure Trove or Trivial Pursuit (1987),
provides a very appropriate conclusion Think
about what youre doing use your knowledge of
the molecular system involved to guide both your
interpretation of results and your direction of
inquiry use as much information as possible and
do not blindly accept everything the computer
offers you. He continues . . . if any lesson
is to be drawn . . . it surely is that to be able
to make a useful contribution one must first and
foremost be a biologist, and only second a
theoretician . . . . We have to develop better
algorithms, we have to find ways to cope with the
massive amounts of data, and above all we have to
become better biologists. But thats all it
takes.
FOR MORE INFO...
  • Explore my Web Home http//bio.fsu.edu/stevet/c
    v.html.
  • Contact me (stevet_at_bio.fsu.edu) for specific
    long-distance bioinformatics assistance and
    collaboration.

29
AND FOR EVEN MORE INFO...
Many texts are now available in the field. To
honk-my-own-horn a bit, check out Current
Protocols in Bioinformatics from John Wiley
Sons, Inc. (http//www.does.org/cp/bioinfo.html)
and Horizon Scientific Press Computational
Genomics Theory and Application (http//www.horiz
onpress.com/hsp/books/com.html).
Humana Press Introduction to Bioinformatics A
Theoretical And Practical Approach (http//www.hum
anapress.com/Product.pasp?txtCatalogHumanaBookst
xtCategorytxtProductID1-58829-241-XisVariant0
)
They all asked me to contribute chapters on
multiple sequence alignment and analysis using
GCG software.
30
On to a demonstration of some of SeqLabs
multiple sequence dataset capabilities some of
my prebuilt alignments, and . . . Elongation
Factor 1?/Tu, how to do it.
Write a Comment
User Comments (0)
About PowerShow.com