Marine Biological Laboratory - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Marine Biological Laboratory

Description:

Marine Biological Laboratory Workshop on Molecular Evolution Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM – PowerPoint PPT presentation

Number of Views:212

Avg rating:3.0/5.0

Slides: 31

Provided by: Steve1990

Category:

more less

Transcript and Presenter's Notes

Title: Marine Biological Laboratory

1
Marine Biological Laboratory Workshop on
Molecular Evolution
Woods Hole, Massachusetts
July 25, 2006, 7 to 10 PM
2
Multiple Sequence Alignment Analysis thru GCGs
SeqLab

Steven M. Thompson
Florida State University School of Computational
Science (SCS)

More data yields stronger analyses if done
carefully! Mosaic ideas and evolutionary
importance.
3
But first a prelude My definitions

Biocomputing and computational biology are
synonymous and describe the use of computers and
computational techniques to analyze any
biological system, from molecules, through cells,
tissues, organisms, and populations, to complete
ecologies.
Bioinformatics describes using computational
techniques to access, analyze, and interpret the
biological information in any of the available
online biological databases.
Sequence analysis is the study of molecular
sequence data for the purpose of inferring the
function, mechanism, interactions, evolution, and
perhaps structure of biological molecules.
Genomics analyzes the context of genes or
complete genomes (the total DNA content of an
organism) within and across genomes.
Proteomics is a subdivision of genomics concerned
with analyzing the complete protein complement,
i.e. the proteome, of organisms, both within and
between different organisms.

4
And a way to think about it The reverse
biochemistry analogy

from a virtual DNA sequence to actual molecular
physical characterization, not the other way
round.
Using bioinformatics tools, you can infer all
sorts of functional, evolutionary, and,
structural insights into a gene product, without
the need to isolate and purify massive amounts of
protein! Eventually you can go on to clone and
express the gene based on that analysis using PCR
techniques.
The computer and molecular databases are an
essential part of this process.

5
The exponential growth of molecular sequence
databases
cpu power

Year BasePairs Sequences
1982 680338 606
1983 2274029 2427
1984 3368765 4175
1985 5204420 5700
1986 9615371 9978
1987 15514776 14584
1988 23800000 20579
1989 34762585 28791
1990 49179285 39533
1991 71947426 55627
1992 101008486 78608
1993 157152442 143492
1994 217102462 215273
1995 384939485 555694
1996 651972984 1021211
1997 1160300687 1765847
1998 2008761784 2837897
1999 3841163011 4864570

Doubling time 1 year!
6
Back to multiple sequence alignment
Applicability?
So what why even bother? Applications Probe/pr
imer, and motif/profile design Graphical
illustrations Comparative homology
inference Molecular evolutionary analysis. OK
well, how do you do it?
7
Dynamic programmings complexity increases
exponentially with the number of sequences being
compared

N-dimensional matrix . . . .
complexitysequence lengthnumber of sequences

8
Global heuristic solutions
See MSA (global within bounding box)
and PIMA (local portions only) on the multiple
alignment page at the Baylor College of
Medicines Search Launcher http//searchlauncher
.bcm.tmc.edu/ but, severely limiting
restrictions!
9
Multiple Sequence Dynamic Programming
Therefore pairwise, progressive dynamic
programming restricts the solution to the
neighbor-hood of only two sequences at a
time. All sequences are compared, pairwise, and
then each is aligned to its most similar partner
or group of partners. Each group of partners is
then aligned to finish the complete multiple
sequence alignment.
10
Reliability and the Comparative Approach

explicit homologous correspondence
manual adjustments should be encouraged based
on knowledge,
especially structural, regulatory, and functional
sites.
Therefore, editors like SeqLab and
the Ribosomal Database Project
http//rdp.cme.msu.edu/index.jsp

11
Structural Functional correspondence in the
Wisconsin Packages SeqLab
12
Work with proteins!If at all possible

Twenty match symbols versus four, plus
similarity! Way better signal to noise.
Also guarantees no indels are placed within
codons. So translate, then align.
Nucleotide sequences will only reliably align if
they are very similar to each other. And they
will require extensive hand editing and careful
consideration.

13
Beware of aligning apples and oranges and
grapefruit!

Parologous versus orthologous
genomic versus cDNA
mature versus precursor.

14
Mask out uncertain areas
15
Complications

Order dependence.
Not that big of a deal.
Substitution matrices and gap penalties.
A very big deal!
Regional realignment becomes incredibly
important, especially with sequences that have
areas of high and low similarity (GCG PileUp
-InSitu option).

16
Complications cont.

Format hassles!
Specialized format conversion tools such as GCGs
SeqConv program and PAUPSearch, and
Don Gilberts public domain ReadSeq program.

17
Still more complications

Indels and missing data symbols (i.e. gaps)
designation discrepancy headaches
., -, , ?, N, or X
. . . . . Help!

18
Web resources for pairwise, progressive multiple
alignment
http//www.techfak.uni-bielefeld.de/bcd/Curric/Mul
Ali/welcome.html. http//pbil.univ-lyon1.fr/alignm
ent.html http//www.ebi.ac.uk/clustalw/ http//sea
rchlauncher.bcm.tmc.edu/ However, problems with
very large datasets and huge multiple alignments
make doing multiple sequence alignment on the Web
impractical after your dataset has reached a
certain size. Youll know it when youre there!
19
If large datasets become intractable for analysis
on the Web, what other resources are available?

Desktop software solutions public domain
programs are available, but . . . complicated to
install, configure, and maintain. User must be
pretty computer savvy. So,
commercial software packages are available, e.g.
MacVector, DS Gene, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per
machine, and Internet and/or CD database access
all complicate matters!

20
Therefore, UNIX server-based solutions

Public domain solutions also exist, but now a
very cooperative systems manager needs to
maintain everything for users, so,
commercial products, e.g. the Accelrys GCG
Wisconsin Package and the SeqLab Graphical User
Interface, simplify matters for administrators
and users. One format, one look-and-feel.
One license fee for an entire institution and
very fast, convenient database access on local
server disks. Connections from any networked
terminal or workstation anywhere!
Operating system UNIX command line operation
hassles communications software telnet, ssh,
and terminal emulation X graphics file transfer
ftp, and scp/sftp and editors vi, emacs,
pico (or desktop word processing followed by file
transfer save as "text only!"). See my
supplement pdf file.

21
The Genetics Computer Group

The Accelrys Wisconsin Package for Sequence
Analysis
GCG began in 1982 in Oliver Smithies Genetics
Dept. lab at the University of Wisconsin,
Madison and then starting in 1990 it became a
private company which was acquired by the Oxford
Molecular Group, U.K., in 1997 and then by
Pharmacopeia Inc., U.S.A., in 2000 and then in
2004 Accelrys, San Diego, California, left
Pharmacopeia to become an independent entity.
The suite contains around 150 programs designed
to work in a toolbox fashion. Several simple
programs used in succession can lead to very
sophisticated results.
Also internal compatibility, i.e. once you
learn to use one program, all programs can be run
similarly, and, the output from many programs can
be used as input for other programs.
Used all over the world at over 950 institutions,
so learning it will likely be useful at other
research institutions as well.

22
To answer the always perplexing GCG question
What sequence(s)? . . . .
Specifying sequences, GCG style in order of
increasing power and complexity

The sequence is in a local GCG format single
sequence file in your UNIX account. (GCG
Reformat and SeqConv programs)
The sequence is in a local GCG database in which
case you point to it by using any of the GCG
database logical names. A colon, , always
sets the logical name apart from either an
accession number or a proper identifier name or a
wildcard expression, and they are case
insensitive.
The sequence is in a GCG format multiple sequence
file, either an MSF (multiple sequence format)
file or an RSF (rich sequence format) file. To
specify sequences contained in a GCG multiple
sequence file, supply the file name followed by a
pair of braces, , containing the sequence
specification, e.g. a wildcard .
Finally, the most powerful method of specifying
sequences is in a GCG list file. It is merely
a list of other sequence specifications and can
even contain other list files within it. The
convention to use a GCG list file in a program is
to precede it with an at sign, _at_. Furthermore,
you can supply attribute information within list
files to specify something special about the
sequence such as begin and end constraints.

23
Clean GCG format single sequence file after
reformat (or the SeqConv program)
!!NA_SEQUENCE 1.0 This is a small example of GCG
single sequence format. Always put some
documentation on top, so in the future you can
figure out what it is you're dealing with!
The line with the two periods is converted to the
checksum line. example.seq Length 77 July 21,
1999 0930 Type N Check 4099 .. 1
ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA
CAAGTATACA 51 GATTTAATAG CATGCGATCC CATGGGA
SeqLabs Editor mode can also Import native
GenBank format and ABI or LI-COR trace files!
24
Logical terms for the Wisconsin Package

Sequence databases, nucleic acids Sequence
databases, amino acids
GENBANKPLUS all of GenBank plus EST, HTC GSS
subdivisions GENPEPT GenBank CDS translations
GBP all of GenBank plus EST, HTC GSS
subdivisions GP GenBank CDS translations
GENBANK all of GenBank except EST, HTC GSS
subdivisions UNIPROT or UNI all of Swiss-Prot and
all of SPTrEMBL
GB all of GenBank except EST, HTC GSS
subdivisions SWISSPROTPLUS all of Swiss-Prot and
all of SPTrEMBL
BA GenBank bacterial subdivision SWP all of
Swiss-Prot and all of SPTrEMBL
BACTERIAL GenBank bacterial subdivision UNISPROT a
ll of Swiss-Prot (fully annotated)
EST GenBank EST (Expressed Sequence Tags)
subdivision SWISSPROT all of Swiss-Prot (fully
annotated)
GSS GenBank GSS (Genome Survey Sequences)
subdivision SWISS all of Swiss-Prot (fully
annotated)
HTC GenBank High Throughput cDNA SW all of
Swiss-Prot (fully annotated)
HTG GenBank High Throughput Genomic UNITREMBL Swis
s-Prot preliminary EMBL translations
IN GenBank invertebrate subdivision SPTREMBL Swiss
-Prot preliminary EMBL translations
INVERTEBRATE GenBank invertebrate
subdivision SPT Swiss-Prot preliminary EMBL
translations
OM GenBank other mammalian subdivision P all of
PIR Protein
OTHERMAMM GenBank other mammalian
subdivision PIR all of PIR Protein
OV GenBank other vertebrate subdivision PIR1 PIR
fully annotated subdivision
OTHERVERT GenBank other vertebrate subdivision
PIR2 PIR preliminary subdivision
PAT GenBank patent subdivision PIR3 PIR
unverified subdivision
PATENT GenBank patent subdivision PIR4 PIR
unencoded subdivision

These are easy they make sense and youll have
a vested interest.
25
GCG MSF RSF format
!!AA_MULTIPLE_ALIGNMENT 1.0 small.pfs.msf MSF
735 Type P July 20, 2001 1453 Check 6619
.. Name a49171 Len 425 Check
537 Weight 1.00 Name e70827 Len
577 Check 21 Weight 1.00 Name g83052
Len 718 Check 9535 Weight 1.00
Name f70556 Len 534 Check 3494
Weight 1.00 Name t17237 Len 229
Check 9552 Weight 1.00 Name s65758
Len 735 Check 111 Weight 1.00 Name
a46241 Len 274 Check 3514 Weight
1.00 // ///////////////////////////////
///////////////////
!!RICH_SEQUENCE 1.0 .. name
ef1a_giala descrip PileUp of
_at_/users1/thompson/.seqlab-mendel/pileup_28.list ty
pe PROTEIN longname /users1/thompson/seqlab/EF
1A_primitive.orig.msfef1a_giala sequence-ID
Q08046 checksum 7342 offset
23 creation-date 07/11/2001 165119 strand
1 comments ///////////////////////////////////////
/////////////////////
This is SeqLabs native format

The trick is to not forget the Braces and wild
card, e.g. filename, when specifying!

26
The List File Format
remember the _at_ sign!

!!SEQUENCE_LIST 1.0
An example GCG list file of many elongation 1a
and Tu factors follows. As with all GCG data
files, two periods separate documentation from
data. ..
my-special.pep begin24 end134
SwissProtEfTu_Ecoli
Ef1a-Tu.msf
/usr/accounts/test/another.rsfef1a_
_at_another.list

The way SeqLab works!
27
SeqLab GCGs X-based GUI!

SeqLab is the merger of Steve Smiths Genetic
Data Environment and GCGs Wisconsin Package
Interface
GDE WPI SeqLab
Requires an X-Windowing environment either
native on UNIX computers (including LINUX, but
not installed by default on Mac OS X v.10
systems, however, see Apples free X11 package or
XDarwin), or emulated with X-Server Software on
personal computers.

28
Conclusions
Gunnar von Heijne in his old but quite readable
treatise, Sequence Analysis in Molecular Biology
Treasure Trove or Trivial Pursuit (1987),
provides a very appropriate conclusion Think
about what youre doing use your knowledge of
the molecular system involved to guide both your
interpretation of results and your direction of
inquiry use as much information as possible and
do not blindly accept everything the computer
offers you. He continues . . . if any lesson
is to be drawn . . . it surely is that to be able
to make a useful contribution one must first and
foremost be a biologist, and only second a
theoretician . . . . We have to develop better
algorithms, we have to find ways to cope with the
massive amounts of data, and above all we have to
become better biologists. But thats all it
takes.
FOR MORE INFO...

Explore my Web Home http//bio.fsu.edu/stevet/c
v.html.
Contact me (stevet_at_bio.fsu.edu) for specific
long-distance bioinformatics assistance and
collaboration.

29
AND FOR EVEN MORE INFO...
Many texts are now available in the field. To
honk-my-own-horn a bit, check out Current
Protocols in Bioinformatics from John Wiley
Sons, Inc. (http//www.does.org/cp/bioinfo.html)
and Horizon Scientific Press Computational
Genomics Theory and Application (http//www.horiz
onpress.com/hsp/books/com.html).
Humana Press Introduction to Bioinformatics A
Theoretical And Practical Approach (http//www.hum
anapress.com/Product.pasp?txtCatalogHumanaBookst
xtCategorytxtProductID1-58829-241-XisVariant0
)
They all asked me to contribute chapters on
multiple sequence alignment and analysis using
GCG software.
30
On to a demonstration of some of SeqLabs
multiple sequence dataset capabilities some of
my prebuilt alignments, and . . . Elongation
Factor 1?/Tu, how to do it.

Write a Comment

User Comments (0)