Part 12 Genome Analysis - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Part 12 Genome Analysis

Description:

Part 12 Genome Analysis Outline Overview Why do comparative genomic analysis? Assumptions/Limitations Genome Analysis and Annotation Standard Procedure General ... – PowerPoint PPT presentation

Number of Views:112

Avg rating:3.0/5.0

Slides: 29

Provided by: chu8163

Category:

more less

Transcript and Presenter's Notes

Title: Part 12 Genome Analysis

1
Part 12 Genome Analysis
2
Outline

Overview
Why do comparative genomic analysis?
Assumptions/Limitations
Genome Analysis and Annotation Standard Procedure
General Purposes Databases for Comparative
Genomics
Organism Specific Databases
Genome Analysis Environments
Genome Sequence Alignment Programs
Genomic Comparison Visualization Tools

3
Some of the prokaryotic genomes
4
Some of the eukaryotic genomes

Aspergillus fumigatus

Farmers lung

In progress

Dictyostelium discoideum

Soil amoeba

In progress

Amoebic dysentry

In progress

Entamoeba histolitica

Leishmania major
Leishmaniasis

In progress

Plasmodium falciparum

Malaria

In progress

Bilharzia

In progress

Schistosoma mansoni
Schizosaccharomyces pombe
Fission yeast

Complete

Theileria annulata
Veterinary

In progress

Toxoplasma gondii

Toxoplasmosis

In progress

Trypanosoma brucei

Sleeping sickness

In progress

5
Bioinformatics Flow Chart
1a. Sequencing
6. Gene Protein expression data
1b. Analysis of nucleic acid seq.
7. Drug screening
2. Analysis of protein seq.
Ab initio drug design OR Drug compound screening
in database of molecules
3. Molecular structure prediction
4. molecular interaction
8. Genetic variability
5. Metabolic and regulatory networks
6
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
7
Genome Sequencing - Review
Strategy
Strategy
Libraries
Libraries
Sequencing
Sequencing
Assembly
Assembly
Closure
Closure
Annotation
Annotation
Release
Release
8
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
AAAAAAA
Gm3
Comparative gene prediction
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
9
Why do comparative genomics?

Many of the genes encoded in each genome from the
genome projects had no known or predictable
function
Analysis of protein set from completely sequenced
genomes
Uniform evolutionary conservation of proteins in
microbial genomes, 70 of gene products from
sequenced genomes have homologs in distant
genomes (Koonin et al., 1997)
Function of many of these genes can be predicted
by comparing different genomes of known
functional annotation and transferring functional
annotation of proteins from better studied
organisms to their orthologs in lesser studied
organisms.
Cross species comparison to help reveal conserved
coding regions
No prior knowledge of the sequence motif is
necessary
Complement to algorithmic analysis

10
Assumptions/Limitation

Homologous genes are relatively well preserved
while noncoding regions tend to show varying
degrees of conservation. Conserved noncoding
regions are believed to be important in
regulating gene expression, maintaiing structural
organization of the genome and most likely other
possible functions.
Cross species comparative genomics is influenced
by the evolutionary distance of the compared
species.

11
Genome Analysis and Annotation General Procedure

Basic procedure to determine the functional and
structural annotation of uncharacterized
proteins
Use a sequence similarity search programs such as
BLAST or FASTA to identify all the functional
regions in the sequence. If greater sensitivity
is required then the Smith-Waterman algorithm
based programs are preferred with the trade-off
greater analysis time.
Identify functional motifs and structural domains
by comparing the protein sequence against
PROSITE, BLOCKS, SMART, CDD, or Pfam.
Predict structural features of the protein such
as signal peptides, transmembrane segments,
coiled-coil regions, and other regions of low
sequence complexity
Generate a secondary and tertiary (if possible)
structure prediction
Annotation
Transfer of function information from a
well-characterized organism to a lesser studied
organism and/or
Use phylogenetic patterns (or profiles) and/or
Use the phylogenetic pattern search tools (e.g.
through COGs) to perform a systematic formal
logical operations (AND, OR, NOT) on gene sets --
differential genome display (Huynen et al., 1997).

12
Genome Analysis and AnnotationOne Possible
Procedure

Basic procedure to determine the functional and
structural annotation of uncharacterized
proteins
Use a sequence similarity search programs such as
BLAST or FASTA to identify all the functional
regions in the sequence. If greater sensitivity
is required then the Smith-Waterman algorithm
based programs are preferred with the trade-off
greater analysis time.
Identify functional motifs and structural domains
by comparing the protein sequence against
PROSITE, BLOCKS, SMART, CDD, or Pfam.
Predict structural features of the protein such
as signal peptides, transmembrane segments,
coiled-coil regions, and other regions of low
sequence complexity
Generate a secondary and tertiary (if possible)
structure prediction
Transfer of function information from a
well-characterized organism to a lesser studied
organism and/or use phylogenetic patterns (or
profiles) and/or use the phylogenetic pattern
search tools (e.g. through COGs) to perform a
systematic formal logical operations (AND, OR,
NOT) on gene sets -- differential genome display
(Huynen et al., 1997)..

13
Automated Genome Annotation

GeneQuiz limited number of searches/day
MAGPIE outside users cannot submit own seq
PEDANT commercial version allow for full
capacity
SEALS semi automated

14
General Databases Useful for Comparative Genomics

Locus Link/RefSeq http//www.ncbi.nih.gov/LocusLi
nk/
PEDANT -Protein Extraction Description ANalysis
Tool http//pedant.gsf.de/
MIPS http//mips.gsf.de/
COGs - Cluster of Orthologous Groups (of
proteins) http//www.ncbi.nih.gov/COG/
KEGG - Kyoto Encyclopedia of Genes and Genomes
http//www.genome.ad.jp/kegg/
MBGD - Microbial Genome Database
http//mbgd.genome.ad.jp/
GOLD - Genome OnLine Database http//wit.integrate
dgenomics.com/GOLD/
TOGA http//www.tigr.org/xxxxx

15
Problems with existing sequence alignments
algorithms for genomic analysis

Most algorithms were developed for comparing
single protein sequences or DNA sequences
containing a single gene
Most algorithms were based on assigning a score
to all the possible alignments (usually by the
sum of the similarity/identity values for each
aligned residue minus a penalty for the
introduction of gaps) and then finding the
optimal or near-optimal alignment based on the
chosen scoring scheme.
Unfortunately, most of these programs cannot
accurately handle long alignments.
Linear-space type of Smith-Waterman variants are
too computationally intensive requiring
specialized hardware (memory-limited) or very
time-consuming. Higher speed vs increased
sensitivity.

16
Genome-size comparative alignment tools

ASSIRC - Accelerated Search for SImilarity
Regions in Chromosomes
ftp//ftp.biologie.ens.fr/pub/molbio/ (Vincens et
al. 1998)
BLAT
http//genome.ucsc.edu/cgi-bin/hgBlat?commandstar
t (Kent xxx)
DIALIGN - DIagonal ALIGNment
http//www.gsf.de/biodv/dialign.html (Morgenstern
et al. 1998 Morgenstern 1999(
DBA - DNA Block Aligner
http//www.sanger.ac.uk/Software/Wise2/dba.shtml
(Jareborg et al. 1999(
GLASS - GLobal Alignment SyStem
http//plover.lcs.mit.edu/ (Batzoglou et al.
2000)
LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL
PAIRS
Email jbuhler_at_cs.washington.edu (Buhler 2001)
MegaBlast
http//www.ncbi.nih.gov/blast/ (Zhang 2000)
MUMmer - Maximal Unique Match (mer)
http//www.tigr.org/softlab/ (Delcher et al.
1999)
PIPMaker - Percent Identity Plot MAKER
http//biocse.psu.edu/pipmaker/ (Schwartz et al.
2000)
SSAHA Sequence Search and Alignment by Hashing
Algorithm

17
SSAHA

Sequence Search and Alignment by Hashing
Algorithm
Software tool for very fast matching and
alignment of DNA sequences.
Achieves fast search speed by converting sequence
information into a hash table data structure
which can then be searched very rapidly for
matches
http//www.sanger.ac.uk/Software/analysis/SSAHA/
Run from the Unix command line
Need gt 1GB RAM (needs a lot of memory)
SSAHA algorithm best for application requiring
exact or almost exact matches between two
sequences e.g. SNP detection, fast sequence
assembly, ordering and orientation of contigs

18
Genome Analysis Environment

MAGPIE - Automated Genome Project Investigation
Environment
PEDANT
SEALS

19
Problems with Visualizing Genomes

Alignment programs output often were visualized
by text file, which can be intuitively difficult
to interpret when comparing genomes.
Visualization tools needed to handle the
complexity and volume of data and present the
information in a comprehensive and comprehensible
manner to a biologist for interpretation.
Genome Alignment Visualization tools need to
provide
interpretable alignments,
gene prediction and database homologies from
different sources
Interactive features real time capabilities,
zooming, searching specific regions of homologies
Represent breaks in synteny
Multiple alignments display
Displaying contigs of unfinished genomes with
finished genomes
Handle various data formats
Software availabilty (no black box)

20
Genome Comparison Visualization Tool

ACT - Artemis Comparison Tool (displays parsed
BLAST alignments based on Artemis an
annotation tool)
http//www.sanger.ac.uk/Software/ACT/
Alfresco (displays DBA alignments and ...)
http//www.sanger.ac.uk/Software/Alfresco/
(Jareborg Durbin 2000)
PipMaker (displays BlastZ alignments)
http//bio.cse.psu.edu/pipmaker/ (Schwartz et al.
2000)
Enteric/Menteric/Maj (displays Blastz alignments)
http//glovin.cse.psu.edu/enterix/ (Florea et al.
2000 McClelland et al. 2000)
Intronerator (displays WABA alignments and ...)
http//www.cse.ucsc.edu/kent/intronerator/ (Kent
Zahler 2000b)
VISTA (Visualization Tool for Alignment)
(displays GLASS alignments)
http//www-gsd.lbl.gov/vista/
SynPlot (displays DIALIGN and GLASS alignments)
http//www.sanger.ac.uk/Users/igrg/SynPlot/

21
Artemis Comparison Tool (ACT)

ACT is a DNA sequence comparison viewer based on
Artemis
Can read complete EMBL and GenBank entries or
sequence in FASTA or raw format
Additional sequence feature can be in EMBL,
GenBank, GFF format
ACT is free software and is distributed under the
GNU Public License
Java based software
Latest release 2.0 better support Eukaryotic
Genome Comparison
http//www.sanger.ac.uk/Software/ACT/

22
Salmonella typhi vs. E. coli SPI-2
GC tRNA phage/IS genes Pseudogenes
S.typhi
Blast hits
E.coli
23
Salmonella typhi and Yersinia pestis type III
secretion systems
24
Salmonella typhi vs. E. coli - ACT
SPI-10
SPI-1
SPI-2
SPI-9
SPI-7 Vi
S. typhi
DNA matches
E. coli
25
Neisseria meningitidis - A vs. B comparison - ACT
26
Extra Slides 1
27
ASSIRC

Accelerated Search for SImilarity Regions in
Chromosome
ASSIRC finds regions of similarity in pair-wise
genomic sequence alignments.
The method involves three steps
(i) identification of short exact chains of fixed
size, called 'seeds', common to both sequences,
using hashing functions
(ii) extension of these seeds into putative
regions of similarity by a 'random walk'
procedure (i.e. the four bases are associated
(iii) final selection of regions of similarity by
assessing alignments of the putative sequences.
We used simulations to estimate the proportion of
regions of similarity not detected for particular
region sizes, base identity proportions and seed
sizes.
This approach can be tailored to the user's
specifications.
They looked for regions of similarity between two
yeast chromosomes (V and IX). The efficiency of
the approach was compared to those of
conventional programs BLAST and FASTA, by
assessing CPU time required and the regions of
similarity found for the same data set.
http//www.biologie.ens.fr/perso/vincens/assirc.ht
ml
ftp//ftp.biologie.ens.fr/pub/molbio/assirc.tar.gz

28
BLAT

Only DNA sequences of 25,000 or less bases and
protein or translated sequence of 5000 or less
letters will be processed. If multiple sequences
are submitted at the same time, the total limit
is 50,000 bases or 12,500 letters.
BLAT on DNA is designed to quickly find sequences
of 95 and greater similarity of length 40 bases
or more. It may miss more divergent or shorter
sequence alignments. It will find perfect
sequence matches of 33 bases, and sometimes find
them down to 22 bases. BLAT on proteins finds
sequences of 80 and greater similarity of length
20 amino acids or more. In practice DNA BLAT
works well on primates, and protein blat on land
vertebrates
BLAT is not BLAST. DNA BLAT works by keeping an
index of the entire genome in memory. The index
consists of all non- overlapping 11-mers except
for those heavily involved in repeats. The index
takes up a bit less than a gigabyte of RAM. The
genome itself is not kept in memory, allowing
BLAT to deliver high performance on a reasonably
priced Linux box. The index is used to find areas
of probable homology, which are then loaded into
memory for a detailed alignment. Protein BLAT
works in a similar manner, except with 4-mers
rather than 11-mers. The protein index takes a
little more than 2 gigabytes
BLAT was written by Jim Kent. Like most of Jim's
software interactive use on this web server is
free to all. Sources and executables to run batch
jobs on your own server are available free for
academic, personal, and non-profit purposes. Non-
exclusive commercial licenses are also available.
Contact Jim for details.