Title: Comparative genomics: functional characterization of new genes and regulatory interactions using computer analysis
1Comparative genomics functional
characterization of new genes and regulatory
interactions using computer analysis
- Mikhail Gelfand
- Institute for Information Transmission
Problems(The Kharkevich Institute), RAS - Workshop at the Landau Instiute of Theoretical
Physics, RAS - September 27-28, 2007, Moscow
2The genome is decyphered!
3Is it?
- To intercept a message does not mean to
understand it
4Fragment of a genome (0.1 of E. coli)
A typical bacterial genome several million
nucleotides 600 through 9,000 genes (90 of
the genome encodes proteins)
5Propaganda
sequences in GenBank (genes)
articles in PubMed (experiments)
6More propaganda
- Most genes will never be studied in experiment
- Even in E.coli only 20-30 new genes per year
(hundreds are still uncharacterized) - Universally missing genes not a single known
gene even for 10 reactions of the central
metabolism. No genes for gt40 reactions overall. - Conserved hypothetical genes (5-15 of any
bacterial genome) essential, but unknown
function.
7The local goal to characterize the genes
- What?
- function (rather, role)
- When?
- regulation (conditions)
- gene expression
- lifetime (mRNA, protein)
- Where?
- Localization
- Cellular/membrane/secreted
- How?
- Mechanism of action
- Specificity, regulation (biochemistry)
8Propaganda-2 complete genomes
2007 gt 1200 bacterial genomes
9The global goal
- to predict the organisms properties given its
genome - (plus some additional information, e.g. the
initial state after cell division) - and to understand the evolution of
genomes/organisms
10Haemophilus influenzae, 1995
11Vibrio cholerae, 2000
12The metabolic map, the birds view
13Metabolic pathways, the eagles view
14A submap (metabolism of arginine and proline)
15Approaches
- Similarity gt homology (common origin)
- Homology gt common function
- The Pearson Principle (after Karl
Pearson)important features are conserved - functional sites in proteins
- regulatory (protein-binding) sites in DNA
- not necessarily sequences
- structure of protein and RNA
- gene localization on chromosomes
- co-expression of genes
- Allows one to annotate 50-75 of genes in a
bacterial genome - Necessary first step, may be automated (to some
extent)
16 but not so simple
- Similarity ? homology
- Low complexity regions, unstructured domains,
transmembrane segments and other regions with
non-strandard amino acid composition - The need for correct similarity measures
- Does homology always follow from the structural
similarity? - What is structural similarity?How can it be
measured? - Convergent evolution of structures?Independent
emergence of folds? - Homology ? same function
- What is the same function?
- Biochemical details and cellular role
17The Fermi principle
- (after Enrico Fermi)
- Purely homology-based annotation boring (nothing
radically new) - It turns out, one can predict something
completely new - Comparative genomics
18Positional clustering
- Genes that are located in immediate proximity
tend to be involved in the same metabolic pathway
or functional subsystem - caused by operon structure, but not only
- horizontal transfer of loci containing several
functionally linked operons - compartmentalisation of products in the cytoplasm
- very weak evidence
- stronger if observed in may unrelated genomes
- May be measured
- e.g. the STRING database/server (P.Bork, EMBL)
- and other sources
19STRING trpB positional clusters
20Functionally dependent genes tend to cluster on
chromosomes in many different organisms
Vertical axis number of gene pairs with
association score exceeding a threshold. Control
same graph, random re-labeling of vertices
21More genomes (stronger links) gt highly
significant clustering
22Fusions
- If two (or more) proteins form a single
multidomain protein in some organism, they all
are likely to be tightly functionally related - Very useful for the analysis of eukaryotes
- Sometimes useful for the analysis of prokaryotes
23STRING trpB fusions
24Phyletic patterns
- Functionally linked genes tend to occur together
- Enzymes with the same function (isozymes) have
complementary phyletic profiles
25STRING trpB co-occurrence (phyletic patterns)
26Phyletic patterns in the Phe/Tyr pathway
shikimate kinase
27Archaeal shikimate-kinase
Chorismate biosynthesis pathway (E. coli)
28Arithmetics of phyletic patterns
Shikimate dehydrogenase (EC 1.1.1.25) AroE
COG0169 aompkzyqvdrlbcefghsnuj-i--
5-enolpyruvylshikimate 3-phosphate synthase (EC
2.5.1.19) AroA COG0128 aompkzyqvdrlbcefghsnuj-
i--
Chorismate synthase (EC 2.5.1.19)
AroC COG0082 aompkzyqvdrlbcefghsnuj-i--
29Distribution of association scores monotonic
for subunits,bimodal for isozymes
30Comparative analysis of regulation
- Phylogenetic footprinting regulatory sites are
more conserved than non-coding regions in general
and are often seen as conserved islands in
alignments of gene upstream regions - Consistency filtering regulons (sets of
co-regulated genes) are conserved gt - true sites occur upstream of orthologous genes
- false sites are scattered at random
31Riboflavin (vitamin B2) biosynthesis pathway
325 UTR regions of riboflavin genes from bacteria
33Conserved secondary structure of the RFN-element
Capitals invariant (absolutely conserved)
positions. Lower case letters strongly
conserved positions. Dashes and stars
obligatory and facultative base pairs Degenerate
positions R A or G Y C or U
K G or U B not A V not U.
N any nucleotide. X any
nucleotide or deletion
34RFN the mechanism of regulation
- Transcription attenuation
35Early observation an uncharacterized gene (ypaA)
with an upstream RFN element
36Phylogenetic tree of RFN-elements (regulation of
riboflavin biosynthesis)
no riboflavin biosynthesis
duplications
no riboflavin biosynthesis
37YpaA a.k.a. RibU riboflavin transporterin
Gram-positive bacteria
- 5 predicted transmembrane segments gt a
transporter - Upstream RFN element (likely co-regulation with
riboflavin genes) gt transport of riboflaving or
a precursor - S. pyogenes, E. faecalis, Listeria sp. ypaA, no
riboflavin pathway gt transport of riboflavin - Prediction YpaA is riboflavin transporter
(Gelfand et al., 1999) - Validation
- YpaA transports flavines (riboflavin, FMN, FAD)
by genetic analysis (Kreneva et al., 2000) by
direct measurement (Burgess et al., 2006 Vogl et
al., 2007 ) - ypaA is regulated by riboflavin by microarray
expression study (Lee et al., 2001) - via attenuation of transcription (and to some
extent inhibition of translaition) (Winkler et
al., 2003)
38Conserved structures of riboswitches (circled
X-ray)
39Mechanisms
gcvT ribozyme, cleaves its mRNA (the Breaker
group)THI-box in plants inhibition of splicing
(the Breaker and Hanamoto groups)
40Characterized riboswitches (more are predicted)
RFN Riboflavin biosynthesis and transport FMN (flavin mononucleotide) Bacillus/Clostridium group, proteobacteria, actinobacteria, other bacteria
THI Biosynthesis and transport of thiamin and related compounds TPP (thiamin pyrophosphate) Bacillus/Clostridium group, proteobacteria, actinobacteria, cyanobacteria, other bacteria, archea (thermoplasmas), plants, fungi
B12 Biosynthesis of cobalamine, transport of cobalt, cobalamin-dependent enzymes Coenzyme B12 (adenosyl-cobalamin) Bacillus/Clostridium group, proteobacteria, actinobacteria, cyanobacteria, spirochaetes, other bacteria
S-boxSAM-IISAM-III Metabolism of methionine and cystein SAM (S-adenosyl- methionine) Bacillus/Clostridium group and some other bacteriaSAM-II (alpha), SAM-III (Streptococci)
LYS Lysine metabolism lysine Bacillus/Clostridium group, enterobacteria, other bacteria
G-box Metabolism of purines purines Bacillus/Clostridium group and some other bacteria
glmS (ribozyme) Synthesis of glucosamine-6-phosphate glucosamine-6-phosphate Bacillus/Clostridium group
gcvT (tandem) Catabolism of glycine glycine Bacillus/Clostridium group
41Properties of riboswitches
- Direct binding of ligands
- High conservation
- Including unpaired regions tertiary
interactions, ligand binding - Same structure different mechanisms
transcription, translation, splicing, (RNA
cleavage) - Distribution in all taxonomic groups
- diverse bacteria
- archaea thermoplasmas
- eukaryotes plants and fungi
- Correlation of the mechanism and taxonomy
- attenuation of transcription (anti-anti-terminator
) Bacillus/Clostridium group - attenuation of translation (anti-anti-sequestor
of translation initiation) proteobacteria - attenuation of translation (direct sequestor of
translation initiation) actinobacteria - Evolution horizontal transfer, duplications,
lineage-specific loss - Sometimes very narrow distribution evolution
from scratch?
42Conserved signal upstream of nrd genes
43Identification of the candidate regulator by the
analysis of phyletic patterns
- COG1327 the only COG with exactly the same
phylogenetic pattern as the signal - large scale on the level of major taxa
- small scale within major taxa
- absent in small parasites among alpha- and
gamma-proteobacteria - absent in Desulfovibrio spp. among
delta-proteobacteria - absent in Nostoc sp. among cyanobacteria
- absent in Oenococcus and Leuconostoc among
Firmicutes - present only in Treponema denticola among four
spirochetes
44COG1327 Predicted transcriptional regulator,
consists of a Zn-ribbon and ATP-cone domains
regulator of the riboflavin pathway (RibX)?
45Additional evidence co-localization
- nrdR is sometimes clustered with nrd genes or
with replication genes dnaB, dnaI, polA
46Additional evidence co-regulated genes
- In some genomes, candidate NrdR-binding sites are
found upstream of other replication-related genes - dNTP salvage
- topoisomerase I, replication initiator dnaA,
chromosome partitioning, DNA helicase II
47Multiple sites (nrd genes) FNR, DnaA, NrdR
48Mode of regulation
- Repressor (overlaps with promoters)
- Co-operative binding
- most sites occur in tandem (gt 90 cases)
- the distance between the copies (centers of
palindromes) equals an integer number of DNA
turns - mainly (94) 30-33 bp, in 84 31-32 bp 3 turns
- 21 bp (2 turns) in Vibrio spp.
- 41-42 bp (4 turns) in some Firmicutes
49Experimental validations
50Acknowledgements
- Dmitry Rodionov (comparative genomics)
- Andrei Mironov (software)
- Alexei Vitreschak (riboswitches)
- Funding
- Howard Hughes Medical Institute
- Russian Foundation of Basic Research
- RAS, program Molecular and Cellular Biology
- INTAS