Finding Transcription Modules from large geneexpression data sets - PowerPoint PPT Presentation

About This Presentation
Title:

Finding Transcription Modules from large geneexpression data sets

Description:

Iterative algorithm to find preliminary modules (modified ISA) avoiding ... After finding each converged preliminary module (sG, sC), remove component along ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 34
Provided by: chao3
Category:

less

Transcript and Presenter's Notes

Title: Finding Transcription Modules from large geneexpression data sets


1
Finding Transcription Modules from large
gene-expression data sets
  • Ned Wingreen Molecular Biology
  • Morten Kloster, Chao Tang NEC Laboratories
    America

2
Outline
  • Introduction transcription, regulation, gene
    chips, and transcription modules.
  • Iterative Signature Algorithm (ISA).
  • Advantages of Progressive Iterative Signature
    Algorithm (PISA).
  • PISA applied to yeast data.

3
Transcription regulation
http//doegenomestolife.org
4
Gene chips
DNA microarray
5
Gene-expression profile
Egc g1,2,...,Ng c1,2,...,Nc

But data very noisy
6
Transcription module
Conditions
C1
C2
C3
Genes
G1
G7
G2
G3
G4
G5
G6
A Transcription Module a set of conditions and a
set of genes connected by
a transcription factor.
7
Signature of a transcription module
A gene can be in multiple transcription modules.
8
Iterative Signature Algorithm (ISA)
Barkai group (2002,2003)
Conditions
c1 c2 c3 cm cn ... ... cNC
Transcription Module (TM)
Thresholding
Gene vector and condition vector
Genes
g1 g2 g3 . . gi . . gj . . gNG
Thresholding on both genes and conditions reduces
noise.
9
Limitations of ISA
  • Lots of spurious modules (millions).
  • Weak modules may be absorbed by strong ones.
  • ISA does not make use of identified modules to
    find new ones.

c1 c2 c3 cm cn ... ... cNc
g1 g2 g3 . . gi . . gj . . gNg
10
Progressive Iterative Signature Algorithm (PISA)
c1 c2 c3 cm cn ... ... cNc
g1 g2 g3 . . gi . . gj . . gNg
11
Advantages of PISA over ISA
  • Removing found modules reveals hidden modules,
    and reduces noise for unrelated modules.
  • No positive feedback.
  • Improved thresholding for genes.
  • Combines coregulated and counter-regulated genes.

12
Example of PISA vs. ISA
A
B
TF1
TF2
G1
G2
13
The gene-score threshold
Gene scores along the condition vector for some
module
  • Goal less than one gene included in the module
    by mistake.
  • Require threshold that is insensitive to
    (unknown) module size.

14
Eliminating false modules
For scrambled data, preliminary modules either
have few genes or few contributing conditions.
True positives
15
PISA applied to yeast data
  • Applied PISA to a dataset containing almost all
    available microarray data for S. cerevisiae
    gt6000 genes, 1000 conditions.
  • Found 140 different modules, including all
    good modules found by ISA.
  • Found some unknown modules.
  • Found many good small modules that ISA could
    not find / separate from the spurious modules.
  • 2600 genes in at least one module, 900 genes in
    more than module.

16
Some modules found by PISA
17
Example Zinc module
ZRT1
ZRT3
ZRT2
ZAP1-regulated genes during zinc starvation.
INO1
ZAP1
ADH4
YNL254C
YOL154W
Zinc module found by PISA
Lyons et al., PNAS 97, 7957-7962 (2000)
18
Comparison with other databases
Gold standard Gene Ontology (Genome Res. 11,
1425-1433 (2001))
Database A Immunoprecipitation (Lee et al.,
Science 298, 799-804 (2002)) Database B
Comparative genomics (Kellis et al., Nature 423,
241-254 (2003))
19
rRNA processing
(117) Ribosomal
proteins (126) Histone
(19) Fatty acid syn (22)
Cell cycle G2/M (31) Cell
cycle M/G1 (35) Cell cycle G1/S (66)
Correlations
Mating genes for type a (15) Mating type a
signaling genes (6) Mating (110) Mating
factors/receptors a/a difference (26)
Oxidative stress response(69) De novo purine
biosyn (32) Lysine biosyn (11) Biotin syn
transport (6) Arg biosyn (6) aa biosyn (96)
Oxidative stress response (69)
aryl alcohol dehydrogenase (6)
proteolysis (27) trehalose
hexose metabolism/conversion (21)
COS genes (11)
heat shock (52)
repair of disulfide bonds (26)
anticorrelated
correlated
20
Summary
  • Data from gene chips can be used to identify
    transcription modules (TMs).
  • Iterative approach (ISA) is promising.
  • PISA improves on ISA by taking out found TMs.
  • PISA also improves gene thresholding, avoids
    positive feedback, and improves signal to noise
    by grouping coregulated and counter-regulated
    genes.
  • PISA very effective for finding secondary
    modules.

http//cn.arxiv.org/abs/q-bio/0311017
21
Future Directions
  • Input to experiment
  • new modules and new genes in old modules.
  • what kinds of experiments give the most
    informative data?
  • Improve PISA
  • better pre/post-processing of data.
  • Apply PISA to other organisms.
  • Combine PISA with other data (experimental,
    bioinformatic) to systematically identify TMs,
    and reconstruct the transcription network.

22
De novo purine biosynthesis
Number of genes 32 Average number of
contributing conditions 14.6 Consistency
0.59 Best ISA overlap 0.59 at tG5.0 frequency
16
23
Galactose induced genes
Number of genes 23 Average number of
contributing conditions 18.1 Consistency
0.55 Best ISA overlap 0.74 at tG3.2 frequency
686
24
Hexose transporters
Number of genes 10 Average number of
contributing conditions 33.7 Consistency
0.59 Best ISA overlap 0.6 at tG3.8 frequency 41
25
Peroxide shock
Number of genes 69 Average number of
contributing conditions 23.9 Consistency
0.50 Best ISA overlap 0.34 at tG3.4 frequency
(1)
26
Implementation of PISA
  • Normalization of gene-expression data
  • Iterative algorithm to find preliminary modules
    (modified ISA)
  • avoiding positive feedback
  • gene-score threshold
  • Orthogonalization
  • Finding consistent modules

27
Normalization of expression data
Gene-score matrix EG
normalizes total RNA levels
removes reference-condition bias
makes gene scores comparable
Condition-score matrix EC
makes condition scores comparable
?
28
Iterative algorithm modified ISA (mISA)
Start with a random set of genes GI.
Produce condition-score vector sC.
Produce gene-score vector sG, using
leave-one-out scoring to avoid positive
feedback.
From sG, calculate gene vector mG for next
iteration.
29
Orthogonalization
After finding each converged preliminary module
(sG, sC), remove component along sC from all
genes
30
Why does scrambled data yield large modules?
Long tails of expression data lead to
single-condition modules.
31
Finding consistent modules
  • Repeat PISA runs many times (30).
  • Tabulate preliminary modules.
  • A preliminary module contributes to a module if
  • the preliminary module contains gt 50 of the
    genes in the module,
  • these genes constitute gt 20 of the preliminary
    module.
  • A gene is included in a module if it appears in
    gt50 of the contributing modules, always with the
    same gene-score sign.

32
Comparison with other databases
Gene Ontology (Genome Res. 11, 1425-1433 (2001))
Ng number of genes in organism m number of
genes in module c number of genes in GO
category n number of genes in both module and
GO category
p value
Database A Immunoprecipitation (Lee et al.,
Science 298, 799-804 (2002)) Database B
Comparative genomics (Kellis et al., Nature 423,
241-254 (2003))
33
Correlation of modules
Conditions
c1 c2 c3 cm cn ... ... cNc
Genes
g1 g2 g3 . . gi . . gj . . gNg
Write a Comment
User Comments (0)
About PowerShow.com