Automated Causal Inference - PowerPoint PPT Presentation


PPT – Automated Causal Inference PowerPoint presentation | free to download - id: 6fb487-NzhlY


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Automated Causal Inference


Full Report NRA A2-37143 Automated Discovery Procedures for Gene Expression and Regulation from Microarray and Serial Analysis of Gene Expression Data – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 100
Provided by: Christo485
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Automated Causal Inference


Report on IHMC- CMU-Pitt Research Full Report
NRA A2-37143 Automated Discovery Procedures
for Gene Expression and Regulation from
Microarray and Serial Analysis of Gene Expression
Data NCC 2-1295 Multi-Domain Network Learning
Algorithms of Latent Variable Interpretation and
Discovering Genetic Regulation April 2001
April 2002 http//
Research Team
  • William Buckles (Ph.D, Professor, Tulane)
  • Tianjiao Chu (Ph.D Student, Logic, Methodology
    and Computation, CMU)
  • Greg Cooper (M.D. Ph.D Associate Professor,
    School of Medicine, Pitt
  • David Danks (Ph.D, Research Scientist, IHMC)
  • Clark Glymour (Ph.D, P.I., Senior Resarch
    Scientist and John Pace Scholar, IHMC Alumni
    University Professor, CMU)
  • Dan Handley (M.S. Student, Logic, Methodology and
    Computation, CMU
  • Subramani Mani (Ph.D Student, Biomedical
    Informatics, Pitt)
  • Rob ODoherty (Ph.D ,Assistant Professor, School
    of Medicine, Pitt)
  • Dave Peters (Ph.D , Human Genetics, Pitt
  • Joseph Ramsey (Ph.D, Research Programmer, CMU)
  • Jaime Robins, (M.D. School of Public Health,
  • Raul Saavedra (Ph.D, Student, Computer Science,
  • Richard Scheines (Ph.D, Associate Professor, CMU)
  • Nicoleta Servan (Ph.D Student, Statistics, CMU)
  • Ricardo Silva (Ph.D student, Computer Science,
  • Peter Spirtes (Ph.D, Research Scientist IHMC
    Professor, CMU)
  • Larry Wasserman (Ph.D, Professor, CMU)
  • Frank Wimberly (Ph.D, Research Programmer, IHMC)
  • Changwon Yoo (Ph.D Student, Biomedical
    Informatics, Pitt)

Two Related Goals
  • Investigating the prospects for more rapid and
    accurate determination of genetic regulatory
    networks using recently developed technologies
    (microarrays and SAGE)
  • Investigating the prospects for determining the
    underlying components of measured phenomena, and
    the influences such components have on one another

Background on Genetics
  • Proteins do most of the work in the cell
  • Cell reproduction, metabolism, and responses to
    the environment are all controlled by proteins
  • Each gene is a machine for constructing
    (approximately) a single protein
  • The rate at which a gene constructs proteins is
    influenced by concentrations of regulator proteins

Gene Regulatory Networks
  • Some genes manufacture proteins which control the
    rate at which other genes manufacture proteins
    (either promoting or suppressing)
  • Hence some genes indirectly (via the proteins
    they create) regulate other genes, which in turn
    regulate the operation of the cell
  • The system by which genes regulate each other is
    called the genetic regulatory network, and can be
    represented by a directed graph (which is a
    special case of a Bayes network)

Measuring Gene Expression Levels
  • A genes expression level is an approximate
    measure of the concentration of mRNA transcripts
    and an more indirect measure of the rate of
    synthesis of corresponding proteins.
  • Recently developed technologies--microarrays and
    Serial Analysis of Gene Expression, or
    SAGE--allow thousands of gene expression levels
    to be measured simultaneously
  • The kinds of measurement errors that these
    technologies introduce is not well understood
  • The best way to use these tools to discover gene
    regulatory networks is not known

Relevance to NASA
  • Gene expression in microgravity has been shown to
    differ significantly from expression in Earth
  • Understanding gene regulation in plants, animals
    and humans is likely to be important for long
    term extraterrestrial habitation
  • Determining regulatory structure is a present
    laborious, slow and costly
  • Need for systematic study of the reliability and
    accuracy of scores of proposals for applying
    statistical/machine learning procedures to speed
    up the process

Background on Latent Structure Analysis
  • Measurements are often of effects of other
    scientifically interesting variables not directly
  • Number and identity of underlying causal or
    compositional variables may not be entirely
  • Measured effects can influence other measured
    effects (e.g., through between channel signal
    leakage in multi-channel

Background on Latent Structure Analysis
  • With no prior cluster information and with the
    possibility of measured-measured and
    latent-latent influences, none of the standard
    data analysis procedures (e.g., factor analysis,
    principal components, independent components)
    give reliable (i.e., asymptotically correct)
    information about all of
  • Number of latent variables
  • Clustering of measured
  • Causal or compositional relations among latent

Relevance to NASA
  • NASA collects vast quantities of observational
    data on the Earth, the solar system and the
    cosmos, much of it spectral
  • Need for automated, fast, reliable procedures
    extracting relevant causal information from
    diverse datasets procedures that integrate
    expert knowledge
  • Inadequacy of current methods (model specific,
    clustering algorithms) for this task
  • Principled procedures using Bayes network methods
    offer promising alternatives
  • They have succeeded in other spectral
  • (J. Ramsey, et al., Automated Identification of
    Carbonate Composition from Reflectance Spectra,
    Data Mining and Knowledge Discovery, in press.)

Structure of the Projects
  • Statistical Foundations
  • Multiple testing problem
  • Measurement error models
  • Search Algorithms
  • Different kinds of inputs
  • Different assumptions about background knowledge
  • Experiments
  • Microarray
  • SAGE
  • Testing
  • Application to known genetic regulatory networks
  • Application to simulated data

First Year Results Algorithms
  • Many algorithms for inferring causal networks
    that have been applied to inferring gene
    regulatory networks assume the input is
    associations between measured features of
  • But microarrays and SAGE measure average gene
    expression levels over many cells rather than for
    a single cell
  • What is the feasibility of inferring regulatory
    networks from associations between averages?
  • Feasibility for linear and local-linear
    regulatory functions
  • Impossibility for the mathematical form of the
    regulatory function of sea urchin Endo 16 gene,
    one of the best established.
  • T. Chu, C. Glymour, R. Scheines and P. Spirtes,
    A Statistical Problem for Inference to
    Regulatory Structure form Associations of Gene
    Expression Measurements with Microarrays
    Bioinformatics, submitted.

First Year Results Statistics
  • Current methods for determining from SAGE
    measurements which genes are changing in response
    to experimental manipulations are incorrect
  • Correct method requires estimating additional
    experimental parameters, and leads to the
    conclusion that many fewer genes are changing
    than had been previously thought
  • T. Chu, Computation of Variance in SAGE
    Measurements of Gene Expression Technical
    Report, Logic, Methodology and Computation, 2002.
  • Future plan apply the new method to SAGE
    measurements of the response of genes to shear
    stress (data already gathered)

First Year Results Statistics
  • Standard techniques for testing whether a gene
    expression level has changed due to an
    experimental manipulation were not designed to be
    applied to test thousands of genes simultaneously
  • Recent developments (False Discovery Rate tests)
    do allow simultaneous testing of thousands of
  • Further improvements of the False Discovery Rate
    procedure have been made
  • C. Genovese, and L. Wasserman, Bayesian and
    Frequentist Multiple Testing, CMU Department of
    Statistics Technical Report 764, April, 2002.

First Year Results Algorithms
  • Implementation and testing (on simulated data) of
    a correct (under explicit assumptions) algorithm
    for causal clustering and for determining latent
  • R. Silva, CMU Masters Thesis, Center for
    Automated Learning and Discovery
  • Extension to time series of learning algorithms
    for dynamical Bayes Nets
  • D. Danks, Constraint-Based Learning Algorithm
    for Dynamical Bayes Nets, Conference on
    Uncertainty in Artificial Intelligence,
  • Development and proof of correctness for an
    improved algorithm for inferring Bayes networks
    across distinct data sets with overlapping
    variable sets
  • D. Danks, Efficient Learning of Bayes Nets from
    Databases with Overlapping Variables, IHMC
    Technical Report, 2002.

First Year Results Algorithms
  • Development and testing of algorithms for
    maximizing information obtained from knockout
  • R. Silva, C. Glymour, D. Danks, Inferring
    Genetic Regulatory Structure from First and
    Second Moments, Technical Report, Logic,
    Methodology and Computation, 2002.
  • Development, implementation and testing of a
    genetic algorithm for linear Bayes networks
    (structural equation models)
  • S. Harwood and R. Scheines, Learning Linear
    Causal Structure Equation Models with Genetic
    Algorithms (2001) Tech Report CMU-PHIL-128,
    submitted to Conference on Knowledge Discovery
    and Data Mining.
  • S. Harwood and R. Scheines, Genetic Algorithm
    Search over Causal Models (2001) Tech Report
    CMU-PHIL-131, submitted to Conference on
    Uncertainty in Artificial Intelligence.
  • Development of an algorithm for regulatory
    structure from mixed observational and knockout

First Year Results Testing
  • Very few genetic regulatory networks are known,
    and even fewer details about the functional
    relationships among the genes are known
  • How can the accuracy of a causal discovery
    algorithm be tested?
  • Generate simulated data from made up gene
    regulatory networks, so that the generating
    mechanism is known

First Year Results Testing
  • Implementation of a flexible program for
    generating simulated microarray data that allows
    the user to conveniently specify many different
  • Functional relationships between cells
  • Measurement errors
  • Averaging over different numbers of cells
  • Gene regulatory network structures (including
    varying time lags)
  • J. Ramsey and R. Scheines, (2001) Simulating
    Genetic Regulatory Networks, Technical Report
  • Implementation of half a dozen algorithms
    proposed in the literature for inferring
    regulatory structure from expression associations
    in microarray measurements (more to be

First Year Results Experiments
  • Fat cells from mice are treated with
    troglitazone, which increases the efficiency of
    the biological actions of insulin in diabetes and
  • Which genes are activated?
  • Microarray chips used to make 47 measurements of
    gene expression level at 35 time points for 5355

First Year Results Experiments
  • Normalize data to remove chip-to-chip effects
  • Perform statistical tests to determine which
    genes are changing, adjusting for multiple tests

Comparing 20 genes that change most with 20 that
change least
Current Work Experiments
  • Remove outlying genes
  • Improve the test performed for whether a gene is
    changing over time
  • Introduce clustering methods for data
  • Use slower but more accurate measurement
    techniques (Northern Blots) to
  • Test the hypotheses about which genes change
    according to the microarray analysis
  • Learn about errors in measurement when using

Gene Research Plans May 2002 May 2003
  • Study statistical properties of multiple
    decisions and of conditional independence among
    averaged variables
  • Develop new algorithms for optimal information
    extraction and implement algorithms proposed in
    the literature
  • Implement Simulator Laboratory SAGE and
    microarray study of expression under
    varying surface flows and drug treatments
  • Where we
  • Test algorithms on real and simulated data
    Analyze data
  • Make
    Predictions Where we will be
  • Knockout Experiments
  • Overall Evaluation

Latent Structure Research Plans, 2002-2003
  • Improve efficiency
  • Test on large simulated data sets
  • Prove asymptotic correctness
  • Investigate non-linear generalizations

Supplementary Material Outline
  • Discovering the Structure of Genetic Regulatory
  • Testing Algorithms Simulator
  • Analysis of Gene Expression Levels Averaged Over
    Many Cells
  • Analysis of SAGE Data
  • Latent Structure---Causal Clustering
  • Experiments
  • Experiment 1 Microarray analysis
  • Experiment 2 SAGE analysis

Discovering the Structure of Genetic Regulatory
Simplified Gene Regulatory Network
G1 G2 G3 G4 mRNA1 mRNA2 m
RNA3 mRNA4 protein1 protein2 protein3 protein4
G5 G6 mRNA5
mRNA6 protein5 protein6
Still More Simplified
Two Strategies for Discovering Gene Regulatory
  • (Difference) Enhance or suppress specific genes
    and measure the changes in expression levels of
    other genes. Infer effects of manipulated gene
    from differences in expression levels of other
    genes versus unmanipuated controls
  • (Association). Use wild-type cells or cells with
    specific enhanced or suppressed levels of other
    genes. Infer effects from associations of
    expression levels of all genes

Measurement Techniques
  • Microarray techniques allow measurements of
    relative mRNA concentrations from multiple tissue
  • mRNA concentrations for thousands of genes can be
    measured simultaneously
  • Measurements can be taken in time sequence, every
    few minutes
  • Serial Analysis of Gene Expression (SAGE) allows
    estimation of concentrations of mRNA transcripts
    for essentially the entire genomedoes not
    require prior knowledge of all genes

Difference Method
  • Several examples of partial identification of
    part of the regulatory network for several
  • Limitations
  • Laborious and expensive
  • Each experiment can only tell us which genes are
    regulated by a manipulated gene, nothing about
    the pathway of regulation
  • E.g, If gene A is suppressed and genes B and C
    change in consequence, the experiment does not
    distinguish among
  • A ? B ? C
  • A ? C ? B
  • C ? A ? B

Difference Method - Fundamental Problems
  • How to make optimal multiple statistical
    decisions about expression differences
  • How to efficiently extract all information from
    an experiment
  • How to dynamically schedule experiments for
    maximal information

Association Method
  • An example or two of recovery of regulatory
    structure previously established by Difference
    methods. No novel discoveries so far.
  • Requires larger number of experimental
  • Depends on statistical methods for implicitly or
    explicitly estimating conditional probability
    relations among cellular expression levels

Testing Algorithms - Simulator
  • User specifies
  • Functional relationships between cells
  • Measurement errors
  • Averaging over different numbers of cells
  • Gene regulatory network structures (including
    varying time lags)
  • Type of experiment
  • This provides a known structure to test
    algorithms on, under a variety of assumptions
    about how genes are related

Simulating MicroArray Data
  • Tetrad 4 (

Network structure
Functional form
Specifying the Network Structure
Specifying the Parameters
Data Output
  • Cell by Cell Raw data
    Aggregrated Measurements

Simulating MicroArray Data
  • Simulated correlation between genes 1 and 3,
    using different sizes averaged over (10, 100, and
    1,000 cells/dish) over 450 time steps

Analysis of Gene Expression Levels Averaged Over
Many Cells
Averaging and Association
  • Goal is to discover the structure of a regulatory
    network from associations among expression levels
    of each pair of genes, and their associations
    conditional on values of other genes
  • But we measure only concentrationsaveragesformed
    from the mRNA of many cells
  • For many systems, conditional associations are
    altered by averaging

The Endo 16 Regulatory Function
  • Regulation of the Endo16 gene of the sea urchin
    (from C. Yuh, H. Bolouri, E. Davidson Genomic
    Cis-Regulatory Logic Experimental and
    Computational Analysis of a Sea Urchin Gene
    Science, 1998, March 20 279 1896-1902

The Endo16 Regulatory Function
The Endo 16 Regulatory Function, Slightly More
If ( CG1 P) (B(t) G(t)) gt 0, then Q (t)
2 (1 (F E CD) Z) (1 CG2 CG3 CG4) (CG1
P) (B(t) G(t)) Else Q (t) 2 (1 (F E
CD) Z) ( 1 CG2 CG3 CG4)Otx(t) and is
Boolean sun
Conditional Independence Is Not Invariant in a
Simplified Form of Endo 16 Regulation
  • X takes values in a discrete set, say 0,1,2,3,4
  • Y g(X), g nonlinear, say Y X2
  • Z a YW, a real, W Boolean (values in 0.1,
    with a Bernoulli distribution
  • X Y Z W

Conditional Independence Is Not Invariant in a
Simplified Form of Endo 16 Regulation
  • X is independent of Z conditional on Y, but.
  • S X is not independent of S Y conditional on S Z,
    where the sum is over values in n 4 or more
    identically and independently distributed units
  • For large n this result generalizes to all cases
    in which the range of X is finite (but not
    binary), g is polynomial, and W is as above

General Pessimistic Conclusion (not a Theorem)
  • Conditional probability relations that hold among
    regulator and regulated gene transcript
    concentrations at the cellular level will not be
    preserved in probability relations as measured in
    microarrays taking from multiple cell sources
  • They will be preserved for linear systems and
    locally linear systems (see Chu, et al.), but
    no regulatory systems are as yet known to have
    such a structure

Analysis of SAGE Data
Difference Strategy and SAGE
  • Estimating whether expression levels of genes
    change in different environments, or which other
    genes removed, requires a comparison of
    expression levels across samples
  • Decision must be made as to whether observed
    differences are or are not due to chance

SAGE and Variance
  • Decisions as to whether differences expression
    levels are or are not due to chance depend on the
    estimate of the variance of the underlying
    probability distribution
  • Standardly, a multinomial model is used which
    gives a very large variancemeaning decisions
    about the constancy of a genes expression across
    environments cannot be reliably made

SAGE and Variance
  • One step in SAGE measurements is an amplification
    of the amount of mRNA measured through PCR
  • The multinomial model does not correctly
    represent the statistics of PCR
  • A correct estimate of variance requires an
    approximate estimate of the original total number
    of transcripts before PCR amplification
  • Relevant measurements can easily be made
  • Lead to a much lower estimate of variance of SAGE

Causal Clustering
The General Problem
  • Given data on a number of variables, find
    features of the underlying processes that
    generated the data
  • Example Spectral measurements of solar radiation
    intensities. Variables are intensities at each
    measured frequency

The Most Common Solution Principal Components
Factor Analysis
  • Explains data by new theoretical variables that
    are linear functions of linear combinations of
    measured variables
  • Chooses theoretical variables to account for as
    much of the variance of measured variables as
  • Theoretical variables are not
    uniqueappropriate transformations will do as
  • Gives no clues to dependencies among real
    underlying factors assumes they are independent
    of one another

General Problems with Clustering Algorithms
  • Tend to give misleading results if some of the
    measured variables influence other measured
    variables (e.g., through signal leakage between
  • Assume no correlations among the underlying
  • E.g., Independent Components algorithms

A New Approach General Considerations
  • For the time being, consider only linear models
  • Think graphically and let the algebra take care
    of itself
  • Be willing to make multiple hypothesis tests on
    the same data set
  • Insist on computational tractability, but be
  • Require asymptotic reliability under specifiable

Think Graphically
  • A system represented by the equations
  • Xi ai T ei, ai a real constant, ei random, i
  • ei independent of ek for i not equal to k, is
    represented as
  • T
  • X1 X2 .Xm-1 Xm

Causal Clustering
  • Assumptions (for some while)
  • Linear Systems
  • Non recursive (acyclic graph)
  • Independent noises or error terms
  • Normal distributions of error variables
  • Independent, identically distributed cases
  • Faithfulness vanishing partial correlations, if
    any, hold for all values of the linear

  • Values for variables X1 .Xn for a number of
  • Significance level (a level) to be used in
    hypothesis tests
  • Nothing else

  • Disjoint clusters of some of the observables a
    set of directed acyclic graphs (DAGs) among
    theoretical variables, one variable for each
  • Each DAG determines a linear model
  • Just write each variable (node) in the graph as a
    linear functional of its parent variables in the
    graph and add an error term for each equation

The True Graph
Purify Start of Round 1
Purify Round 1
  • For each measured variable X, do a test of the
    one factor model, with latent common cause T0,
    and with all measured variables except X, against
    the one factor model with all measured variables
    including X (Difference of chi squares)
  • If the model without X is not rejected, put X in
    set Hold For 1

Purify Steps into Round 1

Purify End of Round 1
Washdown, Round 1
  • Put all measured variables in Hold For 1 in a new
    cluster with a single common latent factor, T1
  • Correlate the new factor with the previous latent
    factor, T0
  • Empty Hold For 1

Washdown Round 1
Purify Round 2
  • Repeat the Purify procedure on all measured
    variables remaining in the first cluster. Put any
    rejected variables in Hold For 1
  • Apply the Purify procedure to all measured
    variables in the second cluster. Put any rejected
    variables in Hold for 2

Purify Round 2
Washdown, Round 2
  • Add variables in Hold For 1 to the remaining
    variables in cluster T1
  • Form a new cluster, with a new latent common
    cause T2 with the variables in Hold For 2
  • Correlate all of the latent variables
  • Empty Hold For 1 and Hold For 2

Washdown Round 2
Purify/Washdown Output (after 5 rounds)
Clean Up
  • Remove any clusters with fewer than 3 observed

Determining Latent Structure
  • For each pair of latent variables, Tj and Tk, and
    their measured effects, test the model in which
    there is a directed edge Tj ? Tk against the
    model in which there is no directed edge
  • If the model with a directed edge is not
    rejected, keep an undirected edge between Tj ? Tk
    If the model with a directed edge is rejected,
    remove the Tj Tk undirected edge

MIMBuild Step 1 Testing Marginal Independencies
Testing T2 T3
Not significantly different (? 0.05) Keep edge
?2 12.42 df 9
?2 11.42 df 8
Testing for Conditional Independence
  • To test if Tj is independent of Tk conditional on
    Tm, form the complete graph among Tj, Tk and Tm
    (with measured variable effects) and test against
    the same model without the Tj ? Tk edge
  • Similarly for conditioning on multiple variables

MIMBuild Step N Testing Independencies
Conditioned in a Set of Size N
Other example N 3, testing T0 T4 T1,
T2, T3
Orienting Edges
  • If, for example, there is a structure T0 T1
    T2 but no T0 T2 edge, and the T0 T2 was
    removed without conditioning on T1, orient T0
    T1 T2 as T0 ?? T1 ? T2 (as a collider)
  • Orient undirected edges adjacent to a collider
    node away from a collider

T0 T1 T3 T2
Final Outcome
Purify/Washdown/MIMBuild output
True graph
General Idea
  • Measured variables are assigned to clusters by
    testing whether the one factor model fits the
    data better with them or without them
  • Every rejected variable is tested on each
    succeeding cluster until it fits
  • The latent structure is determined by the PC
    algorithm (Spirtes, et al. 1993) , known to be
    asymptotically correct under the Faithfulness
    assumption, and (in this case) under the
    assumption that there are no unmeasured causes of
    the latent cluster factors

  • Using another algorithm for latent structure, the
    FCI algorithm, procedure can be applied when
    there may be unmeasured common causes of cluster
    latent factors
  • Can be used with any distribution family for
    which there are good tests of conditional
    independence (not that there are many)
  • The algorithm can be easily integrated with prior
    substantive knowledge about the actual structure
  • For linear systems, can be generalized to latent
    structures with cyclic graphs (feedback systems)
  • Improved performance expected if Bayesian search
    algorithms supplement constraint based search, or
    with genetic algorithms

  • Only works for unmeasured causes having at least
    3 unconfounded measured variables
  • But if there is a known or suspected common cause
    of all measures (or any set of measures), it can
    be estimated and partialed out
  • Does not give orientations of all edges
  • Requires large sample sizes
  • Computationally intensive
  • No error probabilities are possible

Experiment 1 Microarray analysis
Background of the Experiment
  • Fat cells from mice are treated with
    troglitazone, which is a member of the family of
    drugs known as thiazolidendiones (TZDs)
  • TZDs are used in humans to increase the
    efficiency of the biological actions of insulin
    in diabetes and obesity
  • Decreased insulin sensitivity is a hallmark of
    both diabetes and obesity
  • The action is to activate the expression of
    specific genes
  • At the end of a particular incubation the cells
    were quickly frozen to stop all biological
    processes in the cell

cDNA Microarray Analysis of the 3T3-L1 Adipocyte
response to Troglitazone
  • 3T3-L1 pre-adipocytes cultured in vitro
  • 3T3-L1 pre-adipocytes differentiated into mature
    adipocytes by addition of insulin and
  • Mature adipocytes exposed to 10µM Troglitazone
    for durations of between 15 minutes and 24 hours
  • Cells harvested directly in Trizol reagent and
    total cellular RNA extracted by standard

cDNA Microarray Analysis of the 3T3-L1 Adipocyte
response to Troglitazone
  • First strand cDNA synthesized by Reverse
    Transcriptase in the presence of a-33P-dCTP
  • cDNA hybridized to Research Genetics GF400
    (mouse) Gene Filters using standard methods
  • Hybridized signal captured using Storm (Molecular
    Dynamics) phosphorimager and gene-specific signal
    intensity extracted using Pathways 4TM software
    (Research Genetics).

Data Scheme
  • 20 array chips with 47 measurements
  • 3 uses for each chip 20 for the first
    hybridization, 20 for the second hybridization, 7
    for the third hybridization
  • 3 treatments control without DMSO, control with
    DMSO, test sample (drug DMSO)
  • 35 time points
  • 5355 genes
  • The data contains information about background,
    chromosomes, release plates, the coordinates of
    each spot on the plate, etc.

  • The data was logged because
  • it gives a better sense of the amount of
  • the amount of variance in a gene expression leval
    was proportional to the gene expression level
  • Each chip was adjusted to have median zero in
    order to remove global chip-to-chip variations
  • Outliers were removed because very high and low
    intensity gene intensities are not reliably

Determine the Effect of the Drug Treatment on the
Gene Expression Level Over Time
  • Compare 20 genes with highest variability in
    use-1 data with 20 genes with lowest variability
  • Perform statistical tests of hypothesis that
    genes are not changing, adjusted for multiple
    testing problem

Are the Measurements for the Second Use reliable?
  • Chips are supposed to be re-usable
  • However, the second measurement on each chip
    resembles the first measurement on each chip more
    closely than it resembles measurements that
    occurred at the same time
  • Figure in next slide shows close resemblance
    between different measurements on same chip, but
    taken at different times

Are the Measurements for the Second Use Reliable?
  • Is it an experimental error?
  • Should we use the chips only once?
  • Is at least the use-1 data set reliable?
  • We are using other more reliable, but more
    expensive tests to evaluate these hypotheses

Future Plans
  • Remove outlying genes
  • Improve the test performed for data in use-1
  • Clustering methods for data in use-1
  • Check the data for use-2

Experiment 2 SAGE Analysis
Serial Analysis of Gene Expression (SAGE)
  • Analysis of the effect of laminar shear stress on
    gene expression in the vascular endothelium
  • Primary coronary artery endothelial cells (HCAEC)
    grown to confluency on glass microscope slides
  • Slides placed in parallel plate flow chamber and
    cells exposed to laminar shear stress for 0, 4,
    8, 12, 20 and 24 hours
  • Cells harvested directly into Trizol reagent
    (InVitrogen) and total RNA extracted
  • RNA used as substrate for construction of SAGE
    library and SAGE tags analyzed by automated DNA
  • SAGE tag data analyzed using SAGE2000 software
    and gene expression measurement recorded for all
    genes present

Preliminary Clustering Analysis of Genes
Regulated gt2-fold
NB Samples are clustered using the Pearson
correlation. Red, yellow and blue bars indicate
high, medium and low levels of gene expression
Flow Loop
Flow chamber
Parallel Plate Flow Cell
Flow In
Flow Out