Eigensolvers for analysis of microarray gene expression data - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Eigensolvers for analysis of microarray gene expression data

Description:

A typical goal is Finding Gene Networks, i.e., groups of genes that change ... can then be performed to confirm the discovered regulatory network biologically. ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 17
Provided by: donald131
Category:

less

Transcript and Presenter's Notes

Title: Eigensolvers for analysis of microarray gene expression data


1
Eigensolvers for analysis of microarray gene
expression data
  • Andrew Knyazev (speaker) and Donald McCuan
  • Image from http//www.biosci.utexas.edu/mgm/people
    /faculty/profiles/VIresearch.jpg
  • Supported by NSF DMS 0728941. In collaboration
    with CU MCD Biology.

2
Eigensolvers for DNA microarrays
  • Crash course on gene expression
  • Microarrays---a massively parallel experiment
  • Clustering why?
  • Clustering how?
  • Spectral clustering
  • Connection to image segmentation
  • Eigensolvers for spectral clustering

3
Crash course on gene expression 1/3
  • Genes in DNA code for proteins
  • Protein formation in a cell involves
  • Transcription of DNA to mRNA (messenger RNA)?
  • Translation of mRNA to a protein
  • When proteins are being formed for a gene this is
    called gene expression

DNA sense strand .... ATA CGT
... antisense strand .... TAT GCA
... mRNA .... AUA CGU ... Protein .
... Ile Arg ...
transcription
translation
4
Crash course on gene expression 2/3Image
Courtesy cnx.org
5
Crash course on gene expression 3/3
  • Gene expression in a cell depends on many
    factors, e.g., developmental stage, nutrition,
    environment, and diseases, so the level of gene
    expression may vary
  • Knowing how genes are expressed helps to
    understand cellular processes and diagnose
    diseases
  • Measurement of the concentration of proteins in a
    cell is complicated, so the concentration of mRNA
    is used instead, assuming that most mRNA created
    is actually translated to a protein
  • DNA Microarrays (e.g., Affymetrix GeneChip
    arrays) measure the level of mRNA in a sample

6
Microarrays-massively parallel experiment 1/5
Affymetrix GeneChip DNA
Microarrays Image Courtesy
Affymetrix
Affymetrix GeneChip DNA Microarrays
Image Courtesy Affymetrix
7
Microarrays-massively parallel experiment 2/5
  • GeneChip oligonucleotide sequences are
    photo-lithographed on a quartz wafer in a pattern
    of 10 micrometers dots.
  • Oligonucleotide sequences (oligos) probes 25
    nucleotide chains for selected parts of a gene
    complementary to mRNA.
  • GeneChips are manufactured to include all
    currently known and predicted genes of a
    particular organism, e.g., H. sapience. The
    information about physical locations of oligo
    probes for each gene on the chip is contained in
    the .cdf file.
  • A sample of mRNA extracted from cells of an
    organism after pre-processing is hybridized with
    GeneChip giving PM and MM values which
    characterize genes expressions in the cells.

For every gene there are 11-20(depending on chip
design) of different oligo probes called perfect
matches (PM). In addition, there are mismatch
oligos (MM) corresponding to each of the PMs that
differ in the middle base pair.
8
Microarrays-massively parallel experiment 3/5
  • Labelled cRNA targets derived from the mRNA of an
    experimental sample are hybridized to oligo
    probes.
  • During hybridization, complementary nucleotides
    line up and bind together via hydrogen bonds in
    the same way as two strands of DNA bound
    together.
  • The chip is then scanned with a laser giving the
    amount of each mRNA species represented.
  • Image Courtesy cnx.org

9
Microarrays-massively parallel experiment 4/5
  • A pool of mRNA is extracted from the cells of an
    organism and converted to a Biotin labelled
    strand (cRNA) that binds to the oligo probes on
    the GeneChip during hybridization.
  • The higher the concentration of a particular mRNA
    in the testing pool---the greater the
    hybridization level of the PM probes and thus the
    amount of the hybridized material on the
    processed GeneChip.
  • Then a fluorescent stain is applied that binds to
    the Biotin and the GeneChip is processed through
    a scanner that illuminates each dot of the
    GeneChip with a laser, causing dots to fluoresce.
  • The image data of the scanned probe array is
    stored in a .dat file. The Affymetrix GCOS
    software processes the .dat file and generates a
    .cel file, containing all numerical data of the
    GeneChip experiment, e.g., probe locations and PM
    and MM intensities. The processing involves
    computing a square grid locating the dots for
    probes, intensity normalization, using internal
    controls, and detecting the outliers.
  • More sophisticated .dat--gt.cel algorithms,
    e.g., taking into account the cRNA saturation,
    are being developed elsewhere.

10
Microarrays-massively parallel experiment 5/5
  • The PM and MM values are not normally used
    directly for high-level statistical analysis,
    instead they are first converted into the gene
    expression values, which involves
  • Detecting unreliable data by comparing PM and MM
  • Adjustment for background and noise
  • Calculating the single array gene expression
    intensities, basically by averaging adjusted PM
    values for each probe set
  • Alternatively, the Comparison Analysis
    (Experiment versus Baseline arrays) detects and
    quantifies changes in gene expressions between
    two arrays, applying normalization of data and
    using the Signal Log Ratio algorithms.
  • Either way, the absolute or comparison gene
    expression values are stored in a .chp file,
    which serves as the input for high-level
    statistical analysis. Typically, multiple
    GeneChip tests are performed giving multiple
    .chp files with gene expression values.

11
Clustering why?
  • When conducting microarray experiments there are
    multiple microarrays involved typically
  • Studying a process over time, e.g., to measure
    the response to a drug or food.
  • Looking for differences between states, e.g.,
    normal cells versus cancer cells.
  • A typical goal is Finding Gene Networks, i.e.,
    groups of genes that change expression
    inter-dependently across samples. Having a
    significantly large number of microarrays, we
    want to reverse engineer the regulatory network
    that controls gene expressions. We need computer
    clustering on the microarray data to select a
    small (ideally) number of co-expressed genes of a
    gene network. Separate experiments using gene
    knockout on the selected genes can then be
    performed to confirm the discovered regulatory
    network biologically.

12
Clustering how?
  • There is no good widely accepted definition of
    clustering. The traditional graph-theoretical
    definition is combinatorial in nature and
    computationally infeasible. Heuristics rule!
  • Many clustering techniques and methods are known,
    e.g.,
  • Hierarchical clustering/partitioning
  • K-means (centroids)
  • Self-organizing maps (partitioning vectors)
  • Force-directed placement
  • Principal Components Analysis (PCA)?
  • Spectral clustering/partitioning using Fiedler
    vectors
  • Some good and popular free open source software,
    e.g., METIS and CLUTO (Karypis Lab).
  • We focus on PCA and spectral clustering.

13
Spectral clustering
Images Courtesy Russell, Ketteriung U.
A 4-degree-of-freedom system has 4 modes of
vibration and 4 natural frequencies partition
into 2 clusters using the second eigenvector
  • A adjacency matrix
  • D degree matrix
  • Laplacian matrix L D A
  • Fiedler eigenvectors Lx?x
  • N-cut eigenvectors Lx?Dx (smallest) are the
    largest for
  • PCA Markov walks AxµDx with µ1-?. D-1A is
    raw-stochastic and describes the walk
    probabilities.

Example Courtesy Blelloch CMU
1
2
5
3
www.cs.cas.cz/fiedler80/
4

Rows sum to zero
14
Connection to image segmentation
  • Image pixels serve as graph vertices. Weighted
    graph edges are computed by comparing pixel
    colours.
  • Here is an example displaying 4 Fiedler vectors
    of an image

We generate a sparse Laplacian, by comparing
neighbouring pixels here when computing the
weights for the edges. Genes correspond to
vertices in microarrays, but we have to compare
all genes, possibly getting a Laplacian with a
large fill-in.
15
Eigensolvers for spectral clustering
  • Our BLOPEX-LOBPCG software has proved to be
    efficient for large-scale eigenproblems for
    Laplacians from PDE's and for image segmentation
    using multiscale preconditioning of hypre
  • The LOBPCG for massively parallel computers is
    available in our Block Locally Optimal
    Preconditioned Eigenvalue Xolvers (BLOPEX)
    package
  • BLOPEX is built-in in http//www.llnl.gov/CASC/hyp
    re/ and is included as an external package in
    PETSc, see http//www-unix.mcs.anl.gov/petsc/
  • On BlueGene/L 1024 CPU we can compute the Fiedler
    vector of a 24 megapixel image in seconds
    (including the hypre algebraic multigrid setup).

16
Work in progress/future work
  • Our current test cases are to analyze
  • Affymetrix GeneChip data from Marina Kniazeva et
    al. PLoS Biology, 2004.
  • Microarray data from Liang Zhang et al.,
    Molecular Cell, 2007.
  • Our future work will involve developing prototype
    spectral clustering software for Microarrays in
    the Bioinformatics toolbox in MATLAB, writing a
    Microarray analysis driver for our BLOPEX
    library, and testing on large-scale publicly
    available Microarray data.
Write a Comment
User Comments (0)
About PowerShow.com