Bioinformatics: The Analysis of Microarray Data - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Bioinformatics: The Analysis of Microarray Data

Description:

for large enough samples we can tailor the test to the distribution (which might ... make sure that you are cross-validating the experiment that you have carried out ... – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 59
Provided by: rgen6
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics: The Analysis of Microarray Data


1
BioinformaticsThe Analysis of Microarray Data
  • Robert Gentleman
  • Department of Biostatistics
  • Harvard University
  • DFCI

2
Bioconductor
  • a new project aimed at providing software
    resources for analysing and manipulating
    biological data
  • the project has multiple goals (they include)
  • provide high quality software to researchers
  • provide structure and examples that will enable
    rapid development of new methodology
  • explore new methods in both statistics and
    computing and to make these available as rapidly
    as possible

3
Bioconductor
  • web site will be located at www.bioconductor.org
  • not active yet but it should be in the next
    couple of weeks
  • initial offerings will be several libraries of
    functions providing
  • infrastructure support in the form of data
    structures
  • annotation support in the form of a synthesis of
    different databases into a form that is useful
    for the analyses we want to carry out

4
Bioconductor
  • is a collaborative project, we have participants
    from
  • DFCI and Harvard School of Public Health and FAS
  • UC Berkeley, Department of Biostatistics
  • University of Heidelberg
  • ETH, Zurich
  • Technical University, Vienna

5
Bioconductor
  • our current membership is mainly statisticians
    with a strong computing background
  • we would like to have a team of statisticians,
    computer scientists and biologists identifying
    both problems and potential solutions

6
Bioinformatics
Computer Science
Biology
Statistics
7
Bioinformatics
  • there are many challenges
  • large and complex data
  • complex models
  • computational requirements can be enormous
  • data are often a mix of numeric and non-numeric
    (we can deal with the former better than the
    latter)
  • perhaps the largest challenge is to develop tools
    that easily and accurately reveal the biology

8
Bioinformatics
  • the tools used to analyse the data are themselves
    complex and they need to be!
  • we are asking and answering very complex
    questions
  • the user interface should be simple and intuitive
  • we need a flexible system for the development of
    new tools

9
Bioinformatics Tools
  • some methods that have been successfully employed
    in similar situations
  • object-oriented programming
  • visualization
  • statistical modeling
  • parallel algorithms

10
Bioinformatics Tools
  • I contend that the basis for constructing
    bioinformatics tools should be a proper
    programming language

11
Bioinformatics Tools
  • the ideal development environment should have
    some properties
  • high quality graphics (preferably interactive)
  • seamless access to databases
  • good numerics (preferably with many math/stat
    algorithms as primitives)
  • a system for producing packages
  • an intuitive user interface
  • it doesnt exist!

12
Bioinformatics Tools
  • our approach is to start with a language that has
    most, but not all of these properties.
  • we will then work on extending that language to
    provide the missing pieces
  • the language
  • R a language for statistical computing and
    graphics
  • www.r-project.org

13
Bioinformatics
  • in the remainder of this talk I will consider
    some basic tasks that we are interested in
    carrying out and show how our use of R has
    simplified the implementation
  • note that a number of other teams working on the
    development of tools for Bioinformatics have also
    adopted R however they typically have a less
    aggressive strategy

14
Specifics
  • we now turn our attention to the analysis of DNA
    microarray data (only as a specific example)
  • most of the important points here are
    transferable to the analysis of other types of
    data

15
Experimental Design
  • random errors exact replication using the same
    reagents, same samples, same technicians etc will
    still yield variation
  • systematic variation
  • between technicians
  • between batches/reagents
  • dont want systematic components to align with
    experimental conditions (confounding)

16
Types of assays
  • The main types of gene expression assays
  • Serial analysis of gene expression (SAGE)
  • Short oligonucleotide arrays (Affymetrix)
  • Long oligonucleotide arrays (Agilent)
  • Fibre optic arrays (Illumina)
  • cDNA arrays (Brown/Botstein).

17
Microarray Data
  • data are typically obtained from three distinct
    sources
  • the experimental data that provides expression
    level data for a selected set of genes (or ESTs)
  • the sample level covariates, including
    experimental conditions
  • the biological metadata (GenBank, LocusLink,
    KEGG, and so on)

18
Applications of microarrays
  • Measuring transcript abundance (cDNA arrays)
  • Genotyping
  • Estimating DNA copy number (CGH)
  • Determining identity by descent (GMS)
  • Measuring mRNA decay rates
  • Identifying protein binding sites
  • Determining sub-cellular localization of gene
    products

19
Some Questions
  • Which genes have expression levels that are
    correlated with some external variable?
  • For a given pathway, which of the genes in our
    collection are most likely to be involved?
  • For a diffuse disease, which genes are associated
    with different outcomes?

20
Answering the questions
  • we need to obtain and then analyse the expression
    data
  • preprocessing of the image
  • normalization of the images
  • modeling to extract expression level data
  • gene-filtering
  • clustering
  • relating to biologic data

21
Steps in image analysis
1. Addressing. Estimate location of spot centers.
2. Segmentation. Classify pixels as foreground
(signal) or background.
  • 3. Information extraction. For
  • each spot on the array and each
  • dye
  • signal intensities
  • background intensities
  • quality measures.

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Addressing
Automatic addressing within the same batch of
images.
  • Estimate translation of grids.
  • Estimate row and column positions.

4 by 4 grids
Other problems -- Mis-registration --
Rotation -- Skew in the array
Foreground and background grids.
26
Segmentation
Adaptive segmentation, SRG
Fixed circle segmentation
Spots usually vary in size and shape.
27
Image Analysis
  • we need a mechanism for storing and accessing the
    raw data (note databases are not really the
    answer)
  • we need tools to allow us to go back from the
    expression data to the set of spots that were
    used to compute that expression data

28
Image processing
  • Almost all steps in this process seem to lend
    themselves to a Bayesian analysis.
  • Many processing techniques use only the single
    slide of interest but there are usually other
    slides with similar structure that could be used.
  • In most cases there is structure that exists
    across slides, technicians, machines.

29
An Image Storage Solution
  • we have developed an HDF5 library
  • http//hdf.ncsa.uiuc.edu/
  • HDF5 is a storage format for image data that is
    widely used
  • our package allows users to access image data in
    R as if it were an ordinary array but the image
    data remains in a disk file
  • we have used Rs ability to link with other
    software to quickly and effectively implement this

30
Post-Processing
  • once the spot intensities have been obtained
    further processing is required to obtain
    expression level data (expression is often in
    terms of mRNA levels)
  • for Affymetrix arrays the spot intensities are
    for short oligos and must be processed to obtain
    gene level data
  • in all cases some form of normalization is
    required (basically an intensity alignment)

31
Expression Level Data
  • now, for each array we have obtained expression
    level data
  • the next step is to select those genes that have
    interesting expression levels.
  • interesting is interpreted in many different ways
  • high levels of expression in a subgroup of
    interest
  • lack of expression in a subgroup of interest
  • pattern of expression that correlates well with
    experimental conditions.

32
Data Structures
  • one of our goals is to introduce some standard
    data structures
  • in an object oriented setting these are called
    classes
  • a particular data set is referred to as an
    instance of the class
  • they allow us to model complex data in a natural
    way

33
Data Structures
  • if the class is well defined and describes the
    physical data well then using it is natural
  • methods or functions can be written to perform
    calculations on instances of the class
  • this makes it easier to both write the functions
    and to share methods across developers

34
Data Structures
  • one of the difficult tasks that confronts a data
    analyst in this field is to ensure that the data
    are correctly aligned
  • the expression data for each sample must be
    related to the correct phenotypic data and the
    gene annotation must be aligned correctly with
    the genes on the microarray
  • the class structure can help ensure proper
    alignment

35
Data Structures
  • these problems will become more severe as we
    obtain more data
  • for example relating data from several different
    experiments with different sources (ie both cDNA
    array and short oligo arrays)
  • starting to look at pathways or binding/promoter
    sites

36
The Evolution of Gene Selection
  • first fold-change ratio of expression levels
    between two groups
  • then t-tests now statistical variation comes in
    to play
  • other statistical models Anova, Cox Model, etc
  • for large enough samples we can tailor the test
    to the distribution (which might be different in
    the two groups)

37
Filters and Orders
  • what sorts of questions can we answer
  • does a gene have a pattern of expression that
    correlates well with experimental conditions?
  • does a gene express at a reasonably high level in
    a reasonably large portion of my samples?
  • can we find genes that have a pattern of
    expression that is similar to that of some known
    gene(s)?

38
Filters
  • A filter is a mechanism for removing a gene from
    further consideration
  • we want to reduce the number of genes under
    consideration so that we can concentrate on those
    that are more interesting (it is a waste of
    resources to study genes that are not likely to
    be of interest)

39
Orders or Rankings
  • while there is a strong desire to have methods
    that determine which genes are important, such
    methods do not exist and cannot exist
  • we can rank genes according to some measure of
    interesting
  • a test statistic, a p-value, etc
  • such rankings can be used to select genes (those
    with high ranks) for further study

40
Non-specific filters
  • at least k (or a proportion p) of the samples
    must have expression values larger than some
    specified amount, A.
  • the gene should show sufficient variation to be
    interesting
  • either a gap of size A in the central portion of
    the data
  • or a interquartile range of at least B
  • genes that fail to pass can be eliminated early

41
Specific Filters
  • any filter based on a statistical comparison
  • t-test, ANOVA, Cox Model and so on
  • these are all readily available in R and hence in
    our gene filtering package

42
Multiple Testing
  • to assess the significance of the p-values one
    must make some adjustments for multiple testing
    (Dudoit et al)
  • this is especially important if the expression
    data are coming from multiple sources

43
An Example
  • two cell lines
  • observed at four time points
  • we want to select genes that have interesting
    patterns
  • non-linear mixed effects models seem appropriate
  • R provides the filter without any extra coding

44
Annotation
  • obtained from multiple sources
  • align it using a combination of R, XML and
    Postgres into a coherent collection
  • we can produce uniform annotation for any set of
    genes
  • Unigene, Locus Link
  • chromosome, cytoband,
  • GO Ontology

45
Annotation
  • in some cases knowing the chromosome or cytoband
    is useful
  • hyperdiploid samples, hypodiploid samples
  • examine gene amplification/deletion patterns
  • if multiple genes are involved we may be able to
    detect this
  • given information of this kind we want access to
    visualization tools
  • where are the top expressing genes located?

46
Annotation
  • Pathways
  • given pathway information can we determine which
    genes express in a manner similar to known genes
    in the pathway
  • Need to link, via browser technology, with other
    sources. It is particularly useful if the
    information can be parsed and analysed by machine
    (eg XML)
  • again R has support for exactly this type of
    processing

47
Clustering
  • Supervised
  • some clusters are determined a priori, the
    training set, and we assign new samples to those
    clusters (classification)
  • Unsupervised
  • data are grouped, using some method, to provide
    (potentially) new groupings or classifications

48
Some References
  • Classification, A.D. Gordon Chapman and Hall
  • Finding Groups in Data, L.Kaufman and P. J.
    Rousseeuw Wiley.
  • Pattern Recognition and Neural Networks, B. D.
    Ripley Cambridge University Press.
  • The Elements of Statistical Learning, T. Hastie,
    R. Tibshirani, and J. Friedman Springer

49
Clustering
  • Clustering is often used to answer the following
    questions.
  • Can I find a set of genes that help me to
    correctly classify the samples into specific
    groups (often disease categories)
  • If I can do this, which genes were used and how
    important are they?

50
Inference
  • Permutation tests
  • Cross-validation
  • How many clusters are there?
  • Gordon Ch 3 addresses some of these issues
  • Are my clusters significant?
  • it is more important to ask whether they provide
    a meaningful representation or reduction of the
    data

51
Variable Selection
  • how do we select genes to help in clustering?
  • this process has been studied in other areas of
    statistics, however, few of the solutions seem
    appropriate

52
Model Selection
  • gene selection can be characterized as a model
    selection procedure
  • most statistical research in this area has been
    concentrated on the situation where the number of
    variables is much less than the number of
    observations
  • we need to develop model selection procedures for
    the situation where there are many more variables
    than observations

53
Selection Problems
  • may need to adjust for other variables
  • this makes the bootstrap and cross validation
    algorithms more complex
  • computing is already hard -- now it is getting
    much more difficult data structures play an
    essential role in simplifying the computations

54
Selection Problems
  • use bootstrap or crossvalidation type
    experiments to do variable selection
  • for example select genes that show up as
    important predictors in many different subsets
  • to assess variable importance (when there are
    several variables) Breiman has proposed
    reclassifying using a new data set that has the
    covariate values permuted.

55
Cross-validation
  • leave one out CV is easy to implement but
    probably not the best method to use for assessing
    or selecting a model
  • the problem is that leaving out one observation
    yields too small a change to the data
  • in general we need to consider perturbations on
    the order of about the square root of the number
    of observations

56
Cross-validation
  • make sure that you are cross-validating the
    experiment that you have carried out
  • in particular, if you are selecting genes, rather
    than working with known genes, you must
    cross-validate the gene selection process as well
  • most examples I have seen with low classification
    error rates do not cross-validate properly
    (model/gene selection was not validated)

57
Classification
  • Breimans Random Forest ideas
  • notions of boosting using several weak
    classifiers to obtain a good classifier
  • weighted averages of predictions the weights
    might be different in different parts of the
    covariate (gene) space

58
Acknowledgements
  • Vincent Carey, Channing Laboratory
  • Jianhua Zhang
  • Sabina Chiaretti
  • The DFCI for funding
  • Dirk Iglehart
  • Cheng Li
  • Byron Ellis
  • Sandrine Dudoit
Write a Comment
User Comments (0)
About PowerShow.com