Introduction to Microarry Data Analysis - II BMI 730 - PowerPoint PPT Presentation

Loading...

PPT – Introduction to Microarry Data Analysis - II BMI 730 PowerPoint presentation | free to download - id: 462ffc-ODlmY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Introduction to Microarry Data Analysis - II BMI 730

Description:

Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University Distance Metric Difference between Pearson ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 53
Provided by: Biomedical69
Learn more at: http://bmi.osu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to Microarry Data Analysis - II BMI 730


1
Introduction to Microarry Data Analysis - II BMI
730
  • Kun Huang
  • Department of Biomedical Informatics
  • Ohio State University

2
  • Review of Microarray
  • Elements of Statistics and Gene Discovery in
    Expression Data
  • Elements of Machine Learning and Clustering of
    Gene Expression Profiles

3
  • How does two-channel microarray work?
  • Printing process introduces errors and larger
    variance
  • Comparative hybridization experiment

4
  • How does microarray work?
  • Fabrication expense and frequency of error
    increases with the length of probe, therefore 25
    oligonucleotide probes are employed.
  • Problem cross hybridization
  • Solution introduce mismatched probe with one
    position (central) different with the matched
    probe. The difference gives a more accurate
    reading.

5
  • How do we use microarray?
  • Inference
  • Clustering

6
  • Normalization
  • Which normalization algorithm to use
  • Inter-slide normalization
  • Not just for Affymetrix arrays

7
  • Review of Microarray
  • Elements of Statistics and Gene Discovery in
    Expression Data
  • Elements of Machine Learning and Clustering of
    Gene Expression Profiles

8
  • Hypothesis Testing
  • Two set of samples sampled from two distributions
    (N2)

9
  • Hypothesis Testing
  • Two set of samples sampled from two distributions
    (N2)
  • Hypothesis
  • m1 and m2 are the means of the two distributions.

Null hypothesis
Alternative hypothesis
10
Students t-test
11
Students t-test
p-value can be computed from t-value and number
of freedom (related to number of samples) to give
a bound on the probability for type-I error
(claiming insignificant difference to be
significant) assuming normal distributions.
12
  • Students t-test
  • Dependent (paired) t-test

13
  • Permutation (t-)test
  • T-test relies on the parametric distribution
    assumption (normal distribution). Permutation
    tests do not depend on such an assumption.
    Examples include the permutation t-test and
    Wilcoxon rank-sum test.
  • Perform regular t-test to obtain t-value t0. The
    randomly permute the N1N2 samples and designate
    the first N1 as group 1 with the rest being group
    2. Perform t-test again and record the t-value t.
    For all possible
    permutations, count how many t-values are larger
    than t0 and write down the number K0.

14
  • Multiple Classes (Ngt2)
  • F-test
  • The null hypothesis is that the distribution of
    gene expression is the same for all classes.
  • The alternative hypothesis is that at least one
    of the classes has a distribution that is
    different from the other classes.
  • Which class is different cannot be determined in
    F-test (ANOVA). It can only be identified post
    hoc.

15
  • Example
  • GEO Dataset Subgroup Effect

16
  • Gene Discovery and Multiple T-tests
  • Controlling False Positives
  • p-value cutoff 0.05 (probability for false
    positive - type-I error)
  • 22,000 probesets
  • False discovery 22,000X0.051,100
  • Focus on the 1,100 genes in the second speciman.
    False discovery 1,100X0.05 55

17
  • Gene Discovery and Multiple T-tests
  • Controlling False Positives
  • State the set of genes explicitly before the
    experiments
  • Problem not always feasible, defeat the purpose
    of large scale screening, could miss important
    discovery
  • Statistical tests to control the false positives

18
  • Gene Discovery and Multiple T-tests
  • Controlling False Positives
  • Statistical tests to control the false positives
  • Controlling for no false positives (very
    stringent, e.g. Bonferroni methods)
  • Controlling the number of false positives (
  • Controlling the proportion of false positives
  • Note that in the screening stage, false positive
    is better than false negative as the later means
    missing of possibly important discovery.

19
  • Gene Discovery and Multiple T-tests
  • Controlling False Positives
  • Statistical tests to control the false positives
  • Controlling for no false positives (very
    stringent)
  • Bonferroni methods and multivariate permutation
    methods

Bonferroni inequality
Area of union lt Sum of areas
20
Gene Discovery and Multiple T-tests
  • Bonferroni methods
  • Bonferroni adjustment
  • If Ei is the event for false positive discovery
    of gene I, conservative speaking, it is almost
    guaranteed to have false positive for K gt 19.
  • So change the p-value cutoff line from p0 to
    p0/K. This is called Bonferroni adjustment.
  • If K20, p00.05, we call a gene i is
    significantly differentially expressed if
    pilt0.0025.

21
Gene Discovery and Multiple T-tests
  • Bonferroni methods
  • Bonferroni adjustment
  • Too conservative. Excessive stringency leads to
    increased false negative (type II error).
  • Has problem with metaanalysis.
  • Variations sequential Bonferroni test
    (Holm-Bonferroni test)
  • Sort the K p-values from small to large to get
    p1?p2??pK.
  • So change the p-value cutoff line for the ith
    p-value to be p0/(K-i1) (ie, p1?p0/K,
    p2?p0/(K-1), , pK?p0.
  • If pj?p0/(K-j1) for all j?i but
    pi1gtp0/(K-i11), reject all the alternative
    hypothesis from i1 to K, but keep the hypothesis
    from 1 to i.

22
  • Gene Discovery and Multiple T-tests
  • Controlling False Positives
  • Statistical tests to control the false positives
  • Controlling the number of false positives
  • Simple approach choose a cutoff for p-values
    that are lower than the usual 0.05 but higher
    than that from Bonferroni adjustment
  • More sophisticated way a version of multivariate
    permutation.

23
  • Gene Discovery and Multiple T-tests
  • Controlling False Positives
  • Statistical tests to control the false positives
  • Controlling the proportion of false positives

Let g be the portion (percentage) of false
positive in the total discovered genes.
False positive
Total positive
pD is the choice. There are other ways for
estimating false positives. Details can be found
in Tusher et. al. PNAS 985116-5121.
24
  • Review of Microarray
  • Elements of Statistics and Gene Discovery in
    Expression Data
  • Elements of Machine Learning and Clustering of
    Gene Expression Profiles

25
  • Review of Microarray and Gene Discovery
  • Clustering and Classification
  • Preprocessing
  • Distance measures
  • Popular algorithms (not necessarily the best
    ones)
  • More sophisticated ones
  • Evaluation
  • Data mining

26
(No Transcript)
27
  • Clustering or classification?
  • Is training data available?
  • What domain specific knowledge can be applied?
  • What preprocessing of data is needed?
  • Log / data scale and numerical stability
  • Filtering / denoising
  • Nonlinear kernel
  • Feature selection (do I need to use all the
    data?)
  • Is the dimensionality of the data too high?

28
How do we process microarray data (clustering)?
  • Feature selection genes, transformations of
    expression levels.
  • Genes discovered in the class comparison
    (t-test). Risk missing genes.
  • Iterative approach select genes under
    different p-value cutoff, then select the one
    with good performance using cross-validation.
  • Principal components (pro and con).
  • Discriminant analysis (e.g., LDA).

29
  • Distance Measure (Metric?)
  • What do you mean by similar?
  • Euclidean
  • Uncentered correlation
  • Pearson correlation

30
  • Distance Metric
  • Euclidean

102123_at Lip1 1596.000 2040.900 1277.000 4090.500
1357.600 1039.200 1387.300 3189.000 1321.300 2164
.400 868.600 185.300 266.400 2527.800 160552_at A
p1s1 4144.400 3986.900 3083.100 6105.900 3245.800
4468.400 7295.000 5410.900 3162.100 4100.900 4603.
200 6066.200 5505.800 5702.700
dE(Lip1, Ap1s1) 12883
31
  • Distance Metric
  • Pearson Correlation

Ranges from 1 to -1.
r 1
r -1
32
  • Distance Metric
  • Pearson Correlation

102123_at Lip1 1596.000 2040.900 1277.000 4090.500
1357.600 1039.200 1387.300 3189.000 1321.300 2164
.400 868.600 185.300 266.400 2527.800 160552_at A
p1s1 4144.400 3986.900 3083.100 6105.900 3245.800
4468.400 7295.000 5410.900 3162.100 4100.900 4603.
200 6066.200 5505.800 5702.700
dP(Lip1, Ap1s1) 0.904
33
  • Distance Metric
  • Uncentered Correlation

102123_at Lip1 1596.000 2040.900 1277.000 4090.500
1357.600 1039.200 1387.300 3189.000 1321.300 2164
.400 868.600 185.300 266.400 2527.800 160552_at A
p1s1 4144.400 3986.900 3083.100 6105.900 3245.800
4468.400 7295.000 5410.900 3162.100 4100.900 4603.
200 6066.200 5505.800 5702.700
du(Lip1, Ap1s1) 0.835
q
About 33.4o
34
  • Distance Metric
  • Difference between Pearson correlation and
    uncentered correlation

102123_at Lip1 1596.000 2040.900 1277.000 4090.500
1357.600 1039.200 1387.300 3189.000 1321.300 2164
.400 868.600 185.300 266.400 2527.800 160552_at A
p1s1 4144.400 3986.900 3083.100 6105.900 3245.800
4468.400 7295.000 5410.900 3162.100 4100.900 4603.
200 6066.200 5505.800 5702.700
Uncentered correlation All are considered signals
Pearson correlation Baseline expression possible
35
  • Distance Metric
  • Difference between Euclidean and correlation

36
  • Distance Metric
  • Missing negative correlation may also mean
    close in signal pathway (1-PCC, 1-PCC2)

37
  • Review of Microarray and Gene Discovery
  • Clustering and Classification
  • Preprocessing
  • Distance measures
  • Popular algorithms (not necessarily the best
    ones)
  • More sophisticated ones
  • Evaluation
  • Data mining

38
How do we process microarray data (clustering)?
  • Unsupervised Learning Hierarchical Clustering

39
How do we process microarray data (clustering)?
  • Unsupervised Learning Hierarchical Clustering

Single linkage The linking distance is the
minimum distance between two clusters.
40
How do we process microarray data (clustering)?
  • Unsupervised Learning Hierarchical Clustering

Complete linkage The linking distance is the
maximum distance between two clusters.
41
How do we process microarray data (clustering)?
  • Unsupervised Learning Hierarchical Clustering

Average linkage/UPGMA The linking distance is
the average of all pair-wise distances between
members of the two clusters. Since all genes and
samples carry equal weight, the linkage is an
Unweighted Pair Group Method with Arithmetic
Means (UPGMA).
42
How do we process microarray data (clustering)?
  • Unsupervised Learning Hierarchical Clustering
  • Single linkage Prone to chaining and sensitive
    to noise
  • Complete linkage Tends to produce compact
    clusters
  • Average linkage Sensitive to distance metric

43
  • Unsupervised Learning Hierarchical Clustering

44
  • Unsupervised Learning Hierarchical Clustering
  • Dendrograms
  • Distance the height each horizontal line
    represents the distance between the two groups it
    merges.
  • Order Opensource R uses the convention that the
    tighter clusters are on the left. Others
    proposed to use expression values, loci on
    chromosomes, and other ranking criteria.

45
  • Unsupervised Learning - K-means
  • Vector quantization
  • K-D trees
  • Need to try different K, sensitive to
    initialization

46
  • Unsupervised Learning - K-means

cidx, ctrs kmeans(yeastvalueshighexp, 4,
'dist', 'corr', 'rep',20)
Metric
K
47
  • Unsupervised Learning - K-means
  • Number of class K needs to be specified
  • Does not always converge
  • Sensitive to initialization

48
  • Issues
  • Lack of consistency or representative features
    (5.3 TP53 0.8 PTEN doesnt make sense)
  • Data structure is missing
  • Not robust to outliers and noise

DHaeseleer 2005 Nat. Biotechnol 23(12)1499-501
49
  • Model-based clustering methods

(Han) http//www.cs.umd.edu/bhhan/research2.html
Pan et al. Genome Biology 2002 3research0009.1  
doi10.1186/gb-2002-3-2-research0009
50
  • Structure-based clustering methods

51
  • Supervised Learning
  • Support vector machines (SVM) and Kernels
  • Only (binary) classifier, no data model

52
  • Accuracy vs. generality
  • Overfitting
  • Model selection
About PowerShow.com