# Introduction to Microarry Data Analysis - II BMI 730 - PowerPoint PPT Presentation

PPT – Introduction to Microarry Data Analysis - II BMI 730 PowerPoint presentation | free to download - id: 462ffc-ODlmY

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Introduction to Microarry Data Analysis - II BMI 730

Description:

### Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University Distance Metric Difference between Pearson ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 53
Provided by: Biomedical69
Category:
Tags:
Transcript and Presenter's Notes

Title: Introduction to Microarry Data Analysis - II BMI 730

1
Introduction to Microarry Data Analysis - II BMI
730
• Kun Huang
• Department of Biomedical Informatics
• Ohio State University

2
• Review of Microarray
• Elements of Statistics and Gene Discovery in
Expression Data
• Elements of Machine Learning and Clustering of
Gene Expression Profiles

3
• How does two-channel microarray work?
• Printing process introduces errors and larger
variance
• Comparative hybridization experiment

4
• How does microarray work?
• Fabrication expense and frequency of error
increases with the length of probe, therefore 25
oligonucleotide probes are employed.
• Problem cross hybridization
• Solution introduce mismatched probe with one
position (central) different with the matched
probe. The difference gives a more accurate

5
• How do we use microarray?
• Inference
• Clustering

6
• Normalization
• Which normalization algorithm to use
• Inter-slide normalization
• Not just for Affymetrix arrays

7
• Review of Microarray
• Elements of Statistics and Gene Discovery in
Expression Data
• Elements of Machine Learning and Clustering of
Gene Expression Profiles

8
• Hypothesis Testing
• Two set of samples sampled from two distributions
(N2)

9
• Hypothesis Testing
• Two set of samples sampled from two distributions
(N2)
• Hypothesis
• m1 and m2 are the means of the two distributions.

Null hypothesis
Alternative hypothesis
10
Students t-test
11
Students t-test
p-value can be computed from t-value and number
of freedom (related to number of samples) to give
a bound on the probability for type-I error
(claiming insignificant difference to be
significant) assuming normal distributions.
12
• Students t-test
• Dependent (paired) t-test

13
• Permutation (t-)test
• T-test relies on the parametric distribution
assumption (normal distribution). Permutation
tests do not depend on such an assumption.
Examples include the permutation t-test and
Wilcoxon rank-sum test.
• Perform regular t-test to obtain t-value t0. The
randomly permute the N1N2 samples and designate
the first N1 as group 1 with the rest being group
2. Perform t-test again and record the t-value t.
For all possible
permutations, count how many t-values are larger
than t0 and write down the number K0.

14
• Multiple Classes (Ngt2)
• F-test
• The null hypothesis is that the distribution of
gene expression is the same for all classes.
• The alternative hypothesis is that at least one
of the classes has a distribution that is
different from the other classes.
• Which class is different cannot be determined in
F-test (ANOVA). It can only be identified post
hoc.

15
• Example
• GEO Dataset Subgroup Effect

16
• Gene Discovery and Multiple T-tests
• Controlling False Positives
• p-value cutoff 0.05 (probability for false
positive - type-I error)
• 22,000 probesets
• False discovery 22,000X0.051,100
• Focus on the 1,100 genes in the second speciman.
False discovery 1,100X0.05 55

17
• Gene Discovery and Multiple T-tests
• Controlling False Positives
• State the set of genes explicitly before the
experiments
• Problem not always feasible, defeat the purpose
of large scale screening, could miss important
discovery
• Statistical tests to control the false positives

18
• Gene Discovery and Multiple T-tests
• Controlling False Positives
• Statistical tests to control the false positives
• Controlling for no false positives (very
stringent, e.g. Bonferroni methods)
• Controlling the number of false positives (
• Controlling the proportion of false positives
• Note that in the screening stage, false positive
is better than false negative as the later means
missing of possibly important discovery.

19
• Gene Discovery and Multiple T-tests
• Controlling False Positives
• Statistical tests to control the false positives
• Controlling for no false positives (very
stringent)
• Bonferroni methods and multivariate permutation
methods

Bonferroni inequality
Area of union lt Sum of areas
20
Gene Discovery and Multiple T-tests
• Bonferroni methods
• If Ei is the event for false positive discovery
of gene I, conservative speaking, it is almost
guaranteed to have false positive for K gt 19.
• So change the p-value cutoff line from p0 to
p0/K. This is called Bonferroni adjustment.
• If K20, p00.05, we call a gene i is
significantly differentially expressed if
pilt0.0025.

21
Gene Discovery and Multiple T-tests
• Bonferroni methods
• Too conservative. Excessive stringency leads to
increased false negative (type II error).
• Has problem with metaanalysis.
• Variations sequential Bonferroni test
(Holm-Bonferroni test)
• Sort the K p-values from small to large to get
p1?p2??pK.
• So change the p-value cutoff line for the ith
p-value to be p0/(K-i1) (ie, p1?p0/K,
p2?p0/(K-1), , pK?p0.
• If pj?p0/(K-j1) for all j?i but
pi1gtp0/(K-i11), reject all the alternative
hypothesis from i1 to K, but keep the hypothesis
from 1 to i.

22
• Gene Discovery and Multiple T-tests
• Controlling False Positives
• Statistical tests to control the false positives
• Controlling the number of false positives
• Simple approach choose a cutoff for p-values
that are lower than the usual 0.05 but higher
• More sophisticated way a version of multivariate
permutation.

23
• Gene Discovery and Multiple T-tests
• Controlling False Positives
• Statistical tests to control the false positives
• Controlling the proportion of false positives

Let g be the portion (percentage) of false
positive in the total discovered genes.
False positive
Total positive
pD is the choice. There are other ways for
estimating false positives. Details can be found
in Tusher et. al. PNAS 985116-5121.
24
• Review of Microarray
• Elements of Statistics and Gene Discovery in
Expression Data
• Elements of Machine Learning and Clustering of
Gene Expression Profiles

25
• Review of Microarray and Gene Discovery
• Clustering and Classification
• Preprocessing
• Distance measures
• Popular algorithms (not necessarily the best
ones)
• More sophisticated ones
• Evaluation
• Data mining

26
(No Transcript)
27
• Clustering or classification?
• Is training data available?
• What domain specific knowledge can be applied?
• What preprocessing of data is needed?
• Log / data scale and numerical stability
• Filtering / denoising
• Nonlinear kernel
• Feature selection (do I need to use all the
data?)
• Is the dimensionality of the data too high?

28
How do we process microarray data (clustering)?
• Feature selection genes, transformations of
expression levels.
• Genes discovered in the class comparison
(t-test). Risk missing genes.
• Iterative approach select genes under
different p-value cutoff, then select the one
with good performance using cross-validation.
• Principal components (pro and con).
• Discriminant analysis (e.g., LDA).

29
• Distance Measure (Metric?)
• What do you mean by similar?
• Euclidean
• Uncentered correlation
• Pearson correlation

30
• Distance Metric
• Euclidean

102123_at Lip1 1596.000 2040.900 1277.000 4090.500
1357.600 1039.200 1387.300 3189.000 1321.300 2164
.400 868.600 185.300 266.400 2527.800 160552_at A
p1s1 4144.400 3986.900 3083.100 6105.900 3245.800
4468.400 7295.000 5410.900 3162.100 4100.900 4603.
200 6066.200 5505.800 5702.700
dE(Lip1, Ap1s1) 12883
31
• Distance Metric
• Pearson Correlation

Ranges from 1 to -1.
r 1
r -1
32
• Distance Metric
• Pearson Correlation

102123_at Lip1 1596.000 2040.900 1277.000 4090.500
1357.600 1039.200 1387.300 3189.000 1321.300 2164
.400 868.600 185.300 266.400 2527.800 160552_at A
p1s1 4144.400 3986.900 3083.100 6105.900 3245.800
4468.400 7295.000 5410.900 3162.100 4100.900 4603.
200 6066.200 5505.800 5702.700
dP(Lip1, Ap1s1) 0.904
33
• Distance Metric
• Uncentered Correlation

102123_at Lip1 1596.000 2040.900 1277.000 4090.500
1357.600 1039.200 1387.300 3189.000 1321.300 2164
.400 868.600 185.300 266.400 2527.800 160552_at A
p1s1 4144.400 3986.900 3083.100 6105.900 3245.800
4468.400 7295.000 5410.900 3162.100 4100.900 4603.
200 6066.200 5505.800 5702.700
du(Lip1, Ap1s1) 0.835
q
34
• Distance Metric
• Difference between Pearson correlation and
uncentered correlation

102123_at Lip1 1596.000 2040.900 1277.000 4090.500
1357.600 1039.200 1387.300 3189.000 1321.300 2164
.400 868.600 185.300 266.400 2527.800 160552_at A
p1s1 4144.400 3986.900 3083.100 6105.900 3245.800
4468.400 7295.000 5410.900 3162.100 4100.900 4603.
200 6066.200 5505.800 5702.700
Uncentered correlation All are considered signals
Pearson correlation Baseline expression possible
35
• Distance Metric
• Difference between Euclidean and correlation

36
• Distance Metric
• Missing negative correlation may also mean
close in signal pathway (1-PCC, 1-PCC2)

37
• Review of Microarray and Gene Discovery
• Clustering and Classification
• Preprocessing
• Distance measures
• Popular algorithms (not necessarily the best
ones)
• More sophisticated ones
• Evaluation
• Data mining

38
How do we process microarray data (clustering)?
• Unsupervised Learning Hierarchical Clustering

39
How do we process microarray data (clustering)?
• Unsupervised Learning Hierarchical Clustering

minimum distance between two clusters.
40
How do we process microarray data (clustering)?
• Unsupervised Learning Hierarchical Clustering

maximum distance between two clusters.
41
How do we process microarray data (clustering)?
• Unsupervised Learning Hierarchical Clustering

the average of all pair-wise distances between
members of the two clusters. Since all genes and
samples carry equal weight, the linkage is an
Unweighted Pair Group Method with Arithmetic
Means (UPGMA).
42
How do we process microarray data (clustering)?
• Unsupervised Learning Hierarchical Clustering
• Single linkage Prone to chaining and sensitive
to noise
• Complete linkage Tends to produce compact
clusters
• Average linkage Sensitive to distance metric

43
• Unsupervised Learning Hierarchical Clustering

44
• Unsupervised Learning Hierarchical Clustering
• Dendrograms
• Distance the height each horizontal line
represents the distance between the two groups it
merges.
• Order Opensource R uses the convention that the
tighter clusters are on the left. Others
proposed to use expression values, loci on
chromosomes, and other ranking criteria.

45
• Unsupervised Learning - K-means
• Vector quantization
• K-D trees
• Need to try different K, sensitive to
initialization

46
• Unsupervised Learning - K-means

cidx, ctrs kmeans(yeastvalueshighexp, 4,
'dist', 'corr', 'rep',20)
Metric
K
47
• Unsupervised Learning - K-means
• Number of class K needs to be specified
• Does not always converge
• Sensitive to initialization

48
• Issues
• Lack of consistency or representative features
(5.3 TP53 0.8 PTEN doesnt make sense)
• Data structure is missing
• Not robust to outliers and noise

DHaeseleer 2005 Nat. Biotechnol 23(12)1499-501
49
• Model-based clustering methods

(Han) http//www.cs.umd.edu/bhhan/research2.html
Pan et al. Genome Biology 2002 3research0009.1
doi10.1186/gb-2002-3-2-research0009
50
• Structure-based clustering methods

51
• Supervised Learning
• Support vector machines (SVM) and Kernels
• Only (binary) classifier, no data model

52
• Accuracy vs. generality
• Overfitting
• Model selection