Loading...

PPT – Introduction to Microarry Data Analysis - II BMI 730 PowerPoint presentation | free to download - id: 462ffc-ODlmY

The Adobe Flash plugin is needed to view this content

Introduction to Microarry Data Analysis - II BMI

730

- Kun Huang
- Department of Biomedical Informatics
- Ohio State University

- Review of Microarray
- Elements of Statistics and Gene Discovery in

Expression Data - Elements of Machine Learning and Clustering of

Gene Expression Profiles

- How does two-channel microarray work?
- Printing process introduces errors and larger

variance - Comparative hybridization experiment

- How does microarray work?
- Fabrication expense and frequency of error

increases with the length of probe, therefore 25

oligonucleotide probes are employed. - Problem cross hybridization
- Solution introduce mismatched probe with one

position (central) different with the matched

probe. The difference gives a more accurate

reading.

- How do we use microarray?
- Inference
- Clustering

- Normalization
- Which normalization algorithm to use
- Inter-slide normalization
- Not just for Affymetrix arrays

- Review of Microarray
- Elements of Statistics and Gene Discovery in

Expression Data - Elements of Machine Learning and Clustering of

Gene Expression Profiles

- Hypothesis Testing
- Two set of samples sampled from two distributions

(N2)

- Hypothesis Testing
- Two set of samples sampled from two distributions

(N2) - Hypothesis
- m1 and m2 are the means of the two distributions.

Null hypothesis

Alternative hypothesis

Students t-test

Students t-test

p-value can be computed from t-value and number

of freedom (related to number of samples) to give

a bound on the probability for type-I error

(claiming insignificant difference to be

significant) assuming normal distributions.

- Students t-test
- Dependent (paired) t-test

- Permutation (t-)test
- T-test relies on the parametric distribution

assumption (normal distribution). Permutation

tests do not depend on such an assumption.

Examples include the permutation t-test and

Wilcoxon rank-sum test. - Perform regular t-test to obtain t-value t0. The

randomly permute the N1N2 samples and designate

the first N1 as group 1 with the rest being group

2. Perform t-test again and record the t-value t.

For all possible

permutations, count how many t-values are larger

than t0 and write down the number K0.

- Multiple Classes (Ngt2)
- F-test
- The null hypothesis is that the distribution of

gene expression is the same for all classes. - The alternative hypothesis is that at least one

of the classes has a distribution that is

different from the other classes. - Which class is different cannot be determined in

F-test (ANOVA). It can only be identified post

hoc.

- Example
- GEO Dataset Subgroup Effect

- Gene Discovery and Multiple T-tests
- Controlling False Positives
- p-value cutoff 0.05 (probability for false

positive - type-I error) - 22,000 probesets
- False discovery 22,000X0.051,100
- Focus on the 1,100 genes in the second speciman.

False discovery 1,100X0.05 55

- Gene Discovery and Multiple T-tests
- Controlling False Positives
- State the set of genes explicitly before the

experiments - Problem not always feasible, defeat the purpose

of large scale screening, could miss important

discovery - Statistical tests to control the false positives

- Gene Discovery and Multiple T-tests
- Controlling False Positives
- Statistical tests to control the false positives
- Controlling for no false positives (very

stringent, e.g. Bonferroni methods) - Controlling the number of false positives (
- Controlling the proportion of false positives
- Note that in the screening stage, false positive

is better than false negative as the later means

missing of possibly important discovery.

- Gene Discovery and Multiple T-tests
- Controlling False Positives
- Statistical tests to control the false positives
- Controlling for no false positives (very

stringent) - Bonferroni methods and multivariate permutation

methods

Bonferroni inequality

Area of union lt Sum of areas

Gene Discovery and Multiple T-tests

- Bonferroni methods
- Bonferroni adjustment

- If Ei is the event for false positive discovery

of gene I, conservative speaking, it is almost

guaranteed to have false positive for K gt 19. - So change the p-value cutoff line from p0 to

p0/K. This is called Bonferroni adjustment. - If K20, p00.05, we call a gene i is

significantly differentially expressed if

pilt0.0025.

Gene Discovery and Multiple T-tests

- Bonferroni methods
- Bonferroni adjustment
- Too conservative. Excessive stringency leads to

increased false negative (type II error). - Has problem with metaanalysis.
- Variations sequential Bonferroni test

(Holm-Bonferroni test)

- Sort the K p-values from small to large to get

p1?p2??pK. - So change the p-value cutoff line for the ith

p-value to be p0/(K-i1) (ie, p1?p0/K,

p2?p0/(K-1), , pK?p0. - If pj?p0/(K-j1) for all j?i but

pi1gtp0/(K-i11), reject all the alternative

hypothesis from i1 to K, but keep the hypothesis

from 1 to i.

- Gene Discovery and Multiple T-tests
- Controlling False Positives
- Statistical tests to control the false positives
- Controlling the number of false positives
- Simple approach choose a cutoff for p-values

that are lower than the usual 0.05 but higher

than that from Bonferroni adjustment - More sophisticated way a version of multivariate

permutation.

- Gene Discovery and Multiple T-tests
- Controlling False Positives
- Statistical tests to control the false positives
- Controlling the proportion of false positives

Let g be the portion (percentage) of false

positive in the total discovered genes.

False positive

Total positive

pD is the choice. There are other ways for

estimating false positives. Details can be found

in Tusher et. al. PNAS 985116-5121.

- Review of Microarray
- Elements of Statistics and Gene Discovery in

Expression Data - Elements of Machine Learning and Clustering of

Gene Expression Profiles

- Review of Microarray and Gene Discovery
- Clustering and Classification
- Preprocessing
- Distance measures
- Popular algorithms (not necessarily the best

ones) - More sophisticated ones
- Evaluation
- Data mining

(No Transcript)

- Clustering or classification?
- Is training data available?
- What domain specific knowledge can be applied?
- What preprocessing of data is needed?
- Log / data scale and numerical stability
- Filtering / denoising
- Nonlinear kernel
- Feature selection (do I need to use all the

data?) - Is the dimensionality of the data too high?

How do we process microarray data (clustering)?

- Feature selection genes, transformations of

expression levels. - Genes discovered in the class comparison

(t-test). Risk missing genes. - Iterative approach select genes under

different p-value cutoff, then select the one

with good performance using cross-validation. - Principal components (pro and con).
- Discriminant analysis (e.g., LDA).

- Distance Measure (Metric?)
- What do you mean by similar?
- Euclidean
- Uncentered correlation
- Pearson correlation

- Distance Metric
- Euclidean

102123_at Lip1 1596.000 2040.900 1277.000 4090.500

1357.600 1039.200 1387.300 3189.000 1321.300 2164

.400 868.600 185.300 266.400 2527.800 160552_at A

p1s1 4144.400 3986.900 3083.100 6105.900 3245.800

4468.400 7295.000 5410.900 3162.100 4100.900 4603.

200 6066.200 5505.800 5702.700

dE(Lip1, Ap1s1) 12883

- Distance Metric
- Pearson Correlation

Ranges from 1 to -1.

r 1

r -1

- Distance Metric
- Pearson Correlation

102123_at Lip1 1596.000 2040.900 1277.000 4090.500

1357.600 1039.200 1387.300 3189.000 1321.300 2164

.400 868.600 185.300 266.400 2527.800 160552_at A

p1s1 4144.400 3986.900 3083.100 6105.900 3245.800

4468.400 7295.000 5410.900 3162.100 4100.900 4603.

200 6066.200 5505.800 5702.700

dP(Lip1, Ap1s1) 0.904

- Distance Metric
- Uncentered Correlation

102123_at Lip1 1596.000 2040.900 1277.000 4090.500

1357.600 1039.200 1387.300 3189.000 1321.300 2164

.400 868.600 185.300 266.400 2527.800 160552_at A

p1s1 4144.400 3986.900 3083.100 6105.900 3245.800

4468.400 7295.000 5410.900 3162.100 4100.900 4603.

200 6066.200 5505.800 5702.700

du(Lip1, Ap1s1) 0.835

q

About 33.4o

- Distance Metric
- Difference between Pearson correlation and

uncentered correlation

102123_at Lip1 1596.000 2040.900 1277.000 4090.500

1357.600 1039.200 1387.300 3189.000 1321.300 2164

.400 868.600 185.300 266.400 2527.800 160552_at A

p1s1 4144.400 3986.900 3083.100 6105.900 3245.800

4468.400 7295.000 5410.900 3162.100 4100.900 4603.

200 6066.200 5505.800 5702.700

Uncentered correlation All are considered signals

Pearson correlation Baseline expression possible

- Distance Metric
- Difference between Euclidean and correlation

- Distance Metric
- Missing negative correlation may also mean

close in signal pathway (1-PCC, 1-PCC2)

- Review of Microarray and Gene Discovery
- Clustering and Classification
- Preprocessing
- Distance measures
- Popular algorithms (not necessarily the best

ones) - More sophisticated ones
- Evaluation
- Data mining

How do we process microarray data (clustering)?

- Unsupervised Learning Hierarchical Clustering

How do we process microarray data (clustering)?

- Unsupervised Learning Hierarchical Clustering

Single linkage The linking distance is the

minimum distance between two clusters.

How do we process microarray data (clustering)?

- Unsupervised Learning Hierarchical Clustering

Complete linkage The linking distance is the

maximum distance between two clusters.

How do we process microarray data (clustering)?

- Unsupervised Learning Hierarchical Clustering

Average linkage/UPGMA The linking distance is

the average of all pair-wise distances between

members of the two clusters. Since all genes and

samples carry equal weight, the linkage is an

Unweighted Pair Group Method with Arithmetic

Means (UPGMA).

How do we process microarray data (clustering)?

- Unsupervised Learning Hierarchical Clustering

- Single linkage Prone to chaining and sensitive

to noise - Complete linkage Tends to produce compact

clusters - Average linkage Sensitive to distance metric

- Unsupervised Learning Hierarchical Clustering

- Unsupervised Learning Hierarchical Clustering

- Dendrograms
- Distance the height each horizontal line

represents the distance between the two groups it

merges. - Order Opensource R uses the convention that the

tighter clusters are on the left. Others

proposed to use expression values, loci on

chromosomes, and other ranking criteria.

- Unsupervised Learning - K-means
- Vector quantization
- K-D trees
- Need to try different K, sensitive to

initialization

- Unsupervised Learning - K-means

cidx, ctrs kmeans(yeastvalueshighexp, 4,

'dist', 'corr', 'rep',20)

Metric

K

- Unsupervised Learning - K-means
- Number of class K needs to be specified
- Does not always converge
- Sensitive to initialization

- Issues
- Lack of consistency or representative features

(5.3 TP53 0.8 PTEN doesnt make sense) - Data structure is missing
- Not robust to outliers and noise

DHaeseleer 2005 Nat. Biotechnol 23(12)1499-501

- Model-based clustering methods

(Han) http//www.cs.umd.edu/bhhan/research2.html

Pan et al. Genome Biology 2002 3research0009.1

doi10.1186/gb-2002-3-2-research0009

- Structure-based clustering methods

- Supervised Learning
- Support vector machines (SVM) and Kernels
- Only (binary) classifier, no data model

- Accuracy vs. generality
- Overfitting
- Model selection