Introduction to microarray analysis II - PowerPoint PPT Presentation

1 / 71
About This Presentation
Title:

Introduction to microarray analysis II

Description:

2- Une valeur id ale pour les 'MM' est calcul e et soustraite ... KO (Levures, Drosophile, C. Elegans, Zebrafish, Souris)? Design de l'experience. R ponse ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 72
Provided by: jeanmor
Category:

less

Transcript and Presenter's Notes

Title: Introduction to microarray analysis II


1
Introduction to microarray analysis II
Ariel Chernomoretz Plataforme de
Bioinformatique Centre de Recherche du CHUL
2
Question pertinente
Calcul du signal (Affymetrix-GCOS)? 1- Chaque
intensité est corrigée par le bruit de fond. 2-
Une valeur idéale pour les MM est calculée et
soustraite à chaque valeur de PM. 3-
Lintensité ainsi ajustée est log transformée
pour stabiliser la variance. 4- Une moyenne
pondérée est calculée en appliquant léquation de
Turkey, et exprimée en antilog. 5- Finalement, le
signal est corrigé en utilisant une moyenne nette
de 500.
Design de lexperience
Affymetrix .EXP .DAT .CEL .CHP .RPT .TXT
Protocole expérimental
Bioestatisticien
Données des biopuces
Normalisation
Algorithmes de sélection de gènes
significativement modulés 1- Fold change fixe
ou variable selon lintensité. 2- t-test ou ANOVA
si exp. Avec replicats.
LFCM FC fixe T-test ANOVA
Gènes significativement modulées
Logiciels de triage et regroupement GO classifie
aprox. les 50 de gènes en 3 groupes. Arbres,
SOM, K-means et PCA font différents types de
triage en utilisant le profil et lintensité de
chaque gène. Classifient les 100 des gènes dans
la liste. Utilisent des interfaces graphiques
très puissantes.
  • PCA (Principal
  • Component analysis)?
  • Neural networks
  • Support vector machines
  • Bayesian inference
  • Clustering
  • S.O.M.
  • K-Means

Triage Pas supervisé
Triage supervisé
Gènes "intéressants" - Gènes déjà caractérisés
dans la bibliographie (validation
bibliographique)? - Gènes connus mais jamais
associés aux conditions particulières de notre
expérience. - Gènes inconnus (EST) mais associés
à des gènes connus (patron dexpression
similaire). - Gènes inconnus et avec un patron
dexpression particulier.
  • Pathways
  • Classification ontologique
  • Diagrammes de Venn
  • Moteurs de recherche bibliographique

Listes retenues
Validation Tout expérience avec des biopuces
génère des résultats faux positifs et faux
négatifs. A présent il ny a pas de critères
établis pour les exigences minimes de validation
de ce type dexpérience.
Quantitative RT-PCR Inhibition par ARNdb
(iRNA)? KO (Levures, Drosophile, C. Elegans,
Zebrafish, Souris)?
Validation
Nouvelle question
Réponse
3
Curse of dimensionality
  • After RMA (or MAS 5.0 or .) data is in the form
    of a data matrix
  • Samples point of view
  • Which exp. conditions have similar effects across
    a set of genes?
  • 10 points in 10000-dim space
  • Genes point of view
  • Which genes behaves similarly across experiments?
  • 10000 points in 10-dim space

Genes
samples 10 genes 10000
4
Curse of dimensionality
  • We normally filter out low quality or
    uninformative data
  • Low intensity data
  • Outliers
  • Genes that are not interesting for our study
  • Genes that do not change vs genes that change
    differential expression

5
Differential Expression
  • Detect genes that are expressed at significantly
    different level in one sample compared to another
  • Identify list of genes that act like markers
    between different samples

6
Differential expressionFold Change
  • FC Experiment/Control
  • In the beginning
  • FC gt 2 gt upregulation
  • FC lt 2 gt downregulation
  • Why 2? Intensity dependent cutoff!
  • More ellaborted intensity dependent methods were
    developped.

Intensity dependent variation
7
Differential Expressiont-test
  • For each gene we want to know if the means of two
    groups are different or not
  • Kind of signal to noise calculation we compare
    distance of means against total variance
  • Calculate a p-value how probable it is that the
    estimated means are different
  • Assumptions normal distribution, large number of
    replicates are necessary

group1
group2
8
Differential Expressiont-test
group1
group2
  • If t is higher than a certain threshold, the
    difference between X and Y can be said to be
    significant
  • The p-value tells us how probable is to find a
    higher t value by chance if X and Y came from the
    same distribution

9
Differential ExpressionANOVA
  • ANOVA test if different groups have the same mean
    (null hypothesis) by comparing two estimates of
    variance ?
  • MSE (mean square error) within-group variability
  • MSB (mean square beetween) inter-group
    variability
  • http//www.psych.utah.edu/stat/introstats/anovafla
    sh.html

10
Differential ExpressionANOVA
  • The MSE is an estimate of ? whether or not the
    null hypothesis is true.
  • MSB is only an estimate of ? if the null
    hypothesis is true. If the null hypothesis is
    false then MSB estimates something larger than ?
  • Therefore, if MSB is sufficiently larger than
    MSE, the null hypothesis can be rejected.
  • A p-value is calculated. Low p-value means it is
    unlikely the means are from the same distribution

11
ANOVA
  • ANOVA
  • tests if different groups have the same mean
    (null hypothesis) by comparing two estimates of
    variance ?
  • MSB (mean square beetween) inter-group
    variability
  • MSE (mean square error) within-group variability
  • Tests if a factor is 'important', i.e. if it can
    explain the observed variability

12
Anova
  • log(yijkg)µAiDjVkGg(AG)ig(VG)kgeijkg

Overall mean
Random noise
Effect of array i
Effect of dye j
Effect of variate (treatment) k
Effect of gen g
Array-gen interaction ('spot' effect)?
Variate-gen interaction differential expression!!
13
ANOVA
  • Example compare two conditions A and B looking
    for genes expressed differently
  • Hybridize A (cy3 labelled) and B (cy5 labelled)
    in a single array. Foe a given gene g

log(y111g)µA1D1V1Gg(AG)1g(DG)1g(VG)1ge111
g
log(y122g)µA1D2V2Gg(AG)1g(DG)2g(VG)2ge122
g
log(y111g/y122g)(D1-D2)(V1-V2) (DG)1g- (DG)2g
(VG)1g-(VG)2geg
14
ANOVA
A -gt B
log(y111g/y122g)(D1-D2)(V1-V2) (DG)1g- (DG)2g
(VG)1g-(VG)2geg
Dye swap experiment A lt-gt B (two slides)?
15
Differential ExpressionOther methods of gene
selection
  • Fisher criterion score
  • Entropy measure (information theory)?
  • ?2 measure
  • Information gain - Information gain ratio
  • Correlation-based feature selection
  • Principal Component Analysis (PCA)?
  • Linear models Bayesian estimates
  • Etc

16
Differential ExpressionMultiple hypothesis
testing
  • When testing tens of thousands of genes, each
    with a significance level p, we will have a large
    number of errors (false positives)?
  • For plt0.01 , 250 genes out of 25000 will be found
    just by random!
  • Methods to lower the number of predicted FP
  • Bonferroni use pp/num_tests
  • Benjamin-Hochberg
  • Holm

gt long list of DE genes.......What next!!??!
17
Differential ExpressionMultiple hypothesis
testing
  • The aforementioned methods provide corrected
    p-value cutoffs

Gene p-value unadjusted
adjusted
cutoff cutoff
18
Microarray Data Analysis
  • Long lists of DE genes is not biological
    understanding.
  • What's next?
  • Select some gene for validation (e.g. By QRTPCR)?
  • Do follow up experiments on some genes?
  • Try to learn about all the genes on the
    list...(read 500 papers)?
  • Try to publish a huge table with all the results.
  • ....

19
Microarray Data Analysis
  • Look for patterns in your data
  • From one-gene to set-of-genes analysis
  • Gene in biological pathways
  • Genes asociated with particular location in cell
  • Genes having a particular function or involved in
    particular processes
  • A priori selected genes

20
Pattern recognition
  • Find structure in the data that correlate/explain
    some biological behavior
  • Which experimental conditions have similar
    effects across a set of genes? (disease markers,
    cancer subgroup discovery,etc)?
  • Which genes behaves similarly across experiments?
    (gene networks,etc)?
  • Clustering Finding groups of genes (experiments)
    with similar expression
    profiles
  • Classification Finding models that separate two
    or more data classes.

21
Pattern recognition
  • Clustering Finding groups of genes (experiments)
    with similar expression profiles
  • Hierarchical clustering
  • K-means
  • SOM
  • Classification Finding models that separate two
    or more classes.
  • k-Nearest Neighbor (kNN)?
  • artificial neural networks
  • hidden Markov models
  • Bayesian methods

22
Clustering
  • A cluster is a group of genes (experiments) with
    similar expression profiles
  • It is an unsupervised procedure. Once the notions
    of distance and neighborhood are given, no
    previous knowledge is used to find the grouping.
  • Several methods
  • Hierarchical clustering (hierarchical method)?
  • K-means (partitioning method)?
  • More

23
ClusteringWhat is similar?
24
ClusteringWhat is similar?
  • How can we quantify the notion of similarity?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation
  • Mutual information

25
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation
  • Mutual information

26
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation
  • Mutual information

27
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation
  • Mutual information

28
ClusteringWhat is similar?
  • Vector distance measurements
  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Spearmans rank correlation Pearson
    correlation of ranks
  • Mutual information Amount of info gained about X
    when Y is learned

29
ClusteringWhat is similar?
30
ClusteringHierarchical clustering
Regroupement selon la similitude entre
échantillons
Gènes sur-exprimés Gènes sous-exprimés
Regroupement des gènes selon la similitude du
profil dexpression
Échantillons 1 5 2 9 11 3 4 6 10 7 8
31
ClusteringHierarchical clustering
32
ClusteringHierarchical clustering
  • Distance matrix

High similarity
Low similarity
33
ClusteringHierarchical clustering
  • Join A and F. Recalculate distance matrix.
  • Distance to a cluster
  • Single linkage
  • Average linkage
  • Complete linkage

34
ClusteringHierarchical clustering
35
ClusteringHierarchical clustering
36
ClusteringHierarchical clustering
37
ClusteringHierarchical clustering
38
ClusteringHierarchical clustering
  • The resulting figure is known as a hierarchical
    tree.
  • Different number of clusters depending on how
    deep we look.

39
ClusteringHierarchical clustering
  • In each step the individual order between the two
    group joined is arbitrary

40
ClusteringHierarchical clustering
  • Pros
  • Usefull to provide a view of a data structure and
    similarities.
  • It is simple.
  • It is colorful. It is part of most microarray
    studies.
  • Cons
  • Be cautious! Anything will cluster. Even random
    data!
  • Clusters are kind of arbitrary, depending on how
    the tree is cut.
  • What is a good cluster?

41
ClusteringHierarchical clustering
  • What is a good cluster?. Bootstraping
  • The data is resampled (some experiments taken out
    randomly and replaced by copies of other
    experiments)?
  • The whole clustering is repeated.
  • Clusters that often appear are more statistically
    safe than others.

42
ClusteringK-mean
  • A specific number of clusters have to be provided
  • Goal assign element to clusters

43
ClusteringK-mean
  • Start by guessing k centers

44
ClusteringK-mean
  • Assign elements to these centers

45
ClusteringK-mean
  • Move to gravity centers

46
ClusteringK-mean
  • Reassign elements, and repeat until convergence

47
ClusteringK-mean
  • K-means is iterative.
  • The outcome depends on initial guesses
  • The number of final clusters is an input of the
    algorithm

48
ClusteringK-mean
  • To guess a good number of clusters we can use a
    figure of merit (FOM)?
  • FOM quantifies how good the clusters are

49
ClusteringSOM
  • Self organizing maps
  • Start with a given number of clusters
  • For each cluster create a node and give them
    initial positions

50
ClusteringSOM
  • Pick a random gene

51
ClusteringSOM
  • Move the nodes toward the selected gene

52
ClusteringSOM
  • Pick another gene and move the nodes again

53
ClusteringSOM
  • Keep iterating, for iteration decrease node
    movility

54
ClusteringSOM
  • Eventually the nodes will have stable positions.
    Clusters are defined as the closest set of genes

55
Classification
  • Classification is the process of finding models
    that separate two or more data classes.
  • Given classes A and B, can we use them as a basis
    to decide if a new unknow sample is A or B?
  • Supervised classification means we are using a
    priori information to find different classes
  • The methods find the structure in the data that
    explains this information

56
Classification
  • First, the data is divided into a training and a
    test set

57
Classification
  • Learn a classifier with the training set

58
Classification
  • Apply the classifier to the test data
  • Compare predicted classes with known classes to
    assess performance of the classifier

59
Classification
  • Example of classifiers
  • Linear discriminants
  • K-nearest neighbours
  • Artificial neural networks
  • Decision Trees
  • Support Vector Machines
  • Bayesian Methods
  • Hidden Markov Models
  • etc

60
ClassificationK-NN
  • Assign a test sample to the class most often
    found in the K nearest training samples
  • Rule of thumb Ksqrt(Ntraining)?
  • Normally, Euclidean distance is assumed

61
ClassificationK-NN
  • Choose Ksqrt(23)5

62
Example
  • Leukemia classification
  • Golub et al, Science 286531-537 (1999)?
  • Cancer classification applied to acute leukemias
  • Class discovery recognize previously undefined
    tumor subtypes
  • Class prediction assignement of samples to
    already defined classes

63
Example
  • Leukemia classification
  • Golub et al, Science 286531-537 (1999)?
  • Cancer classification applied to acute leukemias
  • Class prediction assignement of samples to
    already defined classes
  • Class discovery recognize previously undefined
    tumor subtypes

64
Example
  • Ramaswamy et al. (Nature Genetics, 2003) Gene
    expression program of metastasis is already
    present in primary tumors?
  • Used a weighted voting algorithm to find genes
    that separate primary tumours from metastases
  • 128 genes that best distinguished primary from
    metastatic tumours using a weighting voting
    algorithm

65
Example
  • horiz bar
  • red recurrent
  • black non-recurrent
  • vert bar
  • red originally primary
  • blackoriginally metastasis
  • Hierarchical clustering in the space of the 128
    genes identifies two main clusters of primary
    tumors, highly correlated to the original
    primary-tumor vs. metastasis distinction

66
Gene Ontology Consortium
  • Goal of GO consortium provide a controlled
    vocabulary that can be applied to all organisms,
    even as knwoledge of gene protein and gene roles
    is accumulating and changing.
  • GO provides three ontologies
  • MOLECULAR FUNCTION (what?)?
  • BIOLOGICAL PROCESS (why?)?
  • CELLULAR COMPONENT (where?)?

67
GO
  • GO is organized as a Directed Acyclic Graph
    structure/home/ariel/Academ/Docencia/CharlaLeloir/
    aux/go.cgi.htmlgo

68
GO
  • Each GO node has zero or more ENTREZID
    annotations
  • Parent terms inherit annotations from children
  • BP aopotosis node

69
GO
70
GO
71
Differential Exression lists GO
  • Are there any GO terms that have a larger than
    expected subset of our selected genes in their
    annotation list?
  • Is so, these GO terms will give us insight into
    the functional characteristics of the gene list
  • How large is 'larger'?

72
GO as a urn
  • The urn contains a ball for each gene in the
    universe
  • Paint the balls representing genes in our
    selected list white and paint the rest black.
  • Testing a GO term amounts to drawing the genes
    annotated at it from the urn and tallying white
    and black.

73
GO an microarray gene sets
  • Is a GO term specific for a set?

2x2 table
pvalue
in Go category
51 416 467 125 8588 8713 173 9004 9177
8 10-52
NOT in Go category
Fisher Exact test or chi-square test
in DE list
NOT in DE list
74
Gene Products interact...a lot!
75
Gene Networks
Gene networks describe relations among a group of
genes.
76
Relations and Interactions
  • Examples
  • A gene expressing a transcription factor that
    regulate the expression of a set of other genes
  • A gene expressing an enzyme which activates a set
    of proteins
  • Two proteins binding each other to produce a
    functional complex
  • A gene expressing an enzyme which catalyses the
    production of a metabolic compound, which in turn
    inhibits another enzyme
  • A gene participating in the same cellular process
    as a set of other genes

77
Interaction data
  • Protein-protein interactions
  • Protein-DNA interactions
  • Functional classifications
  • Metabolic pathways
  • Signalling pathways
  • Sequence and structure information
  • Other gene expression studies
Write a Comment
User Comments (0)
About PowerShow.com