Improved Bayesian segmentation with a novel application in genome biology - PowerPoint PPT Presentation

Loading...

PPT – Improved Bayesian segmentation with a novel application in genome biology PowerPoint presentation | free to download - id: 6feff6-MzU5M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Improved Bayesian segmentation with a novel application in genome biology

Description:

Improved Bayesian segmentation with a novel application in genome biology Petri Pehkonen, Kuopio University Garry Wong, Kuopio University Petri T r nen, HY ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 51
Provided by: helsinkiFi
Learn more at: http://ekhidna.biocenter.helsinki.fi
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Improved Bayesian segmentation with a novel application in genome biology


1
Improved Bayesian segmentation with a novel
application in genome biology
Petri Pehkonen, Kuopio University Garry Wong,
Kuopio University Petri Törönen, HY, Institute
of Biotechnology
2
Outline
  • little motivation
  • some heuristics used
  • proposed Bayes model
  • represents also a modified Dirichlet prior
  • proposed testing with artificial data
  • discusses the use of prior information in the
    evaluation
  • little analysis of real datasets

3
Biological problem setup
  • Input
  • Genes and their associations with biological
    features like regulation, expression clusters,
    functions etc.
  • Assumption
  • Neighbouring genes of genome may share same
    features
  • Aim
  • Find the chromosomal regions "over-related" to
    some biological feature or combination of
    features, look for non-random localization of
    features
  • I will discuss more about gene expression data
    application

4
A comparison with some earlier work with
expression data
  • Our aim is to analyze the gene expression from
    the genome with a new perspective
  • standard Consider very local areas of constant
    expression levels
  • our view How about looking at larger regions
    that have clearly more active genes (under
    certain conditions)?

Our perspective is related with the idea of
active and passive regions of the genome
5
Further comparison with earlier work
  • Standard Up/Down/No regulation classification or
    real value from each experiment as input vector
    for gene
  • Our idea One can also associate genes to
    clusters in varying clustering solutions.
  • multinomial variable/vector for single gene
  • By using varying number of clusters one should
    obtain broader and narrower classes

This is related with the idea of combining weak
coherent signals occuring in various measurements
with clusters
6
Methodological problem setup
  • Gene participance in co-expression clusters
  • Genes can be partitioned into separate clusters
    according to expression similarity first 2
    clusters, then 3, then 4 etc.
  • Aim is to find chromosomal regions where
    consecutive genes are in same expression clusters
    in different clustering results

Broader expression similarity
Specific expression similarity
6 gene expression clusters 0 0 1 5 2 3 6 5 0 3 3 3 4 0 0
5 gene expression clusters 0 0 5 4 5 2 1 2 0 4 4 4 4 0 0
4 gene expression clusters 0 0 3 3 4 3 3 3 0 2 2 1 2 0 0
3 gene expression clusters 0 0 3 3 3 3 3 3 0 1 1 1 1 1 0
2 gene expression clusters 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0
Gene order in chromosome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
7
Existing segmentation algorithms
  • Non-heuristic
  • Dynamic programming
  • Heuristic
  • Hierarchical
  • Top-down/bottom-up
  • Recursive/iterative
  • K-means reminding solutions (EM-methods)
  • Sliding window with adaptive window size (?)
  • etc. ..

8
Hierarchical vs. Non-hierarchical Heuristic
methods
  • Non-hierarchical heuristic methods usually
    produce only a single solution.
  • compare k-means in clustering
  • these often require a parameter (number of
    change-points )
  • Aims to create (local) optimal solution for the
    number of change point
  • Hierarchical heuristic methods produce a large
    group of solutions with varying number of
    change-points
  • Large group of solutions can be created with one
    run
  • Solutions could be usually optimized further

9
Recursive vs. Iterative hierarchical heuristics
  • Recursive hierarchical heuristics
  • Slices usually until some stopping rule (BIC
    penalty) is fullfilled.
  • Each segment is sliced independent from the rest
    of the data
  • Hard to obtain a solution for varying number of
    change points
  • Designed to stop at optimum (which can be a local
    optimum)
  • Top-Down (?) hierarchical heuristics
  • Slices until a stopping rule or maximum number of
    clusters is fullfilled
  • Each new change-point is placed after all
    segments are analyzed. The best change-point
    from all segments is selected.
  • Creates a chain of solutions with varying number
    of segments
  • Can be run past the (local) optimum to see if we
    can find a better solution after few bad results.

10
Our choice for heuristic search
  • Top-Down hierarchical segmentation

11
How to place a new change-point
  • The new change-point position is usually selected
    using a statistical measure
  • Optimization of Log of Likelihood-ratio (ratio of
    ML based solutions)
  • lighter to calculate
  • often referred as Jensen-Shannon Divergence
  • Optimization of our Bayes Factor
  • bayes model discussed later
  • natural choice (as this is what we want to
    optimize)
  • Bayes factor would seem natural but
  • In testing we noticed that we started splitting
    only the smallest segments?? Why???

12
Bias in bayesian score
  • The first fig. represents random data (no
    preferred change-point position)
  • The second fig. represents the behaviour of log
    likelihood ratio model. (ML method)
  • The third fig. represents the behaviour of Bayes
    Factor (BF)
  • The highest point of each profile is taken as
    change point
  • Notice the bias in BF that favours cutting near
    the ends
  • Still all BFs are negative (against splicing)
  • Problems when we force the algorithm to go pass
    the local optimum

gt We chose the ML to change point search
13
What we have obtained so far
  • Top-Down hierarchical heuristical segmentation
  • ML based measure (JS divergence) used to select
    the next change-point

14
Selecting optimal solution from hierarchy
  • Hierarchical segmentation for n sized data
    contains n different nested solutions
  • The solutions must be evaluated in order to find
    proper one not too general, not too complex
  • Need model selection

15
Model selection used for segmentation models
  • Two ideas occuring in the most used methods
  • Evaluating "the fit" of model (usually ML score)
  • Penalization for used parameters (in the model)
  • Segmentation model parameters data classes in
    segments and the positioning of change points
  • We used few (ML based) model selection methods
  • AIC
  • BIC
  • Modified BIC (designed for segmentation tasks)
  • We were not happy with their performance,
    therefore ..

16
Our model selection criterion
  • Bayesian approach gt Takes account uncertainity
    and a priori information on parameters
  • The change-point model M includes two varying
    parameter groups
  • Class proportions within segments
  • Change-points (segment borders)
  • Posterior probability for the model M fitted in
    data D would be to integrate over A and B
    parameter spaces

17
Our approximations/assumptions A
  • clusters do not affect each other (independence)
  • data dimensions do not affect each other
    (independence)
  • These two allow simple multiplication
  • segmentation does not affect directly the
    modeling the modelling of the data in the cluster
  • Only the model and the prior A affect the
    likelihood of the data P(DM,A,B) P(DM,A)

18
Our model selection criterion
  • Therefore a Multinomial model with a Dirichlet
    prior can be used to calculate the integrated
    likelihood

multiplication over classes in one dimension
multiplication over dimensions
19
Further assumptions/approximations B
  • We assume all the change-points exchangeable
  • order of finding change-points does not matter
    for the solution
  • We do not integrate over the parameter space B,
    but analyze only the MAP solution
  • need a proper prior for B..

20
Our model evaluation criterion
  • We select flat prior for simplicity
  • This makes the MAP equal to ML solution
  • Prior of parameters B is 1 divided with how many
    ways the current m change-point estimates can be
    positioned into data with size n

21
Our model evaluation criterion
  • The final form of our criterion is (without the
    log)

"Flat" MAP-estimate for parameters B.
Posterior probability of parameter group A
Multiplication goes over various clusters (c),
and various dimensions (v). Quite simple equation.
22
What about the Dirichlet prior weights
  • Multinomial model requires prior parameters
  • Standard Dirichlet prior weights
  • I) all the prior weights same (FLAT)
  • II) prior probabilities equal the class
    probabilities in the whole dataset (CSP)
  • These require the definition of prior sum (ps)
  • We used ps 1 (CSP1, FLAT1) and ps number of
    classes (CSP, FLAT) for both of previous prios
  • Empirical Bayes (?) prior (EBP) prior II with ps
    SQRT(Nc) (Carlin, others, scales according the
    std)

23
Dirichlet prior weights
  • We considered EBP reasonable but
  • With small class proportions and small clusters
    EBP problematic
  • gamma function of Dirichlet equation probably
    approaches infinity (as prior approaches zero)
  • Modified EBP (MEBP) mutes this behaviour
  • Instead of
  • now prior weights approach 0 slower, when class
    proportion is small

24
Dirichlet prior weights
  • Also now the ps in MEBP is dependent on the class
    distribution (more even distribution gt bigger
    ps). Also larger number of classes gt bigger ps.
    Both these sound natural
  • Prior weight can be also linked to Chi Square
    test.

25
What we have obtained so far
  • Top-Down hierarchical heuristical segmentation
  • ML based measure (JS divergence) used to select
    the next change-point
  • Results from heuristics are analyzed using
    proposed Bayes model
  • flat prior using number of potential solutions
    for segmentation with same m.
  • MEBP prior for multinomial data

26
Evaluation
  • testing using artificial data
  • we can vary number of clusters, number of
    classes, class distributions and monitor the
    performance
  • Do hierarchical segmentation
  • Select the best result with various methods
  • Standard measure for evaluation Compare how well
    the clusters obtained correspond to clusters used
    in the data generation
  • But is the good correlation/correspondence always
    what we want to see?

27
When correlation fails
  • many clusters/segments and few data points
  • consecutive small segments
  • similar neighboring segments
  • One segment in the obtained segmentation (or in
    the data-generation) gt no correspondence
  • Problem Correlation does not account Occams
    Razor

28
Our proposal
  • Base the evaluation to the similarity of the
    statistical model used to generate each data
    point (DGM) vs. the data model obtained from
    clustering for the datapoint (DEM)
  • Reminds standard cross validation
  • Use a probability distribution distance measure
    to monitor how similar they are
  • one can think this as infinite size testing data
    set
  • Need only to select the distance measure
  • extra-plus with hierarchical results we can look
    the optimal result and see if a method
    overestimates or underestimates it.

29
Probability distribution distance measures
  • Kullback-Leibler Divergence (most natural)
  • Inverse of the KL
  • Jensen-Shannon Divergence
  • Other measures were also tested..

X is here DGM Y is the DEM (obtained from
segments)
30
The Good, the Bad and
  • DEM can have data points with P(Xi) 0
  • These create infinite score in DKL
  • Under-estimates the optimal model
  • DKL_Inv was considered to correct this, but
  • P(Xi) 0 causes now too many zero scores
  • xlog(x) when x gt 0 was defined as 0
  • over-estimates heavily the model
  • DJS was selected as a comprimise between these
    two phenomenas

31
Do we want to use prior info
  • Standard cross validation Bayes method uses
    prior information, ML does not use prior
    information
  • Is this fair?
  • same result with and without prior gets different
    score
  • method with prior usually gets better results
  • Our (My!) opinion evaluation should use same
    amount of prior info for all the methods!
  • we would get same score for same result
    (independent from the method)
  • we would pick the model from the model group that
    usually performs better
  • Selecting the prior for evaluation process is now
    an open question!

32
Defending note
  • Amount of prior only affects the results from one
    group of artificial datasets analyzed (sparse
    signal /small clusters)
  • These are datasets where bayes methods behave
    differently.
  • Revelation from the results ML methods perform
    worse also in datasets where prior has little
    effect
  • gt Use of prior mainly important for our Bayes
    method prior comparisons

33
Rules for selecting prior to model evaluation
  • Obtained DEM should be as close to DGM as
    possible (more correct, smaller DJS)
  • The used prior should be based on something else
    than our favorite MEBP
  • Hoping we would not get good results with MEBP
    just because of the same prior
  • Use as little prior as possible
  • want the segment area to have as much affect as
    possible
  • Better ideas?

34
Comparison of model evaluation priors
  • Used a small cluster data with 10 and 30 classes
    (prior effects the results)
  • Used CSP (class prior class probability ps),
    with ps 1, 2, c/4, c/2, c3/4, c, 10c (c
    number of classes)
  • Looked for obtained DJS for various segmenting
    outcomes (from hierarchical results) with 1 n
    clusters (n max(5,k), k artificial data cluster
    number)
  • Analysis was done with artificial datasets

35
Comparison of model evaluation priors
  • ps 1, 2, c/4, c/2, 3c/4, c, 10c

The approximate minimum at ps number of classes
36
Comparison of priors
  • We did not look minimum, but wanted the
    compromise between minimum and weak prior effect
  • We chose ps c/2
  • Choice quite arbitrary but a quick analysis with
    neighbor priors gave similar results

37
Proposed method artificial data evaluation
  • Top-Down hierarchical heuristical segmentation
    with ML used to select the next change-point
  • Results from heuristics are analyzed using
    proposed Bayes model
  • Evaluation of the results using the artificial
    data
  • estimate how well the obtained model predicts the
    future data sets
  • compare the models with DJS that uses also prior
    information

38
More on evaluation
  • Three data types (with varying number of
    classes)
  • several (1 10) large segments (each 30 300
    data points)
  • this should be easy to analyze
  • few (1 4) large segments (30 300 data points)
  • this should have less reliable prior class
    distribution
  • several (1 10) small segments (15 60 data
    points)
  • most difficult to analyze
  • prior affects these results
  • Number of classes used in each 2, 10, 30
  • data sparseness increases with increasing number
    of classes
  • Data classes were made skewed

39
evaluation
  • Data segmented by Top-Down 1 100 segments
  • Model selection methods used to pick optimal
    segmentation
  • ML methods AIC, BIC, modified BIC
  • Bayes method with dirichlet priors FLAT1, FLAT,
    CSP1, CSP, EBP, MEBP
  • Each test replicated 100 times
  • Djs calculated between DGM and the obtained DEM

40
still evaluating
  • As mentioned the smaller the JS-distance between
    DGM and DEM the better the model selection method
  • For simplification we subtracted JS-distances
    obtained with our own Bayesian method from the
    distances obtained with other methods
  • We took average of these differences over 100
    replicates

41
Data I AIC BIC BIC2 CSP EBP CSP1 Flat Flat1
Z-scores Z-scores Z-scores Z-scores Z-scores Z-scores Z-scores Z-scores Z-scores Z-scores
i. 2 16.8 0.6 -0.6 -1.4 -2.2 -1.2 -1.8 -1.5
i. 10 1.6 7.0 3.8 1.6 1.0 3.5 1.6 3.2
i. 30 4.1 13.5 10.3 1.9 1.7 4.3 2.0 4.5
ii. 2 8.7 0.9 2.4 1.6 -1.3 2.2 1.6 2.5
ii. 10 0.4 7.4 2.5 4.0 0.8 3.0 0.5 2.0
ii. 30 1.5 15.4 14.7 8.2 2.2 7.4 -1.4 5.3
iii. 2 7.0 2.8 1.3 -1.3 -0.7 1.0 1.2 1.0
iii. 10 1.9 13.8 8.1 1.6 2.4 4.6 4.2 5.6
iii. 30 11.9 13.9 13.9 5.0 4.9 8.7 5.5 9.7
Average Average 0.60 0.84 0.63 0.24 0.10 0.37 0.15 0.36
Averages Averages Averages Averages Averages Averages Averages Averages Averages Averages
i. 2 5.65 0.06 -0.05 -0.09 -0.06 -0.07 -0.12 -0.09
i. 10 0.16 4.89 0.55 0.04 0.02 0.45 0.08 0.39
i. 30 0.78 58.12 17.48 0.24 0.08 5.74 0.23 4.28
ii. 2 1.42 0.01 0.08 0.03 -0.03 0.07 0.05 0.07
ii. 10 0.03 3.01 0.25 0.47 0.05 0.32 0.02 0.18
ii. 30 0.22 12.64 12.19 1.66 0.30 4.15 -0.15 3.47
iii. 2 1.13 0.27 0.11 -0.08 -0.03 0.09 0.11 0.09
iii. 10 0.15 13.61 3.67 0.19 0.21 1.90 0.65 1.47
iii. 30 5.82 13.88 13.88 0.59 0.51 8.93 1.72 10.70
Average Average 1.70 11.83 5.35 0.34 0.12 2.40 0.29 2.29
Upper box shows the Z-scores (mean(diff)/std(diff)
sqrt(100))Lower box shows the average
difference Shaded Z-scores x gt 3, a strong
support in favour our method underlined Z-scores
x lt 0, any result against our method
Summary AIC bad on two classes, (overestimates)
BIC (and Mod-BIC) bad on 10 and 30 classes
(underestimates) Flat1 and CSP1 weak on 10 and 30
classes (overestimates)
42
Large segments Detailed view
  • Rows show D results for datasets with 2, 10 and
    30 classes
  • D from segmentation selected by Bayes model with
    MEBP
  • Positive resultsgt BM with MEBP outperforms
  • negative resultsgt method in question outperforms
    BM with MEBP
  • 1 column Mainly worse methods
  • 2 column Mainly better methods
  • These results did not depend on the DJS prior

43
Large segments in small data
This is data where prior information is less
reliable. (smaller dataset)
Flat class outperforms our prior in 30 class
dataset
44
Small segments
Hardest data to model This is data where prior
affects the evaluation significantly. Without
prior BIC methods give best result (1 segment is
considered best)
45
Summary from art. data
  • MEBP had overall better result in 23/24 pairwise
    comparisons with 30 class datasets (in 18/24
    Z-score gt 3)
  • MEBP had better overall result in all pairwise
    comparisons with 10 class datasets (in 12/24
    Z-score gt 3)
  • Our method slightly outperformed by other bayes
    methods in dataset i with 2 classes. Also EBP
    slightly outperforms it with every 2 class
    dataset
  • EBP might be better for smaller class numbers
  • MEBP underestimates the optimum here
  • ML methods and priors with ps 1 (Flat1, CSP1)
    had weakest performance

46
Analysis of real biological data
  • Yeast cell cycle time series gene expression data
  • Genes were clustered with k-means into 3 groups,
    4 groups, 5 groups, and 6 groups
  • Order of genes in chromosomes, and gene
    associations with expression clusters were turned
    into multidimensional multinomial data
  • Aim was to locate regional similarities in gene
    exression in yeast cell cycle

47
Anything in real data

CHR Rand. mean Rand. std log(P(MD)) Goodness
1 -726.39 3.86 -711.47 3.87
2 -2783.24 5.17 -2759.31 4.62
3 -1134.89 6.65 -1103.91 4.66
4 -5331.72 8.80 -5160.64 19.44
5 -1899.52 3.62 -1889.82 2.68
6 -792.07 4.90 -752.02 8.17
7 -3548.24 6.34 -3523.82 3.85
8 -1982.86 2.46 -1969.82 5.31
9 -1502.43 6.71 -1492.22 1.52
10 -2589.06 3.36 -2543.79 13.48
11 -2185.09 9.37 -2167.20 1.91
12 -3693.34 4.60 -3658.42 7.58
13 -3176.61 5.06 -3166.51 2.00
14 -2641.54 6.02 -2612.29 4.86
15 -3719.47 6.80 -3693.68 3.79
16 -3157.52 3.77 -3150.92 1.75
  • Each chromosome was segmented
  • Segmentation score of each chromosome was
    compared to score from randomized data (100
    randomizations)
  • Goodness
  • (x mean(rand))/std(rand)

48
Conclusions
  • Showed a Bayes Model, that outperforms in overall
    ML based methods
  • Proposed a modified prior, that performs better
    than other tested priors with datasets having
    many classes
  • Proposed a way of testing various methods
  • avoids picking too detailed models
  • use of prior can be considered a drawback
  • Showed the preference to ML score when segmenting
    data with very weak signals
  • Real data has localized signal

49
Future points
  • Improve the heuristic (optimize the results)
  • Use of fuzzy vs. hard cluster classifications
  • Various other potential applications (no
    certainty of their rationality yet..)
  • Should clusters be merged? (Work done in HIIT,
    Mannilas group)
  • Consider sound ways of setting the prior for DJS
    calculus
  • Length of the gene, density of genes?

50
Thank you!
Wake up!
About PowerShow.com