Loading...

PPT – Computational Systems Biology of Cancer: PowerPoint presentation | free to download - id: 7c7978-NGIyN

The Adobe Flash plugin is needed to view this content

(II)

Computational Systems Biology of Cancer

Bud Mishra

- Professor of Computer Science, Mathematics and

Cell Biology - Courant Institute, NYU School of Medicine, Tata

Institute of Fundamental Research, and Mt. Sinai

School of Medicine

The New Synthesis

Genome Evolution

Selection

perturbed pathways

genetic instability

Is the Genomic View of Cancer Necessarily

Accurate ?

- If I said yes, that would then suggest that that

might be the only place where it might be done

which would not be accurate, necessarily

accurate. It might also not be inaccurate, but

I'm disinclined to mislead anyone. - US Secretary of Defense, Mr. Donald Rumsfeld,

Once again quoted completely out of context.

Cancer Initiation and Progression

Genomics (Mutations, Translocations,

Amplifications, Deletions) Epigenomics (Hyper

Hypo-Methylation) Transcriptomics (Alternate

Splicing, mRNA) Proteomics (Synthesis,

Post-Translational Modification,

Degradation) Signaling

Cancer Initiation and Progression

Proliferation, Motility, Immortality, Metastasis,

Signaling

Mishras Mystical 3Ms

- Rapid and accurate solutions
- Bioinformatic, statistical, systems, and

computational approaches. - Approaches that are scalable, agnostic to

technologies, and widely applicable - Promises, challenges and obstacles

Measure

Mine

Model

Measure

- What we can quantify and what we cannot

Microarray Analysis of Cancer Genome

- Representations are reproducible samplings of DNA

populations in which the resulting DNA has a new

format and reduced complexity. - We array probes derived from low complexity

representations of the normal genome - We measure differences in gene copy number

between samples ratiometrically - Since representations have a lower nucleotide

complexity than total genomic DNA, we obtain a

stronger specific hybridization signal relative

to non-specific and noise

Minimizing Cross Hybridization(Complexity

Reduction)

Copy Number Fluctuation

A1

B1

C1

A2

B2

C2

A3

B3

C3

Critical Innovations

- Data Normalization and Background Correction for

Affy-Chips - 10K, 100K, 500K (Affy) Generalized RMA
- Multi-Experiment-Based Probe-Characterization

(Kalman EM) - A novel genome segmenter algorithm
- Empirical Bayes Approach Maximum A Posteriori

(MAP) - Generative Model (Hierarchical, Heteroskedastic)
- Dynamic Programming Solution
- Cubic-Time Linear-time Approximation using

Beam-Search Heuristic - Single Molecule Technologies
- Optical and Nanotechnologies
- Sequencing SMASH
- Epigenomics
- Transcriptomics

Background Correction Normalization

Oligo Arrays SNP genotyping

- Given 500K human SNPs to be measured, select 10

25-mers that over lap each SNP location for

Allele A. - Select another 10 25-mers corresponding to SNP

Allele B. - Problem Cross Hybridization

Using SNP arrays to detect Genomic Aberrations

- Each SNP probeset measures absense/presence of

one of two Alleles. - If a region of DNA is deleted by cancer, one or

both alleles will be missing! - If a region of DNA is duplicated/amplified by

cancer, one or both alleles will be amplified. - Problem Oligo arrays are noisy.

90 humans, 1 SNP (A0.48)

Allele B

Allele A

90 humans, 1 SNP (A0.24)

Allele B

Allele A

90 humans, 1 SNP (A0.96)

Allele B

Allele A

Background Correction Normalization

- Consider a genomic location L and two similar

nucleotide sequences sL,x and sL,y starting at

that location in the two copies of a diploid

genomes - E.g., they may differ in one SNP.
- Let qx and qy be their respective copy numbers in

the whole genome and all copies are selected in

the reduced complexity representation. The gene

chip contains four probes px 2 sL,x py 2 sL,y

px, py 2 G. - After PCR amplification, we have some Kx qx

amount of DNA that is complementary to the probe

px, etc.K' (¼ Kx) amount of DNA that is

additionally approximately complementary to the

probe px.

Normalize using a Generalized RMA

- I U - mn
- a sn2 - fN(0,1)(a/b)/FN(0,1)(a/b)
- (1 b Bsn/FN(0,1)(a/b)-1
- bsn/Bsn )
- (1 FN(0,1)(a/b)/(b Bsn)-1,
- Where a U-mn -a sn2 b sn, and
- bsn å Ii,j U mn fN(0,1)(Ii,j U mn )
- Bsn å fN(0,1)(Ii,j U mn )

Background Correction Normalization

- If the probe has an affinity fx, then the

measured intensity is can be expressed as - Kx qx K fx noise
- qx K/Kx fx noise
- With Expm1 e s1, a multiplicative logNormal

noise, - m2 e s2 an additive Gaussian noise,
- and fx Kx fx an amplified affinity.
- A more general model
- Ix qx K/Kx fx em1e s1 m2 e s2

Mathematical Model

- In particular, we have four values of measured

intensities - Ix qx fx Nxe m1 e s1 m2 e s2
- Ix Nx e m1 e s1 m2 e s2
- Iy qy fy Ny e m1 e s1 m2 e s2
- Iy Ny e m1 e s1 m2 e s2

Bioinformatics Data modeling

- Good news For each 25-bp probe, the fluorescent

signal increases linearly with the amount of

complementary DNA in the sample (up to some limit

where it saturates). - Bad news The linear scaling and offset differ

for each 25-bp probe. Scaling varies by factors

of more than 10x. - Noise Due to PCR cross hybridization and

measurement noise.

Scaling Offset differ

- Scaling varies across probes
- Each 25-bp sequence has different thermodynamic

properties. - Scaling varies across samples
- The scanning laser for different samples may have

different levels. - The starting DNA concentrations may differ PCR

may amplify differently. - Offset varies across probes
- Different levels of Cross Hybridization with the

rest of the Genome. - Offset varies across samples
- Different sample genomes may differ slightly

(sample degradation impurities, etc.)

Linear Model Noise

Noise minimization

Final Data Model

MLE using gradients

Data Outliers

- Our data model fails for few data points (bad

probes) - Soln (1) Improve the model
- Soln (2) Discard the outliers
- Soln (3) Alternate model for the outliers

Weight the data approprately.

Outlier Model

Problem with MLE No unique maxima

Scaling of MLE estimate

Segmentation to reduce noise

- The true copy number (Allele AB) is normally 2

and does not vary across the genome, except at a

few locations (breakpoints). - Segmentation can be used to estimate the location

of breakpoints and then we can average all

estimated copy number values between each pair of

breakpoints to reduce noise.

Allelic Frequencies Cancer Normal

Allelic Frequencies Cancer Normal

Segmentation Break-Point Detection

Algorithmic Approaches

- Local Approach
- Change-point Detection
- (QSum, KS-Test, Permutation Test)
- Global Approach
- HMM models
- Wavelet Decomposition
- Bayesian Empirical Bayes Approach
- Generative Models
- (One- or Multi-level Hierarchical)
- Maximum A Posteriori

HMM

Model with a very high degree of freedom, but not

enough data points. Small Sample statistics a

Overfitting, Convergence to local maxima, etc.

HMM, finally

Model with a very high degree of freedom, but not

enough data points. Small Sample statistics a

Overfitting, Convergence to local maxima, etc.

3

1

2

HMM, last time

- Advantages
- Small Number of parameters. Can be optimized by

MAP estimator. (EM has difficulties). - Easy to model deviation from Markvian properties

(e.g., polymorphisms, power-law, Polyas urn like

process, local properties of chromosomes, etc.)

We will simply model the number of break-points

by a Poisson process, and lengths of the

aberrational segments by an exponential

process. Two parameter model pb pe

¹ 2

1-pe

pe

2

pb

1-pb

Generative Model

Breakpoints, Poisson, pb Segmental Length,

Exponential, pe Copy number, Empirical

Distribution Noise, Gaussian, m, s

Amplification, c4

Amplification, c3

Deletion, c0

Deletion, c1

A reasonable choice of priors yields good

segmentation.

A reasonable choice of priors yields good

segmentation.

A MAP (Maximum A Posteriori) Estimators

- Priors
- Deletion Amplification
- Data
- Priors Noise
- Goal Find the most plausible hypothesis of

regional changes and their associated copy

numbers - Generalizes HMMThe prior depends on two

parameters pe and pb.

- pe is the probability of a particular probe being

normal. - pb is the average number of intervals per unit

length.

Likelihood Function

- The likelihood function for first n probes
- L(h i1, m1, , ik, mk i)
- Exp(-pb n) (pb n)k
- (2 p s2)(-n/2)Õi1n Exp-(vi - mj)2/2s2
- pe(global)(1-pe)(local)
- Where ik n and i belongs to the jth interval.
- Maximum A Posteriori algorithm (implemented as a

Dynamic Programming Solution) optimizes L to get

the best segmentation - L(h i1, m1, , ik, mk i)

Dynamic Programming Algorithm

- Generalizes Viterbi and Extends.
- Uses the optimal parameters for the generative

model - Adds a new interval to the end
- h i1, m1, , ik, mk i h ik1, mk1 i h i1,

m1, , ik, mk, ik1, mk1 i - Incremental computation of the likelihood

function - Log L(h i1, m1, , ik, mk, ik1, mk1 i)
- Log L(h i1, m1, , ik, mki)
- new-res./2s2 Log(pbn) (ik1 ik) Log (2ps2)
- (ik1 ik) Iglobal Log pe Ilocal Log(1

pe)

Prior Selection F criterion

- For each break we have a T2 statistic and the

appropriate tail probability (p value) calculated

from the distribution of the statistic. In this

case, this is an F distribution. - The best (pe,pb) is the one that leads to the

maximum min p-value.

Segmentation Analysis

Comparison of chromosome 13 tumor using 4

different segmentation algorithm

vMAP

DNAcopy

CGH Explorer v.2.43

GLAD

13q13.1

13q31.3

Comparative Analysis BAC Array

Comparative Analysis Nimblegen

Comparative Analysis Affy 10K

Simulated Data

- Array CGH simulations and an ROC analysis
- Using the same scheme as Lai et al.
- Weil R. Lai, Mark D. Johnson, Raju Kucherlapati,

and Peter J. Park (2005), Comparative analysis

of algorithms for identifying amplifications and

deletions in array CGH data, Bioinformatics,

21(19) 3763-3770. - Segmented by Vmap and DNAcopy
- Vmap algorithm was tested at 11 segmentation

Pvalues of 0.1, 5 10-2, 10-2, 10-3, 10-4, ,

10-10. - DNAcopy algorithm was tested at 9 segmentation

alpha values of .9, .5, .1, 10-2, 10-3, 10-4, ,

10-7. - Analysis by Alex Pearlman et al. (2006)

VMAP

DNACopy

(No Transcript)

Prostate Tumor Gains and Losses Genome view of

19K BAC CGH

Log ratio

Segmentation of Multi-BAC Events On Chromosome 13

Proximal breakpoints were identical for T1 and

T3. Distal breakpoints overlapped for T1, T2,

and T3.

Tumor1 Tumor2 Tumor3

Further Improvement

- We employed a hierarchical Bayesian model in

which global false discovery rates can be

calculated using the different levels of the

model. - Noise processes are also estimated using the

appropriate global parameters.

Specific Features of the Model

- We build a model in which, given the region

segmentations, - we assume that the copy numbers Ij region

j, (1 j k) - in that regions are mutually independent
- Gaussian Xi,j N(qj, sj2), (1 i nj)
- random variables with mean qj and variance sj2.
- We further assume that each copy region mean

parameter qj is in one of a small number of

states 2 1,,S with respective probabilities,

p1, , pS of being in state s. qj is in state s

(with probability ps) if it has a Gaussian

distribution with state mean qs and state

variance ts2 . - States serve to characterize regions. The state

means and variances are the hyperparameters of

the model.

ImplementationDynamic Programming

- Given the hyperparameters, we segment regions

using a dynamic programming approach. This

consists in constructing probe regions as

follows - After the (j-1)st region has been constructed
- A) we choose the next two contiguous regions to

the right of those already constructed by

optimizing the corresponding log likelihood,

subject to the condition that the p-value of the

t-statistic distinguishing between these two

(aforementioned) regions is above a given

threshold. - B) Having chosen these (aforementioned) regions,

the probe regions already constructed, contiguous

to them, may also need to be altered.

Segmentation (ROMA,chr3)

SMASH

- Single Molecule Approaches to Sequencing by

Hybridization - Extensions to Optical Mapping

SMASH

- Genomic DNA is carefully extracted from small

number of cells of an organism (e.g., human) in

normal or diseased states. (Fig 1 shows a cancer

cell to be studied for its oncogeneomic

characterization.)

SMASH

- LNA probes of length 6 8 nucleotides are

hybridized to dsDNA (double-stranded genomic DNA)

in a test tube (Fig 2) and the modified DNA is

stretched on a 1 x 1 chip that has microfluidic

channels manufactured on its surface. These

surfaces have been chemically treated to create a

positive charge.

DNA samples are prepared for analysis with LNA

probes and restriction enzymes.

SMASH

- Since DNA is slightly negatively charged, it

adheres to the surface as it flows along these

channels and stretches out. Individual molecules

range in size from 0.3 3 million base pairs in

length. - Next, bright emitters are attached to the probes

on the surface and the molecules are imaged (Fig

3).

SMASH

- A restriction enzyme1 is added to break the DNA

at specific sites. Since DNA molecules are under

slight tension, the cut fragments of DNA relax

like entropic springs, leaving small visible gaps

corresponding to the positions of the restriction

site (Fig 4). - 1. A restriction enzyme is a highly specific

molecular scissor that recognizes short

nucleotide sequences and cuts the DNA at only

those recognition sites.

SMASH

- The DNA is then stained with a fluorogen (Fig 5)

and reimaged. The two images are combined to

create a composite image suggesting the locations

of a specific short word (e.g., probes) within

the context of a pattern of restriction sites.

SMASH

- The intensity of the light emitted by the dye at

one frequency provides a measure of the length of

the DNA fragments. - The intensity of the light emitted by the

bright-emitters on probes provides an intensity

profile for locations of the probes. - Images of each DNA molecule are then converted

into ideograms, where the restriction sites are

represented by a tall rectangle and probe sites

by small circles (Fig 6).

SMASH

- The steps above are repeated for all possible

probe compositions (modulo reverse

complementarity). - Sutta software then uses the data from all such

individual ideograms to create an assembly of the

haplotypic ordered restriction maps with

approximate probe locations superimposed on the

map.

SMASH

- Local clusters of overlapping words are combined

by Suttas PSBH (positional sequencing by

hybridization) algorithm to overlay the inferred

haplotypic sequence on top of the restriction map

(Fig 7).

Gapped Probes

- Mixing solid bases with wild-card bases
- E.g., xxxxxx (10-4-mers) or xxxxxx

(12-6-mers) - An wild-card base
- Universal In terms of its ability to form base

pairs with the other natural DNA/RNA bases. - Applications in primers and in probes for

hybridization - Examples
- The naturally occurring base hypoxanthine, as its

ribo- or 2'-deoxyribonucleoside - 2'-deoxyisoinosine
- 7-deaza-2'-deoxyinosine
- 2-aza-2'-deoxyinosine

Simulation Results

- Probe Map Assumptions
- For single DNA molecules
- Probe location Standard Deviation 240 bases
- Data coverage per probe map 50x
- Probe hybridization rate 30, and
- false positive rate of 10 probes per megabase,

uniformly distributed. - Analytically estimation of the average error rate

in the probe consensus map - Probe location SD 60 bases
- False Positive rate lt 2.4
- False Negative rate lt 2.0.

Simulation Results

UNGAPPED

GAPPED

Simulation Results

- Simulation based on non-random sequences from the

human genome 96 blocks of 1 Kb (from chromosome

1) concatenated together along with its in silico

restriction map. - Error summary for the gapped probe pattern
- xxx xxx
- Error count excluding repeats or near repeats
- 0.32bp / 10Kb
- There is no error due to incorrect

rearrangements. - There is no loss of information at haplotypic

level. - Assembly failed in 2 of 96 blocks of 1kb 2.1

failure rate (out of memory).

GENomic conTIG

- Gentig uses a purely Bayesian Approach.
- It models all the error processes in the prior.
- FAST It initially starts with a conservative but

fast pairwise overlap configuration, computed

efficiently using Geometric Hashing. - ACCURATE It iteratively combines pairs of maps

or map contigs, while optimizing the likelihood

score subject to a constraint imposed by a

false-positive constraint. - It has special heuristics to handle non-local

errors.

HAPTIG HAPlotypic conTIG Candida Albicans

FAST ACCURATE BAYESIAN ALGORITHM

- The left end of chromsome-1 of the common fungus

Candida Albicans (being sequenced by Stanford). - You can clearly see 3 polymorphisms
- (A) Fragment 2 is of size 41.19kb (top) vs

38.73kb (bottom). - (B) The 3rd fragment of size 7.76kb is missing

from the top haplotype. - (C)The large fragment in the middle is of size

61.78kb vs 59.66kb.

Lambda DNA with probes

10 mm

A

Fig. A Four AFM images of lambda DNA with PNA

probes hybridized to the distal recognition site,

located 6,900 bp or 2.28 microns from the end

(green arrow). Non-specifically bound probes

indicated by the red arrows. Z-scale is /- 1.5

nm.

E. coli

Figure 3. Two optical images of E coli K12

genomic DNA after restriction digestion with

6-cutter restriction enzyme Xho 1 and

hybridization with an 8-mer PNA probe. Bound

probes are indicated by blue arrows and

non-specifically bound probes by the red arrows.

Scale bar shown is 10 micron.

Discussions

- QA

Answer to Cancer

- If I know the answer I'll tell you the answer,

and if I don't, I'll just respond, cleverly. - US Secretary of Defense, Mr. Donald Rumsfeld.

To be continued

- Break