The Information Processing Mechanism of DNA and Efficient DNA Storage - PowerPoint PPT Presentation

About This Presentation
Title:

The Information Processing Mechanism of DNA and Efficient DNA Storage

Description:

Information Theory of Genetics: an emerging discipline ... The League of Extraordinary Gentlemen... I. How is information stored in a. genetic sequence? ... – PowerPoint PPT presentation

Number of Views:247
Avg rating:3.0/5.0
Slides: 53
Provided by: mile2
Category:

less

Transcript and Presenter's Notes

Title: The Information Processing Mechanism of DNA and Efficient DNA Storage


1
The Information Processing Mechanism of DNA
and Efficient DNA Storage
  • Olgica Milenkovic
  • University of Colorado, Boulder
  • Joint work with B. Vasic

2
Outline
  • PART I HOW DOES DNA ENSURE ITS DATA INTEGRITY?
  • Information Theory of Genetics an emerging
    discipline
  • Error-Correction and Proofreading in genetic
    processes
  • What type of codes operate at the level of
    bio-chemical processes of the Central Dogma?
  • Spin Glasses, Kaufmanns NK Model, Regulatory
    Network of Gene Interactions and Low-Density
    Parity-Check (LDPC) Codes
  • Cancer, dysfunctional proofreading and chaos
    theory
  • PART II HOW DOES ONE STORE DNA? (DNA
    COMPRESSION)
  • Structure of DNA Statistics and Modeling
  • DNA Compression
  • Genome Compression
  • New Distance Measures and One-Way Communication
  • PART III NEW CODING PROBLEMS IN GENETICS

3
Information theory of genetics
  • 2003 50th Anniversary of discovery DNA has a
    double-helix structure!
  • (Crick, Watson, Franklin, Wilkins 1953)
  • 2003 Completion of the Human Genome Project (98
    HDNA sequenced)
  • Every day an average of 15 new sequences added to
    SwissProtGeneBank
  • Vast amount of genetic data just starting to be
    analyzed!
  • DNA is a CODE, but very little is known about its
  • exact information content
  • nature of redundancy
  • statistical properties
  • secondary structure
  • influence on disease development and control
  • underlying error-correcting mechanism

4
Information Theory of DNA
Helps in understanding the EVOLUTION of DNA

FUNCTIONALITY of DNA DISEASE DEVELOPMENT IT
community still not involved in this
area! Signal Processing Community is just
getting involved Special Issue of Signal
Processing Journal devoted to Genetics, 2003.
5
The League of Extraordinary Gentlemen
6
IHow is information stored in a genetic
sequence? What are the atoms of information?
7
The DNA Polymer
O
5
S B U A G C A K R
B - O P N H E O S P H A T E
PO4
OH
CH2OH
1
4
H
H
H
H
Sugar
2
OH
H
3
Deoxiribose (Sugar)
PO4
Sugar
PO4
8
The Bases
D O U B L E - H E L I X
Purine Bases Adenine (A) Guanine (G)
Pyramidine BasesThymine (T) Cytosine (C)
9
The Pairing Rule
A and T paired through TWO hydrogen bonds
G and C paired through THREE hydrogen bonds
10
instead of DNA's thymine, i.e. U replaces T. It
is the RNA sequence of codes which biologists
usually refer to as the genetic code (see Table.4
below).
instead of DNA's thymine, i.e. U replaces T. It
is the RNA sequence of codes which biologists
usually refer to as the genetic code (see Table.4
below).
The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code
Second Letter Second Letter Second Letter Second Letter Second Letter Second Letter Second Letter Second Letter
U U C C A A G G
First Letter U UUU leu UCU ser UAU tyr UGU cys UCAG Third Letter
First Letter U UUC leu UCC ser UAC tyr UGC cys UCAG Third Letter
First Letter U UUA leu UCA ser UAA stop UGA stop UCAG Third Letter
First Letter U UUG leu UCG ser UAG stop UGG trp UCAG Third Letter
First Letter C CUA leu CCU pro CAU his CGU arg UCAG Third Letter
First Letter C CUC leu CCC pro CAC his CGC arg UCAG Third Letter
First Letter C CUA leu CCA pro CAA gin CGA arg UCAG Third Letter
First Letter C CUG leu CCG pro CAG gin CGG arg UCAG Third Letter
First Letter A AUU ile ACU thr AAU asn AGU ser UCAG Third Letter
First Letter A AUC ile ACC thr AAC asn AGC ser UCAG Third Letter
First Letter A AUA ile ACA thr AAA lys AGA arg UCAG Third Letter
First Letter A AUG met ACG thr AAG lys AGG arg UCAG Third Letter
First Letter G GUU val GCU ala GAU asp GGU gly UCAG Third Letter
First Letter G GUC val GCC ala GAC asp GGC gly UCAG Third Letter
First Letter G GUA val GCA ala GAA glu GGA gly UCAG Third Letter
First Letter G GUG val GCG ala GAG glu GGG gly UCAG Third Letter
Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations
ala alaninearg arginineasn asparagineasp aspartic acidcys cysteine ala alaninearg arginineasn asparagineasp aspartic acidcys cysteine ala alaninearg arginineasn asparagineasp aspartic acidcys cysteine gln glutamineglu glutamic acidgly glycinehis histidineile isoleucine gln glutamineglu glutamic acidgly glycinehis histidineile isoleucine gln glutamineglu glutamic acidgly glycinehis histidineile isoleucine leu leucinelys lysinemet methioninephy phenylalaninepro proline leu leucinelys lysinemet methioninephy phenylalaninepro proline leu leucinelys lysinemet methioninephy phenylalaninepro proline ser serinethr threoninetrp tryptophantyr tyrosineval valine ser serinethr threoninetrp tryptophantyr tyrosineval valine ser serinethr threoninetrp tryptophantyr tyrosineval valine
In summary all life as we know it contains DNA
and its close relative RNA. These polymers
11
Genes, Exons, Introns (Junk DNA)
  • Genes Sequence of base pairs coding for chains
    of amino-acids
  • Consist of exons (coding) and introns
    (non-coding) regions
  • Length- anything between several tenths up to
    several millions
  • EXAMPLE Among most complex identified genes is
  • DYSTROPHINE
  • (2 million bps, more than 60 exons, codes for
    4000 amino acids)
  • Escherichia Coli around 4000 genes Humans
    35000-40000 genes
  • Junk DNA Disrespectful name for introns
  • Significant fraction of DNA
  • Shown (last year) to be somewhat responsible
    for RNA coding
  • (Far from being junk, but function still
    not well understood)

12
The Central Dogma
DNA
mRNA
Proteins
Replication Transcription
Translation
A Communication Theory Perspective
Genetic Channel
DNA sequence mRNA Proteins
DNA sequence
What kind of errors are introduced by the Genetic
Channel?
13
Processing in the Genetic Channel DNA REPLICATION
  • DNA within Chromosomes (tight packing)
  • DNA wrapped around HISTONES (proteins)
  • HISTONES are organized in NUCLEOSOMES
  • NUCLEOSOMES CHROMATINE folded in
    CHROMOSOMES

Untying the knots Topoisomerases Unwinding the
helix Helicases Getting it all started
Primers Doing the hard work Polymerases Sealing
the segments Ligases Helping to keep two sides
apart SSB
14
Replication more details
Timing for replication E. Coli 40 min Humans
(parallel) lt 2 hours Can be prolonged for
proofreading purposes
Rules Replication always proceeds in 5 to 3
direction Replication is
semi-conservative Replication is a parallel
process for eukaryotes Facts Polymerases
can stitch together any combination
of bases (Ps are a little bit sloppy)
15
Errors
  • Combination of substitution, deletion, insertion
    (replication fork), shift, reversal, etc errors
  • (Complete exon or intron deleted, or simple base
    pair deletions)
  • 1. Tautomeric shifts (transition/transvertion)
    T-G, G-T, C-A, A-C
  • 2. Recombination between non-identical molecules
    (HETERODUPLEX mismatches)
  • 3. Spontaneous DEAMINATION (C to U, C to T, C-G
    to T-A), METHYLATION (CpG), rare
  • 4. APURINIC/APYRAMIDINIC SITES (due to
    HYDROLISIS)
  • 5. CROSS-LINKS
  • 6. STRAND-BREAKAGE, OXIDATIVE DAMAGE ERRORS
  • 7. LOSS OF 5000-10000 PURINE and 200-500
    PYRIMIDINE bases (20 hours) due to radiation
  • Replication Errors Polymerases miss-insertion
    probability between 10e-3/10e-5

Miscoding A-G-A-T-G C-T-G-C-T-A-C
Slippage A-A-T-G
C-G-T-T A-C T
Slippage-Dislocation G-A-A-T-G
C-G -T -T-T-A-C
Miscoding - Realignment
A-G-A-T-G C-T C-T-A-C
G
16
Bio-chemical mechanism responsible for error
correction?
  • Proofreading (Maroni, Molecular and Genetic
    Analysis of Human Traits)
  • Replication polymerases error rate
    human DNA with bps, total of 106
    errors
  • Example
  • C to U conversion causes presence of
    deoxyuridine, detected by uracil-DNA GLYCOSYLASE
  • Glycosylase process acts like erasure channel
  • 1. Proofreading based on semi-conservative nature
    of replication
  • 2. Excision Repair Mechanisms Arrays of
    Exonucleases
  • Show large degree of pre-correction binding
    activity correction performed by EXCISION
  • Jumping occurs between different genes !!!
    (Lin, Lloyd, Roberts, Nucleases)
  • Reduce error levels by an additional several
    orders of magnitude
  • Mismatch-specific post-replication enzymes
  • Total number of errors per human DNA replication
    on average JUST ONE
  • Replication and Repair have been optimized for
    balancing spontaneous mutational load
  • Permitting evolution without threatening fitness
    or survival

17
  • Characteristics of DNA ECC
  • Error-correction performed on different levels
  • Error correction performed in very short time
  • Extremely large number of very diverse errors
    corrected
  • Error correction tied to global structure of DNA
  • (not to consecutive base pairs)
  • Error correction also depends on DNA topology

18
Identify ECCs of DNA
  • Error-Correcting Codes in DNA Forsdyke (1981),
    Wolny (1983), Eigen (1993), Liebovitch et al
    (1996), Battail (1997), Rosen and Moore (2003),
    McDonaill (2003)
  • Theories
  • Non-coding regions are in-series error detecting
    sequences!
  • Ordering of coding/non-coding regions responsible
    for error-correction!
  • Complementary base pairing corresponds to
    error-detecting code!
  • Acceptor/Donor hydrogen atom/lone electrons

1 represents donor, 0 acceptor Additionally, add
0 or 1 for purine and pyramidine Code
A 1010 G 0110 T 0101 C 1001
19
  • BEST ERROR CORRECTING MECHANISM Deinococcus
    radiodurans
  • Microbe with extreme radiation resistance
  • Enabled to survive radiation doses thousands of
    times higher than would kill most organisms,
    including  humans.
  • Surpasses the cockroach by orders of magnitude!
  • Why? Because of its remarkable DNA-repair
    mechanism!!! 
  • D. radiodurans flawlessly regenerates its
    radiation-shattered genome in about 24 hours.
  •  


Conan The Bacterium (to conquer the
Red Planet !)

20
Something seemingly unrelated
Spin Glasses, the Ising Model, Hopfield Networks
or Boltzmann Machines State x of a spin glass
with N spins that may take values in
-1,1 Energy of the state x E, external field h
The Hamiltonian
Hamiltonian for Ising model


Example Water exists as a gas, liquid or solid,
but all microscopic elements are H2O
molecules This is due to intermolecular
interactions depending on temperature, pressure
etc.


-

frustration
21
Something seemingly unrelated
Codes on graphs the most powerful class of error
correcting codes in information theory, including
Turbo, Low-Density Parity-Check (LDPC),
Repeat-Accumulate (RA) Codes
Most important consequence of graphical
description efficient iterative decoding
Variable nodes communicate to check nodes their
reliability Check nodes decide which variables
are unreliable and suppress their inputs Number
of edges in graph density of H Sparse
small complexity
Variables Checks
Detrimental for convergence of decoder presence
of short cycle in code graph
Applications of LDPC codes for cryptography,
compression, distributed source coding for sensor
networks, error control coding in optical,
wireless comm and magnetic and optical storage
22
Gallagers Decoding Algorithm A
Works for (Binary Symmetric Channel) BSC Each
variable sends its channel reliability unless all
incoming messages from checks say change Each
check sends estimate of the bit based on modulo
two sum of other bits participating in the check
Alternative view VariablesAtoms Binary
ValuesSpins Variables align or misalign
according to interaction patterns
LDPC equivalent to diluted spin glasses
Ground state search for above Hamiltonian
maximum aposteriori decoding of codeword Average
magnetization at a site MAP decision for
individual variable
23
Something seemingly unrelated
  • The regulatory Network of
  • Gene Interactions (RNGI)
  • Kaufmann (1960s) NK Evolution
  • through Changing Interactions
  • between Genes
  • Life exists at the Edge of Chaos!
  • BASED ON SPIN GLASSES!
  • RANDOM BOOLEAN FUNCTION MODEL
  • Evolution carried by genes, not base pairs, and
    the way genes interact!

T T1
G1 G2 G3 G1 G2 G3
0 0 0 0 0 1
0 0 1 0 0 1
0 1 0 1 0 1
0 1 1 0 0 0
1 0 0 1 0 1
1 0 1 0 1 0
1 1 0 0 0 1
1 1 1 0 1 1
G1G2 G1
0 0 0
0 1 1
1 0 0
1 1 0
G1G3 G2
0 0 0
0 1 0
1 0 0
1 1 1
G1G2 G3
0 0 1
0 1 1
1 0 0
1 1 1
G1
G3
G2
24
Chaos, Attractors, Connectivity
  • Boolean networks dynamical systems
    Attractors point and periodic
  • Characterized by network topology
    Number and period lengths
  • choices of Boolean node functions

MOST IMPORTANT topological factor CONNECTIVITY KE
Y Sparse connectivity allows enough variability
for evolutionary processes, produces
self-organizing structures, but doesnt allow the
system to get trapped in chaotic behavior MOST
IMPORTANT Boolean function factors BIAS (number
of 1 outputs) CANNALIZATION (depends on number of
inputs determining output)
  • 111
  • 000 011
  • 001 101
  • 110 010
  • Kimatograph of the network

25
The NK model and RNGI
N number of genes Knumber of genes
co-interacting with one given gene K2 critical
value (mainly frozen states with islands of
changing interaction) Interaction between genes
in regulatory network very limited in scope K
ranges everywhere between 2-3 to 10-15 If we
check carefully, logarithmic in N, i.e. number of
genes Between 2 and 3 for Escherichia Coli
(around a thousand genes) 4 and
8 for higher metazoea (several thousand genes)
Can explain the process of cell differentiation
genetic material of each cell the same, yet
cells functionally and morphologically very
different Each cell type CORRESPONDS TO ONE GIVEN
ATTRACTOR of the RNGI Counting attractors for
networks with N40000 genes, K2 gives Cell
types (correct number 258).
26
KEY IDEA LDPC Code with Given Decoding Algorithm
is a BOOLEAN NETWORK, SPIN GLASS,
  • Example LDPC Code under Gallagers A Algorithm

In the Control Graph, edge (i,j) exists if i-th
bit controls j-th bit (i.e. if i and j are at
distance exactly two)
G1 G2 G3 G4
Boolean function determined by decoding
algorithm For Gallagers A algorithm, takes form
of truncated/periodically repeated MORSE-THUE
sequence
LDPC Code Variables and Checks
LDPC Code The Control Graph
Morse-Thue 0 1 2 3 4 5 6 7
0 1 10 11
100 101 110 111
0 1 1 0 1 0 0 1
Properties Self-Similar (fractal) Results in
unbiased Boolean functions
27
Use Boolean Network Analysis for LDPC Codes
  • No cycles of length four, code regular uniform
    choice for Boolean function
  • Cycles of length four Boolean functions vary,
    many more attractors
  • In no case are the functions canalizing

modulo two sums of variable nodes connected to
controls
Can use mean-field theorems to see when initial
perturbations in the codewords disappear in the
limit use the Boolean derivative, sensitivity
analysis, iterative Jacobian and Lyapunov
exponent (as in Schmulevich et.al)
matrix with
in entry (i,j).
Jacobian F is a
28
Use Boolean Network Analysis for LDPC Codes
Iterated Jacobian
Lyapunov exponent
The influence of variable on the Boolean
function f is defined as the
expectation of the partial derivative with
respect to the distribution of the variables
,
.
Influence carries important information about
frozen states, error susceptibility etc.
iterative change of size of stable core
Control of the chaotic phase in the a Boolean
network by means of periodic pulses (with period
T) that freeze a fraction of nodes
29
LDPC Codes and Gallagers A Decoding Algorithm
A (B)C1 (C)C2 (D)C3 F3(A) A (B C) D C1 C2 F1(A) A (B C D) C1 F2(A)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1
0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1
0 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0
0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1
0 1 0 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0
0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0
0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1
1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1
1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1
1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0
1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1
1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0
1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0
1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
30
New decoding methods for LDPC and other Block
Codes
  • Work in Progress
  • Decoders that dont operate on the frozen core
  • Decoders that periodically freeze some variables
    to avoid chaotic behavior
  • Iterative decoders that work for asymmetric
    channels and channels with insertion/deletion
    errors

31
(No Transcript)
32
Bold Conjecture The ECC of DNA Replication
operates on multiple levelsCarrier of
information is gene, not base pairThe Global
level involves Genes Local levels may involve
exons or base pairs in generalThe Global Code
is an LDPC Code!Wigner observed that the same
mathematical concepts turn up in entirely
unexpected connections in whole of science (no
explanation as of yet)LDPC related to
statistical physics (spin glasses) to neural
networks to self-organizing systems to R.
Sole and B. Goodwin, Signs of Life How
Complexity Pervades Biology
33
The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
6 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
7 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1
10 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
12 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1
Table 1 Example of 15-node regulatory network in terms of gene controls Table 1 Example of 15-node regulatory network in terms of gene controls Table 1 Example of 15-node regulatory network in terms of gene controls

Gene Controls Control (after addition)
1 2,3,13 2,313
2 1,3 1,3
3 4,5,6 4,5,6,1,2
4 5,6 5,6,3
5 4,9,6 4,9,6
6 5,9,3,4,7,8 3,4,5,7,8,9
7 15,8,6 15,8,6
8 7,9 7,9
9 4,5,6 4,5,6
10 9,13,15 9,13,15
11 8,12,13 8,12,13
12 11,13,15 8,11,13,15
13 8,14,15 8,14,15,11,12
14 X X
15 11,12 11,12,13,14,10,7
15-gene interaction example by Hashimoto
(Shmulevich, Anderson Cancer Center) Need q-ary
LDPC code corresponding to different levels of
interaction
34
  • Cancer genetic disorder of somatic cells
  • Human cancer INDUCED and SPONTANEOUS
  • Accumulation of mutant (erroneous) genes that
    control cell cycle, maintain genomic stability,
    and mediate apoptosis
  • Causes of mutation depurination and
    depyrimidation of DNA proofreading and mismatch
    errors during DNA replication
  • Deamination of 5-methylcytosine to produce C to T
    base pair substitutions and damage to DNA and
    its replication imposed by products of metabolism
    (notably oxidative damage caused by oxygen free
    radicals)
  • Defective DNA excision-repair low levels of
    antioxidants, antioxidant enzymes, and
    nucleophiles that trap DNA-reactive
    electrophiles and enzymes that conjugate
    nucleophiles with DNA-damaging electrophiles

35
Cancer Research
  • To summarize Various forms of cancer tightly
    linked to malfunctioning of proofreading (ECC)
    mechanism
  • Cancer cells correspond to a special type of
    attractor of the RNGI
  • (A cancer cell is just another
    configuration of RNGI)
  • (Schmulevich et.al., Anderson Cancer
    Research Center)
  • This attractor has genes interacting in a
    way that results in uncontrolled cell
    division
  • Key observation C-Change in RNGI results in
    further weakening of the proofreading system, and
    VV

36
Example 1 Cancer cells cheat the proofreading
mechanism regulating reduction in length of
telomeres
Aging during each cell division, telomeres get
shorter and shorter When they become too short,
errors in replication happen, leading to cancer
(a time bomb in our body) Cancer cells
cheat proofreading mechanism and allow
telomeres to maintain constant length
Finding the error-control mechanism classifying
diseases accurately, curing diseases (including
cancer) by gene therapy, making telomer lengths
constant over long time
37
Example 2 Breast Cancer Oncogene BRCA1 tightly
linked to error-control of DNA and cell division
regulation
38
How to obtain results practically? DNA
Microarrays!
Figure taken from Schmulevich et.al.
39
  • II
  • How can one efficiently store
  • DNA sequences?

40
DNA Storage Compression
  • GenBank/Swiss-Prot storage of large number of
    DNA and protein sequences (17471 million
    sequences in GenBank, 2002)
  • Every day, an average of 15 new sequences added
    to database
  • DNA compression absolutely necessary to maintain
    banks
  • Fractal DNA structure to be exploited
  • Possible use of Tsallis entropy
  • Need novel compression algorithms
  • DNA sequences of related species differ in very
    small percent of base pairs need cross-reference
    compression
  • Need meaningful definition of DNA distance
  • -- major paradigm shift from base-pair
  • distance to chromosomal distance --

41
Statistical properties of DNA sequences
Bases within the human mitochondrion (length
approximately 17000) appear with the following
frequencies
A T G C
0.31 0.13 0.25 0.31
while within different regions of human fetal
globin gene
Introns A T G C
0.27 0.29 0.27 0.17
Exons A T G C
0.24 0.22 0.28 0.25
Parts of genetic sequences can be modeled by
Markov chains of given order and transition
probabilities order 2-7
Regions of uniform distribution isochors can
stretch in length up to hundreds Kbps
Repetitive patterns tandem repeats (TR), random
repeats (RR), short interspersed repeat
sequences (SINEs, 9 of DNA), long interspersed
repeat sequences (LINEs).
BPs, like CG, have very small probability most
notorious triplet repeats, related to
Huntingtons disease and Fragile-X mental
retardation, consist of these very unlikely CG
pairs (CGG)m ,(CCG)m, m number of
repetitions
Junk-DNA seems to have long-range (fractal)
characteristics.
42
A fractal patterns arises from the so-called DNA
walk a graphical representation of the DNA
sequence in which one moves up for C or T and
down for A or G.           Can have two,
three-dimensional random walk further
differentiation A,G,C,T          
C A T G
Fractal dimension of the DNA molecule 0.85 for
higher species, 1 for lower Use lingual analysis
of human languages for exploring DNA "language"
(Zipf method)
http//library.thinkquest.org/26242/full/ap/ap13.h
tml DNAWalker http//athena.bioc.uvic.ca/pbr/walk/
43
DNA and Cantor Sets
Provata and Almirantis, 2003 Fractal Cantor
pattern in DNA Exons - filled regions Introns -
empty regions Random, fractal, Cantor-like
set Implication atom (carrier of information)
exon/intron pairs History-based random walk and
DNA description in terms of urn models Only
introns in higher species have higher complexity
than in lower species Both coding and non-coding
regions exhibit long range correlation, with
spectral density of introns
44
Known algorithms
GenCompress (Chen, 97) Biocompress
(Grumbach/Tachi, 94) Fact (Rivals,
00) GenomeSequenceCompress (Sato et.al 00)
Use characteristics of DNA like repeats, reverse
complements Compression rate is about 1.74 bits
per base (78 in compression ratio) 
Two classes statistical and grammar based
compression algorithms
Huffman, Lempel-Ziv, Arithmetic Coding,
Burrows-Wheeler, Kieffers Grammar Based
Schemes (with DNA specific modifications)
No known algorithm specially suited for fractal
nature of DNA, although 90 fractal!
FILE COMPRESSION RATE (ACHIEVABLE) GZIP ARITHM. VPS2A UNIX COMPRESS BIO- COMPRESS BWT GTAC
Human Growth Hormone (HUMGHCSA) 2.00 2.065 2.052 1.607 2.19 1.31 1.608 1.1
45
Different Entropy Measures
  • Shannon Entropy
  • Renyi Entropy
  • Tsallis Entropy
  • TE non-additive in the way that for two
    independent PS A,B
  • Hausdorff Dimension

46
Approach Use Fractal Grammars
Inference of context-free grammars from fractal
data sets Syntactic generation of fractals Theory
of formal languages can be used to state the
problem of "syntactic fractal pattern
recognition" Explore Connections with
Wavelets (ideas by Jacques Blanc-Talon)
Example Heighway dragon and Koch curve
  • Barthel, Brandau, Hermesmeier, Heising Fractal
    Prediction, 1997
  • Zerotree wavelet coding using fractal prediction

47
How does one compress sets of related DNA
Sequences?
  • Distributed Source Coding Problem Peculiar
    Correlation Patterns
  • Could explore Wavelet Based Compression
  • Distributed Source Coding with LDPC Codes

48
Genomic Distance and One-Way Communication
  • Major paradigm shift in genetic distance measure
  • From base-pair distance (involving deletion,
    insertion and substitution) Sankoff,
    Kruskal,Time Warps, String Edits, and
    Macromolecules) to Chromosomal Distance based on
    global arrangements of genes
  • Inversions are primary mechanism of genome
    rearrangement!
  • REVERSAL DISTANCE
  • The smallest number of inversions necessary to
    transform one genome into another
  • Finding the minimum number of reversals needed to
    sort a permutation
  • Permutations are signed, indicating direction of
    transcription
  • Example (1 3 2) (1 -2 -3)
    (1 2 -3) (1 2 3)
  • How does one perform one-way communication
    (SENDING INFORMATION TO A RECEIVER WHO POSESESS
    CORRELATED INFORMATION) under the reversal
    distance measure?

49
The other way around DNA compression methods
increase network efficiency by up to 10
times Peribit's SR-50 compressor
  • Uses molecular sequence reduction (MSR)
    algorithms similar to those used to match
    patterns in the study of DNA.
  • The algorithms identify and eliminate repetitions
    previously undetected in network traffic in wide
    area networks (Wans) to give compression ratios
    of between 1.21 for voice and video and 51 for
    SQL traffic.

50
IIIAdditional Coding Problems in Genetics
51
DNA ComputingCodes with Constant GC Content and
invariant under Watson-Crick InversionMicroarr
ay Error Control CodingUsing design theory to
reduce error rate of DNA array dataUse novel
clustering algorithms for DNA Array Data
52
Conclusion
  • Genetics is the most exciting source of new ideas
    for coding theory
  • The atom of information is a gene, not a base
    pair or a triple of base pairs
  • The error control code of the genome is to be
    found operating on the level of genes
  • Compression, phylogenic tree construction
    comparison of species has to be performed on the
    level of genes first
  • Once the genes are compared, can move to local
    base pair comparisons
Write a Comment
User Comments (0)
About PowerShow.com