# MATH 1020: Mathematics For Non-science Chapter 3.1: Information in a networked age - PowerPoint PPT Presentation

PPT – MATH 1020: Mathematics For Non-science Chapter 3.1: Information in a networked age PowerPoint presentation | free to download - id: 84e760-MDI4Z The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## MATH 1020: Mathematics For Non-science Chapter 3.1: Information in a networked age

Description:

### MATH 1020: Mathematics For Non-science Chapter 3.1: Information in a networked age Instructor: Dr. Ken Tsang Room E409-R9 Email: kentsang_at_uic.edu.hk – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 103
Provided by: eduh1256
Category:
Tags:
Transcript and Presenter's Notes

Title: MATH 1020: Mathematics For Non-science Chapter 3.1: Information in a networked age

1
MATH 1020 Mathematics For Non-science Chapter
3.1 Information in a networked age
Instructor Dr. Ken Tsang Room
E409-R9 Email kentsang_at_uic.edu.hk

2
Transmitting Information
• Binary codes
• Encoding with parity-check sums
• Data compression
• Cryptography
• Model the genetic code

3
The Challenges
• Mathematical Challenges in the Digital Revolution
• How to correct errors in data transmission
• How to electronically send and store information
economically
• How to ensure security of transmitted data
• How to improve Web search efficiency

4
Binary Codes
• A binary code is a system for encoding data made
up of 0s and 1s
• Examples
• Postnet (tall 1, short 0)
• UPC (universal product code, dark 1, light 0)
• Morse code (dash 1, dot 0)
• Braille (raised bump 1, flat surface 0)
• Yi-jing?? (Yin0, yang1)

5
Binary Codes are Everywhere
• CD, MP3, and DVD players, digital TV, cell
phones, the Internet, GPS system, etc. all
represent data as strings of 0s and 1s rather
than digits 0-9 and letters A-Z
• Whenever information needs to be digitally
transmitted from one location to another, a
binary code is used

6
Transmission Problems
• What are some problems that can occur when data
is transmitted from one place to another?
• The two main problems are
• transmission errors the message sent is not the
• security someone other than the intended

7
Transmission Error Example
• Suppose you were looking at a newspaper ad for a
job, and you see the sentence must have bive
years experience
• We detect the error since we know that bive is
not a word
• Can we correct the error?
• Why is five a more likely correction than
three?
• Why is five a more likely correction than
nine?

8
Another Example
• Suppose NASA is directing one of the Mars rovers
by telling it which crater to investigate
• There are 16 possible signals that NASA could
send, and each signal represents a different
command
• NASA uses a 4-digit binary code to represent this
information

0000 0100 1000 1100
0001 0101 1001 1101
0010 0110 1010 1110
0011 0111 1011 1111
9
Lost in Transmission
• The problem with this method is that if there is
a single digit error, there is no way that the
rover could detect or correct the error
• If the message sent was 0100 but the rover
receives 1100, the rover will never know a
mistake has occurred
• This kind of error called noise occurs all
the time

10
BASIC IDEA
• The details of techniques used to protect
information against noise in practice are
sometimes rather complicated, but basic
principles are easily understood.
• The key idea is that in order to protect a
message against a noise, we should encode the
message by adding some redundant information to
the message.
• In such a case, even if the message is corrupted
by a noise, there will be enough redundancy in
the encoded message to recover, or to decode the
message completely.

11
• To decrease the effects of noise, we add
redundancy to our messages.
• First method repeat the digits multiple times.
• Thus, the computer is programmed to take any
five-digit message received and decode the result
by majority rule.

12
Majority Rule
• So, if we sent 00000, and the computer receives
any of the following, it will still be decoded as
0.
• 00000 11000 Notice that for the
• 10000 10100 computer to decode
• 01000 10010 incorrectly, at least
• 00010 10001 three errors must be

13
Independent Errors
• Using the five-time repeats, and assuming the
errors happen independently, it is less likely
that three errors will occur than two or fewer
will occur.
• This is called the maximum likelihood decoding.

14
Why dont we use this?
• Repetition codes have the advantage of
simplicity, both for encoding and decoding
• But, they are too inefficient!
• In a five-fold repetition code, 80 of all
transmitted information is redundant.
• Can we do better?
• Yes!

15
More Redundancy
• Another way to try to avoid errors is to send the
same message twice
• This would allow the rover to detect the error,
but not correct it (since it has no way of
knowing if the error occurs in the first copy of
the message or the second)

16
• Parity-Check Sums
• Sums of digits whose parities determine the check
digits.
• Even Parity Even integers are said to have even
parity.
• Odd Parity Odd integers are said to have odd
parity.
• Decoding
• The process of translating received data into
code words.
• Example Say the parity-check sums detects an
error.
• The encoded message is compared to each of the
possible correct messages. This process of
decoding works by comparing the distance between
two strings of equal length and determining the
number of positions in which the strings differ.
• The one that differs in the fewest positions is
chosen to replace the message in error.
• In other words, the computer is programmed to
automatically correct the error or choose the

16
17
Error Correction
• Over the past 40 years, mathematicians and
engineers have developed sophisticated schemes to
build redundancy into binary strings to correct
errors in transmission!
• One example can be illustrated with Venn diagrams!

Claude Shannon (1916-2001) Father of Information
Theory
18
Computing the Check Digits
• The original message is four digits long
• We will call these digits I, II, III, and IV
• We will add three new digits, V, VI, and VII
• Draw three intersecting circles as shown here
• Digits V, VI, and VII should be chosen so that
each circle contains an even number of ones

Venn Diagrams
I
V
VI
III
IV
II
VII
19
A Hamming (7,4) code
• A Hamming code of (n,k) means the message of k
digits long is encoded into the code word of n
digits.
• The 16 possible messages
• 0000 1010 0011 1111
• 0001 1100 1110
• 0010 1001 1101
• 0100 0110 1011
• 1000 0101 0111

20
Binary Linear Codes
• The error correcting scheme we just saw is a
special case of a Hamming code.
• These codes were first proposed in 1948 by
Richard Hamming (1915-1998), a mathematician
working at Bell Laboratories.
• Hamming was frustrated with losing a weeks worth
of work due to an error that a computer could
detect, but not correct.

21
Appending Digits to the Message
• The message we want to send is 0100
• Digit V should be 1 so that the first circle has
two ones
• Digit VI should be 0 so that the second circle
has zero ones (zero is even!)
• Digit VII should be 1 so that the last circle has
two ones
• Our message is now 0100101

0
1
0
0
0
1
1
22
(No Transcript)
23
(No Transcript)
24
Encoding those messages
• Message ? codeword
• 0000 ? 0000000 0110 ? 0110010
• 0001 ? 0001011 0101 ? 0101110
• 0010 ? 0010111 0011 ? 0011100
• 0100 ? 0100101 1110 ? 1110100
• 1000 ? 1000110 1101 ? 1101000
• 1010 ? 1010001 1011 ? 1011010
• 1100 ? 1100011 0111 ? 0111001
• 1001 ? 1001101 1111 ? 1111111

25
(No Transcript)
26
Detecting and Correcting Errors
• Now watch what happens when there is a single
digit error
• We transmit the message 0100101 and the rover
• The rover can tell that the second and third
circles have odd numbers of ones, but the first
circle is correct
• So the error must be in the digit that is in the
second and third circles, but not the first
thats digit IV
• Since we know digit IV is wrong, there is only
one way to fix it change it from 1 to 0

0
1
0
0
1
1
1
27
(No Transcript)
28
Try It!
• Encode the message 1110 using this method
• You have received the message 0011101. Find and
correct the error in this message.

29
Extending This Idea
• This method only allows us to encode 4 bits (16
possible) messages, which isnt even enough to
represent the alphabet!
• However, if we use more digits, we wont be able
to use the circle method to detect and correct
errors
• Well have to come up with a different method
that allows for more digits

30
Parity Check Sums
• The circle method is a specific example of a
parity check sum
• The parity of a number is 1 is the number is
odd and 0 if the number is even
• For example, digit V is 0 if I II III is
even, and 1 if I II III is odd

31
Conventional Notation
• Instead of using Roman numerals, well use a1 to
represent the first digit of the message, a2 to
represent the second digit, and so on
• Well use c1 to represent the first check digit,
c2 to represent the second, etc.

32
Old Rules in the New Notation
• Using this notation, our rules for our check
digits become
• c1 0 if a1 a2 a3 is even
• c1 1 if a1 a2 a3 is odd
• c2 0 if a1 a3 a4 is even
• c2 1 if a1 a3 a4 is odd
• c3 0 if a2 a3 a4 is even
• c3 1 if a2 a3 a4 is odd

a1
c1
c2
a3
a4
a2
c3
33
An Alternative System
• If we want to have a system that has enough code
words for the entire alphabet, we need to have 5
message digits a1, a2, a3, a4, a5
• We will also need more check digits to help us
decode our message c1, c2, c3, c4

34
Rules for the New System
• We cant use the circles to determine the check
digits for our new system, so we use the parity
notation from before
• c1 is the parity of a1 a2 a3 a4
• c2 is the parity of a2 a3 a4 a5
• c3 is the parity of a1 a2 a4 a5
• c4 is the parity of a1 a2 a3 a5

35
Making the Code
• Using 5 digits in our message gives us 32
possible messages, well use the first 26 to
represent letters of the alphabet
• On the next slide youll see the code itself,
each letter together with the 9 digit code
representing it

36
The Code
Letter Code Letter Code
A 000000000 N 011010101
B 000010111 O 011101100
C 000101110 P 011111011
D 000111001 Q 100001011
E 001001101 R 100011100
F 001011010 S 100100101
G 001100011 T 100110010
H 001110100 U 101000110
I 010001111 V 101010001
J 010011000 W 101101000
K 010100001 X 101111111
L 010110110 Y 110000100
M 011000010 Z 110010011
37
Using the Code
• Now that we have our code, using it is simple
• When we receive a message, we simply look it up
on the table
• But what happens when the message we receive
isnt on the list?
• Then we know an error has occurred, but how do we
fix it? We cant use the circle method anymore

38
Beyond Circles
• Using this new system, how do we decode messages?
• Simply compare the (incorrect) message with the
list of possible correct messages and pick the
closest one
• What should closest mean?
• The distance between the two messages is the
number of digits in which they differ

39
The Distance Between Messages
• What is the distance between 1100101 and 1010101?
• The messages differ in the 2nd and 3rd digits, so
the distance is 2
• What is the distance between 1110010 and 0001100?
• The messages differ in all but the 7th digit, so
the distance is 6

40
Hamming Distance
• Def The Hamming distance between two vectors of
a vector space is the number of components in
which they differ, denoted d(u,v).

41
Hamming Distance
• Ex. 1 The Hamming distance between
• v 1 0 1 1 0 1 0
• u 0 1 1 1 1 0 0
• d(u, v) 4
• Notice d(u,v) d(v,u)

42
(No Transcript)
43
Hamming weight of a Vector
• Def The Hamming weight of a vector is the number
of nonzero components of the vector, denoted
wt(u).

44
Hamming weight of a code
• Def The Hamming weight of a linear code is the
minimum weight of any nonzero vector in the code.

45
Hamming Weight
• The Hamming weight of
• v 1 0 1 1 0 1 0
• u 0 1 1 1 1 0 0
• w 0 1 0 0 1 0 1
• are
• wt(v) 4
• wt(u) 4
• wt(w) 3

46
Nearest-Neighbor Decoding
• The nearest neighbor decoding method decodes a
received message as the code word that agrees
with the message in the most positions

47
Trying it Out
• Suppose that, using our alphabet code, we receive
the message 010100011
• We can check and see that this message is not on
our list
• How far away is it from the messages on our list?

48
Distances From 010100011
Code Distance Code Distance
000000000 4 011010101 5
000010111 4 011101100 5
000101110 4 011111011 3
000111001 4 100001011 4
001001101 6 100011100 8
001011010 6 100100101 4
001100011 2 100110010 4
001110100 6 101000110 6
010001111 3 101010001 6
010011000 5 101101000 6
010100001 1 101111111 6
010110110 3 110000100 5
011000010 3 110010011 3
49
Fixing the Error
• Since 010100001 was closest to the message that
we received, we know that this is the most likely
actual transmission
• We can look this corrected message up in our
table and see that the transmitted message was
(probably) K
• This might still be incorrect, but other errors
can be corrected using context clues or check
digits

50
Distances From 1010 110
• The distances between message 1010 110 and all
possible code words

v 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110
code word 0000 000 0001 011 0010 111 0100 101 1000 110 1100 011 1010 001 1001 101
distance 4 5 2 5 1 4 3 4

v 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110 1010 110
code word 0110 010 0101 110 0011 100 1110 100 1101 000 1011 010 0111 001 1111 111
distance 3 4 3 2 5 2 6 3
51
Transmitting Information
• Binary codes
• Encoding with parity-check sums
• Data compression
• Cryptography
• Model the genetic code

52
Understanding Data Compression
• Some image formats compress their data
• GIF, JPEG, PNG
• Others, like BMP, do not compress their data
• Use data compression tools for those formats
• Data compression
• Coding of data from a larger to a smaller form
• Types
• Lossless compression and lossy compression

53
Data compression
• Data compression is important to storage systems
because it allows more bytes to be packed into a
given storage medium than when the data is
uncompressed.
• Some storage devices (notably tape) compress data
automatically as it is written, resulting in less
tape consumption and significantly faster backup
operations.
• Compression also reduces file transfer time,
saving time and communications bandwidth.

54
Compression
• There are two main categories
• Lossless
• Lossy
• Compression ratio

55
Compression factor
• A good metric for compression is the compression
factor (or compression ratio) given by
• If we have a 100KB file that we compress to 40KB,
we have a compression factor of

56
Information Theory
• Shannon, C.E. (1948). A mathematical theory of
communication. Bell System Technical Journal 30,
50-64.
• Very precise definition of information as a
message made up of symbols from some finite
alphabet.
• Shannons definition of information ignores the
meaning conveyed by the message

57
Information Theory cont.
• Information content is a quantifiable amount
• The information content of some message is
inversely related to the probability that that
message will be received from the set of all
possible messages.
• The message with the lowest probability of being
received contains the highest information content.

58
Information content
• Compression is achieved by removing data
redundancy while preserving information content.
• The information content of a group of bytes (a
message) is its entropy.
• Data with low entropy permit a larger compression
ratio than data with high entropy.
• Entropy ?, H, is a function of symbol frequency.
It is the weighted average of the number of bits
required to encode the symbols of a message. For
a single symbol x
• H -P(x) ? log2P(x)

59
Entropy of a message
• The entropy of the entire message is the sum of
the individual symbol entropies.
• ? -P(xi) ? log2P(xi)

i
where xi is the i-th symbol
Information and entropy are measures of
unexpectedness. Entropy effectively limits the
strongest lossless compression possible.
60
Entropy
• Entropy is a measure of information content the
minimum number of bits required to store data
without any loss of information.
• Entropy is sometimes called a measure of
surprise, the uncertainty associated with the
message
• A highly predictable sequence contains little
actual information
• Example 11011011011011011011011011 (whats
next?)
• A completely unpredictable sequence of n bits
contains n bits of information
• Example 01000001110110011010010000 (whats
next?)
• Note that nothing says the information has to
have any meaning (whatever that is)
• A fair coin has an entropy of one. If the coin is
not fair, then the uncertainty is lower and the
entropy is also lower.

61
Entropy of a coin flip
Entropy H(X) of a coin flip, measured in bits
graphed versus the fairness of the coin
Pr(X1). Note the maximum of the graph depends
on the distribution Here, at most 1 bit is
required to communicate the outcome of a fair
coin flip but the result of a fair die would
require at most log2(6) bits.
62
(No Transcript)
63
Inefficiency of ASCII
• Realization In many natural (English) files, we
are much more likely to see the letter e than
the character , yet they are both encoded
using 7 bits!
• Solution Use variable length encoding! The
encoding for e should be shorter than the
encoding for .

64
ASCII (cont.)
• Here are the ASCII bit strings for the capital
letters in our alphabet

Letter ASCII Letter ASCII
A 0100 0001 N 0100 1110
B 0100 0010 O 0100 1111
C 0100 0011 P 0101 0000
D 0100 0100 Q 0101 0001
E 0100 0101 R 0101 0010
F 0100 0110 S 0101 0011
G 0100 0111 T 0101 0100
H 0100 1000 U 0101 0101
I 0100 1001 V 0101 0110
J 0100 1010 W 0101 0111
K 0100 1011 X 0101 1000
L 0100 1100 Y 0101 1001
M 0100 1101 Z 0101 1010
65
Variable Length Coding
• Assume we know the distribution of characters
(e appears 1000 times, appears 1 time)
• Each character will be encoded using a number of
bits that is inversely proportional to its
• Need a prefix free encoding if e 001
• than we cannot assign to be 0011. Since
encoding is variable length, need to know when to
stop.

66
Example Morse code
• Morse code is a method of transmitting textual
information as a series of on-off tones, lights,
or clicks that can be directly understood by a
skilled listener or observer without special
equipment.
• Each character is a sequence of dots and dashes,
with the shorter sequences assigned to the more
frequently used letters in English the letter
'E' represented by a single dot, and the letter
'T' by a single dash.
• Invented in the early 1840s. it was extensively
used in the 1890s for early radio communication
before it was possible to transmit voice.

67
A U.S. Navy seaman sends Morse code signals in
2005.
Vibroplex semiautomatic key. The paddle, when
pressed to the right by the thumb, generates a
series of dits. When pressed to the left by the
knuckle of the index finger, the paddle generates
a dah.
68
International Morse Code
69
Relative Frequency of Letters in English Text
70
Encoding Trees
• Think of encoding as an (unbalanced) tree.
• Data is in leaf nodes only (prefix free).
• e 0, a 10, b 11
• How to decode 01110?

1
0
e
0
1
a
b
71
Cost of a Tree
• For each character ci let fi be its frequency in
the file.
• Given an encoding tree T, let di be the depth of
ci in the tree (number of bits needed to encode
the character).
• The length of the file after encoding it with the
coding scheme defined by T will be C(T) Sdi fi

72
Example Huffman encoding
• A 0 B 100 C 1010 D 1011 R 11
• This is eleven letters in 23 bits
• A fixed-width encoding would require 3 bits for 5
different letters, or 33 bits for 11 letters
• Notice that the encoded bit string can be decoded!

73
Why it works
• In this example, A was the most common letter
• 5 As code for A is 1 bit long
• 2 Rs code for R is 2 bits long
• 2 Bs code for B is 3 bits long
• 1 C code for C is 4 bits long
• 1 D code for D is 4 bits long

74
Creating a Huffman encoding
• For each encoding unit (letter, in this example),
associate a frequency (number of times it occurs)
• Use a percentage or a probability
• Create a binary tree whose children are the
encoding units with the smallest frequencies
• The frequency of the root is the sum of the
frequencies of the leaves
• Repeat this procedure until all the encoding
units are in the binary tree

75
Example, step I
• Assume that relative frequencies are
• A 40
• B 20
• C 10
• D 10
• R 20
• (I chose simpler numbers than the real
frequencies)
• Smallest numbers are 10 and 10 (C and D), so
connect those

76
Example, step II
• C and D have already been used, and the new node
above them (call it CD) has value 20
• The smallest values are B, CD, and R, all of
which have value 20
• Connect any two of these it doesnt matter which
two

77
Example, step III
• The smallest values is R, while A and BCD all
have value 40
• Connect R to either of the others

root
leave
78
Example, step IV
• Connect the final two nodes

79
Example, step V
• Assign 0 to left branches, 1 to right branches
• Each encoding is a path from the root
• A 0 B 100 C 1010 D 1011 R 11
• Each path terminates at a leaf
• Do you see why encoded strings are decodable?

80
Unique prefix property
• A 0 B 100 C 1010 D 1011 R 11
• No bit string is a prefix of any other bit string
• For example, if we added E01, then A (0) would
be a prefix of E
• Similarly, if we added F10, then it would be a
prefix of three other encodings (B100, C1010,
and D1011)
• The unique prefix property holds because, in a
binary tree, a leaf is not on a path to any other
node

81
Practical considerations
• It is not practical to create a Huffman encoding
for a single short string, such as ABRACADABRA
• To decode it, you would need the code table
• If you include the code table in the entire
message, the whole thing is bigger than just the
ASCII message
• Huffman encoding is practical if
• The encoded string is large relative to the code
table, OR
• We agree on the code table beforehand
• For example, its easy to find a table of letter
frequencies for English (or any other
alphabet-based language)

82
Data compression
• Huffman encoding is a simple example of data
compression representing data in fewer bits than
it would otherwise need
• A more sophisticated method is GIF (Graphics
Interchange Format) compression, for .gif files
• Another is JPEG (Joint Photographic Experts
Group), for .jpg files
• Unlike the others, JPEG is lossyit loses
information
• Generally OK for photographs (if you dont
compress them too much) because decompression
adds fake data very similar to the original

83
JPEG Compression
• Photographic images incorporate a great deal of
information. However, much of that information
can be lost without objectionable deterioration
in image quality.
• With this in mind, JPEG allows user-selectable
image quality, but even at the best quality
levels, JPEG makes an image file smaller owing to
its multiple-step compression algorithm.
• Its important to remember that JPEG is lossy,
even at the highest quality setting. It should
be used only when the loss can be tolerated.

84
2. Run Length Encoding (RLE)
• RLE When data contain strings of repeated
symbols (such as bits or characters), the strings
can be replaced by a special marker, followed by
the repeated symbol, followed by the number of
occurrences. In general, the number of
occurrences (length) is shown by a two digit
number.
• If the special marker itself occurs in the data,
it is duplicated (as in character stuffing).
• RLE can be used in audio (silence is a run of 0s)
and video (run of a picture element having the
same brightness and color).

85
An Example of Run-Length Encoding
86
2. Run Length Encoding (RLE)
• Example
• is chosen as the special marker.
• Two-digit number is chosen for the repetition
count.
• Consider the following string of decimal digits
• 15000000000045678111111111111118
• Using RLE algorithm, the above digital string
would be encoded as
• 15010456781148
• The compression ration would be
• (1 (16/32)) 100 50

87
(No Transcript)
88
Transmitting Information
• Binary codes
• Encoding with parity-check sums
• Data compression
• Cryptography
• Model the genetic code

89
Model the genetic code
• The genome??? is the instruction manual for life,
an information system that specifies the
biological body.
• In its simplest form, it consists of a linear
sequence of four extremely small molecules,
called nucleotides.
• These nucleotides make up the steps of the
spiral-staircase structure of the DNA and are the
letters of the genetic code.

90
(No Transcript)
91
The structure of part of a DNA double helix
• DNA is a nucleic acid that contains the genetic
instructions used in the development and
functioning of all known living organisms.

92
A DNA double helix
The main role of DNA (Deoxyribonucleic acid
??????) molecules is the long-term storage of
information.
93
Four bases found in DNA
The DNA double helix is stabilized by hydrogen
bonds between the bases attached to the two
strands. The four bases (nucleotides) found in
DNA are adenine (abbreviated A), cytosine (C),
guanine (G) and thymine (T). These four bases are
attached to the sugar/phosphate to form the
complete nucleotide
94
Escherichia coli genome
• gtgbU00096U00096 Escherichia coli ????
• K-12 MG1655 complete genome???
• AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAA
AAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT
TAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA
GCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCAT
TAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGG
GCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA

95
Hierarchies of symbols
• English computer genetics
• letter (26) bit (2) nucleotide???(4)
• word byte codon (1-28 letters) (8
bits) (3 nucleotides)
• sentence line gene
• book program genome

96
Information Theory
A typical communication system Shannon (1948)
DNA
Signal
Message
Message
Information Source
Transmitter
Destination
Child
Parents
Noise Source
Mutation
97
DNA from an Information Theory Perspective
• The alphabet for DNA is A,C,G,T. Each DNA
strand is a sequence of symbols from this
alphabet.
• These sequences are replicated and translated in
processes reminiscent of Shannons communication
model.
• There is redundancy in the genetic code that
enhances its error tolerance.

98
The Central Dogma of Molecular Biology
Replication
Transcription
Translation
RNA
DNA
Protein
Reverse Transcription
Ribonucleic acid ????
99
What Information Theory Contributes to Genetic
Biology
• A useful model for how genetic information is
stored and transmitted in the cell
• A theoretical justification for the observed
redundancy of the genetic code

100
Data Compression in gene sequences
• As an illustration of data compression, lets use
the idea of gene sequences.
• Biologists are able to describe genes by
specifying sequences composed of the four letters
A, T, G, and C, which stand for the four
cytosine, respectively.
• Suppose we wish to encode the sequence AAACAGTAAC.

101
Data Compression (cont.)
• One way is to use the (fixed-length) code A?00,
C?01, T?10, and G?11.
• Then AAACAGTAAC is encoded as
00000001001110000001.
• From experience, biologists know that the
frequency of occurrence from most frequent to
least frequent is A, C, T, G.
• Thus, it would more efficient to choose the
following binary code A?0, C?10, T?110, and
G?111.
• With this new code, AAACAGTAAC is encoded as
0001001111100010.
• Notice that this new binary code word has 16
letters versus 20 letters for the fixed-length
code, a decrease of 20.
• This new code is an example of data compression!

102
Data Compression (cont.)
• Suppose we wish to decode a sequence encoded with
the new data compression scheme, such as
0001001111100010.
• Looking at groups of three digits at a time, we
can decode this message!
• Since 0 only occurs at the end of a code word,
and the codes words that end in 0 are 0, 10, and
110, we can put a mark after every 0, as this
will be the end of a code word.
• The only time a sequence of 111 occurs is for the
code word 111, so we can put a mark after every
triple of 1s.
• Thus, we have 0,0,0,10,0,111,110,0,0,10, which
is AAACAGTAAC.

103
(No Transcript)