Chapter 5 Huffman One Better: Arithmetic Coding - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Chapter 5 Huffman One Better: Arithmetic Coding

Description:

If a statistical method could assign a 90 percent probability to a given ... Permanently stuck, impasse? 5.2.2 A Complication. Action to take: ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 30
Provided by: yr53
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5 Huffman One Better: Arithmetic Coding


1
Chapter 5 Huffman One BetterArithmetic Coding
The Data Compression Book
2
5.1 Difficulties
Huffman coding has been proven the best
fixed-length coding method available. Huffman
codes have to be an integral number of bits long,
and this can sometimes be a problem. If a
statistical method could assign a 90 percent
probability to a given character, the optimal
code size would be 0.15 bits. The Huffman coding
system would probably assign a 1-bit code to the
symbol, which is six times longer than
necessary. The conventional solution to this
problem is to group the bits into packets and
apply Huffman coding. But this weakness prevents
Huffman coding from being a universal compressor.
3
5.2 Arithmetic Coding A Step Forward
  • Arithmetic coding bypasses the idea of replacing
    an input symbol with a specific code. It replaces
    a stream of input symbols with a single
    floating-point output number. More bits are
    needed in the output number for longer, complex
    messages.
  • Consider the message BILL GATES, have a
    probability distribution

4
0.2 B 0.3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.2572167752 0.2572167756
0.25 I 0.26
0.256 L 0.258
0.2572 L 0.2576
0.25720 SPACE 0.25724
0.257216 G 0.257220
0.2572164 A 0.2572168
0.25721676 T 0.2572168
0.257216772 E 0.257216776
0.2572167752 S 0.2572167756
5
Arithmetic Coding A Step Forward-2
  • Each character is assigned the portion of the 0
    to 1 range that corresponds to its probability of
    appearance.
  • The encoding process is simply one of narrowing
    the range of possible numbers with every new
    symbol. The new range is proportional to the
    predefined probability attached to that symbol.
  • low 0.0
  • high 1.0
  • while ( ( c getc( input ) ) ! EOF )
  • range high - low
  • high low range high_range( c )
  • low low range low_range( c )
  • output ( low )

So the final low value, .2572167752 , will
uniquely encode the message BILL GATES using
our present coding scheme.
6
Arithmetic Coding A Step Forward-3
  • Decoding is the inverse procedure, in which the
    range is expanded in proportion to the
    probability of each symbol as it is extracted.
  • The algorithm for decoding the incoming number is
    shown next
  • number input_code()
  • for ( )
  • symbol find_symbol_straddling_this_range(
    number )
  • putc( symbol )
  • range high_range( symbol ) - low_range(
    symbol )
  • number number - low_range( symbol )
  • number number / range

7
5.2.1 Practical Matters
What is required is an incremental transmission
scheme in which fixed-size integer state
variables receive new bits at the low end and
shift them out at the high end, forming a single
number that can be as long as necessary,
conceivably millions or billions of bits.
The BILL GATES example in a five-decimal digit
register (use decimal digits in this example for
clarity). highlowhigh_range(symbol) HIGH
99999 LOW 00000
8
5.2.2 A Complication
This scheme works well for incrementally encoding
a message. Potential for a loss of precision
after some iterations, high could be 70000, and
low could be 69999. Permanently stuck, impasse?
Action to take Delete the second digits from
high and low and shift the rest of the digits
left to fill the space. The most significant
digit stays in place. After every recalculation,
check for underflow digits again if the most
significant digit dont match. If underflow
digits are present, we shift them out and
increment the counter. When the most significant
digits do finally converge to a single value,
output that value. Then output the underflow
digits previously discarded.
9
5.2.3 Decoding
Instead of using just two numbers, high and low,
the decoder has to use three numbers. The first
two, high and low, correspond exactly to the high
and low values maintained by the encoder. The
third number, code, contains the current bits
being read in from the input bit stream. The code
value always falls between the high and low
values. As they come closer and closer to it, new
shift operations will take place, and high and
low will move back away from code. The high and
low values in the decoder will be updated after
every symbol, just as they were in the encoder,
and they should have exactly the same values.
10
5.2.4 Wheres the Beef?
An example encode the stream AAAAAAA, and the
probability of A is known to be .9, there is a 90
percent chance that any incoming character will
be the letter A. The encoding process The
number .45 will make this message uniquely decode
to AAAAAAA. Those two decimal digits take
slightly less than seven bits to specify, which
means we have encoded eight symbols in less than
eight bits! An optimal Huffman message would have
taken a minimum of nine bits.
11
5.3 The Code
The code supplied with this chapter in ARITH.C is
a simple module that performs arithmetic
compression and decompression using a simple
order 0 model. It works exactly like the
non-adaptive Huffman coding program in Chapter 3.
It first makes a single pass over the data,
counting the symbols. The data is then scaled
down to make the counts fit into a single,
unsigned character. The scaled counts are saved
to the output file for the decompressor to get at
later, then the arithmetic coding table is built.
Finally, the compressor passes through the data,
compressing each symbol as it appears. When
done, the end-of-stream character is sent out,
the arithmetic coder is flushed, and the program
exits.
12
5.3.1 The Compression Program-1
  • The compressor code breaks down neatly into three
    sections. The first two lines initialize the
    model and the encoder. The while loop consists of
    two lines, which together with the line following
    the loop perform the compression, and the last
    three lines shut things down.
  • build_model( input, output-gtfile )
  • initialize_arithmetic_encoder()
  • while ( ( c getc( input ) ) ! EOF )
  • convert_int_to_symbol( c, s )
  • encode_symbol( output, s )
  • convert_int_to_symbol( END_OF_STREAM, s )
  • encode_symbol( output, s )
  • flush_arithmetic_encoder( output )
  • OutputBits( output, OL, 16 )

13
5.3.1 The Compression Program-2
The build_model() routine count all the
characters, scales down the counts to fit in
unsigned characters, builds the range table used
by the coder, writes the counts to the output
file. The initialize_arithmetic_encoder()
routine sets up the high- and low-integer
variables. The encoding loop calls two different
routines to encode the symbol. convert_int_to_symb
ol(), takes the character read in from the file
and looks up the range for the given symbol. The
range is then stored in the symbol object, which
has the structure shown typedef struct
unsigned short int low_count unsigned short
int high_count unsigned short int scale
SYMBOL Once the symbol object has been defined,
it can be passed to the encoder.
14
5.3.1 The Compression Program-3
When we reach the end of the input file, we
encode and send the end-of-stream symbol. To
finish, call a routine to flush the arithmetic
encoder, which takes care of any underflow
bits. Finally, output an extra sixteen bits.
15
5.3.2 The Expansion Program
  • The main part of the expansion program follows
    the same pattern.
  • input_counts( input-gtfile )
  • initialize_arithmetic_decoder( input )
  • for ( )
  • get_symbol_scale( s )
  • count get_current_count( s )
  • c convert_symbol_to_int( count, s )
  • if ( c END_OF_STREAM )
  • break
  • remove_symbol_from_stream( input, s )
  • putc( (char) c, output )
  • The decoding loop First, get the scale for the
    current model to pass back to the arithmetic
    decoder. The decoder then converts its current
    input code into a count in the routine
    get_current_count. determine which symbol is the
    correct one to decode.

16
5.3.3 Initializing the Model-1
The model needs three pieces of information for
each symbol the low end and the high end of its
range, the scale of the entire alphabets range
(this is the same for all symbols in the
alphabet). Since the top of a given symbols
range is the bottom of the next, we only need to
keep track of N 1 numbers for N symbols in the
alphabet. For symbol x in the array, the low
count can be found at totals x , the high count
at totals x 1 , and the range of scale at
totals N , N being the number of symbols in the
alphabet. In this program, the array is named
totals, and it has 258 elements. The number of
symbols in the alphabet is 257, the normal 256
plus one for the end-of-stream symbol.
17
5.3.3 Initializing the Model-2
  • One additional constraint is placed on these
    counts( the number of bits for the counts).
    16-bit registers for our high and low values, the
    highest cumulative counts in the totals array
    to no more than 14 bits, or 16,384. scaling the
    counts down so they all fit in a single byte.
  • Code from build_model()
  • count_bytes( input, counts )
  • scale_counts( counts, scaled_counts )
  • output_counts( output, scaled_counts )
  • build_totals( scaled_counts )
  • UpdateModel() routine has to see if the root node
    has reached the maximum allowable count. If it
    has, the Huffman tree is rebuilt.
  • count_bytes() is same as it in static Huffman
    coding.
  • scale_counts() is same as Huffman coding at the
    first part, scales to fit in an array of unsigned
    characters.

18
5.3.3 Initializing the Model-3
  • The second part of scale_counts restricts the
    total of counts to less than 16384, or fourteen
    bits. An additional count for end-of-stream.
  • total 1
  • for ( i 0 i lt 256 i )
  • total scaled_counts i
  • if ( total gt ( 32767 - 256 ) ) scale 4
  • else if ( total gt 16383 ) scale 2
  • else return
  • for ( i 0 i lt 256 i )
  • scaled_counts i / scale
  • The last step in building the model is to set up
    the cumulative totals array in totals.
  • totals 0 0
  • for ( i 0 i lt END_OF_STREAM i )
  • totals i 1 totals i scaled_counts
    i
  • totals END_OF_STREAM 1 totals
    END_OF_STREAM 1

19
5.3.4 Reading the Model
  • For expansion, the code needs to build the same
    model array in totals that was used in the
    compression routine. the program reads in the
    scaled_counts array stored in the compressed
    file just as in Chapter 3.
  • After the scaled_counts array has been read in,
    the same routine used by the compression code can
    be invoked to build the totals array. Calling
    build_totals() in both the compression and
    expansion routines helps ensure that we are
    working with the same array.

20
5.3.5 Initializing the Encoder
  • Before compression can begin, we have to
    initialize the variables that constitute the
    arithmetic encoder. Three 16-bit variables define
    the arithmetic encoder low, high, and
    underflow_bits.
  • low 0
  • high 0xffff
  • underflow_bits 0

21
5.3.6 The Encoding Process-1
  • The actual encoding process
  • while ( ( c getc( input ) ) !EOF )
  • convert_int_to_symbol( c, s )
  • encode_symbol( output, s )
  • convert_int_to_symbol( END_OF_STREAM, s )
  • encode_symbol( output, s )
  • This consists of looping through the entire file,
    reading in a character, determining its range
    variables, then encoding it. After the file has
    been scanned, the final step is to encode the
    end-of-stream symbol.

22
5.3.6 The Encoding Process-2
  • Two routines encode a symbol. The
    convert_int_to_symbol() routine looks up the
    modeling information for the symbol and retrieves
    the numbers needed to perform the arithmetic
    coding.
  • s-gtscale totals END_OF_STREAM 1
  • s-gtlow_count totals c
  • s-gthigh_count totals c 1
  • Encode the symbol in encode_symbol(), has two
    distinct steps.
  • The first is to adjust the high and low variables
    based on the symbol data passed to the encoder.
  • range (long) ( high-low ) 1
  • high low (unsigned short int)
  • (( range s-gthigh_count ) / s-gtscale - 1
    )
  • low low (unsigned short int)
  • (( range s-gtlow_count ) / s-gtscale )

23
5.3.6 The Encoding Process-3
  • Shift out any bits available for shifting
  • for ( )
  • if ( ( high 0x8000 ) ( low 0x8000 ) )
  • OutputBit( stream, high 0x8000 )
  • while ( underflow_bits gt 0 )
  • OutputBit( stream, high 0x8000 )
  • underflow_bits--
  • else if ( ( low 0x4000 ) !( high
    0x4000 ) )
  • underflow_bits 1
  • low 0x3fff
  • high 0x4000
  • else
  • return
  • low ltlt 1
  • high ltlt 1
  • high 1

24
5.3.7 Flushing the Encoder
After encoding, it is necessary to flush the
arithmetic encoder. The code for this is in the
flush_arithmetic_encoder() routine. It outputs
two bits and any additional underflow bits added
along the way.
25
5.3.8 The Decoding Process-1
  • Before arithmetic decoding can start, we need to
    initialize the arithmetic decoder variables. A
    high and low variable are maintained by the
    decoder with a code variable, which contains the
    current bit stream read in from the input file.
  • In initialize_arithmetic_decoder
  • code 0
  • for ( i 0 i lt 16 i )
  • code ltlt 1
  • code InputBit( stream )
  • low 0
  • high Oxffff

26
5.3.8 The Decoding Process-2
  • This implementation of the arithmetic decoding
    process requires four separate steps to decode
    each character.
  • The first is to get the current scale for the
    symbol.
  • The second is made to get the count for the
    current arithmetic code.
  • range (long) ( high - low ) 1
  • count (short int)
  • ((((long) ( code - low ) 1 ) s-gtscale-1
    ) / range )
  • return( count )
  • Determining which symbol goes with which count
  • for ( c END_OF_STREAM count lt totals c
    c )
  • s-gthigh_count totals c 1
  • s-gtlow_count totals c
  • return( c )
  • Takes the high and low counts and stores them in
    the symbol variable, removes the symbol from the
    stream

27
5.3.8 The Decoding Process-3
  • range (long)( high - low ) 1
  • high low (unsigned short int)
  • (( range s-gthigh_count ) / s-gtscale - 1
    )
  • low low (unsigned short int)
  • (( range s-gtlow_count ) / s-gtscale )
  • for ( )
  • if ( ( high 0x8000 ) ( low 0x8000 ) )
  • else if ((low 0x4000) 0x4000 (high
    0x4000) 0 )
  • code 0x4000
  • low 0x3fff
  • high 0x4000
  • else return
  • low ltlt 1
  • high ltlt 1
  • high 1
  • code ltlt 1
  • code InputBit( stream )

28
5.4 Summary
Arithmetic coding seems more complicated than
Huffman coding, but the size of the program
required to implement it is not significantly
different. Runtime performance is significantly
slower than Huffman coding, however, due to the
computational burden imposed on the encoder and
decoder. If squeezing the last bit of
compression capability out of the coder is
important, arithmetic coding will always do as
good a job or better, than Huffman coding. But
careful optimization is needed to get performance
up to acceptable levels.
29
5.5 The Code
The code ARITH.C
arith.c
Write a Comment
User Comments (0)
About PowerShow.com