A Highly Efficient XML Compression Scheme for the Web - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

A Highly Efficient XML Compression Scheme for the Web

Description:

Przemyslaw Skibinski1, Jakub Swacha2, Szymon Grabowski3. 1 Uniwersytet Wroclawski, Instytut ... XMLPPM (Cheney, 2001) switching between different. PPM models. ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 21
Provided by: szymongr
Category:

less

Transcript and Presenter's Notes

Title: A Highly Efficient XML Compression Scheme for the Web


1
A Highly Efficient XML Compression Schemefor
the Web  Przemyslaw Skibinski1, Jakub Swacha2,
Szymon Grabowski3 1 Uniwersytet Wroclawski,
Instytut Informatyki, ul. Joliot-Curie 15, 50-383
Wroclaw, Poland. E-mail inikep_at_ii.uni.wroc.pl  
2 Uniwersytet Szczecinski, Instytut Informatyki
w Zarzadzaniu, ul. Mickiewicza 64, 71-101
Szczecin,Poland. E-mail jakubs_at_sus.univ.szczecin
.pl 3 Politechnika Lódzka, Katedra Informatyki
Stosowanej, al. Politechniki 11, 90-924 Lódz,
Poland. E-mail sgrabow_at_kis.p.lodz.pl
ltconf_datagt ltconf_namegtSOFSEMlt/conf_namegt
ltconf_locationgtlttowngtNový Smokoveclt/towngtltcountrygt
Slovakialt/countrygtlt/conf_locationgt
ltconf_dategtltmonthgtJanuarylt/monthgtltyeargt2008lt/yeargt
lt/conf_dategtlt/conf_datagt
2
Whats wrong with XML
XML is textual good for many reasons.But also
verbose...(NEED FOR COMPRESSION!) XML databases
can be large Protein Sequence Database
(annotated) 683 MB DBLP Computer Science 127
MB. (Lots of information stored, but also a
verbose representation.)
More and more XML documents exchanged through
the Web (the advent of Open XML format in MS
Office 2007can only accelerate this trend).
3
XML compression goals
What shall we do, use general-purpose
compression(eg. zip, bzip2, ppmd)? Far from
optimal (known since 1999 when first XML
compressors appeared e.g. XMill). Compression
ratio could be improved. Speed can be improved
(maybe not easily with zip though...).
Compression ratio / (de)compression speedare
typically contradictory criteria what should we
choose?
WE CARE FOR A TOTAL TRANSFER TIME(OVER A NET).
4
Specialized XML compression
XMill (Liefke Suciu, 1999, 2000) separate
streamselement and attribute names actual
content (text),XML document structure.Significan
t gains esp. with gzip as the back-end
compressor. XMLPPM (Cheney, 2001) switching
between different PPM models. Novel idea
injecting a symbol from the prevmodel into the
current context (so both the traditional and
the element related contexts matter). SCMPPM
(Adiego et al., 2004) XMLPPM to the extreme a
separate model for each element path. Beats
XMLPPM on large files. But also needs lots of
memory.
5
Redundancy in XML databases
Every end tag must match the corresponding start
tag ? each end tag may be replaced with merely a
closing flag. Some words appear with high
frequency ? build a dictionary. Not only over
tag / attribute names, but also over the textual
content. Physical layout often regular ? encode
trailing spaces in linesalmost to zero. Similar
thing often works for End-of-Line
chars. Decimal system is verbose ? compact
integers (use e.g. base 256).
6
Our web-compression-oriented transform,
bird-flight view
Design assumption dedicated for PPM
compressors(e.g. PPMd). Semi-dynamic
dictionary use a byte coding for words that
appear at least fmin 64 times in the
document.(The dictionary is front-compressed and
stored in the archive.) The notion of word
comprises also start XML tags, URL prefixes (
http//domain/ ), emails, data, ", "gt
patterns, runs of spaces. Integers and some
other patterns encoded densely.
7
Dictionary coding
1st pass gather the words of at least lmin 2
characters,with least fmin occs, and sort acc.
to freqs. Variable-length coding used from 1 up
to 4 bytes. The codeword alphabet 127-255 range,
most 0-31 range a few more chars. Non-intersect
ing value ranges for different codeword bytesof
size w, x, y, z.Namely w 1-byters, x w
2-byters, y x w 3-byters, z y x w
4-byters.
The parameters w, x, y, z are selected acc. to
the size of the created dictionary, with the
principle of maximizing the number of short
codewords.
8
Pattern encoding
Some patterns integers, dates (in a specified
format),IPs occur frequently, and can be encoded
densely in binary. Original idea XMill. In XWP
automatic detection (no need for DTD or human
assistance).
  • XWP handles
  • integers from 1900...2155 (years) 2 bytes
    (incl. a flag),
  • other integers from 2 to 5 bytes (up to 232),
  • IP addresses 5 bytes,
  • dates (e.g., 1980-02-31, 01-MAR-1920) 2 or 3
    bytes, differential encoding,
  • times (e.g., 1130pm, 2320, 233059) 3 or 4
    bytes,
  • page ranges 4 bytes,
  • floats x.x (0.0...24.9) and .xx 2 bytes.

9
Encoding of time and range patterns
Numbers from 1...12 followed by am or pmare
interpreted as times, and encoded on 3
bytestime pattern flag,the hour (in 24-h
convention), the minutes.
10
PPMVC(PPM with variable-length
contexts)Skibinski Grabowski, 2004
Main weakness of most PPM algorithms ispoor
handling of long matching sequences(as opposed
to LZ77 algs which excel in it).
Using high orders (16)memory-hungry, quite
slow, it's hard to overcome the so-called
zero-frequency problem.
A possible solution coupling PPM with LZ
matching.Original idea PPM (Cleary et al.,
1995).Another implementation PPMZ (Bloom,
1998).
11
PPMVC, contd
In PPMVC, each max order context holds a pointer
to reference context (the prev occ of the
context) and the minimum left match length. The
left match length (LML) the length of the
common part of the active context and the
reference context. LML always at least as large
as the maximum PPM order. The right match length
(RML) the length of the matching sequence
between symbols to encode and symbols followed by
the reference context.
If the left match between the current pos and the
prev max-order context occurrence is at least
minLML, then the RML (0 or more) is sent to the
output. If not, plain PPM coding (Shkarin's
PPMd) is used. In practice it is better to
quantize RML, e.g. round down to a multiple of 8.
12
Fast PAQ
A relatively fast compressor from the PAQ
(Mahoney, 2002-2007) family.
  • PAQ features
  • working on bit level,
  • mixing predictions from various models run in
    parallel (PPM-like models, string matching model,
    word model, tabular data model, etc.),
  • mixing predictions with several neural networks,
  • adaptive probability maps (APM) mechanism to
    update the models considering previous experience
    and the current context,
  • extremely high compression, extremely slow.
  • FastPAQ features
  • models irrelevant for XML removed,
  • APM stages simplified,
  • much faster than PAQ8 for a reasonable loss in
    compression.

13
Experiments databases
14
Enwikinews, excerpt
15
Swissprot, excerpt
16
DBLP, excerpt
17
Experiments methodology etc.
The test machine Intel Core 2 Duo E6600 2.40
GHz, 1 GB RAM,two Seagate 250 GB SATA drives in
RAID mode 1, Windows XP (64-bit). Implementation
C (Visual C 6.0). XML-WRT v3.1 with
sources http//www.ii.uni.wroc.pl/inikep/. Back
-end compressors used gzip 1.2.4, Pavlovs LZMA
(used in 7-zip), Shkarins PPMd, PPMVC, FastPAQ.
18
Experimental results
19
Decompression and transmission times
XWRT3PPMVC best choice for transmission speed
up to 384Kbps.For 1 Mbps, it succumbs only to
XWRT2 (our prev. scheme) LZMA.Still, XWRT3
decompression is streamlined.
20
Conclusions
XWRT3 (XWP transform PPMVC)seems to be best
choice for transmitting XML documentsover slow /
moderate-speed networks.
For high-bandwidth networks XWPPPMVC may be
slowerin retrieving a document than
XWRT2LZMA,but both the transform (XWP) and the
coder (PPMVC)components are streamlined, i.e.
immediate display of the (beginning of the)
document is possible.
Best XML compression ratios presented so far
with PPMVC, outperforming SCMPPM by 9 on
avg, with FastPAQ (alas, impractical) an extra 9
avg gain.
21
Experiments databases
Write a Comment
User Comments (0)
About PowerShow.com