Guadalupe Canahuate1, Hakan Ferhatosmanoglu1, Ali Pinar2 1The Ohio State University 2 Lawrence Berke - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Guadalupe Canahuate1, Hakan Ferhatosmanoglu1, Ali Pinar2 1The Ohio State University 2 Lawrence Berke

Description:

Bitmap indexing has been successfully applied to scientific databases by ... Bitmap indices can be compressed with variants of run length encoding for a ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 2
Provided by: guadalupe
Category:

less

Transcript and Presenter's Notes

Title: Guadalupe Canahuate1, Hakan Ferhatosmanoglu1, Ali Pinar2 1The Ohio State University 2 Lawrence Berke


1
Guadalupe Canahuate1, Hakan Ferhatosmanoglu1, Ali
Pinar21The Ohio State University 2 Lawrence
Berkeley Laboratory
IMPROVING BITMAP COMPRESSION BY DATA
REORGANIZATION AN INTEGRATED FRAMEWORK
We present the effectiveness of our method based
on the improvement factor (IF), which we compute
as the ratio of the compressed bitmap table size
of the original data to the compressed bitmap
table size of the reordered data, i.e, Thus, an
improvement factor of 5 means, compressed
reordered data takes 5 times less space than the
compressed original.
  • Massive volumes of data are being produce by
    observations and simulations in many scientific
    applications.
  • Most of these scientific databases are read-only,
    i.e., large volumes of data are stored once and
    never updated.
  • Bitmap indexing has been widely used for
    scientific data and query processing. Data as a
    two dimensional table of 0s and 1s and
    compression is used for effective storage.
  • Most of compression schemes are run-length
    encoding which compress the data but also enables
    fast bitwise logical operations over the
    compressed bitmaps, translating to faster query
    processing.
  • We study how to reorder tuples in the database to
    achieve higher compression rates. Our techniques
    are used as a preprocessing step before
    compression, only to improve the performance,
    without affecting algorithms used for compression
    and querying.
  • We state this tuple reordering problem as a
    combinatorial optimization problem, and propose
    heuristics for effective solutions for this
    NP-Complete problem.

The goal is to align the data to produce longer
uniform segments from which the run-length
compression can benefit.
Gray code encoding favors the first few columns.
Pick the next column to maximize the overall
compression ratio.
HEP come from High Energy Physics Experiments,
histogram comes from an image database, and stock
is a time-series data of stock price movements.
Column Selection Criteria
The recursive algorithm 1) Sort all bits in the
first column thus dividing all the rows into two
parts. First part with all 0s in the first column
while second part with all 1s. 2) Applying Gray
code ordering algorithm to the first part
beginning from the next column. 3) Applying
REVERSE Gray code ordering algorithm to the
second part beginning from the next column.
Column Compressibility
Runs in the column
The longest run in a column
  • Bitmap compression can be greatly improved by
    data reorganization.
  • We proposed an integrated framework that exploits
    the idea of gray codes. Our algorithm requires no
    extra storage and show 3-15x improvement in
    compression over already compressed bitmaps.
  • Our technique performs well for both equality
    encoding bitmap and range encoding bitmap and can
    be applied in general to any binary matrix.
  • Bitmaps. Each binary row in the bitmap represents
    one tuple in the database. Data is quantized and
    encoded according to the category its attribute
    belongs to. Equality encoding (EE) bitmap and
    range encoding (RE) bitmap are two types of
    bitmaps suitable for point query and range query
    respect.


This work was supported by the U.S. Department of
Energy (DOE) under contract DE-AC03-76SF00098 and
the DOE Award No. DE-FG02-03ER25573. We thank Tao
Tao from University of Illinois at
Urbana-Champaign and Yong Su from The Ohio State
University for their involvement in some aspects
of this work.
For EE we put 1 in the category and 0 for the
others. For RE, we put 1 in the category and all
the later categories.
  • Word-Aligned Hybrid (WAH) code supports queries
    execution over the compressed bitmaps. WAH
    partitions the data into fill words and literal
    words. The most significant
  • bit indicates the type of
  • word.

1 A. Pinar, T. Tao, and H. Ferhatosmanoglu,
Compressing bitmap indices by data
reorganization, in Proceedings of the 21st
International Conference on Data Engineering.
IEEE Computer Society, 2005. 2 D. S. Richards,
Data compression and Gray-code sorting. Inf.
Process. Lett., vol. 22, no. 4, pp. 201.205,
1986. 3 T. Johnson, Performance measurements
of compressed bitmap indices in VLDB '99
Proceedings of the 25th International Conference
on Very Large Data Bases. Morgan Kaufmann
Publishers Inc., 1999, pp. 278.289. 4 C.-Y.
Chan and Y. E. Ioannidis. An efficient bitmap
encoding scheme for selection queries. SIGMOD
Rec., 28(2)215226, 1999.
  • Gray Codes encoding of numbers such that
    adjacent numbers differ in only 1 digit. For
    instance (000 001 011 010 110 111 101 100)
    is a binary Gray code.

Scalability
Performance
The ordering algorithm is linear in the size of
the dataset and the nextCol function cost.
  • Bitmap has been the most popular approach for
    scientific databases in various domains such as
  • Biology Genomic and proteomic technologies
  • High-energy physics Simulations are continuously
    run, and notable events are stored with all the
    details.
  • Climate modeling sensor data.
  • Astro-physics telescopes devoted for
    observations.
  • In addition, bitmap indexing is used in many
    major commercial database systems such as Oracle,
    Informix, DB2, among others.

The result using Ci is much better than the best
outcome found from 10,000 random permutation
functions. The probability of randomly getting a
better permutation function is only 4.3x10-9.
Improvement factor up to 9.6 when using the
original order of the columns.
Write a Comment
User Comments (0)
About PowerShow.com