Column-Stores%20vs.%20Row-stores - PowerPoint PPT Presentation

About This Presentation
Title:

Column-Stores%20vs.%20Row-stores

Description:

... atores.store_id AND stores.country = 'Canada' GROUP BY store_name ... Available : Index maps values to tuple ids. Required : Tuple ids to attribute values ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 29
Provided by: Harikr
Category:

less

Transcript and Presenter's Notes

Title: Column-Stores%20vs.%20Row-stores


1
Column-Stores vs. Row-stores
  • Harikrishna Bikmal
  • 05d05019

2
Row vs. Column Stores
3
Column Stores
  • Really good for read-mostly data-warehouses
  • Lots of column scans and aggregations
  • Writes tend to be in batch
  • Top 3 in TPC-H rankings are column stores(Exasol,
    ParAccel and Kickfire)
  • Yahoos world largest data warehouse is a column
    store

4
Paper Looks At Key Question
  • How much of the buzz around column-stores just
    marketing hype?
  • Do you really need to buy Sybase IQ or Vertica?
  • How far will your current row-store take you?
  • Can you get column-store performance from a
    row-store?
  • Can you simulate a column-store in a row-store?

5
Benchmark
  • Star Schema Benchmark (SSMB)
  • Fact table 17 columns, 60,000,000 rows
  • Table LineOrder
  • 4 dimension tables largest one 80,000 rows
  • Tables Customer, Supplier, Part, Date
  • Contains 13 queries divided into four categories
    or flights
  • Used System-x row-store database and for column
    store they used C-store

6
Simulate Column-Store Inside Row-Store
7
Average Query Time
8
Vertical Partitioning
  • Tuple overheads
  • Traditional row store takes 4GB (compressed)
  • Vertical partitioning takes 0.7 to 1.1 GB per
    column(compressed)
  • Column joins
  • Expensive hash joins
  • Were unable to use merge join

9
Vertical Partitioning
  • Horizontal Partitioning
  • Fact table is vertically partitioned using year
    attribute of date table (dimension table)
  • Most queries in SSMB have predicate on year
    attribute
  • Vertically partitioned tables do not contain date
    foreign key (except one table)

10
Index only plans
  • Common type of query
  • SELECT store_name, SUM(revenue)FROM Facts,
    Stores WHERE fact.store_id atores.store_id AND
    stores.country Canada GROUP BY store_name
  • Contains predicates on fact table using dimension
    tables and need to extract SELECT attributes for
    these tuple ids

11
Index only plans
  • Available Index maps values to tuple ids
  • Required Tuple ids to attribute values
  • Tuple construction is slow !
  • Expensive hash joins of the columns of the
    fact table before filtering using dimension tables

12
So..
  • All indexes approach is a poor way to simulate a
    column-store
  • Problems with vertical partitioning are NOT
    fundamental
  • Store tuple header in a separate partition
  • Allow virtual TIDs
  • Allow HP using a foreign key on a different VP
  • So can row-stores simulate column-stores?

13
Optimal set of Materialized Views
14
Column-Store Experiments
  • Start with column-store (C-Store)
  • Remove column-store-specific performance
    optimizations
  • End with column-store with a row-oriented query
    executer

15
Column Store
  • Column Store specific optimizations
  • Compression
  • Late Materialization
  • Block Processing
  • In visible Join

16
Column Store
  • Compression
  • Higher data value locality in column-stores
  • Better ratio - reduced I/O
  • Can use schemes like
  • run-length encoding
  • Easy to operate on directly for improved
    performance

17
Column Store
  • Late Materialization
  • Select and Aggregate operations render the
    construction of some tuples unnecessary
  • Delays decompression
  • Better cache performance
  • Block Iteration
  • Attributes extraction can be done in blocks
    instead of iterating on rows

18
Column Store
  • Invisible Join
  • Typical query

19
Invisible Join
  • Typical query
  • Contains selection predicates on fact table using
    one or more dimension tables, need to extract
    SELECT attributes for these tuple ids,
    aggregation using other dimension tables
  • Joins between fact table and dimension tables for
    each selection predicate, aggregate grouping

20
Invisible Join
  • Previous approaches either used early
    materialization or the ones with late
    materialization had many out of order attribute
    extraction from the dimension tables

21
Invisible Join
  • Step 1

22
Invisible Join
  • Step 2

23
Invisible Join
  • Step 3

24
Invisible Join
  • Between-Predicate Rewriting
  • Use of range predicates instead of hash lookup in
    step 1 of invisible join
  • If the set of keys obtained for join in step 1
    results in a complete range of keys
  • Use of dictionary table in case of predicates on
    sorted fields
  • Invisible join gave up to 75 improvement in some
    queries

25
Break down of column store advantages
t block processing (T) C compression (c) I
Invisible join (i) L late Materialization(l)
26
Conclusions
  • Attempt to emulate the physical layout of a
    column-store in a row-store
  • Broke down the reasons why a column-store is able
    to process column-oriented data so efficiently
  • Introduced Invisible join

27
Questions
28
Thank You
Write a Comment
User Comments (0)
About PowerShow.com