Scaling to Infinity: Partitioning in Data Warehouses in Oracle - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Scaling to Infinity: Partitioning in Data Warehouses in Oracle

Description:

Tim Gorman. Evergreen Database Technologies, Inc. http://www.EvDBT.com ... Example: fact table for credit-card processing. Fact table is time-variant ... – PowerPoint PPT presentation

Number of Views:292
Avg rating:3.0/5.0
Slides: 37
Provided by: timgo9
Category:

less

Transcript and Presenter's Notes

Title: Scaling to Infinity: Partitioning in Data Warehouses in Oracle


1
North Carolina Oracle Users GroupScaling to
InfinityPartitioning Data Warehouses in Oracle
Tim GormanEvergreen Database Technologies, Inc.
http//www.EvDBT.com
2
Speaker Qualifications
  • Tim Gorman (chief, cook, and bottle-washer
    EvDBT.com)
  • Director of RMOUG Training Days 2007 conference
  • Info online at http//www.RMOUG.org
  • Co-author (with Gary Dodge)
  • Oracle8 Data Warehousing 1998 John Wiley Sons
  • Essential Oracle8i Data Warehousing 2000 John
    Wiley Sons
  • Co-author (with Oak Table Network -
    http//www.OakTable.net)
  • Oracle Insights Tales of the Oak Table 2004
    Apress
  • Oracle data warehousing DBA since 1994
  • Technical manager at Oracle Consulting
  • Independent consultant since 1998

3
Agenda
  • Ive seen the good, the bad, and the ugly
  • In the end, successful data warehouses are a
    combination of all three
  • But in general, I see three major errors that
    result in doom
  • Ignore basic requirements for DW and design what
    is familiar
  • Fail to portray data changes over time
  • Fail to utilize partitioning from the beginning

4
Ignoring the requirements
  • Repeat after me -- reporting and analysis
    applications do not enforce business rules!
  • Reporting and analysis applications are
    responsible for presenting data in the format
    that works best for end-users and their
    query/analysis tools
  • Very often, what end-users seem to want is a
    simple spreadsheet, hundreds of columns wide
  • GIVE THEM WHAT THEY WANT!
  • Conceal from them what it takes to provide what
    they want
  • Do NOT build a data model to enforce referential
    integrity and/or business rules

5
Ignoring the requirements
  • Third-normal form
  • Eliminate repeating groups
  • Every attribute is atomic and scalar
  • Eliminate functional dependencies on composite
    key components
  • Every attribute is functionally dependent on the
    whole key
  • Eliminate functional dependencies on non key
    components
  • Every fact/attribute in the entity should rely on
    the whole key
  • 4th, 5th, and 6th normal forms have been defined
  • But most entities that are in 3NF are also 4th,
    5th, and 6th NF
  • Intended for use in process-oriented operational
    systems
  • enforce data integrity according to business
    rules
  • using referential-integrity constraint mechanisms
    in application code as well as databases

6
Ignoring the requirements
  • Data presented in a simplistic dimensional model
    versus the 3rd-normal-form (3NF)
    entity-relationship model used by most
    operational systems
  • Ralph Kimball discusses in The Data Warehouse
    Toolkit
  • John Wiley Sons - ISBN 0471153370
  • provide immediate, on-demand, and
    high-performance access to corporate or
    organizational subject data
  • comprised of fact tables containing varying
    levels of summarized data and dimension tables
    representing important subject areas
  • very simple representation of data
  • It is a spreadsheet with one degree of
    normalization for flexibility
  • also known as a star schema because diagrams
    generally represent the fact table as a hub and
    dimensions as spokes

7
Ignoring the requirements
Transactional Operational Entity-Relational Modeli
ng
Dimensional Modeling
Customers
Suppliers
Products Dim
Suppliers Dim
Orders
Products
Order Facts
Order Lines
Customers Dim
Time Dim
8
Ignoring the requirements
  • Fact tables
  • More volatile
  • Contain columns for
  • Dimension keys
  • Measures
  • In a spreadsheet or tabular report
  • Dimension keys dont appear at all
  • Measures appear in the cells of the report
  • Dimension tables
  • Usually more static
  • Although the dimension for people is usually
    quite volatile
  • Contain columns for
  • Dimension keys
  • Attributes
  • In a spreadsheet or tabular report
  • Dimension keys dont appear at all
  • Attributes appear as column headers or row
    headers

9
Time-variant data, who cares?
  • Two major types of queries from business
    intelligence applications to data warehouse
    databases
  • Point in time
  • What is the present situation? What do the
    numbers look like now?
  • Situational awareness applications, also known
    as dashboards or executive information
    systems
  • Usually uses the present point in time, but could
    also use any specific point in time in the past
  • Trend analysis
  • How do things look now versus 3 months ago? A
    year ago?
  • How have things changed day-by-day over the past
    quarter? Week-by-week over the past year?
    Month-by-month over the past 7 years?

10
Time-variant data, who cares?
  • Consider this
  • Dimension tables are usually designed to be
    point-in-time or type-1
  • People, items, products, etc.
  • Locations, time, etc.
  • Fact tables are almost always designed to be
    time-variant or type-2
  • Transactions
  • What happens when you join transactions from
    years ago with dimensional attributes from the
    present?
  • For example, when analyzing purchases by
    location, does it make sense to summarize all
    transactions by a persons present location?
  • Or should it reflect the persons location at the
    time of the transaction?

11
Time-variant data, who cares?
  • Every data warehouse has at least one
    slowly-changing dimension (SCD)
  • Usually involving people (i.e. accounts,
    customers, employees, parties, etc)
  • Static dimensions do not need to be time-variant
  • Identifying static dimensions if a change is
    made to the dimension, should it be reflected
    across all time?
  • SCDs should be represented as type-2
  • type-1 views of SCDs can be created as needed
  • type-1 views of fact tables can also be
    created, if necessary, to support point-in-time
    tactical reporting

12
Time-variant data, who cares?
Type-2 Dimension (time-variant)
Type-1 Dimension (point-in-time view)
  • PERSON_DIM
  • Person_key
  • Eff_dt
  • Last_name
  • First_name
  • Address_1
  • Address_2
  • City
  • CURR_PERSON_DIM
  • Person_key
  • o Eff_dt
  • Last_name
  • First_name
  • Address_1
  • Address_2
  • City

PK
PK
13
Time-variant data, who cares?
  • PERSON_DIM
  • Person_key
  • Eff_dt
  • CURR_PERSON_DIM
  • Person_key
  • TXN_FACT
  • Person_key
  • Person_eff_dt

14
Time-variant data, who cares?
  • Slowly-changing dimensions should always be
    type-2
  • With type-1 views constructed using the
    just-loaded type-2 data
  • So, with this in mind
  • Why do people so often treat time-variant tables
    as an after-thought?
  • Why do extraction-transformation-loading (ETL)
    processes so often focus on MERGE logic (if
    row doesnt exist then INSERT else UPDATE) on
    the current point-in-time tables, and then insert
    change data as an after-thought
  • a.k.a. type-1 or point-in-time data
  • Instead of
  • inserting change data into the time-variant
    type-2 table from which point-in-time type-1
    views (as materialized views?) can be built for
    any point-in-time?
  • Think about it
  • If users should be using type-2 data for SCDs,
    who usually utilizes the type-1 views of the
    SCDs? What are they good for?

15
Four characteristics of a DW
  • Non-volatile, time-variant, subject-oriented,
    integrated
  • Bill Inmon Building the Data Warehouse 3rd Ed
    2002 (Wiley)
  • Think about what these mean?
  • Consider the converse of these characteristics?
  • Volatile? Static-image? Process-oriented?
    Application-specific?
  • Time-variant, non-volatile database implies
  • Insert, index, and analyze each row of data only
    once
  • From an implementation perspective, this is vital
    to remember! And often ignored completely!!!
  • Consider an extreme situation?
  • Analytical database for quantum research in
    physics
  • 50 Tbytes of data to load every day

16
The Virtuous Cycle
  • Insert-only processing enables
  • Tables and indexes partitioned by time
  • Optionally sub-partitioned by other key values
  • Partitioned tables/indexes enables
  • Partition pruning during queries
  • Direct-path loads using EXCHANGE PARTITION
  • Time-variant tables/indexes and tablespaces
  • Purging using DROP or TRUNCATE partition instead
    of DELETE
  • Partition pruning enables
  • Infinite scalability for queries, regardless of
    how large the database becomes
  • Direct-path (a.k.a. append) loads enable
  • Ability to load more data, faster, more
    efficiently
  • Table compression

17
The Virtuous Cycle
  • Time-variant partitioned tables/indexes enable
  • Time-variant tablespaces
  • Time-variant tablespaces enable
  • READ ONLY tablespaces for older, less-volatile
    data
  • READ ONLY tablespaces enable
  • Near-line storage (i.e. NAS, SAMFS/HFS, etc)
  • Right-sizing of storage to the need, classified
    by IOPS
  • Backup efficiencies
  • READ WRITE tablespaces scheduled for backup every
    day or week
  • READ ONLY tablespaces scheduled for backup every
    quarter or year

18
The Virtuous Cycle
  • Using EXCHANGE PARTITION for loads enables
  • Elimination of ETL load window and 24x7
    availability for queries
  • Direct-path loads
  • Bitmap indexes and bitmap-join indices
  • Bitmap indices enable
  • Star transformations on star (dimensional)
    schemas
  • Star transformations enable
  • Bitmap-join indexes
  • SUCCESS!
  • optimal query-execution plan for dimensional data
    models!

19
The Death Spiral
  • Volatile data presented in a static-image
    according to process-oriented concepts leads to
  • ETL using conventional-path INSERT, UPDATE, and
    DELETE operations (including MERGE and
    multi-table INSERT)
  • Conventional-path operations are trouble with
  • Bitmap indexes and bitmap-join indexes
  • Forcing frequent complete rebuilds until they get
    too big
  • Contention in Shared Pool, Buffer Cache, global
    structures
  • Mixing of queries and loads simultaneously on
    table and indexes
  • Periodic rebuilds/reorgs of tables if deletions
    occur
  • Full redo logging and undo transaction tracking
  • ETL will dominate the workload in the database
  • Queries will consist mainly of dumps or
    extracts to downstream systems
  • Query performance will be abysmal and worsening

20
The Death Spiral
  • Without partitioning
  • Query performance worsens as tables/indexes grow
    larger
  • Loads must be performed into live tables
  • Users must be locked out during load cycle
  • In-flight queries must be killed during load
    cycle
  • Bitmap indexes must be dropped/rebuilt during
    load cycle
  • Entire tables must be re-analyzed during load
    cycle
  • Entire database must be backed up frequently
  • Data cannot be right-sized to storage options
    according to IOPS
  • Everything just gets harder and harder to do
  • and that stupid Oracle software is to blame
  • BRING ON TERADATA OR

21
Exchange Partition
  • The basic technique of bulk-loading new data into
    a temporary load table, which is then indexed,
    analyzed, and then published all at once to
    end-users using the EXCHANGE PARTITION operation,
    should be the default load technique for all
    large tables in a data warehouse
  • fact tables
  • slowly-changing or quickly-changing dimensions
  • Assumptions for this example
  • Composite partitioned fact table named TXN
  • Range partitioned on DATE column TXN_DATE
  • Hash partitioned on NUMBER column ACCT_KEY
  • Data to be loaded into partition P20040225 on TXN

22
Exchange Partition
Composite-partitioned table TXN
1. Create Temp Table
5. EXCHANGE PARTITION
2. Bulk Loads
Hash-partitioned table TXN_TEMP
3. Index Creates
4. Table Col Stats
22-Feb 2004
23-Feb 2004
24-Feb 2004
(empty)
25-Feb 2004
23
Exchange Partition
  • Create temporary table TXN_TEMP as a
    hash-partitioned table
  • Perform parallel, direct-path load of new data
    into TXN_TEMP
  • Create indexes on the temporary hash-partitioned
    table TXN_TEMP corresponding to the local indexes
    on TXN
  • using PARALLEL, NOLOGGING, and COMPUTE STATISTICS
    options
  • Gather CBO statistics on table TXN_TEMP
  • Only table and columns stats -- leave computed
    index stats!
  • alter table TXN
  • exchange partition P20040225 with table TXN_TEMP
  • including indexes without validation update
    global indexes

24
Exchange Partition
  • It is a good idea to encapsulate this logic
    inside PL/SQL packaged- or stored-procedures
  • SQL execute exchpart.prepare(TXN_FACT,TMP_,
    -
  • 2 25-FEB-2004)
  • SQL alter session enable parallel dml
  • SQL insert / append parallel(n,4) /
  • 2 into tmp_txn_fact n
  • 3 select / full(x) parallel(x,4) /
  • 4 from stage_txn_fact x
  • 5 where load_date 25-FEB-2004
  • 6 and load_date
  • SQL commit
  • SQL execute exchpart.finish(TXN_FACT,TMP_)
  • DDL for exchpart.sql posted at
    http//www.EvDBT.com/tools.htm

25
Exchange Partition
  • Loading time-variant fact and dimension tables is
    not the only load activity in most data
    warehouses
  • Often, some tables contain current or
    point-in-time data
  • Example type-1 dimension snowflaked from
    type-2 dimension
  • This is often an excellent situation for
    materialized views
  • But, as is often the case, the refresh mechanisms
    built in with materialized views might not be the
    most efficient
  • With each load cycle, the current images need to
    be updated
  • Instead of performing transactional MERGE (I.e.
    Update or Insert) logic directly on the table
  • Rebuild the table into a temporary table, then
    swap it in using EXCHANGE PARTITION

26
Exchange Partition
Composite-partitioned table ACCOUNT_DIM
Hash-partitioned table
Composite-partitioned table CURR_ACCOUNT_DIM
Merge/build operation
23-Feb 2004
24-Feb 2004
25-Feb 2004
27
Exchange Partition
EXCHANGE PARTITION
Previous cycles current-image data in
composite-partitioned table CURR_ACCOUNT_DIM,
with single partition named PZERO
New current-image data in hash-partitioned table
CURR_ACCT_DIM_TEMP
28
Exchange Partition
  • INSERT / append parallel(t, 8) / INTO
    TMP_CURR_ACCOUNT_DIM T
  • SELECT / full(x) parallel(x, 8) /
  • 0 partkey, acctkey, effdt, (and so on for all
    columns)
  • FROM (SELECT acctkey, effdt, (and so on for all
    columns),
  • row_number() over (partition by acctkey order
    by effdt) ranking
  • FROM (SELECT acctkey,
    effdt, (and so on for all columns)
  • FROM CURR_ACCOUNT_DIM
  • UNION ALL
  • SELECT acctkey, effdt,
    (and so on for all columns)
  • FROM CURR_ACCOUNT_DIM partition
    (P20040225)
  • )
  • )
  • WHERE RANKING 1

29
Exchange Partition
  • ALTER TABLE CURR_ACCOUNT_DIM
  • exchange partition PZERO
  • with table TMP_ACCOUNT_DIM
  • with without validation
  • including indexes
  • update global indexes

30
Choosing partition keys
  • The most important decision when partitioning is
  • Choosing the partition key columns
  • All benefits of partitioning hinges upon this
    choice!!!
  • Which columns to partition upon?
  • If the table contains time-variant data
  • Choose the RANGE partition key DATE column to
    optimize
  • ETL according to load cycles
  • End-user access through partition pruning
  • Choose the HASH or LIST sub-partition key column
    to optimize
  • End-user access through partition pruning
  • If the table does NOT contain time-variant data
  • Choose the RANGE, HASH, or LIST partition key
    column to optimize
  • End-user access through partition pruning

31
Choosing partition keys
  • When choosing columns to optimize ETL
  • Choose a column which distinguishes different
    load cycles
  • Should be a DATE column
  • When choosing columns to optimize end-user access
  • Gather hard facts about usage dont guess!
  • Oracle STATSPACK and Oracle10g AWR
  • Data dictionary table SYS.COL_USAGE
  • Populated automatically by cost-based optimizer
    in Oracle9i and above
  • DDL script dba_column_usage.sql can be
    downloaded from http//www.EvDBT.com/tools.htm
  • Ambeo Usage Tracker (http//www.ambeo.com)
  • Teleran iSight (http//www.teleran.com)

32
Choosing partition keys
  • Example fact table for credit-card processing
  • Fact table is time-variant
  • Use range partitioning on DATE datatype to
    optimize ETL and queries
  • Use hash- or list-subpartitioning to optimizer
    queries
  • Fact table has four DATE columns
  • TXN_DT (date on which transaction occurred)
  • POST_DT (date on which transaction was posted by
    merchant)
  • PAID_DT (date on which transaction was paid to
    merchant)
  • LOAD_DT (date on which transaction was loaded to
    DW)

33
Choosing partition keys
  • Which should be chosen? And why?
  • LOAD_DT
  • Optimizes ETL perfectly, but does not benefit
    queries in any way
  • Data is loaded by LOAD_DT
  • End-users dont query on LOAD_DT
  • TXN_DT, POST_DT, and PAID_DT
  • Each benefits a different set of end-user queries
  • Presents some problems for ETL processing
  • Each date loads mostly into the latest partition,
    then a little into each partition for the
    previous 2-4 days
  • This situation can be handled by iterating
    through the five steps of the basic EXCHANGE
    PARTITION algorithm
  • Where each iteration processes a different
    LOAD_DT value

34
Summary recommendations
  • Use dimensional data models for the
    presentation to end-users
  • Dont free lance and confuse the end-users
  • Understand the purpose of facts and dimensions
  • Base the database design on time-variant data
    structures
  • Dont join type-2 fact data to type-1
    dimension data
  • Load type-2 data first, then rebuild type-1
    data from that
  • Use partitioning
  • Enable the virtuous cycle of Oracle features
    that cascade from using partitioning
    intelligently

35
Questions?
36
Thank You!
  • Rocky Mountain Oracle Users Group (www.rmoug.org)
  • Training Days 2008, Denver CO
  • Tue-Thu 12-14 Feb 2008
  • Tues 12-Feb 4-hour university sessions
  • Wed-Thu 13-14 Feb main conference
  • Thu-Sun 15-18 Feb 2008
  • Informal ad-hoc ski weekend for attendees who
    wish to partake!!!
  • Tims contact info
  • Web http//www.EvDBT.com
  • Email tim_at_evdbt.com
Write a Comment
User Comments (0)
About PowerShow.com