Lecture 1: Data Warehousing - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 1: Data Warehousing

Description:

Lecture 1: Data Warehousing Based on the s by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University * ... – PowerPoint PPT presentation

Number of Views:207
Avg rating:3.0/5.0
Slides: 94
Provided by: Janet240
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Lecture 1: Data Warehousing


1
Lecture 1 Data Warehousing
  • Based on the slides by Jeffrey D. Ullman and
  • Hector Garcia-Molina at Stanford University

2
Overview
  • Traditional database systems are tuned to many,
    small, simple queries.
  • Some new applications use fewer, more
    time-consuming, complex queries.
  • New architectures have been developed to handle
    complex analytic queries efficiently.

3
The Data Warehouse
  • The most common form of data integration.
  • Copy sources into a single DB (warehouse) and try
    to keep it up-to-date.
  • Usual method periodic reconstruction of the
    warehouse, perhaps overnight.
  • Frequently essential for analytic queries.

4
OLTP
  • Most database operations involve On-Line
    Transaction Processing (OTLP).
  • Short, simple, frequent queries and/or
    modifications, each involving a small number of
    tuples.
  • Examples Answering queries from a Web interface,
    sales at cash registers, selling airline tickets.

5
OLAP
  • Of increasing importance are On-Line Application
    Processing (OLAP) queries.
  • Few, but complex queries --- may run for hours.
  • Queries do not depend on having an absolutely
    up-to-date database.

6
OLAP Examples
  1. Amazon analyzes purchases by its customers to
    come up with an individual screen with products
    of likely interest to the customer.
  2. Analysts at Wal-Mart look for items with
    increasing sales in some region.

7
Warehouse Architecture
Metadata
8
Why a Warehouse?
  • Two Approaches
  • Query-Driven (Lazy)
  • Warehouse (Eager)

9
Data Warehouse
  • Databases at store branches handle OLTP.
  • Local store databases copied to a central
    warehouse overnight.
  • Analysts use the warehouse for OLAP.

10
Query-Driven Approach
11
Advantages of Warehousing
  • High query performance
  • Queries not visible outside warehouse
  • Local processing at sources unaffected
  • Can operate when sources unavailable
  • Can query data not stored in a DBMS
  • Extra information at warehouse
  • Modify, summarize (store aggregates)
  • Add historical information

12
Advantages of Query-Driven
  • No need to copy data
  • less storage
  • no need to purchase data
  • More up-to-date data
  • Query needs can be unknown
  • Only query interface needed at sources
  • May be less draining on sources

13
OLTP vs. OLAP
  • OLTP On Line Transaction Processing
  • Describes processing at operational sites
  • OLAP On Line Analytical Processing
  • Describes processing at warehouse

14
OLTP vs. OLAP
OLTP
OLAP
  • Mostly updates
  • Many small transactions
  • Mb-Tb of data
  • Raw data
  • Clerical users
  • Up-to-date data
  • Consistency, recoverability critical
  • Mostly reads
  • Queries long, complex
  • Gb-Tb of data
  • Summarized, consolidated data
  • Decision-makers, analysts as users

15
Star Schemas
  • A star schema is a common organization for data
    at a warehouse. It consists of
  • Fact table a very large accumulation of facts
    such as sales.
  • Often insert-only.
  • Dimension tables smaller, generally static
    information about the entities involved in the
    facts.

16
Example Star Schema
  • Suppose we want to record in a warehouse
    information about every beer sale the bar, the
    brand of beer, the drinker who bought the beer,
    the day, the time, and the price charged.
  • The fact table is a relation
  • Sales(bar, beer, drinker, day, time, price)

17
Example, Continued
  • The dimension tables include information about
    the bar, beer, and drinker dimensions
  • Bars(bar, addr, license)
  • Beers(beer, manf)
  • Drinkers(drinker, addr, phone)

18
Visualization Star Schema
Dimension Table (Drinkers)
Dimension Table (Bars)
Dimension Attrs.
Dependent Attrs.
Fact Table - Sales
Dimension Table (Beers)
Dimension Table (etc.)
19
Dimensions and Dependent Attributes
  • Two classes of fact-table attributes
  • Dimension attributes the key of a dimension
    table.
  • Dependent attributes a value determined by the
    dimension attributes of the tuple.

20
Example Dependent Attribute
  • price is the dependent attribute of our example
    Sales relation.
  • It is determined by the combination of dimension
    attributes bar, beer, drinker, and the time
    (combination of day and time-of-day attributes).

21
Approaches to Building Warehouses
  1. ROLAP relational OLAP Tune a relational
    DBMS to support star schemas.
  2. MOLAP multidimensional OLAP Use a
    specialized DBMS with a model such as the data
    cube.

22
ROLAP Techniques
  1. Bitmap indexes For each key value of a
    dimension table (e.g., each beer for relation
    Beers) create a bit-vector telling which tuples
    of the fact table have that value.
  2. Materialized views Store the answers to several
    useful queries (views) in the warehouse itself.

23
Typical OLAP Queries
  • Often, OLAP queries begin with a star join the
    natural join of the fact table with all or most
    of the dimension tables.
  • Example
  • SELECT
  • FROM Sales, Bars, Beers, Drinkers
  • WHERE Sales.bar Bars.bar AND
  • Sales.beer Beers.beer AND
  • Sales.drinker Drinkers.drinker

24
Typical OLAP Queries --- (2)
  • The typical OLAP query will
  • Start with a star join.
  • Select for interesting tuples, based on dimension
    data.
  • Group by one or more dimensions.
  • Aggregate certain attributes of the result.

25
Example OLAP Query
  • For each bar in Palo Alto, find the total sale of
    each beer manufactured by Anheuser-Busch.
  • Filter addr Palo Alto and manf
    Anheuser-Busch.
  • Grouping by bar and beer.
  • Aggregation Sum of price.

26
Example In SQL
  • SELECT bar, beer, SUM(price)
  • FROM Sales NATURAL JOIN Bars
  • NATURAL JOIN Beers
  • WHERE addr Palo Alto AND
  • manf Anheuser-Busch
  • GROUP BY bar, beer

27
Using Materialized Views
  • A direct execution of this query from Sales and
    the dimension tables could take too long.
  • If we create a materialized view that contains
    enough information, we may be able to answer our
    query much faster.

28
Example Materialized View
  • Which views could help with our query?
  • Key issues
  • It must join Sales, Bars, and Beers, at least.
  • It must group by at least bar and beer.
  • It must not select out Palo-Alto bars or
    Anheuser-Busch beers.
  • It must not project out addr or manf.

29
Example --- Continued
  • Here is a materialized view that could help
  • CREATE VIEW BABMS(bar, addr,
  • beer, manf, sales) AS
  • SELECT bar, addr, beer, manf,
  • SUM(price) sales
  • FROM Sales NATURAL JOIN Bars
  • NATURAL JOIN Beers
  • GROUP BY bar, addr, beer, manf

30
Example --- Concluded
  • Heres our query using the materialized view
    BABMS
  • SELECT bar, beer, sales
  • FROM BABMS
  • WHERE addr Palo Alto AND
  • manf Anheuser-Busch

31
MOLAP and Data Cubes
  • Keys of dimension tables are the dimensions of a
    hypercube.
  • Example for the Sales data, the four dimensions
    are bar, beer, drinker, and time.
  • Dependent attributes (e.g., price) appear at the
    points of the cube.

32
Visualization - Data Cubes
beer
price
bar
drinker
33
Marginals
  • The data cube also includes aggregation
    (typically SUM) along the margins of the cube.
  • The marginals include aggregations over one
    dimension, two dimensions,

34
Visualization - Data Cube w/ Aggregation
beer
SUM over all Drinkers
price
bar
drinker
35
Example Marginals
  • Our 4-dimensional Sales cube includes the sum of
    price over each bar, each beer, each drinker, and
    each time unit (perhaps days).
  • It would also have the sum of price over all
    bar-beer pairs, all bar-drinker-day triples,

36
Structure of the Cube
  • Think of each dimension as having an additional
    value .
  • A point with one or more s in its coordinates
    aggregates over the dimensions with the s.
  • Example Sales(Joes Bar, Bud, , ) holds
    the sum over all drinkers and all time of the Bud
    consumed at Joes.

37
Drill-Down
  • Drill-down de-aggregate break an aggregate
    into its constituents.
  • Example having determined that Joes Bar sells
    very few Anheuser-Busch beers, break down his
    sales by particular A.-B. beer.

38
Roll-Up
  • Roll-up aggregate along one or more dimensions.
  • Example given a table of how much Bud each
    drinker consumes at each bar, roll it up into a
    table giving total amount of Bud consumed for
    each drinker.

39
Roll Up and Drill Down
of Anheuser-Busch by drinker/bar
of A-B / drinker
Jim Bob Mary
Joes Bar 45 33 30
Nut- House 50 36 42
Blue Chalk 38 31 40
Jim Bob Mary
133 100 112
Roll upby Bar
Drill downby Beer
of A-B Beers / drinker
Jim Bob Mary
Bud 40 29 40
Mlob 45 31 37
Bud Light 48 40 35
40
Materialized Data-Cube Views
  • Data cubes invite materialized views that are
    aggregations in one or more dimensions.
  • Dimensions may not be completely aggregated ---
    an option is to group by an attribute of the
    dimension table.

41
Example
  • A materialized view for our Sales data cube
    might
  • Aggregate by drinker completely.
  • Not aggregate at all by beer.
  • Aggregate by time according to the week.
  • Aggregate according to the city of the bar.

42
Example
  • A materialized view for our Sales data cube
    might
  • Aggregate by drinker completely.
  • Not aggregate at all by beer.
  • Aggregate by time according to the week.
  • Aggregate according to the city of the bar.

43
Warehouse Models Operators
  • Data Models
  • relations
  • stars snowflakes
  • cubes
  • Operators
  • slice dice
  • roll-up, drill down
  • pivoting
  • other

44
Star
45
Star Schema
46
Terms
  • Fact table
  • Dimension tables
  • Measures

47
Dimension Hierarchies
sType
store
city
region
è snowflake schema è constellations
48
Cube
Fact table view
Multi-dimensional cube
dimensions 2
49
3-D Cube
Multi-dimensional cube
Fact table view
dimensions 3
50
ROLAP vs. MOLAP
  • ROLAPRelational On-Line Analytical Processing
  • MOLAPMulti-Dimensional On-Line Analytical
    Processing

51
Aggregates
  • Add up amounts for day 1
  • In SQL SELECT sum(amt) FROM SALE
  • WHERE date 1

81
52
Aggregates
  • Add up amounts by day
  • In SQL SELECT date, sum(amt) FROM SALE
  • GROUP BY date

53
Another Example
  • Add up amounts by day, product
  • In SQL SELECT date, sum(amt) FROM SALE
  • GROUP BY date, prodId

rollup
drill-down
54
Aggregates
  • Operators sum, count, max, min, median,
    ave
  • Having clause
  • Using dimension hierarchy
  • average by region (within store)
  • maximum by month (within date)

55
Cube Aggregation
Example computing sums
day 2
. . .
day 1
129
56
Cube Operators
day 2
. . .
day 1
sale(c1,,)
129
sale(c2,p2,)
sale(,,)
57
Extended Cube

day 2
sale(,p2,)
day 1
58
Aggregation Using Hierarchies
customer
region
country
(customer c1 in Region A customers c2, c3 in
Region B)
59
Pivoting
Fact table view
Multi-dimensional cube
60
Query Analysis Tools
  • Query Building
  • Report Writers (comparisons, growth, graphs,)
  • Spreadsheet Systems
  • Web Interfaces
  • Data Mining

61
Other Operations
  • Time functions
  • e.g., time average
  • Computed Attributes
  • e.g., commission sales rate
  • Text Queries
  • e.g., find documents with words X AND B
  • e.g., rank documents by frequency of
    words X, Y, Z

62
Implementing a Warehouse
  • Monitoring Sending data from sources
  • Integrating Loading, cleansing,...
  • Processing Query processing, indexing, ...
  • Managing Metadata, Design, ...

63
Monitoring
  • Source Types relational, flat file, IMS, VSAM,
    IDMS, WWW, news-wire,
  • Incremental vs. Refresh

64
Monitoring Techniques
  • Periodic snapshots
  • Database triggers
  • Log shipping
  • Data shipping (replication service)
  • Transaction shipping
  • Polling (queries to source)
  • Screen scraping
  • Application level monitoring

è Advantages Disadvantages!!
65
Monitoring Issues
  • Frequency
  • periodic daily, weekly,
  • triggered on big change, lots of changes, ...
  • Data transformation
  • convert data to uniform format
  • remove add fields (e.g., add date to get
    history)
  • Standards (e.g., ODBC)
  • Gateways

66
Integration
  • Data Cleaning
  • Data Loading
  • Derived Data

67
Data Cleaning
  • Migration (e.g., yen ð dollars)
  • Scrubbing use domain-specific knowledge (e.g.,
    social security numbers)
  • Fusion (e.g., mail list, customer merging)
  • Auditing discover rules relationships(like
    data mining)

68
Loading Data
  • Incremental vs. refresh
  • Off-line vs. on-line
  • Frequency of loading
  • At night, 1x a week/month, continuously
  • Parallel/Partitioned load

69
Derived Data
  • Derived Warehouse Data
  • indexes
  • aggregates
  • materialized views (next slide)
  • When to update derived data?
  • Incremental vs. refresh

70
Materialized Views
  • Define new warehouse relations using SQL
    expressions

71
Processing
  • ROLAP servers vs. MOLAP servers
  • Index Structures
  • What to Materialize?
  • Algorithms

72
ROLAP Server
  • Relational OLAP Server

tools
Special indices, tuning Schema is denormalized
73
MOLAP Server
  • Multi-Dimensional OLAP Server

M.D. tools
multi-dimensional server
could also sit on relational DBMS
74
Index Structures
  • Traditional Access Methods
  • B-trees, hash tables, R-trees, grids,
  • Popular in Warehouses
  • inverted lists
  • bit map indexes
  • join indexes
  • text indexes

75
Inverted Lists
. . .
data records
inverted lists
age index
76
Using Inverted Lists
  • Query
  • Get people with age 20 and name fred
  • List for age 20 r4, r18, r34, r35
  • List for name fred r18, r52
  • Answer is intersection r18

77
Bit Maps
. . .
age index
data records
bit maps
78
Using Bit Maps
  • Query
  • Get people with age 20 and name fred
  • List for age 20 1101100000
  • List for name fred 0100000001
  • Answer is intersection 010000000000
  • Good if domain cardinality small
  • Bit vectors can be compressed

79
Join
  • Combine SALE, PRODUCT relations
  • In SQL SELECT FROM SALE, PRODUCT

80
What to Materialize?
  • Store in warehouse results useful for common
    queries
  • Example

total sales
day 2
. . .
day 1
129
materialize
81
Materialization Factors
  • Type/frequency of queries
  • Query response time
  • Storage cost
  • Update cost

82
Cube Aggregates Lattice
129
all
city
product
date
city, product
city, date
product, date
use greedy algorithm to decide what to materialize
city, product, date
83
Dimension Hierarchies
all
state
city
84
Dimension Hierarchies
all
product
city
date
product, date
city, product
city, date
state
city, product, date
state, date
state, product
state, product, date
not all arcs shown...
85
Interesting Hierarchy
all
years
weeks
quarters
conceptual dimension table
months
days
86
Managing
  • Metadata
  • Warehouse Design
  • Tools

87
Metadata
  • Administrative
  • definition of sources, tools, ...
  • schemas, dimension hierarchies,
  • rules for extraction, cleaning,
  • refresh, purging policies
  • user profiles, access control, ...

88
Metadata
  • Business
  • business terms definition
  • data ownership, charging
  • Operational
  • data lineage
  • data currency (e.g., active, archived, purged)
  • use stats, error reports, audit trails

89
Design
  • What data is needed?
  • Where does it come from?
  • How to clean data?
  • How to represent in warehouse (schema)?
  • What to summarize?
  • What to materialize?
  • What to index?

90
Tools
  • Development
  • design edit schemas, views, scripts, rules,
    queries, reports
  • Planning Analysis
  • what-if scenarios (schema changes, refresh
    rates), capacity planning
  • Warehouse Management
  • performance monitoring, usage patterns, exception
    reporting
  • System Network Management
  • measure traffic (sources, warehouse, clients)
  • Workflow Management
  • reliable scripts for cleaning analyzing data

91
Current State of Industry
  • Extraction and integration done off-line
  • Usually in large, time-consuming, batches
  • Everything copied at warehouse
  • Not selective about what is stored
  • Query benefit vs storage update cost
  • Query optimization aimed at OLTP
  • High throughput instead of fast response
  • Process whole query before displaying anything

92
Future Directions
  • Better performance
  • Larger warehouses
  • Easier to use
  • What are companies research labs working on?

93
Conclusions
  • Massive amounts of data and complexity of queries
    will push limits of current warehouses
  • Need better systems
  • easier to use
  • provide quality information
Write a Comment
User Comments (0)
About PowerShow.com