Data Analysis - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Data Analysis

Description:

Short, simple, frequent queries and/or modifications, each ... selling airline tickets. OLAP ... Computing Baskets1 is cheap, since it doesn't involve a join. ... – PowerPoint PPT presentation

Number of Views:347
Avg rating:3.0/5.0
Slides: 35
Provided by: aaa23
Category:

less

Transcript and Presenter's Notes

Title: Data Analysis


1
Data Analysis
2
Overview
  • Traditional database systems are tuned to many,
    small, simple queries.
  • Some applications use fewer, more time-consuming,
    analytic queries.

3
OLTP
  • Most database operations involve On-Line
    Transaction Processing (OTLP).
  • Short, simple, frequent queries and/or
    modifications, each involving a small number of
    tuples.
  • Examples
  • Answering queries from a Web interface,
  • sales at cash registers,
  • selling airline tickets.

4
OLAP
  • On-Line Analytical Processing (OLAP, or
    analytic) queries are, typically
  • Few, but complex queries --- may run for hours.
  • Queries don't depend on having an absolutely
    up-to-date database.
  • Example
  • Partition car sales by years, and look at only
    the last two years (2007 and 2008). Aggregate
    price.

5
Common Architecture
  • Databases at store branches handle OLTP.
  • Local store databases copied to a central data
    warehouse overnight.
  • Analysts use the data warehouse for OLAP.

6
Star Schemas
  • A star schema is a common organization for data
    at a warehouse. It consists of
  • Fact table a very large accumulation of facts
    such as sales.
  • Dimension tables smaller, generally static
    information about the entities involved in the
    facts.

7
Example Star Schema
  • Suppose we want to record in a warehouse
    information about every car sale, e.g.
  • serial number,
  • dealer who sold the car,
  • date of the sale, and
  • price charged.
  • The fact table is thus
  • Sales(serialNo, dealer, day, price)

8
Example -- Continued
  • The dimension tables include information about
    the autos, dealers, and days dimensions
  • Autos(serialNo, model, color)
  • Dealers(name, city, province)
  • Days(day, week, month, year)

Days dimension table probably not stored.
9
Visualization Star Schema
Dimension Table (Dealer)
Dimension Table (Autos)
Dimension Attrs.
Dependent Attrs., e.g. price
Fact Table - Sales
Dimension Table (Day)
Dimension Table (etc.)
10
Data Cubes
  • Keys of dimension tables are the dimensions of a
    hypercube.
  • Example for the Sales data, the three dimensions
    are serialNo, date, and dealer.
  • Dependent attributes (e.g., price) appear at the
    points of the cube.

11
Visualization -- Data Cubes
car
price
dealer
date
12
Slicing and Dicing
  • Slice
  • focus on particular partitions along (one or
    more) dimension i.e., focus on a particular slice
    of the cube
  • WHERE clause in SQL
  • Dice
  • partition the cube into smaller subcubes and
    aggregated the points in each
  • GROUP BY clause in SQL

13
Slicing and Dicing in SQL
  • SELECT grouping-attributes and aggregations
  • FROM fact table joined with (zero or more)
    dimension tables
  • WHERE certain attributes are compared with
    constants / slicing /
  • GROUP BY grouping-attributes / dicing /

14
Slicing and Dicing Example
  • Suppose a particular car model, say Gobi, isn't
    selling as well as anticipated. How to analyze?
  • Maybe its the color
  • Slice for Gobi. Dice for color.
  • SELECT color, SUM(price)
  • FROM Sales NATURAL JOIN Autos
  • WHERE model 'Gobi'
  • GROUP BY color

15
Slicing and Dicing Example
  • Suppose the previous query doesn't tell us much
    each color produces about the same revenue.
  • Since the query does not dice for time, we only
    see the total over all time for each color.
  • We may thus issue a revised query that also
    partitions time by month.
  • SELECT color, month, SUM(price)
  • FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
  • WHERE model 'Gobi'
  • GROUP BY color, month

16
Slicing and Dicing Example
  • We might discover that red Gobis havent sold
    well recently.
  • Does this problem exists at all dealers, or only
    some dealers have had low sales of red Gobis?
  • Lets dice on dealer dimension.
  • SELECT dealer, month, SUM(price)
  • FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
  • WHERE model 'Gobi' AND color 'red
  • GROUP BY month, dealer

17
Slicing and Dicing Example
  • At this point, we find that the sales per month
    for red Gobis are so small that we cannot observe
    any trends easily.
  • We decide it was a mistake to dice by month. A
    better partition by years, and look at only the
    last two years (2007 and 2008).
  • SELECT dealer, year, SUM(price)
  • FROM (Sales NATURAL JOIN Autos) NATURAL JOIN Days
  • WHERE model 'Gobi' AND color 'red' AND
  • (year 2007 OR year 2008)
  • GROUP BY year, dealer

18
Drill-Down and Roll-Up
  • Previous examples illustrate two common patterns
    in sequences of queries that slice-and-dice the
    data cube.
  • Drill-down is the process of partitioning more
    finely and/or focusing on specific values in
    certain dimensions.
  • Each of the example steps except the last is an
    instance of drill-down.
  • Roll-up is the process of partitioning more
    coarsely.
  • The last step, where we grouped by years instead
    of months to eliminate the effect of randomness
    in the data, is an example of roll-up.

19
Marginals --- Data Cube w/Aggregation
20
Cube operator
  • CUBE(F) is an augmented table for fact table F
  • tuples (or points) added in CUBE(F)
  • have a value, denoted to each dimension
  • represents aggregation along that dimension

21
Example
  • Sales(serialNo, date, dealer, price)
  • Sales(model, color, date, dealer, val, cnt)

serialNo dimension not well suited for the cube
as summing the price over all dates, or over all
dealers, but keeping the serial number fixed has
no effect.
We replace the serialNo by model and color to
which the serial number connects. Also, we will
have as independent attributes, val for sum of
prices and cnt for number of cars sold.
22
Example CUBE(Sale)
  • ('Gobi', 'red', '2001-05-21', 'Friendly Fred',
    45000, 2)
  • On May 21, 2001, dealer Friendly Fred sold two
    red Gobis for a total of 45,000.
  • This tuple is in both Sales and CUBE(Sales).
  • ('Gobi', , '2001-05-21', 'Friendly Fred',
    152000, 7)
  • On May 21, 2001, Friendly Fred sold seven Gobis
    of all colors, for a total price of 152,000.
  • Note that this tuple is in CUBE(Sales) but not in
    Sales.

23
Example CUBE(Sale)
  • ('Gobi', , '2001-05-21', , 2348000, 100)
  • On May 21, 2001, there were 100 Gobis sold by all
    the dealers, and the total price of those Gobis
    was 2,348,000.
  • ('Gobi', , , , 1339800000, 58000)
  • Over all time, dealers, and colors, 58,000 Gobis
    have been sold for a total price of
    1,339,800,000.
  • (, , , , 3521727000, 198000)
  • Total sales of all models in all colors, over all
    time at all dealers is 198,000 cars for a total
    price of 3,521,727,000.

24
CUBE Helps Answering Queries
  • SELECT color, AVG(price)
  • FROM Sales
  • WHERE model 'Gobi'
  • GROUP BY color
  • is answered by looking for all tuples of
    CUBE(Sales) with the form
  • ('Gobi', c, , , v, n)
  • where c is any specific color. v and n will be
    the sum and number of sales of Gobis in that
    color. The average price, is v/n. The answer to
    query is the set of (c v/n) pairs obtained from
    all ('Gobi', c, , , v, n) tuples.

25
CUBE in SQL
  • SELECT model, color, date, dealer, SUM(val) AS v,
    SUM(cnt) AS n
  • FROM Sales
  • GROUP BY model, color, date, dealer
  • WITH CUBE
  • Suppose we materialize this into SalesCube.
  • Then the previous query is rewritten into
  • SELECT color, SUM(v)/SUM(n)
  • FROM SalesCube
  • WHERE model 'Gobi' AND
  • date IS NULL AND
  • dealer IS NULL

26
Data Mining
  • Data mining is a popular term for summarization
    of big data sets in useful ways.
  • Examples
  • Clustering Web pages by topic.
  • Finding characteristics of fraudulent credit-card
    use.
  • Finding products that are usually bought together.

27
Market-Basket Data
  • An important form of mining from relational data
    involves market baskets sets of items that
    are purchased together as a customer leaves a
    store.
  • Summary of basket data is frequent itemsets
    sets of items that often appear together in
    baskets.

28
Example Market Baskets
  • If people often buy hamburger and ketchup
    together, the store can
  • Put hamburger and ketchup near each other and put
    potato chips between.
  • Run a sale on hamburger and raise the price of
    ketchup.

29
Finding Frequent Pairs
  • The simplest case is when we only want to find
    frequent pairs of items.
  • Assume data is in a relation Baskets(basket,
    item).
  • The support threshold s is the minimum number
    of baskets in which a pair appears before we are
    interested.

30
Frequent Pairs in SQL
  • SELECT b1.item, b2.item
  • FROM Baskets b1, Baskets b2
  • WHERE b1.basket b2.basket
  • AND b1.item
  • GROUP BY b1.item, b2.item
  • HAVING COUNT() s

31
A-Priori Trick (1)
  • Straightforward implementation involves a join of
    a huge Baskets table with itself.
  • The A-Priori algorithm speeds the query by
    recognizing that a pair of items i, j cannot
    have support s unless both i and j do.

32
A-Priori Trick (2)
  • Use a materialized view to hold only information
    about frequent items.
  • INSERT INTO Baskets1(basket, item)
  • SELECT FROM Baskets
  • WHERE item IN (
  • SELECT item FROM Baskets
  • GROUP BY item
  • HAVING COUNT() s
  • )

33
A-Priori Algorithm
  • Materialize the view Baskets1.
  • Run the obvious query, but on Baskets1 instead of
    Baskets.
  • Notes
  • Computing Baskets1 is cheap, since it doesnt
    involve a join.
  • Baskets1 probably has many fewer tuples than
    Baskets.
  • Running time shrinks with the square of the
    number of tuples involved in the join.

34
Course Plug
  • Fall 2009 CSC 485E / CSC 571 Advanced Databases
  • Spring 2010 SENG 474 Data Mining
Write a Comment
User Comments (0)
About PowerShow.com