OLAP on - PowerPoint PPT Presentation

About This Presentation
Title:

OLAP on

Description:

OLAP on Sequence Data Presenter : Chun Kit Chui (Kit), The University of Hong Kong ckchui_at_cs.hku.hk Supervisors : Eric Lo, Hong Kong Polytechnic University – PowerPoint PPT presentation

Number of Views:269
Avg rating:3.0/5.0
Slides: 73
Provided by: iCsHkuH
Category:

less

Transcript and Presenter's Notes

Title: OLAP on


1
OLAP on
Sequence Data
Presenter
Chun Kit Chui (Kit),The University of Hong
Kongckchui_at_cs.hku.hk
Supervisors
Eric Lo,Hong Kong Polytechnic University
Ben Kao,The University of Hong Kong
2
OLAP on
Sequence Data
Problem Motivation
Sequence Data Cuboid
Experimental evaluations
3
OLAP on
Sequence Data
  • Many kinds of real-life data exhibit logical
    ordering among their data items and are thus
    sequential in nature.

Web server access logs
Stock market data
U.S. OIL FUND ETF
MEXCO ENERGY CORP
4
Web server access logs (Web retailor selling
sports wear products)
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma

2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas

2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
The product dimension is associated with a
concept hierarchy in which the finest level of
abstraction is product ID, followed by product
type, and brand.
Sequence Data
  • Many kinds of real-life data exhibit logical
    ordering among their data items and are thus
    sequential in nature.

Web server access logs
Stock market data
U.S. OIL FUND ETF
MEXCO ENERGY CORP
5
Web server access logs (Web retailor selling
sports wear products)
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma

2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas

2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
The product dimension is associated with a
concept hierarchy in which the finest level of
abstraction is product ID, followed by product
type, and brand.
Sequence Data
  • Many kinds of real-life data exhibit logical
    ordering among their data items and are thus
    sequential in nature.

From the access logs we can trace back the
browsing sequences of all members.
Web server access logs
Browsing Sequence
6
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma

2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas

2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Manager
Sequence Data
  • Many kinds of real-life data exhibit logical
    ordering among their data items and are thus
    sequential in nature.

Web server access logs
Browsing Sequence
7
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma

2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas

2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Manager
Sequence Data
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt ?
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200

lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
Browsing Sequence
The query is referring to a particular kind of
pattern in the browsing sequences. The
comparison shopping semantics can be expressed by
the pattern template lt X, Y, X gt.
8
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma

2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas

2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Manager
Sequence Data
ltNike shoes, Adidas Shoes, Nike Shoesgt is one of
the instantiations of the pattern template. Since
the browsing sequence of member 688 contains/
posses the pattern, the sequence contributes to 1
count in the cell.
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 1
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200

lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
Browsing Sequence
9
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma

2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas

2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Manager
Sequence Data
The aggregated number of members is counted and a
tabulated view of the sequence data should be
returned.
ltNike shoes, Adidas Shoes, Nike Shoesgt is one of
the instantiations of the pattern template. Since
the browsing sequence of member 688 contains/
posses the pattern, the sequence contributes to 1
count in the cell.
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 200,000
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200

lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
Browsing Sequence
10
Web server access logs (Web retailor selling
sports wear products)
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
Time member- ID URL Product Product type Brand
2008-1-01 0001 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0002 688 /product.html?pid13250 13250 Adidas shoes Adidas
2008-1-01 0010 14230 /product.html?pid324 324 Puma shoes Puma

2008-1-01 0245 688 /product.html?pid12800 12800 Nike shoes Nike

2008-1-01 0349 688 /product.html?pid329 329 Adidas T-shirts Adidas

2008-1-01 0345 14230 /checkout.xhtml Nil Nil Nil
Sequence OLAP system
Query
  • Support pattern based grouping and aggregation.

Manager
The aggregated number of members is counted and a
tabulated view of the sequence data should be
returned.
Result
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 200,000
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200

lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
11
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
There are so many members browse lt Nike Shoes,
Adidas Shoes, Nike Shoes gt, I would like to
further investigate whether those members would
browse one more product and if so what is the
product.
Sequence OLAP system
Follow up Query
  • Support pattern based grouping and aggregation.

Manager
  • Obtain query results in real time (OLAP feature).

lt X, Y, X, Z gt XNike Shoes, YAdidas Shoes, ZAny Members
lt Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes gt 15,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts gt 180,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes gt 9,000

Result
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 200,000
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200

lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
The new query can be expressed by appending a
pattern symbol Z to form a new pattern template
ltX,Y,X,Zgt. The result shows the statistics of
one more browsing step after the comparison
shopping between Nike Shoes and Adidas Shoes
12
I would like to know the number of members that
did comparison shopping and their distributions
over all product web page to product web page
pairs within 2008 Quarter 1.
There are so many members browse lt Nike Shoes,
Adidas Shoes, Nike Shoes gt, I would like to
further investigate whether those members would
browse one more product and if so what is the
product.
Sequence OLAP system
Follow up Query
  • Support pattern based grouping and aggregation.

Manager
  • Obtain query results in real time (OLAP feature).

This cell suggests that there are many members
who browsed an Adidas T-shirts page after doing
comparison shopping between Nike shoes and Adidas
shoes.
lt X, Y, X, Z gt XNike Shoes, YAdidas Shoes, ZAny Members
lt Nike Shoes, Adidas Shoes, Nike Shoes, Nike Shoes gt 15,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Adidas T-shirts gt 180,000
lt Nike Shoes, Adidas Shoes, Nike Shoes, Puma Shoes gt 9,000

Result
lt X, Y, X gt Members
lt Nike Shoes, Adidas Shoes, Nike Shoes gt 200,000
lt Nike Shoes, Puma Shoes, Nike Shoes gt 5,432
lt Nike Shoes, Nike Shoes, Nike Shoes gt 13,200

lt Adidas Shoes, Nike Shoes, Adidas Shoes gt 1,020
lt Adidas Shoes, Puma Shoes, Adidas Shoes gt 4,331
The new query can be expressed by appending a
pattern symbol Z to form a new pattern template
ltX,Y,X,Zgt. The result shows the statistics of
one more browsing step after the comparison
shopping between Nike Shoes and Adidas Shoes
13
Research Objective
lt X, Y gt Members
lt Nike, Adidasgt 1,315,000
lt Nike, Puma gt 6,480,000
lt Nike, Nikegt 3,189,000

lt X, Y, X gt Members
lt Nike, Adidas, Nikegt 315,000
lt Nike, Puma, Nike gt 2,180,000
lt Nike, Nike, Nike gt 189,000
  • To design and implement an OLAP system that is
    able to
  • support pattern based grouping and aggregation.
  • obtain query results in real-time.
  • Especially optimized for interactive/iterative
    queries.
  • provide OLAP operations to ease explorative
    analysis of sequence data.

14
RFID Logs
  • Radio-frequency identification (RFID) is an
    automatic identification method, relying on
    storing and remotely retrieving data using
    devices called RFID tags.
  • The smart card system for public transportation
  • Octopus card Hong Kong, SmarTrip in Washington DC
    etc
  • Electronic money
  • Travel history of passengers are logged in a
    database.
  • Generate massive amount of sequence data.

15
RFID Logs
Event Database
Time Card-ID Location Action Amount
2008-6-09 0001 688 Vancouver Airport in 0
2008-6-09 0225 688 Waterfront out -5

2008-6-09 0002 9876 Aberdeen in 0

2008-6-14 0223 688 Waterfront Machine 10 Add value 100
2008-6-14 0225 688 Waterfront in 0

2008-6-14 0245 9876 Marine Drive out -4
2008-6-14 1849 688 Vancouver Airport out -5
  • Radio-frequency identification (RFID) is an
    automatic identification method, relying on
    storing and remotely retrieving data using
    devices called RFID tags.
  • The smart card system for public transportation
  • Octopus card Hong Kong, SmarTrip in Washington DC
    etc
  • Electronic money
  • Payment can be done easily by waving the card
    over the card reader.
  • Travel history of passengers are logged in a
    database.
  • Generate massive amount of sequence data.

16
The
Event Database
Time Card-ID Location Action Amount
2008-6-09 0001 688 Vancouver Airport in 0
2008-6-09 0225 688 Waterfront out -5

2008-6-09 0002 9876 Aberdeen in 0

2008-6-14 0223 688 Waterfront Machine 10 Add value 100
2008-6-14 0225 688 Waterfront in 0

2008-6-14 0245 9876 Marine Drive out -4
2008-6-14 1849 688 Vancouver Airport out -5

The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2008 Quarter 4.
17
The
Event Database
Time Card-ID Location Action Amount
2008-6-09 0001 688 Vancouver Airport in 0
2008-6-09 0225 688 Waterfront out -5

2008-6-09 0002 9876 Aberdeen in 0

2008-6-14 0223 688 Waterfront Machine 10 Add value 100
2008-6-14 0225 688 Waterfront in 0

2008-6-14 0245 9876 Marine Drive out -4
2008-6-14 1849 688 Vancouver Airport out -5

The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2008 Quarter 4.
Query
Result
lt X, Y, Y, X gt Users
lt Vancouver Airport, Waterfront, Waterfront, Vancouver Airport gt 12,032
lt Vancouver Airport, Aberdeen, Aberdeen, Vancouver Airport gt 982

lt Aberdeen, Marine Drive, Marine Drive, Aberdeen gt 822
lt Aberdeen, Templeton, Templeton, Aberdeen gt 1,020
18
Sequence Data Cuboid
A logical view of sequence data at a particular
degree of summarization.
19
Preliminary
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2008 Quarter 4.
  • Sequence Cuboid (S-Cuboid)
  • a logical view of sequence data at a particular
    degree of summarization.
  • sequences can be characterized by
  • attributes values (e.g.
    time)
  • the subsequence/ substring patterns they possess.
    (e.g. ltX,Y,Xgt , ltX,Y,Y,Xgt)

Sequence OLAP
An S-Cuboid
lt X, Y, Y, X gt Users
lt Airport, Waterfront, Waterfront, Airport gt 2
lt Airport, Aberdeen, Aberdeen, Airport gt 9

20
Phase 1. Sequence Formation
An event dataset
Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Vancouver Airport in 0
e2 2008-6-09 0225 688 Waterfront out -5

e101 2008-6-09 2223 688 Waterfront Machine 10 Add value 100
e102 2008-6-09 2225 688 Waterfront in 0

e180 2008-6-09 2349 688 Vancouver Airport out -5

Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Vancouver Airport in 0
e2 2008-6-09 0225 688 Waterfront out -5

e102 2008-6-09 2225 688 Waterfront in 0

e180 2008-6-09 2349 688 Vancouver Airport out -5

Event Selection
An event selection step to select a set of a
relevant records and attributes.
21
Phase 1. Sequence Formation
An event dataset
Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Vancouver Airport in 0
e2 2008-6-09 0225 688 Waterfront out -5

e101 2008-6-09 2223 688 Waterfront Machine 10 Add value 100
e102 2008-6-09 2225 688 Waterfront in 0

e180 2008-6-09 2349 688 Vancouver Airport out -5

Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Vancouver Airport in 0
e2 2008-6-09 0225 688 Waterfront out -5

e102 2008-6-09 2225 688 Waterfront in 0

e180 2008-6-09 2349 688 Vancouver Airport out -5

Event Selection
A sequence formation step to form sequences from
the event dataset.
Sequence Formation
Sequences can be formed per day and for each
individual user. So after this step, we have a
number of daily travel sequences of each user.
E.g. S1 is Erics trip on Monday
User Individual, Time Day
Seq ID Sequence of events
S1 lt e1, e2, e102, e180gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292, e352 gt

22
Phase 2. S-Cuboid construction
User Individual, Time Day
Seq ID Sequence of events
S1 lt e1, e2, e102, e180gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292, e352 gt

Monday
23
Phase 2. S-Cuboid construction
A sequence grouping step to group the sequences
that share the same dimensions values into a
sequence group. E.g. travel sequences are grouped
according to their fair groups.
Sequence Grouping
User Individual, Time Day
Seq ID Sequence of events
S1 lt e1, e2, e102, e180gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292, e352 gt

Monday
24
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Pattern Grouping
Sequence Grouping
The pattern grouping step further group the
sequences according to the patterns they
possess.
User Individual, Time Day
Seq ID Sequence of events
S1 lt e1, e2, e102, e180gt
S2 lt e3, e7, e8, e12 , e19, e232 , e234, e235 gt
S3 lt e4, e5, e9, e13 , e14, e290 , e292, e352 gt

25
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Each cell represents an instantiated pattern E.g.
ltVancouver Airport, Waterfront, Waterfront,
Vancouver Airportgt We assign sequences to a cell
if that sequence contains the instantiated
pattern.
Pattern Grouping
Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Vancouver Airport in 0
e2 2008-6-09 0225 688 Waterfront out -5

e102 2008-6-09 2225 688 Waterfront in 0

e180 2008-6-09 2349 688 Vancouver Airport out -5

The pattern grouping step further group the
sequences according to the patterns they
possess.
S1
Waterfront
S3
Vancouver Airport
26
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Each cell represents an instantiated pattern E.g.
ltVancouver Airport, Waterfront, Waterfront,
Vancouver Airportgt We assign sequences to a cell
if that sequence contains the instantiated
pattern.
Pattern Grouping
Aggregated Value
Finally, an aggregation function is applied to
the sequences in each cuboid cell.
Count 2
S1
Waterfront
S3
Vancouver Airport
27
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Pattern Grouping
Aggregated Value
Count 2
S1
Waterfront
S3
4D S-Cuboid
lt X, Y, Y, X gt Users
lt Vancouver Airport, Waterfront, Waterfront, Vancouver Airport gt 2
lt Vancouver Airport, Aberdeen, Aberdeen, Vancouver Airport gt 9

Vancouver Airport
4D S-Cuboid
28
Phase 2. S-Cuboid construction
Pattern X,Y,Y,X
Pattern Grouping
Aggregated Value
Count 2
S1
Waterfront
S3
4D S-Cuboid
lt X, Y, Y, X gt Users
lt Vancouver Airport, Waterfront, Waterfront, Vancouver Airport gt 2
lt Vancouver Airport, Aberdeen, Aberdeen, Vancouver Airport gt 9

Vancouver Airport
4D S-Cuboid
29
Sequence Cuboid query language
The number of round-trip passengers and their
distributions over all origin-destination station
pairs within 2007 Quarter 4.
4D S-Cuboid
lt X, Y, Y, X gt Users
lt Vancouver Airport, Waterfront, Waterfront, Vancouver Airport gt 2
lt Vancouver Airport, Aberdeen, Aberdeen, Vancouver Airport gt 9

30
Properties of S-cuboids
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

Notice that modifying the pattern template
essentially changes the cuboid specification and
thus generates a new cuboid.
31
Properties of S-cuboids
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

lt X, Y, Z gt Count
lt Airport, Waterfront, Airport gt 1
lt Waterfront, Airport, Waterfront gt 1
lt Airport, Waterfront, Marine Drive gt 1
Seq ID Sequence of events
S1 lt Airport, Waterfront, Airport, Waterfront, Marine Drive gt
S2 lt Waterfront, Marine Drive gt
The S-Cuboid with pattern template ltX,Y,Zgt
32
Properties of S-cuboids
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser aggregation) solely from the S-Cuboid
with pattern ltX,Y,Zgt (finer aggregation)?
lt X, Y, Z gt Count
lt Airport, Waterfront, Airport gt 1
lt Waterfront, Airport, Waterfront gt 1
lt Airport, Waterfront, Marine Drive gt 1
Seq ID Sequence of events
S1 lt Airport, Waterfront, Airport, Waterfront, Marine Drive gt
S2 lt Waterfront, Marine Drive gt
lt X, Y gt Count
lt Airport, Waterfrontgt ?
The S-Cuboid with pattern template ltX,Y,Zgt
33
Properties of S-cuboids
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable

Can we compute the S-Cuboid with pattern ltX,Ygt
(coarser aggregation) solely from the S-Cuboid
with pattern ltX,Y,Zgt (finer aggregation)?
lt X, Y, Z gt Count
lt Airport, Waterfront, Airport gt 1
lt Waterfront, Airport, Waterfront gt 1
lt Airport, Waterfront, Marine Drive gt 1
Seq ID Sequence of events
S1 lt Airport, Waterfront, Airport, Waterfront, Marine Drive gt
S2 lt Waterfront, Marine Drive gt
lt X, Y gt Count
lt Airport, Waterfrontgt ?
Notice that there are two finer patterns that
contain the pattern ltAirport, Waterfrontgt, the
problem is that we dont know if the count in the
two patterns are generated from the same
sequence, or two different sequences
The S-Cuboid with pattern template ltX,Y,Zgt
34
Properties of S-cuboids
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Non-summarizable
  • Coarser aggregates cannot be computed solely from
    the corresponding finer aggregates.

35
Properties of S-cuboids
  • Infinite number of S-cuboids
  • The number of pattern dimensions is infinite
  • Pattern Template (X,Y,Y,X,A,B,)
  • Full materialization is impossible!
  • Non-summarizable
  • Coarser aggregates cannot be computed solely from
    the corresponding finer aggregates.
  • Partial materialization is infeasible!

36
Properties of S-cuboids
  • Research direction
  • Precompute some other auxiliary data structures
    so that queries can be computed online using the
    pre-built data structures

37
Auxiliary Data Structures
38
Counter-Based approach
  • Counter-Based approach
  • Each cell in an S-cuboid is associated with a
    counter.
  • To determine the counters values, the entire set
    of sequences is scanned.
  • For each sequence s, we determine the cells whose
    associated patterns are contained in s and
    increment each of such counters by 1.
  • Basic and simple
  • But processing iterative queries requires
    Counting from scratch.

39
Inverted Indices
  • Inverted-Index Approach
  • Create a set of inverted indices by
    pre-processing the data offline.
  • During query processing, the relevant inverted
    indices are joined in real-time.
  • By-products of answering a query is the creation
    of new inverted indices.
  • Such indices can be used to assist the
    processing of iterative S-OLAP operations.

40
Experiments
  • A prototype S-OLAP system was implemented using
    C.
  • Real Data
  • KDD Cup 2000
  • Clickstream data from a web retailer selling
    legwear and legcare products.
  • 50,524 sequences.
  • KDD Cup 2000 Question 1
  • Look for page-click patterns
  • We answer this question in an explorative way via
    three iterative queries.

41
Experiments
The corresponding pattern template to capture the
2 steps navigation semantics is ltX,Ygt.
Cuboid Qa (4444 cells)
lt X, Ygt X,Y at page category level User sessions
lt Main page, Product Cataloggt 6,524

lt Product Catalog, Legwear Product gt 2,201

lt Main page, Promotion ad gt 852

lt Product Catalog, Legcare Product gt 150
Qa Look for the statistics of all 2 steps
navigations in the page category level.
  • KDD Cup 2000 Question 1
  • Look for page-click patterns
  • We answer this question in an explorative way via
    three iterative queries

42
Experiments
Cuboid Qa (4444 cells)
lt X, Ygt X,Y at page category level User sessions
lt Main page, Product Cataloggt 6,524

lt Product Catalog, Legwear Product gt 2,201

lt Main page, Promotion ad gt 852

lt Product Catalog, Legcare Product gt 150
Qa Look for the statistics of all 2 steps
navigations in the page category level.
Qb Since there are many visitors browse from the
product catalog to a legwear product page. What
exactly are the products they browse?
Cuboid Qb (1279 cells)
lt X, Y gt (sliced) X at page category level Y at page level User sessions
lt Product Catalog, Nullgt 181
lt Product Catalog, PID - 34839 gt 172
lt Product Catalog, PID - 34897 gt 163

The most popular product that visitors
browse from the catalog page is the product 34839
(DKNY skin legwear collection product)
SLICE on the ltProduct Catalog, Legwear Productgt
cell, and DRILL DOWN on the pattern symbol Y
43
Experiments
Cuboid Qa (4444 cells)
lt X, Ygt X,Y at page category level User sessions
lt Main page, Product Cataloggt 6,524

lt Product Catalog, Legwear Product gt 2,201

lt Main page, Promotion ad gt 852

lt Product Catalog, Legcare Product gt 150
Qa Look for the statistics of all 2 steps
navigations in the page category level.
Qb Since there are many visitors browse from the
product catalog to a legwear product page. What
exactly are the products they browse?
Qc APPEND(Z)
Cuboid Qb (1279 cells)
The runtime of II is higher than CB in Qa because
we include the indices precomputation time in Qa.
lt X, Y gt (sliced) X at page category level Y at page level User sessions
lt Product Catalog, Nullgt 181
lt Product Catalog, PID - 34839 gt 172
lt Product Catalog, PID - 34897 gt 163

Cuboid Qc (1279279 cells)
lt X, Y, Z gt (sliced) X at page category level Y, Z at page level User sessions

lt Product Catalog, PID - 34839, PID - 34839 gt 17
lt Product Catalog, PID - 34839, PID - 34897 gt 14

44
Experiments
Cuboid Qa (4444 cells)
lt X, Ygt X,Y at page category level User sessions
lt Main page, Product Cataloggt 6,524

lt Product Catalog, Legwear Product gt 2,201

lt Main page, Promotion ad gt 852

lt Product Catalog, Legcare Product gt 150
Qa Look for the statistics of all 2 steps
navigations in the page category level.
Qb Since there are many visitors browse from the
product catalog to a legwear product page. What
exactly are the products they browse?
Qc APPEND(Z)
Cuboid Qb (1279 cells)
The runtime of II is higher than CB in Qa because
we include the indices precomputation time in Qa.
For the iterative queries, II takes the advantage
of processing only the sequences that possess the
pattern lt Product catalog, Legwear Productgt.
lt X, Y gt (sliced) X at page category level Y at page level User sessions
lt Product Catalog, Nullgt 181
lt Product Catalog, PID - 34839 gt 172
lt Product Catalog, PID - 34897 gt 163

Cuboid Qc (1279279 cells)
lt X, Y, Z gt (sliced) X at page category level Y, Z at page level User sessions

lt Product Catalog, PID - 34839, PID - 34839 gt 17
lt Product Catalog, PID - 34839, PID - 34897 gt 14

45
Conclusion
  • We propose a novel online analytical processing
    system for sequence data analysis (The S-OLAP
    system).
  • We defined what is a sequence data cuboid
    (S-Cuboid) and the steps to construct the
    S-Cuboids from the raw event dataset.
  • Identified two properties of S-Cuboid
  • Infinite number of S-Cuboid
  • Non-summarizable
  • Illustrated the usability of the proposed S-OLAP
    system through a prototype system that works on
    real data.

46
Ongoing/ Future works
  • Performance
  • Auxiliary data structures
  • Increment Update
  • Maintenance of the data structures
  • Review on the entire OLAP research history
  • Iceberg query

47
The End
Thank you!
48
S-OLAP Specific
Operations
APPEND, PREPEND
DE-HEAD, DE-TAIL
P-ROLL-UP, P-DRILL-DOWN
SLICE, DICE on Pattern and Global dimensions
49
S-OLAP specific operations
  • Navigate between cuboids with ease
  • APPEND (X,Y,Y) ? (X,Y,Y,X)
  • DE-TAIL (X,Y,Y,X) ? (X,Y,Y)
  • PREPEND (X,Y,Y,X) ? (Z,X,X,Y,Y)
  • DE-HEAD (Q,Y,Y,X) ? (Y,Y,X)
  • PATTERN-ROLL-UP (X,Y,Y,X) ? (X,Y,Y,X)
  • PATTERN-DRILL-DOWN (X,Y,Y,X) ? (x,Y,Y,x)

50
S-OLAP specific operations
Vancouver Section
Airport Branch
Richmond Section
51
S-OLAP specific operations
S-Cuboid 1 (33 cells)
lt X, Y gt , X and Y at branch level Passenger
lt Airport branch, Vancouver sectiongt 120,000
lt Airport branch, Richmond section gt 8,000

Sequence OLAP
Vancouver Section
Airport Branch
Richmond Section
52
S-OLAP specific operations
S-Cuboid 1 (33 cells)
lt X, Y gt , X and Y at branch level Passenger
lt Airport branch, Vancouver sectiongt 120,000
lt Airport branch, Richmond section gt 8,000

Sequence OLAP
S-Cuboid 2 (110 cells)
lt X, Y gt , X at branch level, Y at station level XAirport branch, YVancouver section Passenger
lt Airport branch, Waterfrontgt 100,000
lt Airport branch, Vancouver City Center gt 8,300
lt Airport branch, Olympic Village gt 4,030
lt Airport branch, Marine Drive gt 2,430

Vancouver Section
Airport Branch
Richmond Section
53
S-OLAP specific operations
S-Cuboid 1 (33 cells)
lt X, Y gt , X and Y at branch level Passenger
lt Airport branch, Vancouver sectiongt 120,000
lt Airport branch, Richmond section gt 8,000

Sequence OLAP
S-Cuboid 2 (110 cells)
lt X, Y gt , X at branch level, Y at station level XAirport branch, YVancouver section Passenger
lt Airport branch, Waterfrontgt 100,000
lt Airport branch, Vancouver City Center gt 8,300
lt Airport branch, Olympic Village gt 4,030
lt Airport branch, Marine Drive gt 2,430

Vancouver Section
S-Cuboid 3 (11010 cells)
lt X, Y ,Ygt , X at branch level, Y at station level XAirport branch, YVancouver section Passenger
lt Airport branch, Waterfront, Waterfront gt 90,000
lt Airport branch, City Center, City Center gt 8,300
lt Airport branch, Olympic Village, Olympic Village gt 4,030
lt Airport branch, Marine Drive, Marine Drive gt 2,430

Airport Branch
Richmond Section
54
S-OLAP specific operations
S-Cuboid 1 (33 cells)
lt X, Y gt , X and Y at branch level Passenger
lt Airport branch, Vancouver sectiongt 120,000
lt Airport branch, Richmond section gt 8,000

Sequence OLAP
S-Cuboid 2 (110 cells)
lt X, Y gt , X at branch level, Y at station level XAirport branch, YVancouver section Passenger
lt Airport branch, Waterfrontgt 100,000
lt Airport branch, Vancouver City Center gt 8,300
lt Airport branch, Olympic Village gt 4,030
lt Airport branch, Marine Drive gt 2,430

Vancouver Section
S-Cuboid 3 (11010 cells)
lt X, Y ,Ygt , X at branch level, Y at station level XAirport branch, YVancouver section Passenger
lt Airport branch, Waterfront, Waterfront gt 90,000
lt Airport branch, City Center, City Center gt 8,300
lt Airport branch, Olympic Village, Olympic Village gt 4,030
lt Airport branch, Marine Drive, Marine Drive gt 2,430

Airport Branch
Richmond Section
55
System Architecture
Skip
56
System Architecture
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
57
System Architecture
The job of the Sequence Query Engine is to
compose sets of event sequences out of the event
dataset (Phase 1 in S-Cuboid construction).
Sequence Query Engine
Event Dataset
Sequence Cache
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
58
System Architecture
The job of the Sequence Query Engine is to
compose sets of event sequences out of the event
dataset (Phase 1 in S-Cuboid construction).
Queries
Sequence Query Engine
User Interface
Event Dataset
Sequence Cache
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
59
System Architecture
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an
S-cuboid has been previously computed and stored.
Queries
Sequence Query Engine
Sequence OLAP Engine
User Interface
Event Dataset
Sequence Cache
Results
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
60
System Architecture
The SOLAP Engine computes the S-cuboid with the
help of certain Auxiliary Data Structures.
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an
S-cuboid has been previously computed and stored.
Queries
Sequence Query Engine
Sequence OLAP Engine
User Interface
Event Dataset
Sequence Cache
Results
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
61
System Architecture
The SOLAP Engine computes the S-cuboid with the
help of certain Auxiliary Data Structures.
Given an S-Cuboid query, the SOLAP Engine
consults a Cuboid Repository to see if such an
S-cuboid has been previously computed and stored.
Queries
Sequence Query Engine
Sequence OLAP Engine
User Interface
Event Dataset
Sequence Cache
Results
The raw data of an S-OLAP system is a set of
events that are deposited in an Event Dataset.
The User Interface provides certain user-friendly
components to help a user specify an S-cuboid.
62
Experiments on synthetic data
  • D number of sequences
  • L average length of a generated sequence
  • I number of possible pattern symbols (e.g.,
    X,Y,Z are pattern symbols)
  • ? the skew factor of Zipf's distribution

63
Experiments on synthetic data
  • Study the scalability of Counter-Based approach
    and Inverted-Index approach under a series of
    APPEND operations
  • QA1 (X,Y)? SLICE APPEND ? QA2 (X,Y,Z) ? SLICE
    APPEND ? QA3 (X,Y,Z,A) ? SLICE APPEND ? QA4
    (X,Y,Z,A,B) ? SLICE APPEND ? QA5 (X,Y,Z,A,B,C)

64
Experiments on synthetic data
65
Related Work
  • Sequence Databases
  • PREDATOR (Seshadri, Livny, and Ramakrishnan
    SIGMOD 94, VLDB 96)
  • DEVise (Ramakrishnan et al. SSDBM 98)
  • TS-SQL (Sadri et al. PODS 01)
  • OLAP
  • Data-cube operator (Gray et al. 95),
    iceberg-cube, star-schema, , etc.
  • OLAP on unconventional data
  • RFID-cube (Gonzalez, Han, and Li VLDB 06)
  • Stream-cube (Chen et al. VLDB 02)
  • XML-cube (Wiwatwattana el al. ICDE 07)

66
Preliminary
  • Event
  • a tuple inside a fact table
  • Dimension (associated with concept hierarchy)
  • Time time ? day ? week
  • Location station ? branch
  • Card-id individual ? fare-group
    (student/regular/senior)
  • Measure
  • Count, Amount
  • If there is a logical ordering among a set of
    events, the events can form a sequence

Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Vancouver Airport in 0
e2 2008-6-09 0225 688 Waterfront out -5

e101 2008-6-09 2223 688 Waterfront Machine 10 Add value 100
e102 2008-6-09 2225 688 Waterfront in 0

e180 2008-6-09 2349 688 Vancouver Airport out -5

67
Preliminary
  • Event
  • a tuple inside a fact table
  • Dimension (associated with concept hierarchy)
  • Time time ? day ? week
  • Location station ? branch
  • Card-id individual ? fare-group
    (student/regular/senior)
  • Measure
  • Count, Amount
  • If there is a logical ordering among a set of
    events, the events can form a sequence

Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Vancouver Airport in 0
e2 2008-6-09 0225 688 Waterfront out -5

e101 2008-6-09 2223 688 Waterfront Machine 10 Add value 100
e102 2008-6-09 2225 688 Waterfront in 0

e180 2008-6-09 2349 688 Vancouver Airport out -5

68
Preliminary
An event dataset
  • Event
  • a tuple inside a fact table
  • Dimension (associated with concept hierarchy)
  • Time time ? day ? week
  • Location station ? branch
  • Card-id individual ? fare-group
    (student/regular/senior)
  • Measure
  • Count, Amount
  • If there is a logical ordering among a set of
    events, the events can form a sequence

Event Time Card-ID Location Action Amount
e1 2008-6-09 0001 688 Vancouver Airport in 0
e2 2008-6-09 0225 688 Waterfront out -5

e101 2008-6-09 2223 688 Waterfront Machine 10 Add value 100
e102 2008-6-09 2225 688 Waterfront in 0

e180 2008-6-09 2349 688 Vancouver Airport out -5

69
Research Aspects
  • Sequence OLAP concepts
  • Sequence cuboid (S-cuboid)
  • Sequence data cube (S-cube)
  • S-OLAP-specific operations
  • (1) APPEND, (2) DE-TAIL, (3) PREPEND, (4)
    DE-HEAD, (5) PATTERN-ROLL-UP and (6)
    PATTERN-DRILL-DOWN
  • Implementation
  • Compute sequence cuboids efficiently
  • Support the six S-OLAP-specific operations
    efficiently
  • Experimental evaluation

70
S-OLAP specific operations
  • Navigate between cuboids with ease
  • OLAP operations for Global Dimensions
  • SLICE, DICE, ROLL-UP, DRILL-DOWN, etc.
  • S-OLAP operations for Pattern Dimensions
  • E.g., from round-trip (X,Y,Y,X)
  • APPEND X ? (X,Y,Y,X,X)
  • APPEND Z ? (X,Y,Y,X,X,Z)

S-Cuboid 1
lt X, Y, Y, X gt Count


S-Cuboid 2
lt X, Y, Y, X, X gt Count


S-Cuboid 3
lt X, Y, Y, X, X, Z gt Count


71
Sequence Data Cube
Infinite number of S-Cuboids
Non - summarizable
72
Sequence Data Cube
  • Given
  • A set of global dimensions
  • A set of pattern dimensions
  • A set of concept hierarchies that is associated
    with the dimensions
  • We can define an S-cuboid for each of the
    possible subsets of the given dimensions and
    abstraction levels.
  • The set of S-cuboids forms a lattice
  • ? Sequence Data Cube (S-cube)
Write a Comment
User Comments (0)
About PowerShow.com