Processing AdHoc Joins on Mobile Devices - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Processing AdHoc Joins on Mobile Devices

Description:

Mobile Devices and Databases ... Mobile users may wish to combine information from more ... Send a selection query Q to a server needs T(BSQL Bkey) bytes ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 62
Provided by: Eri6
Category:

less

Transcript and Presenter's Notes

Title: Processing AdHoc Joins on Mobile Devices


1
Processing Ad-Hoc Joins on Mobile Devices
  • HKU CSIS DB Seminar
  • 10 Oct 2003
  • Speaker Eric Lo

2
Mobile Devices and Databases
  • Cellular phones and Personal Data Assistants
    (PDAs) are capable to ask information from remote
    database(s) anywhere and anytime
  • The connection channel is wireless
  • E.g., WAP, IEEE 802.11 (also WiFi), GPRS, 3G

3
Example
HK Stock Exchange
SELECT Stock_Price FROM DB WHERE Stock_Code 8
8 - PCCW 1156am HKD 2.5
1155am What is the stock price of PCCW now?
4
There are no free lunchOption 1 Charged by
airtime
HK Stock Exchange
SELECT Stock_Price FROM DB WHERE Stock_Code 8
8 - PCCW 1156am HKD 2.5
1155am What is the stock price of PCCW now?
1
2.8
4.6
10.2
5
Option 2 Charged by amount of data transferred
  • Network traffic and QoS of wireless data
    networking are strongly dependent on factors
    like
  • Network workloads
  • Availability of network stations
  • Charged by amount of data transferred
  • Minimizing the transfer cost ? dollar!

6
Query more than one data source
  • Mobile users may wish to combine information from
    more than one remote databases
  • E.g., A vegetarian visits Hong Kong and looks for
    some restaurants recommended by both HK tourist
    office and HK vegetarian community

7
Example relations
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
Join Query
8
Motivations
  • Evaluating join queries on mobile devices
  • Considerations
  • Mobile device has limited memory
  • Minimizing the transfer cost (dollar )
  • Databases are non-collaborative
  • Query results are small in sizes compare to input
    relations (ad-hoc)

9
Download all relations?
  • Download both relations (HK tourist office and HK
    vegetarian directory) onto the mobile device and
    evaluate the join on the device locally
  • Wont be able to hold the large amount of data
    from the remote databases (for most mobile
    devices)
  • The transfer cost is very high though the result
    size is very small

10
Outline
  • Introduction and motivation
  • A simple late-projection strategy
  • Block-merge join
  • Ship-data-as-queries join
  • RAMJ Recursive and Adaptive Mobile Join
  • Experiment result
  • Conclusions and future work

11
A simple late-projection strategy
  • Traditional distributed query processing
    techniques like semi-join involves
  • Shipping of join columns and (whole) tuples
  • Across the trusted distributed nodes directly
  • In high selective join, most tuples fails to
    contribute to the final result
  • Semi-join d/l the non-key attributes which may
    not be included in the result
  • Download and join the distinct values of join
    keys only (Do not download the non-key
    attributes)
  • Only tuples belong to join result entails
    downloading the rest of non-key attributes

12
A late projection strategy
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
13
Step 1
  • Download ? Name R1

SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
14
Step 2
  • Download ? Name R2

SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
15
Step 3
  • Evaluate T ? Name (R1) ? Name (R2) locally


T
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
16
Step 4
  • Evaluate ? Name,Address (?NameT (R1))

T
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
17
Step 5
  • Evaluate ? Name,Cost (?NameT (R2))

T
SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
18
Step 6
  • Join the two resultsets

SELECT R1.Name, R1.Address, R2.Cost FROM R1,
R2 WHERE R1.Name R2.Name
19
Block-merge join (BMJ)
  • Late-projection still insufficient if the whole
    join column cannot fit into the memory of mobile
    devices
  • Applying sort-merge join, with the sorting part
    on the servers
  • 1 block of ordered join keys are downloaded from
    each server and join them locally, until one
    block is exhausted
  • Each block must
  • Cover same data range
  • Sorted in same order (e.g., both in ascending
    order)
  • Small enough to be resided in memory
  • Each block can be downloaded by using ROWNUM or
    LIMIT SQL statements

20
Ship-data-as-queries join (SDAQ)
  • If R1
  • Download the join column of R1 to the mobile
    deviceSELECT Name FROM R1
  • Send the join keys to R2 in form of SQL selection
    queries (e.g., if two results returned in step
    1) SELECT Name FROM R2WHERE Name in (Beta
    Food,Ceta Food))
  • The result from R2 are the joined keys

Very few results returned
21
Can we do even better?
  • Block-merge join (BMJ) can handle the limited
    memory problem, BUT download all join keys
    essentially
  • Ship-data-as-queries (SDAQ) can do better ONLY if
    the sizes of two relations differ much
  • Pay small overhead ?
  • Build histograms that capture the data
    distribution of target relations
  • Join space pruning
  • Bucket-base joining

22
Pruning the data space
23
Constructing Histogram
  • Problem
  • Mobile devices are not able to receive those
    histograms (they are some internal data
    structures in remote databases)
  • Solution
  • Constructing some queries that build histogram
    through SQL

24
String Histogram
  • Using the SUBSTRING function
  • SELECT SUBSTRING(Name,1,1) AS Bucket,COUNT(Name)
    As CountFROM R1GROUP BY SUBSTRING(Name,1,1)HAVI
    NG COUNT(Name) 0

25
Numeric Histogram
  • Using ROUND function
  • SELECT ROUND(Cost/(D/G)) AS Bucket,COUNT(ROUND(Co
    st/(D/G)) As CountFROM R2GROUP BY
    ROUND(Cost/(D/G)HAVING COUNT(ROUND(Cost/(D/G))
    0
  • G is granularity that specifies the number of
    bucket
  • D is the numeric domain

26
A Bucket-base Approach
  • So far we know that
  • SDAQ is good when input relations have large size
    difference
  • But, BMJ is better when two input relations have
    similar size (Why?)
  • Histogram helpful to prune join space
  • Which method is better?
  • The histogram can do more!
  • The histogram already partitioned the data space
    in form of buckets
  • Depends on the data distribution of each bucket,
    assign the best action to them adaptively

27
Direct Join (BMJ)
28
Ship-join-keys (SDAQ)
29
Recursive Partitioning
30
Recursive Partitioning
  • Further breaks a partition into more refined ones
    and further request histogram for it
  • Hoping some sub-buckets are being pruned in
    future
  • Or hoping cheap ship-join-keys join can be
    applied on some future sub-buckets

31
Recursive Partitioning
  • SELECT SUBSTRING(Name,1,2) AS Bucket,COUNT(Name)
    AS CountFROM R2WHERE SUBSTRING(Name,1,1)AGR
    OUP BY SUBSTRING(Name,1,2)HAVING COUNT(Name) 0

32
Which action is the best for each bucket? The
cost model!
  • The largest amount of data that can be
    transferred in one packet is called MTU (Maximum
    Transfer Unit)
  • The largest segment of TCP data that can be
    transmitted is called MSS(Maximum Segment Size)
  • MTU MSS BH (BH is the size of headers)
  • To transfer B bytes data, the actual number of
    bytes to be transferred is

33
Cost Model
  • Assume CR1 and CR2 be the cost of accessing R1
    and R2, respectively
  • Send a selection query Q to a server needs T(BSQL
    Bkey) bytes
  • Bkey 4 bytes for numeric attributes
  • Bkey 2L bytes for string attributes in length L

34
Cost Model
  • Under these settings, we have to determine the
    cost of
  • Direct Join C1
  • Ship-join-key Join C2
  • Recursive Partitioning C3
  • Execute the minimal cost action for each bucket
    adaptively

35
C1 Direct Join
  • ai,ßi be the i-th histogram bucket summarizing
    the same data region
  • Send a selection query to R1 CR1 T(BSQL Bkey)
  • Receiving CR1 T(aiBkey) bytes.
  • Send a selection query to R2 CR2 T(BSQL Bkey)
  • Receiving CR2 T(ßiBkey) bytes.
  • C1(ai,ßi) (CR1 CR2 )T(BSQL Bkey)
    CR1 T(aiBkey) CR2 T(ßiBkey)

36
C2 Ship-join-keys
  • ai
  • Send a selection query to the smaller relation R2
    that holds ai CR2 T(BSQL Bkey)
  • Receiving CR2 T(aiBkey) bytes from R2
  • Send a selection query to larger relation R1 to
    check existence of ai keys CR1 T(BSQL
    aiBkey)
  • Receiving at most ai keys from R1 CR1
    T(aiBkey)
  • C2(ai,ßi) CR1 (T(BSQL Bkey) T(aiBkey))
    CR2 T(T(BSQL 2aiBkey))

37
C3 Recursive Partitioning
  • Have to estimate the cost of
  • Ask for finer histograms for that bucket from R1
  • Ask for finer histograms for that bucket from R2
  • For each pair of (future/virtual) sub-buckets,
    each of them may execute direct-join,
    ship-join-key or recursive partitioning again

38
Recursive Partitioning C3
  • Ask for finer histograms from R1
  • Ch(G,R1) CR1(T(G(Bkey4))T(BSQLBkey))
  • Ask for finer histograms from R2
  • Ch(G,R2) CR2(T(G(Bkey4))T(BSQLBkey))
  • For each pair of (future) sub-buckets, each
    sub-bucket pair may recursively follow
    direct-join, ship-join-key or recursive
    partitioning again

39
Recursive Partitioning C3
  • C3(ai,ßi) Ch (G,R1) Ch (G,R2) CRP (ai,ßi)

?
40
Recursive Partitioning Optimistic Estimation
  • Optimistically assume that buckets in next level
    are all being pruned
  • It will hold if the data distribution in the two
    datasets is very different
  • Since all future (next-level) sub-buckets are
    being pruned, they would NOT have any actions
  • Therefore
  • C3(ai,ßi) Ch (G,R1) Ch (G,R2) CRP (ai,ßi)

41
Recursive Partitioning Linear Interpolation
Estimation
  • More accurate. Higher computational cost
  • Exploit the histogram in current level to
    estimate the distribution of next level

We have histograms in this level
We DONT have histograms in this level
42
Linear Interpolation Estimation
  • Select adjacent buckets as interpolation points
  • Preserve the current trend
  • Resistance to fluctuated distribution

b3
b3
b2
b2
b4
b4
b5
b5
b1
b1
43
One problem left
Level 1
b3
b3
b2
b2
b4
b4
b5
b5
b1
b1
Level 2
  • Level 1 The cost of RP on b2 is?
  • Estimate Level 2 by Linear Interpolation
  • b2,1, b2,2, , b2,5 are found
  • b2,1, b2,2, , b2,5 are found,
    determine which action is the most
    cost-efficient (C1,C2 or C3) for each sub-bucket?
  • C1, C2 of b2,1, b2,2, , b2,5
    can be determine
  • C3 of b2,1, b2,2, , b2,5 ?
    Started from step 1 again

b2,1
2
3
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
b2,1,1, b2,1,2, b2,1,3, b2,1,4, b2,1,5
44
If you dont understand
  • Cost 3 of one level depends on next level
  • Fortunately the cost of CRP (ai,ßi) is bounded by
    the following inequality

C1
ai,ßi
C2
C1
C3
C2
C1
C3
C2
C3
45
Recursive Partitioning Linear Interpolation
Estimation
  • C3(ai,ßi) Ch (G,R1) Ch (G,R2) CRP (ai,ßi)
  • In optimistic estimation, we omit the last item
    optimistically
  • CRP (ai,ßi) is bounded by the inequality
  • Summing up everything

46
RAMJ Algorithm
  • Recursive and Adaptive Mobile Join

47
Real Data Experiment
  • Real Data Set 1
  • DBLP
  • Join relations Conference (235K tuples) and
    Journal (125K tuples) in order to find the set
    of publications that have the same conference and
    journal title
  • 3836 publications have same title in both
    conference and journal paper
  • SELECT R1.Title
  • FROM Conference R1, Journal R2
  • WHERE R1.Title R2.Title

48
Real Data Experiment
  • Real Data Set 2
  • Restaurants Data Set
  • Crawled from www.restaurantrow.com
  • Join relation Steak (4573 tuples) and
    Vegetarian (2098 tuples) in order to find the
    set of restaurants that offer both steak and
    vegetarian dishes (163 joined)
  • SELECT R1.Name
  • FROM Steak R1, Vegetarian R2
  • WHERE R1.Name R2.Name

49
Real Data Experiment Result
50
Synthetic Data Experiment
  • Generate 3 relations with different
    distributions
  • Gaussian
  • Negative Exponential
  • Zipf (skewness ? 1)
  • Default
  • 10,000 tuples
  • Domain 100,000
  • G 20

51
Synthetic Data Experiment Result
52
The impact of data skew
53
The impact of memory size
54
Conclusions and Future Work
  • Identify the requirements and limitation on
    evaluating ad-hoc join on mobile devices
  • A recursive and adaptive algorithm RAMJ
  • Extension to multi-way joins and multi-attributes
  • Extension to Top-K-Join
  • Existing approaches on Top-K problem ONLY works
    on collaborative database

55
Q A
  • ?

56
Approach 2 Mediator
  • User queries are free-form
  • i.e., User KY may issue a join query that
    involves DB x and DB y, whereas user BY may issue
    a join query that involves DB e and DB f
  • Mediator cannot answer those queries without
    prior preparation like data integration, schema
    matching
  • Mediator services may charge the users as well

57
Approach 2Distributed query processing?
  • Existing distributed database work on trusted
    environment only
  • DB1.Name DB2.CustomerName
  • Semi-join
  • Site DB1 Evaluate J ? Name DB1 J All Names
  • Send J from DB1 to DB2
  • Site DB2 Evaluate K ? CustomerName J( DB2 )
  • Find all CustomerName Name
    in DB1
  • Send K from DB2 to DB1

58
Approach 2Distributed query processing?
  • Not work on our problem!
  • DB1 and DB2 are non-collaborative
  • Would not accept data structures as input
    (e.g., a join column in semi-join or a
    hash-table in bloom-join
  • Accept SQL only
  • Semi-join is worked by some modifications
  • Send J and K through the mobile device
  • ? High transfer cost

59
References
  • Processing Ad-hoc joins on mobile devices,
    submitted to EDBT 04
  • P.A. Bernstein and N. Goodman. Power of natural
    semijoin. SAIM Journal of Computing, 1981
  • P.A. Bernstein, N. Goodman, et. al. Query
    processing in a system for distributed databases
    (sdd-1). ACM TODS, 1981
  • N. Mamoulis, P. Kalnis, et. al. Optimization of
    spatial joins on mobile devices. SSTD, 2003

60
Other Join Types
  • Equi-join with selection constraints
  • E.g., We are interested in restaurants appear in
    both datasets and the expense is less than 20
  • SELECT R1.Name, R1.Address, R2.Cost
  • FROM R1, R2
  • WHERE R1.Name R2.Name
  • AND R2.Cost
  • Add this condition in histogram construction
  • SELECT SUBSTRING(Name,1,1) AS Bucket,
  • COUNT(Name) As Count
  • FROM R1
  • WHERE R2.Cost
  • GROUP BY SUBSTRING(Name,1,1)
  • HAVING COUNT(Name) 0

61
Iceberg Semi-join
  • Find all restaurants in R1 which are recommended
    by at least 10 users in a discussion group R2
  • Properties
  • Equi-join between R1 and R2
  • Results are comes from R1 only
  • Condition is applied on R2 only (10 users)
  • SELECT SUBSTRING(Name,1,1) AS Bucket,
  • COUNT(Name) As Count
  • FROM R2
  • GROUP BY SUBSTRING(Name,1,1)
  • HAVING COUNT(Name) t
Write a Comment
User Comments (0)
About PowerShow.com