KLEE: A Framework for Distributed Topk Query Algorithms presentation

About This Presentation

Transcript and Presenter's Notes

Title: KLEE: A Framework for Distributed Topk Query Algorithms

1
??????
??????
Cloud-based Data Management Challenges
Opportunities
Jiaheng Lu Renmin Universtiy of
China 2009-08-25
2
Research experience and interesting

National University of Singapore PhD
XML query processing and XML keyword search
University of California, Irvine Postdoc
Approximate string processing
Data integration and data cleaning
Renmin University of China
Cloud data management
XML data management

3
Outline

Motivation cloud data management
Database Future and Challenges
Large-scale Data management transaction
processing
Cloud-based data indexing and query optimization
Recent research work
An efficient multiple-dimensional indexes for
cloud data management
CIKM Workshop CloudDB 2009

4
Motivation Internet Chatter
5
BLOG Wisdom

If you want vast, on-demand scalability, you
need a non-relational database. Since
scalability requirements
Can change very quickly and,
Can grow very rapidly.
Difficult to manage with a single in-house RDBMS
server.
Although RDBMS scale well
When limited to a single node.
Overwhelming complexity to scale on multiple
sever nodes.

6
Current State

Most enterprise solutions are based on RDBMS
technology.
Significant Operational Challenges
Provisioning for Peak Demand
Resource under-utilization
Capacity planning too many variables
Storage management a massive challenge
System upgrades extremely time-consuming

7
Internet Search Data Analytics A Case Study

Data analytics
Parsed WEB Logs ingested in a RDBMS store.
Hourly and Daily summarization for custom
reporting.
Operational nightmare
Maintaining live reporting system ON at all costs
and at all times.
Timely completion of hourly summarization.
Constant tension between Ad-hoc workload versus
reporting workload.
Data-driven feedback to live products.
Temporal depth of detailed data

8
Internet Search Data Analytics A Case Study

Various solutions explored
Data Warehousing appliance for fast
summarization.
Parallel RDBMS technology for fast ad-hoc
queries.
Business Intelligence Products (Data Cubes) for
fast and intuitive reporting and analysis.
None of the solutions completely satisfactory
Plans to migrate low-level data to file-based
system to overcome Database scalability
bottlenecks

9
Paradigm Shift in Computing
10
WEB is replacing the Desktop
11
What is Cloud Computing?

Old idea Software as a service (SaaS)
Def delivering applications over the internet
Recently Hardware, infrastructure, Platform
as a service
Poorly defined so we avoid all X as a service
Utility Computing pay-as-you-go computing
Illusion of infinite resources
No up-front cost
Fine-grained billing (e.g. hourly)

12
Why Now?

Experience with very large datacenters
Unprecedented economies of scale
Other factors
Pervasive broadband internet
Pay-as-you-go billing model

13
Cloud Computing Spectrum

Instruction Set VM (Amazon EC2, 3Tera)
Framework VM
Google AppEngine, Force.com

14
Cloud Killer Apps

Mobile and web applications
Extensions of desktop software
Matlab, Mathematica
Batch processing/MapReduce

15
Economics of Cloud Users

Pay by use instead of provisioning for peak

16
Economics of Cloud Users

Risk of over-provisioning underutilization

17
Economics of Cloud Users

Heavy penalty for under-provisioning

18
Economics of Cloud Providers

5-7X economies of scale Hamilton 2008
Extra benefits
Amazon utilize off-peak capacity
Microsoft sell .NET tools
Google reuse existing infrastructure

19
Engineering Definition

Providing services on virtual machines allocated
on top of a large physical machine pool.

20
Business Definition

A method to address scalability and availability
concerns for large scale applications.

21
Data Management in the Cloud?
22
Cloud Computing Implications on DBMSs

Where do Databases fit in this paradigm?
Generational reality
Animoto.com
Started with 50 servers on Amazon EC2
Growth of 25,000 users/hour
Need to scale to 3,500 servers in 2 days.
Many similar stories
RightScale
Joyent

23
Clouded Data?

Reality Number ?
Unlimited processing assumption
Interactive page views
By targeting large number of SQL queries against
MySQL
Still Expect sub-millisecond object retrieval
Reality Number ?
Why cant the database tier be replicated in the
same way as the Web Server and App Server can?
?These are the major challenges for Data
Management in the cloud.

24
The Vision

RD Challenges at the macro level
Where and how does the DBMS fit into this model.
RD Challenges at micro level
Specific technology components that must be
developed to enable the migration of enterprise
data into the clouds.

25
Data and Networks Attempt ?

Distributed Database (1980s)
Idealized view unified access to distributed
data
Prohibitively expensive global synchronization
Remained a laboratory prototype
Associated technology widely in-use 2PC

26
Data and Networks Attempt ?
27
Data and Networks Pragmatics
28
Database on S3 SIGMOD08

Amazons Simple Storage Service(S3)
Updates may not preserve initiation order
No force writes
Eventual guarantee
Proposed solution
Pending Update Queue
Checkpoint protocol to ensure consistent ordering
ACID only Atomicity Durability

29
Unbundling Txns in the Cloud

Research results
CIDR09 proposal to unbundle Transactions
Management for Cloud Infrastructures
Attempts to refit the DBMS engine in the cloud
storage and computing

30
Analytical Processing
31
Architectural and System Impacts

Current state
MapReduce Paradigm for data analysis
What is missing
Auxiliary structures and indexes for associative
access to data (i.e., attribute-based access)
Caveat inherent inconsistency and approximation
Future projection
Eventual merger of databases (ODSs) and data
warehouses as we have learned to use and
implement them.

32
Underlying Principles CIDR2009

Business data may not always reflect the state of
the world or the business
Inherent lack of perfect information
Secondary data need not be updated with primary
data
Inherent latency
Transactions/Events may temporarily violate
integrity constraints
Referential integrity may need to be compromised

33
Data Security Privacy

Data privacy remains a show-stopper in the
context of database outsourcing.
Encryption-based solutions are too expensive and
are projected to be so in the foreseeable future
Private Information Retrieval (Sion2008)
Other approaches
Information-theoretic approaches that uses
data-partitioning for security (Emekci2007)
Hardware-based solution for information security

34
Self management and self tuning in cloud-based
data management

Self management and self tuning
Query optimization on thousands of nodes

35
Remarks

Data Management for Cloud Computing poses a
fundamental challenge to database researchers
Scalability
Reliability
Data Consistency
Radically different approaches and solution are
warranted to overcome this challenge
Need to understand the nature of new applications

36
References

Life Beyond Distributed Transactions An
Apostates Opinion by P.Helland, CIDR07
Building a Database on S3 M.Brartner, D.Florescu,
D.Graf, D.Kossman, T.Kraska, SIGMOD08
Unbundling Transaction Services in the Cloud
D.Lo,et, A.Fekete, G.Weikum, M.Zwilling, CIDR09
Principles of Inconsistency S.Finkelstein,
R.Brendle, D.Jacobs, CIDR09
VLDB Database School (China) 2009
http//www.sei.ecnu.edu.cn/vldbschool2009/VLDBSch
ool2009English.htm

37
An Efficient Multi-Dimensional Index for Cloud
Data Management

CIKM workshop CloudDB09

38
Outline

INTRODUCTION
MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE
Extended Nodes partition
Node partition
Cost Estimation Strategy
EVALUATION

39
Cloud Computing
Google File System
Yahoo PNUTS
40
Distributed Cloud base?

BigTable

How to query on other attributes besides primary
key?

HBase

41
Distributed Index Single Dimension?

S. Wu and K.-L. Wu, An indexing framework for
efficient retrieval on the cloud, IEEE Data Eng.
Bull., vol. 32, pp.7582, 2009.
H. chih Yang and D. S. Parker, Traverse
Simplified indexing on large map-reduce-merge
clusters, in Proceedings of DASFAA 2009,
Brisbane, Australia, April 2009, pp. 308322.
M. K. Aguilera, W. Golab, and M. A. Shah, A
practical scalable distributed b-tree, in
Proceedings of VLDB08, Auckland, New Zealand,
August 2008, pp. 598609.

42
Outline

INTRODUCTION
MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE
Extended Nodes partition
Node partition
Cost Estimation Strategy
EVALUATION

43
Framework of Request Processing in Cloud
44
R-Tree
R-trees is a tree data structure that is similar
to a B-tree, but is used for spatial access
methods
45
KD-Tree
kd-tree (short for k-dimensional tree) is a
space-partitioning data structure for organizing
points in a k-dimensional space.
46
R-Tree KD-Tree RKDTree
Master
range 68009000, 34008900
range 200040000, 34008900
range 63007000, 5991400
range 02000, 5001200
range 8003500, 3001300
Slave
Slave
Slave
Slave
Slave
47
Outline

INTRODUCTION
MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE
Extended Nodes partition
Node partition
Cost Estimation Strategy
EVALUATION

48
Nodes partition for data summary

Random cutting Pick several random values on the
attribute and cut by the points. with the random
method you may receive great performance, but
also possible to have poor performance.
Equal cutting Cut the attribute into several
equal intervals. This method is relatively stable
since no extreme case will happen.
Clustering-based cutting Cut the attribute by
clustering values on the attribute and cut
between clusters. This method may receive
foreseeable better performance, but the time cost
is also apparently higher. The time complexity of
a clustering algorithm is typically O(nlogn) or
even higher.

49
Nodes partition
Random cutting
Equal cutting
Clustering-based cutting
50
(No Transcript)
51
Dynamic maintenance of Indexes

Update of node cube
Why? If the data distribution in the node cube
have greatly changed and caused the cube to be
sparse or greatly uneven
How? Reorganize the nodes partition again
When? A two-phase approach
After each update, compute the minimal ?T for
next update
When the ?T expires, check if needs update

52
Dynamic maintenance of Indexes

Basic idea benefit gt cost
Volume of a node cube is defined as the number of
combination of records can be made out of the
cube. The volume can be calculated as the product
of lengths of all the intervals. We note volume
of a cube by v.
For the cube \1, 11, 2, 5\, the volume is
(11-1)(5-2) 30.

53
Dynamic maintenance of Indexes

Assumption
The amount of queries forwarded to each slave
node is proportional to the total volume of all
the node cubes of the slave node.

54
Dynamic maintenance of Indexes

benefit (?v/v) nq ?T
? v decrement of volume after update
nq number of queries this node must process
before update.
cost mt/qt
mt time cost of last update
qt time needed for processing one query
benefit gt cost gt T gt (mt v)/(qt ? v nq)

55
Dynamic maintenance of Indexes

After ? T expires, check if an update is needed.
This check involves following
Record update frequency
Expected benefit ratio
Performance requirement
We leave this as a future work.

56
Experimental Setup

6 machines
1 master
5 slaves 1001000 nodes
Each machine had a 2.33GHz Intel Core2 Quad CPU,
4GB of main memory, and a 320G disk.
Machines ran Ubuntu 9.04 Server OS.

57
Point Query Experiment Results
58
Range Query Experiment Results
Result Cover Rate one ten thousandth
59
Conclusions

In this paper we presented a series of approaches
on building efficient multi-dimensional index in
cloud platform.
We used the combination of R-tree and KD-tree to
support the index structure.
We developed the node partition technique to
reduce query processing cost on the cloud
platform.
In order to maintain efficiency of the index, we
proposed a cost estimation-based approach for
index update.

60
Future works

Better node partition algorithms
Improve the estimation-based approach
Consider multiple replicas of data

61
??,??????!
62
Backup(1)
Result Cover Rate one thousandth 1 2
63
Backup(2)
Result Cover Rate one thousandth 4 5

Write a Comment

User Comments (0)

About PowerShow.com

KLEE: A Framework for Distributed Topk Query Algorithms PowerPoint PPT Presentation