KLEE: A Framework for Distributed Topk Query Algorithms PowerPoint PPT Presentation

presentation player overlay
1 / 63
About This Presentation
Transcript and Presenter's Notes

Title: KLEE: A Framework for Distributed Topk Query Algorithms


1
??????
??????
Cloud-based Data Management Challenges
Opportunities
Jiaheng Lu Renmin Universtiy of
China 2009-08-25
2
Research experience and interesting
  • National University of Singapore PhD
  • XML query processing and XML keyword search
  • University of California, Irvine Postdoc
  • Approximate string processing
  • Data integration and data cleaning
  • Renmin University of China
  • Cloud data management
  • XML data management

3
Outline
  • Motivation cloud data management
  • Database Future and Challenges
  • Large-scale Data management transaction
    processing
  • Cloud-based data indexing and query optimization
  • Recent research work
  • An efficient multiple-dimensional indexes for
    cloud data management
  • CIKM Workshop CloudDB 2009

4
Motivation Internet Chatter
5
BLOG Wisdom
  • If you want vast, on-demand scalability, you
    need a non-relational database. Since
    scalability requirements
  • Can change very quickly and,
  • Can grow very rapidly.
  • Difficult to manage with a single in-house RDBMS
    server.
  • Although RDBMS scale well
  • When limited to a single node.
  • Overwhelming complexity to scale on multiple
    sever nodes.

6
Current State
  • Most enterprise solutions are based on RDBMS
    technology.
  • Significant Operational Challenges
  • Provisioning for Peak Demand
  • Resource under-utilization
  • Capacity planning too many variables
  • Storage management a massive challenge
  • System upgrades extremely time-consuming

7
Internet Search Data Analytics A Case Study
  • Data analytics
  • Parsed WEB Logs ingested in a RDBMS store.
  • Hourly and Daily summarization for custom
    reporting.
  • Operational nightmare
  • Maintaining live reporting system ON at all costs
    and at all times.
  • Timely completion of hourly summarization.
  • Constant tension between Ad-hoc workload versus
    reporting workload.
  • Data-driven feedback to live products.
  • Temporal depth of detailed data

8
Internet Search Data Analytics A Case Study
  • Various solutions explored
  • Data Warehousing appliance for fast
    summarization.
  • Parallel RDBMS technology for fast ad-hoc
    queries.
  • Business Intelligence Products (Data Cubes) for
    fast and intuitive reporting and analysis.
  • None of the solutions completely satisfactory
  • Plans to migrate low-level data to file-based
    system to overcome Database scalability
    bottlenecks

9
Paradigm Shift in Computing
10
WEB is replacing the Desktop
11
What is Cloud Computing?
  • Old idea Software as a service (SaaS)
  • Def delivering applications over the internet
  • Recently Hardware, infrastructure, Platform
    as a service
  • Poorly defined so we avoid all X as a service
  • Utility Computing pay-as-you-go computing
  • Illusion of infinite resources
  • No up-front cost
  • Fine-grained billing (e.g. hourly)

12
Why Now?
  • Experience with very large datacenters
  • Unprecedented economies of scale
  • Other factors
  • Pervasive broadband internet
  • Pay-as-you-go billing model

13
Cloud Computing Spectrum
  • Instruction Set VM (Amazon EC2, 3Tera)
  • Framework VM
  • Google AppEngine, Force.com

14
Cloud Killer Apps
  • Mobile and web applications
  • Extensions of desktop software
  • Matlab, Mathematica
  • Batch processing/MapReduce

15
Economics of Cloud Users
  • Pay by use instead of provisioning for peak

16
Economics of Cloud Users
  • Risk of over-provisioning underutilization

17
Economics of Cloud Users
  • Heavy penalty for under-provisioning

18
Economics of Cloud Providers
  • 5-7X economies of scale Hamilton 2008
  • Extra benefits
  • Amazon utilize off-peak capacity
  • Microsoft sell .NET tools
  • Google reuse existing infrastructure

19
Engineering Definition
  • Providing services on virtual machines allocated
    on top of a large physical machine pool.

20
Business Definition
  • A method to address scalability and availability
    concerns for large scale applications.

21
Data Management in the Cloud?
22
Cloud Computing Implications on DBMSs
  • Where do Databases fit in this paradigm?
  • Generational reality
  • Animoto.com
  • Started with 50 servers on Amazon EC2
  • Growth of 25,000 users/hour
  • Need to scale to 3,500 servers in 2 days.
  • Many similar stories
  • RightScale
  • Joyent

23
Clouded Data?
  • Reality Number ?
  • Unlimited processing assumption
  • Interactive page views
  • By targeting large number of SQL queries against
    MySQL
  • Still Expect sub-millisecond object retrieval
  • Reality Number ?
  • Why cant the database tier be replicated in the
    same way as the Web Server and App Server can?
  • ?These are the major challenges for Data
    Management in the cloud.

24
The Vision
  • RD Challenges at the macro level
  • Where and how does the DBMS fit into this model.
  • RD Challenges at micro level
  • Specific technology components that must be
    developed to enable the migration of enterprise
    data into the clouds.

25
Data and Networks Attempt ?
  • Distributed Database (1980s)
  • Idealized view unified access to distributed
    data
  • Prohibitively expensive global synchronization
  • Remained a laboratory prototype
  • Associated technology widely in-use 2PC

26
Data and Networks Attempt ?
27
Data and Networks Pragmatics
28
Database on S3 SIGMOD08
  • Amazons Simple Storage Service(S3)
  • Updates may not preserve initiation order
  • No force writes
  • Eventual guarantee
  • Proposed solution
  • Pending Update Queue
  • Checkpoint protocol to ensure consistent ordering
  • ACID only Atomicity Durability

29
Unbundling Txns in the Cloud
  • Research results
  • CIDR09 proposal to unbundle Transactions
    Management for Cloud Infrastructures
  • Attempts to refit the DBMS engine in the cloud
    storage and computing

30
Analytical Processing
31
Architectural and System Impacts
  • Current state
  • MapReduce Paradigm for data analysis
  • What is missing
  • Auxiliary structures and indexes for associative
    access to data (i.e., attribute-based access)
  • Caveat inherent inconsistency and approximation
  • Future projection
  • Eventual merger of databases (ODSs) and data
    warehouses as we have learned to use and
    implement them.

32
Underlying Principles CIDR2009
  • Business data may not always reflect the state of
    the world or the business
  • Inherent lack of perfect information
  • Secondary data need not be updated with primary
    data
  • Inherent latency
  • Transactions/Events may temporarily violate
    integrity constraints
  • Referential integrity may need to be compromised

33
Data Security Privacy
  • Data privacy remains a show-stopper in the
    context of database outsourcing.
  • Encryption-based solutions are too expensive and
    are projected to be so in the foreseeable future
  • Private Information Retrieval (Sion2008)
  • Other approaches
  • Information-theoretic approaches that uses
    data-partitioning for security (Emekci2007)
  • Hardware-based solution for information security

34
Self management and self tuning in cloud-based
data management
  • Self management and self tuning
  • Query optimization on thousands of nodes

35
Remarks
  • Data Management for Cloud Computing poses a
    fundamental challenge to database researchers
  • Scalability
  • Reliability
  • Data Consistency
  • Radically different approaches and solution are
    warranted to overcome this challenge
  • Need to understand the nature of new applications

36
References
  • Life Beyond Distributed Transactions An
    Apostates Opinion by P.Helland, CIDR07
  • Building a Database on S3 M.Brartner, D.Florescu,
    D.Graf, D.Kossman, T.Kraska, SIGMOD08
  • Unbundling Transaction Services in the Cloud
    D.Lo,et, A.Fekete, G.Weikum, M.Zwilling, CIDR09
  • Principles of Inconsistency S.Finkelstein,
    R.Brendle, D.Jacobs, CIDR09
  • VLDB Database School (China) 2009
    http//www.sei.ecnu.edu.cn/vldbschool2009/VLDBSch
    ool2009English.htm

37
An Efficient Multi-Dimensional Index for Cloud
Data Management
  • CIKM workshop CloudDB09

38
Outline
  • INTRODUCTION
  • MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE
  • Extended Nodes partition
  • Node partition
  • Cost Estimation Strategy
  • EVALUATION

39
Cloud Computing
Google File System
Yahoo PNUTS
40
Distributed Cloud base?
  • BigTable

How to query on other attributes besides primary
key?
  • HBase

41
Distributed Index Single Dimension?
  • S. Wu and K.-L. Wu, An indexing framework for
    efficient retrieval on the cloud, IEEE Data Eng.
    Bull., vol. 32, pp.7582, 2009.
  • H. chih Yang and D. S. Parker, Traverse
    Simplified indexing on large map-reduce-merge
    clusters, in Proceedings of DASFAA 2009,
    Brisbane, Australia, April 2009, pp. 308322.
  • M. K. Aguilera, W. Golab, and M. A. Shah, A
    practical scalable distributed b-tree, in
    Proceedings of VLDB08, Auckland, New Zealand,
    August 2008, pp. 598609.

42
Outline
  • INTRODUCTION
  • MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE
  • Extended Nodes partition
  • Node partition
  • Cost Estimation Strategy
  • EVALUATION

43
Framework of Request Processing in Cloud
44
R-Tree
R-trees is a tree data structure that is similar
to a B-tree, but is used for spatial access
methods
45
KD-Tree
kd-tree (short for k-dimensional tree) is a
space-partitioning data structure for organizing
points in a k-dimensional space.
46
R-Tree KD-Tree RKDTree
Master
range 68009000, 34008900
range 200040000, 34008900
range 63007000, 5991400
range 02000, 5001200
range 8003500, 3001300
Slave
Slave
Slave
Slave
Slave
47
Outline
  • INTRODUCTION
  • MULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREE
  • Extended Nodes partition
  • Node partition
  • Cost Estimation Strategy
  • EVALUATION

48
Nodes partition for data summary
  • Random cutting Pick several random values on the
    attribute and cut by the points. with the random
    method you may receive great performance, but
    also possible to have poor performance.
  • Equal cutting Cut the attribute into several
    equal intervals. This method is relatively stable
    since no extreme case will happen.
  • Clustering-based cutting Cut the attribute by
    clustering values on the attribute and cut
    between clusters. This method may receive
    foreseeable better performance, but the time cost
    is also apparently higher. The time complexity of
    a clustering algorithm is typically O(nlogn) or
    even higher.

49
Nodes partition
Random cutting
Equal cutting
Clustering-based cutting
50
(No Transcript)
51
Dynamic maintenance of Indexes
  • Update of node cube
  • Why? If the data distribution in the node cube
    have greatly changed and caused the cube to be
    sparse or greatly uneven
  • How? Reorganize the nodes partition again
  • When? A two-phase approach
  • After each update, compute the minimal ?T for
    next update
  • When the ?T expires, check if needs update

52
Dynamic maintenance of Indexes
  • Basic idea benefit gt cost
  • Volume of a node cube is defined as the number of
    combination of records can be made out of the
    cube. The volume can be calculated as the product
    of lengths of all the intervals. We note volume
    of a cube by v.
  • For the cube \1, 11, 2, 5\, the volume is
    (11-1)(5-2) 30.

53
Dynamic maintenance of Indexes
  • Assumption
  • The amount of queries forwarded to each slave
    node is proportional to the total volume of all
    the node cubes of the slave node.

54
Dynamic maintenance of Indexes
  • benefit (?v/v) nq ?T
  • ? v decrement of volume after update
  • nq number of queries this node must process
    before update.
  • cost mt/qt
  • mt time cost of last update
  • qt time needed for processing one query
  • benefit gt cost gt T gt (mt v)/(qt ? v nq)

55
Dynamic maintenance of Indexes
  • After ? T expires, check if an update is needed.
    This check involves following
  • Record update frequency
  • Expected benefit ratio
  • Performance requirement
  • We leave this as a future work.

56
Experimental Setup
  • 6 machines
  • 1 master
  • 5 slaves 1001000 nodes
  • Each machine had a 2.33GHz Intel Core2 Quad CPU,
    4GB of main memory, and a 320G disk.
  • Machines ran Ubuntu 9.04 Server OS.

57
Point Query Experiment Results
58
Range Query Experiment Results
Result Cover Rate one ten thousandth
59
Conclusions
  • In this paper we presented a series of approaches
    on building efficient multi-dimensional index in
    cloud platform.
  • We used the combination of R-tree and KD-tree to
    support the index structure.
  • We developed the node partition technique to
    reduce query processing cost on the cloud
    platform.
  • In order to maintain efficiency of the index, we
    proposed a cost estimation-based approach for
    index update.

60
Future works
  • Better node partition algorithms
  • Improve the estimation-based approach
  • Consider multiple replicas of data

61
??,??????!
62
Backup(1)
Result Cover Rate one thousandth 1 2
63
Backup(2)
Result Cover Rate one thousandth 4 5
Write a Comment
User Comments (0)
About PowerShow.com