A Scalable Information Management Middleware for Large Distributed Systems

About This Presentation

Title:

A Scalable Information Management Middleware for Large Distributed Systems

Description:

Mike Dahlin, The University of Texas at Austin. Trends. Large wide-area ... Aggregation abstraction. SDIMS Design. Scalability with machines and attributes ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 86

Provided by: valueds220

Category:

more less

Transcript and Presenter's Notes

Title: A Scalable Information Management Middleware for Large Distributed Systems

1
A Scalable Information Management Middleware for
Large Distributed Systems

Praveen Yalagandula
HP Labs, Palo Alto
Mike Dahlin, The University of Texas at Austin

2
Trends

Large wide-area networked systems
Enterprise networks
IBM
170 countries
gt 330000 employees
Computational Grids
NCSA Teragrid
10 partners and growing
100-1000 nodes per site
Sensor networks
Navy Automated Maintenance Environment
About 300 ships in US Navy
200,000 sensors in a destroyer 3eti.com

3
Trends

Large wide-area networked systems
Enterprise networks
IBM
170 countries
gt 330000 employees
Computational Grids
NCSA Teragrid
10 partners and growing
100-1000 nodes per site
Sensor networks
Navy Automated Maintenance Environment
About 300 ships in US Navy
200,000 sensors in a destroyer 3eti.com

4
Trends

Large wide-area networked systems
Enterprise networks
IBM
170 countries
gt 330000 employees
Computational Grids
NCSA Teragrid
10 partners and growing
100-1000 nodes per site
Sensor networks
Navy Automated Maintenance Environment
About 300 ships in US Navy
200,000 sensors in a destroyer 3eti.com

5
Trends

Large wide-area networked systems
Enterprise networks
IBM
170 countries
gt 330000 employees
Computational Grids
NCSA Teragrid
10 partners and growing
100-1000 nodes per site
Sensor networks
Navy Automated Maintenance Environment
About 300 ships in US Navy
200,000 sensors in a destroyer 3eti.com

6
Trends

Large wide-area networked systems
Enterprise networks
IBM
170 countries
gt 330000 employees
Computational Grids
NCSA Teragrid
10 partners and growing
100-1000 nodes per site
Sensor networks
Navy Automated Maintenance Environment
About 300 ships in US Navy
200,000 sensors in a destroyer 3eti.com

7
Trends

Large wide-area networked systems
Enterprise networks
IBM
170 countries
gt 330000 employees
Computational Grids
NCSA Teragrid
10 partners and growing
100-1000 nodes per site
Sensor networks
Navy Automated Maintenance Environment
About 300 ships in US Navy
200,000 sensors in a destroyer 3eti.com

8
Trends

Large wide-area networked systems
Enterprise networks
IBM
170 countries
gt 330000 employees
Computational Grids
NCSA Teragrid
10 partners and growing
100-1000 nodes per site
Sensor networks
Navy Automated Maintenance Environment
About 300 ships in US Navy
200,000 sensors in a destroyer 3eti.com

9
Research Vision
Wide-area Distributed Operating System

Goals
Ease building applications
Utilize resources efficiently

Data Management
Security
Monitoring
......
Scheduling
10
Information Management

Most large-scale distributed applications
Monitor, query, and react to changes in the
system
Examples
A general information management middleware
Eases design and development
Avoids repetition of same task by different
applications
Provides a framework to explore tradeoffs
Optimizes system performance

Job Scheduling
System administration and management Service
location Sensor
monitoring and control
File location service Multicast service
Naming and request routing
11
Contributions SDIMS
Scalable Distributed Information Management System

Meets key requirements
Scalability
Scale with both nodes and information to be
managed
Flexibility
Enable applications to control the aggregation
Autonomy
Enable administrators to control flow of
information
Robustness
Handle failures gracefully

12
SDIMS in Brief

Scalability
Hierarchical aggregation
Multiple aggregation trees
Flexibility
Separate mechanism from policy
API for applications to choose a policy
A self-tuning aggregation mechanism
Autonomy
Preserve organizational structure in all
aggregation trees
Robustness
Default lazy re-aggregation upon failures
On demand fast reaggregation

13
Outline

SDIMS a general information management
middleware
Aggregation abstraction
SDIMS Design
Scalability with machines and attributes
Flexibility to accommodate various applications
Autonomy to respect administrative structure
Robustness to failures
Experimental results
SDIMS in other projects
Conclusions and future research directions

14
Outline

SDIMS a general information management
middleware
Aggregation abstraction
SDIMS Design
Scalability with machines and attributes
Flexibility to accommodate various applications
Autonomy to respect administrative structure
Robustness to failures
Experimental results
SDIMS in other projects
Conclusions and future research directions

15
Attributes

Information at machines
Machine status information
File information
Multicast subscription information

16
Aggregation Function

Defined for an attribute
Given values for a set of nodes
Computes aggregate value
Examples
Total users logged in the system
Attribute numUsers
Aggregation function summation

17
Aggregation Trees
f(f(a,b), f(c,d))

Aggregation tree
Physical machines are leaves
Each virtual node represents
a logical group of machines
Administrative domains
Groups within domains
Aggregation function, f, for attribute A
Computes the aggregated value Ai for level-i
subtree
A0 locally stored value at the physical node or
NULL
Ai f(Ai-10, Ai-11, , Ai-1k) for virtual node
with k children
Each virtual node is simulated by some machines

A2
f(a,b)
f(c,d)
A1
A0
c
d
a
b
18
Example Queries

Job scheduling system
Find the least loaded machine
Find a (nearby) machine with load lt 0.5
File location system
Locate a (nearby) machine with file foo

19
Example Machine Loads

Attribute minLoad
Value at a machine M with load L is
( M, L )
Aggregation function
MIN_LOAD (set of tuples)

(C, 0.1)
(A, 0.3)
(C, 0.1)
(D, 0.7)
(C, 0.1)
(A, 0.3)
(B, 0.6)
minLoad
20
Example Machine Loads
Query Tell me the least loaded machine.

Attribute minLoad
Value at a machine M with load L is
( M, L )
Aggregation function
MIN_LOAD (set of tuples)

(C, 0.1)
(A, 0.3)
(C, 0.1)
(D, 0.7)
(C, 0.1)
(A, 0.3)
(B, 0.6)
minLoad
21
Example Machine Loads
Query Tell me a (nearby) machine with
load lt 0.5.

Attribute minLoad
Value at a machine M with load L is
( M, L )
Aggregation function
MIN_LOAD (set of tuples)

(C, 0.1)
(A, 0.3)
(C, 0.1)
(D, 0.7)
(C, 0.1)
(A, 0.3)
(B, 0.6)
minLoad
22
Example File Location

Attribute fileFoo
Value at a machine with id machineId
machineId if file Foo exists on the
machine
null otherwise
Aggregation function
SELECT_ONE(set of machine ids)

B
B
C
null
C
B
null
fileFoo
23
Example File Location
Query Tell me a (nearby) machine with file
Foo.

Attribute fileFoo
Value at a machine with id machineId
machineId if file Foo exists on the
machine
null otherwise
Aggregation function
SELECT_ONE(set of machine ids)

B
B
C
null
C
B
null
fileFoo
24
Outline

SDIMS a general information management
middleware
Aggregation abstraction
SDIMS Design
Scalability with machines and attributes
Flexibility to accommodate various applications
Autonomy to respect administrative structure
Robustness to failures
Experimental results
SDIMS in other projects
Conclusions and future research directions

25
Scalability

To be a basic building block, SDIMS should
support
Large number of machines (gt 104)
Enterprise and global-scale services
Applications with a large number of attributes (gt
106)
File location system
Each file is an attribute ? Large number of
attributes

26
Scalability Challenge

Single tree for aggregation
Astrolabe, SOMO, Ganglia, etc.
Limited scalability with attributes
Example File Location

27
Scalability Challenge

Single tree for aggregation
Astrolabe, SOMO, Ganglia, etc.
Limited scalability with attributes
Example File Location

Automatically build multiple trees for
aggregation
Aggregate different attributes along different
trees

28
Building Aggregation Trees

Leverage Distributed Hash Tables
A DHT can be viewed as multiple aggregation trees
Distributed Hash Tables (DHT)
Supports hash table interfaces
put (key, value) inserts value for key
get (key) returns values associated with key
Buckets for keys distributed among machines
Several algorithms with different properties
PRR, Pastry, Tapestry, CAN, CHORD, SkipNet, etc.
Load-balancing, robustness, etc.

29
DHT - Overview

Machine IDs and keys Long bit vectors
Owner of a key Machine with ID closest to the
key
Bit correction for routing
Each machine keeps O(log n) neighbors

Key 11111
10111
11101
11000
10010
get(11111)
00001
00110
01100
01001
30
DHT Trees as Aggregation Trees
111
Key 11111
11x
1xx
010
110
001
100
101
011
000
111
31
DHT Trees as Aggregation Trees
Mapping from virtual nodes to real machines
111
Key 11111
11x
1xx
010
110
001
100
101
011
000
111
32
DHT Trees as Aggregation Trees
Key 11111
Key 00010
33
DHT Trees as Aggregation Trees
Key 11111
Key 00010
Aggregate different attributes along different
trees hash(minLoad) 00010 ?
aggregate minLoad along tree for key
00010
34
Scalability

Challenge
Scale with both machines and attributes
Our approach
Build multiple aggregation trees
Leverage well-studied DHT algorithms
Load-balancing
Self-organizing
Locality
Aggregate different attributes along different
trees
Aggregate attribute A along the tree for key
hash(A)

35
Outline

SDIMS a general information management
middleware
Aggregation abstraction
SDIMS Design
Scalability with machines and attributes
Flexibility to accommodate various applications
Autonomy to respect administrative structure
Robustness to failures
Experimental results
SDIMS in other projects
Conclusions and future research directions

36
Flexibility Challenge

When to aggregate?
On reads? or on writes?
Attributes with different read-write ratios

reads gtgt writes
writes gtgt reads

read-write ratio
Total Mem
File Location
CPU Load
Best Policy
Aggregate on reads
Aggregate on writes
Partial Aggregation on writes
Astrolabe Ganglia
DHT based systems
Sophia MDS-2
37
Flexibility Challenge

When to aggregate?
On reads? or on writes?
Attributes with different read-write ratios

reads gtgt writes
writes gtgt reads

read-write ratio
Total Mem
File Location
CPU Load
Best Policy
Aggregate on reads
Aggregate on writes
Partial Aggregation on writes
Single framework separate mechanism from
policy ? Allow applications to choose any
policy ? Provide self-tuning mechanism
Astrolabe Ganglia
DHT based systems
Sophia MDS
38
API Exposed to Applications

Install an aggregation function for an attribute
Function is propagated to all nodes
Arguments up and down specify an aggregation
policy
Update the value of a particular attribute
Aggregation performed according to the chosen
policy
Probe for an aggregated value at some level
If required, aggregation is done
Two modes one-shot and continuous

Install
Update
Probe

39
Flexibility
Update-Up Upall Down0
Policy Setting
Update-All Upall Downall
Update-Local Up0 Down0
40
Flexibility
Update-Up Upall Down0
Policy Setting
Update-All Upall Downall
Update-Local Up0 Down0
41
Flexibility
Update-Up Upall Down0
Policy Setting
Update-All Upall Downall
Update-Local Up0 Down0
42
Flexibility
Update-Up Upall Down0
Policy Setting
Update-All Upall Downall
Update-Local Up0 Down0
43
Self-tuning Aggregation

Some apps can forecast their read-write rates
What about others?
Can not or do not want to specify
Spatial heterogeneity
Temporal heterogeneity
Shruti Dynamically tunes aggregation
Keeps track of read and write patterns

44
Shruti Dynamic Adaptation
R
A
Update-Up Upall Down0
45
Shruti Dynamic Adaptation
R
A
Update-Up Upall Down0
Any updates are forwarded until lease is
relinquished
46
Shruti In Brief

On each node
Tracks updates and probes
Both local and from neighbors
Sets and removes leases
Grants leases to a neighbor A
When gets k probes from A while no updates happen
Relinquishes leases from a neighbor A
When gets m updates from A while no probes happen

47
Flexibility

Challenge
Support applications with different read-write
behavior
Our approach
Separate mechanism from policy
Let applications specify an aggregation policy
Up and Down knobs in Install interface
Provide a lease based self-tuning aggregation
strategy

48
Outline

SDIMS a general information management
middleware
Aggregation abstraction
SDIMS Design
Scalability with machines and attributes
Flexibility to accommodate various applications
Autonomy to respect administrative structure
Robustness to failures
Experimental results
SDIMS in other projects
Conclusions and future research directions

49
Administrative Autonomy

Systems spanning multiple administrative domains
Allow a domain administrator control information
flow
Prevent external observer from observing the
information
Prevent external failures from affecting the
operations
Challenge
DHT trees might not conform

50
Administrative Autonomy

Our approach Autonomous DHTs
Two properties
Path locality
Path convergence

51
Autonomy Example

Path Locality
Path Convergence

L3
L2
L1
L0
111
011
010
110
000
001
100
101
cs.utexas.edu
ece.utexas.edu
phy.utexas.edu
52
Autonomy Challenge

DHT trees might not conform
Example DHT tree for key 111
Autonomous DHT with two properties
Path Locality
Path Convergence

L3
L2
L1
L0
111
011
010
110
000
001
100
101
domain1
domain1
domain1
53
Robustness

Large scale system ? failures are common
Handle failures gracefully
Enable applications to tradeoff
Cost of adaptation,
Response latency, and
Consistency
Techniques
Tree repair
Leverage DHT self-organizing properties
Aggregated information repair
Default lazy re-aggregation on failures
On-demand fast re-aggregation

54
Outline

SDIMS a general information management
middleware
Aggregation abstraction
SDIMS Design
Scalability with machines and attributes
Flexibility to accommodate various applications
Autonomy to respect administrative structure
Robustness to failures
Experimental results
SDIMS in other projects
Conclusions and future research directions

55
Evaluation

SDIMS prototype
Built using FreePastry DHT framework Rice Univ.
Three layers
Methodology
Simulation
Scalability and Flexibility
Micro-benchmarks on real networks
PlanetLab and CS Department

Aggregation Mgmt.
Tree Topology Mgmt.
Autonomous DHT
56
Simulation Results - Scalability

Small multicast sessions with size 8
Node Stress Amt. of incoming and outgoing info

machines
AS 65536
AS 4096
AS 256
SDIMS 256
SDIMS 4096
SDIMS 65536
Max
57
Simulation Results - Scalability

Small multicast sessions with size 8
Node Stress Amt. of incoming and outgoing info

AS 65536
AS 4096
AS 256
SDIMS 256
SDIMS 4096
SDIMS 65536
Max
Orders of magnitude difference in maximum node
stress ? better load balance
58
Simulation Results - Scalability

Small multicast sessions with size 8
Node Stress Amt. of incoming and outgoing info

AS 65536
AS 4096
AS 256
SDIMS 256
SDIMS 4096
SDIMS 65536
Max
59
Simulation Results - Scalability

Small multicast sessions with size 8
Node Stress Amt. of incoming and outgoing info

AS 65536
AS 4096
AS 256
SDIMS 256
SDIMS 4096
SDIMS 65536
Max
60
Simulation Results - Flexibility

Simulation with 4096 nodes
Attributes with different up and down strategies

Update-Local
Update-All
Up5, Down0
Upall, Down5
Update-Up
61
Simulation Results - Flexibility

Simulation with 4096 nodes
Attributes with different up and down strategies

Update-Local
Update-All
Up5, Down0
Upall, Down5
Update-Up
62
Simulation Results - Flexibility

Simulation with 4096 nodes
Attributes with different up and down strategies

Update-Local
Update-All
Reads dominate writes Update-All best
Up5, down0
Writes dominate reads Update-local best
Upall, Down5
Update-Up
63
Dynamic Adaptation
Simulation with 512 nodes
Update-None
Update-All
Upall, Down3
Up3, Down0
Avg Message Count
Update-Up
Shruti
Read-to-write ratio
64
Prototype Results

CS department 180 machines
PlanetLab 70 machines

Department Network
Planet Lab
800
3500
700
3000
600
2500
500
Latency (ms)
2000
400
1500
300
1000
200
500
100
0
0
Update - All
Update - Up
Update - Local
Update - All
Update - Up
Update - Local
65
Outline

SDIMS a general information management
middleware
Aggregation abstraction
SDIMS Design
Scalability with machines and attributes
Flexibility to accommodate various applications
Autonomy to respect administrative structure
Robustness to failures
Experimental results
SDIMS in other projects
Conclusions and future research directions

66
SDIMS in Other Projects

PRACTI a replication toolkit (Dahlin et al)
Grid Services (TACC)
Resource Scheduling
Data management
INSIGHT Network Monitoring (Jain and Zhang)
File location Service (IBM)
Scalable Sensing Service (HP Labs)

67
PRACTI A Replication Toolkit
68
PRACTI A Replication Toolkit
69
PRACTI Design
read() write() delete()
Inform
Invals Updates from/to other nodes
Controller
Core
Mgmt.

Core Mechanism
Controller Policy
Notified of key events
Read Miss, update arrival, invalidation arrival,
Directs communication across cores

70
SDIMS Controller in PRACTI

Read Miss For locating a replica
Similar to File Location System example
But handles flash crowds
Dissemination tree among requesting clients
For Writes Spanning trees among replicas
Multicast tree for spreading invalidations
Different trees for different objects

71
PRACTI Grid Benchmark
Home

Three phases
Read input and programs
Compute (some pairwise reads)
Results back to server
Performance
improvement
21 reduction in
total time

Grid at school
72
PRACTI Experience

Aggregation abstraction and API generality
Construct multicast trees for pushing
invalidations
Locate a replica on a local read miss
Construct a tree in the case of flash crowds
Performance benefits
Grid micro-benchmark 21 improvement over manual
tree construction
Ease of implementation
Less than two weeks

73
Conclusions

Research Vision
Ease design and development of distributed
services
SDIMS an information management middleware
Scalability with both machines and attributes
An order of magnitude lower maximum node stress
Flexibility in aggregation strategies
Support for a wide range of applications
Autonomy
Robustness to failures

74
Future Directions

Core SDIMS research
Composite queries
Resilience to temporary reconfigurations
Probe functions
Other components of wide-area distributed OS
Scheduling
Data management
Monitoring

75
For more information

http//www.cs.utexas.edu/users/ypraveen/sdims

76
SkipNet and Autonomy

Constrained load balancing in Skipnet
Also single level administrative domains
One solution Maintain separate rings in
different domains

phy.utexas.edu
ece.utexas.edu
cs.utexas.edu
Does not form trees because of revisits
77
Load Balance

Let
f fraction of attributes a node is interested
in
N number of nodes in the system
In DHT -- node will have O(log (N)) indegree whp

78
Related Work

Other aggregation systems
Astrolabe, SOMO, Dasis, IrisNet
Single tree
Cone
Aggregation tree changes with new updates
Ganglia, TAG, Sophia, and IBM Tivoli Monitoring
System
Database abstraction on DHTs
PIER and Gribble et al 2001
Support for join operation
Can be leveraged for answering composite queries

79
Load Balance

How many attributes?
O(log N) levels
Few children at each level
Each node interested in few attributes
Level 0 d
Level 1 2 x d / 2 d
Level 2 2 (d) / 2 c2 d/4
Total d 1 c/2c2/4 O(d log N)

80
PRACTI Approach

Bayou type log-exchange
But allow partial replication
Two key ideas
Separate invalidations from updates
? partial replication of data
Imprecise invalidations summary of a set of
invals
? partial replication of metadata

81
PRACTI

For reads Locate a replica on a read miss
For writes Construct spanning tree among
replicas
To propagate invalidations
To propagate updates

82
SDIMS not yet another DHT system

Typical DHT applications
Use put and get interfaces in hashtable
Aggregation as a general abstraction

83
Autonomy

Increase in path length
Path Convergence violations
None in autonomous DHT

Pastry
ADHT
bf4
bf4
bf16
bf64
bf16
Pastry
bf64
bf branching factor or nodes per domain
84
Autonomy

Increase in path length
Path Convergence violations
None in autonomous DHT

bf ? ? tree height ? bf ? ? violations ?
Pastry
ADHT
bf4
bf4
bf16
bf64
bf16
Pastry
bf64
bf branching factor or nodes per domain
85
Robustness

Planet-Lab with 67 nodes
Aggregation function summation Strategy
Update-Up
Each node updates the attribute with value 10

86
Sparse attributes

Attributes of interest to only few nodes
Example A file foo in file location
application
Key for scalability
Challenge
Aggregation abstraction one function per
attribute
Dilemma
Separate aggregation function with each attribute
Unnecessary storage and communication overheads
A vector of values with one aggregation function
Defeats DHT advantage

87
Sparse attributes

Attributes of interest to only few nodes
Example A file foo in file location
application
Key for scalability
Challenge
Aggregation abstraction one function per
attribute
Dilemma
Separate aggregation function with each attribute
Unnecessary storage and communication overheads
A vector of values with one aggregation function
Defeats DHT advantage

88
Sparse attributes

Attributes of interest to only few nodes
Example A file foo in file location
application
Key for scalability
Challenge
Aggregation abstraction one function per
attribute
Dilemma
Separate aggregation function with each attribute
Unnecessary storage and communication overheads
A vector of values with one aggregation function
Defeats DHT advantage

89
Novel Aggregation Abstraction

Separate attribute type from attribute name
Attribute (attribute type, attribute name)
Example typefileLocation, namefileFoo
Define aggregation function for a type

Name macA
IP addr 1.1.1.1
90
Example File Location
Query Tell me two machines with file Foo.

Attribute fileFoo
Value at a machine with id machineId
machineId if file Foo exists on the
machine
null otherwise
Aggregation function
SELECT_TWO (set of machine ids)

B, C
B
C
null
C
B
null
fileFoo
91
A Key Component

Most large-scale distributed applications
Monitor, query and react to changes in the system
Examples
Fundamental building block
Information collection and management

System administration and management Service
placement and location Sensor
monitoring and control Distributed
Denial-of-Service attack detection
File location service Multicast tree
construction Naming and request routing

92
CS Department Micro-benchmark Experiment
93
API Exposed to Applications
Applications
Install (attrType, function, up, down)
Update (attrType, attrName, Value)
Probe (attrType, attrName, level, mode)
API
SDIMS at leaf node (level 0)

Write a Comment

User Comments (0)