Title: A Scalable Information Management Middleware for Large Distributed Systems
1A Scalable Information Management Middleware for
Large Distributed Systems
- Praveen Yalagandula
-
- HP Labs, Palo Alto
- Mike Dahlin, The University of Texas at Austin
2Trends
- Large wide-area networked systems
- Enterprise networks
- IBM
- 170 countries
- gt 330000 employees
- Computational Grids
- NCSA Teragrid
- 10 partners and growing
- 100-1000 nodes per site
- Sensor networks
- Navy Automated Maintenance Environment
- About 300 ships in US Navy
- 200,000 sensors in a destroyer 3eti.com
3Trends
- Large wide-area networked systems
- Enterprise networks
- IBM
- 170 countries
- gt 330000 employees
- Computational Grids
- NCSA Teragrid
- 10 partners and growing
- 100-1000 nodes per site
- Sensor networks
- Navy Automated Maintenance Environment
- About 300 ships in US Navy
- 200,000 sensors in a destroyer 3eti.com
4Trends
- Large wide-area networked systems
- Enterprise networks
- IBM
- 170 countries
- gt 330000 employees
- Computational Grids
- NCSA Teragrid
- 10 partners and growing
- 100-1000 nodes per site
- Sensor networks
- Navy Automated Maintenance Environment
- About 300 ships in US Navy
- 200,000 sensors in a destroyer 3eti.com
5Trends
- Large wide-area networked systems
- Enterprise networks
- IBM
- 170 countries
- gt 330000 employees
- Computational Grids
- NCSA Teragrid
- 10 partners and growing
- 100-1000 nodes per site
- Sensor networks
- Navy Automated Maintenance Environment
- About 300 ships in US Navy
- 200,000 sensors in a destroyer 3eti.com
6Trends
- Large wide-area networked systems
- Enterprise networks
- IBM
- 170 countries
- gt 330000 employees
- Computational Grids
- NCSA Teragrid
- 10 partners and growing
- 100-1000 nodes per site
- Sensor networks
- Navy Automated Maintenance Environment
- About 300 ships in US Navy
- 200,000 sensors in a destroyer 3eti.com
7Trends
- Large wide-area networked systems
- Enterprise networks
- IBM
- 170 countries
- gt 330000 employees
- Computational Grids
- NCSA Teragrid
- 10 partners and growing
- 100-1000 nodes per site
- Sensor networks
- Navy Automated Maintenance Environment
- About 300 ships in US Navy
- 200,000 sensors in a destroyer 3eti.com
8Trends
- Large wide-area networked systems
- Enterprise networks
- IBM
- 170 countries
- gt 330000 employees
- Computational Grids
- NCSA Teragrid
- 10 partners and growing
- 100-1000 nodes per site
- Sensor networks
- Navy Automated Maintenance Environment
- About 300 ships in US Navy
- 200,000 sensors in a destroyer 3eti.com
9Research Vision
Wide-area Distributed Operating System
- Goals
- Ease building applications
- Utilize resources efficiently
Data Management
Security
Monitoring
......
Scheduling
10Information Management
- Most large-scale distributed applications
- Monitor, query, and react to changes in the
system - Examples
- A general information management middleware
- Eases design and development
- Avoids repetition of same task by different
applications - Provides a framework to explore tradeoffs
- Optimizes system performance
Job Scheduling
System administration and management Service
location Sensor
monitoring and control
File location service Multicast service
Naming and request routing
11Contributions SDIMS
Scalable Distributed Information Management System
- Meets key requirements
- Scalability
- Scale with both nodes and information to be
managed - Flexibility
- Enable applications to control the aggregation
- Autonomy
- Enable administrators to control flow of
information - Robustness
- Handle failures gracefully
12SDIMS in Brief
- Scalability
- Hierarchical aggregation
- Multiple aggregation trees
- Flexibility
- Separate mechanism from policy
- API for applications to choose a policy
- A self-tuning aggregation mechanism
- Autonomy
- Preserve organizational structure in all
aggregation trees - Robustness
- Default lazy re-aggregation upon failures
- On demand fast reaggregation
13Outline
- SDIMS a general information management
middleware - Aggregation abstraction
- SDIMS Design
- Scalability with machines and attributes
- Flexibility to accommodate various applications
- Autonomy to respect administrative structure
- Robustness to failures
- Experimental results
- SDIMS in other projects
- Conclusions and future research directions
14Outline
- SDIMS a general information management
middleware - Aggregation abstraction
- SDIMS Design
- Scalability with machines and attributes
- Flexibility to accommodate various applications
- Autonomy to respect administrative structure
- Robustness to failures
- Experimental results
- SDIMS in other projects
- Conclusions and future research directions
15Attributes
- Information at machines
- Machine status information
- File information
- Multicast subscription information
16Aggregation Function
- Defined for an attribute
- Given values for a set of nodes
- Computes aggregate value
- Examples
- Total users logged in the system
- Attribute numUsers
- Aggregation function summation
17Aggregation Trees
f(f(a,b), f(c,d))
- Aggregation tree
- Physical machines are leaves
- Each virtual node represents
- a logical group of machines
- Administrative domains
- Groups within domains
- Aggregation function, f, for attribute A
- Computes the aggregated value Ai for level-i
subtree - A0 locally stored value at the physical node or
NULL - Ai f(Ai-10, Ai-11, , Ai-1k) for virtual node
with k children - Each virtual node is simulated by some machines
A2
f(a,b)
f(c,d)
A1
A0
c
d
a
b
18Example Queries
- Job scheduling system
- Find the least loaded machine
- Find a (nearby) machine with load lt 0.5
- File location system
- Locate a (nearby) machine with file foo
19Example Machine Loads
- Attribute minLoad
- Value at a machine M with load L is
- ( M, L )
- Aggregation function
- MIN_LOAD (set of tuples)
(C, 0.1)
(A, 0.3)
(C, 0.1)
(D, 0.7)
(C, 0.1)
(A, 0.3)
(B, 0.6)
minLoad
20Example Machine Loads
Query Tell me the least loaded machine.
- Attribute minLoad
- Value at a machine M with load L is
- ( M, L )
- Aggregation function
- MIN_LOAD (set of tuples)
(C, 0.1)
(A, 0.3)
(C, 0.1)
(D, 0.7)
(C, 0.1)
(A, 0.3)
(B, 0.6)
minLoad
21Example Machine Loads
Query Tell me a (nearby) machine with
load lt 0.5.
- Attribute minLoad
- Value at a machine M with load L is
- ( M, L )
- Aggregation function
- MIN_LOAD (set of tuples)
(C, 0.1)
(A, 0.3)
(C, 0.1)
(D, 0.7)
(C, 0.1)
(A, 0.3)
(B, 0.6)
minLoad
22Example File Location
- Attribute fileFoo
- Value at a machine with id machineId
- machineId if file Foo exists on the
machine - null otherwise
- Aggregation function
- SELECT_ONE(set of machine ids)
B
B
C
null
C
B
null
fileFoo
23Example File Location
Query Tell me a (nearby) machine with file
Foo.
- Attribute fileFoo
- Value at a machine with id machineId
- machineId if file Foo exists on the
machine - null otherwise
- Aggregation function
- SELECT_ONE(set of machine ids)
B
B
C
null
C
B
null
fileFoo
24Outline
- SDIMS a general information management
middleware - Aggregation abstraction
- SDIMS Design
- Scalability with machines and attributes
- Flexibility to accommodate various applications
- Autonomy to respect administrative structure
- Robustness to failures
- Experimental results
- SDIMS in other projects
- Conclusions and future research directions
25Scalability
- To be a basic building block, SDIMS should
support - Large number of machines (gt 104)
- Enterprise and global-scale services
- Applications with a large number of attributes (gt
106) - File location system
- Each file is an attribute ? Large number of
attributes
26Scalability Challenge
- Single tree for aggregation
- Astrolabe, SOMO, Ganglia, etc.
- Limited scalability with attributes
- Example File Location
27Scalability Challenge
- Single tree for aggregation
- Astrolabe, SOMO, Ganglia, etc.
- Limited scalability with attributes
- Example File Location
- Automatically build multiple trees for
aggregation - Aggregate different attributes along different
trees
28Building Aggregation Trees
- Leverage Distributed Hash Tables
- A DHT can be viewed as multiple aggregation trees
- Distributed Hash Tables (DHT)
- Supports hash table interfaces
- put (key, value) inserts value for key
- get (key) returns values associated with key
- Buckets for keys distributed among machines
- Several algorithms with different properties
- PRR, Pastry, Tapestry, CAN, CHORD, SkipNet, etc.
- Load-balancing, robustness, etc.
29DHT - Overview
- Machine IDs and keys Long bit vectors
- Owner of a key Machine with ID closest to the
key - Bit correction for routing
- Each machine keeps O(log n) neighbors
Key 11111
10111
11101
11000
10010
get(11111)
00001
00110
01100
01001
30DHT Trees as Aggregation Trees
111
Key 11111
11x
1xx
010
110
001
100
101
011
000
111
31DHT Trees as Aggregation Trees
Mapping from virtual nodes to real machines
111
Key 11111
11x
1xx
010
110
001
100
101
011
000
111
32DHT Trees as Aggregation Trees
Key 11111
Key 00010
33DHT Trees as Aggregation Trees
Key 11111
Key 00010
Aggregate different attributes along different
trees hash(minLoad) 00010 ?
aggregate minLoad along tree for key
00010
34Scalability
- Challenge
- Scale with both machines and attributes
- Our approach
- Build multiple aggregation trees
- Leverage well-studied DHT algorithms
- Load-balancing
- Self-organizing
- Locality
- Aggregate different attributes along different
trees - Aggregate attribute A along the tree for key
hash(A)
35Outline
- SDIMS a general information management
middleware - Aggregation abstraction
- SDIMS Design
- Scalability with machines and attributes
- Flexibility to accommodate various applications
- Autonomy to respect administrative structure
- Robustness to failures
- Experimental results
- SDIMS in other projects
- Conclusions and future research directions
36Flexibility Challenge
- When to aggregate?
- On reads? or on writes?
- Attributes with different read-write ratios
reads gtgt writes
writes gtgt reads
read-write ratio
Total Mem
File Location
CPU Load
Best Policy
Aggregate on reads
Aggregate on writes
Partial Aggregation on writes
Astrolabe Ganglia
DHT based systems
Sophia MDS-2
37Flexibility Challenge
- When to aggregate?
- On reads? or on writes?
- Attributes with different read-write ratios
reads gtgt writes
writes gtgt reads
read-write ratio
Total Mem
File Location
CPU Load
Best Policy
Aggregate on reads
Aggregate on writes
Partial Aggregation on writes
Single framework separate mechanism from
policy ? Allow applications to choose any
policy ? Provide self-tuning mechanism
Astrolabe Ganglia
DHT based systems
Sophia MDS
38API Exposed to Applications
- Install an aggregation function for an attribute
- Function is propagated to all nodes
- Arguments up and down specify an aggregation
policy - Update the value of a particular attribute
- Aggregation performed according to the chosen
policy - Probe for an aggregated value at some level
- If required, aggregation is done
- Two modes one-shot and continuous
39Flexibility
Update-Up Upall Down0
Policy Setting
Update-All Upall Downall
Update-Local Up0 Down0
40Flexibility
Update-Up Upall Down0
Policy Setting
Update-All Upall Downall
Update-Local Up0 Down0
41Flexibility
Update-Up Upall Down0
Policy Setting
Update-All Upall Downall
Update-Local Up0 Down0
42Flexibility
Update-Up Upall Down0
Policy Setting
Update-All Upall Downall
Update-Local Up0 Down0
43Self-tuning Aggregation
- Some apps can forecast their read-write rates
- What about others?
- Can not or do not want to specify
- Spatial heterogeneity
- Temporal heterogeneity
- Shruti Dynamically tunes aggregation
- Keeps track of read and write patterns
44Shruti Dynamic Adaptation
R
A
Update-Up Upall Down0
45Shruti Dynamic Adaptation
R
A
Update-Up Upall Down0
Any updates are forwarded until lease is
relinquished
46Shruti In Brief
- On each node
- Tracks updates and probes
- Both local and from neighbors
- Sets and removes leases
- Grants leases to a neighbor A
- When gets k probes from A while no updates happen
- Relinquishes leases from a neighbor A
- When gets m updates from A while no probes happen
47Flexibility
- Challenge
- Support applications with different read-write
behavior - Our approach
- Separate mechanism from policy
- Let applications specify an aggregation policy
- Up and Down knobs in Install interface
- Provide a lease based self-tuning aggregation
strategy
48Outline
- SDIMS a general information management
middleware - Aggregation abstraction
- SDIMS Design
- Scalability with machines and attributes
- Flexibility to accommodate various applications
- Autonomy to respect administrative structure
- Robustness to failures
- Experimental results
- SDIMS in other projects
- Conclusions and future research directions
49Administrative Autonomy
- Systems spanning multiple administrative domains
- Allow a domain administrator control information
flow - Prevent external observer from observing the
information - Prevent external failures from affecting the
operations - Challenge
- DHT trees might not conform
50Administrative Autonomy
- Our approach Autonomous DHTs
- Two properties
- Path locality
- Path convergence
51Autonomy Example
- Path Locality
- Path Convergence
L3
L2
L1
L0
111
011
010
110
000
001
100
101
cs.utexas.edu
ece.utexas.edu
phy.utexas.edu
52Autonomy Challenge
- DHT trees might not conform
- Example DHT tree for key 111
- Autonomous DHT with two properties
- Path Locality
- Path Convergence
L3
L2
L1
L0
111
011
010
110
000
001
100
101
domain1
domain1
domain1
53Robustness
- Large scale system ? failures are common
- Handle failures gracefully
- Enable applications to tradeoff
- Cost of adaptation,
- Response latency, and
- Consistency
- Techniques
- Tree repair
- Leverage DHT self-organizing properties
- Aggregated information repair
- Default lazy re-aggregation on failures
- On-demand fast re-aggregation
54Outline
- SDIMS a general information management
middleware - Aggregation abstraction
- SDIMS Design
- Scalability with machines and attributes
- Flexibility to accommodate various applications
- Autonomy to respect administrative structure
- Robustness to failures
- Experimental results
- SDIMS in other projects
- Conclusions and future research directions
55Evaluation
- SDIMS prototype
- Built using FreePastry DHT framework Rice Univ.
- Three layers
- Methodology
- Simulation
- Scalability and Flexibility
- Micro-benchmarks on real networks
- PlanetLab and CS Department
Aggregation Mgmt.
Tree Topology Mgmt.
Autonomous DHT
56Simulation Results - Scalability
- Small multicast sessions with size 8
- Node Stress Amt. of incoming and outgoing info
machines
AS 65536
AS 4096
AS 256
SDIMS 256
SDIMS 4096
SDIMS 65536
Max
57Simulation Results - Scalability
- Small multicast sessions with size 8
- Node Stress Amt. of incoming and outgoing info
AS 65536
AS 4096
AS 256
SDIMS 256
SDIMS 4096
SDIMS 65536
Max
Orders of magnitude difference in maximum node
stress ? better load balance
58Simulation Results - Scalability
- Small multicast sessions with size 8
- Node Stress Amt. of incoming and outgoing info
AS 65536
AS 4096
AS 256
SDIMS 256
SDIMS 4096
SDIMS 65536
Max
59Simulation Results - Scalability
- Small multicast sessions with size 8
- Node Stress Amt. of incoming and outgoing info
AS 65536
AS 4096
AS 256
SDIMS 256
SDIMS 4096
SDIMS 65536
Max
60Simulation Results - Flexibility
- Simulation with 4096 nodes
- Attributes with different up and down strategies
Update-Local
Update-All
Up5, Down0
Upall, Down5
Update-Up
61Simulation Results - Flexibility
- Simulation with 4096 nodes
- Attributes with different up and down strategies
Update-Local
Update-All
Up5, Down0
Upall, Down5
Update-Up
62Simulation Results - Flexibility
- Simulation with 4096 nodes
- Attributes with different up and down strategies
Update-Local
Update-All
Reads dominate writes Update-All best
Up5, down0
Writes dominate reads Update-local best
Upall, Down5
Update-Up
63Dynamic Adaptation
Simulation with 512 nodes
Update-None
Update-All
Upall, Down3
Up3, Down0
Avg Message Count
Update-Up
Shruti
Read-to-write ratio
64Prototype Results
- CS department 180 machines
- PlanetLab 70 machines
Department Network
Planet Lab
800
3500
700
3000
600
2500
500
Latency (ms)
2000
400
1500
300
1000
200
500
100
0
0
Update - All
Update - Up
Update - Local
Update - All
Update - Up
Update - Local
65Outline
- SDIMS a general information management
middleware - Aggregation abstraction
- SDIMS Design
- Scalability with machines and attributes
- Flexibility to accommodate various applications
- Autonomy to respect administrative structure
- Robustness to failures
- Experimental results
- SDIMS in other projects
- Conclusions and future research directions
66SDIMS in Other Projects
- PRACTI a replication toolkit (Dahlin et al)
- Grid Services (TACC)
- Resource Scheduling
- Data management
- INSIGHT Network Monitoring (Jain and Zhang)
- File location Service (IBM)
- Scalable Sensing Service (HP Labs)
67PRACTI A Replication Toolkit
68PRACTI A Replication Toolkit
69PRACTI Design
read() write() delete()
Inform
Invals Updates from/to other nodes
Controller
Core
Mgmt.
- Core Mechanism
- Controller Policy
- Notified of key events
- Read Miss, update arrival, invalidation arrival,
- Directs communication across cores
70SDIMS Controller in PRACTI
- Read Miss For locating a replica
- Similar to File Location System example
- But handles flash crowds
- Dissemination tree among requesting clients
- For Writes Spanning trees among replicas
- Multicast tree for spreading invalidations
- Different trees for different objects
71PRACTI Grid Benchmark
Home
- Three phases
- Read input and programs
- Compute (some pairwise reads)
- Results back to server
- Performance
improvement - 21 reduction in
total time
Grid at school
72PRACTI Experience
- Aggregation abstraction and API generality
- Construct multicast trees for pushing
invalidations - Locate a replica on a local read miss
- Construct a tree in the case of flash crowds
- Performance benefits
- Grid micro-benchmark 21 improvement over manual
tree construction - Ease of implementation
- Less than two weeks
73Conclusions
- Research Vision
- Ease design and development of distributed
services - SDIMS an information management middleware
- Scalability with both machines and attributes
- An order of magnitude lower maximum node stress
- Flexibility in aggregation strategies
- Support for a wide range of applications
- Autonomy
- Robustness to failures
74Future Directions
- Core SDIMS research
- Composite queries
- Resilience to temporary reconfigurations
- Probe functions
- Other components of wide-area distributed OS
- Scheduling
- Data management
- Monitoring
75For more information
- http//www.cs.utexas.edu/users/ypraveen/sdims
76SkipNet and Autonomy
- Constrained load balancing in Skipnet
- Also single level administrative domains
- One solution Maintain separate rings in
different domains
phy.utexas.edu
ece.utexas.edu
cs.utexas.edu
Does not form trees because of revisits
77Load Balance
- Let
- f fraction of attributes a node is interested
in - N number of nodes in the system
- In DHT -- node will have O(log (N)) indegree whp
78Related Work
- Other aggregation systems
- Astrolabe, SOMO, Dasis, IrisNet
- Single tree
- Cone
- Aggregation tree changes with new updates
- Ganglia, TAG, Sophia, and IBM Tivoli Monitoring
System - Database abstraction on DHTs
- PIER and Gribble et al 2001
- Support for join operation
- Can be leveraged for answering composite queries
79Load Balance
- How many attributes?
- O(log N) levels
- Few children at each level
- Each node interested in few attributes
- Level 0 d
- Level 1 2 x d / 2 d
- Level 2 2 (d) / 2 c2 d/4
-
- Total d 1 c/2c2/4 O(d log N)
80PRACTI Approach
- Bayou type log-exchange
- But allow partial replication
- Two key ideas
- Separate invalidations from updates
- ? partial replication of data
- Imprecise invalidations summary of a set of
invals - ? partial replication of metadata
81PRACTI
- For reads Locate a replica on a read miss
- For writes Construct spanning tree among
replicas - To propagate invalidations
- To propagate updates
82SDIMS not yet another DHT system
- Typical DHT applications
- Use put and get interfaces in hashtable
- Aggregation as a general abstraction
83Autonomy
- Increase in path length
- Path Convergence violations
- None in autonomous DHT
Pastry
ADHT
bf4
bf4
bf16
bf64
bf16
Pastry
bf64
bf branching factor or nodes per domain
84Autonomy
- Increase in path length
- Path Convergence violations
- None in autonomous DHT
bf ? ? tree height ? bf ? ? violations ?
Pastry
ADHT
bf4
bf4
bf16
bf64
bf16
Pastry
bf64
bf branching factor or nodes per domain
85Robustness
- Planet-Lab with 67 nodes
- Aggregation function summation Strategy
Update-Up - Each node updates the attribute with value 10
86Sparse attributes
- Attributes of interest to only few nodes
- Example A file foo in file location
application - Key for scalability
- Challenge
- Aggregation abstraction one function per
attribute - Dilemma
- Separate aggregation function with each attribute
- Unnecessary storage and communication overheads
- A vector of values with one aggregation function
- Defeats DHT advantage
87Sparse attributes
- Attributes of interest to only few nodes
- Example A file foo in file location
application - Key for scalability
- Challenge
- Aggregation abstraction one function per
attribute - Dilemma
- Separate aggregation function with each attribute
- Unnecessary storage and communication overheads
- A vector of values with one aggregation function
- Defeats DHT advantage
88Sparse attributes
- Attributes of interest to only few nodes
- Example A file foo in file location
application - Key for scalability
- Challenge
- Aggregation abstraction one function per
attribute - Dilemma
- Separate aggregation function with each attribute
- Unnecessary storage and communication overheads
- A vector of values with one aggregation function
- Defeats DHT advantage
89Novel Aggregation Abstraction
- Separate attribute type from attribute name
- Attribute (attribute type, attribute name)
- Example typefileLocation, namefileFoo
- Define aggregation function for a type
Name macA
IP addr 1.1.1.1
90Example File Location
Query Tell me two machines with file Foo.
- Attribute fileFoo
- Value at a machine with id machineId
- machineId if file Foo exists on the
machine - null otherwise
- Aggregation function
- SELECT_TWO (set of machine ids)
B, C
B
C
null
C
B
null
fileFoo
91A Key Component
- Most large-scale distributed applications
- Monitor, query and react to changes in the system
- Examples
- Fundamental building block
- Information collection and management
System administration and management Service
placement and location Sensor
monitoring and control Distributed
Denial-of-Service attack detection
File location service Multicast tree
construction Naming and request routing
92CS Department Micro-benchmark Experiment
93API Exposed to Applications
Applications
Install (attrType, function, up, down)
Update (attrType, attrName, Value)
Probe (attrType, attrName, level, mode)
API
SDIMS at leaf node (level 0)