Title: CorrelationAware Object Placement for MultiObject Operations
1Correlation-Aware Object Placement for
Multi-Object Operations
- Ming Zhong Kai Shen Joel Seiferas
- University of Rochester
2Problem Overview
- Multi-object operations in data-intensive
applications - multi-word searches in full-text keyword search
engines - aggregation queries in distributed databases.
- Operations involving multiple distributed objects
incur communication and synchronization overhead. - Correlation between two objects
- probability that they are requested together in
an operation. - Correlation-aware object placement
- intuitive to place highly correlated objects
together - goal reduce communication overhead subject to
per-machine capacity constraint (load balance).
3Realistic Object Correlation Patterns
- Skewed correlations ? sufficient benefit
- Stable correlations ? low adjustment overhead
- Illustration of skewness and stability in real
keyword search engine traces (at Ask.com)
Most correlated pair 177 times more correlated
than the 1000th correlated pair.
Only 1.2 pairs whose correlations changed at
least a factor of two after one month.
4Problem Context
- Many large applications are data-intensive and
distributed. - Multi-object operations are increasingly common
- Yu et al. 2006 studied availability of
multi-object operations - we study comm/sync cost of multi-object
operations in distributed systems - Data skewness in large real-world data sets
- e.g., web object popularity follows Zipf
distribution - we identify and utilize similar skewness of
object correlations in multi-object operations
5Analytical Problem Formulation
- Input parameters
- r(i,j) correlation between objects i and j
- w(i,j) cost between objects i,j when they are
placed away - s(i) capacity usage of object i
- c(k) capacity constraint at machine k
- Object placement variables
- x(i,k) 1 if object i is placed at machine k 0
otherwise - z(i,j) 0 if objects i,j are placed together 1
otherwise - Optimization target ? Minimize ?i,j r(i,j)
w(i,j) z(i,j) - Capacity constraint at each machine k ? ?i x(i,k)
s(i) c(k) - Problem can be reduced to minimum n-way cut (n is
the number of machines) - NP-hard
6Simplified Variant Linear Programming
- Optimization target ? Minimize ?i,j r(i,j)
w(i,j) z(i,j) - Constraints k ? ?i x(i,k) s(i) c(k)
- Relax to fractional object placement
- x(i,k) proportion of object i placed at machine
k - 0.0 x(i,k) 1.0 ?k x(i,k) 1.0
- ? Problem become linear programming solvable in
polynomial time - Fractional solution x(i,k)s must be rounded to
integers - rounding is very coarse-grain
- naïve rounding may dramatically inflate the
minimization target (cost of multi-object
operations)
7Probabilistic Rounding
- Loop until all objects are placed
- pick a random machine k
- pick a random probability r in 0,1
- check every un-placed object i place i at k if r
x(i,k) - Probabilistic results
- object i is placed at machine k with probability
x(i,k) - expected cost (after the rounding) is at most
twice the cost of the original fractional
solution - expected capacity need (after the rounding) at
each machine does not change - Strength polynomial-time 2-approximation is a
strong result compared to relevant NP-hard
problems. - Weakness probabilistic expectation is not a
guarantee.
8Important-Object Partial Optimization
- Overhead
- number of variables/constraints in linear
programming is at least O(objects X machines)
can be too large! - Important-object partial optimization
- derive only optimal placement for a few important
objects (incurring most cost in multi-object
operations)
Example dominance of important keywords in
multi-word search trace at Ask.com.
9Trace-driven Performance Evaluation
- Trace-driven keyword index placement to minimize
the communication cost of multi-keyword searches - Compare three data placement approaches on the
overhead of multi-object operations - Linear programming probabilistic rounding
- Random placement
- Greedy placement place most correlated object
pairs together subject to per-machine capacity
constraint - Load balance Per-machine capacity constraint is
twice the average per-machine load
10Result Comm. Overhead Reduction
- Overhead reduction compared to random placement
with varying optimization scope (number of most
important keywords subject to optimization)
11Result Varying System Sizes
- Cost reduction over random placement is between
7386 - Small variations of the cost reduction at
different system sizes
12Conclusion
- Results
- analytical aspect polynomial-time linear
programming and probabilistic rounding to produce
2-approximation solution - systems aspect important-object partial
optimization to control overhead evaluation
using real application traces - Big picture skewed stable data distributions
motivate per-object adaptation in distributed
system management - adapt co-placement of correlated data objects for
efficient multi-object operations ICDCS08 - adapt object replication degrees for high
availability EuroSys08 - adapt Bloom filter object hash numbers for low
false-positive rate PODC08