Title: AutoLoop: Automated Action Selection in the ObserveAnalyzeAct Loop for Storage Systems
1AutoLoop Automated Action Selection in the
Observe-Analyze-Act Loop for Storage Systems
Li Yin (U.C. Berkeley), Sandeep Uttamchandani
(IBM) John Palmer (IBM), Randy Katz (U.C.
Berkeley), Gul Agha (UIUC) Presented by David
Pease (IBM Almaden Research)
2- Jim Gray's Turing award speech What next? - A
dozen IT research goals, 1999
- Build a system
- used by millions of people each day
- administered and managed by a ½ time person.
- On hardware fault, order replacement part
- On overload, adjust automatically
Observe system state, Analyze behavior, Activate
corrective actions
- Self-management a necessity and a key value
differentiator today - Autonomic Computing IBM
- Self- Systems CMU
- ROC UC Berkeley/Stanford
- AutoAdmin Microsoft
- Impact on Total Cost of Ownership - Scarce
skilled administrators - Growing number of system
protocols, users, application requirements - One
admin per 1-10 TB of storage Gartner00
- Manifestations of the Observe-Analyze-Act Loop
- Databases Harvard Margo Setlzer
- Networks UC Berkeley Randy Katz
- Storage systems CMU Greg Ganger
3The Observe-Analyze-Act Loop in Storage Systems
(I)
E-mail
Data Warehousing
Web-server
. . .
SLO Goals
Storage Virtualization (Mapping Application-data
to Storage Resources)
. . .
4The Observe-Analyze-Act Loop in Storage Systems
(II)
Adding Heterogeneous Hardware
DAS
NAS
Workload access variations
iSCSI
Failures
- Hardware failures - Software bugs - Operator
errors
Observe
Load Surges
Analyze
Request size
Act
IOPS
SPC OLTP
Indeterminate Goals
Harvard Campus
Time
- - Imprecise information
- - Changes in number of users, business models,
over-provisioning thresholds, performance
requirements, etc.
Time
5Problem Statement Automation of the
Observe-Analyze-Act Loop
Workload access characteristics
. . .
E-mail
Data Warehousing
Web server
Application priorities/Utility functions
Storage Virtualization (Mapping Application data
to Storage Resources)
SLO Goals Latency-bound Throughput-bound
. . .
Component properties
- Short-term Actions
- Throttling
- Painkillers Low cost
- parameter tuning
- Long-term Actions
- Migration, Replication
- Vitamins High cost
- modifying the allocation of
- resources to data
Permanent Changes Addition of new
hardware Surgery
6Outline
- Motivation
- Birds eye view of Automated Storage Management
- AutoLoop Automated Action Selection
- Action parameters selection
- Action selection
- Conclusion
7The Management Loop
Describe aspects of system behavior Component
modelsWorkload models Action models
Specify system goals Minimize the number of
workloads Violating SLOs
Decide what to do The system should invoke
throttling how to do it workload 1 throttled
to 500 I/Opswhen to do it. It needs to start
now
Observe system behaviors and triggeranalyze
engine
Current State
Execute action(s)
8Goal
- Automate the observe-analyze-act Loop
- Focus on the key functionality of the analyze
part - Automated Action Selection
- Make decisions on short-term action or
long-term ones
9The Analyze Engine
Step 1 Generate possible corrective options
Step 2 Compare and pre-filter corrective options
Step 3 Select the action and schedule it
Option ltAction, Action Type, Invocation valuesgt
10The Management Loop
Describe system behavior Component
modelsWorkload models Action models
Specify system goals Minimize the number of
workloads Violating SLOs
Decide what to do The system should invoke
throttling how to do it workload 1 throttled
to 500 I/Opswhen to do it. It needs to start
now
Observe system behaviors and triggeranalyze
engine
Current State
Execute action(s)
11Knowledge Base Component Models
- Objective Predict service time for a given load
at a component (For example storage controller). - Service_timecontroller L( R1, , Rn)
- Where Ri is the load characteristics from
workload i, including read/write ratio, request
size, request rate and random/sequential ratio - An example of component model
- Single block level workload, Request Size 10KB,
Read/Write Ratio 0.8, Random Access - Hardware configuration FAStT900 storage
controller, 8 disks, RAID0
12Knowledge Base Workload Models
- Objective Predict the load on component i as a
function of workload js features
Component_loadi,j Wi,j( workload j
characteristics)
1000 reqs/s
10MB/s
500 Iops
13Knowledge Base Action Models
- Objective Predict the effect of corrective
action on component load and workloads - Example
- Workload with 20KB request size, 0.642
read/write ratio and 0.026 sequential access
ratio
Workload J request Rate Aj(Token Issue Rate for
Workload J)
WorkloadRequest Rate
Token Issue Rate
14Knowledge Base Interpolation Functions
- Objective Describe workload/system trends and
patterns - Example
- Trend Workload increases 5 every month
- Pattern The request rate peaks at 2-4pm everyday
- Figure shows the load pattern on www.ibm.com for
one week duration Chase et al. SOSP 2001 - Time series analysis to predict pattern and
trends - Example ARIMA (Auto Regressive Integrated Moving
Average) algorithm
15The Management Loop
Describe system behaviors Component
modelsWorkload models Action models
Specify system goals Minimize the number of
workloads Violating SLOs
Decide what to do The system should invoke
throttling how to do it workload 1 throttled
to 500 I/Opswhen to do it. It needs to start
now
Observe system behaviors and triggeranalyze
engine
Current State
Execute action(s)
16The Analyze Engine
Step 1 Generate possible corrective options
Step 2 Compare and pre-filter corrective options
Step 3 Select the action and schedule it
17Step 1 Generate Possible Corrective Actions
- Use throttling and migration as examples for
short-term and long-term actions - Throttling
- Token issue rate for each workloads Chameleon
usenix05 - Migration
- Dataset
- Migration target
- Migration speed
18Corrective Action Generation Intuitions
- Formulated as constrained optimization problems
- Two parts
- Performance prediction for given parameters
- Input Invocation parameters
- Output Expected Performance
- Constrained optimization techniques to select
optimal parameters
19Part 1 Performance Prediction
- Chain all models together to predict action
result - Example throttling
Action Model
Workload 1
Component Model
Workload n
20Part 2 Optimal invocation parameters
selectionThrottling
- Formulated as constrained optimization problem
- Throttling
- Formulation
- Variable Token issue rate for each workload (ti)
- Objective Function
- Minimize the weighted throughput distance to SLOs
- Example
- Constraints
- Workloads should meet their SLO latency goals
- Each components latency is based on the model
chaining - Latency is summed over along the path
21Part 2 Invocation Parameter Selection Migration
- Step 1 Dataset selection
- Filter high cost (size) and low benefit (load)
datasets - Remove infeasible dataset
-
Size 10MB Load 500 I/Ops
Size 20MB Load 100 I/Ops
Size 500MB Load 1400 I/Ops
Size 500MB Load 500 I/Ops
Source
22Part 2 Invocation Parameter Selection Migration
(cont.)
- Step 2 Target selection
- Variable Si 0,1 represents if component I is
selected as the target or not - Objective Function Minimize load variance on the
source and the target. - Constraints
- Remaining workloads running on the target should
meet their SLOs - The migrated dataset should meet its SLO
- Step 3 Migration speed determination
- Chameleon algorithm migration is another
workload
23The Analyze Engine
Step 1 Generate possible corrective options
Step 2 Compare and pre-filter corrective options
Step 3 Select the action and schedule it
24Step 2 Compare and Pre-filter Corrective Options
- Based on sky-line analysis
- 2-dimentional (cost, benefit) sky-line graphs
- Associate each candidate with (cost, benefit)
values - Divide graphs into intervals according to benefit
values - For each interval, eliminate all options not on
the skyline
25Cost/Benefit Definitions
- Short-term actions
- Cost number of workloads operating above or
close to their SLOlatency - Benefit weighted sum of workloads throughput
efficiency
26Cost/Benefit Definitions
- Long-term actions
- Cost Size of data movement
- Benefit Headroom
- Step 1 MaxLoadj is the maximum allowed load for
component j without violating any workloads SLO
latency. - Step 2 AdditionLoadj is the additional load
component j can accommodate without violating any
workload SLOlatency -
- Step 3 Headroom is defined as
27The Analyze Engine
Step 1 Generate possible corrective options
Step 2 Compare and pre-filter corrective options
Step 3 Select the action and schedule it
28Step 3 Action Selection and Scheduling
- Event types
- Reactive Trigger
- Analyze engine is triggered after system
exceptions happened - Proactive Trigger
- Analyze engine is triggered before system
exceptions happens - Opportunity Window
- System is lightly loaded
29Step 3 Action Selection and Scheduling Flow Chart
Trigger From Observed Module
Reactive Trigger
Opportunity Window
Proactive Trigger
No
No
Yes
No
Yes
Yes
30Take away points
- Automated system management is a necessity
- Actions differ in the operational semantics
cost, benefit, lead-time for invocation - AutoLoop Selects corrective strategies along the
entire spectrum of available actions - Currently building a prototype for deploying in
real-world systems
31Questions?
32Related Publications
- Chameleon a self-evolving fully-adaptive
resource arbitrator for storage systems. Usenix
2005. - MonitorMining A Gray-box approach for
knowledge-base creation. IM 2005. - AutoLoop Automated Action Selection in the
Observe-Analyze-Act Loop for Storage Systems.
POLICY 2005. - Polus Growing Storage QoS Management beyond a
4-Year Old kid. FAST 2004. - DecisionQoS an adaptive, self-evolving QoS
arbitration module for storage systems. POLICY
2004. - EoS An Approach of Using Behavior Implications
for Policy-based Self-management. DSOM 2003. - Contact Information
- yinli AT eecs.berkeley.edu sandeepu AT
us.ibm.com