Title: RADical New Challenges for Machine Learning A Systems View
1RADical New Challenges for Machine Learning(A
Systems View)
- Armando FoxAutonomic Computing Workshop, ECML
2006 - UC Berkeley RAD Lab
2Everything is Online
- Shrinkwrap SW moving to online service model
- Already there Data-intensive (eg search),
communication (email, IM, etc), e-commerce, B2B - Hosted enterprise components Oracle ? Oracle
Online desktop CRM ? Salesforce, etc - Desktop Word?Writely
- Platform Windows Vista ?Windows Live
- Individual as well as enterprise use
- Metrics for online services are different
- TCO dominated by operations, esp. human costs
- Availability subject to QOS/SLA constraint
- (Later power consumption)
- Problem maintain availability, QOS/SLA, etc.
during rapid agile growth
3Coping with Rapid Growth
YouTube.com Daily Traffic Ranking Past Year
12-month growth 100,000 ? 15
Source Alexa.com
Log scale
4SML Opportunity
- Need near-100 availability while growing
- agility dynamically improve/modify service
offerings - dynamic add/change hardware software, scale
network, add users... - Result constant new (mis)behaviors
- Lots of data available but few analytical models
or good tools ? SML opportunity - This talk what about this application domain is
interesting/noteworthy for SML? - What new SML research do they suggest?
5Dynamic Environment
- eg Amazon.com monitoring subsystem
- 140 code pushes/month
- 700 doc changes/month
- You build it, you run it everything is a
service - Tons of data, little preprocessing
- Overwhelms ability of operators to localize
problems - Result 6-20 resolvers on 15-minute pager call
6Outline 31 Challenges for SML Internet Systems
- Ground truth is elusive, no brute-force path to
get it - SML alternative evaluations, strength of result
- False positives may be OK
- SML lightweight online techniques
- Help the operator
- SML formalize operator knowledge as model input
- Power management is new challenge area
- SML models for power-centric dynamic
configuration
7Outline 31 Challenges for SML Internet Systems
- Ground truth is elusive, no brute-force path to
get it - SML alternative evaluations, strength of result
- False positives may be OK
- SML lightweight online techniques
- Help the operator
- SML formalize operator knowledge as model input
- Power management is new challenge area
- SML models for power-centric dynamic
configuration
8Why Is Ground Truth Elusive?
- Incomplete forensic data
- imperfect choices about what to collect, how long
to keep - Undetected partial failures
- may affect only a subset of users or for a short
time - Simulation not good enough
- Like thermodynamics effects in the large can
be simulated with some accuracy, but not
finer-grained - Dynamic external conditions in real world always
elicit some previously-unseen behavior/pathology - Incorrect expert diagnosis ? wrong labels
9Cant we just measure ground truth?
- Global or end-to-end indicators?
- click-through rate of searches?
- purchase rate at large-volume e-commerce site?
- symptoms, but ? localize resolve problem
- Low-level system metrics?
- 10s of metrics per machine, 100s machines/cluster
- But which metrics relevant to a particular
problem? - Human operators label problem causes?
- This is part of the motivation of the work!
- Not like labeling images, text corpora, etc.
10Ground Truth Example
- Find concise signature (like feature vector)
that captures essential system state over short
interval - Goal similar states ltgt similar system behavior
- Representation should be indexable, so recurring
conditions can be detected readily, annotated,
etc. - Idea use TAN to classify whether system met
Service-Level Objective during 5-minute intervals - Good classification on real data ? signatures
meaningful - As it turns out, dont use raw metric values as
signatures, but attribution weights (based on
Brier score) of how much each metric contributed
to classification decision - Evaluation Can similarity-based retrieval
(precision, recall) identify recurring problem
once models have been trained? - Synthetic expts generate workload, induce
known failures - Problem real data incompletely/incorrectly
labeled
from Zhang, Goldszmidt, Cohen, Fox et al. DSN04,
SOSP05
11What can be done?
- Original dataset was partially labeled
- Can always tell did system meet SLO?
- Can sometimes tell in violation, what was
underlying problem/pathology? - Later learned that operators had misdiagnosed
(mislabeled) one of the problems - 80 pages of messages exchanged among operators
to troubleshoot - Idea 1 co-clustering in time as proxies for
labels - Time not used in original clustering/classificatio
n, but real systems tend to phase transition - So, evaluate both stability of clustering as well
as whether clusters are localized in time
12Co-clustering with time
13Results
- Tens of seconds to induce models, hundreds of ms
to evaluate incoming data against models - System behavior, workload changes on scale of 10s
sec - Good retrieval accuracy on real labeled data from
1 sys - Poor accuracy when use dataset A to identify same
problems on similar system B...why? - Juxtaposition of good and bad system
behaviors differs on both systems - Currently investigating mixture/sampling of
good behaviors to build baseline - How to evaluate new approach w/o labeled data?
14Alternative evaluation
- Consistency try different baselines, see if
leads to similar clustering - Weaker result When model is built according to
this algo., achieves precision/recall no
worse/better than when baseline model computed
from single interval - Strong result ...achieves the following
precision/recall in identifying labeled data - Can measure this on synthetic data only
- Can demonstrate when obtain more labeled real
data
15Outline 31 Challenges for SML Internet Systems
- Ground truth is elusive, no brute-force path to
get it - SML alternative evaluations, strength of result
- False positives may be OK
- SML lightweight online techniques
- Help the operator
- SML formalize operator knowledge as model input
- Power management is new challenge area
- SML models for power-centric dynamic
configuration
16False Positives May Be OK
- Localize ? diagnose, and Resolve ? fix limit
diagnosis to what can be resolved - Systems contribution fine-grained generic
recovery strategies - Insight If other strategies gtgtexpensive by
comparison, cost of trying simple strategies is
relatively very low ? false positives can be
tolerated
17False positives example path-based analysis
- Instrument J2EE app server to capture path of
each request thru application modules (EJBs) - path parse tree generated by PCFG
- At runtime, compute anomaly score based on parse
trees - Done entirely in middleware, no change to app
source - Detect 16 more failures than existing generic
techniques (107/122 or 88 vs. 88/122 or 72) - False positives around 20! 2 orders of mag. too
high?? - Algorithmic simple scalar path scoring, but
nonlinear classification boundary - Semantic rare-but-legitimate system behavior
(mis)classified as anomalous - Ballpark for systems eBay decision tree
diagnosis 24 FP
E. Kiciman and A. Fox, IEEE Trans. Neural
Networks, Spring 2005
18Solution reduce cost of acting on false positives
- Microreboot localized recovery of individual app
components (Candea, Fox et al., OSDI04) - Supported by middleware changes ? app-transparent
- 10s-100s ms recover from common problems (local
memory leaks, corrupted data structures...) - Harmless if superfluous
- Similar idea Rx (Zhou et al. SOSP 2005)
- next-cheapest technique takes order of magnitude
longer ? always safe to try - Experiments full reboot vs. microreboot, FP rate
60 - Improvement in availability due to path-based
diagnosis 1 nine (compared to traditional
detection) - full reboot needs 0 FP for same availability
- Another interpretation given fixed FP rate
availability, uRB allows gt50sec additional time
to detect/localize over full RB - Systems contribution cheap recovery allows
higher time-to-detect and higher FP rate
19Outline 31 Challenges for SML Internet Systems
- Ground truth is elusive, no brute-force path to
get it - SML alternative evaluations, strength of result
- False positives may be OK
- SML lightweight online techniques
- Help the operator
- SML formalize operator knowledge as model input
- Power management is new challenge area
- SML models for power-centric dynamic
configuration
20Helping the Operator Waterfall vs. Process
- Waterfall Static Handoff Model, N groups, static
system - DADO developers are the operators, dynamic
system - eg Amazon highly developed resolver culture
21Autonomic vs. RADical?
- IBM Autonomic Computing Manifesto
- Self-configuring/upgrading
- Self-optimizing (performance)
- Self-healing (from failures)
- Self-protecting (from attacks)
- 2 approaches
- Automate entire process (get rid of sysadmins
from waterfall model) - 2) Increase productivity of Developer/Operators
- Help expert operators make sense of mounds of
data - Improve efficiency of novice operators
- Improve all operators confidence in automation
22Helping the operator
Top 40 Pages (98 of traffic)
Time (5 minute intervals)
Bodik et al., ICAC 2005
232nd Example Account Page
anomalyscore
anomalythreshold
24Ebates Results
- Landing looping Warn 2 days beforelocalized
- Account page Warn 16 hours 1 hour localize
- Broken signup page Warn 7 days localized
- Detected a failure they didnt tell us about (??)
- Detected 3 other significant anomalies
- CTO These might have been important, but we
didnt know about them. Definitely useful if
detected in real-time. - Visualization helps detection, but anomaly scores
help localization - ask model what pages contributed most to the
decision of "anomalous - warning 3 detection time Sun Nov 16 192700
PST 2003 start Sun Nov 16 192400 PST 2003
end Sun Nov 16 210500 PST 2003 anomaly
score 7.05 Most anomalous pages
/landing.jsp 19.55
/landing_merchant.jsp 19.50
/mall_ctrl.jsp 3.69
/malltop.go 2.63
25Operator Knowledge Model Drift
- which learning approaches is correct?
- try multiple, let operator disambiguate
- Expert operators how to capture incorporate
knowledge? - Operators interrogate interpretable/auditable
models eg decision trees, Bayesian networks, etc - Mine operator actions to understand eventually
automate how operators troubleshoot - Novice operators improve learning curve
- constant turnover of operators in DADO model
- less-experienced operators can do harder tasks
- eg a way of reducing false positives in Ebates
example - In the limit, novice operators may be automated
26Path based analysis
- Idea rather than timeseries, look at paths
- Each path gets unique ID
- featurescollected along path
- Do data mining on paths
- Which paths are similar (w.r.t. selected
features)? (clustering, etc) - workload
characterization - Which paths share resources? (scheduling)
- Do paths co-cluster with respect to both some
external observable and some subset of features?
(failure localization)
27Other Path Based Results
- Reynolds et al 2006 PIP Detecting the
Unexpected in Distributed Systems - Found 18 bugs in 4 distributed systems by logging
paths and automatically checking expected vs.
unexpected behavior - Chen et al 2003 Path-based Failure and
Evolution Management - Using decision trees (from SML) on collected
paths to detect and diagnose failures and
estimate their impact from eBay and TellMe traces - Identified 93 of root causes with 23 false
positives - Aguilera, Mogul,et al 2003 Performance debugging
of black-box distributed systems - reconstruction of transactions causal path for
performance debugging
28Path-Based Nirvana
- Imagine a world where path information always
passed along so that can always track user
requests throughout system - Analogous to data lineage in data base systems
- Across apps, OS, different computers on LAN,
- Unique request ID
- Components touched
- Time of day
- Parent of this request
29Trace The 1 Solution
- Trace Goal Make Path Based Analysis have low
overhead so it can be ubiquitous - Baseline path info collection with 1
overhead - Selectively enable more local detail for specific
requests - Trace an end-to-end path recording framework
- Capture timestamp a unique requestID across all
system components - Top level log contains path traces
- Local logs contain additional detail, correlated
to path ID
30Trace Example
3-tiered system on separate computers
- Every time cross machine or component trace in
top level of log (logically centralbut
physically distributed) Modify existing local
logs with ID - Use wrapper or plug-in
1.
2.
3.
Apache
App Server
Database
Local log
Local log
Local log
31Trace Initial Plans
Apache
App Server
Database
Local log
Local log
Local log
Annotation put ID into IP header to pass
between computers on LAN
32Outline 31 Challenges for SML Internet Systems
- Ground truth is elusive, no brute-force path to
get it - SML alternative evaluations, strength of result
- False positives may be OK
- SML lightweight online techniques
- Help the operator
- SML formalize operator knowledge as model input
- Power management is new challenge area
- SML models for power-centric dynamic
configuration
33Challenge on Horizon Datacenter Power Management
- Early 90s make CPUs faster at any cost
- Late 90s increase density at any cost
- Today improve power efficiency at any cost
- Cooling datacenters not full, but at capacity
due to heat removal - Google, MS building datacenters next to Columbia
River - Datacenters have longer depreciation times than
PCs - Nonlinear cost model for power
- 1W of power consumption requires 1W of A/C
- Heavy penalties for spiking (penalize peak, not
average) - Heavy incentives for agile conservation
- National international implications
- 5 of US electricity consumption is IT
- 20 improvement ?1 national reduction in US
34Effect on datacenters
- Before overprovision
- How many machines for Xmas? ?Lots
- Need 1,000 machines to deploy new service? Buyem
- Out of rack space? Build datacenter (US1B)
- Every service on its own machinebetter failure
isolation and performance management - Now VMs, server consolidation, migration
- Turn off/slow down unused hardware on demand
- Eliminate hot spots through resource
redistribution - Seamless migration involves nontrivial latency
- Performance ?nonlinear resource-demand
interactions - Interesting domain for techniques like RL?
35Noteworthy for SML
- Typical problem interactive response time ltX
while minimizing total power - Or minimizing variance in power
- False positives OK?
- Dont sacrifice performance for power, but vice
versa sometimes OK - online learning (eg RL) practicable?
- Analytical models expensive, impractical
- Thermodynamic models ? hours to compute
- Predict power used by ltservice,workloadgt ? ???
- Use agile ( imperfect) SML models instead?
- Potential for serious real-world impact
- 1.3 lbs (0.8 kg) CO2 emission per KWh (US mean)
36Conclusions
- Lack of ground truth complicates evaluation, no
brute force way to get it - Can other techniques operator knowledge
compensate? - False positives may be OK
- Ask systems collaborators what tradeoffs may help
- Corollary Approximate model now gt accurate model
later - Models are being built discarded all the time
- A truly autonomic argument spend more horsepower
on monitoring analysis than on service itself - Humans will be in the loop for some time, so help
them - use SML to raise level of abstraction of data
presented - Interpretable models visualization ?raise level
of abstraction for operators increase
confidence in automation - Visualization also makes false positives cheaper,
by reducing operators processing time to
recognize false positive! - Power is the next big challenge