RADical New Challenges for Machine Learning A Systems View

About This Presentation

Title:

RADical New Challenges for Machine Learning A Systems View

Description:

Shrinkwrap SW moving to online service model ... How many machines for Xmas? Need 1,000 machines to deploy new service? Buy'em. Out of rack space? ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 37

Provided by: arman3

Category:

more less

Transcript and Presenter's Notes

Title: RADical New Challenges for Machine Learning A Systems View

1
RADical New Challenges for Machine Learning(A
Systems View)

Armando FoxAutonomic Computing Workshop, ECML
2006
UC Berkeley RAD Lab

2
Everything is Online

Shrinkwrap SW moving to online service model
Already there Data-intensive (eg search),
communication (email, IM, etc), e-commerce, B2B
Hosted enterprise components Oracle ? Oracle
Online desktop CRM ? Salesforce, etc
Desktop Word?Writely
Platform Windows Vista ?Windows Live
Individual as well as enterprise use
Metrics for online services are different
TCO dominated by operations, esp. human costs
Availability subject to QOS/SLA constraint
(Later power consumption)
Problem maintain availability, QOS/SLA, etc.
during rapid agile growth

3
Coping with Rapid Growth
YouTube.com Daily Traffic Ranking Past Year
12-month growth 100,000 ? 15
Source Alexa.com
Log scale
4
SML Opportunity

Need near-100 availability while growing
agility dynamically improve/modify service
offerings
dynamic add/change hardware software, scale
network, add users...
Result constant new (mis)behaviors
Lots of data available but few analytical models
or good tools ? SML opportunity
This talk what about this application domain is
interesting/noteworthy for SML?
What new SML research do they suggest?

5
Dynamic Environment

eg Amazon.com monitoring subsystem
140 code pushes/month
700 doc changes/month
You build it, you run it everything is a
service
Tons of data, little preprocessing
Overwhelms ability of operators to localize
problems
Result 6-20 resolvers on 15-minute pager call

6
Outline 31 Challenges for SML Internet Systems

Ground truth is elusive, no brute-force path to
get it
SML alternative evaluations, strength of result
False positives may be OK
SML lightweight online techniques
Help the operator
SML formalize operator knowledge as model input
Power management is new challenge area
SML models for power-centric dynamic
configuration

7
Outline 31 Challenges for SML Internet Systems

Ground truth is elusive, no brute-force path to
get it
SML alternative evaluations, strength of result
False positives may be OK
SML lightweight online techniques
Help the operator
SML formalize operator knowledge as model input
Power management is new challenge area
SML models for power-centric dynamic
configuration

8
Why Is Ground Truth Elusive?

Incomplete forensic data
imperfect choices about what to collect, how long
to keep
Undetected partial failures
may affect only a subset of users or for a short
time
Simulation not good enough
Like thermodynamics effects in the large can
be simulated with some accuracy, but not
finer-grained
Dynamic external conditions in real world always
elicit some previously-unseen behavior/pathology
Incorrect expert diagnosis ? wrong labels

9
Cant we just measure ground truth?

Global or end-to-end indicators?
click-through rate of searches?
purchase rate at large-volume e-commerce site?
symptoms, but ? localize resolve problem
Low-level system metrics?
10s of metrics per machine, 100s machines/cluster
But which metrics relevant to a particular
problem?
Human operators label problem causes?
This is part of the motivation of the work!
Not like labeling images, text corpora, etc.

10
Ground Truth Example

Find concise signature (like feature vector)
that captures essential system state over short
interval
Goal similar states ltgt similar system behavior
Representation should be indexable, so recurring
conditions can be detected readily, annotated,
etc.
Idea use TAN to classify whether system met
Service-Level Objective during 5-minute intervals
Good classification on real data ? signatures
meaningful
As it turns out, dont use raw metric values as
signatures, but attribution weights (based on
Brier score) of how much each metric contributed
to classification decision
Evaluation Can similarity-based retrieval
(precision, recall) identify recurring problem
once models have been trained?
Synthetic expts generate workload, induce
known failures
Problem real data incompletely/incorrectly
labeled

from Zhang, Goldszmidt, Cohen, Fox et al. DSN04,
SOSP05
11
What can be done?

Original dataset was partially labeled
Can always tell did system meet SLO?
Can sometimes tell in violation, what was
underlying problem/pathology?
Later learned that operators had misdiagnosed
(mislabeled) one of the problems
80 pages of messages exchanged among operators
to troubleshoot
Idea 1 co-clustering in time as proxies for
labels
Time not used in original clustering/classificatio
n, but real systems tend to phase transition
So, evaluate both stability of clustering as well
as whether clusters are localized in time

12
Co-clustering with time
13
Results

Tens of seconds to induce models, hundreds of ms
to evaluate incoming data against models
System behavior, workload changes on scale of 10s
sec
Good retrieval accuracy on real labeled data from
1 sys
Poor accuracy when use dataset A to identify same
problems on similar system B...why?
Juxtaposition of good and bad system
behaviors differs on both systems
Currently investigating mixture/sampling of
good behaviors to build baseline
How to evaluate new approach w/o labeled data?

14
Alternative evaluation

Consistency try different baselines, see if
leads to similar clustering
Weaker result When model is built according to
this algo., achieves precision/recall no
worse/better than when baseline model computed
from single interval
Strong result ...achieves the following
precision/recall in identifying labeled data
Can measure this on synthetic data only
Can demonstrate when obtain more labeled real
data

15
Outline 31 Challenges for SML Internet Systems

Ground truth is elusive, no brute-force path to
get it
SML alternative evaluations, strength of result
False positives may be OK
SML lightweight online techniques
Help the operator
SML formalize operator knowledge as model input
Power management is new challenge area
SML models for power-centric dynamic
configuration

16
False Positives May Be OK

Localize ? diagnose, and Resolve ? fix limit
diagnosis to what can be resolved
Systems contribution fine-grained generic
recovery strategies
Insight If other strategies gtgtexpensive by
comparison, cost of trying simple strategies is
relatively very low ? false positives can be
tolerated

17
False positives example path-based analysis

Instrument J2EE app server to capture path of
each request thru application modules (EJBs)
path parse tree generated by PCFG
At runtime, compute anomaly score based on parse
trees
Done entirely in middleware, no change to app
source
Detect 16 more failures than existing generic
techniques (107/122 or 88 vs. 88/122 or 72)
False positives around 20! 2 orders of mag. too
high??
Algorithmic simple scalar path scoring, but
nonlinear classification boundary
Semantic rare-but-legitimate system behavior
(mis)classified as anomalous
Ballpark for systems eBay decision tree
diagnosis 24 FP

E. Kiciman and A. Fox, IEEE Trans. Neural
Networks, Spring 2005
18
Solution reduce cost of acting on false positives

Microreboot localized recovery of individual app
components (Candea, Fox et al., OSDI04)
Supported by middleware changes ? app-transparent
10s-100s ms recover from common problems (local
memory leaks, corrupted data structures...)
Harmless if superfluous
Similar idea Rx (Zhou et al. SOSP 2005)
next-cheapest technique takes order of magnitude
longer ? always safe to try
Experiments full reboot vs. microreboot, FP rate
60
Improvement in availability due to path-based
diagnosis 1 nine (compared to traditional
detection)
full reboot needs 0 FP for same availability
Another interpretation given fixed FP rate
availability, uRB allows gt50sec additional time
to detect/localize over full RB
Systems contribution cheap recovery allows
higher time-to-detect and higher FP rate

19
Outline 31 Challenges for SML Internet Systems

Ground truth is elusive, no brute-force path to
get it
SML alternative evaluations, strength of result
False positives may be OK
SML lightweight online techniques
Help the operator
SML formalize operator knowledge as model input
Power management is new challenge area
SML models for power-centric dynamic
configuration

20
Helping the Operator Waterfall vs. Process

Waterfall Static Handoff Model, N groups, static
system
DADO developers are the operators, dynamic
system
eg Amazon highly developed resolver culture

21
Autonomic vs. RADical?

IBM Autonomic Computing Manifesto
Self-configuring/upgrading
Self-optimizing (performance)
Self-healing (from failures)
Self-protecting (from attacks)
2 approaches
Automate entire process (get rid of sysadmins
from waterfall model)
2) Increase productivity of Developer/Operators
Help expert operators make sense of mounds of
data
Improve efficiency of novice operators
Improve all operators confidence in automation

22
Helping the operator
Top 40 Pages (98 of traffic)
Time (5 minute intervals)
Bodik et al., ICAC 2005
23
2nd Example Account Page
anomalyscore
anomalythreshold
24
Ebates Results

Landing looping Warn 2 days beforelocalized
Account page Warn 16 hours 1 hour localize
Broken signup page Warn 7 days localized
Detected a failure they didnt tell us about (??)
Detected 3 other significant anomalies
CTO These might have been important, but we
didnt know about them. Definitely useful if
detected in real-time.
Visualization helps detection, but anomaly scores
help localization
ask model what pages contributed most to the
decision of "anomalous
warning 3 detection time Sun Nov 16 192700
PST 2003 start Sun Nov 16 192400 PST 2003
end Sun Nov 16 210500 PST 2003 anomaly
score 7.05 Most anomalous pages
/landing.jsp 19.55
/landing_merchant.jsp 19.50
/mall_ctrl.jsp 3.69
/malltop.go 2.63

25
Operator Knowledge Model Drift

which learning approaches is correct?
try multiple, let operator disambiguate
Expert operators how to capture incorporate
knowledge?
Operators interrogate interpretable/auditable
models eg decision trees, Bayesian networks, etc
Mine operator actions to understand eventually
automate how operators troubleshoot
Novice operators improve learning curve
constant turnover of operators in DADO model
less-experienced operators can do harder tasks
eg a way of reducing false positives in Ebates
example
In the limit, novice operators may be automated

26
Path based analysis

Idea rather than timeseries, look at paths
Each path gets unique ID
featurescollected along path
Do data mining on paths
Which paths are similar (w.r.t. selected
features)? (clustering, etc) - workload
characterization
Which paths share resources? (scheduling)
Do paths co-cluster with respect to both some
external observable and some subset of features?
(failure localization)

27
Other Path Based Results

Reynolds et al 2006 PIP Detecting the
Unexpected in Distributed Systems
Found 18 bugs in 4 distributed systems by logging
paths and automatically checking expected vs.
unexpected behavior
Chen et al 2003 Path-based Failure and
Evolution Management
Using decision trees (from SML) on collected
paths to detect and diagnose failures and
estimate their impact from eBay and TellMe traces
Identified 93 of root causes with 23 false
positives
Aguilera, Mogul,et al 2003 Performance debugging
of black-box distributed systems
reconstruction of transactions causal path for
performance debugging

28
Path-Based Nirvana

Imagine a world where path information always
passed along so that can always track user
requests throughout system
Analogous to data lineage in data base systems
Across apps, OS, different computers on LAN,
Unique request ID
Components touched
Time of day
Parent of this request

29
Trace The 1 Solution

Trace Goal Make Path Based Analysis have low
overhead so it can be ubiquitous
Baseline path info collection with 1
overhead
Selectively enable more local detail for specific
requests
Trace an end-to-end path recording framework
Capture timestamp a unique requestID across all
system components
Top level log contains path traces
Local logs contain additional detail, correlated
to path ID

30
Trace Example
3-tiered system on separate computers

Every time cross machine or component trace in
top level of log (logically centralbut
physically distributed) Modify existing local
logs with ID
Use wrapper or plug-in

1.
2.
3.
Apache
App Server
Database
Local log
Local log
Local log
31
Trace Initial Plans
Apache
App Server
Database
Local log
Local log
Local log
Annotation put ID into IP header to pass
between computers on LAN
32
Outline 31 Challenges for SML Internet Systems

Ground truth is elusive, no brute-force path to
get it
SML alternative evaluations, strength of result
False positives may be OK
SML lightweight online techniques
Help the operator
SML formalize operator knowledge as model input
Power management is new challenge area
SML models for power-centric dynamic
configuration

33
Challenge on Horizon Datacenter Power Management

Early 90s make CPUs faster at any cost
Late 90s increase density at any cost
Today improve power efficiency at any cost
Cooling datacenters not full, but at capacity
due to heat removal
Google, MS building datacenters next to Columbia
River
Datacenters have longer depreciation times than
PCs
Nonlinear cost model for power
1W of power consumption requires 1W of A/C
Heavy penalties for spiking (penalize peak, not
average)
Heavy incentives for agile conservation
National international implications
5 of US electricity consumption is IT
20 improvement ?1 national reduction in US

34
Effect on datacenters

Before overprovision
How many machines for Xmas? ?Lots
Need 1,000 machines to deploy new service? Buyem
Out of rack space? Build datacenter (US1B)
Every service on its own machinebetter failure
isolation and performance management
Now VMs, server consolidation, migration
Turn off/slow down unused hardware on demand
Eliminate hot spots through resource
redistribution
Seamless migration involves nontrivial latency
Performance ?nonlinear resource-demand
interactions
Interesting domain for techniques like RL?

35
Noteworthy for SML

Typical problem interactive response time ltX
while minimizing total power
Or minimizing variance in power
False positives OK?
Dont sacrifice performance for power, but vice
versa sometimes OK
online learning (eg RL) practicable?
Analytical models expensive, impractical
Thermodynamic models ? hours to compute
Predict power used by ltservice,workloadgt ? ???
Use agile ( imperfect) SML models instead?
Potential for serious real-world impact
1.3 lbs (0.8 kg) CO2 emission per KWh (US mean)

36
Conclusions

Lack of ground truth complicates evaluation, no
brute force way to get it
Can other techniques operator knowledge
compensate?
False positives may be OK
Ask systems collaborators what tradeoffs may help
Corollary Approximate model now gt accurate model
later
Models are being built discarded all the time
A truly autonomic argument spend more horsepower
on monitoring analysis than on service itself
Humans will be in the loop for some time, so help
them
use SML to raise level of abstraction of data
presented
Interpretable models visualization ?raise level
of abstraction for operators increase
confidence in automation
Visualization also makes false positives cheaper,
by reducing operators processing time to
recognize false positive!
Power is the next big challenge

Write a Comment

User Comments (0)

About PowerShow.com

RADical New Challenges for Machine Learning A Systems View - PowerPoint PPT Presentation

RADical New Challenges for Machine Learning A Systems View

Shrinkwrap SW moving to online service model ... How many machines for Xmas? Need 1,000 machines to deploy new service? Buy'em. Out of rack space? ... – PowerPoint PPT presentation