RADical New Challenges for Machine Learning A Systems View - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

RADical New Challenges for Machine Learning A Systems View

Description:

Shrinkwrap SW moving to online service model ... How many machines for Xmas? Need 1,000 machines to deploy new service? Buy'em. Out of rack space? ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 37
Provided by: arman3
Category:

less

Transcript and Presenter's Notes

Title: RADical New Challenges for Machine Learning A Systems View


1
RADical New Challenges for Machine Learning(A
Systems View)
  • Armando FoxAutonomic Computing Workshop, ECML
    2006
  • UC Berkeley RAD Lab

2
Everything is Online
  • Shrinkwrap SW moving to online service model
  • Already there Data-intensive (eg search),
    communication (email, IM, etc), e-commerce, B2B
  • Hosted enterprise components Oracle ? Oracle
    Online desktop CRM ? Salesforce, etc
  • Desktop Word?Writely
  • Platform Windows Vista ?Windows Live
  • Individual as well as enterprise use
  • Metrics for online services are different
  • TCO dominated by operations, esp. human costs
  • Availability subject to QOS/SLA constraint
  • (Later power consumption)
  • Problem maintain availability, QOS/SLA, etc.
    during rapid agile growth

3
Coping with Rapid Growth
YouTube.com Daily Traffic Ranking Past Year
12-month growth 100,000 ? 15
Source Alexa.com
Log scale
4
SML Opportunity
  • Need near-100 availability while growing
  • agility dynamically improve/modify service
    offerings
  • dynamic add/change hardware software, scale
    network, add users...
  • Result constant new (mis)behaviors
  • Lots of data available but few analytical models
    or good tools ? SML opportunity
  • This talk what about this application domain is
    interesting/noteworthy for SML?
  • What new SML research do they suggest?

5
Dynamic Environment
  • eg Amazon.com monitoring subsystem
  • 140 code pushes/month
  • 700 doc changes/month
  • You build it, you run it everything is a
    service
  • Tons of data, little preprocessing
  • Overwhelms ability of operators to localize
    problems
  • Result 6-20 resolvers on 15-minute pager call

6
Outline 31 Challenges for SML Internet Systems
  • Ground truth is elusive, no brute-force path to
    get it
  • SML alternative evaluations, strength of result
  • False positives may be OK
  • SML lightweight online techniques
  • Help the operator
  • SML formalize operator knowledge as model input
  • Power management is new challenge area
  • SML models for power-centric dynamic
    configuration

7
Outline 31 Challenges for SML Internet Systems
  • Ground truth is elusive, no brute-force path to
    get it
  • SML alternative evaluations, strength of result
  • False positives may be OK
  • SML lightweight online techniques
  • Help the operator
  • SML formalize operator knowledge as model input
  • Power management is new challenge area
  • SML models for power-centric dynamic
    configuration

8
Why Is Ground Truth Elusive?
  • Incomplete forensic data
  • imperfect choices about what to collect, how long
    to keep
  • Undetected partial failures
  • may affect only a subset of users or for a short
    time
  • Simulation not good enough
  • Like thermodynamics effects in the large can
    be simulated with some accuracy, but not
    finer-grained
  • Dynamic external conditions in real world always
    elicit some previously-unseen behavior/pathology
  • Incorrect expert diagnosis ? wrong labels

9
Cant we just measure ground truth?
  • Global or end-to-end indicators?
  • click-through rate of searches?
  • purchase rate at large-volume e-commerce site?
  • symptoms, but ? localize resolve problem
  • Low-level system metrics?
  • 10s of metrics per machine, 100s machines/cluster
  • But which metrics relevant to a particular
    problem?
  • Human operators label problem causes?
  • This is part of the motivation of the work!
  • Not like labeling images, text corpora, etc.

10
Ground Truth Example
  • Find concise signature (like feature vector)
    that captures essential system state over short
    interval
  • Goal similar states ltgt similar system behavior
  • Representation should be indexable, so recurring
    conditions can be detected readily, annotated,
    etc.
  • Idea use TAN to classify whether system met
    Service-Level Objective during 5-minute intervals
  • Good classification on real data ? signatures
    meaningful
  • As it turns out, dont use raw metric values as
    signatures, but attribution weights (based on
    Brier score) of how much each metric contributed
    to classification decision
  • Evaluation Can similarity-based retrieval
    (precision, recall) identify recurring problem
    once models have been trained?
  • Synthetic expts generate workload, induce
    known failures
  • Problem real data incompletely/incorrectly
    labeled

from Zhang, Goldszmidt, Cohen, Fox et al. DSN04,
SOSP05
11
What can be done?
  • Original dataset was partially labeled
  • Can always tell did system meet SLO?
  • Can sometimes tell in violation, what was
    underlying problem/pathology?
  • Later learned that operators had misdiagnosed
    (mislabeled) one of the problems
  • 80 pages of messages exchanged among operators
    to troubleshoot
  • Idea 1 co-clustering in time as proxies for
    labels
  • Time not used in original clustering/classificatio
    n, but real systems tend to phase transition
  • So, evaluate both stability of clustering as well
    as whether clusters are localized in time

12
Co-clustering with time
13
Results
  • Tens of seconds to induce models, hundreds of ms
    to evaluate incoming data against models
  • System behavior, workload changes on scale of 10s
    sec
  • Good retrieval accuracy on real labeled data from
    1 sys
  • Poor accuracy when use dataset A to identify same
    problems on similar system B...why?
  • Juxtaposition of good and bad system
    behaviors differs on both systems
  • Currently investigating mixture/sampling of
    good behaviors to build baseline
  • How to evaluate new approach w/o labeled data?

14
Alternative evaluation
  • Consistency try different baselines, see if
    leads to similar clustering
  • Weaker result When model is built according to
    this algo., achieves precision/recall no
    worse/better than when baseline model computed
    from single interval
  • Strong result ...achieves the following
    precision/recall in identifying labeled data
  • Can measure this on synthetic data only
  • Can demonstrate when obtain more labeled real
    data

15
Outline 31 Challenges for SML Internet Systems
  • Ground truth is elusive, no brute-force path to
    get it
  • SML alternative evaluations, strength of result
  • False positives may be OK
  • SML lightweight online techniques
  • Help the operator
  • SML formalize operator knowledge as model input
  • Power management is new challenge area
  • SML models for power-centric dynamic
    configuration

16
False Positives May Be OK
  • Localize ? diagnose, and Resolve ? fix limit
    diagnosis to what can be resolved
  • Systems contribution fine-grained generic
    recovery strategies
  • Insight If other strategies gtgtexpensive by
    comparison, cost of trying simple strategies is
    relatively very low ? false positives can be
    tolerated

17
False positives example path-based analysis
  • Instrument J2EE app server to capture path of
    each request thru application modules (EJBs)
  • path parse tree generated by PCFG
  • At runtime, compute anomaly score based on parse
    trees
  • Done entirely in middleware, no change to app
    source
  • Detect 16 more failures than existing generic
    techniques (107/122 or 88 vs. 88/122 or 72)
  • False positives around 20! 2 orders of mag. too
    high??
  • Algorithmic simple scalar path scoring, but
    nonlinear classification boundary
  • Semantic rare-but-legitimate system behavior
    (mis)classified as anomalous
  • Ballpark for systems eBay decision tree
    diagnosis 24 FP

E. Kiciman and A. Fox, IEEE Trans. Neural
Networks, Spring 2005
18
Solution reduce cost of acting on false positives
  • Microreboot localized recovery of individual app
    components (Candea, Fox et al., OSDI04)
  • Supported by middleware changes ? app-transparent
  • 10s-100s ms recover from common problems (local
    memory leaks, corrupted data structures...)
  • Harmless if superfluous
  • Similar idea Rx (Zhou et al. SOSP 2005)
  • next-cheapest technique takes order of magnitude
    longer ? always safe to try
  • Experiments full reboot vs. microreboot, FP rate
    60
  • Improvement in availability due to path-based
    diagnosis 1 nine (compared to traditional
    detection)
  • full reboot needs 0 FP for same availability
  • Another interpretation given fixed FP rate
    availability, uRB allows gt50sec additional time
    to detect/localize over full RB
  • Systems contribution cheap recovery allows
    higher time-to-detect and higher FP rate

19
Outline 31 Challenges for SML Internet Systems
  • Ground truth is elusive, no brute-force path to
    get it
  • SML alternative evaluations, strength of result
  • False positives may be OK
  • SML lightweight online techniques
  • Help the operator
  • SML formalize operator knowledge as model input
  • Power management is new challenge area
  • SML models for power-centric dynamic
    configuration

20
Helping the Operator Waterfall vs. Process
  • Waterfall Static Handoff Model, N groups, static
    system
  • DADO developers are the operators, dynamic
    system
  • eg Amazon highly developed resolver culture

21
Autonomic vs. RADical?
  • IBM Autonomic Computing Manifesto
  • Self-configuring/upgrading
  • Self-optimizing (performance)
  • Self-healing (from failures)
  • Self-protecting (from attacks)
  • 2 approaches
  • Automate entire process (get rid of sysadmins
    from waterfall model)
  • 2) Increase productivity of Developer/Operators
  • Help expert operators make sense of mounds of
    data
  • Improve efficiency of novice operators
  • Improve all operators confidence in automation

22
Helping the operator
Top 40 Pages (98 of traffic)
Time (5 minute intervals)
Bodik et al., ICAC 2005
23
2nd Example Account Page
anomalyscore
anomalythreshold
24
Ebates Results
  • Landing looping Warn 2 days beforelocalized
  • Account page Warn 16 hours 1 hour localize
  • Broken signup page Warn 7 days localized
  • Detected a failure they didnt tell us about (??)
  • Detected 3 other significant anomalies
  • CTO These might have been important, but we
    didnt know about them. Definitely useful if
    detected in real-time.
  • Visualization helps detection, but anomaly scores
    help localization
  • ask model what pages contributed most to the
    decision of "anomalous
  • warning 3 detection time Sun Nov 16 192700
    PST 2003 start Sun Nov 16 192400 PST 2003
    end Sun Nov 16 210500 PST 2003 anomaly
    score 7.05 Most anomalous pages
    /landing.jsp 19.55
    /landing_merchant.jsp 19.50
    /mall_ctrl.jsp 3.69
    /malltop.go 2.63

25
Operator Knowledge Model Drift
  • which learning approaches is correct?
  • try multiple, let operator disambiguate
  • Expert operators how to capture incorporate
    knowledge?
  • Operators interrogate interpretable/auditable
    models eg decision trees, Bayesian networks, etc
  • Mine operator actions to understand eventually
    automate how operators troubleshoot
  • Novice operators improve learning curve
  • constant turnover of operators in DADO model
  • less-experienced operators can do harder tasks
  • eg a way of reducing false positives in Ebates
    example
  • In the limit, novice operators may be automated

26
Path based analysis
  • Idea rather than timeseries, look at paths
  • Each path gets unique ID
  • featurescollected along path
  • Do data mining on paths
  • Which paths are similar (w.r.t. selected
    features)? (clustering, etc) - workload
    characterization
  • Which paths share resources? (scheduling)
  • Do paths co-cluster with respect to both some
    external observable and some subset of features?
    (failure localization)

27
Other Path Based Results
  • Reynolds et al 2006 PIP Detecting the
    Unexpected in Distributed Systems
  • Found 18 bugs in 4 distributed systems by logging
    paths and automatically checking expected vs.
    unexpected behavior
  • Chen et al 2003 Path-based Failure and
    Evolution Management
  • Using decision trees (from SML) on collected
    paths to detect and diagnose failures and
    estimate their impact from eBay and TellMe traces
  • Identified 93 of root causes with 23 false
    positives
  • Aguilera, Mogul,et al 2003 Performance debugging
    of black-box distributed systems
  • reconstruction of transactions causal path for
    performance debugging

28
Path-Based Nirvana
  • Imagine a world where path information always
    passed along so that can always track user
    requests throughout system
  • Analogous to data lineage in data base systems
  • Across apps, OS, different computers on LAN,
  • Unique request ID
  • Components touched
  • Time of day
  • Parent of this request

29
Trace The 1 Solution
  • Trace Goal Make Path Based Analysis have low
    overhead so it can be ubiquitous
  • Baseline path info collection with 1
    overhead
  • Selectively enable more local detail for specific
    requests
  • Trace an end-to-end path recording framework
  • Capture timestamp a unique requestID across all
    system components
  • Top level log contains path traces
  • Local logs contain additional detail, correlated
    to path ID

30
Trace Example
3-tiered system on separate computers
  • Every time cross machine or component trace in
    top level of log (logically centralbut
    physically distributed) Modify existing local
    logs with ID
  • Use wrapper or plug-in

1.
2.
3.
Apache
App Server
Database
Local log
Local log
Local log
31
Trace Initial Plans
Apache
App Server
Database
Local log
Local log
Local log
Annotation put ID into IP header to pass
between computers on LAN
32
Outline 31 Challenges for SML Internet Systems
  • Ground truth is elusive, no brute-force path to
    get it
  • SML alternative evaluations, strength of result
  • False positives may be OK
  • SML lightweight online techniques
  • Help the operator
  • SML formalize operator knowledge as model input
  • Power management is new challenge area
  • SML models for power-centric dynamic
    configuration

33
Challenge on Horizon Datacenter Power Management
  • Early 90s make CPUs faster at any cost
  • Late 90s increase density at any cost
  • Today improve power efficiency at any cost
  • Cooling datacenters not full, but at capacity
    due to heat removal
  • Google, MS building datacenters next to Columbia
    River
  • Datacenters have longer depreciation times than
    PCs
  • Nonlinear cost model for power
  • 1W of power consumption requires 1W of A/C
  • Heavy penalties for spiking (penalize peak, not
    average)
  • Heavy incentives for agile conservation
  • National international implications
  • 5 of US electricity consumption is IT
  • 20 improvement ?1 national reduction in US

34
Effect on datacenters
  • Before overprovision
  • How many machines for Xmas? ?Lots
  • Need 1,000 machines to deploy new service? Buyem
  • Out of rack space? Build datacenter (US1B)
  • Every service on its own machinebetter failure
    isolation and performance management
  • Now VMs, server consolidation, migration
  • Turn off/slow down unused hardware on demand
  • Eliminate hot spots through resource
    redistribution
  • Seamless migration involves nontrivial latency
  • Performance ?nonlinear resource-demand
    interactions
  • Interesting domain for techniques like RL?

35
Noteworthy for SML
  • Typical problem interactive response time ltX
    while minimizing total power
  • Or minimizing variance in power
  • False positives OK?
  • Dont sacrifice performance for power, but vice
    versa sometimes OK
  • online learning (eg RL) practicable?
  • Analytical models expensive, impractical
  • Thermodynamic models ? hours to compute
  • Predict power used by ltservice,workloadgt ? ???
  • Use agile ( imperfect) SML models instead?
  • Potential for serious real-world impact
  • 1.3 lbs (0.8 kg) CO2 emission per KWh (US mean)

36
Conclusions
  • Lack of ground truth complicates evaluation, no
    brute force way to get it
  • Can other techniques operator knowledge
    compensate?
  • False positives may be OK
  • Ask systems collaborators what tradeoffs may help
  • Corollary Approximate model now gt accurate model
    later
  • Models are being built discarded all the time
  • A truly autonomic argument spend more horsepower
    on monitoring analysis than on service itself
  • Humans will be in the loop for some time, so help
    them
  • use SML to raise level of abstraction of data
    presented
  • Interpretable models visualization ?raise level
    of abstraction for operators increase
    confidence in automation
  • Visualization also makes false positives cheaper,
    by reducing operators processing time to
    recognize false positive!
  • Power is the next big challenge
Write a Comment
User Comments (0)
About PowerShow.com