On Picking the Right Statistical Model for Proactive Problem Prediction - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

On Picking the Right Statistical Model for Proactive Problem Prediction

Description:

Computer rage. 5. Computer Rage. People are becoming increasingly ... Down Side of Computer Rage. Destruction of personal, business, or govt. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 56
Provided by: jennif279
Category:

less

Transcript and Presenter's Notes

Title: On Picking the Right Statistical Model for Proactive Problem Prediction


1
On Picking the Right (Statistical) Model for
Proactive Problem Prediction
  • Songyun Duan, Pradeep Gunda, and Shivnath Babu

Duke University
2
Motivation
  • Systems are becoming hard to manage
  • Increasing size and complexity
  • Workloads change over time
  • 24 x 7
  • which is a problem because
  • Up to 80 of IT budgets spent on maintenance
    McKinsey

3
Time Distribution for Database Mgmt.
4
Motivation
  • Systems are becoming hard to manage
  • Increasing size and complexity
  • Workloads change over time
  • 24 x 7
  • which is a problem because
  • Up to 80 of IT budgets spent on maintenance
    McKinsey
  • System downtime can be extremely costly, e.g.,
  • Brokerage ? 6.5 Mil/h, Credit card auth. ? 2.6
    Mil/h
  • Computer rage

5
Computer Rage
  • People are becoming increasingly
  • Dependent on computers at work, school,
    personal lives
  • Married" to their computers
  • So, when computers malfunction, people
  • Become frustrated angry
  • Feel betrayed
  • Lash out against computers

6
(No Transcript)
7
(No Transcript)
8
Down Side of Computer Rage
  • Destruction of personal, business, or govt.
    property (estimate millions of annually)
  • Potential injury to self or others
  • 47-53 of time is lost (I. Ceaparu et al.
    2004)
  • Uptime Vs. Goodtime

9
Autonomic Systems
  • CS-wide push towards Autonomic (Self-Managing)
    Systems
  • At Duke
  • QueS project Ability to query a complex system
  • Fa project Enabling an autonomic system to pick
    the right statistical model for problem
    prediction and diagnosis Todays talk
  • NIMO project Learning performance and
    availability models automatically for complex
    apps running on clusters grids

10
Context for the Fa Project
Clients
  • Three-tier Internet service
  • Our current focus is on the database tier
  • Service-Level Objectives
  • Ex bound on response time
  • Violations of these objectives are costly

WAN
Web server
Application servers
Database servers
11
Predicting Violations
  • Need accurate and early predictions
  • Enables autonomic systems to take remedial
    actions proactively
  • Accuracy Low false positives false negatives
  • Lead time How soon before the actual problem is
    the prediction given
  • Also useful confidence in the prediction

12
An Example
SLO Bound 5 secs
Lead Time 1 min
13
Problem Statement
14
Statistical Models for Prediction
Models
CaRTs
Bayes Nets (BN)
Times series tech.
Pattern Mining
General BN
TAN BN
Naive BN
Dynamic BN
Score-based Learning
CI-based Learning
BN inference
Discretization
Feature selection
Search strategy
Score function
Training-set size
Feature transformation
Learning time
15
Fa Project
  • How can an autonomic system pick the right model
    and the right values of parameters?

16
What Do We Output?
17
Individual Graphs
18
Composite Accuracy Vs. Lead-Time Graph
19
Recap
  • Given a set of time series representing past
    system behavior
  • Output composite Accuracy Vs Lead-time graph

20
Sanity Check
  • Fully automated, hands-off approach
  • Statisticians will say the problem is hopeless
  • There is a naive solution
  • We want to be fast
  • We are not aiming for a fully-automated solution
    that works 100 of the time

21
Our Approach
  • How can a self-managing system pick the right
    model and the right values of parameters?
  • Which model best predicts a specific perf.
    problem?
  • How do model parameters influence prediction?
  • How much data is needed to build an accurate
    model?
  • Approach characterizing relationships

22
Fa Testbed
Avg. response time, SLO violations, etc.
OS-level data
23
Fa Testbed
Avg. response time, SLO violations, etc.
OS-level data
24
Roadmap
  • Motivation
  • Overall goal of the Fa project
  • Our current approach
  • Closer look at Fas components
  • Models we are working with
  • Preliminary experimental results

25
Injecting Performance Problems
Workload generator injects problems
Prediction Accuracy
Response time SLO_VIO, etc.
Training Data
Linux OS
Make Prediction
Model
Merge data
Model builder
MySQL status variables
MySQL Server
Test Data
OS-level data
  • Scripted workload generator
  • Clients submit queries/updates to MySQL server
  • Workload controlled to create performance problems

26
Data Collection
workload generator injects problems
Prediction Accuracy
Response time SLO_VIO, etc.
Linux OS
Training Data
Make Prediction
Model
Merge data
Model builder
MySQL status variables
MySQL Server
Test Data
OS-level data
  • Data collected at three levels
  • OS--Ex CPU utilization SAR
  • Database--Ex index acceses MySQLadmin
  • Application--Ex avg response time MySQL Query
    Log

27
Data Integration
OS-level data
Application-level data
DB-level data
28
Studying Different Statistical Models
Lead time
Bayesian Network Learner
Training Data
Model Builder
Structure Learning
Parameter Estimation
Shifted Data
Bayesian Network
Model
Test Data
Prediction Accuracy
Make Prediction
BN Inference
  • Models tried so far
  • Bayesian networks
  • Regression trees
  • Multivariate regression

29
Primer on Bayesian Networks (BN)
IDX_ACC
CPU_UTIL
NUM_IO
SLO_VIO
30
Learning Bayesian Networks From Data
Bayesian Network Learner
Training Data
Structure Learning
Parameter Estimation
Shifted Data
Bayesian Network
Test Data
Prediction Accuracy
BN Inference
  • Structure learning (Banjo)
  • Heuristic search strategy simulated annealing
  • Parameter estimation
  • Maximum likelihood estimation

31
BN Inference for Prediction bnj
Bayesian Network Learner
Training Data
Structure Learning
Parameter Estimation
Shifted Data
Bayesian Network
Prediction Accuracy
Test Data
BN Inference
  • Compute Prob(SLO_VIO0) given values of the other
    variables
  • Exact inference Vs. Approximate inference
  • Prob (SLO_VIO 0) Vs. Prob (SLO_VIO 1)

32
Regression Tree (RT) Models
  • Use is similar to Bayesian networks
  • From open source package Dtree

RT learner
Training Data
RT Induction
RT Pruning
Shifted Data
RT
Test Data
Prediction Accuracy
RT Prediction
33
Multivariate Auto-Regression (MR)
  • Time-series forecasting method
  • From the R statistical toolkit

?
34
Roadmap
  • Motivation
  • Overall goal of the Fa project
  • Our current approach
  • Closer look at Fas components
  • Models we are working with
  • Preliminary experimental results

35
Experimental Settings
  • MySQL server and clients running on virtual
    machines (Xen) with 996.8 MHz processor and 188
    MB memory
  • Model learning and prediction are conducted on
    3.6 GHz processor with 1 GB memory
  • Data collected for 3 days at one-minute intervals
    (3660 valid observations in total)

36
Workload Generator
  • Two clients submit parametric queries
  • SELECT avg(b) FROM Table
    WHERE a gt 1 and a lt 2
  • SLO bound 4.5 seconds

37
Evaluation
  • Prediction accuracy of SLO violations
  • False positives false negatives
  • Boolean accuracy
  • Root mean square error (numeric prediction of
    SLO violations)
  • Lead time
  • Dependence of model on parameters
  • Ex Training data size, training time

38
Accuracy Vs. Lead-Time
Training data 3294 samples, Features 42
39
How Training Data Size Affects Prediction
  • Total data 3660 samples,
    Lead time 0 minutes, Features 42

40
Subset of Features Used for Prediction
Lead time 1 min, Training data 3294 samples
41
Current Steps in Fa
  • Understanding our current observations
  • More types of performance problems
  • Ex different workload patterns, overload,
    resource exhaustion, parameter misconfiguration,
    and aging
  • More types of statistical models
  • Ex pattern mining, dynamic Bayesian networks,
    support vector machines, neural networks, and
    boosting

42
Related Work
  • Prediction techniques
  • Forecasting short term performance (Ex HP labs)
  • Predicting critical events (Ex IBM)
  • Autonomic computing projects
  • Ex IBM, Self- (CMU)
  • Self-tuning in commercial systems, Ex database
    systems like DB2, Oracle, SQLServer

43
Summary
  • Fa project Enabling an autonomic system to pick
    the best statistical model for proactive problem
    prediction
  • Current approach Inject problems into a system
    collect data test, study, characterize models
  • Models tried so far Bayesian networks,
    regression trees, multivariate auto-regression
  • Next steps Characterization, more types of
    performance problems, more statistical models

44
Thank you!
shivnath_at_cs.duke.edu
45
Sample Bayesian Network Learned
46
Fa in Future
  • A toolkit to learn the system behavior offline
    using model-based approach and generate rules for
    online performance problem prediction
  • Pros () and cons (-) of existing approaches of
    predicting performance problems

47
NIMO Project
  • NIMO Non Invasive Modeling for Optimization
  • Automatically builds models to predict
    performance of scientific applications on
    heterogeneous resources (e.g., in Grids)
  • Noninvasive ? no change to application sources

48
Scientific workflow
NIMOs Scheduler
1. Enumeration 2. Costing 3. Selection
Best plan
49
Application Profile
  • Execution time M (Compute, Network, Storage)
  • Execution time M (cpu_speed, memory_size, ,
    network_latency, , disk_seek_time, )

50
Learning Application Profiles
Accuracy of best model so far
Learning Time
  • Need accelerated learning

51
Active and Accelerated Learning
M(C,N,S)
52
Preliminary Results
53
Querying Systems as Data
  • What are the probable causes of average response
    time rising to gt5 seconds?

Root-cause query
54
Querying Systems as Data
  • Traces, logs
  • System activity data
  • Data from active probing
  • Workload
  • System configuration data (e.g., buffer size,
    indexes)
  • Source code
  • Models
  • Analytic performance models
  • Machine-learning models
  • Rules from system experts
  • Simulators

55
Querying Systems with QueS (30,000 ft)
Write a Comment
User Comments (0)
About PowerShow.com