On Picking the Right Statistical Model for Proactive Problem Prediction - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

On Picking the Right Statistical Model for Proactive Problem Prediction

Description:

Computer rage. 5. Computer Rage. People are becoming increasingly ... Down Side of Computer Rage. Destruction of personal, business, or govt. ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 56

Provided by: jennif279

Category:

more less

Transcript and Presenter's Notes

Title: On Picking the Right Statistical Model for Proactive Problem Prediction

1
On Picking the Right (Statistical) Model for
Proactive Problem Prediction

Songyun Duan, Pradeep Gunda, and Shivnath Babu

Duke University
2
Motivation

Systems are becoming hard to manage
Increasing size and complexity
Workloads change over time
24 x 7
which is a problem because
Up to 80 of IT budgets spent on maintenance
McKinsey

3
Time Distribution for Database Mgmt.
4
Motivation

Systems are becoming hard to manage
Increasing size and complexity
Workloads change over time
24 x 7
which is a problem because
Up to 80 of IT budgets spent on maintenance
McKinsey
System downtime can be extremely costly, e.g.,
Brokerage ? 6.5 Mil/h, Credit card auth. ? 2.6
Mil/h
Computer rage

5
Computer Rage

People are becoming increasingly
Dependent on computers at work, school,
personal lives
Married" to their computers
So, when computers malfunction, people
Become frustrated angry
Feel betrayed
Lash out against computers

6
(No Transcript)
7
(No Transcript)
8
Down Side of Computer Rage

Destruction of personal, business, or govt.
property (estimate millions of annually)
Potential injury to self or others
47-53 of time is lost (I. Ceaparu et al.
2004)
Uptime Vs. Goodtime

9
Autonomic Systems

CS-wide push towards Autonomic (Self-Managing)
Systems
At Duke
QueS project Ability to query a complex system
Fa project Enabling an autonomic system to pick
the right statistical model for problem
prediction and diagnosis Todays talk
NIMO project Learning performance and
availability models automatically for complex
apps running on clusters grids

10
Context for the Fa Project
Clients

Three-tier Internet service
Our current focus is on the database tier
Service-Level Objectives
Ex bound on response time
Violations of these objectives are costly

WAN
Web server
Application servers
Database servers
11
Predicting Violations

Need accurate and early predictions
Enables autonomic systems to take remedial
actions proactively
Accuracy Low false positives false negatives
Lead time How soon before the actual problem is
the prediction given
Also useful confidence in the prediction

12
An Example
SLO Bound 5 secs
Lead Time 1 min
13
Problem Statement
14
Statistical Models for Prediction
Models
CaRTs
Bayes Nets (BN)
Times series tech.
Pattern Mining
General BN
TAN BN
Naive BN
Dynamic BN
Score-based Learning
CI-based Learning
BN inference
Discretization
Feature selection
Search strategy
Score function
Training-set size
Feature transformation
Learning time
15
Fa Project

How can an autonomic system pick the right model
and the right values of parameters?

16
What Do We Output?
17
Individual Graphs
18
Composite Accuracy Vs. Lead-Time Graph
19
Recap

Given a set of time series representing past
system behavior
Output composite Accuracy Vs Lead-time graph

20
Sanity Check

Fully automated, hands-off approach
Statisticians will say the problem is hopeless
There is a naive solution
We want to be fast
We are not aiming for a fully-automated solution
that works 100 of the time

21
Our Approach

How can a self-managing system pick the right
model and the right values of parameters?
Which model best predicts a specific perf.
problem?
How do model parameters influence prediction?
How much data is needed to build an accurate
model?
Approach characterizing relationships

22
Fa Testbed
Avg. response time, SLO violations, etc.
OS-level data
23
Fa Testbed
Avg. response time, SLO violations, etc.
OS-level data
24
Roadmap

Motivation
Overall goal of the Fa project
Our current approach
Closer look at Fas components
Models we are working with
Preliminary experimental results

25
Injecting Performance Problems
Workload generator injects problems
Prediction Accuracy
Response time SLO_VIO, etc.
Training Data
Linux OS
Make Prediction
Model
Merge data
Model builder
MySQL status variables
MySQL Server
Test Data
OS-level data

Scripted workload generator
Clients submit queries/updates to MySQL server
Workload controlled to create performance problems

26
Data Collection
workload generator injects problems
Prediction Accuracy
Response time SLO_VIO, etc.
Linux OS
Training Data
Make Prediction
Model
Merge data
Model builder
MySQL status variables
MySQL Server
Test Data
OS-level data

Data collected at three levels
OS--Ex CPU utilization SAR
Database--Ex index acceses MySQLadmin
Application--Ex avg response time MySQL Query
Log

27
Data Integration
OS-level data
Application-level data
DB-level data
28
Studying Different Statistical Models
Lead time
Bayesian Network Learner
Training Data
Model Builder
Structure Learning
Parameter Estimation
Shifted Data
Bayesian Network
Model
Test Data
Prediction Accuracy
Make Prediction
BN Inference

Models tried so far
Bayesian networks
Regression trees
Multivariate regression

29
Primer on Bayesian Networks (BN)
IDX_ACC
CPU_UTIL
NUM_IO
SLO_VIO
30
Learning Bayesian Networks From Data
Bayesian Network Learner
Training Data
Structure Learning
Parameter Estimation
Shifted Data
Bayesian Network
Test Data
Prediction Accuracy
BN Inference

Structure learning (Banjo)
Heuristic search strategy simulated annealing
Parameter estimation
Maximum likelihood estimation

31
BN Inference for Prediction bnj
Bayesian Network Learner
Training Data
Structure Learning
Parameter Estimation
Shifted Data
Bayesian Network
Prediction Accuracy
Test Data
BN Inference

Compute Prob(SLO_VIO0) given values of the other
variables
Exact inference Vs. Approximate inference
Prob (SLO_VIO 0) Vs. Prob (SLO_VIO 1)

32
Regression Tree (RT) Models

Use is similar to Bayesian networks
From open source package Dtree

RT learner
Training Data
RT Induction
RT Pruning
Shifted Data
RT
Test Data
Prediction Accuracy
RT Prediction
33
Multivariate Auto-Regression (MR)

Time-series forecasting method
From the R statistical toolkit

?
34
Roadmap

Motivation
Overall goal of the Fa project
Our current approach
Closer look at Fas components
Models we are working with
Preliminary experimental results

35
Experimental Settings

MySQL server and clients running on virtual
machines (Xen) with 996.8 MHz processor and 188
MB memory
Model learning and prediction are conducted on
3.6 GHz processor with 1 GB memory
Data collected for 3 days at one-minute intervals
(3660 valid observations in total)

36
Workload Generator

Two clients submit parametric queries
SELECT avg(b) FROM Table
WHERE a gt 1 and a lt 2
SLO bound 4.5 seconds

37
Evaluation

Prediction accuracy of SLO violations
False positives false negatives
Boolean accuracy
Root mean square error (numeric prediction of
SLO violations)
Lead time
Dependence of model on parameters
Ex Training data size, training time

38
Accuracy Vs. Lead-Time
Training data 3294 samples, Features 42
39
How Training Data Size Affects Prediction

Total data 3660 samples,
Lead time 0 minutes, Features 42

40
Subset of Features Used for Prediction
Lead time 1 min, Training data 3294 samples
41
Current Steps in Fa

Understanding our current observations
More types of performance problems
Ex different workload patterns, overload,
resource exhaustion, parameter misconfiguration,
and aging
More types of statistical models
Ex pattern mining, dynamic Bayesian networks,
support vector machines, neural networks, and
boosting

42
Related Work

Prediction techniques
Forecasting short term performance (Ex HP labs)
Predicting critical events (Ex IBM)
Autonomic computing projects
Ex IBM, Self- (CMU)
Self-tuning in commercial systems, Ex database
systems like DB2, Oracle, SQLServer

43
Summary

Fa project Enabling an autonomic system to pick
the best statistical model for proactive problem
prediction
Current approach Inject problems into a system
collect data test, study, characterize models
Models tried so far Bayesian networks,
regression trees, multivariate auto-regression
Next steps Characterization, more types of
performance problems, more statistical models

44
Thank you!
shivnath_at_cs.duke.edu
45
Sample Bayesian Network Learned
46
Fa in Future

A toolkit to learn the system behavior offline
using model-based approach and generate rules for
online performance problem prediction
Pros () and cons (-) of existing approaches of
predicting performance problems

47
NIMO Project

NIMO Non Invasive Modeling for Optimization
Automatically builds models to predict
performance of scientific applications on
heterogeneous resources (e.g., in Grids)
Noninvasive ? no change to application sources

48
Scientific workflow
NIMOs Scheduler
1. Enumeration 2. Costing 3. Selection
Best plan
49
Application Profile