Title: On Picking the Right Statistical Model for Proactive Problem Prediction
1On Picking the Right (Statistical) Model for
Proactive Problem Prediction
- Songyun Duan, Pradeep Gunda, and Shivnath Babu
Duke University
2Motivation
- Systems are becoming hard to manage
- Increasing size and complexity
- Workloads change over time
- 24 x 7
- which is a problem because
- Up to 80 of IT budgets spent on maintenance
McKinsey
3Time Distribution for Database Mgmt.
4Motivation
- Systems are becoming hard to manage
- Increasing size and complexity
- Workloads change over time
- 24 x 7
- which is a problem because
- Up to 80 of IT budgets spent on maintenance
McKinsey - System downtime can be extremely costly, e.g.,
- Brokerage ? 6.5 Mil/h, Credit card auth. ? 2.6
Mil/h - Computer rage
5Computer Rage
- People are becoming increasingly
- Dependent on computers at work, school,
personal lives - Married" to their computers
- So, when computers malfunction, people
- Become frustrated angry
- Feel betrayed
- Lash out against computers
6(No Transcript)
7(No Transcript)
8Down Side of Computer Rage
- Destruction of personal, business, or govt.
property (estimate millions of annually) - Potential injury to self or others
- 47-53 of time is lost (I. Ceaparu et al.
2004) - Uptime Vs. Goodtime
9Autonomic Systems
- CS-wide push towards Autonomic (Self-Managing)
Systems - At Duke
- QueS project Ability to query a complex system
- Fa project Enabling an autonomic system to pick
the right statistical model for problem
prediction and diagnosis Todays talk - NIMO project Learning performance and
availability models automatically for complex
apps running on clusters grids
10Context for the Fa Project
Clients
- Three-tier Internet service
- Our current focus is on the database tier
- Service-Level Objectives
- Ex bound on response time
- Violations of these objectives are costly
WAN
Web server
Application servers
Database servers
11Predicting Violations
- Need accurate and early predictions
- Enables autonomic systems to take remedial
actions proactively - Accuracy Low false positives false negatives
- Lead time How soon before the actual problem is
the prediction given - Also useful confidence in the prediction
12An Example
SLO Bound 5 secs
Lead Time 1 min
13Problem Statement
14Statistical Models for Prediction
Models
CaRTs
Bayes Nets (BN)
Times series tech.
Pattern Mining
General BN
TAN BN
Naive BN
Dynamic BN
Score-based Learning
CI-based Learning
BN inference
Discretization
Feature selection
Search strategy
Score function
Training-set size
Feature transformation
Learning time
15Fa Project
- How can an autonomic system pick the right model
and the right values of parameters?
16What Do We Output?
17Individual Graphs
18Composite Accuracy Vs. Lead-Time Graph
19Recap
- Given a set of time series representing past
system behavior - Output composite Accuracy Vs Lead-time graph
20Sanity Check
- Fully automated, hands-off approach
- Statisticians will say the problem is hopeless
- There is a naive solution
- We want to be fast
- We are not aiming for a fully-automated solution
that works 100 of the time
21Our Approach
- How can a self-managing system pick the right
model and the right values of parameters? - Which model best predicts a specific perf.
problem? - How do model parameters influence prediction?
- How much data is needed to build an accurate
model? - Approach characterizing relationships
22Fa Testbed
Avg. response time, SLO violations, etc.
OS-level data
23Fa Testbed
Avg. response time, SLO violations, etc.
OS-level data
24Roadmap
- Motivation
- Overall goal of the Fa project
- Our current approach
- Closer look at Fas components
- Models we are working with
- Preliminary experimental results
25Injecting Performance Problems
Workload generator injects problems
Prediction Accuracy
Response time SLO_VIO, etc.
Training Data
Linux OS
Make Prediction
Model
Merge data
Model builder
MySQL status variables
MySQL Server
Test Data
OS-level data
- Scripted workload generator
- Clients submit queries/updates to MySQL server
- Workload controlled to create performance problems
26Data Collection
workload generator injects problems
Prediction Accuracy
Response time SLO_VIO, etc.
Linux OS
Training Data
Make Prediction
Model
Merge data
Model builder
MySQL status variables
MySQL Server
Test Data
OS-level data
- Data collected at three levels
- OS--Ex CPU utilization SAR
- Database--Ex index acceses MySQLadmin
- Application--Ex avg response time MySQL Query
Log
27Data Integration
OS-level data
Application-level data
DB-level data
28Studying Different Statistical Models
Lead time
Bayesian Network Learner
Training Data
Model Builder
Structure Learning
Parameter Estimation
Shifted Data
Bayesian Network
Model
Test Data
Prediction Accuracy
Make Prediction
BN Inference
- Models tried so far
- Bayesian networks
- Regression trees
- Multivariate regression
29Primer on Bayesian Networks (BN)
IDX_ACC
CPU_UTIL
NUM_IO
SLO_VIO
30Learning Bayesian Networks From Data
Bayesian Network Learner
Training Data
Structure Learning
Parameter Estimation
Shifted Data
Bayesian Network
Test Data
Prediction Accuracy
BN Inference
- Structure learning (Banjo)
- Heuristic search strategy simulated annealing
- Parameter estimation
- Maximum likelihood estimation
31BN Inference for Prediction bnj
Bayesian Network Learner
Training Data
Structure Learning
Parameter Estimation
Shifted Data
Bayesian Network
Prediction Accuracy
Test Data
BN Inference
- Compute Prob(SLO_VIO0) given values of the other
variables - Exact inference Vs. Approximate inference
- Prob (SLO_VIO 0) Vs. Prob (SLO_VIO 1)
32Regression Tree (RT) Models
- Use is similar to Bayesian networks
- From open source package Dtree
RT learner
Training Data
RT Induction
RT Pruning
Shifted Data
RT
Test Data
Prediction Accuracy
RT Prediction
33Multivariate Auto-Regression (MR)
- Time-series forecasting method
- From the R statistical toolkit
?
34Roadmap
- Motivation
- Overall goal of the Fa project
- Our current approach
- Closer look at Fas components
- Models we are working with
- Preliminary experimental results
35Experimental Settings
- MySQL server and clients running on virtual
machines (Xen) with 996.8 MHz processor and 188
MB memory - Model learning and prediction are conducted on
3.6 GHz processor with 1 GB memory - Data collected for 3 days at one-minute intervals
(3660 valid observations in total)
36Workload Generator
- Two clients submit parametric queries
- SELECT avg(b) FROM Table
WHERE a gt 1 and a lt 2 - SLO bound 4.5 seconds
37Evaluation
- Prediction accuracy of SLO violations
- False positives false negatives
- Boolean accuracy
- Root mean square error (numeric prediction of
SLO violations) - Lead time
- Dependence of model on parameters
- Ex Training data size, training time
38Accuracy Vs. Lead-Time
Training data 3294 samples, Features 42
39How Training Data Size Affects Prediction
- Total data 3660 samples,
Lead time 0 minutes, Features 42
40Subset of Features Used for Prediction
Lead time 1 min, Training data 3294 samples
41Current Steps in Fa
- Understanding our current observations
- More types of performance problems
- Ex different workload patterns, overload,
resource exhaustion, parameter misconfiguration,
and aging - More types of statistical models
- Ex pattern mining, dynamic Bayesian networks,
support vector machines, neural networks, and
boosting
42Related Work
- Prediction techniques
- Forecasting short term performance (Ex HP labs)
- Predicting critical events (Ex IBM)
- Autonomic computing projects
- Ex IBM, Self- (CMU)
- Self-tuning in commercial systems, Ex database
systems like DB2, Oracle, SQLServer
43Summary
- Fa project Enabling an autonomic system to pick
the best statistical model for proactive problem
prediction - Current approach Inject problems into a system
collect data test, study, characterize models - Models tried so far Bayesian networks,
regression trees, multivariate auto-regression - Next steps Characterization, more types of
performance problems, more statistical models
44Thank you!
shivnath_at_cs.duke.edu
45Sample Bayesian Network Learned
46Fa in Future
- A toolkit to learn the system behavior offline
using model-based approach and generate rules for
online performance problem prediction - Pros () and cons (-) of existing approaches of
predicting performance problems
47NIMO Project
- NIMO Non Invasive Modeling for Optimization
- Automatically builds models to predict
performance of scientific applications on
heterogeneous resources (e.g., in Grids) - Noninvasive ? no change to application sources
48Scientific workflow
NIMOs Scheduler
1. Enumeration 2. Costing 3. Selection
Best plan
49Application Profile
- Execution time M (Compute, Network, Storage)
- Execution time M (cpu_speed, memory_size, ,
network_latency, , disk_seek_time, )
50Learning Application Profiles
Accuracy of best model so far
Learning Time
- Need accelerated learning
51Active and Accelerated Learning
M(C,N,S)
52Preliminary Results
53Querying Systems as Data
- What are the probable causes of average response
time rising to gt5 seconds?
Root-cause query
54Querying Systems as Data
- Traces, logs
- System activity data
- Data from active probing
- Workload
- System configuration data (e.g., buffer size,
indexes) - Source code
- Models
- Analytic performance models
- Machine-learning models
- Rules from system experts
- Simulators
55Querying Systems with QueS (30,000 ft)