Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

About This Presentation

Title:

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

Description:

Gilman Tolle , Jonathan Hui , Armando Fox , Michael I. Jordan , David Patterson ... detected 4 out of 6 problems earlier than Ebates.com ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 19

Provided by: peter262

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

1
Using HTTP Access Logs To Detect
Application-Level Failures In Internet Services

Peter Bodík, Greg Friedman, Lukas Biewald,
Helen Levine, George Candea, Kayur Patel,
Gilman Tolle, Jonathan Hui, Armando Fox,
Michael I. Jordan, David Patterson
UC Berkeley, Stanford University, Ebates.com

2
Introduction

problem
hard to detect/localize failures in Internet
services
operators dont trust the statistical learning
algorithms!
hard to verify
goal
build a real-time visualization tool for
operators
quickly detect changes/anomalies
localize the cause of the change
based on user access patterns
evaluated on real-world failure data

3
Outline

modeling user access patterns
demo of the visualization tool
examples of failures from Ebates.com
results
why evaluation is hard

4
Analyzing web site access patterns

rapid changes in access patterns indicate
problems
broken links
users hitting Reload multiple times
access patterns captured in HTTP logs
detect anomalies
compare current traffic to the historic/normal
traffic
model access frequencies to the top 40 pages
capture 98 of traffic to Ebates.com
dont need any instrumentation
visualize access patterns

5
Anomaly detection

Chi-square test
compare page frequencies using Chi-square test
anomaly score ? significance of the test
Naive Bayes approach
assume page frequencies independent
page frequency modeled as a Gaussian
anomaly score ? 1 - Prob(current interval is
normal)
can estimate anomaly score for each page
learn normal access pattern from the past
skip anomalous intervals

6
Reporting anomalies
warning 3 detection time Sun Nov 16 192700
PST 2003 start Sun Nov 16 192400 PST 2003
end Sun Nov 16 210500 PST 2003 anomaly
score 7.05 Most anomalous pages
/landing.jsp
19.55 /landing_merchant.jsp
19.50 /mall_ctrl.jsp
3.69 /malltop.go
2.63 /mall.go 2.18
How long did it take you to read this?
7
Visualization

uses the same features as anomaly detection
anomalies are obvious
exploits human pattern-matching

8
Failures at Ebates.com

used real-world failures
let the CTO of Ebates.com evaluate our results
analyzed 6 failures
account page problem
broken sign-up page
2 redirection problems
database overload
runaway DB query

9
Nov 2003 account page problem (1)
10
Nov 2003 account page problem (2)
9am
1pm
11
Oct 2003 broken signup page (1)
12
Oct 2003 broken signup page (2)
13
Oct 2003 broken signup page (3)
14
Summary of results

november 2003 account page problem
1st warning 16 hours earlier
2nd warning 1 hour earlier correctly localized
the bad page!
october 2003 broken signup page
noticed the problem 7 days earlier correctly
localized!
july 2001 landing looping problem
warning 2 days earlier correctly localized
detected a failure they didnt tell us about
detected three other significant anomalies
feedback these might have been important, but
we didnt know about them. definitely useful if
detected in real-time.

15
Evaluation is hard

ideally, evaluate in terms of
accuracy, time to detect, false positive/negative
rate
need to reconstruct ground truth
need to know about all failures
when failure introduced/detected/resolved
root cause of failure
operators at Ebates.com VERY helpful
analyzed system logs, email archives, chat logs
still dont know
report advance warning time

16
Generalizing the approach

can be applied to arbitrary Web application
analyzed traffic to the top 40 pages
captures 98 of traffic
easily extensible to more pages
how much data we need
used logs from 3 web servers
same results for the account page problem with
just 5 of data
about 12,000 requests a day

17
Open-source failure repository
18
Conclusion

changes in user access patterns indicate problems
detected 4 out of 6 problems earlier than
Ebates.com
synergy between visualization and automatic
detection
builds the trust relationship
good visualization ? cheaper false positives
dont handle false positives automatically, let
the operator use his/her experience
hard to evaluate detection and localization
algorithms

Write a Comment

User Comments (0)

About PowerShow.com

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services - PowerPoint PPT Presentation

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

Gilman Tolle , Jonathan Hui , Armando Fox , Michael I. Jordan , David Patterson ... detected 4 out of 6 problems earlier than Ebates.com ... – PowerPoint PPT presentation