Title: Using HTTP Access Logs To Detect Application-Level Failures In Internet Services
1Using HTTP Access Logs To Detect
Application-Level Failures In Internet Services
- Peter BodÃk, Greg Friedman, Lukas Biewald,
- Helen Levine, George Candea, Kayur Patel,
- Gilman Tolle, Jonathan Hui, Armando Fox,
- Michael I. Jordan, David Patterson
- UC Berkeley, Stanford University, Ebates.com
2Introduction
- problem
- hard to detect/localize failures in Internet
services - operators dont trust the statistical learning
algorithms! - hard to verify
- goal
- build a real-time visualization tool for
operators - quickly detect changes/anomalies
- localize the cause of the change
- based on user access patterns
- evaluated on real-world failure data
3Outline
- modeling user access patterns
- demo of the visualization tool
- examples of failures from Ebates.com
- results
- why evaluation is hard
4Analyzing web site access patterns
- rapid changes in access patterns indicate
problems - broken links
- users hitting Reload multiple times
- access patterns captured in HTTP logs
- detect anomalies
- compare current traffic to the historic/normal
traffic - model access frequencies to the top 40 pages
- capture 98 of traffic to Ebates.com
- dont need any instrumentation
- visualize access patterns
5Anomaly detection
- Chi-square test
- compare page frequencies using Chi-square test
- anomaly score ? significance of the test
- Naive Bayes approach
- assume page frequencies independent
- page frequency modeled as a Gaussian
- anomaly score ? 1 - Prob(current interval is
normal) - can estimate anomaly score for each page
- learn normal access pattern from the past
- skip anomalous intervals
6Reporting anomalies
warning 3 detection time Sun Nov 16 192700
PST 2003 start Sun Nov 16 192400 PST 2003
end Sun Nov 16 210500 PST 2003 anomaly
score 7.05 Most anomalous pages
/landing.jsp
19.55 /landing_merchant.jsp
19.50 /mall_ctrl.jsp
3.69 /malltop.go
2.63 /mall.go 2.18
How long did it take you to read this?
7Visualization
- uses the same features as anomaly detection
- anomalies are obvious
- exploits human pattern-matching
8Failures at Ebates.com
- used real-world failures
- let the CTO of Ebates.com evaluate our results
- analyzed 6 failures
- account page problem
- broken sign-up page
- 2 redirection problems
- database overload
- runaway DB query
9Nov 2003 account page problem (1)
10Nov 2003 account page problem (2)
9am
1pm
11Oct 2003 broken signup page (1)
12Oct 2003 broken signup page (2)
13Oct 2003 broken signup page (3)
14Summary of results
- november 2003 account page problem
- 1st warning 16 hours earlier
- 2nd warning 1 hour earlier correctly localized
the bad page! - october 2003 broken signup page
- noticed the problem 7 days earlier correctly
localized! - july 2001 landing looping problem
- warning 2 days earlier correctly localized
- detected a failure they didnt tell us about
- detected three other significant anomalies
- feedback these might have been important, but
we didnt know about them. definitely useful if
detected in real-time.
15Evaluation is hard
- ideally, evaluate in terms of
- accuracy, time to detect, false positive/negative
rate - need to reconstruct ground truth
- need to know about all failures
- when failure introduced/detected/resolved
- root cause of failure
- operators at Ebates.com VERY helpful
- analyzed system logs, email archives, chat logs
- still dont know
- report advance warning time
16Generalizing the approach
- can be applied to arbitrary Web application
- analyzed traffic to the top 40 pages
- captures 98 of traffic
- easily extensible to more pages
- how much data we need
- used logs from 3 web servers
- same results for the account page problem with
just 5 of data - about 12,000 requests a day
17Open-source failure repository
18Conclusion
- changes in user access patterns indicate problems
- detected 4 out of 6 problems earlier than
Ebates.com - synergy between visualization and automatic
detection - builds the trust relationship
- good visualization ? cheaper false positives
- dont handle false positives automatically, let
the operator use his/her experience - hard to evaluate detection and localization
algorithms