Using HTTP Access Logs To Detect Application-Level Failures In Internet Services - PowerPoint PPT Presentation

About This Presentation
Title:

Using HTTP Access Logs To Detect Application-Level Failures In Internet Services

Description:

Gilman Tolle , Jonathan Hui , Armando Fox , Michael I. Jordan , David Patterson ... detected 4 out of 6 problems earlier than Ebates.com ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 19
Provided by: peter262
Category:

less

Transcript and Presenter's Notes

Title: Using HTTP Access Logs To Detect Application-Level Failures In Internet Services


1
Using HTTP Access Logs To Detect
Application-Level Failures In Internet Services
  • Peter Bodík, Greg Friedman, Lukas Biewald,
  • Helen Levine, George Candea, Kayur Patel,
  • Gilman Tolle, Jonathan Hui, Armando Fox,
  • Michael I. Jordan, David Patterson
  • UC Berkeley, Stanford University, Ebates.com

2
Introduction
  • problem
  • hard to detect/localize failures in Internet
    services
  • operators dont trust the statistical learning
    algorithms!
  • hard to verify
  • goal
  • build a real-time visualization tool for
    operators
  • quickly detect changes/anomalies
  • localize the cause of the change
  • based on user access patterns
  • evaluated on real-world failure data

3
Outline
  1. modeling user access patterns
  2. demo of the visualization tool
  3. examples of failures from Ebates.com
  4. results
  5. why evaluation is hard

4
Analyzing web site access patterns
  • rapid changes in access patterns indicate
    problems
  • broken links
  • users hitting Reload multiple times
  • access patterns captured in HTTP logs
  • detect anomalies
  • compare current traffic to the historic/normal
    traffic
  • model access frequencies to the top 40 pages
  • capture 98 of traffic to Ebates.com
  • dont need any instrumentation
  • visualize access patterns

5
Anomaly detection
  • Chi-square test
  • compare page frequencies using Chi-square test
  • anomaly score ? significance of the test
  • Naive Bayes approach
  • assume page frequencies independent
  • page frequency modeled as a Gaussian
  • anomaly score ? 1 - Prob(current interval is
    normal)
  • can estimate anomaly score for each page
  • learn normal access pattern from the past
  • skip anomalous intervals

6
Reporting anomalies
warning 3 detection time Sun Nov 16 192700
PST 2003 start Sun Nov 16 192400 PST 2003
end Sun Nov 16 210500 PST 2003 anomaly
score 7.05 Most anomalous pages
/landing.jsp
19.55 /landing_merchant.jsp
19.50 /mall_ctrl.jsp
3.69 /malltop.go
2.63 /mall.go 2.18
How long did it take you to read this?
7
Visualization
  • uses the same features as anomaly detection
  • anomalies are obvious
  • exploits human pattern-matching

8
Failures at Ebates.com
  • used real-world failures
  • let the CTO of Ebates.com evaluate our results
  • analyzed 6 failures
  • account page problem
  • broken sign-up page
  • 2 redirection problems
  • database overload
  • runaway DB query

9
Nov 2003 account page problem (1)
10
Nov 2003 account page problem (2)
9am
1pm
11
Oct 2003 broken signup page (1)
12
Oct 2003 broken signup page (2)
13
Oct 2003 broken signup page (3)
14
Summary of results
  • november 2003 account page problem
  • 1st warning 16 hours earlier
  • 2nd warning 1 hour earlier correctly localized
    the bad page!
  • october 2003 broken signup page
  • noticed the problem 7 days earlier correctly
    localized!
  • july 2001 landing looping problem
  • warning 2 days earlier correctly localized
  • detected a failure they didnt tell us about
  • detected three other significant anomalies
  • feedback these might have been important, but
    we didnt know about them. definitely useful if
    detected in real-time.

15
Evaluation is hard
  • ideally, evaluate in terms of
  • accuracy, time to detect, false positive/negative
    rate
  • need to reconstruct ground truth
  • need to know about all failures
  • when failure introduced/detected/resolved
  • root cause of failure
  • operators at Ebates.com VERY helpful
  • analyzed system logs, email archives, chat logs
  • still dont know
  • report advance warning time

16
Generalizing the approach
  • can be applied to arbitrary Web application
  • analyzed traffic to the top 40 pages
  • captures 98 of traffic
  • easily extensible to more pages
  • how much data we need
  • used logs from 3 web servers
  • same results for the account page problem with
    just 5 of data
  • about 12,000 requests a day

17
Open-source failure repository
18
Conclusion
  • changes in user access patterns indicate problems
  • detected 4 out of 6 problems earlier than
    Ebates.com
  • synergy between visualization and automatic
    detection
  • builds the trust relationship
  • good visualization ? cheaper false positives
  • dont handle false positives automatically, let
    the operator use his/her experience
  • hard to evaluate detection and localization
    algorithms
Write a Comment
User Comments (0)
About PowerShow.com