Root Cause Analysis of Failures in LargeScale Computing Environments - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Root Cause Analysis of Failures in LargeScale Computing Environments

Description:

Diagnosing faults in large-scale systems is hard. Use of diverse software ... Magpie [Barham, et al., 04] Diagnose distributed systems that serve HTTP requests ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 25
Provided by: nao92
Category:

less

Transcript and Presenter's Notes

Title: Root Cause Analysis of Failures in LargeScale Computing Environments


1
Root Cause Analysis of Failures in Large-Scale
Computing Environments
  • Naoya Maruyama (Tokyo Tech)Alexander V.
    Mirgorodskiy (UW-Madison) Barton P. Miller
    (UW-Madison) Satoshi Matsuoka (Tokyo Tech NII)

2
Background
  • Diagnosing faults in large-scale systems is hard.
  • Use of diverse software/hardware components
  • Often observed only on a production system
  • Non-deterministic behaviors caused by different
    orders of operations
  • Despite these complexities, traditional
    approaches are insufficient.
  • An interactive debugger is not scalable with
    regards to the number of processes/hosts
  • The printf debugging is too ad-hoc to be used in
    productions systems

3
Assumption and Objectives
  • Systematic, well-focused fault diagnosis for
    large-scale computing environments
  • Automate processes as much as possible
  • Low false positive/negative rates
  • We assume
  • SPMD-style distributed systems
  • A failure is observed by other components and
    notified to the diagnosis engine

4
Our Idea
  • Narrowing down diagnosis steps by identifying
    behavioral behaviors between correct and
    incorrect execution, e.g.,
  • A function was only called when the program
    crashed.
  • A certain packet was never delivered to the
    destination when the program hung up.
  • How?
  • Collects execution data of programs at run time
  • Identifies anomalies in the data
  • Normal behavior ? correct behavior
  • Anomalous behavior ? incorrect behavior
  • Compares the normal and behaviors

5
Current Achievements
  • Prototype implementation for distributed systems
    of SPMD style
  • Demonstration of systematic diagnosis of faults
    in a real production system

6
Overview of the Diagnosis Steps
  • Data Collection
  • Monitors and traces the execution data of a
    target system
  • Data Analysis
  • Identifies anomalies inside the trace
  • Reports the results to the analyst for further
    investigation

7
Data Collection
  • Collects function call traces
  • Captures control-flow behaviors
  • Can be extended to incorporate other types of
    behaviors, as memory management operations,
    concurrency, communication
  • How to collect the data?
  • Use spTracer, a lightweight dynamic
    instrumentation technique Mirgorodskiy et al.,
    04
  • Injects a tracing agent into a process of
    interest
  • The agent inserts trace statements at all
    function call sites
  • The statements generate log records with
    timestamps
  • Keep the trace in a shared-memory segment to
    retain the data even if the process crashes.
  • Manage the trace in a circular buffer
  • Keeps only the most recent trace of fixed length
  • Avoids unlimited growth of trace size

8
Textual Representation of Call Traces
ENTER func_addr 0x819967c pid 5095 tid 4
timestamp 12131002746163258 LEAVE func_addr
0x819967c pid 5095 tid 4 timestamp
12131002746163936 ENTER func_addr 0x819967c pid
5095 tid 4 timestamp 12131002746164571 LEAVE
func_addr 0x819967c pid 5095 tid 4 timestamp
12131002746165197 ENTER func_addr 0x819967c pid
5095 tid 4 timestamp 12131002746165828 LEAVE
func_addr 0x819967c pid 5095 tid 4 timestamp
12131002746166395 LEAVE func_addr 0x80de590 pid
5095 tid 4 timestamp 12131002746166938 ENTER
func_addr 0x819967c pid 5095 tid 4 timestamp
12131002746167573 LEAVE func_addr 0x819967c pid
5095 tid 4 timestamp 12131002746179202 ENTER
func_addr 0x80de750 pid 5095 tid 4 timestamp
12131002746180027 ENTER func_addr 0x811b070 pid
5095 tid 4 timestamp 12131002746180691 ENTER
func_addr 0x8138710 pid 5095 tid 4 timestamp
12131002746181359 LEAVE func_addr 0x8138710 pid
5095 tid 4 timestamp 12131002746185934
The timestamp field is the system cycle counter.
9
Visualizing Traces with Jumpshot
  • Traces can be exported to the SLOG format to
    visualize with Jumpshot

Multiple rows represent multiple threads/processes
Each rectangle means a function invocation
Nested rectangles mean nested function calls.
10
Data Analysis
  • Two-step analysis
  • Finding the most anomalous process (trace)
  • Finding the most anomalous function
  • Presents two techniques
  • Identifying fail-stop anomalies
  • Identifying non-fail-stop anomalies

11
Data Analysis Identifying Fail-Stop Anomalies
  • Finds the process that stopped generating trace
    records first
  • It ended substantially earlier than the others ?
    The fail-stop anomaly
  • Traced ended at similar times ? Identify the
    non-fail-stop anomalies

Fail-Stop Case
Non-Fail-Stop Case
Traces
Trace end time
Trace end time
12
Data Analysis Identifying Non-Fail-Stop Anomalies
  • Apply a distance-based anomaly detection
    technique
  • Define a distance metric between each pair of
    traces
  • Define a trace suspect score
  • Report traces with highest suspect scores

13
Defining the Distance Metric
  • Say, there are only two functions, func_A and
    func_B, and tree traces, trace_X, trace_Y, trace_Z

Normalized time spent in each function
func_B
1.0
trace_Z
trace_X
0.5
trace_Y
0.4
0
func_A
0.5
0.6
14
Defining the Suspect Score
s(g)
g
s(h)
h
  • Common behavior normal
  • Suspect score s(h) distance to nearest
    neighbor
  • Report process with the highest s to the analyst
  • h is in the big mass, s(h) is low, h is normal
  • g is a single outlier, s(g) is high, g is an
    anomaly
  • What if there is more than one anomaly?

15
Defining the Suspect Score
sk(g)
g
h
Computing the score using k2
  • Suspect score sk(h) distance to the kth
    neighbor
  • Exclude (k-1) closest neighbors
  • Sensitivity study k NumProcesses/4 works well
  • Represents distance to the big mass
  • h is in the big mass, kth neighbor is close,
    sk(h) is low
  • g is an outlier, kth neighbor is far, sk(g) is
    high

16
Defining the Suspect Score
sk(g)
g
h
  • Anomalous means unusual, but unusual does not
    always mean anomalous!
  • E.g., MPI master is different from all workers
  • Would be reported as an anomaly (false positive)
  • Distinguish false positives from true anomalies
  • With knowledge of system internals manual
    effort
  • With previous execution history can be
    automated

17
Defining the Suspect Score
g
h
n
  • Add traces from known-normal previous run
  • One-class classification
  • Suspect score sk(h) distance to the kth trial
    neighbor or the 1st known-normal neighbor
  • Distance to the big mass or known-normal behavior
  • h is in the big mass, kth neighbor is close,
    sk(h) is low
  • g is an outlier, normal node n is close, sk(g) is
    low

18
Finding Anomalous Function
  • Fail-stop problems
  • Failure is in the last function invoked
  • Non-fail-stop problems
  • Find why process h was marked as an anomaly
  • Function with the highest contribution to s(h)
  • s(h) d (h,g), where g is the chosen neighbor
  • anomFn arg max di

i
19
Experimental Study SCore on TitechGrid
  • TitechGrid
  • 129-node PC cluster at Tokyo Institute of
    Technology
  • Serving as a production system for over a year
  • SCore v5.4 is operated in the multi-user mode
  • The scored daemons are connected in a ring with
    the sc_watch process
  • Each process in the ring sends patrol messages
    to the next daemon
  • If sc_watch receives no patrol messages for 10
    minutes, it kills and restarts all the daemons

20
Applying the Diagnosis System to SCore
  • Injects the tracing agents into all scoreds
  • Instruments sc_watch to save in-memory traces
    when the daemons are being killed
  • Identify the anomalous trace
  • Identify the anomalous function/call path

21
Finding the Host
Suspect Score
  • Host n129 is unusual different from the others
  • Host n129 is anomalous not present in previous
    known-normal runs
  • Host n129 is a new anomaly not present in
    previous known-faulty runs

22
Finding the Cause
score_write
score_write_short
output_job_status
__libc_write
  • Call chain with the highest contribution to the
    suspect score (output_job_status -gt
    score_write_short -gt score_write -gt __libc_write)
  • Tries to output a log message to the scbcast
    process
  • Writes to the scbcast process kept blocking for
    10 minutes
  • Scbcast stopped reading data from its socket
    bug!
  • Scored did not handle it well (spun in an
    infinite loop) bug!

23
Related Work
  • Magpie Barham, et al., 04
  • Diagnose distributed systems that serve HTTP
    requests
  • components include web front-end, backend
    databases, etc.
  • For errors on the incoming HTTP requests, locates
    where the errors happened by looking at request
    log on each components
  • Applied cluster analysis to request log
  • PinPoint Chen, et al., 02
  • Modified JBoss to record some of the events
    associated to a request
  • such events as send/recv, exception handling
  • Finds out the correlation between failures of
    requests with the events by a technique to
    cluster analysis

Both techniques are well suited to collections of
small request paths, but not to daemon-type
processes like SCore.
24
Conclusion
  • Summary
  • Proposed systematic diagnosis methods for
    distributed systems
  • Demonstrated efficacy of the methods on a real
    production system
  • Next Steps
  • Application study with a wider range of systems
  • Richer execution data for smarter analysis

25
Discussion
  • Omitted

26
References
  • Omitted refer to the paper

27
Related Work
  • Similar approaches to similar problems
  • Different approaches to similar problems
  • Similar approaches to different problems

28
Overview of the Diagnosis Steps
Step 1
Step 2
Step 3
trace set
anomalous trace function
anomalous trace
Tracingwith spTracer
fail-stop
Earliest LastTimestamp
Last TraceEntry
Analyst
non-fail-stop
Has ReferenceTraces?
Trace Visualizationwith Jumpshot
Supervised TraceRanking Based onSuspect Score
yes
Max Componentof the Delta Vector
no
Unsupervised TraceRanking Based onSuspect Score
29
Overview of the Diagnosis Steps
Proc
process
Log
Log
Log
Execution data
Log
Log
Log
Proc
Crash
Log
Proc
Proc
Log
Proc
Proc
Log
Proc
Proc
Proc
Proc
Proc
Proc
Log
Proc
Proc
Proc
Log
Proc
Proc
????????????(???????????????????)
Log
Log
???????????
Log
Log
Log
Log
Log
?????????????????
Log
Log
Log
Log
Log
Log
Log
Log
Log
??????????????????
Log
Log
Log
Log
Log
Log
Log
Log
Log
Log
Write a Comment
User Comments (0)
About PowerShow.com