Connecting the Dots: Using Runtime Paths for Macro Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Connecting the Dots: Using Runtime Paths for Macro Analysis

Description:

Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen mikechen_at_cs.berkeley.edu http://pinpoint.stanford.edu – PowerPoint PPT presentation

Number of Views:132

Avg rating:3.0/5.0

Slides: 47

Provided by: MikeC261

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Connecting the Dots: Using Runtime Paths for Macro Analysis

1
Connecting the DotsUsing Runtime Paths for
Macro Analysis

Mike Chen
mikechen_at_cs.berkeley.edu
http//pinpoint.stanford.edu

2
Motivation

Divide and conquer, layering, and replication are
fundamental design principles
e.g. Internet systems, P2P systems, and sensor
networks
Execution context is dispersed throughout the
system
gt difficult to monitor and debug
Lots of existing low-level tools that help with
debugging individual components, but not a
collection of them
Much of the system is in how the components are
put together
Observation a widening gap between the systems
we are building and the tools we have

3
Current Approach
Apache
Apache
Java Bean
Java Bean
Java Bean
Database
Database
4
Current Approach
A1
A2

Micro analysis tools, like code-level debuggers
(e.g. gdb) and application logs, offers details
of each individual component
Scenario
A user reports request A1 failed
You try the same request, A2, but it works fine
What to do next?

Apache
Apache
Java Bean
Java Bean
Java Bean
Database
Database
5
Macro Analysis

Macro analysis exploits non-local context to
improve reliability and performance
Performance examples Scout, ILP, Magpie
Statistical view is essential for large, complex
systems
Analogy micro analysis allows you to understand
the details of individual honeybee macro
analysis is needed to understand how the bees
interact to keep the beehive functioning

6
Observation

Systems have a single system-wide execution paths
associated with each request
E.g. request/response, one-way messages
Scout, SEDA, Ninja use paths to specify how to
service requests
Our philosophy
Use only dynamic, observed behavior
Application-independent techniques

7
Our Approach

Use runtime paths to connect the dots!
dynamically captures the interactions and
dependency between components
look across many requests to get the overall
system behavior
more robust to noise
Components are only partially known (gray
boxes)

Apache
Apache
Java Bean
Java Bean
Java Bean
Database
Database
8
Our Approach

Applicable to a wide range of systems.

9
Open Challenges in Systems Today

Deducing system structure
manual approach is error-prone
static analysis doesnt consider resources
Detecting application-level failures
often dont exhibit lower-level symptoms
Diagnosing failures
failures may manifest far from the actual faults
multi-component faults
Goal reduce time to detection, recovery,
diagnosis, and repair

10
Talk Outline

Motivation
Model and architecture
Applying macro analysis
Future directions

11
Runtime Paths

Instrument code to dynamically trace requests
through a system at the component level
record call path the runtime properties
e.g. components, latency, success/failure, and
resources used to service each request
Use statistical analysis detect and diagnose
problems
e.g. data mining, machine learning, etc.
Runtime analysis tells you how the system is
actually being used, not how it may be used
Complements existing micro analysis tools

12
Architecture
request

Tracer
Tags each request with a unique ID, and carries
it with the request throughout the system
Report observations (component name resource
performance properties) for each component
Aggregator Repository
Reconstructs paths and stores them
Declarative Query Engine
Supports statistical queries on paths
Data mining and machine learning routines
Visualization

Aggregator
Developers/ Operators
Query Engine
Visualization
Path Repository
13
Request Tracing

Challenge maintaining an ID with each request
throughout the system
Tracing is platform-specific but can be
application-generic and reusable across
applications
2 classes of techniques
Intra-thread tracing
Use per-thread context to store request ID (e.g.
ThreadLocal in Java)
ID is preserved if the same thread is used to
service the request
Inter-thread tracing
For extensible protocols like HTTP, inject new
headers that will be preserved (e.g. REQ_ID xx)
Modify RPC to pass request ID under the cover
Piggyback onto messages

14
Talk Outline

Motivation
Model and architecture
Applying macro analysis
Inferring system structure
Detection application-level failures
Diagnosing failures
Future directions

15
Inferring System Structure

Key idea paths directly capture application
structure

2 requests
16
Indirect Coupling of Requests

Key idea paths associate requests with internal
state
Trace requests from web server to database
Parse client-side SQL queries to get sharing of
db tables
Straightforward to extend to more fine-grained
state (e.g. rows)

Database tables
Request types
17
Failure Detection and Diagnose

Detecting application-level failures
Key idea paths change under failures gt detect
failures via path changes.
Diagnosing failures
Key idea bad paths touch root cause(s). Find
common features.

18
Future Directions

Key idea violation of macro invariants are signs
of buggy implementation or intrusion
Message paths in P2P and sensor networks
a general mechanism to provide visibility into
the collective behavior of multiple nodes
micro or static approaches by themselves dont
work well in dynamic, distributed settings
e.g. algorithms have upper bounds on the of
hops
Although hop count violation can be detected
locally, paths help identify nodes that route
messages incorrectly
e.g. detecting nodes that are slow or corrupt msgs

19
Conclusion

Macro analysis fills the need when monitoring and
debugging systems where local context is of
insufficient use
Runtime path-based approach dynamically traces
request paths and statistically infer macro
properties
A shared analysis framework that is reusable
across many systems
Simplifies the construction of effective tools
for other systems and the integration with
recovery techniques like RR
http//pinpoint.stanford.edu
Paper includes a commercial example from Tellme!
(thanks to Anthony Accardi and Mark Verber)

20
Backup Slides
21
Backup Slides
22
Current Approach

Micro analysis tools, like code-level debuggers
(e.g. gdb) and application logs, offers details
of each individual component

Apache
Apache
Java Bean
Java Bean
X 1 Y 2
X 2 Y 4
gdb
Java Bean
Java Bean
Java Bean
Java Bean
Java Bean
X 1 Y 2
X 5 Y 2
X 3 Y 2
Database
Database
Java Bean
Java Bean
X 2 Y 3
X 7 Y 1
23
Related Work

Commercial request tracing systems
Announced in 2002, a few months after Pinpoint
was developed
PerformaSure and AppAssure focus on performance
problems.
IntegriTea captures and playback failure
conditions.
Focus on individual requests rather than overall
behavior, and on recreating the failure
condition.
Extensive work in event/alarm correlation, mostly
in the context of network management (i.e. IP)
Dont directly capture relationship between
events
Rely on human knowledge or use machine learning
to suppress alarms.
Distributed debuggers
PDT, P2D2, TotalView, PRISM, pdbx
Aggregates views from multiple components, but do
not capture relationship and interaction between
components
Comparative debuggers Wizard, GUARD
Dependency models
Most are statically generated and are likely to
be inconsistent.
Brown et al. takes an active, black box approach
but is invasive. Candea et al. dynamically trace
failures propagation.

24
1. Detecting Failures using Anomaly Detection

Key idea paths change under failures gt detect
failures via path changes
Anomalies
Unusual paths
Changes in distribution
Changes in latency/response time
Examples
Error paths are shorter.
User behavior changes under failures
Retries a few times then give up
Implement as long running queries (i.e. diff)
Challenges
detecting application-level failures
comparing sets of paths

25
2. Root-cause Analysis

Key idea all bad paths touch root cause, find
common features
Challenge a small set of known bad paths and a
large set of maybes
Ideally want to correlate and rank all
combinations of feature sets
E.g. association rules mining
May get false alarms because the root cause may
not be one of the features
Automatic generation of dynamic functional and
state dependency graphs
Helps developers and operators understand
inter-component dependency and inter-request
dependency
Input to recovery algorithms that use dependency
graphs

26
3. Verifying Macro Invariants

Key idea violations of high-level invariants are
signs of intrusion or bugs
Example Peer auditing
Problem A small number of faulty or malicious
nodes can bring down the system
Corruption should be statistically visible in
your behavior
look for nodes that delay or corrupt messages or
route messages incorrectly
Apply root-cause analysis to locate the
misbehaving peers
Some distributed auditing is necessary
Example P2P implementation verification
Problem are messages delivered as specified by
the algorithms?
Detect extra hops, loops, and verify that the
paths are correct
Can implement as a query
select length from paths where (length gt log2(N))

27
4. Detecting Single Point of Failure

Key idea paths converge on a single-point of
failure
Useful for finding out what to replicate to
improve availability
P2P example
Many P2P systems rely on overlay networks, which
typically are networks built on top of the IP
infrastructure.
Its common for several overlay links to fail
together if they depend on a shared physical IP
link that failed
Implement as a query
intersect edge.IP_links from paths

A
B
D
E
C
D
F
G
28
5. Monitoring of Sensor Networks

An emerging area with primitive tools
Key idea use paths to reconstruct topology and
membership
Example
Membership
select unique node from paths
Network topology
for directed information dissemination
Challenge limited bandwidth
Can record a (random) subset of the nodes for
each path, then statistically reconstruct the
paths

29
Macro Analysis

Look across many requests to get the overall
system behavior
more robust to noise

Macro Analysis
Request 1 Request 2 Request 3 Request 4
Component A X X X
Component B X
Component C X X
30
Properties of Network Systems

Web services, P2P systems, and sensor networks
can have tens of thousands of nodes each running
many application components
Continuous adaptation provides high availability,
but also makes it difficult to reproduce and
debug errors
Constant evolution of software and hardware

31
Motivation

Difficult to understand and debug network systems
e.g. Clustered Internet systems, P2P systems and
sensor networks
Composed of many components
Systems are becoming larger, more dynamic, and
more distributed
Workload is unpredictable and impractical to
simulate
Unit testing is necessary but insufficient.
Components break when used together under real
workload
Dont have tools that capture the interactions
between components and the overall behavior
Existing debugging tools and application-level
logs only do micro analysis

32
Macro vs Micro Analysis
Macro Analysis Micro Analysis
Resolution Component. Complements micro analysis tools. Line or variable
Overhead Low. Can use it in actual deployment. High. Typically not used in deployment other than application logs.
33
Whats a dynamic path?

A dynamic path is the (control flow runtime
properties) of a request
Think of it as a stack trace across
process/machine boundaries with runtime
properties
Dynamically constructed by tracing requests
through a system
Runtime properties
Resources (e.g. host, version)
Performance properties (e.g. latency)
Arguments (e.g. URL, args, SQL statement)
Success/failure

request
Path
A
RequestID 1 Seq Num 1 Name A Host xx Latency
10ms Success true ..
A
B
D
C
D
E
E
F
34
Related Work

Micro debugging tools
RootCause provides extensible logging of method
calls and arguments.
Diduce look for inconsistencies in variable
usage.
Complements macro analysis tools.
Languages for monitoring
InfoSpect looks for inconsistencies in system
state using a logic language
Network flow-based monitoring
RTFM and Cisco NetFlow classify and record
network flows
Statistical and data mining languages
S, DMQL, WebML

35
Visualization Techniques

Tainted paths mark all flows that have a certain
property (e.g. failed or slow) with a distinct
color and overlay it on the graph
Detecting performance bottlenecks look for
replicated nodes that have different colors
Detecting anomaly look for missing edges and
unknown paths

36
Pinpoint Framework
Components
Requests
1
2
Communications Layer (Tracing Internal F/D)
3
Detected Faults
Logs
37
Experimental Setup

Demo app J2EE Pet Store
e-commerce site w/30 components
Load generator
replay trace of browsing
Approx. TPCW WIPSo load (50 ordering)
Fault injection parameters
Trigger faults based on combinations of used
components
Inject exceptions, infinite loops, null calls
55 tests with single-components faults and
interaction faults
5-min runs of a single client (J2EE server
limitation)

38
Application Observations

large number of tightly coupled components that
are always used together

of components used in a dynamic web page
request
median 14, min 6, max 23

39
Metrics

Precision C/P
Recall C/A
Accuracy whether all actual faults are correctly
identified (recall 100)
boolean measure

40
4 Analysis Techniques

Pinpoint clusters of components that
statistically correlate with failures
Detection components where Java exceptions were
detected
union across all failed requests
similar to what an event monitoring system
outputs
Intersection intersection of components used in
failed requests
Union union of all components used in failed
requests

41
Results

Pinpoint has high accuracy with relatively high
precision

42
Pinpoint Prototype Limitations

Assumptions
client requests provide good coverage over
components and combinations
requests are autonomous (dont corrupt state and
cause later requests to fail)
Currently cant detect the following
faults that only degrade performance
faults due to pathological inputs
Single-node only

43
Current Status

Simple graph visualization

44
Proposed Research

3 classes of large network systems
Clustered Internet systems
Tiered architecture, high bandwidth, many
replicas
Peer-to-peer (P2P) systems, including sensor
networks
Widely distributed nodes, dynamic membership
Sensor networks
Limited storage, processing, and bandwidth.

45
P2P Systems Tracing

Trace messages by piggybacking the current node
name to the messages
Tracing overhead
Assume 32-bit per node name and a very
conservative log2(N) hops for each msg and
Data overhead is 40 for a 1500-byte message in a
106-node system

46
P2P Systems Implementation Verification

Current debugging techniques lots of printf()s
on each node and manually correlate the paths
taken by messages
How do you know the messages are delivered as
specified by the algorithms?
Use message paths to check for routing invariants
detect extra hops, loops, and verify that the
paths are correct
Can implement as a query
select length from paths where (length gt log2(N))