Using Queries for Distributed Monitoring and Forensics - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Using Queries for Distributed Monitoring and Forensics

Description:

Using Queries for Distributed Monitoring and Forensics ... Performance debugging [Magpie, Causeway...] Configuration debugging for BGP, OSes [Time-travel... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 21
Provided by: intel156
Category:

less

Transcript and Presenter's Notes

Title: Using Queries for Distributed Monitoring and Forensics


1
Using Queries for Distributed Monitoring and
Forensics
  • Atul Singh
  • Rice University

Peter Druschel Max Planck Institute for Software
Systems
Timothy Roscoe Intel Research Berkeley
Petros Maniatis Intel Research Berkeley
2
Building and monitoring a system
  • Building a distributed system is a complex
    undertaking
  • Select properties
  • algorithms
  • implement, deploy
  • Switch to monitoring the system
  • Testing, debugging, profiling, tuning
  • Monitoring is hard, error-prone
  • Distributed state
  • Partial faults
  • Complex interactions
  • Asynchronous
  • External factors

3
Monitoring is hard!
  • Current state of the art
  • Manual insertion of printf
  • Bringing logs to one place
  • Parsing/processing of logs
  • Scripts (perl/python)
  • Queries (Astrolabe)
  • Offline by nature
  • Expose internal state
  • Ad-hoc, error-prone
  • Probe exposed state
  • Correlate events
  • Bridge the semantic gap

4
Declarative systems building systems via queries
Probe the state
  • Declarative specification via queries
  • Execution by a distributed query processor
  • P2SOSP05 a prototype declarative system
  • Concise specifications
  • Enables rapid prototyping
  • We present a monitoring framework for P2
  • Flexible introspection
  • Retains semantics of application
  • Online execution tracing

Expose internals
5
Overview
  • Introduction
  • P2 Background
  • Monitoring framework
  • Example applications/Performance
  • Conclusions

6
Example route operation in P2
action -
event,
precondition.
K
Rule strand
Join route.A nextHop.A
Select D K
Project
  • route(B,K) - route(A,K),
    nextHop(A,D,B),
  • D K.

Application state
nextHop
7
Overview
  • Introduction
  • Background
  • Monitoring framework
  • Examples applications/Performance
  • Conclusions

8
Introspection and Logging
  • Introspection at three levels
  • Application state level
  • Rule level
  • Dataflow level
  • Systematic instrumentation
  • System is built using smaller, re-usable
    components
  • Systematic insertion of logging statements
  • Logging data is in the form of tuples
  • Retains semantics of application logic
  • No need for translation

9
Tracing rule executions
  • We want to step through the execution
  • Each step corresponds to a rule
  • Do it in online fashion
  • For rule level tracing
  • Need to trace tuples
  • Match output tuple to input
  • Track tuples as they go over wire

Node A
Node B
r1
r0
x
y
w
z
10
(1) Tracing rule executions
  • Matching input and output tuples of a rule
  • Tap elements at the beginning and end of a rule
  • Execution tracer tracks rule executions
  • Execution records are stored as tuples in exec
    table

x
y
input
output
Execution Tracer
exec
11
(2) Tracing tuples across wire
  • Each tuple has a locally unique ID
  • Tuple ID is sent along with the tuple
  • Upon receiving, a new tuple is created with
    different ID
  • Hooks in the network in/out handling subsystem
  • A record is created
  • tuples local ID
  • tuples remote ID
  • Node from which it came from

x
Network Out
A
B
Network In
y
B tupleTable
12
Putting it all together
Node A
Node B
r1
r0
w
z
x
y
tupleTable
exec
tupleTable
exec
  • Of course in reality, its more complicated
  • Aborted rule executions
  • Pipelined rule executions

13
Overview
  • Introduction
  • Background
  • Monitoring framework
  • Example applications/Performance
  • Conclusions

14
Example applications (I)
  • Distributed watchpoints Trigger an event if true
  • Possibly trace back/forward
  • Oscillation of faulty/stale information (route
    flaps)
  • Gossiping for stabilization or updates
  • Inconsistent routing in DHTs Pastry, Chord,
  • Each node is responsible for a unique region
  • Route using distinct paths and check Bamboo,
    Secure Routing

15
Example applications (II)
r1
  • Online execution profiling
  • How much time is spent in each rule?
  • Where are the bottlenecks?
  • Which rule is costlier? What operation?
  • Consistent Snapshots Chandy-Lamport
  • Snapshot for the routing state
  • Queries on snapshots itself
  • What is the degree distribution?
  • How many node-disjoint paths?
  • No more than 16 rules for any of the above

r2
r3
16
Performance
  • 21 node Chord overlay in P2
  • Monitored node on separate, unloaded machine
  • Overhead of introspection
  • CPU (0.98 1.3), Memory (8MB 13MB)
  • Consistent distributed snapshot
  • Other results in the paper

CPU Util.
Tx pkts(X1000)
Rate (1/sec)
Rate (1/sec)
17
Related Work
  • Management using database techniques Hy
  • Performance debugging Magpie, Causeway
  • Configuration debugging for BGP, OSes
    Time-travel
  • Distributed debuggers WiDS, Pip, Replay
    Debugging
  • Deep embedded monitoring IBM Websphere,
    Adaptations

18
Conclusions
  • Declarative development of systems
  • Integrated approach to building and monitoring
  • Automatic execution tracing
  • Online, in-place monitoring
  • Step towards autonomic distributed systems
  • Fault-finding tasks evolve with the system
  • Interesting future directions
  • User interface
  • Trade-off between monitoring accuracy and
    overhead
  • Questions? Thank You

19
Request to EuroSys
  • Please schedule my next talk on the first day
  • Move the submission deadline away from NSDI
  • (last year, NSDI submission (19th Oct), EuroSys
    (20th))

20
Questions?
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com