Crash Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Crash Detection

Description:

Connection Setup. Connect nodes as a Binomial tree ... Tree Setup - Phase I. TCP connection setup. Multicast / Reduction ... Sent to client during Setup phase ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 33
Provided by: Syste98
Category:
Tags: crash | detection

less

Transcript and Presenter's Notes

Title: Crash Detection


1
Middleware for Active Reduction Operations in
Distributed Systems
By Nitin Bahadur Gokul Nadathur Department of
Computer Sciences University of Wisconsin-Madison
Spring 2000
2
Talk Outline
  • Motivation and Goals
  • General Architecture of the middleware
  • Components of the middleware
  • Providing reliability - handling of node failures
  • Applications developed using the middleware
  • Performance
  • Conclusions and possible extensions

3
Motivation and Goals
  • A middleware for an application with Master -
    Worker paradigm
  • Scalable framework for communication and
    computing client response (Reduction)
  • Unicast does not scale - so use multicast
  • Introducing reduction operations dynamically in
    clients
  • A general framework for communication among
    clients

4
The Big Picture...
Master App
ARTL
Client App
Client App
ARTL
ARTL
Client App
ARTL
5
ART - Library Architecture
Application specific callbacks
Application
Application API
Reduction functions
Framework for processing messages
ARTL specific message
Event Handler
Outgoing message
ARTL Communication Layer
Incoming Packet
Network
ARTL messages 1. Query from master 2. Response
from downstream nodes
6
ART - Library Architecture
Application specific callbacks
Application
Application API
Reduction functions
Framework for processing messages
ARTL specific message
Event Handler
Outgoing message
ARTL Communication Layer
Incoming Packet
Network
ARTL messages 1. Query from master 2. Response
from downstream nodes
7
Communication Subsystem
  • Connection Setup
  • Connect nodes as a Binomial tree
  • Send and receive ARTL and application messages
  • Detect node failure and act accordingly
  • Integrate restarted node in current tree structure

8
Why use Binomial Tree
Client App
Client App
Master App
3
2
1
2
Master App
Client App
Client App
1
2
Client App
Client App
Binomial Tree Query Propagation time 2
Unicast Mechanism Query Propagation time 3
9
Reduction
Reduction at 5 and 3
Example Reduction operations Min(), Max()
Responses
10
Tree connection setup
11
Tree Setup - Phase I
TCP connection setup
12
Tree Setup - Phase II
TCP connection setup
13
Tree Setup - Phase III
TCP connection setup
14
Inter node communication
Data
ARTL Header
  • Unicast and multicast data transmission
  • ARTL receives application messages for which no
    receive has been posted
  • these are sent to a callback function registered
    by application
  • ARTL receives data on behalf of application when
    application explicitly posts a receive

15
ART - Library Architecture
Application specific callbacks
Application
Application API
Reduction functions
Framework for processing messages
ARTL Encapsulated message
Event Handler
Outgoing message
ARTL Communication Layer
Incoming Packet
Network
ARTL messages 1. Query from master 2. Response
from downstream nodes
16
Reduction Functions
  • Implemented as Shared objects
  • Sent to client during Setup phase
  • Each reduction function is associated with a
    particular response it reduces

17
Event Handler
Network
Thread Pool
Event Handler
Application
18
Multithreaded Architecture
  • No prior Knowledge about behavior of reduction
    function
  • Exploit concurrency - multiple processor per node
  • Static Pool of threads - Creation and destruction
    of threads is bad (Firefly RPC)

19
Crash Reconfiguration
20
Crash Reconfiguration
Crash Reconfiguration at depth 1
21
Crash Reconfiguration
Crash Reconfiguration at depth 2
22
Crash Reconfiguration
Crash Reconfiguration at depth 1
23
Crash Reconfiguration
Crash Reconfiguration at depth 1
24
Crash Detection
  • Break in TCP connection with parent/child
  • a signal is received at the other end of
    connection
  • Use of periodic refresh messages to inform parent
    that child is up and running
  • useful in WAN environments

25
Crash Handling
  • Parent of node down informs master
  • All nodes are informed of a node failure
  • Master recomputes tree
  • If leaf node down, then no problem
  • If intermediate node down, some reconfiguration
    is required

26
Node Restart
  • Restarted node contacts master to tell it about
    restart
  • Master sends it current state of network and the
    shared object(s)
  • All nodes are informed of a node restart
  • Master recomputes tree and informs the new nodes
    parent about its new child
  • Parent and child establish connections

27
SysMon - A System monitor
Monitors the load average from /procdisplays
Min, Max and average loads Per-node load is
also displayedARTL Reduction operations Min,
Max and Average
28
SysMon - A System monitor
Node failures are detected and SysMon pops up an
alert
29
File Transfer Application
  • Transfers a file from master to all clients
  • File can be executed at clients (if required)
  • execution can be instantaneous on receiving file
  • execution can be delayed until all nodes have
    received the file

30
File Transfer Performance
31
Total Startup Time vs Number of Nodes
Client processes started using ssh on different
machines
32
Conclusions and Extensions
  • A middleware for dynamic operations
  • Support for crash detection, recovery and dynamic
    processes
  • Demonstrated near optimal speedup using real
    applications
  • Making response function dynamic - active
    services
  • Differential scheduling in thread scheduler for
    QoS
  • Making dynamic code secure
Write a Comment
User Comments (0)
About PowerShow.com