Parallel Job Deployment and Monitoring in a Hierarchy of Mobile Agents - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Parallel Job Deployment and Monitoring in a Hierarchy of Mobile Agents

Description:

Parallel Job Deployment and Monitoring in a Hierarchy of Mobile Agents Munehiro Fukuda Computing & Software Systems, University of Washington, Bothell – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 51
Provided by: Munehir2
Category:

less

Transcript and Presenter's Notes

Title: Parallel Job Deployment and Monitoring in a Hierarchy of Mobile Agents


1
Parallel Job Deployment and Monitoring in a
Hierarchy of Mobile Agents
  • Munehiro Fukuda
  • Computing Software Systems, University of
    Washington, Bothell
  • Funded by

2
Outline
  1. Introduction
  2. Execution Model
  3. System Design
  4. Performance Evaluation
  5. Related Work
  6. Conclusions

3
1. Introduction
  • Problems in Grid Computing
  • Background of Mobile Agents
  • Objective
  • Project Overview

4
Quiet Laboratories
  • UW1-320 and UW1-302 at 3pm on a weekday
  • No more computing resources needed?

5
Demands for Computing ResourcesIn Teaching
  • I'm in the 320 lab testing my program, and for
    some reason, whenever I attempt to use 15 hosts,
    it asks me for passwords for hosts 31 - 18 and
    then freezes and does nothing.
  • I noticed that uw1-320-20 is being bogged down by
    zombie processes that someone left going on it
  • I went around looking on some of the other
    computers, and its all over the place user A has
    almost 40 processes running on uw1-320-30 since
    April 23rd, user B has about 10 on host 29, and
    theres a ton more on almost every host.
  • I got tired of manually running a bunch of ssh
    commands to run rmi on many different machines.
  • I have narrowed the problem down to three
    machines 16, 20, and 30. First of all
    uw1-320-20 is dead. It drops all incoming ssh
    connections. The other two, uw1-320-16 and
    uw1-320-30, both have a mysterious problem that I
    don't know how to solve.

6
Demands for Computing ResourcesIn Research
  • http//setiathome.berkeley.edu/
  • http//boinc.bakerlab.org/rosetta/rah_about.php
  • These are an effective way to collect numerous
    computing resources from all over the world.
  • But, here is a question
  • Why dont they use idle machines on their
    campuses first?

7
Grid-Computing Brokers
  • Desktops
  • Buyers a desktop user
  • Sellers hardware components
  • Brokers Windows, Linux
  • Clusters
  • Buyers multiple users (e.g., CSS434 students)
  • Sellers cluster computing nodes
  • Brokers PBS, LSF
  • Grid computing
  • Buyers Seti_at_home, Rosseta_at_home, etc.
  • Brokers Globus, Condor, Legion (Avaki),
    NetSolve, Ninf, Entropia, etc.
  • Okay, no need to implement any more?

8
Problems in Grid Computing
  • Targeting large business models
  • A central entry point
  • A lot of installation work
  • http//www.globus.org/toolkit/docs/4.0/
  • Little system faults
  • Too gigantic

9
Our Target
  • Targeting a group of computer users
  • No central entry point
  • No central managers
  • No programming model restrictions
  • Easy installation work
  • Easy participation but necessity of fault
    tolerance

Network
10
Background of Mobile Agents
  • An execution model previously highlighted as a
    prospective infrastructure of distributed
    systems.
  • Static job deployment and result collection No
    more than an alternative approach to centralized
    grid middleware implementation
  • Our goal Let mobile agents do unique tasks in
    grid computing

FTP
Cycle
Cycle
Cycle
Central manger
HTTP
Server
Server
Server
RPC
User
Internet
11
Objective
  • Let mobile agents do unique tasks in grid
    computing
  • Runtime job migration
  • Moving a program from a faulty/busy site to an
    active/idle site
  • Seeking for fault tolerance and better load
    balancing
  • Negotiation
  • Negotiating with other agents about computing
    resources
  • Seeking for better load balancing
  • Inherent parallelism
  • Deploying and monitoring jobs in parallel
  • Decentralized job management
  • Focus on a group of independent computers
  • Turned on and off independently
  • Not controlled by a scheduler such as PBS and LSF
  • Not managed by a central server

12
Project Overview
  • Funded by NSF Middleware Initiative
  • Sponsored by University of Washington
  • In Collaboration of Ehime University
  • In a Team of UWB Undergraduates

13
2. Execution Model
  • System Overview
  • Execution Layer
  • Programming Environment

14
System Overview
User As Process
User As Process
User Bs Process
TCP Communication
Snapshot Methods
GridTCP
User program wrapper
Sentinel Agent
Sentinel Agent
Sentinel Agent
Commander Agent
Commander Agent
Resource Agent
Resource Agent
Bookkeeper Agent
BookkeeperAgent
15
Execution Layer
Java user applications
mpiJava API
GridTcp
Java socket
User program wrapper
Commander, resource, sentinel, and bookkeeper
agents
UWAgents mobile agent execution platform
Operating systems
16
MPI Java Programming
  • public class MyApplication
  • public GridIpEntry ipEntry //
    used by the GridTcp socket library
  • public int funcId //
    used by the user program wrapper
  • public GridTcp tcp // the
    GridTcp error-recoverable socket
  • public int nprocess //
    processors
  • public int myRank //
    processor id ( or mpi rank)
  • public int func_0( String args ) //
    constructor
  • MPJ.Init( args, ipEntry, tcp ) //
    invoke mpiJava-A
  • ..... //
    more statements to be inserted
  • return 1 //
    calls func_1( )
  • public int func_1( ) //
    called from func_0
  • if ( MPJ.COMM_WORLD.Rank( ) 0 )
  • MPJ.COMM_WORLD.Send( ... )
  • else
  • MPJ.COMM_WORLD.Recv( ... )
  • ..... //
    more statements to be inserted
  • return 2 //
    calls func_2( )

17
3. System Design
  • Mobile Agents
  • Job Coordination
  • Distribution
  • Resource allocation and monitoring
  • Resumption and migration
  • Programming Support
  • Language preprocessing
  • Communication check-pointing
  • Inter-Cluster Job Deployment (Current Research
    Topic)
  • Over-gateway agent migration
  • Over-gateway communication
  • Job distribution

18
UWAgents Concept of Agent Domain
19
Job Distribution
Job Submission
Commander id 0
XML Query
Spawn
Sentinel id 2 rank 0
Bookkeeper id 3 rank 0
Resource id 1
eXist
Sentinel id 8 rank 1
Sentinel id 11 rank 4
Sentinel id 10 rank 3
Sentinel id 9 rank 2
Bookkeeper id 12 rank 1
Bookkeeper id 15 rank 4
Bookkeeper id 14 rank 3
Bookkeeper id 13 rank 2
Sensor id 4
Sensor id 5
id agent id rank MPI Rank
Sentinel id 32 rank 5
Sentinel id 34 rank 7
Sentinel id 33 rank 6
Bookkeeper id 48 rank 5
Bookkeeper id 50 rank 7
Bookkeeper id 49 rank 6
20
Resource Allocation and Monitoring
Job submission
total nodes x multiplier
Our own XML DB
Commander id 0
Resource id 1
eXist
An XML query
A list of available nodes
CPU Architecture OS Memory Disk Total
nodes Multiplier
Spawn
Sentinel id 2 rank 0
Sentinel id 8 rank 1
Case 1 Total nodes 2 Multiplier 1.5
Bookkeeper id 2 rank 0
Bookkeeper id 12 rank 5
Sentinel id 2 rank 0
Sentinel id 8 rank 1
Bookkeeper id 2 rank 0
Bookkeeper id 12 rank 5
Case 2 Total nodes 2 Multiplier 3
Future use
Future use
Future use
21
Job Resumption by a Parent Sentinel
Sentinel id 2 rank 0
MPI connections
Sentinel id 8 rank 1
Sentinel id 11 rank 4
Sentinel id 10 rank 3
Sentinel id 9 rank 2
Bookkeeper id 15 rank 4
22
Job Resumption by a Child Sentinel
Commander id 0
New
Sentinel id 2 rank 0
Bookkeeper id 3 rank 0
Resource id 1
Sentinel id 8 rank 1
Bookkeeper id 12 rank 1
23
User Program Wrapper
User Program Wrapper
Source Code
func_0( ) statement_1 statement_2
statement_3 return 1 func_1( )
statement_4 statement_5 statement_6
return 2 func_2( ) statement_7
statement_8 statement_9 return -2
statement_1 statement_2 statement_3 check_point
( ) statement_4 statement_5 statement_6 check_
point( ) statement_7 statement_8 statement_9 c
heck_point( )
int fid 1 while( fid -2) switch(
func_id ) case 0 fid func_0( ) case
1 fid func_1( ) case 2 fid func_2( )
check_point( ) // save this object
// including func_id // into a file
Preprocessed
Cryptography
24
Pre-proccesser and Drawback
Preprocessed Code
Source Code
Preprocessed
int func_0( ) statement_1 statement_2
statement_3 return 1 int func_1( )
while() statement_4 if ()
statement_5 return 2 else
statement_7 statement_8
int func_2( ) statement_6 statement_8
while() statement_4 if ()
statement_5 return 2
else statement_7 statement8

statement_1 statement_2 statement_3 check_point
( ) while () statement_4 if ()
statement_5 check_point( )
statement_6 else statement_7
statement_8 check_point( )
Before check_point( ) in if-clause
After check_point( ) in if-clause
  • No recursions
  • Useless source line numbers indicated upon errors
  • Still need of explicit snapshot points.

25
GridTcp Check-Pointed Connection
User Program Wrapper
rank ip
1 n1.uwb.edu
2 n2.uwb.edu
user program
TCP
outgoing
backup
incoming
Snapshot maintenance
n1.uwb.edu
n2.uwb.edu
  • Outgoing packets saved in a backup queue
  • All packets serialized in a backup file every
    check pointing
  • Upon a migration
  • Packets de-serialized from a backup file
  • Backup packets restored in outgoing queue
  • IP table updated

n3.uwb.edu
26
Inter-Cluster Job DeploymentCurrent Research
Topic
Commander id 0
Sentinel id 2
How?
Internet
medusa.uwb.edu
Sentinel id 8
Sentinel id 9
uw1-320-00.uwb.edu
uw1-320-01.uwb.edu
Private domain
  • Over-gateway agent deployment
  • Over-gateway TCP communication
  • Over-gateway agent tree creatioin

10.0.0.3
10.0.0.4
10.0.0.7
27
UWAgents Over Gateway Migration
talk( )
hop( )
spawnChild( )
id 0
id 1
id 1
Internet
hop( )
medusa.uwb.edu
hop( )
id 1
id 1
Private domain
uw1-320-00.uwb.edu
uw1-320-01.uwb.edu
  • Parent and children keep track of a route to each
    others current position.
  • A daemon maintains where a gateway is.

mnode0
mnode1
mnode4
28
GridTcp Over-Gateway Connection
Commander id 0
Sentinel id 2 rank 0
Sentinel id 9 rank 2
Internet
medusa.uwb.edu
Sentinel id 8 rank 1
Private domain
uw1-320-00.uwb.edu
uw1-320-01.uwb.edu
mnode0
mnode1
mnode4
29
Over-Gateway Agent Tree CreationPossible
Solutions
Commander id 0
Partition 2
Partition 1
Sentinel id 2 rank 0
Bookkeeper id 3 rank 0
Resource id 1
Cluster 0
Sentinel id 8 rank 1
Sentinel id 11 rank 4
Sentinel id 10 rank 3
Sentinel id 9 rank 2
Cluster 2
Sentinel id 32 rank 5
Sentinel id 34 rank 7
Sentinel id 33 rank 6
Sentinel id 35 rank 8
Sentinel id 46 rank 19
Sentinel id 47 rank 20
Cluster 1
30
Over-Gateway Agent Tree CreationFinal Solution
Commander id 0
Sentinel id 2
Bookkeeper id 3 rank 0
Resource id 1
Cluster gateway 0
Desktop computers
Sentinel id 8 rank -8
Sentinel id 9 rank X
Cluster gateways 1, 2, and 3
Sentinel id 32 rank 0
Sentinel id 33 rank -33
Sentinel id 34 rank -34
Sentinel id 35 rank -35
Sentinel id 39 rank X4
Sentinel id 38 rank X3
Sentinel id 37 rank X2
Sentinel id 36 rank X1
Sentinel id 131 rank 4
Sentinel id 130 rank 3
Sentinel id 129 rank 2
Sentinel id 132 rank 6
Sentinel id 128 rank 1
Cluster 1
Cluster 3
Cluster 2
Cluster 0
Sentinel id 531 rank 10
Sentinel id 512 rank 5
Sentinel id 530 rank 9
Sentinel id 529 rank 8
Sentinel id 528 rank 7
31
4. Performance Evaluation
  • Evaluation Environment
  • A 8-node Myrinet-2000 cluster 2.8GHz
    pentium4-Xeon w/ 512MB
  • A 24-node Giga-Ethernet cluster 3.4GHz
    Pentium4-Xeon w/512MB
  • Computation Granularity
  • Java Grande MPJ Benchmark
  • Process Resumption Overhead
  • File Transfer

32
Computational Granularity 1
Master-slave computation
Master
Communication
Slave
Slave
Slave
Slave
Slave
33
Computational Granularity 2
Heartbeat communication
Process
Process
Process
Process
Process
Communication
34
Computational Granularity 3
All to all broadcast
Communication
Process
Process
Process
Process
Process
35
Performance Evaluation - Series
Master-slave computation
36
Performance Evaluation - RayTracer
All reduce communication but few data to send
37
Performance Evaluation MolDyn
All to all broadcast
38
Overhead of Job Resumption
39
File Transfer
Commander id 0
Sentinel id 2 rank 0
Bookkeeper id 3 rank 0
Resource id 1
Sentinel id 8 rank 1
Sentinel id 11 rank 4
Sentinel id 10 rank 3
Sentinel id 9 rank 2
AgentTeamwork vs NFS
Pipelined Transfer in AgentTeamwork
Sentinel id 32 rank 5
Sentinel id 34 rank 7
Sentinel id 33 rank 6
Sentinel id 35 rank 8
Sentinel id 46 rank 19
Sentinel id 47 rank 20
40
5. Related Work
  • From the viewpoints of
  • System Architecture
  • Fault Tolerance
  • Job Deployment and Monitoring

41
System Architecture
Systems Architectural basis
Globus A toolkit
Condor Process migration
Ninf, NetSolve RPC
Legion (Avaki) OO
Catalina, J-SEAL2, AgentTeamwork Mobile agents
  • Difference from Catalina/J-SEAL2
  • They are not fully implemented.
  • They are based on a master-slave model

42
Fault Tolerance
Systems Libraries Data recovery Communication recovery
Legion (Avaki) FT-MPI Variables passed to MPI_FT_save( ) Links recovered
Condor MW Library All master data Master-worker communication
Dome Dome_env Objects declared as dXXX lttypegt N/A
AgentTeamwork GridTcp All serializable class data All in-transit messages
43
Job Deployment and Monitoring
Systems Co-Allocation Module Deployment Scheme
Globus DUROC Master slave
Condor Grid Manager Mater slave
Legion Scheduler and Enactor Master slave
AgenTeamwork Sentinel agents Hierarchical
44
6. Conclusions
  • Project Summary
  • Next Two Years

45
Project summary
  • Applications
  • Computation granularity 40,000 doubles x 10,000
    floating-point operations
  • Message transfer Any types except all-to-all
    communication
  • Entire application size 3 times larger than
    computation granularity
  • Current status
  • UWAgent completed
  • Agent behavioral design basic job
    deployment/resumption implemented
  • User program wrapper completed including
    security features
  • GridTcp/mpiJava in testing
  • Preprocessor almost completed

46
Next Two Years
  • Application support
  • Fault tolerance in file transfer
  • GUI improvement
  • Agent algorithms
  • Over-gateway application deployment
  • Dynamic resource allocation and monitoring
  • Priority-based agent migration
  • Performance evaluation
  • Dissemination

47
Can AgentTeamwork Become Their Competitor?
Nimrod
48
Questions?
49
MPJ.Send and Recv Performance
50
Mobile Agents
Mobile agents Naming Cascading termination Job scheduling Security
IBM Aglets AgeltFinder traces all agents Needs to retract one by one Schedules jobs with Baglets. Java byte-code verification
Voyager RPC-based system-unique agent IDs Needs to be implemented at a user level Launches an independent user process. CORBA security service
DAgent Unpredictable agent IDs Needs to be implemented at a user level Launches an independent user process. A currency-based model
Ara (Obsolete) Unpredictable agent IDs Calls ara_kill to kill all agents Launches an independent user process. An allowance model
UWAgent Agent domain Waits for all descendants termination Schedules jobs with Java thread functions. Agent-to-agent security w/ Agent domain
Write a Comment
User Comments (0)
About PowerShow.com