Title: The Check-Pointed and Error-Recoverable MPI Java of AgentTeamwork Grid Computing Middleware
1The Check-Pointed and Error-Recoverable MPI Java
of AgentTeamwork Grid Computing Middleware
- Munehiro Fukuda and Zhiji Huang
- Computing Software Systems, University of
Washington, Bothell - Funded by
2Background
- Target applications in most grid-computing
systems - Communication takes place at the beginning and
the end of each sub task. - A crashed sub task will be simply repeated.
- Example Master-worker and parameter-sweep models
- Fault tolerance
- FT-MPI or MPI in Legion/Avaki
- The system will not recover lost messages.
- Users must specify variables to save and add a
function called from MPI.Init( ) upon a job
resumption. - Condor MW
- Messages between the master and each slave will
be saved. - No inter-slave communication will be
check-pointed. - Rock/Rack
- Socket buffers will be saved at application
level. - A process must be mobile-aware to keep track of
its communication counterpart.
3Objective
- More programming models
- Not restricted to master slave or parameter sweep
- Targeting heartbeat, pipeline, and
collective-communication-oriented applications - Process resumption in its middle
- Resuming a process from the last checkpoint.
- Allowing process migration for performance
improvement - Error-recovery support from sockets to MPI
- Facilitating check-pointed error-recoverable Java
socket. - Implementing mpiJava API with our fault-tolerant
socket.
4System Overview
Bookkeeper Agent
BookkeeperAgent
5Execution Layer
Java user applications
mpiJava API
mpiJava-A
mpiJava-S
GridTcp
Java socket
User program wrapper
Commander, resource, sentinel, and bookkeeper
agents
UWAgents mobile agent execution platform
Operating systems
6Programming Interface
- public class MyApplication
- public GridIpEntry ipEntry //
used by the GridTcp socket library - public int funcId //
used by the user program wrapper - public GridTcp tcp // the
GridTcp error-recoverable socket - public int nprocess //
processors - public int myRank //
processor id ( or mpi rank) - public int func_0( String args ) //
constructor - MPJ.Init( args, ipEntry, tcp ) //
invoke mpiJava-A - ..... //
more statements to be inserted - return 1 //
calls func_1( ) -
- public int func_1( ) //
called from func_0 - if ( MPJ.COMM_WORLD.Rank( ) 0 )
- MPJ.COMM_WORLD.Send( ... )
- else
- MPJ.COMM_WORLD.Recv( ... )
- ..... //
more statements to be inserted - return 2 //
calls func_2( ) -
7MPJ Package
MPJ
Init( ), Rank( ), Size( ), and Finalize( )
Communicator
All communication functions Send( ), Recv( ),
Gather( ), Reduce( ), etc.
JavaComm
mpiJava-S uses java sockets and server sockets.
GridComm
mpiJava-A uses GridTcp sockets.
DataType
MPJ.INT, MPJ.LONG, etc.
- InputStream for each rank
- OutputStream for each rank
- User a permanent 64K buffer for serialization
- Emulate collective communication sending the same
data to each OutputStream, which deteriorates
performance
MPJMessage
getStatus( ), getMessage( ), etc.
Op
Operate( )
etc
Other utilities
8GridTcp Check-Pointed Connection
User Program Wrapper
rank ip
1 n1.uwb.edu
2 n2.uwb.edu
user program
TCP
outgoing
backup
incoming
Snapshot maintenance
n1.uwb.edu
n2.uwb.edu
- Outgoing packets saved in a backup queue
- All packets serialized in a backup file every
check pointing - Upon a migration
- Packets de-serialized from a backup file
- Backup packets restored in outgoing queue
- IP table updated
n3.uwb.edu
9GridTcp Over-Gateway Connection
User Program Wrapper
User Program Wrapper
User Program Wrapper
User Program Wrapper
rank dest gateway
0 mnode0 -
1 medusa -
2 uw1-320 medusa
3 uw1-320-00 medusa
rank dest gateway
0 mnode0 -
1 medusa -
2 uw1-320 -
3 uw1-320-00 Uw1-320
rank dest gateway
0 mnode0 medusa
1 medusa -
2 uw1-320 -
3 uw1-320-00 -
rank dest gateway
0 mnode0 uw1-320
1 medusa uw1-320
2 uw1-320 -
3 uw1-320-00 -
user program
user program
user program
user program
medusa.uwb.edu (rank 1)
uw1-320.uwb.edu (rank 2)
uw1-320-00 (rank 3)
- RIP-like connection
- Restriction each node name must be unique.
mnode0 (rank 0)
10User Program Wrapper
User Program Wrapper
Source Code
int fid 1 while( fid -2) switch(
func_id ) case 0 fid func_0( ) case
1 fid func_1( ) case 2 fid func_2( )
check_point( ) // save this object
// including func_id // into a file
func_0( ) statement_1 statement_2
statement_3 return 1 func_1( )
statement_4 statement_5 statement_6
return 2 func_2( ) statement_7
statement_8 statement_9 return -2
statement_1 statement_2 statement_3 statement_
4 statement_5 statement_6 statement_7 stateme
nt_8 statement_9
check_point( ) check_point(
) check_point( )
Preprocessed
11Preproccesser and Drawback
Preprocessed Code
Source Code
Preprocessed
int func_0( ) statement_1 statement_2
statement_3 return 1 int func_1( )
while() statement_4 if ()
statement_5 return 2 else
statement_7 statement_8
int func_2( ) statement_6 statement_8
while() statement_4 if ()
statement_5 return 2
else statement_7 statement8
statement_1 statement_2 statement_3 check_point
( ) while () statement_4 if ()
statement_5 check_point( )
statement_6 else statement_7
statement_8 check_point( )
Before check_point( ) in if-clause
After check_point( ) in if-clause
- No recursions
- Useless source line numbers indicated upon errors
- Still need of explicit snapshot points.
12MPI Job Coordination
UWPlace (UWAgent Execution Platform)
13MPJ.Send and Recv Performance
14MPJ.Bcast Performance - Doubles
15Conclusions
- Raw bandwidth
- mpiJava-S comes to about 95-100 of maximum Java
performance. - mpiJava-A (with check-pointing and error
recovery) incurs 20-60 overhead, but still
overtakes mpiJava with bigger data segments. - Serialization
- When dealing with primitives or objects that need
serialization, a 25-50 overhead is incurred. - Memory issues related to mpiJavaA
- Due to snapshots created every func_n call.
- Next work
- Performance and memory-usage improvement
- Preprocessor implementation