Title: Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003
1Scalable Systems Software CenterResource
Management and Accounting Working
GroupFace-to-Face MeetingFebruary 24-25, 2003
2Resource Management and Accounting Working Group
- Working group scope
- Progress over last quarter
- Next steps
- Topics for group consideration
3Working Group Scope
- The Resource Management Working Group is involved
in the areas of resource management, scheduling
and accounting. - This working group will focus on the following
software components - Queue Manager
- Scheduler
- Allocation Manager (and accounting)
- Meta Scheduler
- Other critical resource management components are
being developed in the Process Management and
Monitoring Working Group - Process Manager
- Cluster Monitor
4Proposed Component Architecture
Infrastructure Services
Meta Scheduler
Discovery Service
Allocation Manager
Local Scheduler
Information Service
Queue Manager
Node Monitor
Event Manager
Color Key Working Group
Resource Management and Accounting
Execution Management and Monitoring
Node Configuration and Infrastructure
Security System
Node Manager
Process Manager
5Resource Management Prototype Demonstration
This demo runs a simple end-to-end test with a
job being submitted running past its wallclock
limit
4 Create-Reservation
Allocation Manager
Local Scheduler
9 Withdraw-Allocation
2 Query-Job
7 Query-Job
8 Delete-Job
3 Query-Node
5 Run-Job
Queue Manager
Node Monitor
Job Submission Client
1 Submit-Job
Color Key Working Group
Resource Management and Accounting
Execution Management and Monitoring
Node Configuration and Infrastructure
0 Service-Lookup
6 Exec-Process
Process Manager
Discovery Service
6General Progress
- Released v1.0 Initial SSS Resource Management
Suite - OpenPBS-SSS 2.3.15-1
- Maui Scheduler 3.2.6
- QBank 2.10.4 (accounting system)
- Website created and software available for
download (intended for friendly beta testers) - SSSRMAP protocol (using HTTP) validated in Maui
Scheduler, Queue Manager, PBS front-end, and Gold
Allocation Manager (complex query support
validated and utility shown within a diversity of
usage scenarios) - Scalability testing performed on all components
7Scheduler Progress
- Scheduler implemented interfaces for the system
monitor, the event manager, the service
directory, as well as a scheduling extension
interface (allow scheduling plug-ins to enable to
scheduling algorithms and capabilities) - enhanced native support for LoadLeveler, PBS,
SGE, LSF, and BProc based systems - significantly enhanced web based scheduler
documentation, additional scheduler command man
pages for select commands - SSS Requirements document completed
8Scheduler Progress
- Security improvements
- Support DES, HMAC, MD5, and external source
secret key based algorithms has been implemented
for client/server authentication - Improved buffer overflow protection has been
added to critical scheduler interfaces - A generalized secret key management facility has
been implemented for secure multi-party
communication. - Scalability improvements
- decreasing memory consumption by over 80
- enabling support for up to 8,000 nodes
- enabling support for up to 32,000 processors
- enabling support for up to 2,000 simultaneous
active jobs - enabling support for jobs requesting up to 16,000
hosts
9Scheduler Progress
- Fault Tolerance
- migration of all Resource Manager calls to a
threaded Resource Manager interface (enabling
scheduler survival of interface hangs and
crashes) - incorporation of Resource Manager and Allocation
Manager diagnostics and failure tracking
statistics - implementation of improved data checking and
handling routines to detect and correct corrupt
Resource Manager data - Dynamic job support interfaces have been designed
- Limited support for generic resources has been
enabled (i.e., software licenses, network
bandwidth, global disk caches, etc.).
10Queue Manager Progress
- Both Ames Queue Manager and PNNL PBS front-end
have implemented and validated SSSRMAP HTTP
interface - Replaced third-party XML parser with SSS-created
routines - Created Resource Management Suite Software
website - PNNL created and tested patches for PBS
scalability improvements and packaged as RPMs
(and tarball patch) for beta distribution - Requirements document completed
- Updated Process-Manager interface for new XML
schema - Ames Queue-Manager has implemented a nearly
complete PB-like command line interface
11Accounting and Allocation Manager Progress
- QBank
- a test harness was installed, test suites
created, significant testing performed and bugs
fixed - Security was strengthened (new qauth uses
libcrypto and key in separate file for greater
stability and so binary versions can be
distributed) - The install process for QBank was streamlined and
made non-interactive - Packaged in RPMs and tarballs for Linux and
released in v1.0 SSS Resource Management System - Documentation was significantly improved
including the creation of a user guide, a
deployment guide, man pages, and updated online
documentation
12Accounting and Allocation Manager Progress
- Gold
- Time-travel implemented
- Initial support for object-joined queries
- Implemented Reservations
- Implemented Balance Checking
- Scalability Testing
- Component-level testing was done to test timings
to perform barrages of common accounting and
allocation operations (charges, reservations,
balance checks, etc.) - Simulations were performed with the Maui
Scheduler to test transaction times with the
allocation manager interface
13Meta-Scheduler Progress
- SSS Requirements document completed
- Support has been added for Globus 2.0 and 2.2
based job staging - The initial information service interface has
been designed - Security has been enhanced by adding Globus
credential caching and enabling generalized
secret session key management - Support has been added for retrying resources
- Additional functionality includes the basic data
management interface and an initial file staging
capability
14Next Work
- Release v2 SSS Resource Management and Accounting
interface specification - Implement and test SSSRMAP security
authentication - Try to get more components under a testing
framework - Portability enhancements (AIX, Tru64, possibly
Cray)
15Next Work
- Local Scheduler
- Test interaction with checkpoint/restart
mechanisms when interfaces ready - virtual partitioning through resource limit
enforcement and tracking - quality of service support for completion time
guarantees - Security integration
- Progress on graphical interfaces
-
16Next Work
- Queue manager
- Implement persistence via database (replacing
flat files) - Add Epilogue/Prologue support and job submission
verification script - Interface with Node Monitor
- Full PBS qsub compatibility (nearly complete)
- Implement full input/output handling (need to
define PM interfaces, if any) - Add interface with Node Manager to support job
dependent node OS image installation
17Next Work
- Accounting and Allocation manager
- Quotations (Gold)
- Flexible charging (Gold)
- Continuing effort on open source of new and old
Allocation Managers - SSSRMAP XML Security integration (Gold)
- Support for operations on returned fields (sort,
sum, max, unique, group by, etc) - Begin Portability testing for Gold and QBank
18Issues requiring inter-group discussion