Scalable Systems Software Center Resource Management and Accounting Working Group FacetoFace Meeting - PowerPoint PPT Presentation

About This Presentation
Title:

Scalable Systems Software Center Resource Management and Accounting Working Group FacetoFace Meeting

Description:

Craig initial efforts on SSSRMAP Wire Level integration into ssslib. General Progress ... performance and reliability of gridFTP, GASS, and SCP based data staging ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 23
Provided by: scottmj
Learn more at: https://www.csm.ornl.gov
Category:

less

Transcript and Presenter's Notes

Title: Scalable Systems Software Center Resource Management and Accounting Working Group FacetoFace Meeting


1
Scalable Systems Software CenterResource
Management and Accounting Working
GroupFace-to-Face MeetingJan 25-26,
2005Washington D.C.
2
Resource Management and Accounting Working Group
  • Working group scope
  • Progress since last face-to-face
  • Future Work

3
Working Group Scope
  • The Resource Management Working Group is involved
    in the areas of resource management, scheduling
    and accounting.
  • This working group will focus on the following
    software components
  • Queue Manager
  • Scheduler
  • Accounting and Allocation Manager
  • Meta Scheduler
  • Other critical resource management components are
    being developed in the Process Management and
    Monitoring Working Group
  • Process Manager
  • Cluster Monitor

4
Resource Management Component Architecture
Grid Scheduler
Infrastructure Services
Allocation Manager
Cluster Scheduler
Discovery Service
Queue Manager
Node Monitor
Event Manager
Security System
Node Manager
Process Manager
5
Resource Management Prototype Demonstration
This demo runs a simple end-to-end test with a
job being submitted running past its wallclock
limit
4 Create-Reservation
Allocation Manager
Cluster Scheduler
9 Withdraw-Allocation
2 Query-Job
7 Query-Job
8 Delete-Job
3 Query-Node
5 Run-Job
Queue Manager
Node Monitor
Job Submission Client
1 Submit-Job
0 Service-Lookup
6 Exec-Process
Process Manager
Discovery Service
6
General Progress
  • Protocol has stabilized very little change in
    SSSRMAP Wire Protocol or Message Format
  • Scott - Wrote a good deal of the SSSRMAP Message
    Format SDK (Python classes)
  • all that is left is Data integration into Request
    and Response
  • Craig initial efforts on SSSRMAP Wire Level
    integration into ssslib

7
General Progress
  • SC2004 release of RMWG components
  • System tested and bundled w/ SSS-OSCAR 1.0
  • Bamboo Queue Manager v1.0.0
  • Maui Scheduler v3.2.6p10
  • Gold Accounting and Allocation Manager v2.0.b1.1
  • Warehouse System Monitor v0.7.0

8
General Progress
  • Starting to see evidences of adoption and value
    add of the SSS components
  • Bamboo Queue Manager
  • built-in support for checkpoint/restart
  • PBS or LoadLeveler job submission syntax
  • interfaces with ANL process manager
  • has been in production use on Ames cluster for
    over a year now

9
General Progress
  • Adoption and value add (continued)
  • Gold Allocation Manager
  • very successful in ensuring that the right work
    gets done
  • very successful in establishing a project cycle
    and managing capacity
  • Gold is in production use on multiple PNNL
    systems including the 11.8TF Linux Cluster
  • Dozens of sites have downloaded it
  • about 3 other sites currently evaluating Gold
    (also began discussions with DOD HPCMP sites)

10
General Progress
  • Adoption and value add (continued)
  • Maui Scheduler
  • implemented support for checkpoint/restart
  • sites are using the new resource utilization
    tracking and enforcement capabilities to
    advantage
  • because of SSS-directed work in enhanced
    prioritization, throttling policies and quality
    of service, sites are better able to dial in
    their preferences for improved
  • fairness
  • higher system utilization
  • improved response time
  • targeted cycle delivery

11
General Progress
  • Maui Scheduler (continued)
  • Maui has been installed on over 2,500 clusters
  • and downloaded over 100,000 times last year
  • Maui is running on more supercomputers than any
    other scheduler in the world
  • In early 2003 it was found to be running on (out
    of top 500 list)
  • 15 out of the top 20
  • 75 out of the top 100

12
Queue Manager Progress
  • v1.0 (and v1.0.1) release of Bamboo made
    available
  • Full support for SSSRMAP v3 message format
  • Submission clients support PBS in addition to
    LoadLeveler style job scripts
  • CheckPoint/Restart manager interfaces tested and
    debugged.
  • Job output now correct for suspended jobs.
  • SSS suite was updated on cluster in Ames in
    November with the full SC code release.

13
Accounting and Allocation Manager Progress
  • Released Gold Beta release at SC2004
  • Included in SSS-OSCAR 1.0 distribution
  • Beta version of Gold in production on PNNLs
    11.8TF Linux cluster
  • Full-featured Web-based Graphical User Interface
  • Performance testing and tuning carried out
  • Improved robustness (timeout select in
    non-blocking read/write loops prevents client and
    server communication hangs)

14
Accounting and Allocation Manager Progress
  • Ported Gold to Tier1 and Tier2 OSs
  • Added support for SQLite embedded database
  • Added support for encryption/decryption (in Perl)
  • Support for variable decimal precision currency
  • New reservation design improves handling of
    charges that span allocation boundaries
  • Created a project usage report
  • New User Guide chapters on Allocations,
    Installation, Roles, gold shell, Passwords

15
Cluster Scheduler Progress
  • Peer Diagnostics - added service health checks
  • SSS Interface - added support for numerous job
    attributes
  • Packaging - Enhanced packaging for pre-req
    auto-detection
  • Security - added interface buffer overflow
    prevention
  • Allocation Manager Interface - extended support
    for allocation debit/reservation attributes
  • Added end-to-end support for BambooBerkeley
    Checkpoint Manager based suspend/resume
  • General - numerous stability and usability
    enhancements

16
Grid Scheduler Progress
  • Cluster Service API - rewrote Cluster Service
    interface to use SSS job object and message layer
    communication protocols
  • Usability - added node monitoring, job
    monitoring, statistics, and job management client
    commands
  • Submission - significantly enhanced job
    submission client and Globus job staging
    infrastructure
  • Data Staging - improved performance and
    reliability of gridFTP, GASS, and SCP based data
    staging
  • Grid Fairness - added initial support for grid
    level usage policies, fairshare, and priority
  • General - enhanced multi-cluster job
    co-allocation, improved packaging, documentation,
    and internal diagnostics of Globus, network, job,
    and resource failures.

17
MCOM Progress(common library used by the cluster
scheduler and grid scheduler)
  • XML - added failure logging and exception
    handling for corrupt XML
  • Compression - added inline socket data
    compression
  • Encryption - added initial key based data
    encryption (not full SSS standard)
  • General - made general improvements in socket
    communication, XML processing, SSS job
    processing, and node resource monitoring

18
Future Work
  • General release of all components
  • Including new Silver Meta-scheduler
  • Increase deployment base
  • Portability testing for new components
  • Tier 1 LinuxRedHat (9.0)
  • Tier 2 LinuxSuSE, AIX, Tru-64
  • Tier 3 OS-X, Unicos
  • Tier 4 HP-UX, IRIX, Solaris
  • Fault Tolerance supporting 25 cluster loss

19
Future Work
  • Queue manager
  • Add job group support (mainly for submission)
  • Add Task Group support/ multi-requirement job
    support to submission clients
  • Add Job Submission filter
  • Finish final missing portions of PBS style job
    language support.

20
Future Work
  • Accounting and Allocation manager
  • General release to be made available by mid-year
  • Production deployment of Gold on additional sites
  • Port Gold to other OSs (Tiers 3 and 4) and
    databases
  • Complete and test design for distributed
    accounting and multi-organizational involvement
    in job startup
  • Add support for multi-site authentication/authoriz
    ation (each site having its own symmetric key)
  • Improvements in the web-based GUI
  • Documentation to include object customization

21
Future Work
  • Cluster Scheduler
  • Peer Diagnostics - add auto-recovery to failed
    service interfaces
  • Resource Utilization - complete development of
    all resource utilization objectives
  • Resource Limits - complete development of all
    resource limits objectives
  • Checkpoint Restart - optimize resource management
    for suspended jobs

22
Future Work
  • Grid Scheduler
  • Reliability - complete Globus failure diagnostics
    and auto-recovery
  • Data Staging - complete Globus/Non-Globus data
    staging failure auto-recovery
  • Optimization - add network co-allocation
    reservation
  • Fairness - complete Priority, Fairshare, and
    Usage Limit based policy enforcement
  • Statistics - add credential, job, and cluster
    based usage statistics
  • General - mature client commands to provide
    status reporting in more intuitive manner
Write a Comment
User Comments (0)
About PowerShow.com