Parallel Session B2 - CPU and Resource Allocation - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel Session B2 - CPU and Resource Allocation

Description:

Parallel Session B2 - CPU and Resource Allocation Panelists: Charles Young (BaBar) David Bigagli Seed Questions: Batch queuing system in use? Turnaround guarantees? – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 7
Provided by: fna68
Category:

less

Transcript and Presenter's Notes

Title: Parallel Session B2 - CPU and Resource Allocation


1
Parallel Session B2 - CPU and Resource Allocation
  • Panelists
  • Charles Young (BaBar)
  • David Bigagli
  • Seed Questions
  • Batch queuing system in use?
  • Turnaround guarantees?
  • Pre-allocation of resources?

2
Batch System
  • Vendor or home-brewed?
  • Maintenance and support issues.
  • If vendor, licensing and cost issues.
  • Already significant fraction of H/W.
  • Per node? Per unit computing power? Or?
  • Management concerns.
  • Can one really management 10K nodes?
  • Split into separate management domains?

Slide from Charles Young (BaBar)
3
LSF - Talk given by David Bigagli (Platform
Computing)
  • What is LSF
  • Developer's view of LSF
  • architecture
  • Scalability
  • Dealing with resources
  • Load Information manager
  • Batch
  • Lively discussion

4
Discussion which batch systems are you using and
why?
  • LSF (30 ?)
  • LSF has worked for us and continues to work.
  • PBS (30)
  • PBS is free
  • Collaborators want us to use PBS (because PBS is
    free)
  • Ability to modify source
  • Nobody is using ProPBS
  • Condor (20)
  • Condor costs nothing
  • Cycle stealing allows us to get computing done

5
Discussion which batch systems continued...
  • BQS
  • Homegrown at IN2P3
  • used in a small number of external sites
  • Have had it for seven years
  • Everyone likes it
  • FBS
  • Homegrown at FNAL
  • used in some external sites
  • LSF is expensive
  • FBS is designed to be used on farms
  • Lightweight and flexible

6
Other issues
  • Mosix
  • Only CERN has looked at it
  • Appears to be difficult to take down individual
    machines in the cluster
  • How to deal with abusers?
  • Turn them over to the user community
  • How do people schedule downtime ?
  • Train people that jobs longer than 24 hours are
    at risk.
  • CERN posts a future shutdown time for the job
    starter (internal)
  • BQS has this feature inside.
  • Condor has eventd for draining. Labs reboot and
    have maintenance windows.
Write a Comment
User Comments (0)
About PowerShow.com