Title: Workload Management
 1Workload Management 
- Status of current activity 
- GridPP 13, Durham, 6th July 2005 
2Activity
- Scalability testing 
- Analysis of current middleware performance 
- SGE integration 
- GridCC
3Scalability Testing
People involved Janusz Martyniak, Luke Dikens, 
Barry MacEvoy, Steve McGough, David Colling 
 4Scalability Testing
Why
-  From EDG we knew that it was easy to build a 
 system capable of running 5 jobs concurrently.
-  No so easy to build one capable of running 500 
 jobs or 5000 jobs concurrently.
-  The plan was to perform testing to find software 
 bottlenecks and hot spots
-  
-  Feed the results back to the developers in a 
 virtuous circle
5Scalability Testing
- The methodology  
- Original plan was to build a testbed across 2 
 sites (Imperial HEP and LeSC). This was
 deliverable X.Y
- Take an engineering approach. I.e. Submit tests 
 to the testbed and monitor how the different
 components respond.
- Metrics to be tested to evolve in complexity as 
 the stability grew.
6Scalability Testing
- What happened 
- Decided to join the JRA1 testbed instead of 
 forming our own. This gave us better access to
 the developers and much support on other parts of
 the system that we were not directly testing but
 which are needed to run the tests e.g. VOMs,
 RGMA. Also thus made a contribution to wider
 community. This decision has been praised by Bob
 Jones and Frederic Hemmer.
- Still decided to run two sites (as per 
 deliverable) as this gave a better testing
 environment for scalability tests
7Scalability Testing
- What happenned  
- We were delayed by the late release of the WMS in 
 EGEE
- However have had two sites in JRA1 testing since 
 immediately after the Athens meeting. The two
 sites are maintained by JM and LD and they
 consist of
- Machines 1 WMS 
-  2 CEs (1) 
-  2 WNs (1) 
-  Install apt 
-  Config Site 
-  Version R1.1 ( QF78) 
- Machines 1 WMS 
-  1 CE 
-  2 WNs 
-  1 RGMA Server. 
-  1 IO Server 
-  1 UI 
-  Install Manual 
-  Config Site (mostly) 
-  Version R1.1 
Site 2
Site 1 
 8Scalability Testing
- To add to these sites 
- SEs 
- VOMS 
-  Second RGMA server (to complete split)
9Scalability Testing
- Actual testing 
- Only really started writing scalability tests a 
 couple of weeks ago
- Have defined some basic metrics 
- Time to submit as a function of number of jobs 
 for serial submission
- Time to submit for parallel submission 
- Failure rates as function of active jobs 
- etc 
- Use LB database and system monitoring on WMS node 
 to reconstruct what is going on
10Scalability Testing
- So, 100 simple jobs submitted sequentially 
- Result preliminary 
- Example of what we are trying to do 
- Bypassed known problems especially cross matching 
- Summary
11Scalability Testing
Summary
28 Success53 Proxy expired (12 hours after the 
jobs were submitted !)3 Aborted due to reaching 
retry count16 Ready state 
In this sample greatest source of failure is 
CondorC 
 12Scalability testing
100 jobs submitted sequentially
All registered in 3 minutes 
 13Scalability Testing
Long tail of retries
Greatest number lt5000s (Excel binning) 
 14Scalability Testing
100 jobs submitted sequentially
Can plot for individual or groups of processes
Still activity 1 hour later
5 Minutes 
 15Scalability Testing
- Future Plans 
- Automate testing scripts 
- Output directed to web-pages 
- Expand metrics as appropriate 
16Performance of middleware
- We access to the job data through the LB 
 databases, so why not have a look?
- People involved Gidon Moont and David Colling
17Performance of middleware
Long tail 
 18Performance of middleware
Number of entries
Efficiency
RunTime (s) 
 19Performance of middleware
- Future plans 
- Keep monitoring this across different releases 
- Low level activity 
- Feedback into JRA2
20SGE Porting
- People involved David McBride, Mona Aggarwal and 
 Owen Maroney
21SGE Porting
LCG Integration with Sun Grid Engine (SGE)
- Wish to add LCG as an additional entry point for 
 our existing SGE cluster
- Problem LCG installation assumes the use of PBS 
 as the cluster management system.
- Solution replace PBS-specific components with 
 SGE specific components.
22SGE Porting
PBS-specific components in LCG(That need 
replacing)
- Globus JobManager 
- Already have an existing alternative Globus 
 JobManager for Sun Grid Engine to replace lcgpbs
 version.
- Implemented in Perl, well understood. 
- Supports 5.x, 6.x revisions of SGE. 
- Currently installed, about to enter the first run 
 of testing as part of an LCG CE installation.
23SGE Porting
PBS-specific Components in LCG (That need 
replacing)
- Information Reporter 
- Have developed first-pass attempt at an SGE 
 information reporter.
- Again, developed in Perl, small, relatively 
 straightforward. (Existing PBS code wasn't very
 clear, but GLUE Schema is public.)
- Installed on site CE, about to enter first run of 
 validation and iterative improvement.
24SGE Porting
PBS-specific components in LCG(That need 
replacing)
- Accounting (APEL) 
- APEL Accounting using PBS Event Logs. 
- SGE does have advanced accounting records but are 
 not stored in the same format as PBS!
- Existing Java-based tooling seems large and 
 complex for what should be a fairly
 straightforward task not obvious where changes
 could/should be made.
- Refactored version exists in gLite, but would 
 still require new implementation of SGE-specific
 backend.
- Using updated gLite revision on site may well 
 work, but would introduce manageability issues at
 upgrade-time.
- Currently wondering whether APEL can simply be 
 replaced with a small perl script(!) Currently
 looking up for documentation on the APEL/R-GMA
 reporting interface.
25SGE Porting
- Community of Interest formed 
- Code available from 
- http//www.lesc.ic.ac.uk/projects/SGE-LCG.html 
- Mailing list 
- coi-sge-lcg_at_imperial.ac.uk 
-  
26GridCC
- People involved 
-  Marko Krznaric, Janusz Martyniak, Luke 
 Dickens, John Darlington, Steve McGough, David
 McBride and David Colling
-  Tiziana  Costas 
27GridCC
- Lot about GridCC at GridPP12 so brief update 
- Discussions between GridCC and EGEE (Bob Jones 
 and Frederic Hemmer)
- Agreed to collaborate (e.g. use EGEE CVS) GridCC 
 relies on EGEE
- First release September this year 
- Review October this year
28GridCC
Bits in red from UK wms activity 
 29Summary
- Activity in 4 areas 
- testing, 
- analysis, 
- SGE port, 
- GridCC