Title: VO and Application Centric Approaches
 1- VO- and Application- Centric Approaches 
 - to Service Level Agreement 
 - Marian Bubak, Jakub Moscicki, 
 - Marcin Radecki, and Tomasz Szepieniec 
 - Cyfronet AGH, Krakow, PL 
 - CERN / IT
 
  2Contents
- VO-centric approach to SLA 
 - Motivation 
 - Basic requirements 
 - SLA metrics 
 - SLA execution 
 - Bazaar tool 
 - QoS from a user perspective 
 - User-level vs system-level techniques 
 - Tools Ganga/DIANE 
 - Examples of QoS metrics 
 - Case-study Lattice QCD 2008 
 - Summary
 
  3-  VO-centric Approach to SLA
 
  4Motivation
- Large number of VOs/users and resources 
 - Dynamic management is a must 
 - Remote interactions 
 - Limitation of automation 
 - Policy managers want to decide about their 
resources  - Start with human-in-the-loop SLA process 
 
  5Aim SLA based Resource Allocation 
 6What is needed
- Definition of meaningful and measurable SLA 
metrics  - Communication patterns 
 - (Re-)negotiation 
 - Configuration validation 
 - Tracking demands/policy changes 
 - Complexity management and process traceability 
 - SLA execution monitoring (including feedback from 
users)  - So, we should 
 - define the SLA process 
 - and build a collaboration tool 
 
  7EGEE Grid and Bazaar 
- Starting point 
 - No standard QoS metrics 
 - No procedures to express requirements 
 - Resources become available in the infrastructure 
even if not agreed with VO  - Resource Allocation in Central Europe ROC (Bazaar 
Project)?  - A procedure of tracking requests and responses to 
them  - Registration and monitoring of SLAs between VOs 
and Resources Providers  - Collaboration tool for tracking the process
 
  8Central European Region in EGEE
- 8 countries, 
 - 25 sites, 
 - 8000 cores, 
 - 850 TB storage 
 - 30 VOs 
 
  9SLA Metrics 
- Common language for users and providers 
 - Users I need to use x CPUs 
 - Providers prefer to speak about aggregated 
wall-clock time in specific period, without 
guarantee that resources will be available in 
(any) defined time  - Expressive enough to satisfy users important 
requests  - Aggregated time, parallel use, waiting time 
(queues), condition of environment  - Configurable 
 - providers need to have technical possibility to 
configure the resources according to the SLA 
(fabric layer need to support those requirements) 
  - Measurable in execution time 
 
  10Examples of SLA Metrics
- Computational Resources 
 - Guaranteed number of job slots in Local Batch 
System  - CPUs or cores? 
 - Total wall-clock time to be used in specified 
time period (in hours)  - weekly, monthly 
 - Access period (range of dates)? 
 - Maximum wall-clock- and CPU-time of a single job 
(hours)  - Maximum waiting time from job submission to make 
it running (in minutes)?  - Average power of a single core (benchmark results 
like SpecInt)  - Capacity available for temporal use by a job (GB) 
 - Memory available per core/CPU (GB) 
 - Maximum latency between nodes in the cluster (ms)
 
  11Examples of SLA Metrics?
- Grid Storage Resources 
 - Storage quota guaranteed (GB) 
 - Maximum latency in accessing files (optional, in 
ms)  - Minimum bandwidth in accessing files (optional, 
Gb/s)  - Storage quota for temporal use (optional, GB) 
 - Time limit for temporal use of storage (optional, 
hours)  - Period of using storage (dates from-to)? 
 - General Resource QoS 
 - Minimum resource availability (optional, in ) 
 - Minimum resource reliability (optional, in ) 
 - Maximum time to acknowledge trouble ticket (days) 
 - Maximum time to resolve trouble ticket (days)
 
  12SLA Execution Stages in Bazaar
The process is initialized by a VO by a call for 
resources Next, a resource providers define 
their proposal for SLA 
 13States Transition Details
-  Each state transition must be confirmed by 
both sides 
 Proper configuration is controlled by separated 
set of states 
 14Bazaar Functionality
- Call management - the user can perform call 
creation, edition and management.  - SLA management including negotiation - site 
managers can create a contract as a response to a 
call. Both partners can negotiate contract 
conditions and track contract changes.  - Notification management - system notifies a user 
via e-mail and user interface about actions like 
resource reconfiguration etc.  - Feedback - VO managers can assess site's 
configuration and both partners can provide a 
general assessment of the collaboration when the 
contract has been completed.  - Accounting and statistics - users can generate 
reports with resources usage statistics. In the 
next prototype, a tool shall enable obtaining 
data from EGEE accounting tools.  
  15Bazaar in operation
- Bazaar  a tool supporting resource allocation 
including SLA negotiation  - Integrated with EGEE Operation Portal (CIC 
Portal)?  - No cost of entry  data obtained from GOCDB and 
CIC-Portal VO-cards  - Introduced into operations in Central European 
Region  - Main features of Bazaar 
 - Clear view on VOs demands for resources 
 - Management of calls and SLAs between VOs and RCs 
 - SLA negotiation support 
 - E-mail notifications 
 - Tracking of SLA changes
 
  16SLA in PL-Grid
- PL-Grid Project 
 - Grid operations center in Poland 
 - 3 different infrastructures EGI compliant 
(currently gLite-based), DEISA, cloud-like 
research grid  - SLA Management in PL-Grid 
 - We take ideas from Bazaar Project as a starting 
point  - Develop SLA-centric model including 
 - Impact on resources available at the technical 
level  - Notifications on missing resources 
 - Improvement on SLA monitoring and accounting 
 - Integration with computational grants system
 
  17PL-Grid Operation Tools Architecture 
 18Conclusions
- Human in a negotiation loop seems to be 
unavoidable  - SLAs should support VO and resource managers 
 - Complexity management should be supported by Web 
2.0 tools (collaboration tools with traceable 
processes)? 
  19-  QoS on the Grid 
 -  with User-Level Scheduling
 
  20Some Grid applications
- Data Analysis 
 - extraction of (statistical) parameters from data 
using event loop  - ATLAS experiment at LHC 
 - Monte Carlo simulation 
 - creation of statistical objects (e.g. histograms) 
or building images by generating large number of 
independent events  - Geant4 simulations for radiotherapy in medical 
physics  - Parameter sweep 
 - running a large number of independent jobs in 
various configurations  - Geant 4 regression tests 
 - High-throughput activities 
 - autonomous computing over long periods of time 
 - Avian Flu Drug Search (bio-informatics)? 
 - Lattice QCD (theoretical physics)? 
 - High-performance, short-deadline activities 
 - short-deadline performance peak 
 - ITU frequency analysis for RRC06
 
  21QoS for scientific applications
- In the Grid the basic interaction of a user is 
sending jobs  - efficient job/workload management plays central 
role  - efficient scheduling often requires 
application-specific knowledge  - which may be difficult at the system level 
 - The system provides an appropriate QoS if it 
responds in an acceptable way to the user and is 
capable of automatically maintaining the 
processing goals defined by the user (measured by 
metrics)  - Some QoS metrics (measure of user-defined goals)? 
 - turnaround time 
 - typically minimize the total execution time of 
the job  - reliability / failure rate 
 - response latency time to obtain initial results 
 - feedback from the execution 
 - filling histograms with events -gt significance of 
individual partial results decreases with time  - prioritization/scheduling of the tasks 
 - predictability/stability of the execution 
 
  22Mechanisms for better QoS 
- In general QoS in NOT implemented on the Grid 
 - Techniques for performance related metrics 
 - dedication of resources (wasteful) 
 - advanced reservations 
 - difficult for some users who do not plan ahead 
interactive work  - better scheduling fast/slow queues (site 
configuration)  - preemption suspend lower priority job 
 - migration suspend and migrate elsewhere 
 - better brokering forecasting using monitoring 
systems (e.g. NWS)  - Techniques for failure related metrics 
 - metascheduling (JDL retry count, Condor) 
 - Techniques for application-specific metrics 
 - metascheduling (not generally implemented, e.g. 
out of scope of DAGs) 
  23QoS Implementation Choices
- QoS implementation 
 - site service modifications 
 - faster queues, scheduler modifications e.g. 
virtualization schemes with MAUI  - middleware modification 
 - checkpointing/migration, special services (e.g. 
GARA), Virtual Machines  - system level modifications (unix kernel modules, 
special I/O)  - user-level overlay schedulers (plot jobs, 
agents,...)?  - Boundary conditions in a large Grid (e.g. EGEE)? 
 - acceptance/deployment of middleware changes very 
slow due organizational constraints  - resource providers' constraints (site changes) 
 - many sites cannot freely change their software 
(serving also non-grid users)  - sysadmins do not like sudo-like programs 
 - interfacing legacy applications
 
  24User-level overlay
- Overlays are the only option if we talk about 
using existing Grid infrastructure at the large 
scale 
- LCG and EGEE Grid 
 - the largest Grid infrastructure to date 
 - over 250 sites 
 - over 80K WNs 
 - over 15 PB of storage
 
  25User-level tools
- DIANE helps smaller scientific communities using 
distributed (Grid) resources more efficiently  - reduce the application execution time 
 - reduce the manual work overhead by providing 
fully automatic execution and failure management,  - efficiently integrate local and Grid resources 
 - part of EGEE Respect suite 
 - http//cern.ch/diane 
 - Ganga Job Management Interface 
 - Submission gateway to many distributed systems 
 - Easy job management and application configuration 
 - http//cern.ch/ganga
 
  26User-level Overlay
- User-level overlay 
 - each user uses a (temporary) overlay which is 
created for the duration the computations 
(drawing courtesy of ThIS collaboration 
 27Master/Worker backbone
- Master/Worker processing of tasks 
 - RunMaster executes on a local host 
 - WorkerAgents execute as Grid jobs 
 - TaskScheduler is a software component (python 
module) which may be arbitrarily customized or 
replaced  - application plugins 
 - ApplicationWorker 
 - ApplicationManager 
 
  28Flexible architecture
- 3 functional parts 
 - Submitter selection and acquisition of the 
resources  - M/W scheduling and execution control 
 - Directory Service late binding of resources 
 - System is easily customized by plugins
 
  29Examples of QoS Metrics
- Selected examples of QoS metrics for different 
applications 
  30QoS Metric predictability of execution
-  Comparison of G4 Production on LCG DIANE and 
direct submission  -  6 sites / 173 CPUs / 100 VO-shared, 70 
VO-dedicated  -  207 tasks, direct 1 task  1 job, DIANE workers
 
  31QoS metric reliability
- Summary of ITU RRC06 runs 
 - 200K jobs in less than 6 hours 
 - worst case reliability 0.0003 jobs lost
 
 run jobs task turnaround CPUh WN 
 comment 1 243K 26K 6.40h 425h 
190 lost lt10 tasks (3e-04)? 2 237K 23K 
6.30h 332h 125 lost 1 task (4e-05)? 
3 224K 40K 3.05h 192h 210 OK 
4 218K 39K 1.05h 151h 320 OK
- ITU RRC-06 (15 May16 June 2006)? 
 - 120 countries (1200 delegates) negotiated 
thenew digital frequency plan  - a part of a new international agreement 
 - introduction of digital broadcasting 
 - UHF (470-862 Mhz)? 
 - VHF (174-230 Mhz)? 
 - preceded by RRC-04 and other international 
meetings 
  32QoS Metric low latency on the Grid
- RRC06 ITU job 
 -  116 LCG workers 
 -  3470 tasks 
 -  130 CPU h 
 - large span of task length 
 - not a priori known! 
 
  33QoS Metric stability of execution 
 34- Case study high-throughput Lattice QCD 
simulation  - application-aware scheduler prioritize tasks 
based on the simulation parameters  - active resource selection via Submitter 
(WorkerFactory)?  - dynamically select resources based on their 
fitness for the application 
  35Lattice QCD 2008 _at_ Grid
- Study the behaviour of the critical point of 
quark-gluon plasma  - The scientific results obtained by the LQCD 
project were published in a paper P. de Forcrand 
et al. "The chiral critical point of Nf  3 QCD 
at finite density to the order (ยต/T)4" and are 
available at http//arxiv.org/pdf/0808.1096  - Monte-Carlo simulation of discrete space-time 
lattice  - need a lot of CPU 
 - relatively small data (Gbs)?
 
  36LQCD execution history
- ongoing since May 2008 
 - several phases (application and system upgrades, 
power-cuts, etc...)  - routinely production since September 2008 
 - runs unattended for months 
 - operated by a single, not-a-Grid-expert user 
 - large-scale 
 - 1000 running jobs at any time 
 - 700 CPU-years since the May 2008 
 - 18 TB of data
 
  37Routinely LQCD production 
- 700 CPU years since May 2008 
 - 18 TB of data transferred 
 - 800 simultaneous workers
 
  38Summary 
- User-level overlay is a technique enhancing the 
QoS parameters for scientific applications in the 
EGEE Grid  - Pros  cons 
 - Existing infrastructure may be used as is 
 - Application-specific optimizations (impossible at 
the system level)?  - Hard QoS not possible (infrastructure 
unreliable)?  - Faire-share implemented by the underlying 
infrastructure and respected by the overlay (if 
used appropriately)?  - Used successful for diverse applications 
 - Overlays are a complementary approach to SLAs 
 - More on tools
 
http//cern.ch/diane 
http//cern.ch/ganga