Computing and Brokering

About This Presentation

Title:

Computing and Brokering

Description:

Cluster batch system model Some batch systems Batch systems ... DAG workflows schedd is close ... transferred to Network Server for processing ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 114

Provided by: Margri4

Category:

more less

Transcript and Presenter's Notes

Title: Computing and Brokering

1
Computing and Brokering

Grid Middleware 5
David Groep, lecture series 2005-2006

2
Outline

Classes of computing services
MPP SHMEM
Clusters with high-speed interconnect
Conveniently parallel jobs
Through the hourglass basic functionalities
Representing computing services
resource availability, RunTimeEnvironment
Software installation and ESIA
Jobs as resources, or ?
Brokering
brokering models central view, per-user broker,
neighbourhood P2P brokering
job farming and DAGs Condor-G, gLite WMS,
Nimrod-G, DAG man
resource selection ERT, freeCPUs, ? Prediction
techniques and challenges
colocating jobs and data, input output
sandboxes, LogicalFiles
Specialties
Supporting interactivity

3
Computing Service

resource variability and the hourglass model

4
The Famous Hourglass Model
5
Types of systems

Very different models and pricing suitability
depends on application
shared memory MPP systems
vector systems
cluster computing with high-speed interconnect
can perform like MPP, except for the single
memory image
e.g. Myrinet, Infiniband
course-grained compute clusters
conveniently parallel applications without IPC
can be built of commodity components
specialty systems
visualisation, systems with dedicated
co-processors,

6
Quick, cheap, or both how to run an app?

Task how to run your application
the fastest, or
the most cost-effective (this argument usually
wins ?)
Two choices to speed up an application
Use the fastest processor available
but this gives only a small factor over modest
(PC) processors
Use many processors, doing many tasks in parallel
and since quite fast processors are inexpensive
we can think of using very many processors in
parallel
but the problem must first be decomposed

fast, cheap, good pick any two
7
High Performance or High Throughput?

Key question max. granularity of decomposition
Have you got one big problem or a bunch of little
ones?
To what extent can the problem be decomposed
into sort-of-independent parts (grains) that
can all be processed in parallel?
Granularity
fine-grained parallelism the independent bits
are small, need to exchange information,
synchronize often
coarse-grained the problem can be decomposed
into large chunks that can be processed
independently
Practical limits on the degree of parallelism
how many grains can be processed in parallel?
degree of parallelism v. grain size
grain size limited by the efficiency of the
system at synchronising grains

8
High Performance v. High Throughput?

fine-grained problems need a high performance
system
that enables rapid synchronization between the
bits that can be processed in parallel
and runs the bits that are difficult to
parallelize as fast as possible
coarse-grained problems can use a high throughput
system
that maximizes the number of parts processed per
minute
High Throughput Systems use a large number of
inexpensive processors, inexpensively
interconnected
High Performance Systems use a smaller number of
more expensive processors expensively
interconnected

9
High Performance v. High Throughput?

There is nothing fundamental here it is just
a question of financial trade-offs like
how much more expensive is a fast computer than
a bunch of slower ones?
how much is it worth to get the answer more
quickly?
how much investment is necessary to improve the
degree of parallelization of the algorithm?
But the target is moving -
Since the cost chasm first opened between fast
and slower computers 12-15 years ago an enormous
effort has gone into finding parallelism in big
problems
Inexorably decreasing computer costs and
de-regulation of the wide area network
infrastructure have opened the door to ever
larger computing facilities clusters ?
fabrics ? (inter)national gridsdemanding
ever-greater degrees of parallelism

10
But the fact is
the food chain has been reversed, and
supercomputer vendors are struggling to make a
living.
Graphic Network of Workstations, Berkeley IEEE
Micro, Feb, 1995, Thomas E. Anderson, David E.
Culler, David A. Patterson
11
Using these systems

As both clusters and capability systems are both
expensive (i.e. not on your desktop), they are
resources that need to be scheduled
interface for scheduled access is a batch queue
job submit, cancel, status, suspend
sometimes checkpoint-restart in OS, e.g. on SGI
IRIX
allocate processors (and amount of memory,
these may be linked!) as part of the job request
systems usually also have smaller interactive
partition
not intended for running production jobs

12
Cluster batch system model
13
Some batch systems

Batch systems and schedulers
Torque (OpenPBS, PBS Pro)
Sun Grid Engine (thats not a Grid ?)
Condor
LoadLeveller
Load Share Facility (LSF)
Dedicated schedulers MAUI
can drive scheduling for Torque/PBS, SGE, LSF,
support advanced scheduling features,
likereservation, fair-shares, accounts/banking,
QoS
head node or UI system can usually be used for
test jobs

14
Torque/PBS job description

PBS batch job script
PBS -l walltime360000
PBS -l cput300000
PBS -l vmem1gb
PBS -q qlong
Executing user job
UTCDATEdate -u 'YmdHMSZ'
echo "Execution started on UTCDATE"
echo ""
printenv
date
sleep 3
date
id
hostname

15
PBS queue

bosuitmp1010 qstat -an1head -10
tbn20.nikhef.nl
Req'd Req'd Elap
Job ID Username Queue Jobname
SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ----------
------ ----- --- ------ ----- - -----
823302.tbn20.nikhef. biome034 qlong STDIN
20253 1 -- -- 6000 R 2058 node15-11
824289.tbn20.nikhef. biome034 qlong STDIN
6775 1 -- -- 6000 R 1525 node15-5
824372.tbn20.nikhef. biome034 qlong STDIN
10495 1 -- -- 6000 R 1510 node16-21
824373.tbn20.nikhef. biome034 qlong STDIN
3422 1 -- -- 6000 R 1440 node16-32
...
827388.tbn20.nikhef. lhcb031 qlong STDIN
-- 1 -- -- 6000 Q -- --
827389.tbn20.nikhef. lhcb031 qlong STDIN
-- 1 -- -- 6000 Q -- --
827390.tbn20.nikhef. lhcb002 qlong STDIN
-- 1 -- -- 6000 Q -- --

16
Example Condor clusters of idle workstations
The Condor Project, Miron Livny et al. University
of Wisconsin, Madison. See http//www.cs.wisc.edu/
condor/
17
Condor example

Write a submit file
Executable dowork
Input dowork.in
Output dowork.out
Arguments 1 alpha beta
Universe vanilla
Log dowork.log
Queue
Give it to Condor
condor_submit ltsubmit-filegt
Watch it run condor_q

Files on shared fs
in a cluster at least, for other options see later
From Alan Roy, IO Access in Condor and Grid, UW
Madison. See http//www.cs.wisc.edu/condor/
18
Matching jobs to resources

For homogeneous clusters mainly policy-based
FIFO
credential-based policy
fair-share
queue wait time
banks accounts
QoS specific
For heterogeneous clusters (like condor pools)
matchmaking based on resource job
characteristics
see later in grid matchmaking

19
Example scheduling policies - MAUI

RMTYPE0 PBS
RMHOST0 tbn20.nikhef.nl
...
NODEACCESSPOLICY SHARED
NODEAVAILABILITYPOLICY DEDICATEDPROCS
NODELOADPOLICY ADJUSTPROCS
FEATUREPROCSPEEDHEADER xps
BACKFILLPOLICY ON
BACKFILLTYPE FIRSTFIT
NODEALLOCATIONPOLICY FASTEST
FSPOLICY DEDICATEDPES
FSDEPTH 24
FSINTERVAL 240000
FSDECAY 0.99
GROUPCFGusers FSTARGET1 PRIORITY10
MAXPROC50
GROUPCFGdteam FSTARGET2
PRIORITY5000 MAXPROC32

MAUI is an open source product from
ClusterResources, Inc. http//www.supercluster.or
g/
20
Grid Interface to Computing
21
Grid Interfaces to the compute services

Need common interface for job management
for test jobs in interactive mode fork
like the interactive partition in clusters and
supers
batch system interface
executable
arguments
processors
memory
environment
stdin/out/err
Note
batch system usually doesnt manage local file
space
assumes executable is just there, because of
shared FS or JIT copying of the files to the
worker node in job prologue
local file space management needs to be exposed
as part of the grid service and then implemented
separately

22
Expectations?

What can a user expect from a compute service?
Different user scenarios are all valid
paratrooper mode come in, take all your
equipment (files, executable c) with you, do
your thing and go away
youre supposed to clean up, but the system will
likely do that for you if you forget. In all
cases, garbage left behind is likely to be
removed
two-stage prepare and run
extra services to pre-install environment and
later request it
see later on such Community Software Area
services
dont think but just do it
blindly assume the grid is like your local system
expect all software to be there
expect your results to be retained indefinitely
realism of this scenario is quite low for
production grids, as it does not scale to
larger numbers of users

23
Basic Operations

Direct run/submit
useless unless you have an environment already
set up
Cancel
Signal
Suspend
Resume
List jobs/status
Purge (remove garbage)
retrieve output first
Other useful functions
Assess submission (eligibility, ERT)
Register Start (needed if you have sandboxes)

24
A job submission diagram for a single CE

Example
explicit interactions

diagram from DJRA1.1 EGEE Middleware Architecture
25
WS-GRAM Job management using WS-RF

same functionalitymodelled with jobs represented
as resources
for input sandbox leverages an existing (GT4)
data movement service
exploit re-useable components

26
GT4 WS GRAM Architecture
Service host(s) and compute element(s)
SEG
Job events
GT4 Java Container
Compute element
GRAM services
Local job control
GRAM services
Local scheduler
Job functions
sudo
GRAM adapter
Delegate
Transfer request
Client
Delegation
Delegate
GridFTP
User job
RFT File Transfer
FTP control
FTP data
Remote storage element(s)
GridFTP
diagram from Carl Kesselman, ISI, ISOC/GFNL
masterclass 2006
27
GT2 GRAM

Informational historical
so dont blame the current Globus for this

single job submission flow chart
28
GRAM GT2 Protocol

RSL over http-g
target to a single specific resource
http-g is like https
modified protocol (one one byte) to specify
delegation
no longer interoperable with standard https
delegation implicit in job submission
RSL Resource Specification Language
Used in the GRAM protocol to describe the job
required some (detailed) knowledge about target
system

29
GT2 RSL

(executable"/bin/echo")
(arguments"12345")
(stdoutx-gass-cache//(GLOBUS_GRAM_JOB_CONTACT)
stdout anExtraTag)
(stderrx-gass-cache//(GLOBUS_GRAM_JOB_CONTACT)
stderr anExtraTag)
(queueqshort)

30
GT2 Job Manager interface

One job manager per running or queued job
provide control interface cancel, suspend,
status
GASS Grid Access to Secondary Storage
stdin, stdout, stderr
selected input/output files
listens on a specific TCP port on the Gatekeeper
host
Some issues
protocol does not provide two-phase commit
know way to know if the job really made it
too many open ports
one process for each queued job, i.e. too many
processes
Workaround
dont submit a job, but instread a grid-manager
process

31
Performance ?

Time to submit a basic GRAM job
Pre-WS GRAM lt 1 second
WS GRAM (in Java) 2 seconds
so GT2-style GRAM did have one significant
advantage
Concurrent jobs
Pre-WS GRAM 300 jobs
WS GRAM 32,000 jobs

32
Scaling scheduling

load on the CE head node per VO cannot be
controlled with a single common job manager
with many VOs
might need to resolve inter-VO resource
contention
different VOs may want different policies
make the CE pluggable
and provide a common CE interface, irrespective
of the site-specific job submission mechanism
as long as the site supports a fork JM

33
gLite job submission model
site
one grid CEMON per VO or user
34
Unicore CE

Other design and concept
eats JSDL (GGF standard) as a description
described job requirements in detail
security model cannot support dynamic VOs yet
grid-wide coordinated UID space
(or shared group accounts for all grid users)
no VO management tools (DEISA added a directory
for that)
intra-site communication not secured
one big plus job management uses only 1 port for
all ommunications (including file transfer), and
is thus firewall-friendly

35
Unicore CE Architecture
Graphic from Dave Snelling, Fujitsu Labs Europe,
Unicore Technology, Grid School July 2003
36
Unicore programming model

Abstract Job Object
Collection of classes representing Grid functions
Encoded as Java objects (XML encoding possible)
Where to build AJOs
Pallas client GUI - The users view
Client plugins - Grid deployer
Arcon client tool kit - Hard core
What cant the AJO do
Application level Meta-computing
???

from Dave Snelling, Fujitsu Labs Europe,
Unicore Technology, Grid School July 2003
37
Interfacing to the local system

Incarnation Data Base
Maps abstract representation to concrete jobs
Includes resource description
Prototype auto-generation from MDS
Target System Interface
Perl interface to host platform
Very small system specific module for easy
porting
Current NQS (several versions), PBS,
Loadleveler, UNICOS, Linux, Solaris, MacOSX,
PlayStation-2
Condor Under development ( probably done by now)

from Dave Snelling, Fujitsu Labs Europe,
Unicore Technology, Grid School July 2003
38
Resource Representation

CE attributes
obtaining metrics
GLUE CE

39
Describing a CE

Balance between completeness and timeliness
Some useful metrics almost impossible to obtain
when will this job of mine be finished if I
submit now?cannot be answered!
depends on system load
need to predict runtime for already running
queued jobs
simultaneous submission in a non-FIFO scheduling
model (e.g. fair share, priorities, pre-emption
c)

40
GlueCE a resource description viewpoint
From the GLUE Information Model version 1.2, see
document for details
41
Through the Glue Schema Cluster Info

Performance info SI2k, SF2k
Max wall time, CPU time seconds
together these determine if a job completes in
time
but clusters are not homogeneous
solve at the local end (scale masCPU,wall time
on each node to the system speed)CAVEAT when
doing cross-cluster grid-wide scheduling, this
can make you choose the wrong resource entirely!
solve (i.e. multiply) at the broker endbut now
you need a way to determine on which subcluster
your job will run oops.

42
Cluster Info total, free and max JobSlots

FreeJobSlots is the wrong metric to use for
scheuling (a good cluster is always 100 full)
these metrics may be VO, user and job dependent
if a cluster have free CPUs, that does not mean
that you can use them
even if there are thousands of waiting jobs, you
might get to the front of the queue because of
your prio or fair-share

43
Cluster info ERT and WRT

Estimated/worst response time
when will my job start to run if I submit now
Impossible to pre-determine in case of
simultaneous submissions
Best to do is to estimate
Possible approaches
simulation good but very, very slowPredicting
Job Start Times on Clusters, Hui Li et al. 2004
historical comparisons
template approach need to discover the proper
template
look for similar system states in the past
learning approach adapt the estimation
algorithm to the actual load and learn the best
approach
see the many other papers by Hui, bundle on
Blackboard!

44
Brokering
45
Brokering models

All current grid broker systems use global
brokering
consider all known resources when matching
requests
brokering takes longer as the system grows
Models
Bubble-to-the-top-information-system based
current Condor-G, gLite WMS
Ask the world for bids
Unicore Broker

46
Some grid brokers

Condor-G
uses Condor schedd (matchmaker) to match
resources
a Condor submitter has a number of backends to
talk to different CEs (GT2, GT4-GRAM, Condor
(flocking))
supports DAG workflows
schedd is close to the user
gLite WMS
separation between broker (based on Condor-G) and
the UI
additional Logging and Bookkeeping (generic, but
actually only used for the WMS)
does job-data co-location scheduling

47
Grid brokers (contd.)

Nimrod-G
parameter sweep engine
cycles through static list of resources
automatically inspects the job output and uses
that to drive automatic job submission
minimisation methods like simulated annealing
built in
Unicore broker
based on a pricing model
asks for bids from resources
no large information system needed full of
useless resources, but instead ask bids from all
resources for every job
shifts, but does nothing to resolve, the
info-system explosion

48
Alternative brokering

Alternatives could be P2P-style brokering
look in the neighbourhood for reasonable
matches, if none found, give the task to a peer
super-scheduler
scheduler only considers close resources (has
no global knowledge)
job submission pattern may or may not follow
brokering pattern
if it does, it needs recursive delegation for job
submission, which opens the door for worms and
trojans
trust is not very transitive(this is not a
problem in sharing public files, such as in the
popular P2P file sharing applications)

49
Broker detailed example gLite WMS

Job services in the gLite architecture
Computing Element (just discussed)
Workload Management System (brokering, submission
control)
Accounting (for EGEE comes in two flavours site
or user)
Job Provenance (to be done)
Package management (to be done)
continuous matchmaking solution
persistent list of pending jobs, waiting for
matching resources
WMS task akin to what the resources did in
Unicore

50
Architecture Overview
Resource Broker Node (Workload Manager, WM)
Job status
Storage Element
51
WMSs Architecture
52
WMSs Architecture
Job management requests (submission,
cancellation) expressed via a Job
Description Language (JDL)
53
WMSs Architecture
Keeps submission requests Requests are kept
for a while if no matching resources available
54
WMSs Architecture
Repository of resource information available to
matchmaker Updated via notifications and/or
active polling on sources
55
WMSs Architecture
Finds an appropriate CE for each submission
request, taking into account job requests and
preferences, Grid status, utilization policies
on resources
56
WMSs Architecture
Performs the actual job submission and
monitoring
57
The Information Supermarket

ISM represents one of the most notable
improvements in the WM as inherited from the EU
DataGrid (EDG) project
decoupling between the collection of information
concerning resources and its use
allows flexible application of different policies
The ISM basically consists of a repository of
resource information that is available in read
only mode to the matchmaking engine
the update is the result of
the arrival of notifications
active polling of resources
some arbitrary combination of both
can be configured so that certain notifications
can trigger the matchmaking engine
improve the modularity of the software
support the implementation of lazy scheduling
policies

58
The Task Queue

The Task Queue represents the second most notable
improvement in the WM internal design
possibility to keep a submission request for a
while if no resources are immediately available
that match the job requirements
technique used by the AliEn and Condor systems
Non-matching requests
will be retried either periodically
eager scheduling approach
or as soon as notifications of available
resources appear in the ISM
lazy scheduling approach

59
Job Logging Bookkeeping

LB tracks jobs in terms of events
important points of job life
submission, finding a matching CE, starting
execution etc
gathered from various WMS components
The events are passed to a physically close
component of the LB infrastructure
locallogger
avoid network problems
stores them in a local disk file and takes over
the responsibility to deliver them further
The destination of an event is one of bookkeeping
servers
assigned statically to a job upon its submission
processes the incoming events to give a higher
level view on the job states
Submitted, Running, Done
various recorded attributes
JDL, destination CE name, job exit code
Retrieval of both job states and raw events is
available via legacy (EDG) and WS querying
interfaces
user may also register for receiving
notifications on particular job state changes

60
Job Submission Services

WMS components handling the job during its
lifetime and performing the submission
Job Adapter
is responsible for
making the final touches to the JDL expression
for a job, before it is passed to CondorC for the
actual submission
creating the job wrapper script that creates the
appropriate execution environment in the CE
worker node
transfer of the input and of the output sandboxes
CondorC
responsible for
performing the actual job management operations
job submission, job removal
DAGMan
meta-scheduler
purpose is to navigate the graph
determine which nodes are free of dependencies
follow the execution of the corresponding jobs.
instance is spawned by CondorC for each handled
DAG
Log Monitor
is responsible for

61
Job Preparation

Information to be specified when a job has to be
submitted
Job characteristics
Job requirements and preferences on the computing
resources
Also including software dependencies
Job data requirements
Information specified using a Job Description
Language (JDL)
Based upon Condors CLASSified ADvertisement
language (ClassAd)
Fully extensible language
A ClassAd
Constructed with the classad construction
operator
It is a sequence of attributes separated by
semi-colons.
An attribute is a pair (key, value), where value
can be a Boolean, an Integer, a list of strings,
ltattributegt ltvaluegt

62
ClassAds matchmaking

Brokering based on advertisements by both jobs
and resources

63
ClassAds matchmaking

Allow customers to set provide requirements and
preferences on the resources
Allow resources to impose constraints on the
customers they wish to service.
Separation between matchmaking and claiming.
The matchmake is stateless and thus can scale to
very large systems without complex failure
recovery.

64
Job Description Language (JDL)

The supported attributes are grouped into two
categories
Job Attributes
Define the job itself
Resources
Taken into account by the Workload Manager for
carrying out the matchmaking algorithm (to choose
the best resource where to submit the job)
Computing Resource
Used to build expressions of Requirements and/or
Rank attributes by the user
Have to be prefixed with other.
Data and Storage resources
Input data to process, Storage Element (SE) where
to store output data, protocols spoken by
application when accessing SEs

65
JDL Relevant Attributes (1)

JobType
Normal (simple, sequential job), DAG,
Interactive, MPICH, Checkpointable
Executable (mandatory)
The command name
Arguments (optional)
Job command line arguments
StdInput, StdOutput, StdError (optional)
Standard input/output/error of the job
Environment
List of environment settings
InputSandbox (optional)
List of files on the UIs local disk needed by
the job for running
The listed files will be staged automatically to
the remote resource
OutputSandbox (optional)
List of files, generated by the job, which have
to be retrieved

66
JDL Relevant Attributes (2)

Requirements
Job requirements on computing resources
Specified using attributes of resources published
in the Information Service
If not specified, default value defined in UI
configuration file is considered
Default other.GlueCEStateStatus "Production"
(the resource has to be able to accept jobs and
dispatch them on WNs)
Rank
Expresses preference (how to rank resources that
have already met the Requirements expression)
Specified using attributes of resources published
in the Information Service
If not specified, default value defined in the UI
configuration file is considered
Default - other.GlueCEStateEstimatedResponseTime
(the lowest estimated traversal time)
Default other.GlueCEStateFreeCPUs (the highest
number of free CPUs) for parallel jobs (see later)

67
JDL Relevant Attributes (3)

InputData
Refers to data used as input by the job these
data are published in the Replica Catlog and
stored in the Storage Elements)
LFNs and/or GUIDs
InputSandbox
Execuable, files etc. that are sent to the job
DataAccessProtocol (mandatory if InputData has
been specified)
The protocol or the list of protocols which the
application is able to speak with for accessing
InputData on a given Storage Element
OutputSE
The Uniform Resource Identifier of the output
Storage Element
RB uses it to choose a Computing Element that is
compatible with the job and is close to Storage
Element

Details in Data Management lecture
68
Example of JDL File

JobTypeNormal
Executable gridTest
StdError stderr.log
StdOutput stdout.log
InputSandbox /home/mydir/test/gridTest
OutputSandbox stderr.log, stdout.log
InputData lfn/glite/myvo/mylfn
DataAccessProtocol gridftp
Requirements other.GlueHostOperatingSystemNameOp
Sys LINUX
other.GlueCEStateFreeCPUsgt4
Rank other.GlueCEPolicyMaxCPUTime

69
Jobs State Machine (1/9)

Submitted job is entered by the user to the User
Interface but not yet transferred to Network
Server for processing

70
Jobs State Machine (2/9)

Waiting job accepted by NS and waiting for
Workload Manager processing or being processed by
WMHelper modules.

71
Jobs State Machine (3/9)

Ready job processed by WM and its Helper modules
(CE found) but not yet transferred to the CE
(local batch system queue) via JC and CondorC.
This state does not exists for a DAG as it is not
subjected to matchmaking (the nodes are) but
passed directly to DAGMan.

72
Jobs State Machine (4/9)
Scheduled job waiting in the queue on the CE.
This state also does not exists for a DAG as it
is not directly sent to a CE (the node are).
73
Jobs State Machine (5/9)
Running job is running. For a DAG this means
that DAGMan has started processing it.
74
Jobs State Machine (6/9)
Done job exited or considered to be in a
terminal state by CondorC (e.g., submission to CE
has failed in an unrecoverable way).
75
Jobs State Machine (7/9)
Aborted job processing was aborted by WMS
(waiting in the WM queue or CE for too long,
over-use of quotas, expiration of user
credentials).
76
Jobs State Machine (8/9)
Cancelled job has been successfully canceled on
user request.
77
Jobs State Machine (9/9)
Cleared output sandbox was transferred to the
user or removed due to the timeout.
78
Directed Acyclic Graphs (DAGs)

A DAG represents a set of jobs
Nodes Jobs Edges Dependencies

NodeA
NodeB
NodeC
NodeD
NodeE
79
DAG JDL Structure

Type DAG
VirtualOrganisation yourVO
Max_Nodes_Running int gt0
MyProxyServer
Requirements
Rank
InputSandbox more later!
OutSandbox
Nodes nodeX more later!
Dependencies more later!

Mandatory Mandatory Optional Optional Option
al Optional Optional Mandatory Mandatory
80
Attribute Nodes

The Nodes attribute is the core of the DAG
description

. Nodes nodefilename1 ...
nodefilename2 .
dependencies

Nodefilename1 file foo.jdl
Nodefilename2 file
/home/vardizzo/test.jdl retry 2

Nodefilename1 description JobType
Normal
Executable abc.exe
Arguments 1 2 3
OutputSandbox
InputSandbox
.. retry 2

81
Attribute Dependencies

It is a list of lists representing the
dependencies between the nodes of the DAG.

. Nodes nodefilename1 ...
nodefilename2 .
dependencies

dependencies nodefilename1,
nodefilename2
MANDATORY YES!
dependencies
nodefilename1, nodefilename2
nodefilename1, nodefilename2 , nodefilename3

nodefilename1, nodefilename2,
nodefilename3, nodefilename4
82
InputSandbox Inheritance

All nodes inherit the value of the attributes
from the one specified for the DAG.

NodeA description JobType
Normal Executable abc.exe
OutputSandbox myout.txt InputSandbox
/home/vardizzo/myfile.txt,
root.InputSandbox
Type DAG VirtualOrganisation
yourVO Max_Nodes_Running int gt0 MyProxyServer
Requirements Rank InputSandbox
Nodes nodefilename
.. dependencies

Nodes without any
InputSandbox values, have to contain in their
description an empty list
InputSandbox

83
Interactive Jobs

It is a job whose standard streams are forwarded
to the submitting client.
The DISPLAY environment variable has to be set
correctly, because an X window may be opened.

Listener Process
X window or std no-gui
84
Interactive Jobs

Specified setting JobType Interactive in JDL
When an interactive job is executed, a window for
the stdin, stdout, stderr streams is opened
Possibility to send the stdin to
the job
Possibility the have the stderr
and stdout of the job when it
is running
Possibility to start a window for
the standard streams for a
previously submitted interactive
job with command glite-job-attach

85
Interactive Jobs JDL Structure
Mandatory Mandatory Mandatory Optional
Optional Optional Mandatory Mandatory

Type job
JobType interactive
Executable
Argument
ListenerPort int gt 0
OutputSandbox
Requirements
Rank

gLite Commands glite-job-attach options
ltjobIDgt
86
gLite Commands

JDL Submission
glite-job-submit o guidfile
jobCheck.jdl
JDL Status
glite-job-status i guidfile
JDL Output
glite-job-output i guidfile
Get Latest Job State
glite-job-get-chkpt o statefile i
guidfile
Submit a JDL from a state
glite-job-submit -chkpt statefile
o guidfile jobCheck.jdl
See also options typing help after the
commands.

87
Economy based brokering

Unicore

88
Unicore Broker

Distributed brokering
Sites Know the State of their Resources Best
Sites Can Conceal their Resource Configuration
Different VOs Need Different Selection Algorithms
Preferred site sets will vary
Different applications have different performance
characteristics
Uses an economic model
cost-based evaluation, like in the real world
broker developed by University of Manchester, UK

Unicore is a open source product coordinated by
the Unicore Forum, see www.unicore.org
89
Unicore Broker
graphic from Brokering in Unicore, John Brooke
and Donal Fellows, UoM, Unicore Summit October
2005
90
Job description ontology
graphic from Brokering in Unicore, John Brooke
and Donal Fellows, UoM, Unicore Summit October
2005
91
Unicore Broker hierarchy
graphic from Brokering in Unicore, John Brooke
and Donal Fellows, UoM, Unicore Summit October
2005
92
Unicore Broker in the system
Resource Database
Resource Broker
NQS
Network Job Supervisor
Unicore Gateway
Unicore Client
Condor
GT
Alternative Client
Multiple firewalllayouts possible
User Database
Ext. Auth Service
UoM Broker Architecture, from Dave Snelling,
Fujitsu Labs Europe, Unicore Technology, Grid
School July 2003
93
Unicore Broker
UoM Broker Architecture, from Dave Snelling,
Fujitsu Labs Europe, Unicore Technology, Grid
School July 2003
94
VO Schedulers

Pilot jobs and overlay networks

95
Towards a multi-scheduler world

expressing scheduling policies (priorities and
usage shares) for multiple complex VOs in a
single scheduler is proving difficult
resource owner does not want to know about VO
internal structure, but assign the VO just a
single share
VO wants to set fine-grained intra-VO shares
local schedulers (such as MAUI) are not geared
towards non-admin defined policies there is no
grid-aware scheduler
possible solutions
develop an interface to manage the local
scheduling policies
stack the schedulers, i.e. introduce a per-VO
scheduler

96
traditional job submission models

There are three traditional deployment models
direct per-user job submission to a gatekeeper
running with root privileges (GT2GK, todays
model)
a non-privileged dedicated CE or scheduler,
accepting authenticated user jobs and submitting
to the batch system
on-demand CE, submitted by VO or user to a
front-end system, that then receives user jobs
and submits these to the batch system
in order to not have complex schedulers run as
root, a sudo-component glexec is introducted

97
What is glexec?

glexec
a thin layerto change unix credentialsbased on
grid identity and attribute information
you can think of it as
a replacement for the gatekeeper
a griddy version of Apaches suexec(8)
a program wrapper around LCAS, LCMAPS or GUMS

98
What glexec does

Input
a certificate chain, possibly with VOMS
extensions
a user program name arguments to run
Action
check authorization (LCAS, GUMS)
user credentials, proper VOMS attributes,
executable name
acquire local credentials
local (uid, gid) pair, possibly across a cluster
enforce the local credential on the process
Result
user program is run with the mapped credentials

99
Jobs submission today (GT2 GK)

Deployment model without glexec (mode GT2GK)
jobs are submitted with an identity (hopefully
the original users one) to the site Gatekeeper
running as root
one job manager is run for each user on the head
node
with the users (uid,gid) as set by the gatekeeper

100
Glexec in a one-per-site mode

Deployment model with a CE service
running in a non-privileged account or
with a CE run (maybe one per VO) on a single
front-end per site

examples
CREAM
GT4 WS-GRAM

101
glexec with an on-demand CE

Deployment model with on-demand CEs (mode
on-demand CEs)
The user or the VO start their own scheduler on a
front-end system
All these on-demand schedulers are
resource-limited by a site-managed master
scheduler (via a GT2GK or Condor)
the on-demand schedulers eat jobs for their VO or
user
and set the proper identity before the job gets
submitted to the site batch system

102
glexec with on-demand CE

Deployment model with on-demand CEs (mode
on-demand for VOs with native interface)

103
Traditional model summary

In all three models, the submission of the user
job to the batch system is done with the original
job owners mapped (uid, gid) identity
grid-to-local identity mapping is done only on
the front-end system (CE)
batch system accounting provides per-user records
inspection of Unix process on worker nodes are
per-user

104
Pilot jobs

A pilot job is basically just
a small script which downloads a real job
from a repository once it starts executing, hence
it is not committed to any particular task, or
perhaps even a particular user, until that point.
If there are no tasks waiting the pilot job exits
immediately.
In principle, if the time limits on the queue are
long enough a single pilot job could run more
than one real job, although I'm not sure if
anyone is actually doing that at the moment.

105
From the VO side

Background some large VOs develop and prefer to
use their own scheduling job management
framework
late binding of jobs to job slots
first establishing an overlay network
subsequent scheduling and starting of jobs is
faster
hide details between the various grid flavours
implement VO priorities
full use of allocated slots, up to max wall clock
time
but these VOs will need their own scheduler
some of them do have it already,
but then others dont and most never will, so the
use of pilots should not be the only option (or
even the default) way of things

106
Situation today

VO-type pilot jobs submitted as if regular user
jobs
run with the identity of one or a few individuals
from a VO
obtain jobs from any user (within the VO) and run
that payload on the WN allocated
site sees only a single identity, not the true
owner of the workload
no effective mechanisms today can deny this use
model
note that this does not apply to the regular
per-user pilot jobs

107
Issues

Issues that drove the original glexec-on-WN
scenario
VO supplied pilot jobs must observe and honour
the same policies the site uses for normal job
execution
preferably
without requiring alternate mechanisms to
describe the policies
be continuously in synch with the site policies
again, per-user pilot jobs satisfy these rules
by design

108
Pieces of a solution

Three pieces that go together
glexec on the worker-node deployment
mechanism for pilot job to submit themselves and
their payload to site policy control
give incontrovertible evidence of who is running
on which node at any one time
needed at selected sites for regulatory
compliance
ability to nail individual culprits
by requiring the VO to present a valid delegation
from each user
VO should want this
to keep user jobs from interfering with each
other
honouring site ban lists for individuals may help
in not banning the entire VO in case of an
incident

109
Pieces of the solution

glexec on the worker-node deployment
way to keep the pilot jobs submitters to their
word
system-level auditing of the pilot jobs, to see
they are not doing the user job by themselves or
evading the controls
relies on advanced auditing features of the OS
(from EAL3)
but auditing data on the WN is useful for
incident investigations only
internal accounting should be done by the VO
the regular site accounting mechanisms are via
the batch system, and will see the pilot job
identity
the site can easily show from those logs the
usage by the pilot job(for which wall-clock-time
accounting should be used)
making a site do accounting based glexec jobs is
non-standard, requires effort, may be intrusive,
and messes up normal accounting
a VO capable of writing their own submission
framework, ought to be able to write their own
accounting system as well

110
glexec on WN deployment model

VO submits a pilot job to the batch system
the VO pilot job submitter is responsible for
the pilot behaviour
this might be a specific role in the VO, or a
locally registered badged user at each site
Pilot job is subject to normal site policies for
jobs
Pilot job obtains the true user job, and
presents the user credentials and the job
(executable name) to the site (glexec) to
request a decision

111
VO pilot job on the node

On success the site will set the uid/gid of the
new users job
On failure glexec will return with an error, and
pilot job can terminate or obtain other job

Note proper uid change by Gatekeeper or
Condor-C/BLAHP on head node should remain default
112
What is needed in this model?

Agreement on the three ingredients
deployment of glexec on the WN to do setuid
detailed auditing on the head node and the WNs
site accounting done at the VO (i.e. pilot job)
level
glexec
needs feature enhancements compared to single-CE
version
see status of glexec on the next slide
Inspection of the audit logs
detect abuse patterns in the system-call auditing
logs
Grid job logging capabilities
glexec will log (uid, user/system/real time
usage) via syslog
credential mapping framework (LCMAPS) will log
mapping (also via syslog)
centralisation of glexec mappings, e.g. via
JobRepository

113
Notes and alternatives

glexec, like any site-managed ingress point,
trusts the submitter not to have mixed up the
user credentials and the jobs
we trust the RB today do this correctly, and RBs
are unknown quantities to the receiving site
a longer term solution is to have the job request
singed by the submitting user
since the description is modified by
intermediaries (brokers), the signature can only
be to the original content, and the site would
have to evaluate whether the job received matches
the signed JDL
or use an inheritance model for the job
description, and treat the job like you would,
e.g., a CIM entity

114
Summary

Realize that today some VOs are doing pilot
jobs today
there is no effective enforcement against this
some sites may even just dont care yet, whilst
others have hard requirements on auditability and
regulatory compliance
The glexec-on-WN model gives the VOs tools to
comply with site requirements
at least makes it better than it is today
but you, as a site, will miss that warm and fuzzy
feeling of trust
a glexec-on-WN is always replaceable by the
null operation for sites that dont care or
want it
but realize this is for just one of the glexec
deployment models