Condor COD (Computing On Demand) Condor Week 5/5/2003 - PowerPoint PPT Presentation

About This Presentation
Title:

Condor COD (Computing On Demand) Condor Week 5/5/2003

Description:

– PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 34
Provided by: Csw5
Category:
Tags: cod | computing | condor | demand | week

less

Transcript and Presenter's Notes

Title: Condor COD (Computing On Demand) Condor Week 5/5/2003


1
Condor COD (Computing On Demand)Condor Week
5/5/2003
2
What problem are we trying to solve?
  • Some people want to run interactive, yet
    compute-intensive applications
  • Jobs that take lots of compute power over a
    relatively short period of time
  • They want to use batch computing resources, but
    need them right away
  • Ideally, when theyre not in use, resources would
    go back to the batch system

3
Some example applications
  • A distributed build/compilation of a large
    software system
  • A very complex spreadsheet that takes a lot of
    cycles when you press recalculate
  • High-energy physics (HEP) analysis jobs
  • Visualization tools for data-mining, rendering
    graphics, etc.

4
Example application for COD
Users Workstation
Compute Farm
On-demand workers
Idle nodes
Controller application
5
Whats the Condor solution?
  • Condor COD Computing on Demand
  • Use Condor to manage the batch resources when
    theyre not in use by the interactive jobs
  • Allow the interactive jobs to come in with high
    priority and run instead of the batch job on any
    given resource

6
Why did we have to change Condor for that?
  • Doesnt Condor already notice when an interactive
    job starts on a CPU?
  • Doesnt Condor already provide checkpointing when
    that happens?
  • Cant I configure Condor to run whatever jobs I
    want with a higher priority on my own machines?

7
Well, yes But thats not good enough
  • Not all jobs can be checkpointed, and even those
    that can take some time
  • We want this to be instantaneous, not waiting for
    the batch system to schedule tasks
  • You can configure Condor to run higher priority
    jobs, but the other jobs are kicked off the
    machine

8
Whats new about COD?
  • Checkpoint to swap space
  • When a high-priority COD job appears, the
    lower-priority batch job is suspended
  • The COD job can run right away, while the batch
    job is suspended
  • Batch jobs (even those that cant checkpoint) can
    resume instantly once there are no more active
    COD jobs

9
But wait, theres more
  • The condor_startd can now manage multiple
    claims on each resource
  • If any COD claim becomes active, the regular
    Condor claim is automatically suspended
  • Without an active COD, regular claim resumes
  • There is a new command-line tool to request,
    activate, suspend, resume and release these
    claims
  • Theres even a C object to do all of that, if
    you really want it

10
COD claim-management commands
  • Request authorizes the user and returns a unique
    claim ID for future commands
  • Activate spawns an application on a given COD
    claim, with various options to define the
    application, job ID, etc
  • Suspends any regular Condor job
  • You can have multiple COD claims on a single
    resource, and they can all be running
    simultaneously

11
COD commands (contd)
  • Suspend
  • Given COD claim is suspended
  • If there are no more active COD claims, a regular
    Condor batch job can now run
  • Resume Given COD claim is resumed, suspending
    the Condor batch job (if any)
  • Deactivate Kill the application but hold onto
    the COD claim
  • Release Get rid of the COD claim itself

12
COD command protocol
  • All commands use ClassAds
  • Allows for a flexible protocol
  • Excellent error propagation
  • Can use existing ClassAd technology
  • Similar to existing Condor protocol
  • Separation of claiming from activation, so you
    can have hot-spares, etc.

13
How does all of that solve the problem?
  • The interactive COD application starts up, and
    goes out to claim some compute nodes
  • Once the helper applications are in place and
    ready, these COD claims are suspended, allowing
    batch jobs to run
  • When the interactive application has work, it can
    instantly suspend the batch jobs and resume the
    COD applications to perform the computations

14
Step 1 Initial state
Users Workstation
Compute Farm
Idle nodes
Idle nodes
15
Step 2 Application spawned
Users Workstation
Compute Farm
Idle nodes
Idle nodes
Controller application spawned
16
Step 3 Compute node setup
Users Workstation
Compute Farm
On-demand workers
On-demand workers
Idle nodes
Claiming and initializing 4 compute nodes for
rendering Got reply from c1.cluster.org c6.clust
er.org c14.cluster.org c17.cluster.org SUCCESS!
request
activate
17
Step 3 Commands used
  • condor_cod_request name c1.cluster.org \
  • classad c1.out
  • Successfully sent CA_REQUEST_CLAIM to startd at
    lt128.105.143.1455642gt
  • Result ClassAd written to c1.out
  • ID of new claim is lt128.105.143.1455642gt105165
    62082
  • condor_cod_activate keyword fractgen \
  • id lt128.105.143.1455642gt10516562082
  • Successfully sent CA_ACTIVATE_CLAIM to startd at
    lt128.105.143.1455642gt

18
Step 4 Checkpoint to swap
Users Workstation
Compute Farm
Suspended worker
SELECT FRACTAL TYPE ltMandelbrotgt (more user
input)
suspend
19
Step 4 Commands used
condor_cod_suspend \ id lt128.105.143.14556
42gt10516562082 Successfully sent
CA_SUSPEND_CLAIM to startd at lt128.105.143.145564
2gt
  • Rendering application on each COD node is
    suspended while interactive tool waits for input
  • The resources are now available for batch Condor
    jobs

20
Step 5 Batch jobs can run
Users Workstation
Compute Farm
SPECIFY PARAMETERS max_iterations 400000 TL
-0.65865, -0.56254 BR -0.45865, -0.71254 (more
user input)
Batch queue
21
Step 6 Computation burst
Users Workstation
Compute Farm
Interactive workers
On-demand workers
Idle nodes
CLICK ltRENDERgt TO VIEW YOUR FRACTAL
RENDER
resume
Suspended batch job
22
Step 6 Commands used
condor_cod_resume \ id lt128.105.143.145564
2gt10516562082 Successfully sent
CA_RESUME_CLAIM to startd at lt128.105.143.1455642
gt
  • Batch Condor jobs on COD nodes are suspended
  • All COD rendering applications are resumed on
    each node

23
Step 7 Results produced
Users Workstation
Compute Farm
Interactive workers
On-demand workers
Idle nodes
Data
Display
Suspended batch job
24
Step 8 User input while batch work resumes
Users Workstation
Compute Farm
Suspended worker
Idle nodes
Idle nodes
ZOOM BOX COORDINATES TL -0.60301, -0.61087 BR
-0.58037, -0.62785
suspend
25
Step 9 Computation burst 2
Users Workstation
Compute Farm
Interactive workers
On-demand workers
Idle nodes
Data
resume
Display
RENDER
Suspended batch job
26
Step 10 Clean-up
Users Workstation
Compute Farm
Idle nodes
Idle nodes
REALLY QUIT? Y/N Releasing compute nodes 4
nodes terminated successfully!
release
27
Step 10 Commands used
condor_cod_release \ id lt128.105.143.14556
42gt10516562082 Successfully sent
CA_RELEASE_CLAIM to startd at lt128.105.143.145564
2gt State of claim when it was released
"Running"
  • The jobs are cleaned up, claims released, and
    resources returned to batch system

28
Other changes for COD
  • The condor_starter has been modified so that it
    can run jobs without communicating with a
    condor_shadow
  • All the great job control features of the starter
    without a shadow
  • Starter can write its own UserLog
  • Other useful features for COD

29
condor_status cod
  • New cod option to condor_status to view COD
    claims in a Condor pool
  • Name ID ClaimState TimeInState
    RemoteUser JobId Keyword
  • astro.cs.wi COD1 Idle 0000004 wright
  • chopin.cs.w COD1 Running 0000205 wright
    3.0 fractgen
  • chopin.cs.w COD2 Suspended 0001021 wright
    4.0 fractgen
  • Total Idle Running
    Suspended Vacating Killing
  • INTEL/LINUX 3 1 1
    1 0 0
  • Total 3 1 1
    1 0 0

30
What else could I use all these new features for?
  • Short-running system administration tasks that
    need quick access but dont want to disturb the
    jobs in your batch system
  • A Grid Shell
  • A condor_starter that doesnt need a
    condor_shadow is a powerful job management
    environment that can monitor a job running under
    a hostile batch system on the grid

31
Future work
  • More ways to tell COD about your application
  • For now, you define important attributes in your
    condor_config file and pre-stage the executables
  • Ability to transfer files to and from a COD job
    at a remote machine
  • Weve already got the functionality in Condor, so
    why rely on a shared filesystem or pre-staging?

32
More future work
  • Accounting for COD jobs
  • Working with some real-world applications and
    integrating these new COD features
  • Would the real users please stand up?
  • Better Grid Shell support
  • This is really a separate-yet-related area of
    work

33
How do you use COD?
  • Upgrade to Condor version 6.5.3 or greater COD
    is already included
  • There will be a new section in the Condor manual
    (coming soon)
  • If you need more help, ask the ever helpful
    condor-admin_at_cs.wisc.edu
  • Find me at the BoF on Wednesday, 9am to Noon
    (room TBA)
Write a Comment
User Comments (0)
About PowerShow.com