Integrating GPUs into Condor - PowerPoint PPT Presentation

About This Presentation
Title:

Integrating GPUs into Condor

Description:

Distributed computing project using NVIDIA graphics card for atom molecular ... Binary for NVIDIA cards for more specific information. 7. Graphics Card Architecture ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 19
Provided by: Tim868
Category:

less

Transcript and Presenter's Notes

Title: Integrating GPUs into Condor


1
Integrating GPUs into Condor
  • Timothy Blattner
  • Marquette University
  • Milwaukee, WI
  • April 22, 2009

2
Outline
  • Background and Vision
  • Graphics Cards
  • Condor Approach
  • Problems
  • Conclusions and Future Work

3
Graphics cards
  • Powerful NVIDIA Tesla C1060
  • 240 massively parallel processing cores
  • 4 GB GDDR3
  • CUDA Capable
  • 993 gigaflops
  • 1,300
  • Cheap NVIDIA 9800 GT
  • 112 massively parallel processing cores
  • 512 MB GDDR3
  • CUDA Capable
  • 120

4
Vision and Focus
  • Pool of computers containing graphics cards,
    managed by Condor
  • Provide users the ability to utilize graphics
    cards identified by Condor

Central Manager
?
?
?
5
Opportunities
  • Resources may already be there
  • Majority of machines have graphics cards in them
  • GPU resources sit idle while Condor runs on the
    CPU
  • Similar work
  • GPUGRID.net
  • Distributed computing project using NVIDIA
    graphics card for atom molecular simulations of
    proteins
  • Uses GPU-enabled BOINC client

6
Prototype Implementation
  • Linux only
  • Script queries operating system and graphics card
  • Hawkeye Cron job manager runs script
  • Script outputs graphics card information into
    ClassAd format
  • Binary for NVIDIA cards for more specific
    information

7
Graphics Card Architecture
8
Graphics card APIs
  • Favor general purpose computations
  • CUDA (NVIDIA)
  • Brook (ATI)
  • openCL (Khronos Group)

9
CUDA Programming Model
  • Kernels are functions run on the device (GPU)
  • Host (CPU) code invokes kernels and determines
  • Number of threads
  • Thread block structure for organizing threads
  • Kernel invocations are asynchronous
  • Control returns to the CPU immediately
  • CUDA provides synchronization primitives
  • Some CUDA calls (e.g. memory allocation) are
    synchronous

10
Hawkeye Cron Job Manager
  • Provides mechanism for collecting, storing, and
    using information about computers
  • Periodically executes specified program(s)
  • Program outputs in form of ClassAd
  • Outputs are added to machine's ClassAd

11
Hawkeye Implementation
  • Added to local configuration file
  • Runs script every minute
  • Condor user must be granted graphics card
    privileges in order to query the card

STARTD_CRON_JOBLIST (STARTD_CRON_JOBLIST),
UPDATEGPU STARTD_CRON_UPDATEGPU_EXECUTABLE
gpu.sh STARTD_CRON_UPDATEGPU_PERIOD
1m STARTD_CRON_UPDATEGPU_MODE
Periodic STARTD_CRON_UPDATEGPU_KILL True
12
Script Output
HasGpu True NGpu 1 Gpu0 "Quadro FX
3700" Gpu0CudaCapable True Gpu0_Major 1
Gpu0_Minor 1 Gpu0Mem 536150016 Gpu0Procs
14 Gpu0Cores 112 Gpu0ShareMem 16384
Gpu0ThreadsPerBlock 512 Gpu0ClockRate 1.24
HasCuda True -
13
Job Submission
  • Users can submit jobs with GPU requirements into
    Condor
  • Portable across Linux Distros

Universe vanilla Executable
tests/CudaJob Initialdir
gpuJobs Requirements (HasGpu true)
(Gpu0CudaCapable true) Log
gpu_test.log Error
gpu_test.stderr Output
gpu_test.stdout Queue
condor_submit gpu_job.submit
14
Access Control
  • /dev/nvidiactl, /dev/nvidia devices need
    read/write by submitting/running user
  • Could be
  • Nobody, open access
  • Controlled by Unix group, containing limited
    users
  • Integrated more directly with Condor user
    control, slot users

15
Problems
  • Preemption
  • Jobs running in GPU kernel cannot be interrupted
    reliably by Unix signals
  • Watchdog timer
  • After 5 seconds, job is killed
  • A Solution use general purpose graphics card as
    secondary display
  • Memory Security
  • Malicious users, interrupting a job between GPU
    kernel calls, have the opportunity to overwrite
    or copy GPU memory

16
Summary
  • Condor based approach for advertising GPU
    resources
  • Linux-based prototype implementation
  • Can access available GPUs
  • Works best on dedicated machines, with no need
    for preemption
  • Current Limitations
  • Doesnt report GPU usage
  • Lack of preemption
  • Limited OS and video card support

17
Future Work
  • Create benchmark and testing suite
  • Handle preemption
  • Investigate how watchdog works
  • GPU usage reporting
  • Integrate memory protection
  • Support more Operating Systems
  • Windows and Mac OS X
  • Support alternative architectures and APIs
  • Brook and OpenCL

18
  • Questions?
  • Contact
  • timothy.blattner_at_marquette.edu
  • craig.struble_at_marquette.edu
  • https//sourceforge.net/projects/condorgpu/
Write a Comment
User Comments (0)
About PowerShow.com