High-Throughput Computing on Commodity Systems. - PowerPoint PPT Presentation

About This Presentation
Title:

High-Throughput Computing on Commodity Systems.

Description:

Raw computing power is everywhere - on desk-tops, shelves, racks, ... State = 'Unclaimed'; LoadAverage = 0.042969. Arch = 'INTEL'; OpSys = 'SOLARIS251'; ondor ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 42
Provided by: miro62
Learn more at: https://www3.nd.edu
Category:

less

Transcript and Presenter's Notes

Title: High-Throughput Computing on Commodity Systems.


1
High-Throughput Computing on Commodity Systems.
2
The Good News
  • Raw computing power is everywhere - on desk-tops,
    shelves, racks, and in your pockets. It is
  • Cheap
  • Plentiful
  • Mass-Produced

3
The Bad News
  • GFLOPS per year
  • /
  • GFLOPS per second
  • 30,000,000 seconds/year

4
A variation on a chestnut
  • What is a benchmark?

5
Answer
  • The throughput which your system is
  • guaranteed
  • never to exceed!

6
Why?
  • A community of commodity computers can be
    difficult to manage
  • Dynamic State and availability change over
    time
  • Evolving New hardware and software is
    continuously acquired and installed
  • Heterogeneous Hardware and software
  • Distributed ownership Each machine has a
    different owner with different requirements and
    preferences.

7
Why?
  • Even traditionally static systems (such as
    professionally managed clusters) suffer the same
    problems when viewed at a yearly scale
  • Power failures
  • Hardware failures
  • Software upgrades
  • Load imbalance
  • Network imbalance

8
How do we measure computer performance?
  • High-Performance Computing
  • Achieve max GFLOP per second under ideal
    circumstances.
  • High-Throughput Computing
  • Achieve max GFLOP per months or years in whatever
    conditions prevail.

9
High-Throughput Computing
  • Focuses on maximizing
  • simulations run before the paper deadline
  • crystal lattices per week
  • reconstructions per week
  • video frames rendered per year
  • without babysitting from the user.
  • Cannot depend on ideal circumstances.

10
High-Throughput Computing
  • Is achieved by
  • Expanding the CPUs available.
  • Silently adapting to inevitable changes.
  • Robust software
  • Is only marginally affected by
  • MB, MHz, MIPS, FLOPS
  • Robust hardware

11
Solution Condor
  • Condor is software for creating a high-throughput
    computing environment on a community of
    workstations, ranging from commodity PCs to
    supercomputers.

12
Who are we?
13
The Condor Project (Established 85)
  • Distributed systems CS research performed by a
    team that faces
  • software engineering challenges in a
    UNIX/Linux/NT environment,
  • active interaction with users and collaborators,
  • daily maintenance and support challenges of a
    distributed production environment,
  • and educating and training students.
  • Funding - NSF, NASA,DoE, DoD, IBM, INTEL,
    Microsoft and the UW Graduate School
  • .

14
Users and collaborators
  • Scientists - Biochemistry, high energy physics,
    computer sciences, genetics,
  • Engineers - Hardware design, software building
    and testing, animation, ...
  • Educators - Hardware design tools, distributed
    systems, networking, ...

15
National Grid Efforts
  • National Technology Grid - NCSA Alliance
    (NSF-PACI)
  • Information Power Grid - IPG (NASA)
  • Particle Physics Data Grid - PPDG (DoE)
  • Grid Physics Network GriPhyN (NSF-ITR)

16
Condor CPUs on the UW Campus
17
Some NumbersUW-CS Pool
  • 6/98-6/00 4,000,000 hours 450 years
  • Real Users 1,700,000 hours 260 years
  • CS-Optimization 610,000 hours
  • CS-Architecture 350,000 hours
  • Physics 245,000 hours
  • Statistics 80,000 hours
  • Engine Research Center 38,000 hours
  • Math 90,000 hours
  • Civil Engineering 27,000 hours
  • Business 970 hours
  • External Users 165,000 hours 19 years
  • MIT 76,000 hours
  • Cornell 38,000 hours
  • UCSD 38,000 hours
  • CalTech 18,000 hours

18
Start slow,but thinkBIG
19
Start slow, but think big!
1000 machines in the GRID.
100 machines in your department
1 machine on your desktop
One Personal Condor
Condor Pool
Condor-G
20
Start slow, but think big!
  • Personal Condor
  • Manage just your machine with Condor. Fault
    tolerance, policy control, logging. Sleep
    soundly at night.
  • Condor Pool
  • Take advantage of your friends and colleagues
    share cycles, gain 100x throughput.
  • Condor-G
  • Jobs from your pool migrate to other
    computational facilities around the world. Gain
    1000x throughput. (Record-breaking results!)

21
Key Condor User Services
  • Local control - jobs are stored and managed
    locally by a personal scheduler.
  • Priority scheduling - execution order controlled
    by priority ranking assigned by user.
  • Job preemption - re-linked jobs can be
    checkpointed, suspended, hold and resumed.
  • Local executing environment preserved - re-linked
    jobs can have their I/O re-directed to submission
    site.

22
More Condor User Services
  • Powerful and flexible means for selecting
    execution site (requirements and preferences)
  • Logging of job activities.
  • Management of large (10K) numbers of jobs per
    user.
  • Support for jobs with dependencies - DAGMan
    (Directed Acyclic Graph Manager)
  • Support for dynamic MW (PVM and File)
    applications

23
How does it work?
24
Basic HTC Mechanisms
  • Matchmaking - enables requests for services and
    offers to provide services find each other
    (ClassAds).
  • Fault tolerance - Checkpointing enables
    preemptive resume scheduling (go ahead and use it
    as long as it is available!).
  • Remote execution enables transparent access to
    resources from any machine in the world.
  • Asynchronicity - enables management of dynamic
    (opportunistic) resources.

25
Every Communityneeds a Matchmaker!
26
Why? Because ...
  • .. someone has to bring together community
    members who have requests for goods and services
    with members who offer them.
  • Both sides are looking for each other
  • Both sides have constraints
  • Both sides have preferences

27
ClassAd - Properties
  • Type Machine
  • Activity Idle
  • KbdIdle 002231
  • Disk 2.1G //2.1 Gigs
  • Memory 64M // 6.4 Megs
  • State Unclaimed
  • LoadAverage 0.042969
  • Arch INTEL
  • OpSys SOLARIS251

28
ClassAd - Policy
  • RsrchGrp raman, miron, solomon
  • Friends dilbert, wally
  • Untrusted rival, riffraff, TPHB
  • Tier member(RsrchGroup, other.Owner) ? 2
  • ( member(Friends, other.Owner) ? 1 0 )
  • Requirements !member(Untrusted, other.Owener)
  • (Tier 2 ? True
  • Tier 1 ? LoadAvg lt 0.3
  • KbdIdle gt 0015 )
  • DayTime() lt0800 DayTime()gt1800 )

29
Advantages of Matchmaking
  • Hybrid (CentralizedDistributed) resource
    allocation algorithm
  • End-to-end verification
  • Bilateral specialization
  • Weak consistency requirements
  • Authentication
  • Fault tolerance
  • Incremental system evolution

30
Fault-Tolerance
  • Condor can checkpoint a program by writing its
    image to disk.
  • If a machine should fail, the program may resume
    from the last checkpoint.
  • Ifa job must vacate a machine, it may resume from
    where it left off.

31
Remote Execution
  • Condor might run your jobs on machines spread
    around the world not all of them will have your
    files.
  • Condor provides an adapter a library which
    converts your jobs I/O operations into remote
    I/O back to your home machine.
  • No matter where your job runs, it sees the same
    environment.

32
Asynchronicity
  • A fact of life in a system of 1000s of machines.
  • Power on/off
  • Lunch breaks
  • Jobs start and finish
  • Condor never depends on a fixed configuration
    work with what is available.

33
Does it work?
34
An example - NUG28
  • We are pleased to announce the exact solution of
    the nug28 quadratic assignment problem (QAP).
    This problem was derived from the well known
    nug30 problem using the distance matrix from a 4
    by 7 grid, and the flow matrix from nug30 with
    the last 2 facilities deleted. This is to our
    knowledge the largest instance from the nugxx
    series ever provably solved to optimality.
  • The problem was solved using the branch-and-bound
    algorithm described in the paper "Solving
    quadratic assignment problems using convex
    quadratic programming relaxations," N.W. Brixius
    and K.M. Anstreicher. The computation was
    performed on a pool of workstations using the
    Condor high-throughput computing system in a
    total wall time of approximately 4 days, 8 hours.
    During this time the number of active worker
    machines averaged approximately 200. Machines
    from UW, UNM and (INFN) all participated in the
    computation.

35
NUG30 Personal Condor
  • For the run we will be flocking to
  • -- the main Condor pool at Wisconsin (600
    processors)
  • -- the Condor pool at Georgia Tech (190 Linux
    boxes)
  • -- the Condor pool at UNM (40 processors)
  • -- the Condor pool at Columbia (16 processors)
  • -- the Condor pool at Northwestern (12
    processors)
  • -- the Condor pool at NCSA (65 processors)
  • -- the Condor pool at INFN (200 processors)
  • We will be using glide_in to access the Origin
    2000 (through LSF ) at NCSA.
  • We will use "hobble_in" to access the Chiba City
    Linux cluster and Origin
  • 2000 here at Argonne.

36
It works!!!
  • Date Thu, 8 Jun 2000 224100 -0500 (CDT) From
    Jeff Linderoth ltlinderot_at_mcs.anl.govgt To Miron
    Livny ltmiron_at_cs.wisc.edugt Subject Re Priority
  • This has been a great day for metacomputing!
    Everything is going wonderfully. We've had over
    900 machines (currently around 890), and all the
    pieces are working great
  • Date Fri, 9 Jun 2000 114111 -0500 (CDT) From
    Jeff Linderoth ltlinderot_at_mcs.anl.govgt
  • Still rolling along. Over three billion nodes in
    about 1 day!

37
Up to a Point
  • Date Fri, 9 Jun 2000 143511 -0500 (CDT) From
    Jeff Linderoth ltlinderot_at_mcs.anl.govgt Hi Gang,
  • The glory days of metacomputing are over. Our job
    just crashed. I watched it happen right before my
    very eyes. It was what I was afraid of -- they
    just shut down denali, and losing all of those
    machines at once caused other connections to time
    out -- and the snowball effect had bad
    repercussions for the Schedd.

38
Back in Business
  • Date Fri, 9 Jun 2000 185559 -0500 (CDT) From
    Jeff Linderoth ltlinderot_at_mcs.anl.govgt
  • Hi Gang,
  • We are back up and running. And, yes, it took me
    all afternoon to get it going again. There was a
    (brand new) bug in the QAP "read checkpoint"
    information that was making the master coredump.
    (Only with optimization level -O4). I was nearly
    reduced to tears, but with some supportive words
    from Jean-Pierre, I made it through.

39
The First 600K seconds
40
We made it!!!
  • Sender goux_at_dantec.ece.nwu.edu Subject Re Let
    the festivities begin.
  • Hi dear Condor Team,
  • you all have been amazing. NUG30 required 10.9
    years of Condor Time. In just seven days !
  • More stats tomorrow !!! We are off celebrating !
  • condor rules !
  • cheers,
  • JP.

41
Do not be picky, be agile!!!
Write a Comment
User Comments (0)
About PowerShow.com