High-Throughput Computing on Commodity Systems. - PowerPoint PPT Presentation

About This Presentation

Title:

High-Throughput Computing on Commodity Systems.

Description:

Raw computing power is everywhere - on desk-tops, shelves, racks, ... State = 'Unclaimed'; LoadAverage = 0.042969. Arch = 'INTEL'; OpSys = 'SOLARIS251'; ondor ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 42

Provided by: miro62

Learn more at: https://www3.nd.edu

Category:

more less

Transcript and Presenter's Notes

Title: High-Throughput Computing on Commodity Systems.

1
High-Throughput Computing on Commodity Systems.
2
The Good News

Raw computing power is everywhere - on desk-tops,
shelves, racks, and in your pockets. It is
Cheap
Plentiful
Mass-Produced

3
The Bad News

GFLOPS per year
/
GFLOPS per second
30,000,000 seconds/year

4
A variation on a chestnut

What is a benchmark?

5
Answer

The throughput which your system is
guaranteed
never to exceed!

6
Why?

A community of commodity computers can be
difficult to manage
Dynamic State and availability change over
time
Evolving New hardware and software is
continuously acquired and installed
Heterogeneous Hardware and software
Distributed ownership Each machine has a
different owner with different requirements and
preferences.

7
Why?

Even traditionally static systems (such as
professionally managed clusters) suffer the same
problems when viewed at a yearly scale
Power failures
Hardware failures
Software upgrades
Load imbalance
Network imbalance

8
How do we measure computer performance?

High-Performance Computing
Achieve max GFLOP per second under ideal
circumstances.
High-Throughput Computing
Achieve max GFLOP per months or years in whatever
conditions prevail.

9
High-Throughput Computing

Focuses on maximizing
simulations run before the paper deadline
crystal lattices per week
reconstructions per week
video frames rendered per year
without babysitting from the user.
Cannot depend on ideal circumstances.

10
High-Throughput Computing

Is achieved by
Expanding the CPUs available.
Silently adapting to inevitable changes.
Robust software
Is only marginally affected by
MB, MHz, MIPS, FLOPS
Robust hardware

11
Solution Condor

Condor is software for creating a high-throughput
computing environment on a community of
workstations, ranging from commodity PCs to
supercomputers.

12
Who are we?
13
The Condor Project (Established 85)

Distributed systems CS research performed by a
team that faces
software engineering challenges in a
UNIX/Linux/NT environment,
active interaction with users and collaborators,
daily maintenance and support challenges of a
distributed production environment,
and educating and training students.
Funding - NSF, NASA,DoE, DoD, IBM, INTEL,
Microsoft and the UW Graduate School
.

14
Users and collaborators

Scientists - Biochemistry, high energy physics,
computer sciences, genetics,
Engineers - Hardware design, software building
and testing, animation, ...
Educators - Hardware design tools, distributed
systems, networking, ...

15
National Grid Efforts

National Technology Grid - NCSA Alliance
(NSF-PACI)
Information Power Grid - IPG (NASA)
Particle Physics Data Grid - PPDG (DoE)
Grid Physics Network GriPhyN (NSF-ITR)

16
Condor CPUs on the UW Campus
17
Some NumbersUW-CS Pool

6/98-6/00 4,000,000 hours 450 years
Real Users 1,700,000 hours 260 years
CS-Optimization 610,000 hours
CS-Architecture 350,000 hours
Physics 245,000 hours
Statistics 80,000 hours
Engine Research Center 38,000 hours
Math 90,000 hours
Civil Engineering 27,000 hours
Business 970 hours
External Users 165,000 hours 19 years
MIT 76,000 hours
Cornell 38,000 hours
UCSD 38,000 hours
CalTech 18,000 hours

18
Start slow,but thinkBIG
19
Start slow, but think big!
1000 machines in the GRID.
100 machines in your department
1 machine on your desktop
One Personal Condor
Condor Pool
Condor-G
20
Start slow, but think big!

Personal Condor
Manage just your machine with Condor. Fault
tolerance, policy control, logging. Sleep
soundly at night.
Condor Pool
Take advantage of your friends and colleagues
share cycles, gain 100x throughput.
Condor-G
Jobs from your pool migrate to other
computational facilities around the world. Gain
1000x throughput. (Record-breaking results!)

21
Key Condor User Services

Local control - jobs are stored and managed
locally by a personal scheduler.
Priority scheduling - execution order controlled
by priority ranking assigned by user.
Job preemption - re-linked jobs can be
checkpointed, suspended, hold and resumed.
Local executing environment preserved - re-linked
jobs can have their I/O re-directed to submission
site.

22
More Condor User Services

Powerful and flexible means for selecting
execution site (requirements and preferences)
Logging of job activities.
Management of large (10K) numbers of jobs per
user.
Support for jobs with dependencies - DAGMan
(Directed Acyclic Graph Manager)
Support for dynamic MW (PVM and File)
applications

23
How does it work?
24
Basic HTC Mechanisms

Matchmaking - enables requests for services and
offers to provide services find each other
(ClassAds).
Fault tolerance - Checkpointing enables
preemptive resume scheduling (go ahead and use it
as long as it is available!).
Remote execution enables transparent access to
resources from any machine in the world.
Asynchronicity - enables management of dynamic
(opportunistic) resources.

25
Every Communityneeds a Matchmaker!
26
Why? Because ...

.. someone has to bring together community
members who have requests for goods and services
with members who offer them.
Both sides are looking for each other
Both sides have constraints
Both sides have preferences

27
ClassAd - Properties

Type Machine
Activity Idle
KbdIdle 002231
Disk 2.1G //2.1 Gigs
Memory 64M // 6.4 Megs
State Unclaimed
LoadAverage 0.042969
Arch INTEL
OpSys SOLARIS251

28
ClassAd - Policy

RsrchGrp raman, miron, solomon
Friends dilbert, wally
Untrusted rival, riffraff, TPHB
Tier member(RsrchGroup, other.Owner) ? 2
( member(Friends, other.Owner) ? 1 0 )
Requirements !member(Untrusted, other.Owener)
(Tier 2 ? True
Tier 1 ? LoadAvg lt 0.3
KbdIdle gt 0015 )
DayTime() lt0800 DayTime()gt1800 )

29
Advantages of Matchmaking

Hybrid (CentralizedDistributed) resource
allocation algorithm
End-to-end verification
Bilateral specialization
Weak consistency requirements
Authentication
Fault tolerance
Incremental system evolution

30
Fault-Tolerance

Condor can checkpoint a program by writing its
image to disk.
If a machine should fail, the program may resume
from the last checkpoint.
Ifa job must vacate a machine, it may resume from
where it left off.

31
Remote Execution

Condor might run your jobs on machines spread
around the world not all of them will have your
files.
Condor provides an adapter a library which
converts your jobs I/O operations into remote
I/O back to your home machine.
No matter where your job runs, it sees the same
environment.

32
Asynchronicity

A fact of life in a system of 1000s of machines.
Power on/off
Lunch breaks
Jobs start and finish
Condor never depends on a fixed configuration
work with what is available.

33
Does it work?
34
An example - NUG28

We are pleased to announce the exact solution of
the nug28 quadratic assignment problem (QAP).
This problem was derived from the well known
nug30 problem using the distance matrix from a 4
by 7 grid, and the flow matrix from nug30 with
the last 2 facilities deleted. This is to our
knowledge the largest instance from the nugxx
series ever provably solved to optimality.
The problem was solved using the branch-and-bound
algorithm described in the paper "Solving
quadratic assignment problems using convex
quadratic programming relaxations," N.W. Brixius
and K.M. Anstreicher. The computation was
performed on a pool of workstations using the
Condor high-throughput computing system in a
total wall time of approximately 4 days, 8 hours.
During this time the number of active worker
machines averaged approximately 200. Machines
from UW, UNM and (INFN) all participated in the
computation.

35
NUG30 Personal Condor

For the run we will be flocking to
-- the main Condor pool at Wisconsin (600
processors)
-- the Condor pool at Georgia Tech (190 Linux
boxes)
-- the Condor pool at UNM (40 processors)
-- the Condor pool at Columbia (16 processors)
-- the Condor pool at Northwestern (12
processors)
-- the Condor pool at NCSA (65 processors)
-- the Condor pool at INFN (200 processors)
We will be using glide_in to access the Origin
2000 (through LSF ) at NCSA.
We will use "hobble_in" to access the Chiba City
Linux cluster and Origin
2000 here at Argonne.

36
It works!!!

Date Thu, 8 Jun 2000 224100 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt To Miron
Livny ltmiron_at_cs.wisc.edugt Subject Re Priority
This has been a great day for metacomputing!
Everything is going wonderfully. We've had over
900 machines (currently around 890), and all the
pieces are working great
Date Fri, 9 Jun 2000 114111 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt
Still rolling along. Over three billion nodes in
about 1 day!

37
Up to a Point

Date Fri, 9 Jun 2000 143511 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt Hi Gang,
The glory days of metacomputing are over. Our job
just crashed. I watched it happen right before my
very eyes. It was what I was afraid of -- they
just shut down denali, and losing all of those
machines at once caused other connections to time
out -- and the snowball effect had bad
repercussions for the Schedd.

38
Back in Business

Date Fri, 9 Jun 2000 185559 -0500 (CDT) From
Jeff Linderoth ltlinderot_at_mcs.anl.govgt
Hi Gang,
We are back up and running. And, yes, it took me
all afternoon to get it going again. There was a
(brand new) bug in the QAP "read checkpoint"
information that was making the master coredump.
(Only with optimization level -O4). I was nearly
reduced to tears, but with some supportive words
from Jean-Pierre, I made it through.

39
The First 600K seconds
40
We made it!!!

Sender goux_at_dantec.ece.nwu.edu Subject Re Let
the festivities begin.
Hi dear Condor Team,
you all have been amazing. NUG30 required 10.9
years of Condor Time. In just seven days !
More stats tomorrow !!! We are off celebrating !
condor rules !
cheers,
JP.

41
Do not be picky, be agile!!!

Write a Comment

User Comments (0)