What - PowerPoint PPT Presentation

About This Presentation
Title:

What

Description:

Playstation 3. HPUX 11i Itanium (almost done) Cross testing on x86-like platforms ... Settings reflect 3 separate viewpoints: Pool manager, Resource Owner, Job ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 52
Provided by: Tann9
Category:

less

Transcript and Presenter's Notes

Title: What


1
Whats new in Condor?Whats coming up?Condor
Week 2008
2
Release Situation
  • Stable Series
  • Current Condor v7.0.1 (Feb 27th 2008)
  • Last Year Condor ver 6.8.4. (Feb 5th 2007)
  • Development Series
  • Current Condor v7.1.0 (April 1st 2008)
  • Last Year Condor ver 6.9.2. (April 10th 2007)
  • v6.9 Series 14 months

3
(No Transcript)
4
Special Condor Week Edition
5
(No Transcript)
6
How many cores in one new UW Condor cluster rack?
7
New Ports
  • RHEL 5 x86 x86_64 with stduniv and glibc 2.5
  • Playstation 3
  • HPUX 11i Itanium (almost done)
  • Cross testing on x86-like platforms
  • Debian clipped port
  • Out with the old.
  • Red Hat Linux 7.x systems on the x86 processor.
  • Digital Unix systems on the Alpha processor.
  • Yellow Dog Linux 3.0 systems on the PPC
    processor.
  • MacOS 10.3 systems on the PPC processor.

8
Big v7.0 Goodies
  • Scalability Improvements
  • GCB Improvements
  • Privilege Separation
  • New Quill
  • Virtual Machine Universe

9
Scalability
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Condors Privilege Separation
  • Apply principle of least privilege to Condor
  • No more root / super-user privilege required
  • Currently completed on execute side
  • Use glexec or Condors own sudo
  • Can still run the old way if you want

17
Quill Take Two in v7.x
  • Shared databases
  • More than just the JobAd, e.g.
  • Startd Machine ClassAds
  • Negotiator matches
  • Run Job User Log information
  • More than just PostgreSQL DBMS
  • All the details
  • http//www.cs.wisc.edu/condor/quill_overview_07-18
    -2007.pdf

18
Disk
19
Virtual Machine Universe
  • Submit a Job that consists of a virtual machine
    image
  • Condor schedules, manages, and monitors VM job
  • Works w/ VMware Server and Xen
  • Matchmaking
  • Checkpoint/Restart/Migration
  • Data Movement
  • Plug BoF Session 130pm tomorrow

20
What else?GCB Improvments!
21
(No Transcript)
22
(No Transcript)
23
  • Improved Scalability Only use the broker if
    required!
  • Local Host Optimizations
  • Bypass GCB if two daemons are talking on the
    same host
  • Local Network Optimizations
  • Two hosts on the same private net bypass the
    broker
  • Every network is assigned a unique network name
  • Daemons advertise (a) public accessible IP (b)
    real IP (c) network name.
  • Names match ? use real ip use public IP.
  • Improved Robustness
  • Broker dies -gt master finds another broker and
    restarts.
  • When master starts up, it pings a list o brokers
    and randomly chooses from those that respond.
  • Bug fixes
  • Improved Logging now they are helpful and sane.

24
Process Tracking Guarantee
  • Iron-clad tracking of process groups
  • Even if running as the job submitter
  • Uses supplementary group ids
  • Linux only
  • Also as a standalone-daemon for OSG
  • USE_GID_PROCESS_TRACKING True
  • MIN_TRACKING_GID 750
  • MAX_TRACKING_GID 757

25
Better Collector Authorization
  • New authorization levels to allow different rules
    for submission vs- execution
  • ADVERTISE_STARTD, ADVERTISE_SCHEDD
  • New config setting COLLECTOR_REQUIREMENTS
    expression must evaluate to true for Collector to
    accept the ad.

26
  • Well-known ports for the trusted daemons
  • Use the below ports if launching the
    condor_master
  • as root else, pick 3 ports above 1024.
  • MASTER_PORT 890
  • SCHEDD_PORT 891
  • STARTD_PORT 892
  •  
  • MASTER_ARGS -p (MASTER_PORT)
  • SCHEDD_ARGS -p (SCHEDD_PORT)
  • STARTD_ARGS -p (STARTD_PORT)
  •  
  • COLLECTOR_REQUIREMENTS \
  • ( MyType ? "Machine" \
  •   regexp( "lt0-9.(STARTD_PORT)gt" , MyAddress
    ) ) \
  • ( MyType ? "Scheduler" \
  •   regexp( "lt0-9.(SCHEDD_PORT)gt" , MyAddress
    ) ) \
  • ( MyType ? "DaemonMaster" \
  •   regexp( "lt0-9.(MASTER_PORT)gt" , MyAddress
    ) ) \
  • ( MyType ! "Machine" MyType ! "Scheduler"
    \

27
Handy New Attributes
  • In your machine ad
  • TotalTimeBackfillBusy, TotalTimeBackfillIdle,Total
    TimeBackfillKilling
  • TotalTimeClaimedBusy,TotalTimeClaimedIdle
  • TotalTimeClaimedRetiring, TotalTimeClaimedSuspende
    d
  • TotalTimeMatchedIdle, TotalTimeOwnerIdle
  • TotalTimePreemptingKilling,TotalTimePreemptingVaca
    ting,TotalTimeUnclaimedBenchmarking,TotalTimeUncla
    imedIdle
  • In your job ad
  • NumJobStarts
  • NumJobReconnects
  • NumShadowExceptions
  • NumShadowStarts

28
And last but not least
  • Leases added to COD.
  • Simple best-fit algorithm added to dedicated
    scheduler.
  • Can reference resource usage and quota
    information in preemption policy.
  • condor_config_val dump -v
  • Chirp improvements
  • Jobs can write messages into the user log
  • Can use proc 0 ClassAd as a scratch pad
  • Condor shutdown via expressions
  • External Awareness

29
and finally
  • File Transfer I/O Throttling
  • MAX_CONCURRENT_DOWNLOADS and MAX_CONCURRENT_UPLOAD
    S
  • More types of jobs can survive across a
    shutdown/crash of submit machine
  • Such as jobs that stream stdout/err.
  • Users job log changes.
  • Can have a centralized job log file.
  • Get values of any job ad attribute in log.
  • Cron like job scheduling (Crondor?)
  • Job Router shipped (Dans talk)
  • License Change
  • Source code publically released on web

30
and finally
and before shipping the new stable release
We squashed LOTS of bugs!
31
(No Transcript)
32
Shiny new bug free Condor v7.0.x stable series!
33
Enough already, Todd.Tell me about what is
cooking with v7.1.x and beyond.
34
Terms of License Any and all dates in these
slides are relative from a date hereby
unspecified in the event of a likely situation
involving a frequent condition. Viewing, use,
reproduction, display, modification and
redistribution of these slides, with or without
modification, in source and binary forms, is
permitted only after a deposit by said user into
PayPal accounts registered to Todd Tannenbaum .
35
Generalizing the Startd/Starter Architecture
  • Making the startd more generic with the
    underlying system.
  • How about running without a starter, running
    w/o a scheddshadow, pulling jobs, running
    starter less jobs that it does not fork/exec,
  • Lightweight Jobs
  • Examples
  • Work Fetch ? Ref to Dereks Talk
  • Blue Heron Project ? Ref to Tom, Amanda, and
    Gregs Talk

36
Some Love for Windows
  • Jobs can write to the registry
  • Condor allocates HKEY_CURRENT_USER.
  • Problems w/ the Batch Login approach sessions on
    Windows Server 2003 fixed (by not using them ?)
  • Interoperability with Samba (as a PDC) has been
    improved
  • Arch class-ad attribute now reflects the wide
    range of architectures available to the Windows
    world it no longer simply returns INTEL

37
Green Computing
  • The startd has the ability to place a machine
    into a low power state. (Standby, Hibernate,
    Soft-Off, etc.)
  • HIBERNATE, HIBERNATE_CHECK_INTERVAL
  • If all slots return non-zero, then the machine is
    powered down otherwise it continues running.
  • Machine ClassAd contains all information required
    for a client to wake it up
  • Condor can wake it up, also a standalone tool.
  • This was NOT as easy as it should be.
  • Machines in Offline State
  • Lots of other uses
  • Wake-up on Matchmaking Pressure
  • Future Work ?

38
Plugins
  • Think Firefox
  • Callouts from Condor daemons on appropriate
    events
  • Plugin could re-implement or modify action
    (different than a client API)
  • Will only build as needed as refactoring
    happens to add features
  • Miron I dont want your plugs, I want new
    features!
  • Examples Collector, Accountant, File Transfers,
    Scheduling Algorithms,

39
Scheduling in Condor Today
CM
CM
schedd
schedd
schedd
schedd
schedd
  • Distributed Ownership
  • Settings reflect 3 separate viewpoints
  • Pool manager, Resource Owner, Job Submitter

40
But some sites want to use Condor like this
schedd
  • Just one submission point (schedd)
  • All resources owned by one entity
  • We can do better for these sites.
  • Policy configurations are complicated.
  • Some useful policies not present because they are
    hard to do a wide-area distributed system.
  • Today the dedicated scheduler only supports
    FIFO and a naive Best Fit algorithms.

41
So what to do?
schedd
  • Give the schedd more scheduling options.
  • Examples why cant the schedd do priority
    preemption without the matchmakers help? Or move
    jobs from slow to fast claimed resources ?
  • Pluggable scheduler routines.

42
DAGMan Improvements
  • Automatic running of rescue DAGs (useful for
    nested DAGs)
  • Significantly improved speed of DAG recovery mode
  • Assignment of node categories and category
    throttles
  • Added generic node priorities Depth First
    Traversal algorithm

43
DAGMan Depth First Example
44
Category Example
Run lt 2
Run lt 5
45
DAGMan Future Work
  • DAG Splicing
  • Allowing custom attributes in node ClassAds
  • Fixing condor_hold semantics
  • Configurable job start rate
  • Node iteration

46
DAGMan Future Work
  • Scalability
  • Current potential about 1 million nodes
  • Future up to 10 million nodes
  • Submit files which generate more than one cluster

47
EC2 / VM Universe Next Steps Impregnate Condor
into the Image
  • When? On Demand. How?
  • Job Router, GlideIn Factory,
  • File Transfer To/From S3 (Plugin!)
  • Options to handle Amazons looming threat NAT
    only
  • Overlay Network ?
  • GCB
  • OpenVPN
  • Communicate by way of S3 ?

48
Negotiation Performance
  • v6.8 -gt automatic significant attributes, Match
    caching
  • v7.1.0 -gt resource request ads
  • Simple explanation Resource request ad a
    count plus all significant attributes.
  • Inserted into a schedd submitter ad.
  • Give me 400 resources like this, and 200
    resources like that, etc.
  • Matchmaking algorithms remains the same, just how
    it learns about jobs changes.
  • Disabled by default.
  • Possibilities, possibilities
  • More robust against unresponsive schedds
  • No startd Rank preemption?
  • Others?

49
(No Transcript)
50
And
  • The End of the NFS Locking issue
  • Avoid redundant copies of the same executable in
    the Condor spool
  • Maybe more?
  • The Stamping of a Passport
  • End-to-End Security ? Ref Ians Talk
  • A web site design from this decade.

51
Thank you for being such an awesome audience and
an awesome user community!!!

Jason Stowe, enjoying free bacon at a local pub.
Only in Wisconsin.
Write a Comment
User Comments (0)
About PowerShow.com