Predicting Queue Waiting Time For Individual TeraGrid Jobs - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Predicting Queue Waiting Time For Individual TeraGrid Jobs

Description:

Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli, Ryan Garver ... Rolling out production tools now and we will be monitoring ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 24
Provided by: kenk169
Category:

less

Transcript and Presenter's Notes

Title: Predicting Queue Waiting Time For Individual TeraGrid Jobs


1
Predicting Queue Waiting Time ForIndividual
TeraGrid Jobs
  • Rich Wolski, Dan Nurmi, John Brevik, Graziano
    Obertelli, Ryan Garver
  • Computer Science Department
  • University of California, Santa Barbara

2
Problem Predicting Delay in Batch Queues
  • Time in queue is experienced as application delay
  • Sounds like an easy problem, but
  • Distribution of load from users is a matter of
    some debate
  • Scheduling policy is partially hidden
  • Sites need to change the policies dynamically and
    without warning
  • Job execution times are difficult to predict
  • Much research in this area over the past 20
    years, but few solutions
  • Current commercial systems provide high variance
    estimates
  • On-line simulation based on max requested time
  • expected value predictions
  • Most sites simply disable these features

3
Hard Problem
4
For Scheduling Its all about the big Q
  • Predictions of the form
  • What is the maximum time my job will wait with
    X certainty?
  • What is the minimum time my job will wait with
    X certainty?
  • Requires two estimates if certainty is to be
    quantified
  • Estimate the (1-X) quantile for the distribution
    of availability gt Qx
  • Estimate the upper or lower X confidence bound
    on the statistic Qx gt Q(x,b)
  • If the estimates are unbiased, and the
    distribution is stationary, future availability
    duration will be larger than Q(x,b) X of the
    time, guaranteed

5
Quantiles versus Moments
  • Quantiles permit quantifiable predictions for
    individual jobs
  • expectation in relation to the mean is a
    misnomer gt useful for throughput
  • Example 100 jobs, weighty tail, 6 orders of
    magnitude variation, random order
  • 95 jobs wait 10 seconds
  • 1 job waits 1000 seconds
  • 1 job waits 10000 seconds
  • 1 job waits 100000 seconds
  • 1 job waits 1000000 seconds
  • 1 job waits 10000000 seconds
  • mean wait time 111120 seconds
  • The expected value
  • 0.95 quantile 10 seconds
  • 95 chance job will wait 10 seconds or less

6
BMBP A New Predictive Methodology
  • New quantile estimator invention based on
    Binomial distribution
  • Requires carefully engineered numerical system to
    deal with large-scale combinatorics
  • New changepoint detector
  • Binomial method in a time series context is
    difficult
  • Need a system to determining
  • Stationary regions in the data
  • Minimum statistically meaningful history in each
    region
  • New clustering methodology
  • More accurate estimates are possible if
    predictions are made from jobs with similar
    characteristics
  • Takes dynamic policy changes into account more
    effectively

7
Ten Years of Supercomputing
8
In Action
9
In San Diego
10
Predicting Things Upside Down
  • Deadline scheduling My job needs to start in the
    next X seconds for the results to be meaningful.
  • Amitava Mujumdar, Tharaka Devaditha, Adam
    Birnbaum (SDSC)
  • Need to run a 4 minute image reconstruction that
    completes in the next 8 minutes
  • Given a
  • Machine
  • Queue
  • Processor count
  • Run time
  • Deadline
  • What is the probability that a job will meet the
    deadline?
  • http//nws.cs.ucsb.edu/batchq/invbqueue.php

11
Making the Deadline
12
In Texas
13
A Day in Urbana
14
A Day in Austin
15
How Well Does it Work with an Application?
Refine
Electron Micrograph
Final 3D model
Preliminary 3D Model
EMAN
Preliminary 3D model
Particles
EMAN has been developed at Baylor College of
Medicine by Research group of Wah Chiu and Steven
Ludtke wah,sludtke_at_bcm.tmc.edu
16
VGrADS EMAN Batch Scheduler
  • EMAN emulator
  • Run the EMAN scheduler to determine a job launch
    sequence
  • Launch the jobs by submitting them to the queues
    specified by the scheduler
  • When an EMAN job acquires the processors, exit
    and sleep the emulator for the predicted
    execution time
  • Saves system allocation time
  • Record the overall makespan
  • Experiment
  • Chicago TeraGrid, SDSC TeraGrid, NCSA TeraGrid
    and CNSI Dell at UCSB
  • 57 separate runs
  • Results mean observed and mean predicted
    makespans are not significantly different at
    alpha 0.05

17
95 Upper Bound on Median
18
EMAN Turnaround Improvement
19
BMBP versus Weibull and Log-normal
  • Correctness
  • Log-normal fails to achieve 95 correctness
    target on about half of the historical traces
  • Weibull and BMBP achieve the same correctness
    rate
  • Each get 51 / 55 traces
  • small sample sizes hurt both
  • Accuracy
  • Measure the tightness of the bounds in terms of
    the RMS over prediction error
  • RMS for Weibull is about 1.6 times that for BMBP

20
Clustering
  • RMS ratio of BMBP with Clustering to without
  • Both achieve 95 correctness
  • Measures additional tightness improvement
    through clustering

21
The Software
  • Requires no special privileges
  • Predictions are better and burn-in shorter if
    scheduler logs are available gt retrofit the log
    history
  • Version 1 -- available now
  • NWS sensors run at each site
  • Prediction software runs at UCSB
  • Command-line tools and web page connect to UCSB
  • Stable, but does not support clustering
  • Version 2 -- alpha version
  • Supports automatic clustering
  • Prediction software can be run locally or at UCSB
  • Command-line tools locally or at UCSB
  • Web support at UCSB only
  • No packaging
  • Version 3 -- end of the year

22
Batch Queue Prediction for Grid Systems
  • A good point-valued prediction remains elusive
  • expectation sounds attractive but is really a
    misnomer
  • Grid users certainly can use bounds instead
  • Early job completion is okay, typically
  • Bounds give a good intuitive feel for which queue
    will be quickest
  • Deployment and integration underway
  • CDF FermiLab working (barely)
  • Condor integration
  • UCLA Grid tools
  • Automatic schedulers are coming
  • EMAN doesnt use rangesit should
  • VGrADS is developing new schedulers (workflow)
  • NEESGrid and ISI are in development (workflow)
  • LEAD integration is underway (workflow)
  • Large-scale sensor network simulation

23
Whats Next?
  • Open questions
  • Does the availability of predictions affect load?
  • Rolling out production tools now and we will be
    monitoring
  • Job cancellation does not affect results
  • If it does, will allocations be stable?
  • Grid economies
  • Reservations must be integrated
  • Virtual resource reservations (VGrADS)
  • Conditional prediction and resubmission
  • Virtual Cluster??
  • Thanks
  • NSF SCI, VGrADS, SDSC, TACC, NCSA, Argonne
  • rich_at_cs.ucsb.edu
Write a Comment
User Comments (0)
About PowerShow.com