Pick up the Pieces - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Pick up the Pieces

Description:

Pick up the Pieces – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 35
Provided by: lronhu
Category:
Tags: issa | pick | pieces

less

Transcript and Presenter's Notes

Title: Pick up the Pieces


1
  • Pick up the Pieces
  • Average White Band

2
  • Modeling Resource Availability in Federated,
    Globally Distributed Computing Environments
  • Rich Wolski
  • Dan Nurmi
  • University of California, Santa Barbara
  • John Brevik
  • Wheaton College

3
The Network is the Computer
  • Goal Conglomerate the computational power
    available from a collection of internetworked
    computers to support application execution
  • Parallel computing performance through
    concurrency
  • Distributed computing functionality through
    resource sharing
  • Can the best of both worlds be combined?

4
The Computational Grid
  • Vision Application programs plug into a
    software environment to draw computational
    power from a dynamically changing pool of
    resources (Foster, Kesselman, et al, 1998).
  • Electrical Power Grid analogy
  • Power generation facilities ? computers
  • Household appliances ? application programs
  • Scale to national and international levels
  • Federated System
  • Grid users (both power producers and application
    consumers) must be able to join and leave the
    Grid at will.
  • Local control supercedes global control
  • Can we build it?

5
Many Challenges, No Waiting
  • Heterogeneity
  • Machines, networks, software, administrative
    policies all vary
  • Dynamism
  • Loads, performance, and availability change with
    time
  • Programability
  • Complex and dynamically changing system
  • Security
  • Maintenance

6
Explorations
  • Performance How can programs extract high
    performance levels given that the resource pool
    is heterogeneous and dynamically changing?
  • The Network Weather Service
  • On-line performance monitoring and prediction

7
Fortune Telling
  • Grid resource performance varies dynamically
  • Machines, networks and storage systems are shared
    by competing applications
  • Federation
  • Either the system or the application itself must
    tolerate performance variation
  • Dynamic scheduling
  • Scheduling requires a prediction of future
    performance levels
  • What performance level will be deliverable?

8
The Network Weather Service
  • A distributed, robust, and adaptive system that
  • monitors the performance that is available from
    distributed resources
  • forecasts future performance levels using fast
    statistical techniques
  • delivers forecasts on-the-fly dynamically
  • Portable and extensible
  • Measures, forecasts, and reports performance at
    the application level, and end-to-end
  • Designed to make and deliver forecasts on-line to
    application schedulers and resource allocators
  • Compatible with Globus, Legion, Condor, NetSolve,
    NINF, etc.

9
Logical Architecture
network
network sensor
machine
cpu sensor
memory sensor
Name Service
Sensor Control
proxy caches
10
Skepticism
  • Is it really possible to predict future
    performance levels?
  • Self-similarity
  • Non-stationarity
  • With what accuracy?
  • For how long into the future?
  • NWS On-line, semi non-parametric time series
    techniques
  • Use running tabulation of forecast error to
    choose between competing forecasters
  • Bandwidth, latency, CPU load, available memory,
    battery power
  • Is it possible to predict resource availability
    and failure?
  • Durations do not fit the time series model well

11
For Example
12
Sample Based Techniques
  • Each measurement is modeled as a sample from a
    random variable
  • Time invariant
  • IID (independent, identically distributed)
  • Stationary (IID forever)
  • Well studied in the literature
  • Exponential distributions
  • Compose well
  • Memoryless
  • Popular in database, fault-tolerance, and P2P
    communities
  • Pareto distributions
  • Potentially related to self-similarity
  • heavy-tailed implying non-predictability
  • Popular in networking, Internet, and Dist. System
    communities

13
Why not Weibull?
  • Proposed originally by Waloddi Weibull in 1939
  • PDF f(x) (a/b) ( ((x - c)/b)(a-1) ) e
    -(((x-c)/b)a)
  • a is scale parameter gt 0
  • b is shape parameter gt 0
  • c is location parameter, (-inf,inf)
  • Used extensively in reliability engineering
  • Modeling lifetime distributions
  • Modeling extreme values in bounded cases
  • Not memoryless
  • F(ts) XXgttltgt F(s)
  • Maximum Likelihood Estimation (MLE) of parameters
    is hard
  • Requires solution to non-linear system of
    equations or optimization problem
  • Sensitive to numerical stability of numerical
    algorithms

14
UCSB Student Computing Labs
  • Approximately 85 machines running Red Hat Linux
    located in three separate buildings
  • Open to all Computer Science graduate and
    undergraduates
  • Only graduates have building keys
  • Power-switch is not protected
  • Anyone with physical access to the machine can
    reboot it by power cycling it
  • Students routinely clean off competing users or
    intrusive processes to gain better performance
    response
  • NWS deployed and monitoring duration between
    restarts
  • Can we model the time-to-reboot?

15
Project
  • Goal
  • Predict availability of user machines
  • Motivation
  • Scheduling of distributed programs
  • User satisfaction
  • Project
  • Build a system for predicting machine reboots at
    UCSB
  • Evaluate the effectiveness of the system using
    synthetic and real user workloads

16
Approach
  • Measure availability as lifetime
  • Student lab at UCSB
  • Develop new NWS availability sensors
  • Test using data from fault-tolerance community
    for checkpointing research
  • Predicting optimal checkpoint
  • Develop robust software for MLE parameter
    estimation
  • Fit Exponential, Pareto, and Weibull
    distributions
  • Compare the fits
  • Visually
  • Goodness of fit tests

17
Milestones
  • Week 1
  • Project Planning
  • Week 2
  • Begin parameter fitting software development
  • Complete exponential, start Pareto and Weibull
  • Begin NWS Availability sensor development
  • Week 3
  • Complete sensor development and deploy at UCSB
  • Complete parameter fitting software
  • Week 4
  • Test parameter fitting software with community
    data set
  • Begin goodness-of-fit software development

18
Milestones (continued)
  • Week 5
  • Complete goodness-of-fit software development
  • Test goodness-of-fit system using community data
    set
  • Week 6
  • Test full model fitting suite with preliminary
    UCSB data
  • Begin NWS integration
  • Week 7
  • Continue NWS integration
  • Week 8
  • Complete NWS integration
  • Begin experimental evaluation using live UCSB
    data
  • Week 9
  • Generate preliminary UCSB results
  • Week 10
  • Complete experimental evaluation and prepare
    report

19
UCSB Availability Data
20
UCSB Empirical CDF
21
MLE Weibull Fit to UCSB Data
22
Comparing Fits at UCSB
23
Goodness of Fit
  • Kolmogorov-Smirnov (K-S) Goodness-of-Fit Test
  • P-values averaged over 1000 subsamples, each size
    100
  • Weibull 0.36
  • Exponential 2 x 10-5
  • Pareto 5 x 10-4
  • Anderson-Darling (A-D) Goodness-of-Fit Test
  • P-values averaged over 1000 subsamples, each size
    100
  • Weibull 0.07
  • Exponential 0
  • Pareto 0
  • At a 0.05 significance level, reject null
    hypothesis for both Exponential and Pareto.

24
Stationarity (K-S of 0.29)
25
Similarity of Fits
26
Without Censor Repair (K-S of 0.13)
27
Condor
  • Cycle harvesting system (M. Livny, U. Wisconsin)
  • Workstations in a pool run the (trusted) Condor
    daemons
  • Each machine agrees to contribute a machine by
    installing and running Condor
  • Condor users submit job-control scripts to a
    batch queue
  • When a machine becomes idle, Condor schedules a
    waiting job
  • Machine owners specify what idle and busy
    mean
  • When a machine running a Condor job becomes
    busy
  • Job is checkpointed and requeued (standard
    universe)
  • Job is terminated (vanilla universe)
  • NWS sensor uses vanilla universe and records
    process lifetime
  • Unknown and constantly changing number of
    workstations in UWisc Condor Pool (gt 500)
  • 210 machines used by Condor for NWS sensor

28
Condor Weibull Fit
29
Comparing Condor Fits
30
Condor Goodness
  • Kolmogorov-Smirnov (K-S) Goodness-of-Fit Test
  • P-values averaged over 1000 subsamples, each size
    100
  • Weibull 0.07
  • Exponential 0
  • Pareto 0
  • Anderson-Darling (A-D) Goodness-of-Fit Test
  • P-values averaged over 1000 subsamples, each size
    100
  • Weibull 0
  • Exponential 0
  • Pareto 0
  • At a 0.05 significance level, reject null
    hypothesis for both Exponential and Pareto under
    K-S, but all three under A-D.

31
Long, Muir, Golding Internet Survey (1995)
  • 1170 Hosts across the Internet in 1995
  • Use response to rpc.statd (NFS daemon) as
    heartbeat
  • Long, Muir, Golding (UCSC, HP-labs) investigated
    exponentials as models for
  • Availability time
  • Downtime
  • Plank and Elwasif (UTK,1998) and Plank and
    Thomason (UTK, 2000) use data and exponentials as
    basis for checkpoint interval determination
  • All researchers conclude that data is not-well
    modeled by exponentials
  • No plausible distribution determined

32
Weibull Again
33
If the Weibull Fits
  • Kolmogorov-Smirnov (K-S) Goodness-of-Fit Test
  • P-values averaged over 1000 subsamples, each size
    100
  • Weibull 0.41
  • Exponential 0.001
  • Pareto 0.0005
  • Anderson-Darling (A-D) Goodness-of-Fit Test
  • P-values averaged over 1000 subsamples, each size
    100
  • Weibull 0.13
  • Exponential 0
  • Pareto 0
  • At a 0.05 significance level, reject null
    hypothesis for both Exponential and Pareto.

34
Wear It
  • Three different availability surveys under three
    different sets of circumstances
  • UCSB Student Labs
  • Adversarial chaos
  • U. Wisc Condor Pool
  • Background cycle harvesting
  • Internet host survey
  • Convolution of host and network availability
    circa 1995
  • In all three cases an MLE-fit Weibull is, by far,
    the best model
  • Visual and GOF evidence
  • Uncharacteristically, the assumptions for the
    model seem to hold
  • Stationarity and Independence

35
What Does it All Mean?
  • If a continuous, closed form distribution is
    needed to model machine availability in federated
    distributed systems, a Weibull is probably the
    best choice
  • Empirical evidence from different scenarios makes
    bias unlikely
  • Weibulls were invented to model lifetimes
  • Who should care?
  • Grid simulators
  • Availability is critical
  • P2P systems
  • Oceanstore, CAN, TAPESTRY, etc. all assume
    exponential distributions in their proofs
  • Replication systems
  • It does not mean, that Weibulls are best for
    predicting availability
  • Next time

36
Thanks
  • Miron Livny and the Condor group at the
    University of Wisconsin
  • Darrell Long (UCSC) and James Plank (UTK)
  • UCSB Facilities Staff
  • NSF and DOE
  • nurmi_at_cs.ucsb.edu, jbrevik_at_wheatonma.edu,
    rich_at_cs.ucsb.edu
Write a Comment
User Comments (0)
About PowerShow.com