Title: Latency as a Performability Metric for Internet Services
1Latency as a Performability Metric for Internet
Services
- Pete Broadwell
- pbwell_at_cs.berkeley.edu
2Outline
- Performability background/review
- Latency-related concepts
- Project status
- Initial test results
- Current issues
3Motivation
- A goal of ROC project develop metrics to
evaluate new recovery techniques - Problem basic concept of availability assumes
system is either up or down at a given time - Nines only describe fraction of uptime over a
certain interval
4Why Is Availability Insufficient?
- Availability doesnt describe durations or
frequencies of individual outages - Both can strongly influence user perception of
service, as well as revenue - Availability doesnt capture systems capacity to
support degraded service - degraded performance during failures
- reduced data quality during high load (Web)
5What is performability?
- Combination of performance and dependability
measures - Classical defn probabilistic (model-based)
measure of a systems ability to perform in the
presence of faults1 - Concept from traditional fault-tolerant systems
community, ca. 1978 - Has since been applied to other areas, but still
not in widespread use
1 J. F. Meyer, Performability Evaluation Where
It Is and What Lies Ahead, 1994
6Performability Example
Discrete-time Markov chain (DTMC) model of a
RAID-5 disk array1
1 Hannu H. Kari, Ph.D. Thesis, Helsinki
University of Technology, 1997
7Visualizing Performability
Throughput
I/O operations/sec
Time
8Metrics for Web Services
- Throughput - requests/sec
- Latency render time, time to first byte
- Data quality
- harvest (response completeness)
- yield ( queries answered)1
1 E. Brewer, Lessons from Giant-Scale Internet
Services, 2001
9Applications of Metrics
- Modeling the expected failure-related performance
of a system, prior to deployment - Benchmarking the performance of an existing
system during various recovery phases - Comparing the reliability gains offered by
different recovery strategies
10Related Projects
- HP Automating Data Dependability
- uses time to data access as one objective for
storage systems - Rutgers PRESS/Mendosus
- evaluated throughput of PRESS server during
injected failures - IBM Autonomic Storage
- Numerous ROC projects
11Arguments for Using Latency as a Metric
- Originally, performability metrics were meant to
capture end-user experience1 - Latency better describes the experience of an end
user of a web site - response time gt8 sec site abandonment
lost income 2 - Throughput describes the raw processing ability
of a service - best used to quantify expenses
1 J. F. Meyer, Performability Evaluation Where
It Is and What Lies Ahead, 1994
2 Zona Research and Keynote Systems, The Need for
Speed II, 2001
12Current Progress
- Using Mendosus fault injection system on a 4-node
PRESS web server (both from Rutgers) - Running latency-based performability tests on the
cluster - Inject faults during load test
- Record page-load times before, during and after
faults
13Test Setup
PRESS web server Mendosus
Test clients
Page
Emulatedswitch
Normal version cooperative caching HA version
cooperative caching heartbeat monitoring
14Effect of Component Failure on Performability
Metrics
Perform- ability metric
Throughput
Latency
Time
REPAIR
FAILURE
15Observations
- Below saturation, throughput is more dependent on
load than latency - Above saturation, latency is more dependent on
load
Thru 3/s Lat .14s
Thru 6/s Lat .14s
Thru 7/s Lat .4s
1
2
3
4
5
Time
16How to Represent Latency?
- Average response time over a given time period
- Make a distinction between render time time
to first byte? - Deviation from baseline latency
- Impose a greater penalty for deviations toward
longer wait times?
17Response Time with Load Shedding Policy
Responsetime (sec)
Abandonment threshold
8s
Load-shedding threshold
Time
REPAIR
FAILURE
18Load Shedding Issues
- Load shedding means returning 0 data quality a
different kind of performability metric - To combine load shedding and latency, define a
demerit system - Such systems quickly lose generality, however
- Server too busy msg 3 demerits - 8 sec
response time 1 demerit/sec
19Further Work
- Collect more experimental results!
- Compare throughput and latency-based results of
normal and high-availability versions of PRESS - Evaluate usefulness of demerit systems to
describe the user experience (latency and data
quality)
20Latency as a Performability Metric for Internet
Services
- Pete Broadwell
- pbwell_at_cs.berkeley.edu