Synthesizing Representative IO Workloads for TPCH - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Synthesizing Representative IO Workloads for TPCH

Description:

Partitioning tables across the disks. 30 GB dataset. Validation. Identify characteristics ... RMS: root-mean-square error of differences between two CDF curves ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 33
Provided by: anandsivas
Category:

less

Transcript and Presenter's Notes

Title: Synthesizing Representative IO Workloads for TPCH


1
Synthesizing Representative I/O Workloads for
TPC-H
  • J. Zhang, A. Sivasubramaniam,
  • H. Franke, N. Gautam, Y. Zhang, S. Nagar
  • Pennsylvania State University
  • IBM T.J. Watson
  • Rutgers University

2
Outline
  • Motivation
  • Related Work
  • Methodology
  • Arrival Time
  • Access Pattern
  • Request Sizes
  • Accuracy of synthetic traces
  • Concluding Remarks

3
Motivation
  • I/O subsystems are critical for commercial
    services and in production environments.
  • Real applications are essential for system design
    and evaluation.
  • TPC-H is a decision-support workload for business
    enterprises.

4
Disadvantages of Traces
  • Not easily obtainable
  • Can be very large
  • Difficult to get statistical confidence
  • Very difficult to change workload behavior
  • Does not isolate the influence of one parameter
  • On the other hand, a deeper understanding of the
    workload can
  • Help generate a synthetic workload
  • Help in system design itself.

5
What do we need to synthesize?
  • Inter-arrival times (temporal behavior) of disk
    block requests.
  • Access pattern (spatial behavior) of blocks being
    referenced
  • Size (volume) of each I/O request.

6
Related work
  • Scientific Application I/O behavior
  • Time-series models for arrivals
  • Sequentiality/Markov models for access pattern
  • Commercial/production workloads
  • Self-similar arrival patterns
  • Sequentiality in TPC-H/TPC-D
  • No prior complete synthesis of all three
    attributes for TPC-H

7
Our TPC-H Workload
  • Trace Collection Platform
  • IBM Netfinity 8-way SMP with 2.5GB memory and 15
    disks
  • Linux 2.4.17
  • DB2 UDB EE V7.2
  • TPC-H Configuration
  • Power Run of 22 queries
  • Partitioning tables across the disks
  • 30 GB dataset

8
Validation
Original I/O traces
Identify characteristics
Generate synthetic traces
Disksim 2.0
Metrics
  • RMS root-mean-square error of differences
    between two CDF curves
  • nRMS RMS/m, m is average response time for the
    original trace

9
Overall Methodology
  • Arrival pattern characteristics
  • Investigate correlations
  • Time series
  • Self-similar
  • iid distributions
  • Access pattern characteristics
  • Sequentiality/pseudo sequentiality/randomness
  • Size characteristics
  • Investigating correlations between time, space
    and volume to get final synthesis

10
Arrival pattern
  • Statistical analysis
  • Auto-correlation function (ACF) plots
  • Shows the correlation between current
    inter-arrival time and one that is x-steps away

11
  • Correlations seem very weak (lt0.15 for 12
    queries, and lt0.30 for the rest)
  • Errors with Time series models (AR/MA/ARIMA/ARFIMA
    ) are high
  • No suggestions for self-similar either
  • Perhaps iid (independent and identically
    distributed) is not a bad assumption.

12
  • Fitting distributions
  • Tried hyper-exponential/normal/pareto
  • Used Maximum Likelihood Estimator (normal/pareto)
    and Expectation Maximization (hyper-exponential)
    to estimate distribution parameters
  • Use K-S test to measure goodness-of-fit
  • Maximum distance between fitted distribution and
    original CDF was ensured to be less than 0.1

13
Comparing CDF of fitted distribution and data
14
Access Pattern (Location Size)
  • Most studies use sequentiality to describe TPC-H
  • However, this is not always the case.

Location
Location
Location
Arrival Time
Arrival Time
Arrival Time
Cat1 Q10 Q4, Q14
Cat2 Q12, Q1,Q3,Q5,Q7, Q8,Q15,Q18, Q19,Q21
Cat3 Q20 Q9, Q17
15
Category 1 Intermingling sequential streams
  • Consider the following
  • Run A strictly sequential set of I/O requests
  • Stream A pseudo-sequential set of I/O requests
    that could be interrupted by another stream.
  • i.e. a stream could have several runs that are
    interrupted by runs of other streams.

16
Run and Stream
An example run of 5 requests
A stream (pseudo-sequential) of 4 requests
An example trace
17
Secondary Attributes
  • Run Length of requests in a run
  • Run Start location start sector of run
  • Stream Length of requests in a stream
  • Inter-stream Jump Distance spatial separation
    between start of run and previous request
  • Intra-stream Jump Distance spatial separation
    between successive requests within a stream
  • Number of active streams (at any instant)
  • Interference Distance number of requests between
    2 successive requests in a stream
  • Derive empirical distributions for these from the
    trace

18
Location Synthesis - Q10(Time and size from
trace)
  • LocIID locations are i.i.d.
  • LocRUN incorporate run length distribution and
    run start location distribution.
  • LocSTREAM combine all stream and run statistics.

19
Request Size
  • Requests are one of
  • 64, 128, 192, 256, 320, 384, 448, 512 blocks
  • But attributes (location, size, time) are not
    independent !!!

20
Correlations between size and location
Fraction of requests
21
Correlations between size and time
22
Correlations between location and time
23
Final Synthesis Methodology (Category 1)
  • Location use LocSTREAM to generate start
    locations. Two kinds of requests a run start
    request or a request within a run
  • Time use Pr(inter-arrival time run start
    requests) and Pr(inter-arrival time within a
    run requests) to generate times.
  • Size
  • For run start request, use Pr(size
    inter-arrival times of run start requests) to
    generate sizes.
  • For within a run requests, use Pr(size within a
    run requests) to generate sizes.

24
  • Can be easily adapted for Category 2 (strictly
    sequential) and Category 3 (random) queries.
  • Validation Compare the response time
    characteristics of synthesized and real trace.

25
Validation of CDF of response times(Category 1)
26
Validation of CDF of response times(Category 2)
27
Validation of CDF of response times(Category 3)
28
Storage Requirements
Storage Fraction(x0.001)
nRMS
Storage Fraction(x0.001)
nRMS
29
Contributions
  • A synthesis methodology to capture
  • Inter-mingling streams of requests
  • Exploiting correlations between request
    attributes
  • An application of this methodology to TPC-H
  • Along the way (for TPC-H),
  • iid can capture arrival time characteristics
  • Strict sequentiality is not always the case

30
Backup slides
31
Validating arrival time synthesis
32
LocSTREAM
  • Use Pr(stream length) to generate stream lengths.
  • Use Pr(run length stream length) to generate
    run lengths for each stream length.
  • Generate start location for each run
  • Use Pr(inter-stream jump dist.) to generate
    the start location of the first run in the
    stream.
  • Use Pr(intra-stream jump distance this
    stream) to generate other runs start location in
    this stream.
  • Use Pr(interference distance) to interleave all
    streams.
Write a Comment
User Comments (0)
About PowerShow.com