Title: On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers
1On the Interaction Between Commercial Workloads
and Memory Systems in High-Performance Servers
Per Stenström Department of Computer
Engineering, Chalmers, Göteborg,
Sweden http//www.ce.chalmers.se/pers
- Fredrik Dahlgren, Magnus Karlsson, and Jim
Nilsson - in collaboration with
- Sun Microsystems and Ericsson Research
2Motivation
- Database applications dominate (32)
- Yet, major focus is on scientific/eng. apps (16)
3Project Objective
- Design principles for high-performance memory
systems for emerging applications - Systems considered
- high-performance compute nodes
- SMP and DSM systems built out of them
- Applications considered
- Decision support and on-line transaction
processing - Emerging applications
- Computer graphics
- video/sound coding/decoding
- handwriting recognition
- ...
4Outline
- Experimental platform
- Memory system issues studied
- Working set size in DSS workloads
- Prefetch approaches for pointer-intensive
workloads (such as in OLTP) - Coherence issues in OLTP workloads
- Concluding remarks
5Experimental Platform
- Platform enables
- Analysis of comm. workloads
- Analysis of OS effects
- Tracking architectural events to OS or appl.
level
6Outline
- Experimental platform
- Memory system issues studied
- Working set size in DSS workloads
- Prefetch approaches for pointer-intensive
workloads (such as in OLTP) - Coherence issues in OLTP workloads
- Concluding remarks
7Decision-Support Systems (DSS)
- Compile a list of matching entries in several
database relations
Will moderately sized caches suffice for huge
databases?
8Our Findings
- MWS footprint of instructions and private data
to access a single tuple - typically small (lt 1 Mbyte) and not affected by
database size - DWS footprint of database data (tuples) accessed
across consecutive invocations of same scan node - typically small impact (0.1) on overall miss
rate
9Methodological Approach
- Challenges
- Not feasible to simulate huge databases
- Need source code we used PostgreSQL and MySQL
- Approach
- Analytical model using
- parameters that describe the query
- parameters measured on downscaled query
executions - system parameters
10Footprints and Reuse Characteristics in DSS
- MWS instructions, private, and metadata
- can be measured on downscaled simulation
- DWS all tuples accessed at lower levels
- can be computed based on query composition and
prob. of match
11Analytical model-an overview
- Goal Predicts miss rate versus cache size for
fully-assoc. caches with a LRU replacement policy
for single-proc. systems
- Number of cold misses
- size of footprint/block size
- MWS is measured
- DWSi computed based on parameters describing
the query (size of relations probability of
matching a search criterion, index versus
sequential scan, etc) - Number of capacity misses for tuple access at
level i - CM0(1- C - C0) if C0 lt Cache size lt
MWS - MWS - C0
- size of tuple/block size if MWS lt Cache
size lt MWS DWSi - Number of accesses per tuple measured
- Total number of misses and accesses computed
12Model Validation
- Goal
- Prediction accuracy for queries with different
compositions - Q3, Q6, and Q10 from TPC-D
- Prediction accuracy when scaling up the database
- parameters at 5Mbyte used to predict at 200
Mbytes databases - Robustness across database engines
- Two engines PostgreSQL and MySQL
13Model Predictions Miss rates for Huge Databases
- Miss rate by instr., priv. and meta data rapidly
decay (128 Kbytes) - Miss rate component for database data small
- Whats in the tail?
14Outline
- Experimental platform
- Memory system issues studied
- Working set size in DSS workloads
- Prefetch approaches for pointer-intensive
workloads (such as in OLTP) - Coherence issues in OLTP workloads
- Concluding remarks
15Cache Issues for Linked Data Structures
- Traversal of lists may exhibit poor temporal
locality - Results in chains of data dependent loads,
- called pointer-chasing
- Pointer-chasing show up in many interesting
applications - 35 of the misses in OLTP (TPC-B)
- 32 of the misses in an expert system
- 21 of the misses in Raytrace
16SW Prefetch Techniques to Attack Pointer-Chasing
- Greedy Prefetching (G). - computation per
node lt latency
- Jump Pointer Prefetching (J) - short list or
traversal not known a priori
- Prefetch Arrays (P.(S/H))
- Generalization of G and J that addresses
above shortcomings. - - Trade memory space and bandwidth for more
latency tolerance
17Results Hash Tables and Lists in Olden
- Prefetch Arrays do better because
- MST has short lists and little computation per
node - They prefetch data for the first nodes in HEALTH
unlike Jump prefetching
18Results Tree Traversals in OLTP and Olden
- Hardware-based prefetch Arrays do better because
- Traversal path not known in DB.tree (depth first
search) - Data for the first nodes prefetched in Tree.add
19Other Results in Brief
- Impact of longer memory latencies
- Robust for lists
- For trees, prefetch arrays may cause severe cache
pollution - Impact of memory bandwidth
- Performance improvements sustained for bandwidths
of typical high-end servers (2.4 Gbytes/s) - Prefetch arrays may suffer for trees. Severe
contention on low-bandwidth systems (640
Mbytes/s) were observed - Node insertion and deletion for jump pointers and
prefetch arrays - Results in instruction overhead (-). However,
- insertion/deletion is sped up by prefetching ()
20Outline
- Experimental platform
- Memory system issues studied
- Working set size in DSS workloads
- Prefetch approaches for pointer-intensive
workloads (such as in OLTP) - Coherence issues in OLTP workloads
- Concluding remarks
21Coherence Issues in OLTP
- Favorite protocol write-invalidate
- Ownership overhead invalidations cause write
stall and inval. traffic
22Ownership Overhead in OLTP
- Simulation setup
- CC-NUMA with 4 nodes
- MySQL, TPC-B, 600 MB database
- 40 of all ownership transactions stem from
load/store sequences
23Techniques to Attack Ownership Overhead
- Dynamic detection of migratory sharing
- detects two load/store sequences by different
processors - only a sub-set of all load/store sequences (40
in OLTP) - Static detection of load/store sequences
- compiler algorithms that tags a load followed by
a store and brings exclusive block in cache - poses problems in TPC-B
24New Protocol Extension
- Criterion
- load miss from processor i followed by global
store from i, tag block as Load/Store
25Concluding Remarks
- Focus on DSS and OLTP has revealed challenges not
exposed by traditional appl. - Pointer-chasing
- Load/store optimizations
- Application scaling not fully understood
- Our work on combining simulation with analytical
modeling shows some promise