On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers - PowerPoint PPT Presentation

About This Presentation

Title:

On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers

Description:

On the Interaction Between Commercial Workloads and Memory ... Operating system (Linux) Application. CPU. Sparc V8. Memory. Interrupt. TTY. SCSI. Ethernet ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 26

Provided by: ulla87

Learn more at: https://research.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers

1
On the Interaction Between Commercial Workloads
and Memory Systems in High-Performance Servers
Per Stenström Department of Computer
Engineering, Chalmers, Göteborg,
Sweden http//www.ce.chalmers.se/pers

Fredrik Dahlgren, Magnus Karlsson, and Jim
Nilsson
in collaboration with
Sun Microsystems and Ericsson Research

2
Motivation

Database applications dominate (32)
Yet, major focus is on scientific/eng. apps (16)

3
Project Objective

Design principles for high-performance memory
systems for emerging applications
Systems considered
high-performance compute nodes
SMP and DSM systems built out of them
Applications considered
Decision support and on-line transaction
processing
Emerging applications
Computer graphics
video/sound coding/decoding
handwriting recognition
...

4
Outline

Experimental platform
Memory system issues studied
Working set size in DSS workloads
Prefetch approaches for pointer-intensive
workloads (such as in OLTP)
Coherence issues in OLTP workloads
Concluding remarks

5
Experimental Platform

Platform enables
Analysis of comm. workloads
Analysis of OS effects
Tracking architectural events to OS or appl.
level

6
Outline

Experimental platform
Memory system issues studied
Working set size in DSS workloads
Prefetch approaches for pointer-intensive
workloads (such as in OLTP)
Coherence issues in OLTP workloads
Concluding remarks

7
Decision-Support Systems (DSS)

Compile a list of matching entries in several
database relations

Will moderately sized caches suffice for huge
databases?
8
Our Findings

MWS footprint of instructions and private data
to access a single tuple
typically small (lt 1 Mbyte) and not affected by
database size
DWS footprint of database data (tuples) accessed
across consecutive invocations of same scan node
typically small impact (0.1) on overall miss
rate

9
Methodological Approach

Challenges
Not feasible to simulate huge databases
Need source code we used PostgreSQL and MySQL
Approach
Analytical model using
parameters that describe the query
parameters measured on downscaled query
executions
system parameters

10
Footprints and Reuse Characteristics in DSS

MWS instructions, private, and metadata
can be measured on downscaled simulation
DWS all tuples accessed at lower levels
can be computed based on query composition and
prob. of match

11
Analytical model-an overview

Goal Predicts miss rate versus cache size for
fully-assoc. caches with a LRU replacement policy
for single-proc. systems

Number of cold misses
size of footprint/block size
MWS is measured
DWSi computed based on parameters describing
the query (size of relations probability of
matching a search criterion, index versus
sequential scan, etc)
Number of capacity misses for tuple access at
level i
CM0(1- C - C0) if C0 lt Cache size lt
MWS
MWS - C0
size of tuple/block size if MWS lt Cache
size lt MWS DWSi
Number of accesses per tuple measured
Total number of misses and accesses computed

12
Model Validation

Goal
Prediction accuracy for queries with different
compositions
Q3, Q6, and Q10 from TPC-D
Prediction accuracy when scaling up the database
parameters at 5Mbyte used to predict at 200
Mbytes databases
Robustness across database engines
Two engines PostgreSQL and MySQL

13
Model Predictions Miss rates for Huge Databases

Miss rate by instr., priv. and meta data rapidly
decay (128 Kbytes)
Miss rate component for database data small
Whats in the tail?

14
Outline

Experimental platform
Memory system issues studied
Working set size in DSS workloads
Prefetch approaches for pointer-intensive
workloads (such as in OLTP)
Coherence issues in OLTP workloads
Concluding remarks

15
Cache Issues for Linked Data Structures

Traversal of lists may exhibit poor temporal
locality
Results in chains of data dependent loads,
called pointer-chasing

Pointer-chasing show up in many interesting
applications
35 of the misses in OLTP (TPC-B)
32 of the misses in an expert system
21 of the misses in Raytrace

16
SW Prefetch Techniques to Attack Pointer-Chasing

Greedy Prefetching (G). - computation per
node lt latency

Jump Pointer Prefetching (J) - short list or
traversal not known a priori

Prefetch Arrays (P.(S/H))
Generalization of G and J that addresses
above shortcomings.
- Trade memory space and bandwidth for more
latency tolerance

17
Results Hash Tables and Lists in Olden

Prefetch Arrays do better because
MST has short lists and little computation per
node
They prefetch data for the first nodes in HEALTH
unlike Jump prefetching

18
Results Tree Traversals in OLTP and Olden

Hardware-based prefetch Arrays do better because
Traversal path not known in DB.tree (depth first
search)
Data for the first nodes prefetched in Tree.add

19
Other Results in Brief

Impact of longer memory latencies
Robust for lists
For trees, prefetch arrays may cause severe cache
pollution
Impact of memory bandwidth
Performance improvements sustained for bandwidths
of typical high-end servers (2.4 Gbytes/s)
Prefetch arrays may suffer for trees. Severe
contention on low-bandwidth systems (640
Mbytes/s) were observed
Node insertion and deletion for jump pointers and
prefetch arrays
Results in instruction overhead (-). However,
insertion/deletion is sped up by prefetching ()

20
Outline

Experimental platform
Memory system issues studied
Working set size in DSS workloads
Prefetch approaches for pointer-intensive
workloads (such as in OLTP)
Coherence issues in OLTP workloads
Concluding remarks

21
Coherence Issues in OLTP

Favorite protocol write-invalidate
Ownership overhead invalidations cause write
stall and inval. traffic

22
Ownership Overhead in OLTP

Simulation setup
CC-NUMA with 4 nodes
MySQL, TPC-B, 600 MB database

40 of all ownership transactions stem from
load/store sequences

23
Techniques to Attack Ownership Overhead

Dynamic detection of migratory sharing
detects two load/store sequences by different
processors
only a sub-set of all load/store sequences (40
in OLTP)
Static detection of load/store sequences
compiler algorithms that tags a load followed by
a store and brings exclusive block in cache
poses problems in TPC-B

24
New Protocol Extension