On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers - PowerPoint PPT Presentation

About This Presentation
Title:

On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers

Description:

On the Interaction Between Commercial Workloads and Memory ... Operating system (Linux) Application. CPU. Sparc V8. Memory. Interrupt. TTY. SCSI. Ethernet ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 26
Provided by: ulla87
Category:

less

Transcript and Presenter's Notes

Title: On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers


1
On the Interaction Between Commercial Workloads
and Memory Systems in High-Performance Servers
Per Stenström Department of Computer
Engineering, Chalmers, Göteborg,
Sweden http//www.ce.chalmers.se/pers
  • Fredrik Dahlgren, Magnus Karlsson, and Jim
    Nilsson
  • in collaboration with
  • Sun Microsystems and Ericsson Research

2
Motivation
  • Database applications dominate (32)
  • Yet, major focus is on scientific/eng. apps (16)

3
Project Objective
  • Design principles for high-performance memory
    systems for emerging applications
  • Systems considered
  • high-performance compute nodes
  • SMP and DSM systems built out of them
  • Applications considered
  • Decision support and on-line transaction
    processing
  • Emerging applications
  • Computer graphics
  • video/sound coding/decoding
  • handwriting recognition
  • ...

4
Outline
  • Experimental platform
  • Memory system issues studied
  • Working set size in DSS workloads
  • Prefetch approaches for pointer-intensive
    workloads (such as in OLTP)
  • Coherence issues in OLTP workloads
  • Concluding remarks

5
Experimental Platform
  • Platform enables
  • Analysis of comm. workloads
  • Analysis of OS effects
  • Tracking architectural events to OS or appl.
    level

6
Outline
  • Experimental platform
  • Memory system issues studied
  • Working set size in DSS workloads
  • Prefetch approaches for pointer-intensive
    workloads (such as in OLTP)
  • Coherence issues in OLTP workloads
  • Concluding remarks

7
Decision-Support Systems (DSS)
  • Compile a list of matching entries in several
    database relations

Will moderately sized caches suffice for huge
databases?
8
Our Findings
  • MWS footprint of instructions and private data
    to access a single tuple
  • typically small (lt 1 Mbyte) and not affected by
    database size
  • DWS footprint of database data (tuples) accessed
    across consecutive invocations of same scan node
  • typically small impact (0.1) on overall miss
    rate

9
Methodological Approach
  • Challenges
  • Not feasible to simulate huge databases
  • Need source code we used PostgreSQL and MySQL
  • Approach
  • Analytical model using
  • parameters that describe the query
  • parameters measured on downscaled query
    executions
  • system parameters

10
Footprints and Reuse Characteristics in DSS
  • MWS instructions, private, and metadata
  • can be measured on downscaled simulation
  • DWS all tuples accessed at lower levels
  • can be computed based on query composition and
    prob. of match

11
Analytical model-an overview
  • Goal Predicts miss rate versus cache size for
    fully-assoc. caches with a LRU replacement policy
    for single-proc. systems
  • Number of cold misses
  • size of footprint/block size
  • MWS is measured
  • DWSi computed based on parameters describing
    the query (size of relations probability of
    matching a search criterion, index versus
    sequential scan, etc)
  • Number of capacity misses for tuple access at
    level i
  • CM0(1- C - C0) if C0 lt Cache size lt
    MWS
  • MWS - C0
  • size of tuple/block size if MWS lt Cache
    size lt MWS DWSi
  • Number of accesses per tuple measured
  • Total number of misses and accesses computed

12
Model Validation
  • Goal
  • Prediction accuracy for queries with different
    compositions
  • Q3, Q6, and Q10 from TPC-D
  • Prediction accuracy when scaling up the database
  • parameters at 5Mbyte used to predict at 200
    Mbytes databases
  • Robustness across database engines
  • Two engines PostgreSQL and MySQL

13
Model Predictions Miss rates for Huge Databases
  • Miss rate by instr., priv. and meta data rapidly
    decay (128 Kbytes)
  • Miss rate component for database data small
  • Whats in the tail?

14
Outline
  • Experimental platform
  • Memory system issues studied
  • Working set size in DSS workloads
  • Prefetch approaches for pointer-intensive
    workloads (such as in OLTP)
  • Coherence issues in OLTP workloads
  • Concluding remarks

15
Cache Issues for Linked Data Structures
  • Traversal of lists may exhibit poor temporal
    locality
  • Results in chains of data dependent loads,
  • called pointer-chasing
  • Pointer-chasing show up in many interesting
    applications
  • 35 of the misses in OLTP (TPC-B)
  • 32 of the misses in an expert system
  • 21 of the misses in Raytrace

16
SW Prefetch Techniques to Attack Pointer-Chasing
  • Greedy Prefetching (G). - computation per
    node lt latency
  • Jump Pointer Prefetching (J) - short list or
    traversal not known a priori
  • Prefetch Arrays (P.(S/H))
  • Generalization of G and J that addresses
    above shortcomings.
  • - Trade memory space and bandwidth for more
    latency tolerance

17
Results Hash Tables and Lists in Olden
  • Prefetch Arrays do better because
  • MST has short lists and little computation per
    node
  • They prefetch data for the first nodes in HEALTH
    unlike Jump prefetching

18
Results Tree Traversals in OLTP and Olden
  • Hardware-based prefetch Arrays do better because
  • Traversal path not known in DB.tree (depth first
    search)
  • Data for the first nodes prefetched in Tree.add

19
Other Results in Brief
  • Impact of longer memory latencies
  • Robust for lists
  • For trees, prefetch arrays may cause severe cache
    pollution
  • Impact of memory bandwidth
  • Performance improvements sustained for bandwidths
    of typical high-end servers (2.4 Gbytes/s)
  • Prefetch arrays may suffer for trees. Severe
    contention on low-bandwidth systems (640
    Mbytes/s) were observed
  • Node insertion and deletion for jump pointers and
    prefetch arrays
  • Results in instruction overhead (-). However,
  • insertion/deletion is sped up by prefetching ()

20
Outline
  • Experimental platform
  • Memory system issues studied
  • Working set size in DSS workloads
  • Prefetch approaches for pointer-intensive
    workloads (such as in OLTP)
  • Coherence issues in OLTP workloads
  • Concluding remarks

21
Coherence Issues in OLTP
  • Favorite protocol write-invalidate
  • Ownership overhead invalidations cause write
    stall and inval. traffic

22
Ownership Overhead in OLTP
  • Simulation setup
  • CC-NUMA with 4 nodes
  • MySQL, TPC-B, 600 MB database
  • 40 of all ownership transactions stem from
    load/store sequences

23
Techniques to Attack Ownership Overhead
  • Dynamic detection of migratory sharing
  • detects two load/store sequences by different
    processors
  • only a sub-set of all load/store sequences (40
    in OLTP)
  • Static detection of load/store sequences
  • compiler algorithms that tags a load followed by
    a store and brings exclusive block in cache
  • poses problems in TPC-B

24
New Protocol Extension
  • Criterion
  • load miss from processor i followed by global
    store from i, tag block as Load/Store

25
Concluding Remarks
  • Focus on DSS and OLTP has revealed challenges not
    exposed by traditional appl.
  • Pointer-chasing
  • Load/store optimizations
  • Application scaling not fully understood
  • Our work on combining simulation with analytical
    modeling shows some promise
Write a Comment
User Comments (0)
About PowerShow.com