Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors - PowerPoint PPT Presentation

About This Presentation
Title:

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors

Description:

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors Lakshmana R Vittanala Mainak Chaudhuri Intel IIT Kanpur – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 37
Provided by: acin
Category:

less

Transcript and Presenter's Notes

Title: Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors


1
Integrating Memory Compression and Decompression
with Coherence Protocols in DSM Multiprocessors
  • Lakshmana R Vittanala Mainak Chaudhuri
  • Intel IIT Kanpur

2
Talk in Two Slides (1/2)
  • Memory footprint of data-intensive workloads is
    ever-increasing
  • We explore compression to reduce memory pressure
    in a medium-scale DSM multi
  • Dirty blocks evicted from last-level of cache is
    sent to home node
  • Compress in home memory controller
  • A last-level cache miss request from a node is
    sent to home node
  • Decompress in home memory controller

3
Talk in Two Slides (2/2)
  • No modification in the processor
  • Cache hierarchy sees decompressed blocks
  • All changes are confined to the directory-based
    cache coherence protocol
  • Leverage spare core(s) to execute
    compression-enabled protocols in software
  • Extend directory structure for compression
    book-keeping
  • Use hybrid of two compression algorithms
  • On 16 nodes for seven scientific computing
    workloads, 73 storage saving on average with at
    most 15 increase in execution time

4
Contributions
  • Two major contributions
  • First attempt to look at compression/decompression
    as directory protocol extensions in mid-range
    servers
  • First proposal to execute a compression-enabled
    directory protocol in software on spare core(s)
    of a multi-core die
  • Makes the solution attractive in many-core systems

5
Sketch
  • Background Programmable Protocol Core
  • Directory Protocol Extensions
  • Compression/Decompression Algorithms
  • Simulation Results
  • Related Work and Summary

6
Programmable Protocol Core
  • Past studies have considered off-die programmable
    protocol processors
  • Offers flexibility in choice of coherence
    protocols compared to hardwired FSMs, but suffers
    from performance loss Sun S3.mp, Sequent STiNG,
    Stanford FLASH, Piranha,
  • With on-die integration of memory controller and
    availability of large number of on-die cores,
    programmable protocol cores may become an
    attractive design
  • Recent studies show almost no performance loss
    IEEE TPDS, Aug07

7
Programmable Protocol Core
  • In our simulated system, each node contains
  • One complex out-of-order issue core which runs
    the application thread
  • One or two simple in-order static dual issue
    programmable protocol core(s) which run the
    directory-based cache coherence protocol in
    software
  • On-die integrated memory controller, network
    interface, and router
  • Compression/decompression algorithms are
    integrated into the directory protocol software

8
Programmable Protocol Core
OOO Core
In-order Core
Protocol Core/ Protocol Processor
AT
PT
IL1
DL1
IL1
DL1
SDRAM
L2
Memory Control
Network
Router
9
Anatomy of a Protocol Handler
  • On arrival of a coherence transaction at the
    memory controller of a node, a protocol handler
    is scheduled on the protocol core of that node
  • Calculates the directory address if home node
    (simple hash function on transaction address)
  • Reads 64-bit directory entry if home node
  • Carries out simple integer arithmetic operations
    to figure out coherence actions
  • May send messages to remote nodes
  • May initiate transactions to local OOO core

10
Baseline Directory Protocol
  • Invalidation-based three-state (MSI) bitvector
    protocol
  • Derived from SGI Origin MESI protocol and
    improved to handle early and late intervention
    races better

64-bit datapath
4
44
16
Unused
States L, M, two busy
Sharer vector
11
Sketch
  • Background Programmable Protocol Core
  • Directory Protocol Extensions
  • Compression/Decompression Algorithms
  • Simulation Results
  • Related Work and Summary

12
Directory Protocol Extensions
  • Compression support
  • All handlers that update memory blocks need
    extension with compression algorithm
  • Two major categories writeback handlers and GET
    intervention response handlers
  • Latter involves a state demotion from M to S and
    hence requires an update of memory block at home
  • GETX interventions do not require memory update
    as they involve ownership hand-off only
  • Decompression support
  • All handlers that access memory in response to
    last-level cache miss requests

13
Directory Protocol Extensions
  • Compression support (writeback cases)

WB
SPP
HPP
DRAM
Compress
WB_ACK
WB
SP
14
Directory Protocol Extensions
  • Compression support (writeback cases)

WB
HP
HPP
DRAM
Compress
15
Directory Protocol Extensions
  • Compression support (intervention cases)

DRAM
GET
GET
HPP
DP
RPP
Compress
SWB
GET
PUT
RP
PUT
16
Directory Protocol Extensions
  • Compression support (intervention cases)

DRAM
GET
GET
HPP
HP
RPP
Compress
PUT
PUT
(Uncompressed)
GET
PUT
RP
17
Directory Protocol Extensions
  • Compression support (intervention cases)

DRAM
GET
GET
HPP
DP
HP
Compress
PUT
PUT
(Uncompressed)
18
Directory Protocol Extensions
  • Decompression support

GET/GETX
RPP
HPP
DRAM
Decompress
PUT/PUTX
GET/GETX
PUT/PUTX
RP
19
Directory Protocol Extensions
  • Decompression support

GET/GETX
HP
HPP
DRAM
Decompress
PUT/PUTX
20
Sketch
  • Background Programmable Protocol Core
  • Directory Protocol Extensions
  • Compression/Decompression Algorithms
  • Simulation Results
  • Related Work and Summary

21
Compression Algorithms
  • Consider each 64-bit chunk at a time of a
    128-byte cache block
  • Algorithm I
  • Original Compressed Encoding
  • All zero Zero byte 00
  • MS 4 bytes zero LS 4 bytes 01
  • MS 4 bytes LS 4 bytes LS 4 bytes 10
  • None 64 bits 11
  • Algorithm II
  • Differs in encoding 10 LS 4 bytes zero.
    Compressed block stores the MS 4 bytes.

22
Compression Algorithms
  • Ideally want to compute compressed size by both
    the algorithms for each of the 16 double-words in
    a cache block and pick the best
  • Overhead is too high
  • Trade-off1
  • Speculate based on the first 64 bits
  • If MS 32 bits LS 32 bits 0, use Algorithm I
    (covers two cases of Algorithm I)
  • If MS 32 bits LS 32 bits 0, use Algorithm II
    (covers three cases of Algorithm II)

23
Compression Algorithms
  • Trade-off2
  • If compression ratio is low, it is better to
    avoid decompression overhead
  • Decompression is fully on the critical path
  • After compressing every 64 bits, compare the
    running compressed size against a threshold
    maxCsz (best 48 bytes)
  • Abort compression and store entire block
    uncompressed as soon as the threshold is crossed

24
Compression Algorithms
  • Meta-data
  • Required for decompression
  • Most meta-data are stored in the unused 44 bits
    of the directory entry
  • Cache controller generates uncompressed block
    address so directory address computation remains
    unchanged
  • 32 bits to locate the compressed block
  • Compressed block size is a multiple of 4 bytes,
    but we extend it to next 8-byte boundary to have
    a cushion for future use
  • 32 bits allow us to address 32 GB of compressed
    memory

25
Compression Algorithms
  • Meta-data
  • Two bits to know the compression algorithm
  • Algorithm I, Algorithm II, uncompressed, all zero
  • All zero blocks do not store anything in memory
  • For each 64 bits need to know one of four
    encodings
  • Maintained in a 32-bit header (two bits for each
    of the 16 double words)
  • Optimization to speed up relocation store the
    size of the compressed block in directory entry
  • Requires four bits (16 double words maximum)
  • 70 bits of meta-data per compressed block

26
Decompression Example
  • Directory entry information
  • 32-bit address 0x4fd1276a
  • Actual address 0x4fd1276a ltlt 3
  • Compression state 01
  • Algorithm II was used
  • Compressed size 0101
  • Actual size40 bytes (not used in decompression)
  • Header information
  • 32-bit header 00 11 10 00 00 01
  • Upper 64 bits used encoding 00 of Algorithm II
  • Next 64 bits used encoding 11 of Algorithm II

27
Performance Optimization
  • Protocol thread occupancy is critical
  • Two protocol cores
  • Out-of-order NI scheduling to improve protocol
    core utilization
  • Cached message buffer (filled with writeback
    payload)
  • 16 uncached loads/stores needed to message buffer
    if not cached during compression
  • Caching requires invalidating the buffer contents
    at the end of compression (coherence issue)
  • Flushing dirty contents occupies the datapath so
    we allow only cached loads
  • Compression ratio remains unaffected

28
Sketch
  • Background Programmable Protocol Core
  • Directory Protocol Extensions
  • Compression/Decompression Algorithms
  • Simulation Results
  • Related Work and Summary

29
Storage Saving
80
73
66
60
40
21
16
20
0
Barnes
FFT
FFTW
LU
Ocean
Radix
Water
30
Slowdown
1PP
2PP
2PPOOO NI
2PPOOO NICLS
2PPOOO NICL
1.60
2
5
7
1
11
15
8
1.45
1.30
1.15
1.00
Barnes
FFT
FFTW
LU
Ocean
Radix
Water
31
Memory Stall Cycles
32
Protocol Core Occupancy
  • Dynamic instruction count and handler occupancy
  • w/o compression w/ compression
  • Barnes 29.1 M (7.5 ns) 215.5 M (31.9
    ns)
  • FFT 82.7 M (6.7 ns) 185.6 M (16.7
    ns)
  • FFTW 177.8 M (10.5 ns) 417.6 M (22.7
    ns)
  • LU 11.4 M (6.3 ns) 29.2 M
    (14.8 ns)
  • Ocean 376.6 M (6.7 ns) 1553.5 M (24.1
    ns)
  • Radix 24.7 M (8.1 ns) 87.0 M (36.9
    ns)
  • Water 62.4 M (5.5 ns) 137.3 M (8.8
    ns)
  • Occupancy still hidden under fastest memory
    access (40 ns)

33
Sketch
  • Background Programmable Protocol Core
  • Directory Protocol Extensions
  • Compression/Decompression Algorithms
  • Simulation Results
  • Related Work and Summary

34
Related Work
  • Dictionary-based
  • IBM MXT
  • X-Match
  • X-RL
  • Not well-suited for cache block grain
  • Frequent pattern-based
  • Applied to on-chip cache blocks
  • Zero-aware compression
  • Applied to memory blocks
  • See paper for more details

35
Summary
  • Explored memory compression and decompression as
    coherence protocol extensions in DSM
    multiprocessors
  • The compression-enabled handlers run on simple
    core(s) of a multi-core node
  • The protocol core occupancy increases
    significantly, but still can be hidden under
    memory access latency
  • On seven scientific computing workloads, our best
    design saves 16 to 73 memory while slowing down
    execution by at most 15

36
Integrating Memory Compression and Decompression
with Coherence Protocols in DSM Multiprocessors
THANK YOU!
  • Lakshmana R Vittanala Mainak Chaudhuri
  • Intel IIT Kanpur
Write a Comment
User Comments (0)
About PowerShow.com