Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors - PowerPoint PPT Presentation

About This Presentation

Title:

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors

Description:

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors Lakshmana R Vittanala Mainak Chaudhuri Intel IIT Kanpur – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 37

Provided by: acin

Category:

more less

Transcript and Presenter's Notes

Title: Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors

1
Integrating Memory Compression and Decompression
with Coherence Protocols in DSM Multiprocessors

Lakshmana R Vittanala Mainak Chaudhuri
Intel IIT Kanpur

2
Talk in Two Slides (1/2)

Memory footprint of data-intensive workloads is
ever-increasing
We explore compression to reduce memory pressure
in a medium-scale DSM multi
Dirty blocks evicted from last-level of cache is
sent to home node
Compress in home memory controller
A last-level cache miss request from a node is
sent to home node
Decompress in home memory controller

3
Talk in Two Slides (2/2)

No modification in the processor
Cache hierarchy sees decompressed blocks
All changes are confined to the directory-based
cache coherence protocol
Leverage spare core(s) to execute
compression-enabled protocols in software
Extend directory structure for compression
book-keeping
Use hybrid of two compression algorithms
On 16 nodes for seven scientific computing
workloads, 73 storage saving on average with at
most 15 increase in execution time

4
Contributions

Two major contributions
First attempt to look at compression/decompression
as directory protocol extensions in mid-range
servers
First proposal to execute a compression-enabled
directory protocol in software on spare core(s)
of a multi-core die
Makes the solution attractive in many-core systems

5
Sketch

Background Programmable Protocol Core
Directory Protocol Extensions
Compression/Decompression Algorithms
Simulation Results
Related Work and Summary

6
Programmable Protocol Core

Past studies have considered off-die programmable
protocol processors
Offers flexibility in choice of coherence
protocols compared to hardwired FSMs, but suffers
from performance loss Sun S3.mp, Sequent STiNG,
Stanford FLASH, Piranha,
With on-die integration of memory controller and
availability of large number of on-die cores,
programmable protocol cores may become an
attractive design
Recent studies show almost no performance loss
IEEE TPDS, Aug07

7
Programmable Protocol Core

In our simulated system, each node contains
One complex out-of-order issue core which runs
the application thread
One or two simple in-order static dual issue
programmable protocol core(s) which run the
directory-based cache coherence protocol in
software
On-die integrated memory controller, network
interface, and router
Compression/decompression algorithms are
integrated into the directory protocol software

8
Programmable Protocol Core
OOO Core
In-order Core
Protocol Core/ Protocol Processor
AT
PT
IL1
DL1
IL1
DL1
SDRAM
L2
Memory Control
Network
Router
9
Anatomy of a Protocol Handler

On arrival of a coherence transaction at the
memory controller of a node, a protocol handler
is scheduled on the protocol core of that node
Calculates the directory address if home node
(simple hash function on transaction address)
Reads 64-bit directory entry if home node
Carries out simple integer arithmetic operations
to figure out coherence actions
May send messages to remote nodes
May initiate transactions to local OOO core

10
Baseline Directory Protocol

Invalidation-based three-state (MSI) bitvector
protocol
Derived from SGI Origin MESI protocol and
improved to handle early and late intervention
races better

64-bit datapath
4
44
16
Unused
States L, M, two busy
Sharer vector
11
Sketch

Background Programmable Protocol Core
Directory Protocol Extensions
Compression/Decompression Algorithms
Simulation Results
Related Work and Summary

12
Directory Protocol Extensions

Compression support
All handlers that update memory blocks need
extension with compression algorithm
Two major categories writeback handlers and GET
intervention response handlers
Latter involves a state demotion from M to S and
hence requires an update of memory block at home
GETX interventions do not require memory update
as they involve ownership hand-off only
Decompression support
All handlers that access memory in response to
last-level cache miss requests

13
Directory Protocol Extensions

Compression support (writeback cases)

WB
SPP
HPP
DRAM
Compress
WB_ACK
WB
SP
14
Directory Protocol Extensions

Compression support (writeback cases)

WB
HP
HPP
DRAM
Compress
15
Directory Protocol Extensions

Compression support (intervention cases)

DRAM
GET
GET
HPP
DP
RPP
Compress
SWB
GET
PUT
RP
PUT
16
Directory Protocol Extensions

Compression support (intervention cases)

DRAM
GET
GET
HPP
HP
RPP
Compress
PUT
PUT
(Uncompressed)
GET
PUT
RP
17
Directory Protocol Extensions

Compression support (intervention cases)

DRAM
GET
GET
HPP
DP
HP
Compress
PUT
PUT
(Uncompressed)
18
Directory Protocol Extensions

Decompression support

GET/GETX
RPP
HPP
DRAM
Decompress
PUT/PUTX
GET/GETX
PUT/PUTX
RP
19
Directory Protocol Extensions

Decompression support

GET/GETX
HP
HPP
DRAM
Decompress
PUT/PUTX
20
Sketch

Background Programmable Protocol Core
Directory Protocol Extensions
Compression/Decompression Algorithms
Simulation Results
Related Work and Summary

21
Compression Algorithms

Consider each 64-bit chunk at a time of a
128-byte cache block
Algorithm I
Original Compressed Encoding
All zero Zero byte 00
MS 4 bytes zero LS 4 bytes 01
MS 4 bytes LS 4 bytes LS 4 bytes 10
None 64 bits 11
Algorithm II
Differs in encoding 10 LS 4 bytes zero.
Compressed block stores the MS 4 bytes.

22
Compression Algorithms

Ideally want to compute compressed size by both
the algorithms for each of the 16 double-words in
a cache block and pick the best
Overhead is too high
Trade-off1
Speculate based on the first 64 bits
If MS 32 bits LS 32 bits 0, use Algorithm I
(covers two cases of Algorithm I)
If MS 32 bits LS 32 bits 0, use Algorithm II
(covers three cases of Algorithm II)

23
Compression Algorithms

Trade-off2
If compression ratio is low, it is better to
avoid decompression overhead
Decompression is fully on the critical path
After compressing every 64 bits, compare the
running compressed size against a threshold
maxCsz (best 48 bytes)
Abort compression and store entire block
uncompressed as soon as the threshold is crossed

24
Compression Algorithms

Meta-data
Required for decompression
Most meta-data are stored in the unused 44 bits
of the directory entry
Cache controller generates uncompressed block
address so directory address computation remains
unchanged
32 bits to locate the compressed block
Compressed block size is a multiple of 4 bytes,
but we extend it to next 8-byte boundary to have
a cushion for future use
32 bits allow us to address 32 GB of compressed
memory

25
Compression Algorithms

Meta-data
Two bits to know the compression algorithm
Algorithm I, Algorithm II, uncompressed, all zero
All zero blocks do not store anything in memory
For each 64 bits need to know one of four
encodings
Maintained in a 32-bit header (two bits for each
of the 16 double words)
Optimization to speed up relocation store the
size of the compressed block in directory entry
Requires four bits (16 double words maximum)
70 bits of meta-data per compressed block

26
Decompression Example

Directory entry information
32-bit address 0x4fd1276a
Actual address 0x4fd1276a ltlt 3
Compression state 01
Algorithm II was used
Compressed size 0101
Actual size40 bytes (not used in decompression)
Header information
32-bit header 00 11 10 00 00 01
Upper 64 bits used encoding 00 of Algorithm II
Next 64 bits used encoding 11 of Algorithm II

27
Performance Optimization

Protocol thread occupancy is critical
Two protocol cores
Out-of-order NI scheduling to improve protocol
core utilization
Cached message buffer (filled with writeback
payload)
16 uncached loads/stores needed to message buffer
if not cached during compression
Caching requires invalidating the buffer contents
at the end of compression (coherence issue)
Flushing dirty contents occupies the datapath so
we allow only cached loads
Compression ratio remains unaffected

28
Sketch

Background Programmable Protocol Core
Directory Protocol Extensions
Compression/Decompression Algorithms
Simulation Results
Related Work and Summary

29
Storage Saving
80
73
66
60
40
21
16
20
0
Barnes
FFT
FFTW
LU
Ocean
Radix
Water
30
Slowdown
1PP
2PP
2PPOOO NI
2PPOOO NICLS
2PPOOO NICL
1.60
2
5
7
1
11
15
8
1.45
1.30
1.15
1.00
Barnes
FFT
FFTW
LU
Ocean
Radix
Water
31
Memory Stall Cycles
32
Protocol Core Occupancy

Dynamic instruction count and handler occupancy
w/o compression w/ compression
Barnes 29.1 M (7.5 ns) 215.5 M (31.9
ns)
FFT 82.7 M (6.7 ns) 185.6 M (16.7
ns)
FFTW 177.8 M (10.5 ns) 417.6 M (22.7
ns)
LU 11.4 M (6.3 ns) 29.2 M
(14.8 ns)
Ocean 376.6 M (6.7 ns) 1553.5 M (24.1
ns)
Radix 24.7 M (8.1 ns) 87.0 M (36.9
ns)
Water 62.4 M (5.5 ns) 137.3 M (8.8
ns)
Occupancy still hidden under fastest memory
access (40 ns)

33
Sketch

Background Programmable Protocol Core
Directory Protocol Extensions
Compression/Decompression Algorithms
Simulation Results
Related Work and Summary

34
Related Work

Dictionary-based
IBM MXT
X-Match
X-RL
Not well-suited for cache block grain
Frequent pattern-based
Applied to on-chip cache blocks
Zero-aware compression
Applied to memory blocks
See paper for more details

35
Summary

Explored memory compression and decompression as
coherence protocol extensions in DSM
multiprocessors
The compression-enabled handlers run on simple
core(s) of a multi-core node
The protocol core occupancy increases
significantly, but still can be hidden under
memory access latency
On seven scientific computing workloads, our best
design saves 16 to 73 memory while slowing down
execution by at most 15

36
Integrating Memory Compression and Decompression
with Coherence Protocols in DSM Multiprocessors
THANK YOU!