Title: LimitLESS Directories: A Scalable Cache Coherence Scheme
1LimitLESS Directories A Scalable Cache Coherence
Scheme
David Chaiken, John Kubiatowicz,
Anant Agarwal Massachusetts Institute of
Technology Presented by
Lalitha Raju
2Cache Coherence
- Caches enhance the performance of
multiprocessors by reducing the network traffic
and average memory access latency. - Side Effect Cache Coherence Problem
- Reason In a multiprocessor environment,
the problem of cache coherence arises because
multiple processors may be reading and modifying
the same memory blocks within their own cache. - Result Globally inconsistent view of
memory.
3Common Solutions
- Snoopy Coherence
- When any change is made to a data location,
broadcast is sent across the bus so that all the
caches in the system can either invalidate or
update their local copy of the location. - Result Achieves globally consistent view
of memory - Disadvantages
- ? Scalability For large-scale
multiprocessors advantage of bandwidth - reduction is
negated. - ? Cost Expensive for large-scale
multiprocessor environments.
4Common Solutions
- Directory Based Coherence
- ? These message based protocols allocate a
section of the systems - memory, called a directory, to store
the locations and state of the cached - copies of each data block.
- ? The basic concept is that a processor
must ask for permission to load an entry - from the primary memory to its cache.
- ? When an entry is changed the directory
must be notified either before the - change is initiated or when it is
complete. - ? When an entry is changed the directory
either updates or invalidates the other - caches with that entry.
5Common Solutions
-
- Solutions involving directory based cache
coherence - ? Full map directory protocol
- ? Limited directory protocol
6Common Solutions
- Full map directory protocol
- Each block of memory has an associated
directory entry which contains a bit for each
cache in the system. - State 1 2
3 N -
- Advantages No broadcasts necessary
- Unlimited caching
- Disadvantage Directory size increases as
the number of processors - increase (High
memory overhead)
X
X
Read - Only
7Common Solutions
-
- Observation based on previous studies
-
- Most parallel algorithms tend to avoid
widespread sharing of variables. - There exists a worker set of processors
(typically a limited number) which would be
concurrently sharing a variable. -
-
8Common Solutions
- Limited Directory Protocol
- Allows only limited number of
simultaneously cached copies of any individual
data block. -
- State Node ID
Node ID Node ID Node ID - Advantages Performance is comparable to
that of a full-map scheme - where there is
limited sharing of data between processors. - Lower memory
overhead. - Disadvantage Thrashing of directory
pointers if software is not properly - optimized.
Read-Only
5
18
22
17
9LimitLESS
-
- ? Limited directory Locally Extended through
Software Support. -
- ? LimitLESS scheme attempts to combine the
full map and limited - directory ideas in order to achieve a
robust yet affordable and scalable - cache coherence solution.
- ? The main idea Handle the common case in
hardware and the - exceptional
case in software. - ? Using limited directories implemented in
hardware to keep track of the - fixed amount of cached memory blocks.
When the capacity of the - directory entry is exceeded, then the
directory interrupts the local - processor and a full map directory is
emulated in software.
10LimitLESS Cache Coherence Protocol
11Transition Input Precondition
Directory Entry
Output Label Message
Change
Message (s)
1 i ? RREQ --
P P U i
RDATA ? i
1
2 i ? WREQ Pi
--
WDATA ? i i ? WREQ
P
Pi WDATA ? i
Read Only Pk1,k2,..,kn S ngtp
Read-Write Pi
6
3 i ? WREQ P k1,k2,..,kn
i E P Pi , AckCtrn kj
INV ? kj i ? WREQ P
k1,k2,..,kn i E P Pi , AckCtrn 1
kjltgti INV ? kj
2
4 j ? WREQ P i
Pj , AckCtr1
INV ? i
5
5 j ? RREQ P i
Pj , AckCtr1
INV ? i
3
4
6 i ? REPM P i
P
--
10
7 j ? RREQ --
--
BUSY ? j j ?
WREQ --
--
BUSY ? j j ? ACKC
AckCtr ltgt 1 AckCtrAckCtr - 1
-- j ? REPM
--
-- --
8
Read Transaction Pi
Write Transaction Pi
8 j ? ACKC AckCtr 1, P
i AckCtr 0
WDATA ? i j ? UPDATE
P i AckCtr 0
WDATA ? i
9 j ? RREQ --
--
BUSY ? j j ?
WREQ --
--
BUSY ? j j ? REPM
-- --
--
9
7
10 j ? UPDATE Pi
AckCtr 0
RDATA ? i j ? ACKC
Pi AckCtr 0
RDATA ? i
12Example
- Suppose on an multiprocessor system, variable X
is homed on P0, and is held - in read only state by P1 and P2. If P3 executes
the statement, - X 3 What sequence of messages is generated
between the four processors?
P0
P1
ACK
1. P3 ? WREQ ? P0 P P3
INV
2. P0 ? INV ? P1, P0 ? INV ?P2 P P3 x
is in Write Transaction State
WREQ
3. P1 ? ACK ? P0, P2 ? ACK ?P0 P P3 x is
in Write Transaction State
INV
ACK
4. P0 ? WDATA ? P3 P
P3 x is in Read-Write State
WDATA
P3
P2
13Architectural Features
-
- ? Alewife is a large-scale
- multiprocessor with distributed
- shared memory and a cost-
- effective mesh network for
- communication.
- ? An Alewife node consists of a
- 33MHz SPACLE processor,
- 64K bytes of direct-mapped
- cache, 4M bytes of globally-
- shared main memory, and a
- floating-point coprocessor
14Architectural Features
-
- ? Rapid trap handling (five to ten cycles).
- A rapid context switching processor with a
finely tuned software - trap architecture.
- ? Processor to network interface In order to
emulate the protocol - functions normally performed by the
hardware directory, the - processor must be able to send and
receive messages from the - interconnection network.
-
15Architectural Features
-
- Interprocessor-Interrupt (IPI)
- Provides the processor with a superset of
the network - functionality needed by the cache coherence
hardware. - Used to send cache coherence packets,
Preemptive - messages to remote processors.
-
- Network Packet Structure
- Protocol Opcode Cache coherence traffic
- Interrupt Opcode Interprocessor messages.
-
- Transmission of IPI Packets
- Enqueues requests on IPI output queue
-
- Reception of IPI packets
- Places packets in IPI input queue
Source processor Packet Length Opcode Operand
1 Operand 2 .. .. .. Operand m-1 Data word Data
word 2 .. .. .. Data word n-1
16Architectural Features
Source processor Packet Length Opcode Operand
1 Operand 2 .. .. .. Operand m-1 Data word Data
word 2 .. .. .. Data word n-1
17Architectural Features
-
- LimitLESS Trap Handler
-
- First time overflow
- The trap code allocates a full-map
bit-vector in local memory. - Empty all hardware pointers, set the
corresponding bits in the vector - Directory Mode is set to Trap-On-Write
before trap returns -
- Additional overflow
- Empty all hardware pointers, set the
corresponding bits in the vector -
- Termination (on WREQ or local write fault)
- Empty all hardware pointers.
- Record the identity of requester in the
directory. - Set the ActCtr to the of bits in the
vector that are set. - Place directory in Normal Mode, Write
Transaction State. - Invalidate all caches with the bit set in
vector
18Performance
-
- ? Performance of limited, LimitLESS, and full
map directories are compared. - ? Performance results are obtained from complete
Alewife machine simulations. - ?Results indicate that LimitLESS directory
performs well even when software is not properly
optimized.
19Conclusion
-
- ? This paper proposed a new scheme for cache
coherence, called LimitLess, which is being
implemented in Alewife machine. - ? Hardware requirement includes rapid trap
handling and a flexible processor interface to
the network. - ? Preliminary simulation results indicate that
the LimitLESS scheme approaches the performance
of a full-map directory protocol with the memory
efficiency of a limited directory protocol. - ? Furthermore, the LimitLESS scheme provides a
migration path toward a future in which cache
coherence is handled entirely in software -