LimitLESS Directories: A Scalable Cache Coherence Scheme - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

LimitLESS Directories: A Scalable Cache Coherence Scheme

Description:

Caches enhance the performance of multiprocessors by reducing the network ... reduction is negated. Cost : Expensive for large-scale multiprocessor environments. ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 20
Provided by: lalith3
Category:

less

Transcript and Presenter's Notes

Title: LimitLESS Directories: A Scalable Cache Coherence Scheme


1
LimitLESS Directories A Scalable Cache Coherence
Scheme
David Chaiken, John Kubiatowicz,
Anant Agarwal Massachusetts Institute of
Technology Presented by
Lalitha Raju

2
Cache Coherence
  • Caches enhance the performance of
    multiprocessors by reducing the network traffic
    and average memory access latency.
  • Side Effect Cache Coherence Problem
  • Reason In a multiprocessor environment,
    the problem of cache coherence arises because
    multiple processors may be reading and modifying
    the same memory blocks within their own cache.
  • Result Globally inconsistent view of
    memory.

3
Common Solutions
  • Snoopy Coherence
  • When any change is made to a data location,
    broadcast is sent across the bus so that all the
    caches in the system can either invalidate or
    update their local copy of the location.
  • Result Achieves globally consistent view
    of memory
  • Disadvantages
  • ? Scalability For large-scale
    multiprocessors advantage of bandwidth
  • reduction is
    negated.
  • ? Cost Expensive for large-scale
    multiprocessor environments.

4
Common Solutions
  • Directory Based Coherence
  • ? These message based protocols allocate a
    section of the systems
  • memory, called a directory, to store
    the locations and state of the cached
  • copies of each data block.
  • ? The basic concept is that a processor
    must ask for permission to load an entry
  • from the primary memory to its cache.
  • ? When an entry is changed the directory
    must be notified either before the
  • change is initiated or when it is
    complete.
  • ? When an entry is changed the directory
    either updates or invalidates the other
  • caches with that entry.

5
Common Solutions
  • Solutions involving directory based cache
    coherence
  • ? Full map directory protocol
  • ? Limited directory protocol

6
Common Solutions
  • Full map directory protocol
  • Each block of memory has an associated
    directory entry which contains a bit for each
    cache in the system.
  • State 1 2
    3 N
  • Advantages No broadcasts necessary
  • Unlimited caching
  • Disadvantage Directory size increases as
    the number of processors
  • increase (High
    memory overhead)

X
X
Read - Only
7
Common Solutions
  • Observation based on previous studies
  • Most parallel algorithms tend to avoid
    widespread sharing of variables.
  • There exists a worker set of processors
    (typically a limited number) which would be
    concurrently sharing a variable.

8
Common Solutions
  • Limited Directory Protocol
  • Allows only limited number of
    simultaneously cached copies of any individual
    data block.
  • State Node ID
    Node ID Node ID Node ID
  • Advantages Performance is comparable to
    that of a full-map scheme
  • where there is
    limited sharing of data between processors.
  • Lower memory
    overhead.
  • Disadvantage Thrashing of directory
    pointers if software is not properly
  • optimized.

Read-Only
5
18
22
17
9
LimitLESS
  • ? Limited directory Locally Extended through
    Software Support.
  • ? LimitLESS scheme attempts to combine the
    full map and limited
  • directory ideas in order to achieve a
    robust yet affordable and scalable
  • cache coherence solution.
  • ? The main idea Handle the common case in
    hardware and the
  • exceptional
    case in software.
  • ? Using limited directories implemented in
    hardware to keep track of the
  • fixed amount of cached memory blocks.
    When the capacity of the
  • directory entry is exceeded, then the
    directory interrupts the local
  • processor and a full map directory is
    emulated in software.

10
LimitLESS Cache Coherence Protocol
11
Transition Input Precondition
Directory Entry
Output Label Message
Change
Message (s)
1 i ? RREQ --
P P U i
RDATA ? i
1
2 i ? WREQ Pi
--
WDATA ? i i ? WREQ
P
Pi WDATA ? i
Read Only Pk1,k2,..,kn S ngtp
Read-Write Pi
6
3 i ? WREQ P k1,k2,..,kn
i E P Pi , AckCtrn kj
INV ? kj i ? WREQ P
k1,k2,..,kn i E P Pi , AckCtrn 1
kjltgti INV ? kj
2
4 j ? WREQ P i
Pj , AckCtr1
INV ? i
5
5 j ? RREQ P i
Pj , AckCtr1
INV ? i
3
4
6 i ? REPM P i
P
--
10
7 j ? RREQ --
--
BUSY ? j j ?
WREQ --
--
BUSY ? j j ? ACKC
AckCtr ltgt 1 AckCtrAckCtr - 1
-- j ? REPM
--
-- --
8
Read Transaction Pi
Write Transaction Pi
8 j ? ACKC AckCtr 1, P
i AckCtr 0
WDATA ? i j ? UPDATE
P i AckCtr 0
WDATA ? i
9 j ? RREQ --
--
BUSY ? j j ?
WREQ --
--
BUSY ? j j ? REPM
-- --
--
9
7
10 j ? UPDATE Pi
AckCtr 0
RDATA ? i j ? ACKC
Pi AckCtr 0
RDATA ? i
12
Example
  • Suppose on an multiprocessor system, variable X
    is homed on P0, and is held
  • in read only state by P1 and P2. If P3 executes
    the statement,
  • X 3 What sequence of messages is generated
    between the four processors?

P0
P1
ACK
1. P3 ? WREQ ? P0 P P3
INV
2. P0 ? INV ? P1, P0 ? INV ?P2 P P3 x
is in Write Transaction State
WREQ
3. P1 ? ACK ? P0, P2 ? ACK ?P0 P P3 x is
in Write Transaction State
INV
ACK
4. P0 ? WDATA ? P3 P
P3 x is in Read-Write State
WDATA
P3
P2
13
Architectural Features
  • ? Alewife is a large-scale
  • multiprocessor with distributed
  • shared memory and a cost-
  • effective mesh network for
  • communication.
  • ? An Alewife node consists of a
  • 33MHz SPACLE processor,
  • 64K bytes of direct-mapped
  • cache, 4M bytes of globally-
  • shared main memory, and a
  • floating-point coprocessor

14
Architectural Features
  • ? Rapid trap handling (five to ten cycles).
  • A rapid context switching processor with a
    finely tuned software
  • trap architecture.
  • ? Processor to network interface In order to
    emulate the protocol
  • functions normally performed by the
    hardware directory, the
  • processor must be able to send and
    receive messages from the
  • interconnection network.

15
Architectural Features
  • Interprocessor-Interrupt (IPI)
  • Provides the processor with a superset of
    the network
  • functionality needed by the cache coherence
    hardware.
  • Used to send cache coherence packets,
    Preemptive
  • messages to remote processors.
  • Network Packet Structure
  • Protocol Opcode Cache coherence traffic
  • Interrupt Opcode Interprocessor messages.
  • Transmission of IPI Packets
  • Enqueues requests on IPI output queue
  • Reception of IPI packets
  • Places packets in IPI input queue

Source processor Packet Length Opcode Operand
1 Operand 2 .. .. .. Operand m-1 Data word Data
word 2 .. .. .. Data word n-1
16
Architectural Features

Source processor Packet Length Opcode Operand
1 Operand 2 .. .. .. Operand m-1 Data word Data
word 2 .. .. .. Data word n-1
17
Architectural Features
  • LimitLESS Trap Handler
  • First time overflow
  • The trap code allocates a full-map
    bit-vector in local memory.
  • Empty all hardware pointers, set the
    corresponding bits in the vector
  • Directory Mode is set to Trap-On-Write
    before trap returns
  • Additional overflow
  • Empty all hardware pointers, set the
    corresponding bits in the vector
  • Termination (on WREQ or local write fault)
  • Empty all hardware pointers.
  • Record the identity of requester in the
    directory.
  • Set the ActCtr to the of bits in the
    vector that are set.
  • Place directory in Normal Mode, Write
    Transaction State.
  • Invalidate all caches with the bit set in
    vector

18
Performance
  • ? Performance of limited, LimitLESS, and full
    map directories are compared.
  • ? Performance results are obtained from complete
    Alewife machine simulations.
  • ?Results indicate that LimitLESS directory
    performs well even when software is not properly
    optimized.

19
Conclusion
  • ? This paper proposed a new scheme for cache
    coherence, called LimitLess, which is being
    implemented in Alewife machine.
  • ? Hardware requirement includes rapid trap
    handling and a flexible processor interface to
    the network.
  • ? Preliminary simulation results indicate that
    the LimitLESS scheme approaches the performance
    of a full-map directory protocol with the memory
    efficiency of a limited directory protocol.
  • ? Furthermore, the LimitLESS scheme provides a
    migration path toward a future in which cache
    coherence is handled entirely in software
Write a Comment
User Comments (0)
About PowerShow.com