LimitLESS Directories: A Scalable Cache Coherence Scheme - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

LimitLESS Directories: A Scalable Cache Coherence Scheme

Description:

... interrupts the local processor and a full map directory is emulated in software. Read Data ... the variable in Weather is optimised. Performance Results ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 23
Provided by: acade116
Category:

less

Transcript and Presenter's Notes

Title: LimitLESS Directories: A Scalable Cache Coherence Scheme


1
LimitLESS Directories A Scalable Cache
Coherence Scheme
  • By David Chaiken,
  • John Kubiatowicz,
  • Anant Agarwal

Presented by Sampath Rudravaram
2
Cache Coherence
  • The gap between the computing power of
    microprocessors and that of the largest
    supercomputers is shrinking, while the
    price/performance advantage of microprocessor is
    increasing.
  • Cache enhance the performance of
    multiprocessors by reducing network traffic and
    average memory access time
  • Cache coherence arise because multiple
    processors may be reading and modifying the same
    memory block within their own cache
  • Common Solution
  • Snoopy coherence
  • Directory based coherence lt--
  • Compiler directed coherence

3
Directory (Full-map)
  • The message-based protocols allocate
  • a section of the systems memory
  • ? Directory
  • Each block of memory has an associated directory
    entry which contains a bit for each cache in the
    system.
  • That bit indicates whether or not the associated
    cache contains a copy of memory block

4
Directory based Coherence
  • The basic concept is that a processor must ask
    for permission to load an entry from the primary
    memory to its cache.
  • When an entry is changed the directory must be
    notified either before the change is initiated or
    when it is complete.
  • When an entry is changed the directory either
    updates or invalidates the other caches with that
    entry.

5
Directory based Coherence
State 1 2 3 . . .
. . . . N
  • FULL-MAP Directory Entry
  • Advantages ?
  • -gtNo broadcast is necessary
  • Disadvantages ?
  • -gtCoherence traffic is high due to all
    requests to the directory
  • -gtGreat need for memory(size grows as ?(N2))

6
Directory based Coherence
State Node ID Node ID Node ID Node
ID
  • Limited Directory Entry
  • Advantages ?
  • -gtIts performance is comparable to that of a
    full-map scheme in case where there is limited
    sharing of data between processors
  • -gtCheaper to implement
  • Disadvantages ?
  • -gtThe protocol is susceptible to thrashing when
    the number of processors sharing data exceeds the
    number of pointers in the directory entry

7
LimitLESS(Limited directory Locally Extended
through Software Support. )
  • The LimitLess scheme attempts to combine the full
    map and limited directory ideas in order to
    achieve a robust yet affordable and scalable
    cache coherence solution. 
  • The main idea behind this method is to handle
    the common case in hardware and the exceptional
    case in software. 
  • Using limited directories implemented in hardware
    to keep track of a fixed amount of cached memory
    blocks.  When the capacity of the directory entry
    is exceeded, then the directory interrupts the
    local processor and a full map directory is
    emulated in software. 

8
lt- Protocol messages for hardware coherence
Directory states

Annotation of the state transition diagram
9
Architectural Features LimitLESS
  • Alewife is a large-scale multiprocessor with
    distributed shared memory and a cost-effective
    mesh network for communication.
  • An Alewife node consists of a 33MHz SPARCLE
    processor, 64K bytes of direct-mapped cache, 4M
    bytes of globally-shared main memory, and a
    floating-point coprocessor

10
(No Transcript)
11
A 16-node Alewife machine
A 128-node Alewife Chassis
12
Architectural Features LimitLESS
  • Be capable of rapid trap handling (five to ten
    cycles ).
  • A rapid context switching processor
  • A finely-tuned software trap architecture .
  • The processor needs complete access to coherence
    related controller state
  • The directory controller must be able to
    invoke processor trap handlers when necessary.
  • An interface to the network that allows the
    processor to launch and to intercept coherence
    protocol packets.
  • IPI( Interprocessor-Interrrupt)

Condition Bits
Processor
Controller
Trap Lines
Data Bus
Address Bus
13
Architectural Features LimitLESS
  • IPI provides
  • a superset of the network functionality
  • -gt Used to send and receive cache protocol
    packets
  • -gt Used to send preemptive message to remote
    processors
  • Network Packet Structure
  • Protocol Opcode
  • -gtfor cache coherence traffic
  • Interrupt Opcode
  • -gtfor interprocessor message
  • Transmission of IPI Packets
  • -gt enqueue the request on IPI output
    Queue
  • Reception of IPI packets
  • -gtplace the packet in the IPI input Queue
  • IPI input traps are synchronous.

Source processor Packet Length Opcode Operand
1 Operand 2 .. .. .. Operand m-1 Data word Data
word 2 .. .. .. Data word n-1
14
Queue based diagram of the Alewife controller
15
Meta States Trap Handler
  • Meta States
  • Trap Handler
  • First time overflow
  • -The trap code allocates a full-map
    bit-vector in local memory.
  • -Empty all hardware pointers, set the
    corresponding bits in the vector
  • -Directory Mode is set to Trap-On-Write
    before trap returns
  • Additional overflow
  • -Empty all hardware pointers, set the
    corresponding bits in the vector
  • Termination (on WREQ or local write fault)
  • -Empty all hardware pointers
  • -Record the identity of requester in the
    directory
  • -Set the ActCtr to the of bits in the
    vector that are set
  • -Place directory in Normal Mode, Write
    Transaction Sate.
  • -Invalidate all caches with the bit set in
    vector

16
PERFORMANCE MEASUREMENT
  • Comparision of the performance of
    limited,LimitLESS and full-map directories.
  • Evaluated in terms of the total number of cycles
    needed to execute an application on a 64
    processor Alewife machine.

17
Measurement Technique
ASIM,The Alewife System Simulator
18
Performance Results
-gt four-pointer limited protocol,full-map
protocol,LimitLESS scheme with Ts50 -gt 64-node
Alewife machine with 64K byte caches and 2D mesh
n/ws
19
Performance Results (contd..)
-gt Result when the variable in Weather is not
optimised.
20
Performance Results (contd..)
-gt Result when the variable in Weather is
optimised
21
Performance Results (Contd..)
-gt Result when emulation latency 50 for
LimitLESS protocol.
22
Conclusion
  • This paper proposed a new scheme for cache
    coherence, called LimitLess, which is being
    implemented in Alewife machine.
  • Hardware requirement includes rapid trap handling
    and a flexible processor interface to the
    network.
  • Preliminary simulation results indicate that the
    LimitLEss scheme approaches the performance of a
    full-map directory protocol with the memory
    efficiency of a limited directory protocol.
  • Furthermore, the LimitLess scheme provides a
    migration path toward a future in which cache
    coherence is handled entirely in software
Write a Comment
User Comments (0)
About PowerShow.com