Title: OnChip COMA Shared Memory Systems for ManyCore Processors
1On-Chip COMA Shared Memory Systems for Many-Core
Processors
- Li Zhang,
- Computer System Architecture Group
- University of Amsterdam
2Overview
- Introduction
- Microthreading model
- On-chip memory system
- Network structure
- Cache and directory structure
- Location consistency model
- On-chip cache coherence protocol
- Cache protocol
- Directory protocol
- Design specifics
- Results and issues
- Future works
3Introduction Microgrid CMP
- The Microthreaded architecture
- Designed for many-core processors
- Simple in-order pipelined processing cores
- Extended ISA to support thread creation and
management - Microthreads as the smallest task for execution -
unit of work - Dynamically scheduled to processing resources -
places - Context switch during long-latency operation
- Synchronizing on register level
- Maintain scalability for a large number of
processors, in terms of power and performance - Communication on three independent networks
- Resource network resource management allocate
and release resources - Processor network thread management - create,
synchronization, etc. - Memory network memory access and coherence
4Introduction Microgrid CMP cont.
5Introduction Microgrid CMP cont.
6On-chip COMA Shared Memory System - Network
7On-chip COMA Shared Memory System - Structures
8On-chip COMA Memory System
- Set associative cache, with pipelined logic
serving multiple requests - Bit mapping on valid segments to facilitate
content updating - Shared local bus, without snooping
- Each cache is associated with a number of local
processors - Ring network shifts network request every cycle
- Network node always prioritizes network request
- Each cache has request queues to buffer requests
that cannot be processed immediately with an
appropriate order - Directory has request queues, linked lists, can
be associated with each cacheline to limit the
requests going through the chip interface - Remote request should get a reply within an
circle on the ring - All the buffer/queues comply with Location
Consistency model
9Memory Consistency Models
- Memory Consistency
- Defines the behavior and correctness of a program
- Enforces memory access partial order
- Consistency Models
- Strict Consistency
- Sequential Consistency
- Processor Consistency
- Release Consistency
-
- Location Consistency
10Location Consistency (LC)
- Definition
- A multiprocessor system is location consistent
if for any execution of a program on the system, - 1) the operations of the execution are the same
as those for some location consistent execution
of the program, and - 2) for each read operation R (with target
location L) that is executed on the
multiprocessor, the result returned by R belongs
to the value-set, V, specified by the state of
the memory location L as maintained by the
abstract interpreter in the corresponding
location consistent execution. - Property
- Remove the order constraints between independent
addresses - Give the Compiler more freedom to reorder the
program - Relax the constraints on the memory system
implementation
11Location Consistency on the Microgrid
- Family creation and synchronization serves as
barriers are controlled by processors - The imposed order will be guaranteed within the
memory system - Memory accesses on different addresses from the
same thread may finish out of order - Memory accesses (on th same address) without the
imposed order may finish out of order
12Cache Coherence Protocol - Cache
13Cache Coherence Protocol - Directory
- Directory includes above and below interfaces
- Root directory directly interact with memory
controller
14Simplified Protocol Verification with Vu Duong
- Scale down the block to a single value
- Thus remove RE, ER and two cache block states
- Proof with the support of line counters in the
directory
15Policies to Obey on Normal Cachelines
- RE, ER and IV transactions, Update Transations,
always carry updates - Cachelines in locking state should be updated
partially by the update transactions - Cacheline no matter in normal or locking state
can give reply, without overwriting requests
updating information - Concurrent updates on different caches can be
carried out together but always has the same
winner - Winner keeps line in normal locking state loser
keeps the line in invalidated locking state - Writeback requests received by a WritePendingI
line will be regarded as a ownership transfer
the request will not be then forwarded to backing
store
16Buffering Directory Requests under LC
- During data loading, a directory line is locked
into reserved state - Following requests for the line will be buffered
in a linked list - Incoming requests from the network has to be
appended to the tail without processing - Previously buffered requests has to be appended
from the head - Active Line queue keeps track of the lines can be
served
17Victim Buffer and Network Bypassing Buffer
- Victim buffer
- Holds evicted blocks
- On local writes the data will be removed
- On network invalidation the data will be removed
- For data consistency, it can only be used for
local requests not remote requests - Skipping Buffer
- Holds missed addresses from network requests
- To avoid going through the whole network-side
accessing logic - A hit in this buffer can pass the request
directly to the next node - Item will be removed on local requests
18Shared Local Bus
- Under LC local bus snooping is unnecessary
- Current data still can be read from D-cache even
another write-request is on its way - Only broadcast invalidations when necessary
- Two policies
- Eager always keeps consistency between L2 and
L1 caches, and only backward invalidate when a
valid line is invalidated, evicted or written
back - Lazy broadcast invalidations from the network
without keeping the consistency. Evicted or
Writeback lines will not be broadcasted - Backward invalidation Buffer for Lazy model
- To avoid broadcast backward invalidations to
processors - Buffer the most recent IB addresses
- Read reply will invalidate the item in the buffer
19Comparing the Implemented Consistency Model
- Comparison with Sequential Consistency
- Sequential consistency requires a certain order
for all writes
20Results Average Memory Access Ratio
- On average 6.77 total requests went to memory
21Results FFT 8
22Conclusion and Future Work
- Generate results and analyze the performance
- A few bugs to solve
- Realistic area and speed estimation
- Estimation with CACTI
- Optimization on the model based on the
performance analysis and the estimations - Token coherence implementation on two level
ring networks - Sacrifice minor performance
- Save verification on the protocol
- Techniques used on concurrent line modification
can be performed to reduce latency
23Questions?