OnChip COMA Shared Memory Systems for ManyCore Processors - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

OnChip COMA Shared Memory Systems for ManyCore Processors

Description:

1. On-Chip COMA Shared Memory Systems for Many-Core Processors. Li Zhang, Computer System Architecture Group. University of Amsterdam ... Estimation with CACTI ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 24

Provided by: JonY1

Category:

more less

Transcript and Presenter's Notes

Title: OnChip COMA Shared Memory Systems for ManyCore Processors

1
On-Chip COMA Shared Memory Systems for Many-Core
Processors

Li Zhang,
Computer System Architecture Group
University of Amsterdam

2
Overview

Introduction
Microthreading model
On-chip memory system
Network structure
Cache and directory structure
Location consistency model
On-chip cache coherence protocol
Cache protocol
Directory protocol
Design specifics
Results and issues
Future works

3
Introduction Microgrid CMP

The Microthreaded architecture
Designed for many-core processors
Simple in-order pipelined processing cores
Extended ISA to support thread creation and
management
Microthreads as the smallest task for execution -
unit of work
Dynamically scheduled to processing resources -
places
Context switch during long-latency operation
Synchronizing on register level
Maintain scalability for a large number of
processors, in terms of power and performance
Communication on three independent networks
Resource network resource management allocate
and release resources
Processor network thread management - create,
synchronization, etc.
Memory network memory access and coherence

4
Introduction Microgrid CMP cont.
5
Introduction Microgrid CMP cont.
6
On-chip COMA Shared Memory System - Network
7
On-chip COMA Shared Memory System - Structures
8
On-chip COMA Memory System

Set associative cache, with pipelined logic
serving multiple requests
Bit mapping on valid segments to facilitate
content updating
Shared local bus, without snooping
Each cache is associated with a number of local
processors
Ring network shifts network request every cycle
Network node always prioritizes network request
Each cache has request queues to buffer requests
that cannot be processed immediately with an
appropriate order
Directory has request queues, linked lists, can
be associated with each cacheline to limit the
requests going through the chip interface
Remote request should get a reply within an
circle on the ring
All the buffer/queues comply with Location
Consistency model

9
Memory Consistency Models

Memory Consistency
Defines the behavior and correctness of a program
Enforces memory access partial order
Consistency Models
Strict Consistency
Sequential Consistency
Processor Consistency
Release Consistency
Location Consistency

10
Location Consistency (LC)

Definition
A multiprocessor system is location consistent
if for any execution of a program on the system,
1) the operations of the execution are the same
as those for some location consistent execution
of the program, and
2) for each read operation R (with target
location L) that is executed on the
multiprocessor, the result returned by R belongs
to the value-set, V, specified by the state of
the memory location L as maintained by the
abstract interpreter in the corresponding
location consistent execution.
Property
Remove the order constraints between independent
addresses
Give the Compiler more freedom to reorder the
program
Relax the constraints on the memory system
implementation

11
Location Consistency on the Microgrid

Family creation and synchronization serves as
barriers are controlled by processors
The imposed order will be guaranteed within the
memory system
Memory accesses on different addresses from the
same thread may finish out of order
Memory accesses (on th same address) without the
imposed order may finish out of order

12
Cache Coherence Protocol - Cache
13
Cache Coherence Protocol - Directory

Directory includes above and below interfaces
Root directory directly interact with memory
controller

14
Simplified Protocol Verification with Vu Duong

Scale down the block to a single value
Thus remove RE, ER and two cache block states
Proof with the support of line counters in the
directory

15
Policies to Obey on Normal Cachelines

RE, ER and IV transactions, Update Transations,
always carry updates
Cachelines in locking state should be updated
partially by the update transactions
Cacheline no matter in normal or locking state
can give reply, without overwriting requests
updating information
Concurrent updates on different caches can be
carried out together but always has the same
winner
Winner keeps line in normal locking state loser
keeps the line in invalidated locking state
Writeback requests received by a WritePendingI
line will be regarded as a ownership transfer
the request will not be then forwarded to backing
store

16
Buffering Directory Requests under LC

During data loading, a directory line is locked
into reserved state
Following requests for the line will be buffered
in a linked list
Incoming requests from the network has to be
appended to the tail without processing
Previously buffered requests has to be appended
from the head
Active Line queue keeps track of the lines can be
served

17
Victim Buffer and Network Bypassing Buffer

Victim buffer
Holds evicted blocks
On local writes the data will be removed
On network invalidation the data will be removed
For data consistency, it can only be used for
local requests not remote requests
Skipping Buffer
Holds missed addresses from network requests
To avoid going through the whole network-side
accessing logic
A hit in this buffer can pass the request
directly to the next node
Item will be removed on local requests

18
Shared Local Bus

Under LC local bus snooping is unnecessary
Current data still can be read from D-cache even
another write-request is on its way
Only broadcast invalidations when necessary
Two policies
Eager always keeps consistency between L2 and
L1 caches, and only backward invalidate when a
valid line is invalidated, evicted or written
back
Lazy broadcast invalidations from the network
without keeping the consistency. Evicted or
Writeback lines will not be broadcasted
Backward invalidation Buffer for Lazy model
To avoid broadcast backward invalidations to
processors
Buffer the most recent IB addresses
Read reply will invalidate the item in the buffer

19
Comparing the Implemented Consistency Model

Comparison with Sequential Consistency
Sequential consistency requires a certain order
for all writes

20
Results Average Memory Access Ratio

On average 6.77 total requests went to memory

21
Results FFT 8
22
Conclusion and Future Work

Generate results and analyze the performance
A few bugs to solve
Realistic area and speed estimation
Estimation with CACTI
Optimization on the model based on the
performance analysis and the estimations
Token coherence implementation on two level
ring networks
Sacrifice minor performance
Save verification on the protocol
Techniques used on concurrent line modification
can be performed to reduce latency

23
Questions?

Write a Comment

User Comments (0)