OnChip COMA Shared Memory Systems for ManyCore Processors - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

OnChip COMA Shared Memory Systems for ManyCore Processors

Description:

1. On-Chip COMA Shared Memory Systems for Many-Core Processors. Li Zhang, Computer System Architecture Group. University of Amsterdam ... Estimation with CACTI ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 24
Provided by: JonY1
Category:

less

Transcript and Presenter's Notes

Title: OnChip COMA Shared Memory Systems for ManyCore Processors


1
On-Chip COMA Shared Memory Systems for Many-Core
Processors
  • Li Zhang,
  • Computer System Architecture Group
  • University of Amsterdam

2
Overview
  • Introduction
  • Microthreading model
  • On-chip memory system
  • Network structure
  • Cache and directory structure
  • Location consistency model
  • On-chip cache coherence protocol
  • Cache protocol
  • Directory protocol
  • Design specifics
  • Results and issues
  • Future works

3
Introduction Microgrid CMP
  • The Microthreaded architecture
  • Designed for many-core processors
  • Simple in-order pipelined processing cores
  • Extended ISA to support thread creation and
    management
  • Microthreads as the smallest task for execution -
    unit of work
  • Dynamically scheduled to processing resources -
    places
  • Context switch during long-latency operation
  • Synchronizing on register level
  • Maintain scalability for a large number of
    processors, in terms of power and performance
  • Communication on three independent networks
  • Resource network resource management allocate
    and release resources
  • Processor network thread management - create,
    synchronization, etc.
  • Memory network memory access and coherence

4
Introduction Microgrid CMP cont.
5
Introduction Microgrid CMP cont.
6
On-chip COMA Shared Memory System - Network
7
On-chip COMA Shared Memory System - Structures
8
On-chip COMA Memory System
  • Set associative cache, with pipelined logic
    serving multiple requests
  • Bit mapping on valid segments to facilitate
    content updating
  • Shared local bus, without snooping
  • Each cache is associated with a number of local
    processors
  • Ring network shifts network request every cycle
  • Network node always prioritizes network request
  • Each cache has request queues to buffer requests
    that cannot be processed immediately with an
    appropriate order
  • Directory has request queues, linked lists, can
    be associated with each cacheline to limit the
    requests going through the chip interface
  • Remote request should get a reply within an
    circle on the ring
  • All the buffer/queues comply with Location
    Consistency model

9
Memory Consistency Models
  • Memory Consistency
  • Defines the behavior and correctness of a program
  • Enforces memory access partial order
  • Consistency Models
  • Strict Consistency
  • Sequential Consistency
  • Processor Consistency
  • Release Consistency
  • Location Consistency

10
Location Consistency (LC)
  • Definition
  • A multiprocessor system is location consistent
    if for any execution of a program on the system,
  • 1) the operations of the execution are the same
    as those for some location consistent execution
    of the program, and
  • 2) for each read operation R (with target
    location L) that is executed on the
    multiprocessor, the result returned by R belongs
    to the value-set, V, specified by the state of
    the memory location L as maintained by the
    abstract interpreter in the corresponding
    location consistent execution.
  • Property
  • Remove the order constraints between independent
    addresses
  • Give the Compiler more freedom to reorder the
    program
  • Relax the constraints on the memory system
    implementation

11
Location Consistency on the Microgrid
  • Family creation and synchronization serves as
    barriers are controlled by processors
  • The imposed order will be guaranteed within the
    memory system
  • Memory accesses on different addresses from the
    same thread may finish out of order
  • Memory accesses (on th same address) without the
    imposed order may finish out of order

12
Cache Coherence Protocol - Cache
13
Cache Coherence Protocol - Directory
  • Directory includes above and below interfaces
  • Root directory directly interact with memory
    controller

14
Simplified Protocol Verification with Vu Duong
  • Scale down the block to a single value
  • Thus remove RE, ER and two cache block states
  • Proof with the support of line counters in the
    directory

15
Policies to Obey on Normal Cachelines
  • RE, ER and IV transactions, Update Transations,
    always carry updates
  • Cachelines in locking state should be updated
    partially by the update transactions
  • Cacheline no matter in normal or locking state
    can give reply, without overwriting requests
    updating information
  • Concurrent updates on different caches can be
    carried out together but always has the same
    winner
  • Winner keeps line in normal locking state loser
    keeps the line in invalidated locking state
  • Writeback requests received by a WritePendingI
    line will be regarded as a ownership transfer
    the request will not be then forwarded to backing
    store

16
Buffering Directory Requests under LC
  • During data loading, a directory line is locked
    into reserved state
  • Following requests for the line will be buffered
    in a linked list
  • Incoming requests from the network has to be
    appended to the tail without processing
  • Previously buffered requests has to be appended
    from the head
  • Active Line queue keeps track of the lines can be
    served

17
Victim Buffer and Network Bypassing Buffer
  • Victim buffer
  • Holds evicted blocks
  • On local writes the data will be removed
  • On network invalidation the data will be removed
  • For data consistency, it can only be used for
    local requests not remote requests
  • Skipping Buffer
  • Holds missed addresses from network requests
  • To avoid going through the whole network-side
    accessing logic
  • A hit in this buffer can pass the request
    directly to the next node
  • Item will be removed on local requests

18
Shared Local Bus
  • Under LC local bus snooping is unnecessary
  • Current data still can be read from D-cache even
    another write-request is on its way
  • Only broadcast invalidations when necessary
  • Two policies
  • Eager always keeps consistency between L2 and
    L1 caches, and only backward invalidate when a
    valid line is invalidated, evicted or written
    back
  • Lazy broadcast invalidations from the network
    without keeping the consistency. Evicted or
    Writeback lines will not be broadcasted
  • Backward invalidation Buffer for Lazy model
  • To avoid broadcast backward invalidations to
    processors
  • Buffer the most recent IB addresses
  • Read reply will invalidate the item in the buffer

19
Comparing the Implemented Consistency Model
  • Comparison with Sequential Consistency
  • Sequential consistency requires a certain order
    for all writes

20
Results Average Memory Access Ratio
  • On average 6.77 total requests went to memory

21
Results FFT 8
22
Conclusion and Future Work
  • Generate results and analyze the performance
  • A few bugs to solve
  • Realistic area and speed estimation
  • Estimation with CACTI
  • Optimization on the model based on the
    performance analysis and the estimations
  • Token coherence implementation on two level
    ring networks
  • Sacrifice minor performance
  • Save verification on the protocol
  • Techniques used on concurrent line modification
    can be performed to reduce latency

23
Questions?
Write a Comment
User Comments (0)
About PowerShow.com