Microgrids of SVP cores Implementation and results

About This Presentation

Title:

Microgrids of SVP cores Implementation and results

Description:

Subsequent write to register will wake thread up ... Use SWCH in the program to avoid bubbles by switching before the miss is known ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 23

Provided by: staffSci

Category:

more less

Transcript and Presenter's Notes

Title: Microgrids of SVP cores Implementation and results

1
Microgrids of SVP cores Implementation and
results

Mike Lankamp
M.Lankamp_at_uva.nl

September 3rd, 2008
2
The Big Picture
3
Processor Components

Register File
Large around 1024 registers
I-Structures thread-blocking reads
Memory Caches
Small caches because of latency-tolerant
architecture
Pipeline
Threads Families
Network
Ring network for shareds, creates and globals
Global delegation network

4
Register File

Large Register File context for each thread
Threads suspend on empty register by writing ID
into register
Asynchronous writebacks of long-latency
operations
Security Make sure the writeback comes from the
correct source

5
Virtual Register Mapping
6
Physical Register Mapping
7
Register State Transitions
8
Thread Management

Threads are managed on two exclusive linked lists
State list Ready, Waiting
Membership list Empty, Member
Various thread states

9
Pipeline

In-order six-stage pipeline
Write back Thread ID to register when read empty
This suspends the thread on a data miss
Subsequent write to register will wake thread up
Dispatch long-latency operations instead of
waiting
Memory reads
Floating point operations
Results are written back asynchronously
Thread switch on data miss
Use SWCH in the program to avoid bubbles by
switching before the miss is known

10
Families and Threads

Family Table (32 entries)
Initial PC
Parent info
Dependencies
Thread Table (256 entries)
Current PC
Register context info
I-Cache info

11
Instruction Cache

Ordinary, though small, cache
Has fields (head, tail) for a linked list of
threads
i.e., threads waiting on the cache-line
On cache-line loaded, all threads are activated
in 1 cycle
Reference counted
When a thread requests the line, count increased
When a thread suspends, count decreased
When zero, line can be reused

12
Data Cache

Ordinary, though small, cache
Has fields (head, tail) for a linked list of
registers
i.e., register reads waiting on the cache-line
Registers contain the read information (offset,
size)
On cache-line loaded, register reads are serviced
one by one
Special attention required for read/write-consiste
ncy
e.g., WAR, RAW, WAW, etc on cache-misses
How about requests straddling cache line
boundaries?

13
Network

Transfer shareds between neighboring processors
Manage the create-token
Broadcast creates (with globals)
Manage various neighbor-to-neighbor
notifications, e.g.
Thread termination
Family termination
Synchronisation

14
Memory

Processor doesnt care about memory
It just sends tagged requests and receives
responses
Memory must guarantee tags with requests arrive
with the response
Tag is index of the cache-line where to place the
data
We plan on using the COMA memory
Fits the computing model

15
Microgrid Simulator

Emulates a configured cluster of microthreaded
processors
Based on the Alpha ISA
Accepts flat and ELF Microthread Alpha binaries
Produced by custom binutils-2.18
Cycle-accurate
Returns total execution time and several
statistics
Allows for stepping through the program
Includes examining all processor components at
every cycle

16
Results

Sine
Fast Fourier Transform
Ideal memory
Banked memory
Randomized banked memory

17
Sine

9-iteration Taylor expansion

18
FFT (Ideal)

Double-precision complex FFT
Performance for different N (log2 of input size)

19
FFT (Ideal)

Double-precision complex FFT
Speedup for up to 256 processors

20
FFT (Ideal)

4-processor Itanium-2 SMP
Theoretical maximum performance of 24 GFLOPS at
1.5GHz
Maximum performance of 5.2 GFLOPS (22 of max)
28 microthreaded processors
Theoretical maximum performance of 36 GFLOPS at
1.5GHz
Maximum performance of 14.4 GFLOPS (40 of max)

21
FFT (Banked MP)

Conflicts on memory banks apparent

22
FFT (Random Banked M2P)

Pseudo-random address-to-bank-mapping
Conflicts on memory banks reduced

Write a Comment

User Comments (0)

About PowerShow.com

Microgrids of SVP cores Implementation and results - PowerPoint PPT Presentation

Microgrids of SVP cores Implementation and results

Subsequent write to register will wake thread up ... Use SWCH in the program to avoid bubbles by switching before the miss is known ... – PowerPoint PPT presentation