Microgrids of SVP cores Implementation and results - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Microgrids of SVP cores Implementation and results

Description:

Subsequent write to register will wake thread up ... Use SWCH in the program to avoid bubbles by switching before the miss is known ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 23
Provided by: staffSci
Category:

less

Transcript and Presenter's Notes

Title: Microgrids of SVP cores Implementation and results


1
Microgrids of SVP cores Implementation and
results
  • Mike Lankamp
  • M.Lankamp_at_uva.nl

September 3rd, 2008
2
The Big Picture
3
Processor Components
  • Register File
  • Large around 1024 registers
  • I-Structures thread-blocking reads
  • Memory Caches
  • Small caches because of latency-tolerant
    architecture
  • Pipeline
  • Threads Families
  • Network
  • Ring network for shareds, creates and globals
  • Global delegation network

4
Register File
  • Large Register File context for each thread
  • Threads suspend on empty register by writing ID
    into register
  • Asynchronous writebacks of long-latency
    operations
  • Security Make sure the writeback comes from the
    correct source

5
Virtual Register Mapping
6
Physical Register Mapping
7
Register State Transitions
8
Thread Management
  • Threads are managed on two exclusive linked lists
  • State list Ready, Waiting
  • Membership list Empty, Member
  • Various thread states

9
Pipeline
  • In-order six-stage pipeline
  • Write back Thread ID to register when read empty
  • This suspends the thread on a data miss
  • Subsequent write to register will wake thread up
  • Dispatch long-latency operations instead of
    waiting
  • Memory reads
  • Floating point operations
  • Results are written back asynchronously
  • Thread switch on data miss
  • Use SWCH in the program to avoid bubbles by
    switching before the miss is known

10
Families and Threads
  • Family Table (32 entries)
  • Initial PC
  • Parent info
  • Dependencies
  • Thread Table (256 entries)
  • Current PC
  • Register context info
  • I-Cache info

11
Instruction Cache
  • Ordinary, though small, cache
  • Has fields (head, tail) for a linked list of
    threads
  • i.e., threads waiting on the cache-line
  • On cache-line loaded, all threads are activated
    in 1 cycle
  • Reference counted
  • When a thread requests the line, count increased
  • When a thread suspends, count decreased
  • When zero, line can be reused

12
Data Cache
  • Ordinary, though small, cache
  • Has fields (head, tail) for a linked list of
    registers
  • i.e., register reads waiting on the cache-line
  • Registers contain the read information (offset,
    size)
  • On cache-line loaded, register reads are serviced
    one by one
  • Special attention required for read/write-consiste
    ncy
  • e.g., WAR, RAW, WAW, etc on cache-misses
  • How about requests straddling cache line
    boundaries?

13
Network
  • Transfer shareds between neighboring processors
  • Manage the create-token
  • Broadcast creates (with globals)
  • Manage various neighbor-to-neighbor
    notifications, e.g.
  • Thread termination
  • Family termination
  • Synchronisation

14
Memory
  • Processor doesnt care about memory
  • It just sends tagged requests and receives
    responses
  • Memory must guarantee tags with requests arrive
    with the response
  • Tag is index of the cache-line where to place the
    data
  • We plan on using the COMA memory
  • Fits the computing model

15
Microgrid Simulator
  • Emulates a configured cluster of microthreaded
    processors
  • Based on the Alpha ISA
  • Accepts flat and ELF Microthread Alpha binaries
  • Produced by custom binutils-2.18
  • Cycle-accurate
  • Returns total execution time and several
    statistics
  • Allows for stepping through the program
  • Includes examining all processor components at
    every cycle

16
Results
  • Sine
  • Fast Fourier Transform
  • Ideal memory
  • Banked memory
  • Randomized banked memory

17
Sine
  • 9-iteration Taylor expansion

18
FFT (Ideal)
  • Double-precision complex FFT
  • Performance for different N (log2 of input size)

19
FFT (Ideal)
  • Double-precision complex FFT
  • Speedup for up to 256 processors

20
FFT (Ideal)
  • 4-processor Itanium-2 SMP
  • Theoretical maximum performance of 24 GFLOPS at
    1.5GHz
  • Maximum performance of 5.2 GFLOPS (22 of max)
  • 28 microthreaded processors
  • Theoretical maximum performance of 36 GFLOPS at
    1.5GHz
  • Maximum performance of 14.4 GFLOPS (40 of max)

21
FFT (Banked MP)
  • Conflicts on memory banks apparent

22
FFT (Random Banked M2P)
  • Pseudo-random address-to-bank-mapping
  • Conflicts on memory banks reduced
Write a Comment
User Comments (0)
About PowerShow.com