Multithreaded ASC - PowerPoint PPT Presentation

About This Presentation
Title:

Multithreaded ASC

Description:

... Time to perform a broadcast or reduction increases as the number of PEs increases Even for a moderate number of PEs, ... within the control unit Broadcast ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 25
Provided by: Batc151
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Multithreaded ASC


1
Multithreaded ASC
  • Kevin Schaffer and Robert A. Walker
  • ASC Processor Group
  • Computer Science Department
  • Kent State University

2
Organization of an ASC Computer
3
Broadcast/Reduction Bottleneck
  • Time to perform a broadcast or reduction
    increases as the number of PEs increases
  • Even for a moderate number of PEs, this time can
    dominate the machine cycle time
  • Pipelining reduces the cycle time but increases
    the latency
  • Additional latency causes pipeline hazards

4
Instruction Types
  • Scalar instructions
  • Execute entirely within the control unit
  • Broadcast/Parallel instructions
  • Execute within the PE array
  • Use the broadcast network to transfer instruction
    and data
  • Reduction instructions
  • Execute within the PE array
  • Use the broadcast network to transfer instruction
    and data
  • Use the reduction network to combine data from PEs

5
Scalar Pipeline
  • Instruction Fetch (IF)
  • Instruction Decode (ID)
  • Execute (EX)
  • Memory Access (M)
  • Write Back (W)

6
Hazards in a Scalar Pipeline
7
Unified SIMD Pipeline
  • Broadcast (B1...Bn)
  • Reduction (R1...Rn)
  • Number of stages is variable
  • All instructions go through every stage

8
Diversified SIMD Pipeline
  • Separate paths for each instruction type so
    instructions only go through stages that they use
  • Stalls less often than a unified pipeline
    organization

9
Hazards
10
Multithreading
  • Pipelining alone cannot eliminate hazards caused
    by broadcast and reduction latencies
  • Solution use instructions from multiple threads
    to keep the pipeline full
  • Instructions from different threads are
    independent so they cannot generate stalls due to
    data dependencies
  • As long as there are a sufficient number of
    threads, it is possible to fill any number of
    stall cycles

11
Types of Multithreading
  • Coarse-grain multithreading switches to a new
    thread when the current thread encounters a high
    latency operation
  • Fine-grain multithreading switches to a new
    thread every clock cycle
  • Simultaneous multithreading can issue
    instructions from multiple threads in the same
    clock cycle
  • For a SIMD processor, fine-grain or simultaneous
    multithreading is necessary as pipeline stalls
    are relatively short and occur frequently

12
Multithreaded Control Unit
13
Reduction Hazard with a Single Thread
14
Reduction Hazard with Multiple Threads
15
Execution Time vs. Latency
16
Throughput vs. Latency
17
Multithreaded ASC vs. MASC
  • A multithreaded ASC computer can execute at most
    one instruction in a cycle
  • A MASC computer with j instruction streams can
    execute up to j instructions in a cycle
  • In multithread ASC each thread can access every
    PE
  • In MASC each instruction stream can only access
    its partition of PEs
  • A multithreaded MASC computer could combine the
    advantages of both

18
ASC
19
Multithreaded ASC
20
MASC
21
Multithreaded MASC
22
Multithreaded ASC Processor
  • In order to validate simulation results and
    estimate hardware costs, a prototype processor
    was developed
  • Targeted for an Altera Cyclone II (EP2C35) FPGA
  • Using an FPGA makes it possible to get detailed
    measurements of speed and hardware cost

23
Additional Enhancements
  • Flags (logical values) are a first-class data
    types with their own set of registers and
    instructions
  • Extra reduction operators
  • Count Responders
  • Sum
  • Hardware semaphores for thread synchronization

24
Synthesis Results
  • Targeted for an Altera Cyclone II FPGA (EP2C35)
  • 16 x 16-bit PEs
  • 16 hardware threads
  • Clock speed 75 MHz
Write a Comment
User Comments (0)
About PowerShow.com