The Vector-Thread Architecture - PowerPoint PPT Presentation

About This Presentation
Title:

The Vector-Thread Architecture

Description:

Ronny Krashinsky, Chris Batten, Krste Asanovi Computer Architecture Group MIT Laboratory for Computer Science ronny_at_mit.edu www.cag.lcs.mit.edu/scale – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 19
Provided by: Krst8
Category:

less

Transcript and Presenter's Notes

Title: The Vector-Thread Architecture


1
The Vector-Thread Architecture
  • Ronny Krashinsky,
  • Chris Batten, Krste Asanovic
  • Computer Architecture Group
  • MIT Laboratory for Computer Science
  • ronny_at_mit.edu
  • www.cag.lcs.mit.edu/scale
  • Boston Area Architecture Workshop (BARC)
  • January 30th, 2003

2
Introduction
  • Architectures are all about exploiting the
    parallelism inherent to applications
  • Performance
  • Energy
  • The Vector-Thread Architecture is a new approach
    which can flexibly take advantage of many forms
    of parallelism available in different
    applications
  • instruction, loop, data, thread
  • The key goal of the vector-thread architecture is
    efficiency high performance with low power
    consumption and small area
  • A clean, compiler-friendly programming model is
    key to realizing these goals

3
Instruction Parallelism
  • Independent instructions can execute concurrently
  • Super-scalar architectures dynamically schedule
    instructions in hardware to enable out-of-order
    and parallel execution
  • Software statically schedules parallel
    instructions on a VLIW machine

Super-scalar
VLIW
track instr. dependencies
4
Loop Parallelism
  • Operations from disjoint iterations of a loop can
    execute in parallel
  • VLIW architectures use software pipelining to
    statically schedule instructions from different
    loop iterations to execute concurrently

iter. 0
VLIW
load
iter. 1
add
load
iter. 2
load
store
add
iter. 3
store
add
load
iter. 4
software pipeline
load
store
add
add
store
store
5
Data Parallelism
  • A single operation can be applied in parallel
    across a set of data
  • In vector architectures, one instruction
    identifies a set of independent operations which
    can execute in parallel
  • Control overhead can be amortized

Vector
6
Thread Parallelism
  • Separate threads of control can execute
    concurrently
  • Multiprocessor architectures allow different
    threads to execute at the same time on different
    processors
  • Multithreaded architectures execute multiple
    threads at the same time to better utilize a
    single set of processing resources

SMT
Multiprocessor
7
Vector-Thread Architecture Overview
  • Data parallelism start with vector architecture
  • Thread parallelism give execution units local
    control
  • Loop parallelism allow fine-grain dataflow
    communication between execution units
  • Instruction parallelism add wide issue

8
Vector Architecture
Programming Model
VP0
VP1
VP(N-1)
vector instruction
control thread
  • A control thread interacts with a set of virtual
    processors (VPs)
  • VPs contain registers and execution units
  • VPs execute instructions under slave control
  • Each iteration in a vectorizable loop mapped to
    its own VP (w. stripmining)

Using VPs for Vectorizable Loops
VP0
VP1
VP(N-1)
for (i0 iltN i) Ci Ai Bi
i0
i1
iN-1
loadA
loadA
loadA
loadB
loadB
loadB
vector-execute
load A
vector-execute
add
load B
add
add
vector-execute
add
store
vector-execute
store
store
store
9
Vector Microarchitecture
Lane 0
Lane 1
Lane 2
Lane 3
Microarchitecture
VP12
VP13
VP14
VP15
VP8
VP9
VP10
VP11
VP4
VP5
VP6
VP7
VP0
VP1
VP2
VP3
from control processor
Execution on Vector Processor
  • Lanes contain regfiles and execution units VPs
    map to lanes and share physical resources
  • Operations execute in parallel across lanes and
    sequentially for each VP mapped to a lane
    control overhead amortized to save energy

Lane 0
Lane 1
Lane 2
Lane 3
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
add
add
add
add
vector-execute
add
add
add
add
load A
add
add
add
add
vector-execute
load B
add
add
add
add
store
store
store
store
vector-execute
add
store
store
store
store
vector-execute
store
store
store
store
store
store
store
store
store
10
Vector-Thread Architecture
Programming Model
VP0
VP1
VP(N-1)
micro-threaded control
slave control
cross-VP communication
  • Vector of Virtual Processors (similar to
    traditional vector architecture)
  • VPs are decoupled local instruction queues
    break the rigid synchronization of vector
    architectures
  • Under slave control, the control thread sends
    instructions to all VPs
  • Under micro-threaded control, each VP fetches its
    own instructions
  • Cross-VP communication allows each VP to send
    data to its successor

11
Using VPs for Do-Across Loops
for (i0 iltN i) x x Ai Ci x

VP0
VP1
VP(N-1)
i0
i1
i(N-1)
load
load
load
recv
recv
recv
vector-execute
add
add
add
load
send
send
send
recv
AIB
add
store
store
store
send
store
  • VPs execute atomic instruction blocks (AIB)
  • Each iteration in a data dependent loop is mapped
    to its own VP
  • Cross-VP send and recv operations communicate
    do-across results from one VP to the next VP
    (next iteration in time)

12
Vector-Thread Microarchitecture
Microarchitecture
Lane 0
Lane 1
Lane 2
Lane 3
VP12
VP13
VP14
VP15
execute directives
VP8
VP9
VP10
VP11
VP4
VP5
VP6
VP7
VP0
VP1
VP2
VP3
Instr. cache
Instr. fill
do-across network
  • VPs striped across lanes as in traditional vector
    machine
  • Lanes have small instruction cache (e.g. 32
    instrs), decoupled execution
  • Execute directives point to atomic instruction
    blocks and indicate which VP(s) the AIB should be
    executed for generated by control thread
    vector-execute command, or VP fetch instruction
  • Do-across network includes dataflow handshake
    signals receiver stalls until data is ready

13
Do-Across Execution
Lane 0
Lane 1
Lane 2
Lane 3
load
load
load
load
vector-execute
recv
load
add
recv
send
recv
store
add
add
send
recv
send
load
store
add
store
send
recv
load
store
add
send
recv
store
load
add
  • Dataflow execution resolves do-across
    dependencies dynamically
  • Independent instructions execute in parallel
    performance adapts to software critical path
  • Instruction fetch overhead amortized across loop
    iterations

recv
send
load
add
store
send
recv
store
add
load
send
recv
store
add
load
send
recv
store
load
add
recv
send
load
add
store
send
recv
store
load
add
14
Micro-Threading VPs
VP0
VP1
VP(N-1)
  • VPs also have the ability to fetch their own
    instructions enabling each VP to execute its own
    thread of control
  • Control thread can send a vector fetch
    instruction to all VPs (i.e. vector fork)
    allows efficient thread startup
  • Control thread can stall until micro-threads
    finish (stop fetching instructions)
  • Enables data-dependent control flow within a loop
    iteration (alternative to predication)

15
Loop Parallelism and Architectures
Loops are ubiquitous and contain ample
parallelism across iterations Super-scalar must
track dependencies between all instructions in a
loop body (and correctly predict branches) before
executing instruction in the subsequent
iteration and do this repeatedly for each loop
iteration VLIW software pipelining exposes
parallelism, but requires static scheduling which
is difficult and inadequate with dynamic
latencies and dependencies Vector efficient, but
limited to do-all loops, no do-across Vector-threa
d Software efficiently exposes parallelism, and
dynamic dataflow automatically adapts to critical
path. Uses simple in-order execution units, and
amortizes instruction fetch overhead across loop
iterations
16
Using the Vector-Thread Architecture
Multi-paradigm Support
Virtual Processors
Control Thread
Vector
DO-ALL Loop
Loop
Performance Energy Efficiency
DO-ACROSS Loop
Threads
ILP
Micro-threading
Vector-Threading
  • The Vector-Thread Architecture seeks to
    efficiently exploit the available parallelism in
    any given application
  • Using the same set of resources, it can flexibly
    transition from pure data parallel operation, to
    parallel loop execution with do-across
    dependencies, to fine-grain multi-threading

17
SCALE-0 Overview
Tile
Outstanding Trans. Table
Clustered Virtual Processor
ctrl proc
Vector Thread Unit
CMMU
256b
128b
Network Interface
Cluster 3
32b
128b
128b
Local Regfile
FP-MUL
4x128b
32KB L1 Configurable I/D Cache
Cluster 2
Local Regfile
FP-ADD
Inter-Cluster Communication
Next-VP
Prev-VP
Cluster 1
Local Regfile
IALU
Cluster 0 (Mem)
Local Regfile
IALU
18
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com