The Vector-Thread Architecture - PowerPoint PPT Presentation

About This Presentation

Title:

The Vector-Thread Architecture

Description:

Ronny Krashinsky, Chris Batten, Krste Asanovi Computer Architecture Group MIT Laboratory for Computer Science ronny_at_mit.edu www.cag.lcs.mit.edu/scale – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 19

Provided by: Krst8

Learn more at: http://scale.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Vector-Thread Architecture

1
The Vector-Thread Architecture

Ronny Krashinsky,
Chris Batten, Krste Asanovic
Computer Architecture Group
MIT Laboratory for Computer Science
ronny_at_mit.edu
www.cag.lcs.mit.edu/scale
Boston Area Architecture Workshop (BARC)
January 30th, 2003

2
Introduction

Architectures are all about exploiting the
parallelism inherent to applications
Performance
Energy
The Vector-Thread Architecture is a new approach
which can flexibly take advantage of many forms
of parallelism available in different
applications
instruction, loop, data, thread
The key goal of the vector-thread architecture is
efficiency high performance with low power
consumption and small area
A clean, compiler-friendly programming model is
key to realizing these goals

3
Instruction Parallelism

Independent instructions can execute concurrently
Super-scalar architectures dynamically schedule
instructions in hardware to enable out-of-order
and parallel execution
Software statically schedules parallel
instructions on a VLIW machine

Super-scalar
VLIW
track instr. dependencies
4
Loop Parallelism

Operations from disjoint iterations of a loop can
execute in parallel
VLIW architectures use software pipelining to
statically schedule instructions from different
loop iterations to execute concurrently

iter. 0
VLIW
load
iter. 1
add
load
iter. 2
load
store
add
iter. 3
store
add
load
iter. 4
software pipeline
load
store
add
add
store
store
5
Data Parallelism

A single operation can be applied in parallel
across a set of data
In vector architectures, one instruction
identifies a set of independent operations which
can execute in parallel
Control overhead can be amortized

Vector
6
Thread Parallelism

Separate threads of control can execute
concurrently
Multiprocessor architectures allow different
threads to execute at the same time on different
processors
Multithreaded architectures execute multiple
threads at the same time to better utilize a
single set of processing resources

SMT
Multiprocessor
7
Vector-Thread Architecture Overview

Data parallelism start with vector architecture
Thread parallelism give execution units local
control
Loop parallelism allow fine-grain dataflow
communication between execution units
Instruction parallelism add wide issue

8
Vector Architecture
Programming Model
VP0
VP1
VP(N-1)
vector instruction
control thread

A control thread interacts with a set of virtual
processors (VPs)
VPs contain registers and execution units
VPs execute instructions under slave control
Each iteration in a vectorizable loop mapped to
its own VP (w. stripmining)

Using VPs for Vectorizable Loops
VP0
VP1
VP(N-1)
for (i0 iltN i) Ci Ai Bi
i0
i1
iN-1
loadA
loadA
loadA
loadB
loadB
loadB
vector-execute
load A
vector-execute
add
load B
add
add
vector-execute
add
store
vector-execute
store
store
store
9
Vector Microarchitecture
Lane 0
Lane 1
Lane 2
Lane 3
Microarchitecture
VP12
VP13
VP14
VP15
VP8
VP9
VP10
VP11
VP4
VP5
VP6
VP7
VP0
VP1
VP2
VP3
from control processor
Execution on Vector Processor

Lanes contain regfiles and execution units VPs
map to lanes and share physical resources
Operations execute in parallel across lanes and
sequentially for each VP mapped to a lane
control overhead amortized to save energy

Lane 0
Lane 1
Lane 2
Lane 3
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadA
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
loadB
add
add
add
add
vector-execute
add
add
add
add
load A
add
add
add
add
vector-execute
load B
add
add
add
add
store
store
store
store
vector-execute
add
store
store
store
store
vector-execute
store
store
store
store
store
store
store
store
store
10
Vector-Thread Architecture
Programming Model
VP0
VP1
VP(N-1)
micro-threaded control
slave control
cross-VP communication

Vector of Virtual Processors (similar to
traditional vector architecture)
VPs are decoupled local instruction queues
break the rigid synchronization of vector
architectures
Under slave control, the control thread sends
instructions to all VPs
Under micro-threaded control, each VP fetches its
own instructions
Cross-VP communication allows each VP to send
data to its successor

11
Using VPs for Do-Across Loops
for (i0 iltN i) x x Ai Ci x

VP0
VP1
VP(N-1)
i0
i1
i(N-1)
load
load
load
recv
recv
recv
vector-execute
add
add
add
load
send
send
send
recv
AIB
add
store
store
store
send
store

VPs execute atomic instruction blocks (AIB)
Each iteration in a data dependent loop is mapped
to its own VP
Cross-VP send and recv operations communicate
do-across results from one VP to the next VP
(next iteration in time)

12
Vector-Thread Microarchitecture
Microarchitecture
Lane 0
Lane 1
Lane 2
Lane 3
VP12
VP13
VP14
VP15
execute directives
VP8
VP9
VP10
VP11
VP4
VP5
VP6
VP7
VP0
VP1
VP2
VP3
Instr. cache
Instr. fill
do-across network

VPs striped across lanes as in traditional vector
machine
Lanes have small instruction cache (e.g. 32
instrs), decoupled execution
Execute directives point to atomic instruction
blocks and indicate which VP(s) the AIB should be
executed for generated by control thread
vector-execute command, or VP fetch instruction
Do-across network includes dataflow handshake
signals receiver stalls until data is ready

13
Do-Across Execution
Lane 0
Lane 1
Lane 2
Lane 3
load
load
load
load
vector-execute
recv
load
add
recv
send
recv
store
add
add
send
recv
send
load
store
add
store
send
recv
load
store
add
send
recv
store
load
add

Dataflow execution resolves do-across
dependencies dynamically
Independent instructions execute in parallel
performance adapts to software critical path
Instruction fetch overhead amortized across loop
iterations

recv
send
load
add
store
send
recv
store
add
load
send
recv
store
add
load
send
recv
store
load
add
recv
send
load
add
store
send
recv
store
load
add
14
Micro-Threading VPs
VP0
VP1
VP(N-1)

VPs also have the ability to fetch their own
instructions enabling each VP to execute its own
thread of control
Control thread can send a vector fetch
instruction to all VPs (i.e. vector fork)
allows efficient thread startup
Control thread can stall until micro-threads
finish (stop fetching instructions)
Enables data-dependent control flow within a loop
iteration (alternative to predication)

15
Loop Parallelism and Architectures
Loops are ubiquitous and contain ample
parallelism across iterations Super-scalar must
track dependencies between all instructions in a
loop body (and correctly predict branches) before
executing instruction in the subsequent
iteration and do this repeatedly for each loop
iteration VLIW software pipelining exposes
parallelism, but requires static scheduling which
is difficult and inadequate with dynamic
latencies and dependencies Vector efficient, but
limited to do-all loops, no do-across Vector-threa
d Software efficiently exposes parallelism, and
dynamic dataflow automatically adapts to critical
path. Uses simple in-order execution units, and
amortizes instruction fetch overhead across loop
iterations
16
Using the Vector-Thread Architecture
Multi-paradigm Support
Virtual Processors
Control Thread
Vector
DO-ALL Loop
Loop
Performance Energy Efficiency
DO-ACROSS Loop
Threads
ILP
Micro-threading
Vector-Threading

The Vector-Thread Architecture seeks to
efficiently exploit the available parallelism in
any given application
Using the same set of resources, it can flexibly
transition from pure data parallel operation, to
parallel loop execution with do-across
dependencies, to fine-grain multi-threading

17
SCALE-0 Overview
Tile
Outstanding Trans. Table
Clustered Virtual Processor
ctrl proc
Vector Thread Unit
CMMU
256b
128b
Network Interface
Cluster 3
32b
128b
128b
Local Regfile
FP-MUL
4x128b
32KB L1 Configurable I/D Cache
Cluster 2
Local Regfile
FP-ADD
Inter-Cluster Communication
Next-VP
Prev-VP
Cluster 1
Local Regfile
IALU
Cluster 0 (Mem)
Local Regfile
IALU
18
(No Transcript)

Write a Comment

User Comments (0)