Fine-Grain Performance Scaling of Soft Vector Processors - PowerPoint PPT Presentation

About This Presentation
Title:

Fine-Grain Performance Scaling of Soft Vector Processors

Description:

Digital System. Hard. Processor. Board space, latency, power. Specialized ... Use software for non-critical data-parallel computation. Thank You! VESPA release: ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 29
Provided by: loo90
Category:

less

Transcript and Presenter's Notes

Title: Fine-Grain Performance Scaling of Soft Vector Processors


1
Fine-Grain Performance Scaling of Soft Vector
Processors
  • Peter Yiannacouras
  • Jonathan Rose
  • Gregory J. Steffan
  • ESWEEK CASES 2009, Grenoble, France
  • Oct 13, 2009

2
FPGA Systems and Soft Processors
Digital System
computation
Weeks
Months
HDL CAD
Software Compiler
Used in 25 of designs source Altera,
2009
? Faster ? Smaller ? Less Power
? Easier
COMPETE
? Configurable
? Board space, latency, power
? Specialized device, increased cost
3
Vector Processing Primer
vadd
// C code for(i0ilt16 i) ciaibi //
Vectorized code set vl,16 vload
vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c
vr215vr015vr115
vr214vr014vr114
vr213vr013vr113
vr212vr012vr112
vr211vr011vr111
vr210vr010vr110
vr29 vr09vr19
vr28 vr08vr18
vr27 vr07vr17
vr26 vr06vr16
vr25 vr05vr15
vr24 vr04vr14
Each vector instruction holds many units of
independent operations
vr23 vr03vr13
vr22 vr02vr12
vr21 vr01vr11
vr20 vr00vr10
1 Vector Lane
4
Vector Processing Primer
vadd
// C code for(i0ilt16 i) ciaibi //
Vectorized code set vl,16 vload
vr0,a vload vr1,b vadd vr2,vr0,vr1 vstore vr2,c
16 Vector Lanes
vr215vr015vr115
vr214vr014vr114
vr213vr013vr113
vr212vr012vr112
  • Previous Work (on Soft Vector Processors)
  • Scalability
  • Flexibility
  • Portability
  • CASES08

vr211vr011vr111
vr210vr010vr110
vr29 vr09vr19
vr28 vr08vr18
vr27 vr07vr17
vr26 vr06vr16
vr25 vr05vr15
vr24 vr04vr14
Each vector instruction holds many units of
independent operations
vr23 vr03vr13
vr22 vr02vr12
vr21 vr01vr11
vr20 vr00vr10
5
VESPA Architecture Design(Vector Extended Soft
Processor Architecture)
Icache
Dcache
Legend Pipe stage Logic Storage
M U X
WB
Decode
RF
Scalar Pipeline 3-stage
A L U
VC RF
VC WB
Supports integer and fixed-point operations
VIRAM
Vector Control Pipeline 3-stage
Logic
Shared Dcache
Decode
VS RF
VS WB
Decode
Repli- cate
Hazard check
VR RF
VR WB
Vector Pipeline 6-stage
VR RF
Lane 1 ALU,Mem Unit
VR WB
Lane 2 ALU, Mem, Mul
32-bit Lanes
6
In This Work
  • Evaluate for real using modern hardware
  • Scale to 32 lanes (previous work did 16 lanes)
  • Add more fine-grain architectural parameters
  • Scale more finely
  • Augment with parameterized vector chaining
    support
  • Customize to functional unit demand
  • Augment with heterogeneous lanes
  • Explore a large design space

7
Evaluation Infrastructure
SOFTWARE
HARDWARE
EEMBC Benchmarks
GCC Compiler
Verilog
Full hardware design of VESPA soft vector
processor
ld
Binary
Vectorized assembly subroutines
GNU as
Stratix III 340
FPGA CAD Software
Instruction Set Simulation
RTL Simulation
area, power, clock frequency
DDR2
cycles
verification
verification
8
VESPA Scalability
19x
(Area1) (Area1.3) (Area1.9) (Area3.2) (
Area6.3) (Area12.3)
11x
9
Vector Lane Design Space
8 of largest FPGA
(Equivalent ALMs)
10
In This Work
  • Evaluate for real using modern hardware
  • Scale to 32 lanes (previous work did 16 lanes)
  • Add more fine-grain architectural parameters
  • Scale more finely
  • Augment with parameterized vector chaining
    support
  • Customize to functional unit demand
  • Augment with heterogeneous lanes
  • Explore a large design space

11
Vector Chaining
  • Simultaneous execution of independent element
    operations within dependent instructions

Dependent Instructions
vadd
vadd vr10, vr1,vr2
0
1
2
3
4
5
6
7
dependency
vmul vr20, vr10,vr11
0
1
2
3
4
5
6
7
vmul
Independent Element Operations
12
Vector Chaining in VESPA
Lanes4
Vector Register File
Single Instruction Execution
No Vector Chaining
Unified
vadd
B1
vmul
Mem
Mem
time
Mem
Mul
Vector Register File
Multiple Instruction Execution
With Vector Chaining
Bank 0
vadd
B2
vmul
Bank 1
Mem
time
Mem
Mem
Mul
13
ALU Replication
Vector Register File
Lanes4
With Vector Chaining
Single Instruction Execution
Bank 0
B2 APBfalse
vadd
vsub
Bank 1
Mem
Mem
Mem
time
Mul
Vector Register File
Multiple Instruction Execution
With Vector Chaining
Bank 0
vadd
B2 APBtrue
vsub
Bank 1
time
Mem
Mem
Mem
Mul
14
Vector Chaining Speedup(on an 8-lane VESPA)
More banks
More banks More ALUs
More ALUs
Dont care
Cycle Speedup vs No Chaining
15
In This Work
  • Evaluate for real using modern hardware
  • Scale to 32 lanes (previous work did 16 lanes)
  • Add more fine-grain architectural parameters
  • Scale more finely
  • Augment with parameterized vector chaining
    support
  • Customize to functional unit demand
  • Augment with heterogeneous lanes
  • Explore a large design space

16
Heterogeneous Lanes
4 Lanes (L4)
Lane 1
2 Multiplier Lanes (X2)
Mul
vmul
Lane 2
Mul
Lane 3
Mul
Lane 4
Mul
17
Heterogeneous Lanes
4 Lanes (L4)
Lane 1
2 Multiplier Lanes (X2)
STALL!
Mul
vmul
Lane 2
Mul
Lane 3
Lane 4
18
Impact of Heterogeneous Lanes(on a 32-lane VESPA)
Free
Expensive
Moderate
19
In This Work
  • Evaluate for real using modern hardware
  • Scale to 32 lanes (previous work did 16 lanes)
  • Add more fine-grain architectural parameters
  • Scale more finely
  • Augment with parameterized vector chaining
    support
  • Customize to functional unit demand
  • Augment with heterogeneous lanes
  • Explore a large design space

20
Design Space Exploration usingVESPA
Architectural Parameters
Description Symbol Values
Number of Lanes L 1,2,4,8,
Memory Crossbar Lanes M 1,2, , L
Multiplier Lanes X 1,2, , L
Banks for Vector Chaining B 1,2,4
ALU Replicate Per Bank APB on/off
Maximum Vector Length MVL 2,4,8,
Width of Lanes (in bits) W 1-32
Instruction Enable (each) - on/off
Data Cache Capacity DD any
Data Cache Line Size DW any
Data Prefetch Size DPK lt DD
Vector Data Prefetch Size DPV lt DD/MVL
Compute Architecture
Instruction Set Architecture
Memory Architecture
21
VESPA Design Space (768 architectural
configurations)
28x range
Normalized Wall Clock Time
18x range
4x
4x
1
2
4
8
16
32
64
Normalized Coprocessor Area
22
Summary
  • Evaluated VESPA on modern FPGA hardware
  • Scale up to 32 lanes with 11x average speedup
  • Augmented VESPA with fine-tunable parameters
  • Vector Chaining (by banking the register file)
  • 22-35 better average performance than without
  • Chaining configuration impact very
    application-dependent
  • Heterogeneous Lanes lanes w/o multipliers
  • Multipliers saved, costs performance (sometimes
    free)
  • Explored a vast architectural design space
  • 18x range in performance, 28x range in area

23
Thank You!
  • VESPA release
  • http//www.eecg.utoronto.ca/VESPA

24
VESPA Parameters
Description Symbol Values
Number of Lanes L 1,2,4,8,
Memory Crossbar Lanes M 1,2, , L
Multiplier Lanes X 1,2, , L
Banks for Vector Chaining B 1,2,4
ALU Replicate Per Bank APB on/off
Maximum Vector Length MVL 2,4,8,
Width of Lanes (in bits) W 1-32
Instruction Enable (each) - on/off
Data Cache Capacity DD any
Data Cache Line Size DW any
Data Prefetch Size DPK lt DD
Vector Data Prefetch Size DPV lt DD/MVL
Compute Architecture
Instruction Set Architecture
Memory Architecture
25
VESPA Scalability
27x
(Area1) (Area1.3) (Area1.9) (Area3.2) (
Area6.3) (Area12.3)
15x
26
Proposed Soft Vector Processor System Design Flow
www.fpgavendor.com
User Code

Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Vectorized Software Routine
Portable
Is the soft processor the bottleneck?
Custom HW
Portable, Flexible, Scalable
Soft Proc
Vector Lane 1
Vector Lane 2
Vector Lane 3
Vector Lane 4
Memory Interface
Peripherals
yes, increase lanes
27
Vector Memory Unit
Memory Request Queue
base
rddata0
rddata1
stride0
rddataL
M U X
stride1
M U X

...

strideL
M U X

index0
index1
indexL
wrdata0
...

Memory Lanes4
wrdata1
wrdataL
Dcache
Read Crossbar
Write Crossbar
L Lanes - 1
Memory Write Queue


28
Overall Memory System Performance
16 lanes
67
48
31
4
(4KB)
(16KB)
15
Write a Comment
User Comments (0)
About PowerShow.com