Title: Polymorphic%20Processors:%20How%20to%20Expose%20Arbitrary%20Hardware%20Functionality%20to%20Programmers
1Polymorphic Processors How to Expose Arbitrary
Hardware Functionality to Programmers
Stamatis Vassiliadis Computer
Engineering, EEMCS, TU Delft http//ce.et.tudelft.
nl Member of HiPEAC
2PZE and the Amdahls law
program
Max speedup 2.0 Excluding start-up reduced 5
cycles to 3 speedup 1.6 83 efficiency
Why polymorphic? We can ride the Amdahls
curve easier and faster
3Motivating example
Paeth coding
4Motivating example
5Example Paeth Prediction (PNG)
Altivec code
- li r5, 0
- .totally 6 instructions
- loop
- lvx vr03, r1
load c's - lvx vr04, r2
load a's - vsidoi vr05, vr01, vr03, 1 load b's
- vmrghb vr07, vr03, vr00 unpack
- vmrglb vr08, vr03, vr00 unpack
- totally 6 instructions
- Compute
- vadduhs vr15, vr09, vr11 ab
- vadduhs vr16, vr10, vr12
- vsubshs vr15, vr15, vr07
- vsubshs vr16, vr16, vr08
- ..totally 76 instructions
- Pack
- vpkshus vr28, vr28, 29 pack
- Store
- Altivec iteration 95 instructions per 16
pixels. -
- CSI code 1 instruction for all iterations
(20 setup instructions) - CSI Instruction design latency . 5
cycles - throughput 16 pixels/1 cycle
- ( EUROMICRO 99 ) area
24 32-bit adders - Cycle 1 ALU operation
6Results Instruction count and execution time
reduction
Bench Paeth kernel, 132-element vectors (132
pixels in a row)
7Research Questions
Motivating example Obvious observations NO way I
can do this on fixed hardware I can do this if
the hardware changes functionality at my
wishes. EASIER SAID THAN DONE ! I have to
answer the following
How can I identify the code for hardware
implementation?
How can I implement arbitrary code?
Is the hardwired code substituted by new
instructions?
How can I substitute this code with SW/HW
descriptions say at the source level?
How can I automatically generate the
transformed program?
8Outline
DATA
Program P
GPP
RH
?
A
MEM
FPGA
What to do
RESULTS
- Show hardware feasibility of ? in FPGA
Tools Microarchitecture Architecture Programming
Paradigm Compiler
- Map ? into reconfigurable hardware (RH)
- Eliminate the identified code
- Add code to have equivalent behavior
MOLEN
Introduce reconfigurable microcode (??-
code) Specific code in hardware left to the
programmer/hardware designer
One time 8 new instructions for any
ISA Co-processor paradigm (e.g. vector) New
register file for parameter passing
Sequential consistency Split-join parallelism
Function like code
9Tool Chain
New Program where Hardware/software descriptions
co-exist
Human Directives
C2C
Critical ?
NO
A U T O
L I B
hand coded
YES
10The MOLEN ?ISA
Divide RC into two logical phases SET EXECUTE
address
function independent No new op-codes
Implementation and ISA independent
Reconfigurable design (two instructions)
Parameter passing two new instructions
Register file
Execute on reconfigurable One instruction
Arbitrary number of parameter passing
Speeding up reconfiguration and execution Two
instructions for prefetching
Parallel execution split via a Molen
instruction and join via a GPP instruction or one
special instruction Modularity by implementing
at least the minimal MOLEN instruction set and by
reconfiguring to it.
Total 8 new instructions
( SAMOS 03 )
11Instruction Set Partitioning
8 instructions grouped in 6 instruction
categories
SET lt address gt EXECUTE lt address gt MOVTX and
MOVFX. SET PREFETCH lt address gt EXECUTE PREFETCH
lt address gt BREAK
partial SET (P-SET) Complete SET (C-SET)
Minimal
Preferred
Complete
12Sequence Control Example
h mov a -gt r1 movtx r1 -gtXR2 mov b -gt
r2 movtx r2 -gtXR3 mov c -gt r3 movtx r3 -gt
XR4 set address_set_op1 set address_set_op2 ldc
2 -gtr4 movtx r4 -gtXR0 ldc 4 -gtr5 movtx r5
-gtXR1 execute address_ex_op1 execute
address_ex_op2 movfx XR2 -gt r6 mov r6 -gt m movfx
XR4 -gt r7 mov r7 -gt n
pragma call_fpga op1 int f( int x, int y)
pragma call_fpga op2 int g(int x)
int h(int a, int b, int c) int m,n,
... mf(a, b) ng(c)
13Reconfigurable Microcode Storage
Frequently used
FIXED
Less frequently used
PAGEABLE
Frequently used
- Fixed on-chip storage for frequently used
microcode - Pageable on-chip storage for less frequently used
microcode
( IEEE MICRO 03 )
14The ??-code unit
Determine next microinstruction
M I R
15More on Architectural support
Instruction format
An example microprogram
- located in memory starting at address ?
- address ? point to first microinstruction
instruction word
address
OPC
?
16The MOLEN ??-coded processor (FPL01)
17The Molen Prototype
Molen prototype implemented on Virtex II Pro
Molen machine organization
18The Prototype Features
A VHDL model has been synthesized for Virtex II
Pro technology
- 64KBytes data and 64KBytes instructions
(on-chip) mems - 64-bit data memory bus
- 64-bit instruction memory bus
- 64 bits microcode word length
- 32MBytes, memory segment for microprograms
- 8Kx64-bit ?-control store using Dual Port Block
RAMs (BRAM) - 512x32-bit XREGs implemented in BRAMs.
- Three clock domains
- PPC clock 250MHz
- MEM clock 83 MHz
- User clock external.
Utilization of FPGA resources (no CCU)
Device xc2vp20-5 Reconf. Processor Arbiter Total incl. XREGs Available resources
slices 71 84 156 10304 1
flip-flops 84 69 147 20608 1
LUT4 171 150 322 20608 1
BRAM 4 N.A. 5 112 3
Max. Freq. MHz 130 143 130 N.A. N.A
( FCCM 04 )
19Compiling for the Molen
C application
FCCM
Compiler
File_n.c
MAIN.c
SUIF frontend
MOLEN extension
Machine SUIF backend framework
alpha backend
x86 backend
20The Molen Compiler
- IBM PowerPC 405 GPP in Virtex II Pro
- Register file extension (XRs)
- ISA extension
( FPL 03-04 )
21Code for a function
- Example
- C code res alpha(param1, param2)
HW
movtx XR1 ? param1 movtx XR2 ? param2
Send parameters
HW reconfiguration
set ltaddress_alpha_setgt
HW execution
exec ltaddress_alpha_execgt
movfx res ? XR3
Return result
22Sequence Control Example
Code generation
C code
pragma call_fpga op1 int f(int a, int b) int
c,i c0 for(i0 iltb i) c c altlti i c
cgtgtb return c void main() int x,z z5 x
f(z, 7)
23The Experiment (hand tuned HW)
Step 1. Obtain MPEG-2 profiling data on a PowerPC
system
MPEG-2 encoder MPEG-2 encoder MPEG-2 encoder MPEG-2 encoder MPEG-2 decoder
sequence frames_at_Resolution SAD (16x16) DCT (8x8) IDCT (8x8) Total IDCT (8x8)
carphone 96_at_176x144 51.1 12.5 1.3 64.9 50.4
claire 168_at_360x288 53.8 11.8 1.0 66.6 37.6
container 300_at_352x288 56.2 10.7 1.0 67.9 40.4
tennis 112_at_352x240 60.0 9.5 0.8 70.3 40.5
Step 2. Measure the kernels speedups on the
prototype
Step 3. Overall speedup per kernel
24Real vs. Theoretical Speedups
Step 4. Application speedup
Speedup MPEG-2 Encoder MPEG-2 Encoder MPEG-2 Encoder MPEG-2 Decoder MPEG-2 Decoder MPEG-2 Decoder
Prototype theory Smax Prototype theory Smax
carphone 2.64 1.94
claire 2.80 1.56
container 2.96 1.63
tennis 3.18 1.65
Speedup MPEG-2 Encoder MPEG-2 Encoder MPEG-2 Encoder MPEG-2 Decoder MPEG-2 Decoder MPEG-2 Decoder
Prototype theory Smax Prototype theory Smax
carphone 2.64 2.85 93 1.94 2.02 96
claire 2.80 2.99 94 1.56 1.60 98
container 2.96 3.12 95 1.63 1.68 97
tennis 3.18 3.37 94 1.65 1.68 98
Time
25mpeg2enc Instruction Counts
137 million
46 million
26M-JPEG (HWAutomatically Generated )
- M-JPEG multimedia
- benchmark
- DCT hardware
- implementation
- Molen prototype
( FPL 04 )
27Performance
MJPEG
Execution cycles
SW DCT () 66
SW DCT 1,242,017
HW DCT 4,125
HW DCT conv 102,589
Prototype speedup 2.5 x
Theoretical Speedup 2.96 x
Efficiency 84
2.5 speedup
28Conclusions
- We have shown a new
- microarchitecture
- processor architecture
- programming paradigm
- compilation
- We have shown that it is easier and faster to
ride the Amdahls curve with polymorphic
processors!
29Contact information
Computer Engineering Laboratory http//ce.et.tud
elft.nl MOLEN homepage http//ce.et.tudelft.nl/
MOLEN Personal homepage http//ce.et.tudelft.nl
/stamatis
OVERVIEW Paper The Molen Polymorphic
Processor IEEE Transactions on computers NOV 04