Title: A case for 16bit floating point data: FPGA image and media processing
1A case for 16-bit floating point data FPGA image
and media processing
- Daniel Etiemble and Lionel Lacassagne
- University Paris Sud, Orsay (France)
- de_at_lri.fr
2Summary
- Graphics and media applications
- integer versus FP computations
- Accuracy
- Execution speed
- Compilation issues
- A niche for 16-bit floating point format (F16 or
half) - Methodology and benchmarks
- Hardware support
- Customization of SIMD16-bit FP operators on a
FPGA soft core (Altera NIOS II CPU) - The SIMD 16-bit FP instructions
- Results
- Conclusion
3Integer or FP computations?
- Both formats are used in graphics and media
processing - Example Apple vImage library has four image
types with Four pixel types - Unsigned char (0-255) or Float (0.0-1.0) for
color or alpha values - Set of 4 unsigned chars or floats for Alpha, Red,
Green, Blue - Trade-offs
- Precision and dynamic range
- Memory occupation and cache footprint
- Hardware cost (embedded applications)
- Chip area
- Power dissipation
4Integer or FP computations? (2)
- General trend to replace FP computations by
fixed-point computations - Intel GPP library using Fixed-Point instead of
Floating-Point for better 3D Performance (G.
Kolly) - Intel Optimizing Center, http//www.devx.com/Intel
/article/16478 - Techniques for automatic floating-point to
fixed-point conversions for DSP code generation
(Menard et al)
5Menard et al approach
Precision
LASTI, Lannion, France
Methodology
FP algorithm
Fixed-point hardware
Correct algorithm
? ?
SW design (DSP) ? Optimize the mapping of
the algorithm on a fixed architecture ??
HW design (ASIC-FPGA) ? Optimize data path
width ? Minimize chip area
Minimize Execution time code size
Maximize precision
6Integer or FP computations? (3)
- Opposite option Customized FP formats
- Lightweight FP arithmetic (Fang et al) to avoid
conversions - With IDCT FP numbers with 5-bit exponent and
8-bit mantissa are sufficient to get a PSNR
similar to 32-bit FP numbers - To compare with half format
7Integer or FP computations? (4)
- How to help a compiler to vectorize?
- Integers different input and output formats
- N bits N bits gt N1 bits
- N bits N bits gt 2N bits
- FP numbers same input and output formats
- Example a Deriche filter on a sizesize points
image -
define byte unsigned char byte X, Y int32
b0, a1, a2 for(i0 iltsize i) for(j0
jltsize j) Yij (byte) (b0
Xij a1 Yi-1j a2 Yi-2j) gtgt
8) for (isize-1igt0i--) for(j0
jltsize j) Yij (byte) (b0
Xij a1 Yi1j a2 Yi2j) gtgt
8)
Compiler vectorization is impossible. With 8-bit
coefficients, this benchmark can be manually
vectorized. The vectorization is possible only if
the programmer has a detailed knowledge of the
used parameters. Float version is easily
vectorized by the compiler
8Cases for 16-bit FP formats
- Computation when data range exceeds 16-bit
integer range without needing 32-bit FP float
range - Graphics and media applications
- Not for GPU (F16 already used in NVidia GPUs)
- For embedded applications
- Advantages of 16-bit FP format
- Reduce memory occupation (cache footprint) versus
32-bit integer or FP formats - CPU without SIMD extensions (low-end embedded
CPUs) - 2 x wider SIMD instructions compared to float
SIMD - CPU with SIMD extensions (high-end embedded CPUs)
- Huge advantage of SIMD float operations versus
SIMD integer operations both for compiler and
manual vectorization.
9Example Points of Interest
10Points of interests (PoI) in images
IxIx
Sxx
Ix
FI
Image
(SxxSyy-Sxy2 ) - 0.05 (SxxSyy)2
IxIy
Sxy
Iy
3 x 3 Gradient (Sobel)
IyIy
Syy
Threshold
short
byte
int
int
byte
3 x 3 Gauss filters
Harris algorithm
- Integer computation mixes char, short and int and
prevents an efficient use of SIMD parallelism - F16 computations would profit from SIMD
parallelism with an uniform 16-bit format
1116-bit Floating-Point formats
- Some have been defined in DSPs but rarely used
- Example TMS 320C32
- Internal FP type (immediate operand)
- 1 sign bit, 4-bit exponent field and 11-bit
fraction - External FP type (storage purposes)
- 1 sign bit, 8-bit exponent field and 7-bit
fraction - Half format
12Half format
- 16-bit version of IEEE 754 simple and double
precision versions. - Introduced by ILM for OpenEXR format
- Defined in Cg (NVidia)
- Motivation
- 16-bit integer based formats typically represent
color component values from 0 (black) to 1
(white), but dont account for over-range value
(e.g. a chrome highlight) that can be captured by
film negative or other HDR displays Conversely,
32-bit floating-point TIFF is often overkill for
visual effects work. 32-bit FP TIFF provides more
than sufficient precision and dynamic range for
VFX images, but it comes at the cost of storage,
both on disk and memory
13Validation of the F16 approach
- Accuracy
- Results presented in ODES-3 (2005) and CAMP05
(2005) - Next slides.
- Performances with General Purpose CPUs (Pentium 4
and Power PC G4-G5) - Results presented in ODES-3 (2005) and CAMP05
(2005) - Performance with FPGAs (this presentation)
- Execution time
- Hardware cost (and power dissipation)
- Other embedded hardware (to be done)
- SoC
- Customizable CPU (ex Tensilica approach)
Another time ?
14Accuracy
- Comparison of F16 computation results with F32
computation results - Specificities of FP formats
- Rounding?
- Denormals?
- NaN?
15Impact of F16 accuracy and dynamic range
- Simulation of half format with float format
with actual benchmarks or applications - Impact of reduced accuracy and range on results
- F32-computed and F16-computed images are compared
with PSNR measures. - Four different functions ftd, frd, ftn, frd to
simulate the F16 - Fraction truncation or rounding
- With or without denormals
- For any benchmark, manual insertions of one
function (ftd / frd /ftn / frd) - Function call before any use of a float value
- Function call after any operation producing a
float value
16Impact of F16 accuracy and dynamic range
- Benchmark 1 zooming (A. Montanvert, Grenoble)
- Spline technique for x1, x2 and x4 zooms
- Benchmark 2 JPEG (Mediabench)
- 4 different DCT/IDCT functions
- Integer/Fast integer/F32/F16
- Benchmark 3 Wavelet transform (L. Lacassagne,
Orsay) - SPIHT (Set Partioning in Hierarchical Trees)
17Accuracy (1) Zooming benchmark
- Denormals are useless
- No significant difference between truncation and
rounding for mantissa - Minimum hardware (no denormals, truncation) is OK
?
18Accuracy (2) JPEG (Mediabench)
512 x 512 images
256 x 256 images
Difference (db) final image compressed -
uncompressed original image
19Accuracy (3) Wavelet transform
512 x 512 or 1024 x 1024 images
20Accuracy (4) Wavelet transforms
Images 256 x 256
21Benchmarks
- Convolution operators
- Horizontal-vertical version of Deriche filter
- Deriche gradient
- Image stabilization
- Points of Interest
- Achard
- Harris
- Optical flow
- FDCT (JPEG 6-a)
for(i0 iltsize-1 i) for(j0 jltsize
j) Yij (byte) ((b0 Xij a1
Yi-1j a2 Yi-2j) gtgt 8)
for (isize-1igt0i--) for(j0 jltsize
j) Yij (byte) ((b0 Xij a1
Yi1j a2 Yi2j) gtgt 8)
Deriche horizontal vertical version
22HW and SW support
- Altera NIOS development kit (Cyclone edition)
- EP1C20F400C7 FPGA device
- NIOS II/f CPU (50-MHz)
- Altera IDE
- GCC tool chain (-O3 option)
- High_res_timer (Nb of clock cycles for execution
time) - VHDL description of all the F16 operators
- Arithmetic operators
- Data handling operators
- Quartus II design software
- NIOS II/f
- Fixed features
- 32-bit RISC CPU
- Branch prediction
- Dynamic branch predictor
- Barrel shifter
- Customized instructions
- Parameterized features
- HW integer multiplication and division
- 4 KB instruction cache
- 2 KB data cache
23Customization of SIMD F16 instructions
Data manipulation
ADD/SUB, MUL, DIV
With a 32-bit CPU, it makes sense to implement
F16 instructions as SIMD 2 x 16-bits instructions
24SIMD F16 instructions
- Data conversions 1 cycle
- Bytes to/from F16
- Shorts to/from F16
- Conversions and shifts 1 cycle
- Accesses to (i, i-1) or (i2, i1) and
conversions - Arithmetic instructions
- ADD/SUB 2 cycles (4 for F32)
- MULF 2 cycles (3 for F32)
- DIVF 5 cycles
- DP2 1 cycle
25Execution time basic vector operations
Copy Vector Add and Mul Vector Scalar Add and Mul
F32
I32
F16
Instruction latencies
26Execution time basic vector operations
- Speedup
- SIMD F16 versus scalar I32 or F32
- Smaller cache footprint for F16 compared to
I32/F32 - F16 latencies are smaller than F32 latencies
27Benchmark speedups
- Speedup greater than 2.5 versus F32
- Speedup from 1.3 to 3 versus I32
- Depends on the add/mul ratio and amount of data
manipulation - Even scalar F16 can be faster than I32 (1.3
speedup for JPEG DCT)
NO MUL
28Hardware cost
F32
F16
29Concluding remarks
- Intermediate level graphics benchmarks generally
need more than I16 (short) or I32 (int) dynamic
ranges without needing F32 (float) dynamic range - On our benchmarks, graphical results are not
significantly different when using F16 instead of
F32 - A limited set of SIMD F16 instructions have been
customized for NIOS II CPU - The hardware cost is limited and compatible with
to-day FPGA technologies - The speedups range from 1.3 to 3 (generally1.5)
versus I32 and are greater than 2.5 versus F32 - Similar results have been found for
general-purpose CPUs (Pentium4, PowerPC) - Tests should be extended to other embedded
approaches - SoCs
- Customizable CPUs (Tensilica approach)
30References
- OpenEXR, http//www.openexr.org/details.html
- W.R. Mark, R.S.Glanville, K. Akeley and M.J.
Kilgard, Cg A system for programming graphics
hardware in a C-like language. - NVIDIA, Cg Users manual, http//developer.nvidia.
com/view.asp?IOcg_toolkit - Apple, Introduction to vImage,
http//developer.apple.com/documentation/Performan
ce/Conceptual/vImage/ - G. Kolli, Using Fixed-Point Instead of Floating
Point for Better 3D Performance, Intel
Optimizing Center, http//www.devx.com/Intel/artic
le/16478 - D. Menard, D. Chillet, F. Charot and O. Sentieys,
Automatic Floating-point to Fixed-point
Conversion for DSP Code Generation, in
International Conference on Compilers,
Architectures and Synthesis for Embedded Systems
(CASES 2002) - F. Fang, Tsuhan Chen, Rob A. Rutenbar,
Lightweight Floating-Point Arithmetic Case
Study of Inverse Discrete Cosine Transform in
EURASIP Journal on Signal Processing, Special
Issue on Applied Implementation of DSP and
Communication Systems - R. Deriche. Using Canny's criteria to derive a
recursively implemented optimal edge detector.
The International Journal of Computer Vision,
1(2)167-187, May 1987. - A. Kumar, SSE2 Optimization OpenGL Data Stream
Case Study, Intel application notes,
http//www.intel.com/cd/ids/developer/asmo-na/eng/
segments/games/resources/graphics/19224.htm - Sample code for the benchmarks available
http//www.lri.fr/de/F16/codetsi - Multi-Chip Projects, Design Kits,
http//cmp.imag.fr/ManChap4.html - J. Detrey and F. De Dinechin, A VHDL Library of
Parametrisable Floating Point and LSN Operators
for FPGA, http//www.ens-lyon.fr/jdetrey/FPLibrar
y
31Back slides
- F16 SIMD instructions on General Purpose CPUs
32Microarchitectural assumptions for Pentium 4 and
Power PC G5
- The F16 new instructions are compatible with the
present implementation of the SIMD ISA extensions - 128-bit SIMD registers
- Same number of SIMD registers
- Most SIMD 16-bit integer instructions can be used
for F16 data - Transfers
- Logical instructions
- Pack/unpack, Shuffle, Permutation instructions
- New instructions
- F16 arithmetic ones add, sub, mul, div, sqrt
- Conversion instructions
- 16-bit integer to/from 16-bit FP
- 8-bit integer to/from 16-bit FP
33Some P4 instruction examples
Latencies and throughput values are similar to
the corresponding ones of P4 FP instructions
Instruction Latency ADDF16 4 MULF16 6 CBL2
F16 4 CBH2F16 4 CF162BL 4 CF162BH 4
XMM
8 bytes
8 bytes
CBL2F16
CBH2F16
XMM
Smaller latencies ADDF16 2 MULF16 4 CONV
2
Byte to Half conversion instructions
34Measures
- Hardware simulator
- IA-32
- 2.4 GHz Pentium 4 with 768-MB running Windows
2000 - Intel C 8 compiler with QxW option, maximize
speed - Execution time measured with RDTSC instruction
- PowerPC
- 1.6 GHz PowerPC G5 with 768-MB DDR400 running Mac
OS X.3 - Xcode programming environment including gcc 3.3
- Measures
- Average values of at least 10 executions
(excluding abnormal ones)
35SIMD Execution time (1) Deriche benchmarks
73
- SIMD integer results are incorrect (insufficient
dynamic range) - F16 results are close to incorrect SIMD integer
results - F16 results are significantly better than 32-bit
FP results
36SIMD Execution time (2) Scan benchmarks
Cumulative sum and sum of square of precedent
pixel values execution time according to
input-output values
- Copy corresponds to the lower bound in execution
time (memory-bounded) - Byte-short for scan and Byte-short and
Byte-integer for scan give incorrect results
(insufficient dynamic range) - Same results as for Deriche benchmarks
- F16 results are close to incorrect SIMD integer
results - F16 results have significant speed-up compared to
Float-Float for both Scans, and compared to
Byte-Float and Float-Float for scan
37SIMD Execution time (2) OpenGL data stream case
- Compute for each triangle the min and max values
of vertice coordinates. - Most computation time is spent in AoS to SoA
conversion - Results
- Altivec is far better, but the relative F16/F32
speed-up is similar
38Overall comparison (1/2/3)
F16 version versus float version Speed-up
left F16 versus incorrect 16-bit integer
version right.
39SIMD Execution time (4) Wavelet transform
- Transformée en ondelettes
Pentium 4
F32/F16 Speed-up
Horizontal
Overall
Vertical
Image size
40SIMD Execution time (4) Wavelet transform
PowerPC
Horizontal
F32/F16 Execution Time
Overall
Vertical
Image size
41Chip area rough evaluation
- Same approach as used by Tulla et al for the
Mediabreeze architecture - VHDL models of FP operators
- J. Detrey and F. De Dinechin (ENS Lyon)
- Non pipelined and pipelined versions
- Adder Close path and large path for exponent
values - Divider Radix-4 SRT algorithm
- SQRT Radix-2 SRT algorithm
- Cell based library
- ST 0.18µm HCMOS8D technology
- Cadence 4.4.3 synthesis tool (before placement
and routing) - Limitations
- Full-custom VLSI ? VHDL Cell-based library
- Actual implementation in the P4 (G5) data path is
not considered
4216-bit and 64-bit operators
Two-path approach is too costly for 16-bit FP
adder. A straightforward approach would be
sufficient
43Chip area evaluation
Chip area (mm2)
16-bit FP FU chip area is about 5.5 of the
64-bit FP FU Eight such units would be 11 of the
four corresponding 64-bit ones