A case for 16bit floating point data: FPGA image and media processing - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

A case for 16bit floating point data: FPGA image and media processing

Description:

A case for 16-bit floating point data: FPGA image and media processing ... Example: Apple vImage library has four image types with Four pixel types: ... – PowerPoint PPT presentation

Number of Views:209

Avg rating:3.0/5.0

Slides: 44

Provided by: Etie8

Category:

more less

Transcript and Presenter's Notes

Title: A case for 16bit floating point data: FPGA image and media processing

1
A case for 16-bit floating point data FPGA image
and media processing

Daniel Etiemble and Lionel Lacassagne
University Paris Sud, Orsay (France)
de_at_lri.fr

2
Summary

Graphics and media applications
integer versus FP computations
Accuracy
Execution speed
Compilation issues
A niche for 16-bit floating point format (F16 or
half)
Methodology and benchmarks
Hardware support
Customization of SIMD16-bit FP operators on a
FPGA soft core (Altera NIOS II CPU)
The SIMD 16-bit FP instructions
Results
Conclusion

3
Integer or FP computations?

Both formats are used in graphics and media
processing
Example Apple vImage library has four image
types with Four pixel types
Unsigned char (0-255) or Float (0.0-1.0) for
color or alpha values
Set of 4 unsigned chars or floats for Alpha, Red,
Green, Blue
Trade-offs
Precision and dynamic range
Memory occupation and cache footprint
Hardware cost (embedded applications)
Chip area
Power dissipation

4
Integer or FP computations? (2)

General trend to replace FP computations by
fixed-point computations
Intel GPP library using Fixed-Point instead of
Floating-Point for better 3D Performance (G.
Kolly)
Intel Optimizing Center, http//www.devx.com/Intel
/article/16478
Techniques for automatic floating-point to
fixed-point conversions for DSP code generation
(Menard et al)

5
Menard et al approach
Precision
LASTI, Lannion, France
Methodology
FP algorithm
Fixed-point hardware
Correct algorithm
? ?
SW design (DSP) ? Optimize the mapping of
the algorithm on a fixed architecture ??
HW design (ASIC-FPGA) ? Optimize data path
width ? Minimize chip area
Minimize Execution time code size
Maximize precision
6
Integer or FP computations? (3)

Opposite option Customized FP formats
Lightweight FP arithmetic (Fang et al) to avoid
conversions
With IDCT FP numbers with 5-bit exponent and
8-bit mantissa are sufficient to get a PSNR
similar to 32-bit FP numbers
To compare with half format

7
Integer or FP computations? (4)

How to help a compiler to vectorize?
Integers different input and output formats
N bits N bits gt N1 bits
N bits N bits gt 2N bits
FP numbers same input and output formats
Example a Deriche filter on a sizesize points
image

define byte unsigned char byte X, Y int32
b0, a1, a2 for(i0 iltsize i) for(j0
jltsize j) Yij (byte) (b0
Xij a1 Yi-1j a2 Yi-2j) gtgt
8) for (isize-1igt0i--) for(j0
jltsize j) Yij (byte) (b0
Xij a1 Yi1j a2 Yi2j) gtgt
8)
Compiler vectorization is impossible. With 8-bit
coefficients, this benchmark can be manually
vectorized. The vectorization is possible only if
the programmer has a detailed knowledge of the
used parameters. Float version is easily
vectorized by the compiler
8
Cases for 16-bit FP formats

Computation when data range exceeds 16-bit
integer range without needing 32-bit FP float
range
Graphics and media applications
Not for GPU (F16 already used in NVidia GPUs)
For embedded applications
Advantages of 16-bit FP format
Reduce memory occupation (cache footprint) versus
32-bit integer or FP formats
CPU without SIMD extensions (low-end embedded
CPUs)
2 x wider SIMD instructions compared to float
SIMD
CPU with SIMD extensions (high-end embedded CPUs)
Huge advantage of SIMD float operations versus
SIMD integer operations both for compiler and
manual vectorization.

9
Example Points of Interest
10
Points of interests (PoI) in images
IxIx
Sxx
Ix
FI
Image
(SxxSyy-Sxy2 ) - 0.05 (SxxSyy)2
IxIy
Sxy
Iy
3 x 3 Gradient (Sobel)
IyIy
Syy
Threshold
short
byte
int
int
byte
3 x 3 Gauss filters
Harris algorithm

Integer computation mixes char, short and int and
prevents an efficient use of SIMD parallelism
F16 computations would profit from SIMD
parallelism with an uniform 16-bit format

11
16-bit Floating-Point formats

Some have been defined in DSPs but rarely used
Example TMS 320C32
Internal FP type (immediate operand)
1 sign bit, 4-bit exponent field and 11-bit
fraction
External FP type (storage purposes)
1 sign bit, 8-bit exponent field and 7-bit
fraction
Half format

12
Half format

16-bit version of IEEE 754 simple and double
precision versions.
Introduced by ILM for OpenEXR format
Defined in Cg (NVidia)
Motivation
16-bit integer based formats typically represent
color component values from 0 (black) to 1
(white), but dont account for over-range value
(e.g. a chrome highlight) that can be captured by
film negative or other HDR displays Conversely,
32-bit floating-point TIFF is often overkill for
visual effects work. 32-bit FP TIFF provides more
than sufficient precision and dynamic range for
VFX images, but it comes at the cost of storage,
both on disk and memory

13
Validation of the F16 approach

Accuracy
Results presented in ODES-3 (2005) and CAMP05
(2005)
Next slides.
Performances with General Purpose CPUs (Pentium 4
and Power PC G4-G5)
Results presented in ODES-3 (2005) and CAMP05
(2005)
Performance with FPGAs (this presentation)
Execution time
Hardware cost (and power dissipation)
Other embedded hardware (to be done)
SoC
Customizable CPU (ex Tensilica approach)

Another time ?
14
Accuracy

Comparison of F16 computation results with F32
computation results
Specificities of FP formats
Rounding?
Denormals?
NaN?

15
Impact of F16 accuracy and dynamic range

Simulation of half format with float format
with actual benchmarks or applications
Impact of reduced accuracy and range on results
F32-computed and F16-computed images are compared
with PSNR measures.
Four different functions ftd, frd, ftn, frd to
simulate the F16
Fraction truncation or rounding
With or without denormals
For any benchmark, manual insertions of one
function (ftd / frd /ftn / frd)
Function call before any use of a float value
Function call after any operation producing a
float value

16
Impact of F16 accuracy and dynamic range

Benchmark 1 zooming (A. Montanvert, Grenoble)
Spline technique for x1, x2 and x4 zooms
Benchmark 2 JPEG (Mediabench)
4 different DCT/IDCT functions
Integer/Fast integer/F32/F16
Benchmark 3 Wavelet transform (L. Lacassagne,
Orsay)
SPIHT (Set Partioning in Hierarchical Trees)

17
Accuracy (1) Zooming benchmark

Denormals are useless
No significant difference between truncation and
rounding for mantissa
Minimum hardware (no denormals, truncation) is OK

?
18
Accuracy (2) JPEG (Mediabench)
512 x 512 images
256 x 256 images
Difference (db) final image compressed -
uncompressed original image
19
Accuracy (3) Wavelet transform
512 x 512 or 1024 x 1024 images
20
Accuracy (4) Wavelet transforms
Images 256 x 256
21
Benchmarks

Convolution operators
Horizontal-vertical version of Deriche filter
Deriche gradient
Image stabilization
Points of Interest
Achard
Harris
Optical flow
FDCT (JPEG 6-a)

for(i0 iltsize-1 i) for(j0 jltsize
j) Yij (byte) ((b0 Xij a1
Yi-1j a2 Yi-2j) gtgt 8)
for (isize-1igt0i--) for(j0 jltsize
j) Yij (byte) ((b0 Xij a1
Yi1j a2 Yi2j) gtgt 8)
Deriche horizontal vertical version
22
HW and SW support

Altera NIOS development kit (Cyclone edition)
EP1C20F400C7 FPGA device
NIOS II/f CPU (50-MHz)
Altera IDE
GCC tool chain (-O3 option)
High_res_timer (Nb of clock cycles for execution
time)
VHDL description of all the F16 operators
Arithmetic operators
Data handling operators
Quartus II design software

NIOS II/f
Fixed features
32-bit RISC CPU
Branch prediction
Dynamic branch predictor
Barrel shifter
Customized instructions
Parameterized features
HW integer multiplication and division
4 KB instruction cache
2 KB data cache

23
Customization of SIMD F16 instructions
Data manipulation
ADD/SUB, MUL, DIV
With a 32-bit CPU, it makes sense to implement
F16 instructions as SIMD 2 x 16-bits instructions
24
SIMD F16 instructions

Data conversions 1 cycle
Bytes to/from F16
Shorts to/from F16
Conversions and shifts 1 cycle
Accesses to (i, i-1) or (i2, i1) and
conversions
Arithmetic instructions
ADD/SUB 2 cycles (4 for F32)
MULF 2 cycles (3 for F32)
DIVF 5 cycles
DP2 1 cycle

25
Execution time basic vector operations
Copy Vector Add and Mul Vector Scalar Add and Mul
F32
I32
F16
Instruction latencies
26
Execution time basic vector operations

Speedup
SIMD F16 versus scalar I32 or F32
Smaller cache footprint for F16 compared to
I32/F32
F16 latencies are smaller than F32 latencies

27
Benchmark speedups

Speedup greater than 2.5 versus F32
Speedup from 1.3 to 3 versus I32
Depends on the add/mul ratio and amount of data
manipulation
Even scalar F16 can be faster than I32 (1.3
speedup for JPEG DCT)

NO MUL
28
Hardware cost
F32
F16
29
Concluding remarks

Intermediate level graphics benchmarks generally
need more than I16 (short) or I32 (int) dynamic
ranges without needing F32 (float) dynamic range
On our benchmarks, graphical results are not
significantly different when using F16 instead of
F32
A limited set of SIMD F16 instructions have been
customized for NIOS II CPU
The hardware cost is limited and compatible with
to-day FPGA technologies
The speedups range from 1.3 to 3 (generally1.5)
versus I32 and are greater than 2.5 versus F32
Similar results have been found for
general-purpose CPUs (Pentium4, PowerPC)
Tests should be extended to other embedded
approaches
SoCs
Customizable CPUs (Tensilica approach)

30
References

OpenEXR, http//www.openexr.org/details.html
W.R. Mark, R.S.Glanville, K. Akeley and M.J.
Kilgard, Cg A system for programming graphics
hardware in a C-like language.
NVIDIA, Cg Users manual, http//developer.nvidia.
com/view.asp?IOcg_toolkit
Apple, Introduction to vImage,
http//developer.apple.com/documentation/Performan
ce/Conceptual/vImage/
G. Kolli, Using Fixed-Point Instead of Floating
Point for Better 3D Performance, Intel
Optimizing Center, http//www.devx.com/Intel/artic
le/16478
D. Menard, D. Chillet, F. Charot and O. Sentieys,
Automatic Floating-point to Fixed-point
Conversion for DSP Code Generation, in
International Conference on Compilers,
Architectures and Synthesis for Embedded Systems
(CASES 2002)
F. Fang, Tsuhan Chen, Rob A. Rutenbar,
Lightweight Floating-Point Arithmetic Case
Study of Inverse Discrete Cosine Transform in
EURASIP Journal on Signal Processing, Special
Issue on Applied Implementation of DSP and
Communication Systems
R. Deriche. Using Canny's criteria to derive a
recursively implemented optimal edge detector.
The International Journal of Computer Vision,
1(2)167-187, May 1987.
A. Kumar, SSE2 Optimization OpenGL Data Stream
Case Study, Intel application notes,
http//www.intel.com/cd/ids/developer/asmo-na/eng/
segments/games/resources/graphics/19224.htm
Sample code for the benchmarks available
http//www.lri.fr/de/F16/codetsi
Multi-Chip Projects, Design Kits,
http//cmp.imag.fr/ManChap4.html
J. Detrey and F. De Dinechin, A VHDL Library of
Parametrisable Floating Point and LSN Operators
for FPGA, http//www.ens-lyon.fr/jdetrey/FPLibrar
y

31
Back slides

F16 SIMD instructions on General Purpose CPUs

32
Microarchitectural assumptions for Pentium 4 and
Power PC G5

The F16 new instructions are compatible with the
present implementation of the SIMD ISA extensions
128-bit SIMD registers
Same number of SIMD registers
Most SIMD 16-bit integer instructions can be used
for F16 data
Transfers
Logical instructions
Pack/unpack, Shuffle, Permutation instructions
New instructions
F16 arithmetic ones add, sub, mul, div, sqrt
Conversion instructions
16-bit integer to/from 16-bit FP
8-bit integer to/from 16-bit FP

33
Some P4 instruction examples
Latencies and throughput values are similar to
the corresponding ones of P4 FP instructions
Instruction Latency ADDF16 4 MULF16 6 CBL2
F16 4 CBH2F16 4 CF162BL 4 CF162BH 4
XMM
8 bytes
8 bytes
CBL2F16
CBH2F16
XMM
Smaller latencies ADDF16 2 MULF16 4 CONV
2
Byte to Half conversion instructions
34
Measures

Hardware simulator
IA-32
2.4 GHz Pentium 4 with 768-MB running Windows
2000
Intel C 8 compiler with QxW option, maximize
speed
Execution time measured with RDTSC instruction
PowerPC
1.6 GHz PowerPC G5 with 768-MB DDR400 running Mac
OS X.3
Xcode programming environment including gcc 3.3
Measures
Average values of at least 10 executions
(excluding abnormal ones)

35
SIMD Execution time (1) Deriche benchmarks
73

SIMD integer results are incorrect (insufficient
dynamic range)
F16 results are close to incorrect SIMD integer
results
F16 results are significantly better than 32-bit
FP results

36
SIMD Execution time (2) Scan benchmarks
Cumulative sum and sum of square of precedent
pixel values execution time according to
input-output values

Copy corresponds to the lower bound in execution
time (memory-bounded)
Byte-short for scan and Byte-short and
Byte-integer for scan give incorrect results
(insufficient dynamic range)
Same results as for Deriche benchmarks
F16 results are close to incorrect SIMD integer
results
F16 results have significant speed-up compared to
Float-Float for both Scans, and compared to
Byte-Float and Float-Float for scan

37
SIMD Execution time (2) OpenGL data stream case

Compute for each triangle the min and max values
of vertice coordinates.
Most computation time is spent in AoS to SoA
conversion
Results
Altivec is far better, but the relative F16/F32
speed-up is similar

38
Overall comparison (1/2/3)
F16 version versus float version Speed-up
left F16 versus incorrect 16-bit integer
version right.
39
SIMD Execution time (4) Wavelet transform

Transformée en ondelettes

Pentium 4
F32/F16 Speed-up
Horizontal
Overall
Vertical
Image size
40
SIMD Execution time (4) Wavelet transform
PowerPC
Horizontal
F32/F16 Execution Time
Overall
Vertical
Image size
41
Chip area rough evaluation

Same approach as used by Tulla et al for the
Mediabreeze architecture
VHDL models of FP operators
J. Detrey and F. De Dinechin (ENS Lyon)
Non pipelined and pipelined versions
Adder Close path and large path for exponent
values
Divider Radix-4 SRT algorithm
SQRT Radix-2 SRT algorithm
Cell based library
ST 0.18µm HCMOS8D technology
Cadence 4.4.3 synthesis tool (before placement
and routing)
Limitations
Full-custom VLSI ? VHDL Cell-based library
Actual implementation in the P4 (G5) data path is
not considered

42
16-bit and 64-bit operators
Two-path approach is too costly for 16-bit FP
adder. A straightforward approach would be
sufficient
43
Chip area evaluation
Chip area (mm2)
16-bit FP FU chip area is about 5.5 of the
64-bit FP FU Eight such units would be 11 of the
four corresponding 64-bit ones

Write a Comment

User Comments (0)