A case for 16bit floating point data: FPGA image and media processing - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

A case for 16bit floating point data: FPGA image and media processing

Description:

A case for 16-bit floating point data: FPGA image and media processing ... Example: Apple vImage library has four image types with Four pixel types: ... – PowerPoint PPT presentation

Number of Views:209
Avg rating:3.0/5.0
Slides: 44
Provided by: Etie8
Category:

less

Transcript and Presenter's Notes

Title: A case for 16bit floating point data: FPGA image and media processing


1
A case for 16-bit floating point data FPGA image
and media processing
  • Daniel Etiemble and Lionel Lacassagne
  • University Paris Sud, Orsay (France)
  • de_at_lri.fr

2
Summary
  • Graphics and media applications
  • integer versus FP computations
  • Accuracy
  • Execution speed
  • Compilation issues
  • A niche for 16-bit floating point format (F16 or
    half)
  • Methodology and benchmarks
  • Hardware support
  • Customization of SIMD16-bit FP operators on a
    FPGA soft core (Altera NIOS II CPU)
  • The SIMD 16-bit FP instructions
  • Results
  • Conclusion

3
Integer or FP computations?
  • Both formats are used in graphics and media
    processing
  • Example Apple vImage library has four image
    types with Four pixel types
  • Unsigned char (0-255) or Float (0.0-1.0) for
    color or alpha values
  • Set of 4 unsigned chars or floats for Alpha, Red,
    Green, Blue
  • Trade-offs
  • Precision and dynamic range
  • Memory occupation and cache footprint
  • Hardware cost (embedded applications)
  • Chip area
  • Power dissipation

4
Integer or FP computations? (2)
  • General trend to replace FP computations by
    fixed-point computations
  • Intel GPP library using Fixed-Point instead of
    Floating-Point for better 3D Performance (G.
    Kolly)
  • Intel Optimizing Center, http//www.devx.com/Intel
    /article/16478
  • Techniques for automatic floating-point to
    fixed-point conversions for DSP code generation
    (Menard et al)

5
Menard et al approach
Precision
LASTI, Lannion, France
Methodology
FP algorithm
Fixed-point hardware
Correct algorithm
? ?
SW design (DSP) ? Optimize the  mapping  of
the algorithm on a fixed architecture ??
HW design (ASIC-FPGA) ? Optimize data path
width ? Minimize chip area
Minimize Execution time code size
Maximize precision
6
Integer or FP computations? (3)
  • Opposite option Customized FP formats
  • Lightweight FP arithmetic (Fang et al) to avoid
    conversions
  • With IDCT FP numbers with 5-bit exponent and
    8-bit mantissa are sufficient to get a PSNR
    similar to 32-bit FP numbers
  • To compare with half format

7
Integer or FP computations? (4)
  • How to help a compiler to vectorize?
  • Integers different input and output formats
  • N bits N bits gt N1 bits
  • N bits N bits gt 2N bits
  • FP numbers same input and output formats
  • Example a Deriche filter on a sizesize points
    image

define byte unsigned char byte X, Y int32
b0, a1, a2 for(i0 iltsize i) for(j0
jltsize j) Yij (byte) (b0
Xij a1 Yi-1j a2 Yi-2j) gtgt
8) for (isize-1igt0i--) for(j0
jltsize j) Yij (byte) (b0
Xij a1 Yi1j a2 Yi2j) gtgt
8)
Compiler vectorization is impossible. With 8-bit
coefficients, this benchmark can be manually
vectorized. The vectorization is possible only if
the programmer has a detailed knowledge of the
used parameters. Float version is easily
vectorized by the compiler
8
Cases for 16-bit FP formats
  • Computation when data range exceeds 16-bit
    integer range without needing 32-bit FP float
    range
  • Graphics and media applications
  • Not for GPU (F16 already used in NVidia GPUs)
  • For embedded applications
  • Advantages of 16-bit FP format
  • Reduce memory occupation (cache footprint) versus
    32-bit integer or FP formats
  • CPU without SIMD extensions (low-end embedded
    CPUs)
  • 2 x wider SIMD instructions compared to float
    SIMD
  • CPU with SIMD extensions (high-end embedded CPUs)
  • Huge advantage of SIMD float operations versus
    SIMD integer operations both for compiler and
    manual vectorization.

9
Example Points of Interest
10
Points of interests (PoI) in images
IxIx
Sxx
Ix
FI
Image
(SxxSyy-Sxy2 ) - 0.05 (SxxSyy)2
IxIy
Sxy
Iy
3 x 3 Gradient (Sobel)
IyIy
Syy
Threshold
short
byte
int
int
byte
3 x 3 Gauss filters
Harris algorithm
  • Integer computation mixes char, short and int and
    prevents an efficient use of SIMD parallelism
  • F16 computations would profit from SIMD
    parallelism with an uniform 16-bit format

11
16-bit Floating-Point formats
  • Some have been defined in DSPs but rarely used
  • Example TMS 320C32
  • Internal FP type (immediate operand)
  • 1 sign bit, 4-bit exponent field and 11-bit
    fraction
  • External FP type (storage purposes)
  • 1 sign bit, 8-bit exponent field and 7-bit
    fraction
  • Half format

12
Half format
  • 16-bit version of IEEE 754 simple and double
    precision versions.
  • Introduced by ILM for OpenEXR format
  • Defined in Cg (NVidia)
  • Motivation
  • 16-bit integer based formats typically represent
    color component values from 0 (black) to 1
    (white), but dont account for over-range value
    (e.g. a chrome highlight) that can be captured by
    film negative or other HDR displays Conversely,
    32-bit floating-point TIFF is often overkill for
    visual effects work. 32-bit FP TIFF provides more
    than sufficient precision and dynamic range for
    VFX images, but it comes at the cost of storage,
    both on disk and memory

13
Validation of the F16 approach
  • Accuracy
  • Results presented in ODES-3 (2005) and CAMP05
    (2005)
  • Next slides.
  • Performances with General Purpose CPUs (Pentium 4
    and Power PC G4-G5)
  • Results presented in ODES-3 (2005) and CAMP05
    (2005)
  • Performance with FPGAs (this presentation)
  • Execution time
  • Hardware cost (and power dissipation)
  • Other embedded hardware (to be done)
  • SoC
  • Customizable CPU (ex Tensilica approach)

Another time ?
14
Accuracy
  • Comparison of F16 computation results with F32
    computation results
  • Specificities of FP formats
  • Rounding?
  • Denormals?
  • NaN?

15
Impact of F16 accuracy and dynamic range
  • Simulation of half format with float format
    with actual benchmarks or applications
  • Impact of reduced accuracy and range on results
  • F32-computed and F16-computed images are compared
    with PSNR measures.
  • Four different functions ftd, frd, ftn, frd to
    simulate the F16
  • Fraction truncation or rounding
  • With or without denormals
  • For any benchmark, manual insertions of one
    function (ftd / frd /ftn / frd)
  • Function call before any use of a float value
  • Function call after any operation producing a
    float value

16
Impact of F16 accuracy and dynamic range
  • Benchmark 1 zooming (A. Montanvert, Grenoble)
  • Spline technique for x1, x2 and x4 zooms
  • Benchmark 2 JPEG (Mediabench)
  • 4 different DCT/IDCT functions
  • Integer/Fast integer/F32/F16
  • Benchmark 3 Wavelet transform (L. Lacassagne,
    Orsay)
  • SPIHT (Set Partioning in Hierarchical Trees)

17
Accuracy (1) Zooming benchmark
  • Denormals are useless
  • No significant difference between truncation and
    rounding for mantissa
  • Minimum hardware (no denormals, truncation) is OK

?
18
Accuracy (2) JPEG (Mediabench)
512 x 512 images
256 x 256 images
Difference (db) final image compressed -
uncompressed original image
19
Accuracy (3) Wavelet transform
512 x 512 or 1024 x 1024 images
20
Accuracy (4) Wavelet transforms
Images 256 x 256
21
Benchmarks
  • Convolution operators
  • Horizontal-vertical version of Deriche filter
  • Deriche gradient
  • Image stabilization
  • Points of Interest
  • Achard
  • Harris
  • Optical flow
  • FDCT (JPEG 6-a)

for(i0 iltsize-1 i) for(j0 jltsize
j) Yij (byte) ((b0 Xij a1
Yi-1j a2 Yi-2j) gtgt 8)
for (isize-1igt0i--) for(j0 jltsize
j) Yij (byte) ((b0 Xij a1
Yi1j a2 Yi2j) gtgt 8)
Deriche horizontal vertical version
22
HW and SW support
  • Altera NIOS development kit (Cyclone edition)
  • EP1C20F400C7 FPGA device
  • NIOS II/f CPU (50-MHz)
  • Altera IDE
  • GCC tool chain (-O3 option)
  • High_res_timer (Nb of clock cycles for execution
    time)
  • VHDL description of all the F16 operators
  • Arithmetic operators
  • Data handling operators
  • Quartus II design software
  • NIOS II/f
  • Fixed features
  • 32-bit RISC CPU
  • Branch prediction
  • Dynamic branch predictor
  • Barrel shifter
  • Customized instructions
  • Parameterized features
  • HW integer multiplication and division
  • 4 KB instruction cache
  • 2 KB data cache

23
Customization of SIMD F16 instructions
Data manipulation
ADD/SUB, MUL, DIV
With a 32-bit CPU, it makes sense to implement
F16 instructions as SIMD 2 x 16-bits instructions
24
SIMD F16 instructions
  • Data conversions 1 cycle
  • Bytes to/from F16
  • Shorts to/from F16
  • Conversions and shifts 1 cycle
  • Accesses to (i, i-1) or (i2, i1) and
    conversions
  • Arithmetic instructions
  • ADD/SUB 2 cycles (4 for F32)
  • MULF 2 cycles (3 for F32)
  • DIVF 5 cycles
  • DP2 1 cycle

25
Execution time basic vector operations
Copy Vector Add and Mul Vector Scalar Add and Mul
F32
I32
F16
Instruction latencies
26
Execution time basic vector operations
  • Speedup
  • SIMD F16 versus scalar I32 or F32
  • Smaller cache footprint for F16 compared to
    I32/F32
  • F16 latencies are smaller than F32 latencies

27
Benchmark speedups
  • Speedup greater than 2.5 versus F32
  • Speedup from 1.3 to 3 versus I32
  • Depends on the add/mul ratio and amount of data
    manipulation
  • Even scalar F16 can be faster than I32 (1.3
    speedup for JPEG DCT)

NO MUL
28
Hardware cost
F32
F16
29
Concluding remarks
  • Intermediate level graphics benchmarks generally
    need more than I16 (short) or I32 (int) dynamic
    ranges without needing F32 (float) dynamic range
  • On our benchmarks, graphical results are not
    significantly different when using F16 instead of
    F32
  • A limited set of SIMD F16 instructions have been
    customized for NIOS II CPU
  • The hardware cost is limited and compatible with
    to-day FPGA technologies
  • The speedups range from 1.3 to 3 (generally1.5)
    versus I32 and are greater than 2.5 versus F32
  • Similar results have been found for
    general-purpose CPUs (Pentium4, PowerPC)
  • Tests should be extended to other embedded
    approaches
  • SoCs
  • Customizable CPUs (Tensilica approach)

30
References
  • OpenEXR, http//www.openexr.org/details.html
  • W.R. Mark, R.S.Glanville, K. Akeley and M.J.
    Kilgard, Cg A system for programming graphics
    hardware in a C-like language.
  • NVIDIA, Cg Users manual, http//developer.nvidia.
    com/view.asp?IOcg_toolkit
  • Apple, Introduction to vImage,
    http//developer.apple.com/documentation/Performan
    ce/Conceptual/vImage/
  • G. Kolli, Using Fixed-Point Instead of Floating
    Point for Better 3D Performance, Intel
    Optimizing Center, http//www.devx.com/Intel/artic
    le/16478
  • D. Menard, D. Chillet, F. Charot and O. Sentieys,
    Automatic Floating-point to Fixed-point
    Conversion for DSP Code Generation, in
    International Conference on Compilers,
    Architectures and Synthesis for Embedded Systems
    (CASES 2002)
  • F. Fang, Tsuhan Chen, Rob A. Rutenbar,
    Lightweight Floating-Point Arithmetic Case
    Study of Inverse Discrete Cosine Transform in
    EURASIP Journal on Signal Processing, Special
    Issue on Applied Implementation of DSP and
    Communication Systems
  • R. Deriche. Using Canny's criteria to derive a
    recursively implemented optimal edge detector.
    The International Journal of Computer Vision,
    1(2)167-187, May 1987.
  • A. Kumar, SSE2 Optimization OpenGL Data Stream
    Case Study, Intel application notes,
    http//www.intel.com/cd/ids/developer/asmo-na/eng/
    segments/games/resources/graphics/19224.htm
  • Sample code for the benchmarks available
    http//www.lri.fr/de/F16/codetsi
  • Multi-Chip Projects, Design Kits,
    http//cmp.imag.fr/ManChap4.html
  • J. Detrey and F. De Dinechin, A VHDL Library of
    Parametrisable Floating Point and LSN Operators
    for FPGA, http//www.ens-lyon.fr/jdetrey/FPLibrar
    y

31
Back slides
  • F16 SIMD instructions on General Purpose CPUs

32
Microarchitectural assumptions for Pentium 4 and
Power PC G5
  • The F16 new instructions are compatible with the
    present implementation of the SIMD ISA extensions
  • 128-bit SIMD registers
  • Same number of SIMD registers
  • Most SIMD 16-bit integer instructions can be used
    for F16 data
  • Transfers
  • Logical instructions
  • Pack/unpack, Shuffle, Permutation instructions
  • New instructions
  • F16 arithmetic ones add, sub, mul, div, sqrt
  • Conversion instructions
  • 16-bit integer to/from 16-bit FP
  • 8-bit integer to/from 16-bit FP

33
Some P4 instruction examples
Latencies and throughput values are similar to
the corresponding ones of P4 FP instructions
Instruction Latency ADDF16 4 MULF16 6 CBL2
F16 4 CBH2F16 4 CF162BL 4 CF162BH 4
XMM
8 bytes
8 bytes
CBL2F16
CBH2F16
XMM
Smaller latencies ADDF16 2 MULF16 4 CONV
2
Byte to Half conversion instructions
34
Measures
  • Hardware simulator
  • IA-32
  • 2.4 GHz Pentium 4 with 768-MB running Windows
    2000
  • Intel C 8 compiler with QxW option, maximize
    speed
  • Execution time measured with RDTSC instruction
  • PowerPC
  • 1.6 GHz PowerPC G5 with 768-MB DDR400 running Mac
    OS X.3
  • Xcode programming environment including gcc 3.3
  • Measures
  • Average values of at least 10 executions
    (excluding abnormal ones)

35
SIMD Execution time (1) Deriche benchmarks
73





  • SIMD integer results are incorrect (insufficient
    dynamic range)
  • F16 results are close to incorrect SIMD integer
    results
  • F16 results are significantly better than 32-bit
    FP results

36
SIMD Execution time (2) Scan benchmarks
Cumulative sum and sum of square of precedent
pixel values execution time according to
input-output values






  • Copy corresponds to the lower bound in execution
    time (memory-bounded)
  • Byte-short for scan and Byte-short and
    Byte-integer for scan give incorrect results
    (insufficient dynamic range)
  • Same results as for Deriche benchmarks
  • F16 results are close to incorrect SIMD integer
    results
  • F16 results have significant speed-up compared to
    Float-Float for both Scans, and compared to
    Byte-Float and Float-Float for scan

37
SIMD Execution time (2) OpenGL data stream case
  • Compute for each triangle the min and max values
    of vertice coordinates.
  • Most computation time is spent in AoS to SoA
    conversion
  • Results
  • Altivec is far better, but the relative F16/F32
    speed-up is similar

38
Overall comparison (1/2/3)
F16 version versus float version Speed-up
left F16 versus incorrect 16-bit integer
version right.
39
SIMD Execution time (4) Wavelet transform
  • Transformée en ondelettes

Pentium 4
F32/F16 Speed-up
Horizontal
Overall
Vertical
Image size
40
SIMD Execution time (4) Wavelet transform
PowerPC
Horizontal
F32/F16 Execution Time
Overall
Vertical
Image size
41
Chip area rough evaluation
  • Same approach as used by Tulla et al for the
    Mediabreeze architecture
  • VHDL models of FP operators
  • J. Detrey and F. De Dinechin (ENS Lyon)
  • Non pipelined and pipelined versions
  • Adder Close path and large path for exponent
    values
  • Divider Radix-4 SRT algorithm
  • SQRT Radix-2 SRT algorithm
  • Cell based library
  • ST 0.18µm HCMOS8D technology
  • Cadence 4.4.3 synthesis tool (before placement
    and routing)
  • Limitations
  • Full-custom VLSI ? VHDL Cell-based library
  • Actual implementation in the P4 (G5) data path is
    not considered

42
16-bit and 64-bit operators
Two-path approach is too costly for 16-bit FP
adder. A straightforward approach would be
sufficient
43
Chip area evaluation
Chip area (mm2)
16-bit FP FU chip area is about 5.5 of the
64-bit FP FU Eight such units would be 11 of the
four corresponding 64-bit ones
Write a Comment
User Comments (0)
About PowerShow.com