Floating-Point Correctness Analysis at the Binary Level - PowerPoint PPT Presentation

About This Presentation
Title:

Floating-Point Correctness Analysis at the Binary Level

Description:

Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 31
Provided by: Michae1619
Category:

less

Transcript and Presenter's Notes

Title: Floating-Point Correctness Analysis at the Binary Level


1
Floating Point Analysis Using Dyninst
Mike Lam University of Maryland, College
Park Jeff Hollingsworth, Advisor
2
Background
  • Floating point represents real numbers as (
    sgnf 2exp)
  • Sign bit
  • Exponent
  • Significand (mantissa or fraction)
  • Finite precision
  • Single-precision 24 bits (7 decimal digits)
  • Double-precision 53 bits (16 decimal digits)

0
32
16
8
4
IEEE Single
Significand (23 bits)
Exponent (8 bits)
0
32
64
8
4
16
IEEE Double
Significand (52 bits)
Exponent (11 bits)
2
3
Motivation
  • Finite precision causes round-off error
  • Compromises certain calculations
  • Hard to detect and diagnose
  • Increasingly important as HPC scales
  • Computation on streaming processors is faster in
    single precision
  • Data movement in double precision is a bottleneck
  • Need to balance speed (singles) and accuracy
    (doubles)

3
4
Our Goal
Automated analysis techniques to inform
developers about floating point behavior and make
recommendations regarding the use of floating
point arithmetic.
4
5
Framework
  • CRAFT Configurable Runtime Analysis for
    Floating-point Tuning
  • Static binary instrumentation
  • Read configuration settings
  • Replace floating-point instructions with new code
  • Rewrite modified binary
  • Dynamic analysis
  • Run modified program on representative data set
  • Produce results and recommendations

5
6
Previous Work
  • Cancellation detection
  • Reports loss of precision due to subtraction
  • Paper appeared in WHIST11
  • Range tracking
  • Reports min/max values
  • Replacement
  • Implements mixed-precision configurations
  • Paper to appear in ICS13

6
7
Mixed Precision
  • Use double precision where necessary
  • Use single precision everywhere else
  • Can be difficult to implement

1 LU ? PA 2 solve Ly Pb 3 solve Ux0
y 4 for k 1, 2, ... do 5 rk ? b
Axk-1 6 solve Ly Prk 7 solve Uzk y 8 xk ?
xk-1 zk 9 check for convergence 10 end for
Mixed-precision linear solver algorithm
Red text indicates steps performed in
double-precision (all other steps are
single-precision)
7
8
Configuration
8
9
Implementation
  • In-place replacement
  • Narrowed focus doubles ? singles
  • In-place downcast conversion
  • Flag in the high bits to indicate replacement

0
32
64
16
8
4
Double
downcast conversion
0
32
64
16
8
4
Replaced Double
7
F
F
4
D
E
A
D
Non-signalling NaN
0
32
16
8
4
Single
9
10
Example
gveci,j gveci,j lvec3 gvar 1 movsd
0x601e38(rax, rbx, 8) ? xmm0 2 mulsd
-0x78(rsp) ? xmm0 3 addsd -0x4f02(rip) ?
xmm0 4 movsd xmm0 ? 0x601e38(rax, rbx, 8)
10
11
Example
gveci,j gveci,j lvec3 gvar 1 movsd
0x601e38(rax, rbx, 8) ? xmm0 check/replace
-0x78(rsp) and xmm0 2 mulss -0x78(rsp) ?
xmm0 check/replace -0x4f02(rip) and
xmm0 3 addss -0x20dd43(rip) ? xmm0 4 movsd
xmm0 ? 0x601e38(rax, rbx, 8)
11
12
Block Editing (PatchAPI)
original instruction in block
block splits
double ? single conversion
initialization
cleanup
check/replace
12
13
Automated Search
  • Manual mixed-precision analysis
  • Hard to use without intuition regarding potential
    replacements
  • Automatic mixed-precision analysis
  • Try lots of configurations (empirical
    auto-tuning)
  • Test with user-defined verification routine and
    data set
  • Exploit program control structure replace larger
    structures (modules, functions) first
  • If coarse-grained replacements fail, try
    finer-grained subcomponent replacements

13
14
System Overview
14
15
NAS Results
Benchmark (name.CLASS) Candidates Configurations Tested Instructions Replaced Static Dynamic Instructions Replaced Static Dynamic
bt.W 6,647 3,854 76.2 85.7
bt.A 6,682 3,832 75.9 81.6
cg.W 940 270 93.7 6.4
cg.A 934 229 94.7 5.3
ep.W 397 112 93.7 30.7
ep.A 397 113 93.1 23.9
ft.W 422 72 84.4 0.3
ft.A 422 73 93.6 0.2
lu.W 5,957 3,769 73.7 65.5
lu.A 5,929 2,814 80.4 69.4
mg.W 1,351 458 84.4 28.0
mg.A 1,351 456 84.1 24.4
sp.W 4,772 5,729 36.9 45.8
sp.A 4,821 5,044 51.9 43.0
15
16
AMGmk Results
  • Algebraic MultiGrid microkernel
  • Multigrid method is highly adaptive
  • Good candidate for replacement
  • Automatic search
  • Complete conversion (100 replacement)
  • Manually-rewritten version
  • Speedup 175 sec to 95 sec (1.8X)
  • Conventional x86_64 hardware

16
17
SuperLU Results
  • Package for LU decomposition and linear solves
  • Reports final error residual
  • Both single- and double-precision versions
  • Verified manual conversion via automatic search
  • Used error from provided single-precision version
    as threshold
  • Final config matched single-precision profile
    (99.9 replacement)

Threshold Instructions Replaced Static Dynamic Instructions Replaced Static Dynamic Final Error
1.0e-03 99.1 99.9 1.59e-04
1.0e-04 94.1 87.3 4.42e-05
7.5e-05 91.3 52.5 4.40e-05
5.0e-05 87.9 45.2 3.00e-05
2.5e-05 80.3 26.6 1.69e-05
1.0e-05 75.4 1.6 7.15e-07
1.0e-06 72.6 1.6 4.7e7-07
17
18
Retrospective
  • Twofold original motivation
  • Faster computation (raw FLOPs)
  • Decreased storage footprint and memory bandwidth
  • Domains vary in sensitivity to these parameters
  • Computation-centric analysis
  • Less insight for memory-constrained domains
  • Sometimes difficult to translate
    instruction-level recommendations to source
    code-level transformations
  • Data-centric analysis
  • Focus on data motion, which is closer to source
    code-level structures

18
19
Current Project
  • Memory-based replacement
  • Perform all computation in double precision
  • Save storage space by storing single-precision
    values in some cases
  • Implementation
  • Register-based computation remains
    double-precision
  • Replace movement instructions (movsd)
  • Memory to register check and upcast
  • Register to memory downcast if configured
  • Searching for replaceable writes instead of
    computes

19
20
Preliminary Results
Benchmark (name.CLASS) Candidates Writes Replaced Static Dynamic Writes Replaced Static Dynamic
cg.W 284 95.4 77.5
ep.W 226 96.0 28.4
ft.W 452 94.2 45.0
lu.W 1,782 68.3 81.3
mg.W 558 96.2 86.4
sp.W 1,607 80.7 84.7
All benchmarks were single core versions compiled
by the Intel Fortran compiler with optimization
enabled. Tests were performed on an Intel
workstation with 48GB of RAM running 64-bit Linux.
20
21
21
22
22
23
23
24
24
25
25
26
26
27
27
28
Future Work
  • Case studies
  • Search convergence study

28
29
Conclusion
Automated binary instrumentation techniques can
be used to implement mixed-precision
configurations for floating point code, and
memory-based replacement provides actionable
results.
29
30
Thank you!
sf.net/p/crafthpc
30
Write a Comment
User Comments (0)
About PowerShow.com