Title: Floating-Point Correctness Analysis at the Binary Level
1Floating Point Analysis Using Dyninst
Mike Lam University of Maryland, College
Park Jeff Hollingsworth, Advisor
2Background
- Floating point represents real numbers as (
sgnf 2exp) - Sign bit
- Exponent
- Significand (mantissa or fraction)
- Finite precision
- Single-precision 24 bits (7 decimal digits)
- Double-precision 53 bits (16 decimal digits)
0
32
16
8
4
IEEE Single
Significand (23 bits)
Exponent (8 bits)
0
32
64
8
4
16
IEEE Double
Significand (52 bits)
Exponent (11 bits)
2
3Motivation
- Finite precision causes round-off error
- Compromises certain calculations
- Hard to detect and diagnose
- Increasingly important as HPC scales
- Computation on streaming processors is faster in
single precision - Data movement in double precision is a bottleneck
- Need to balance speed (singles) and accuracy
(doubles)
3
4Our Goal
Automated analysis techniques to inform
developers about floating point behavior and make
recommendations regarding the use of floating
point arithmetic.
4
5Framework
- CRAFT Configurable Runtime Analysis for
Floating-point Tuning - Static binary instrumentation
- Read configuration settings
- Replace floating-point instructions with new code
- Rewrite modified binary
- Dynamic analysis
- Run modified program on representative data set
- Produce results and recommendations
5
6Previous Work
- Cancellation detection
- Reports loss of precision due to subtraction
- Paper appeared in WHIST11
- Range tracking
- Reports min/max values
- Replacement
- Implements mixed-precision configurations
- Paper to appear in ICS13
6
7Mixed Precision
- Use double precision where necessary
- Use single precision everywhere else
- Can be difficult to implement
1 LU ? PA 2 solve Ly Pb 3 solve Ux0
y 4 for k 1, 2, ... do 5 rk ? b
Axk-1 6 solve Ly Prk 7 solve Uzk y 8 xk ?
xk-1 zk 9 check for convergence 10 end for
Mixed-precision linear solver algorithm
Red text indicates steps performed in
double-precision (all other steps are
single-precision)
7
8Configuration
8
9Implementation
- In-place replacement
- Narrowed focus doubles ? singles
- In-place downcast conversion
- Flag in the high bits to indicate replacement
0
32
64
16
8
4
Double
downcast conversion
0
32
64
16
8
4
Replaced Double
7
F
F
4
D
E
A
D
Non-signalling NaN
0
32
16
8
4
Single
9
10Example
gveci,j gveci,j lvec3 gvar 1 movsd
0x601e38(rax, rbx, 8) ? xmm0 2 mulsd
-0x78(rsp) ? xmm0 3 addsd -0x4f02(rip) ?
xmm0 4 movsd xmm0 ? 0x601e38(rax, rbx, 8)
10
11Example
gveci,j gveci,j lvec3 gvar 1 movsd
0x601e38(rax, rbx, 8) ? xmm0 check/replace
-0x78(rsp) and xmm0 2 mulss -0x78(rsp) ?
xmm0 check/replace -0x4f02(rip) and
xmm0 3 addss -0x20dd43(rip) ? xmm0 4 movsd
xmm0 ? 0x601e38(rax, rbx, 8)
11
12Block Editing (PatchAPI)
original instruction in block
block splits
double ? single conversion
initialization
cleanup
check/replace
12
13Automated Search
- Manual mixed-precision analysis
- Hard to use without intuition regarding potential
replacements - Automatic mixed-precision analysis
- Try lots of configurations (empirical
auto-tuning) - Test with user-defined verification routine and
data set - Exploit program control structure replace larger
structures (modules, functions) first - If coarse-grained replacements fail, try
finer-grained subcomponent replacements
13
14System Overview
14
15NAS Results
Benchmark (name.CLASS) Candidates Configurations Tested Instructions Replaced Static Dynamic Instructions Replaced Static Dynamic
bt.W 6,647 3,854 76.2 85.7
bt.A 6,682 3,832 75.9 81.6
cg.W 940 270 93.7 6.4
cg.A 934 229 94.7 5.3
ep.W 397 112 93.7 30.7
ep.A 397 113 93.1 23.9
ft.W 422 72 84.4 0.3
ft.A 422 73 93.6 0.2
lu.W 5,957 3,769 73.7 65.5
lu.A 5,929 2,814 80.4 69.4
mg.W 1,351 458 84.4 28.0
mg.A 1,351 456 84.1 24.4
sp.W 4,772 5,729 36.9 45.8
sp.A 4,821 5,044 51.9 43.0
15
16AMGmk Results
- Algebraic MultiGrid microkernel
- Multigrid method is highly adaptive
- Good candidate for replacement
- Automatic search
- Complete conversion (100 replacement)
- Manually-rewritten version
- Speedup 175 sec to 95 sec (1.8X)
- Conventional x86_64 hardware
16
17SuperLU Results
- Package for LU decomposition and linear solves
- Reports final error residual
- Both single- and double-precision versions
- Verified manual conversion via automatic search
- Used error from provided single-precision version
as threshold - Final config matched single-precision profile
(99.9 replacement)
Threshold Instructions Replaced Static Dynamic Instructions Replaced Static Dynamic Final Error
1.0e-03 99.1 99.9 1.59e-04
1.0e-04 94.1 87.3 4.42e-05
7.5e-05 91.3 52.5 4.40e-05
5.0e-05 87.9 45.2 3.00e-05
2.5e-05 80.3 26.6 1.69e-05
1.0e-05 75.4 1.6 7.15e-07
1.0e-06 72.6 1.6 4.7e7-07
17
18Retrospective
- Twofold original motivation
- Faster computation (raw FLOPs)
- Decreased storage footprint and memory bandwidth
- Domains vary in sensitivity to these parameters
- Computation-centric analysis
- Less insight for memory-constrained domains
- Sometimes difficult to translate
instruction-level recommendations to source
code-level transformations - Data-centric analysis
- Focus on data motion, which is closer to source
code-level structures
18
19Current Project
- Memory-based replacement
- Perform all computation in double precision
- Save storage space by storing single-precision
values in some cases - Implementation
- Register-based computation remains
double-precision - Replace movement instructions (movsd)
- Memory to register check and upcast
- Register to memory downcast if configured
- Searching for replaceable writes instead of
computes
19
20Preliminary Results
Benchmark (name.CLASS) Candidates Writes Replaced Static Dynamic Writes Replaced Static Dynamic
cg.W 284 95.4 77.5
ep.W 226 96.0 28.4
ft.W 452 94.2 45.0
lu.W 1,782 68.3 81.3
mg.W 558 96.2 86.4
sp.W 1,607 80.7 84.7
All benchmarks were single core versions compiled
by the Intel Fortran compiler with optimization
enabled. Tests were performed on an Intel
workstation with 48GB of RAM running 64-bit Linux.
20
2121
2222
2323
2424
2525
2626
2727
28Future Work
- Case studies
- Search convergence study
28
29Conclusion
Automated binary instrumentation techniques can
be used to implement mixed-precision
configurations for floating point code, and
memory-based replacement provides actionable
results.
29
30Thank you!
sf.net/p/crafthpc
30