Floating-Point Correctness Analysis at the Binary Level - PowerPoint PPT Presentation

About This Presentation

Title:

Floating-Point Correctness Analysis at the Binary Level

Description:

Floating Point Analysis Using Dyninst Mike Lam University of Maryland, College Park Jeff Hollingsworth, Advisor – PowerPoint PPT presentation

Number of Views:101

Avg rating:3.0/5.0

Slides: 31

Provided by: Michae1619

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Floating-Point Correctness Analysis at the Binary Level

1
Floating Point Analysis Using Dyninst
Mike Lam University of Maryland, College
Park Jeff Hollingsworth, Advisor
2
Background

Floating point represents real numbers as (
sgnf 2exp)
Sign bit
Exponent
Significand (mantissa or fraction)
Finite precision
Single-precision 24 bits (7 decimal digits)
Double-precision 53 bits (16 decimal digits)

0
32
16
8
4
IEEE Single
Significand (23 bits)
Exponent (8 bits)
0
32
64
8
4
16
IEEE Double
Significand (52 bits)
Exponent (11 bits)
2
3
Motivation

Finite precision causes round-off error
Compromises certain calculations
Hard to detect and diagnose
Increasingly important as HPC scales
Computation on streaming processors is faster in
single precision
Data movement in double precision is a bottleneck
Need to balance speed (singles) and accuracy
(doubles)

3
4
Our Goal
Automated analysis techniques to inform
developers about floating point behavior and make
recommendations regarding the use of floating
point arithmetic.
4
5
Framework

CRAFT Configurable Runtime Analysis for
Floating-point Tuning
Static binary instrumentation
Read configuration settings
Replace floating-point instructions with new code
Rewrite modified binary
Dynamic analysis
Run modified program on representative data set
Produce results and recommendations

5
6
Previous Work

Cancellation detection
Reports loss of precision due to subtraction
Paper appeared in WHIST11
Range tracking
Reports min/max values
Replacement
Implements mixed-precision configurations
Paper to appear in ICS13

6
7
Mixed Precision

Use double precision where necessary
Use single precision everywhere else
Can be difficult to implement

1 LU ? PA 2 solve Ly Pb 3 solve Ux0
y 4 for k 1, 2, ... do 5 rk ? b
Axk-1 6 solve Ly Prk 7 solve Uzk y 8 xk ?
xk-1 zk 9 check for convergence 10 end for
Mixed-precision linear solver algorithm
Red text indicates steps performed in
double-precision (all other steps are
single-precision)
7
8
Configuration
8
9
Implementation

In-place replacement
Narrowed focus doubles ? singles
In-place downcast conversion
Flag in the high bits to indicate replacement

0
32
64
16
8
4
Double
downcast conversion
0
32
64
16
8
4
Replaced Double
7
F
F
4
D
E
A
D
Non-signalling NaN
0
32
16
8
4
Single
9
10
Example
gveci,j gveci,j lvec3 gvar 1 movsd
0x601e38(rax, rbx, 8) ? xmm0 2 mulsd
-0x78(rsp) ? xmm0 3 addsd -0x4f02(rip) ?
xmm0 4 movsd xmm0 ? 0x601e38(rax, rbx, 8)
10
11
Example
gveci,j gveci,j lvec3 gvar 1 movsd
0x601e38(rax, rbx, 8) ? xmm0 check/replace
-0x78(rsp) and xmm0 2 mulss -0x78(rsp) ?
xmm0 check/replace -0x4f02(rip) and
xmm0 3 addss -0x20dd43(rip) ? xmm0 4 movsd
xmm0 ? 0x601e38(rax, rbx, 8)
11
12
Block Editing (PatchAPI)
original instruction in block
block splits
double ? single conversion
initialization
cleanup
check/replace
12
13
Automated Search

Manual mixed-precision analysis
Hard to use without intuition regarding potential
replacements
Automatic mixed-precision analysis
Try lots of configurations (empirical
auto-tuning)
Test with user-defined verification routine and
data set
Exploit program control structure replace larger
structures (modules, functions) first
If coarse-grained replacements fail, try
finer-grained subcomponent replacements

13
14
System Overview
14
15
NAS Results
Benchmark (name.CLASS) Candidates Configurations Tested Instructions Replaced Static Dynamic Instructions Replaced Static Dynamic
bt.W 6,647 3,854 76.2 85.7
bt.A 6,682 3,832 75.9 81.6
cg.W 940 270 93.7 6.4
cg.A 934 229 94.7 5.3
ep.W 397 112 93.7 30.7
ep.A 397 113 93.1 23.9
ft.W 422 72 84.4 0.3
ft.A 422 73 93.6 0.2
lu.W 5,957 3,769 73.7 65.5
lu.A 5,929 2,814 80.4 69.4
mg.W 1,351 458 84.4 28.0
mg.A 1,351 456 84.1 24.4
sp.W 4,772 5,729 36.9 45.8
sp.A 4,821 5,044 51.9 43.0
15
16
AMGmk Results

Algebraic MultiGrid microkernel
Multigrid method is highly adaptive
Good candidate for replacement
Automatic search
Complete conversion (100 replacement)
Manually-rewritten version
Speedup 175 sec to 95 sec (1.8X)
Conventional x86_64 hardware

16
17
SuperLU Results

Package for LU decomposition and linear solves
Reports final error residual
Both single- and double-precision versions
Verified manual conversion via automatic search
Used error from provided single-precision version
as threshold
Final config matched single-precision profile
(99.9 replacement)

Threshold Instructions Replaced Static Dynamic Instructions Replaced Static Dynamic Final Error
1.0e-03 99.1 99.9 1.59e-04
1.0e-04 94.1 87.3 4.42e-05
7.5e-05 91.3 52.5 4.40e-05
5.0e-05 87.9 45.2 3.00e-05
2.5e-05 80.3 26.6 1.69e-05
1.0e-05 75.4 1.6 7.15e-07
1.0e-06 72.6 1.6 4.7e7-07
17
18
Retrospective

Twofold original motivation
Faster computation (raw FLOPs)
Decreased storage footprint and memory bandwidth
Domains vary in sensitivity to these parameters
Computation-centric analysis
Less insight for memory-constrained domains
Sometimes difficult to translate
instruction-level recommendations to source
code-level transformations
Data-centric analysis
Focus on data motion, which is closer to source
code-level structures

18
19
Current Project

Memory-based replacement
Perform all computation in double precision
Save storage space by storing single-precision
values in some cases
Implementation
Register-based computation remains
double-precision
Replace movement instructions (movsd)
Memory to register check and upcast
Register to memory downcast if configured
Searching for replaceable writes instead of
computes

19
20
Preliminary Results
Benchmark (name.CLASS) Candidates Writes Replaced Static Dynamic Writes Replaced Static Dynamic
cg.W 284 95.4 77.5
ep.W 226 96.0 28.4
ft.W 452 94.2 45.0
lu.W 1,782 68.3 81.3
mg.W 558 96.2 86.4
sp.W 1,607 80.7 84.7
All benchmarks were single core versions compiled
by the Intel Fortran compiler with optimization
enabled. Tests were performed on an Intel
workstation with 48GB of RAM running 64-bit Linux.
20
21
21
22
22
23
23
24
24
25
25
26
26
27
27
28
Future Work

Case studies
Search convergence study

28
29
Conclusion
Automated binary instrumentation techniques can
be used to implement mixed-precision
configurations for floating point code, and
memory-based replacement provides actionable
results.
29
30
Thank you!
sf.net/p/crafthpc
30

Write a Comment

User Comments (0)