Title: Incorporating Domain-Specific Information into the Compilation Process
1Incorporating Domain-Specific Information into
the Compilation Process
- Samuel Z. Guyer
- Supervisor Calvin Lin
- April 14, 2003
2Motivation
- Two different views of software
- Compilers view
- Abstractions numbers, pointers, loops
- Operators , -, , -gt,
- Programmers view
- Abstractions files, matrices, locks, graphics
- Operators read, factor, lock, draw
- This discrepancy is a problem...
3Find the error part 1
- Example
- Error case outside of switch statement
- Part of the language definition
- Error reported at compile time
- Compiler indicates the location and nature of
error
switch (var_83) case 0 func_24()
break case 1 func_29() break case 2
func_78()
!
4Find the error part 2
- Example
- Improper call to libfunc_38
- Syntax is correct no compiler message
- Fails at run-time
- Problem what does libfunc_38 do?
- This is how compilers view reusables
struct __sue_23 var_72 char var_81100 var_72
libfunc_84(__str_14, __str_65) libfunc_44(var_
72) libfunc_38(var_81, 100, 1, var_72)
!
5Find the error part 3
- Example
- Improper call to fread() after fclose()
- The names reveal the mistake
- No traditional compiler reports this error
- Run-time system how does the code fail?
- Code review rarely this easy to spot
FILE my_file char buffer100 my_file
fopen(my_data, r) fclose(my_file) fread(buff
er, 100, 1, my_file)
!
6Problem
- Compilers are unaware of library semantics
- Library calls have no special meaning
- The compiler cannot provide any assistance
- Burden is on the programmer
- Use library routines correctly
- Use library routines efficiently and effectively
- These are difficult manual tasks
- Tedious and error-prone
- Can require considerable expertise
7Solution
- A library-level compiler
- Compiler support for software libraries
- Treat library routines more like built-in
operators - Compile at the library interface level
- Check programs for library-level errors
- Improve performance with library-level
optimizations - Key Libraries represent domains
- Capture domain-specific semantics and expertise
- Encode in a form that the compiler can use
8The Broadway Compiler
- Broadway source-to-source C compiler
- Domain-independent compiler mechanisms
- Annotations lightweight specification language
- Domain-specific analyses and transformations
- Many libraries, one compiler
9Benefits
- Improves capabilities of the compiler
- Adds many new error checks and optimizations
- Qualitatively different
- Works with existing systems
- Domain-specific compilation without recoding
- For us more thorough and convincing validation
- Improve productivity
- Less time spent on manual tasks
- All users benefit from one set of annotations
10Outline
- Motivation
- The Broadway Compiler
- Recent work on scalable program analysis
- Problem Error checking demands powerful analysis
- Solution Client-driven analysis algorithm
- Example Detecting security vulnerabilities
- Contributions
- Related work
- Conclusions and future work
11Security vulnerabilities
- How does remote hacking work?
- Most are not direct attacks (e.g., cracking
passwords) - Idea trick a program into unintended behavior
- Automated vulnerability detection
- How do we define intended?
- Difficult to formalize and check application
logic - Libraries control all critical system
services - Communication, file access, process control
- Analyze routines to approximate vulnerability
12Remote access vulnerability
- Example
- Vulnerability executes any remote command
- What if this program runs as root?
- Clearly domain-specific sockets, processes, etc.
- Requirement
- Why is detecting this vulnerability hard?
int sock char buffer100 sock
socket(AF_INET, SOCK_STREAM, 0) read(sock,
buffer, 100) execl(buffer)
!
Data from an Internet socket should not specify a
program to execute
13Challenge 1 Pointers
- Example
- Still contains a vulnerability
- Only one buffer
- Variables buffer and ref are aliases
- We need an accurate model of memory
int sock char buffer100 char ref
buffer sock socket(AF_INET, SOCK_STREAM,
0) read(sock, buffer, 100) execl(ref)
!
14Challenge 2 Scope
- Call graph
- Objects flow throughout program
- No scoping constraints
- Objects referenced through pointers
- We need whole-program analysis
!
sock (AF_INET, SOCK_STREAM, 0)
(sock, buffer, 100) (ref)
socket
read
execl
15Challenge 3 Precision
- Static analysis is always an approximation
- Precision level of detail or sensitivity
- Multiple calls to a procedure
- Context-sensitive analyze each call separately
- Context-insensitive merge information from all
calls - Multiple assignments to a variable
- Flow-sensitive record each value separately
- Flow-insensitive merge values from all
assignments - Lower precision reduces the cost of analysis
- Exponential polynomial
linear
16Insufficient precision
- Example
- Context-insensitivity
- Information merged at call
- Analyzer reports 2 possible errors
- Only 1 real error
- Imprecision leads to false positives
17Cost versus precision
- Problem A tradeoff
- Precise analysis prohibitively expensive
- Cheap analysis too many false positives
- Idea Mixed precision analysis
- Focus effort on the parts of the program that
matter - Dont waste time over-analyzing the rest
- Key Let error detection problem drive
precision - Client-driven program analysis
18Client-Driven Algorithm
- Client Error detection analysis problem
- Algorithm
- Start with fast cheap analysis monitor
imprecision - Determine extra precision reanalyze
Pointer Analyzer
Client Analysis
Memory Model
19Algorithm components
- Monitor
- Runs alongside main analysis
- Records imprecision
- Adaptor
- Start at the locations of reported errors
- Trace back to the cause and diagnose
20Sources of imprecision
21In action...
- Monitor analysis
- Polluting assignments
- Diagnose and apply fix
- In this case one procedure context-sensitive
- Reanalyze
!
22Methodology
- Compare with commonly-used fixed precision
- Metrics
- Accuracy number of errors reported
- Includes false positives fewer is better
- Performance only when accuracy is the same
CS-FS Full context-sensitive, flow-sensitive
CS-FI Slow context-sensitive, flow-insensitive
CI-FS Medium context-insensitive, flow-sensitive
CI-FI Fast context-insensitive, flow-insensitive
23Programs
- 18 real C programs
- Unmodified source all the issues of production
code - Many are system tools run in privileged mode
- Representative examples
Name Description Priv Lines of code Procedures CFG nodes
muh IRC proxy ü 5K (25K) 84 5,191
blackhole E-mail filter ü 12K (244K) 71 21,370
wu-ftpd FTP daemon ü 22K (66K) 205 23,107
named DNS server ü 26K (84K) 210 25,452
nn News reader û 36K (116K) 494 46,336
24Error detection problems
- File access
- Remote access vulnerabillity
- Format string vulnerability (FSV)
- Remote FSV
- FTP behavior
Files must be open when accessed
Data from an Internet socket should not specify a
program to execute
Format string may not contain untrusted data
Check if FSV is remotely exploitable
Can this program be tricked into reading and
transmitting arbitrary files
25Results
Remote access vulnerability
1000X
Normalized performance
10X
26Overall results
- 90 test cases 18 programs, 5 problems
- Test case 1 program and 1 error detection
problem - Compare algorithms client-driven vs. fixed
precision
As accurate as any other algorithm
87 out of 90
Runs faster than best fixed algorithm
64 out of 87
Performance not an issue
19 of 23
Both most accurate and fastest
29 out of 64
27Why does it work?
Name Total procedures procedures context-sensitive procedures context-sensitive procedures context-sensitive procedures context-sensitive procedures context-sensitive
Name Total procedures RA File FSV RFSV FTP
muh 84 6
apache 313 8 2 2 10
blackhole 71 2 5
wu-ftpd 205 4 4 17
named 210 1 2 1 4
cfengine 421 4 1 3 31
nn 494 2 1 1 30
- Validates our hypothesis
- Different errors have different precision
requirements - Amount of extra precision is small
28Outline
- Motivation
- The Broadway Compiler
- Recent work on scalable program analysis
- Problem Error checking demands powerful analysis
- Solution Client-driven analysis algorithm
- Example Detecting security vulnerabilities
- Contributions
- Related work
- Conclusions and future work
29Central contribution
- Library-level compilation
- Opportunity library interfaces make domains
explicit in existing programming practice - Key a separate language for codifying
domain-specific knowledge - Result our compiler can automate previously
manual error checks and performance improvements
Knowledge representation
Applying knowledge
Results
Old way
Informal
Manual
Difficult, unpredictable
Broadway
Codified
Compiler
Easy, automatic, reliable
30Specific contributions
- Broadway compiler implementation
- Working system (43K C-Breeze, 23K pointers, 30K
Broadway) - Client-driven pointer analysis algorithm SAS03
- Precise and scalable whole-program analysis
- Library-level error checking experiments
CSTR01 - No false positives for format string
vulnerability - Library-level optimization experiments LCPC00
- Solid improvements for PLAPACK programs
- Annotation language DSL99
- Balance expressive power and ease of use
31Related work
- Configurable compilers
- Power versus usability who is the user?
- Active libraries
- Previous work focusing on specific domains
- Few complete, working systems
- Error detection
- Partial program verification paucity of results
- Scalable pointer analysis
- Many studies of cost/precision tradeoff
- Few mixed-precision approaches
32Future work
- Language
- More analysis capabilities
- Optimization
- We have only scratched the surface
- Error checking
- Resource leaks
- Path sensitivity conditional transfer functions
- Scalable analysis
- Start with cheaper analysis unification-based
- Refine to more expensive analysis shape analysis
33(No Transcript)
34Annotations (I)
- Dependence and pointer information
- Describe pointer structures
- Indicate which objects are accessed and modified
procedure fopen(pathname, mode) on_entry
pathname --gt path_string mode --gt
mode_string access path_string,
mode_string on_exit return --gt new
file_stream
35Annotations (II)
- Library-specific properties
- Dataflow lattices
property State Open, Closed initially
Open property Kind File,
Socket Local, Remote
Remote
Local
Open
Closed
Socket
File
36Annotations (III)
- Library routine effects
- Dataflow transfer functions
procedure socket(domain, type, protocol)
analyze Kind if (domain AF_UNIX)
IOHandle lt- Local if (domain AF_INET)
IOHandle lt- Remote analyze State
IOHandle lt- Open on_exit return --gt new
IOHandle
37Annotations (IV)
- Reports and transformations
procedure execl(path, args) on_entry path
--gt path_string report if (Kind
path_string could-be Remote) Error at
callsite remote access procedure
slow_routine(first, second) when (condition)
replace-with quick_check(first)
fast_routine(first, second)
38Why does it work?
- Validates our hypothesis
- Different clients have different precision
requirements - Amount of extra precision is small
Name procedures context-sensitive procedures context-sensitive procedures context-sensitive procedures context-sensitive procedures context-sensitive variables flow-sensitive variables flow-sensitive variables flow-sensitive variables flow-sensitive variables flow-sensitive
Name RA File FSV RFSV FTP RA File FSV RFSV FTP
muh 6 0.1 0.07 0.31
apache 8 2 2 10 0.89 0.18 0.91 1.07 0.83
blackhole 2 5 0.24 0.04 0.32
wu-ftpd 4 4 17 0.63 0.09 0.51 0.53 0.23
named 1 2 1 4 0.14 0.01 0.23 0.20 0.42
cfengine 4 1 3 31 0.43 0.04 0.46 0.48 0.03
nn 2 1 1 30 1.82 0.17 1.99 2.03 0.97
39Time
40Validation
- Optimization experiments
- Cons One library three applications
- Pros Complex library consistent results
- Error checking experiments
- Cons Quibble about different errors
- Pros We set the standard for experiments
- Overall
- Same system designed for optimizations is among
the best for detecting errors and security
vulnerabilities
41Type Theory
- Equivalent to dataflow analysis (heresy?)
- Different in practice
- Dataflow flow-sensitive problems, iterative
analysis - Types flow-insensitive problems, constraint
solver - Commonality
- No magic bullet same cost for the same precision
- Extracting the store model is a primary concern
Remember Phil Wadlers talk?
42Generators
- Direct support for domain-specific programming
- Language extensions or new language
- Generate implementation from specification
- Our ideas are complementary
- Provides a way to analyze component compositions
- Unifies common algorithms
- Redundancy elimination
- Dependence-based optimizations
43Is it correct?
- Three separate questions
- Are Sam Guyers experiments correct?
- Yes, to the best of our knowledge
- Checked PLAPACK results
- Checked detected errors against known errors
- Is our compiler implemented correctly?
- Flip answer whos is?
- Better answer testing suites
- How do we validate a set of annotations?
44Annotation correctness
- Not addressed in my dissertation, but...
- Theoretical approach
- Does the library implement the domain?
- Formally verify annotations against
implementation - Practical approach
- Annotation debugger interactive
- Automated assistance in early stages of
development - Middle approach
- Basic consistency checks
45Error Checking vs Optimization
- Optimistic
- False positives allowed
- It can even be unsound
- Tend to be may analyses
- Correctness is absolute
- Black and white
- Certify programs bug-free
- Cost tolerant
- Explore costly analysis
- Pessimistic
- Must preserve semantics
- Soundness mandatory
- Tend to be must analyses
- Performance is relative
- Spectrum of results
- No guarantees
- Cost sensitive
- Compile-time is a factor
46Complexity
- Pointer analysis
- Address taken linear
- Steensgaard almost linear (log log n factor)
- Anderson polynomial (cubic)
- Shape analysis double exponential
- Dataflow analysis
- Intraprocedural polynomial (height of lattice)
- Context-sensitivity exponential (call graph)
- Rarely see worst-case
47Optimization
- Overall strategy
- Exploit layers and modularity
- Customize lower-level layers in context
- Compiler strategy Top-down layer processing
- Preserve high-level semantics as long as possible
- Systematically dissolve layer boundaries
- Annotation strategy
- General-purpose specialization
- Idiomatic code substitutions
48PLAPACK Optimizations
- PLAPACK matrices are distributed
- Optimizations exploit special cases
- Example Matrix multiply
49Results
50Find the error part 3
- State-of-the-art compiler
struct __sue_23 var_72 struct __sue_25 new_f
(struct __sue_25 ) malloc(sizeof (struct
__sue_25)) _IO_no_init( new_f-gtfp.file, 1, 0,
((void ) 0), ((void ) 0)) (
new_f-gtfp)-gtvtable _IO_file_jumps _IO_file_in
it( new_f-gtfp) if (_IO_file_fopen((struct
__sue_23 ) new_f, filename, mode, is32) !
((void ) 0)) var_72 new_f-gtfp.file if
((var_72-gt_flags2 1) (var_72-gt_flags 8))
if (var_72-gt_mode lt 0) ((struct __sue_23
) var_72)-gtvtable _IO_file_jumps_maybe_mmap
else ((struct __sue_23 ) var_72)-gtvtable
_IO_wfile_jumps_maybe_mmap
var_72-gt_wide_data-gt_wide_vtable
_IO_wfile_jumps_maybe_mmap if
(var_72-gt_flags 8192U) _IO_un_link((struct
__sue_23 ) var_72) if (var_72-gt_flags 8192U)
status _IO_file_close_it(var_72) else status
var_72-gt_flags 32U ? - 1 0 (( (struct
_IO_jump_t ) ((void ) ( ((struct __sue_23 )
(var_72))-gtvtable)
(var_72)-gt_vtable_offset))-gt__finish)(var_72,
0) if (var_72-gt_mode lt 0) if
(((var_72)-gt_IO_save_base ! ((void ) 0)))
_IO_free_backup_area(var_72) if (var_72 !
((struct __sue_23 ) ( _IO_2_1_stdin_))
var_72 ! ((struct __sue_23 ) (
_IO_2_1_stdout_)) var_72 ! ((struct
__sue_23 ) ( _IO_2_1_stderr_)))
var_72-gt_flags 0
free(var_72)
bytes_read _IO_sgetn(var_72, (char ) var_81,
bytes_requested)
51End backup slides