Incorporating Domain-Specific Information into the Compilation Process - PowerPoint PPT Presentation

About This Presentation
Title:

Incorporating Domain-Specific Information into the Compilation Process

Description:

Error reported at compile time. Compiler indicates the location and nature of error ... Communication, file access, process control. Analyze routines to ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 52
Provided by: samuel75
Category:

less

Transcript and Presenter's Notes

Title: Incorporating Domain-Specific Information into the Compilation Process


1
Incorporating Domain-Specific Information into
the Compilation Process
  • Samuel Z. Guyer
  • Supervisor Calvin Lin
  • April 14, 2003

2
Motivation
  • Two different views of software
  • Compilers view
  • Abstractions numbers, pointers, loops
  • Operators , -, , -gt,
  • Programmers view
  • Abstractions files, matrices, locks, graphics
  • Operators read, factor, lock, draw
  • This discrepancy is a problem...

3
Find the error part 1
  • Example
  • Error case outside of switch statement
  • Part of the language definition
  • Error reported at compile time
  • Compiler indicates the location and nature of
    error

switch (var_83) case 0 func_24()
break case 1 func_29() break case 2
func_78()
!
4
Find the error part 2
  • Example
  • Improper call to libfunc_38
  • Syntax is correct no compiler message
  • Fails at run-time
  • Problem what does libfunc_38 do?
  • This is how compilers view reusables

struct __sue_23 var_72 char var_81100 var_72
libfunc_84(__str_14, __str_65) libfunc_44(var_
72) libfunc_38(var_81, 100, 1, var_72)
!
5
Find the error part 3
  • Example
  • Improper call to fread() after fclose()
  • The names reveal the mistake
  • No traditional compiler reports this error
  • Run-time system how does the code fail?
  • Code review rarely this easy to spot

FILE my_file char buffer100 my_file
fopen(my_data, r) fclose(my_file) fread(buff
er, 100, 1, my_file)
!
6
Problem
  • Compilers are unaware of library semantics
  • Library calls have no special meaning
  • The compiler cannot provide any assistance
  • Burden is on the programmer
  • Use library routines correctly
  • Use library routines efficiently and effectively
  • These are difficult manual tasks
  • Tedious and error-prone
  • Can require considerable expertise

7
Solution
  • A library-level compiler
  • Compiler support for software libraries
  • Treat library routines more like built-in
    operators
  • Compile at the library interface level
  • Check programs for library-level errors
  • Improve performance with library-level
    optimizations
  • Key Libraries represent domains
  • Capture domain-specific semantics and expertise
  • Encode in a form that the compiler can use

8
The Broadway Compiler
  • Broadway source-to-source C compiler
  • Domain-independent compiler mechanisms
  • Annotations lightweight specification language
  • Domain-specific analyses and transformations
  • Many libraries, one compiler

9
Benefits
  • Improves capabilities of the compiler
  • Adds many new error checks and optimizations
  • Qualitatively different
  • Works with existing systems
  • Domain-specific compilation without recoding
  • For us more thorough and convincing validation
  • Improve productivity
  • Less time spent on manual tasks
  • All users benefit from one set of annotations

10
Outline
  • Motivation
  • The Broadway Compiler
  • Recent work on scalable program analysis
  • Problem Error checking demands powerful analysis
  • Solution Client-driven analysis algorithm
  • Example Detecting security vulnerabilities
  • Contributions
  • Related work
  • Conclusions and future work

11
Security vulnerabilities
  • How does remote hacking work?
  • Most are not direct attacks (e.g., cracking
    passwords)
  • Idea trick a program into unintended behavior
  • Automated vulnerability detection
  • How do we define intended?
  • Difficult to formalize and check application
    logic
  • Libraries control all critical system
    services
  • Communication, file access, process control
  • Analyze routines to approximate vulnerability

12
Remote access vulnerability
  • Example
  • Vulnerability executes any remote command
  • What if this program runs as root?
  • Clearly domain-specific sockets, processes, etc.
  • Requirement
  • Why is detecting this vulnerability hard?

int sock char buffer100 sock
socket(AF_INET, SOCK_STREAM, 0) read(sock,
buffer, 100) execl(buffer)
!
Data from an Internet socket should not specify a
program to execute
13
Challenge 1 Pointers
  • Example
  • Still contains a vulnerability
  • Only one buffer
  • Variables buffer and ref are aliases
  • We need an accurate model of memory

int sock char buffer100 char ref
buffer sock socket(AF_INET, SOCK_STREAM,
0) read(sock, buffer, 100) execl(ref)
!
14
Challenge 2 Scope
  • Call graph
  • Objects flow throughout program
  • No scoping constraints
  • Objects referenced through pointers
  • We need whole-program analysis

!
sock (AF_INET, SOCK_STREAM, 0)
(sock, buffer, 100) (ref)
socket
read
execl
15
Challenge 3 Precision
  • Static analysis is always an approximation
  • Precision level of detail or sensitivity
  • Multiple calls to a procedure
  • Context-sensitive analyze each call separately
  • Context-insensitive merge information from all
    calls
  • Multiple assignments to a variable
  • Flow-sensitive record each value separately
  • Flow-insensitive merge values from all
    assignments
  • Lower precision reduces the cost of analysis
  • Exponential polynomial
    linear

16
Insufficient precision
  • Example
  • Context-insensitivity
  • Information merged at call
  • Analyzer reports 2 possible errors
  • Only 1 real error
  • Imprecision leads to false positives

17
Cost versus precision
  • Problem A tradeoff
  • Precise analysis prohibitively expensive
  • Cheap analysis too many false positives
  • Idea Mixed precision analysis
  • Focus effort on the parts of the program that
    matter
  • Dont waste time over-analyzing the rest
  • Key Let error detection problem drive
    precision
  • Client-driven program analysis

18
Client-Driven Algorithm
  • Client Error detection analysis problem
  • Algorithm
  • Start with fast cheap analysis monitor
    imprecision
  • Determine extra precision reanalyze

Pointer Analyzer
Client Analysis
Memory Model
19
Algorithm components
  • Monitor
  • Runs alongside main analysis
  • Records imprecision
  • Adaptor
  • Start at the locations of reported errors
  • Trace back to the cause and diagnose

20
Sources of imprecision
  • Polluting assignments

21
In action...
  • Monitor analysis
  • Polluting assignments
  • Diagnose and apply fix
  • In this case one procedure context-sensitive
  • Reanalyze

!
22
Methodology
  • Compare with commonly-used fixed precision
  • Metrics
  • Accuracy number of errors reported
  • Includes false positives fewer is better
  • Performance only when accuracy is the same

CS-FS Full context-sensitive, flow-sensitive
CS-FI Slow context-sensitive, flow-insensitive
CI-FS Medium context-insensitive, flow-sensitive
CI-FI Fast context-insensitive, flow-insensitive
23
Programs
  • 18 real C programs
  • Unmodified source all the issues of production
    code
  • Many are system tools run in privileged mode
  • Representative examples

Name Description Priv Lines of code Procedures CFG nodes
muh IRC proxy ü 5K (25K) 84 5,191
blackhole E-mail filter ü 12K (244K) 71 21,370
wu-ftpd FTP daemon ü 22K (66K) 205 23,107
named DNS server ü 26K (84K) 210 25,452
nn News reader û 36K (116K) 494 46,336
24
Error detection problems
  1. File access
  2. Remote access vulnerabillity
  3. Format string vulnerability (FSV)
  4. Remote FSV
  5. FTP behavior

Files must be open when accessed
Data from an Internet socket should not specify a
program to execute
Format string may not contain untrusted data
Check if FSV is remotely exploitable
Can this program be tricked into reading and
transmitting arbitrary files
25
Results
Remote access vulnerability
1000X
Normalized performance
10X
26
Overall results
  • 90 test cases 18 programs, 5 problems
  • Test case 1 program and 1 error detection
    problem
  • Compare algorithms client-driven vs. fixed
    precision

As accurate as any other algorithm
87 out of 90
Runs faster than best fixed algorithm
64 out of 87
Performance not an issue
19 of 23
Both most accurate and fastest
29 out of 64
27
Why does it work?
Name Total procedures procedures context-sensitive procedures context-sensitive procedures context-sensitive procedures context-sensitive procedures context-sensitive
Name Total procedures RA File FSV RFSV FTP
muh 84 6
apache 313 8 2 2 10
blackhole 71 2 5
wu-ftpd 205 4 4 17
named 210 1 2 1 4
cfengine 421 4 1 3 31
nn 494 2 1 1 30
  • Validates our hypothesis
  • Different errors have different precision
    requirements
  • Amount of extra precision is small

28
Outline
  • Motivation
  • The Broadway Compiler
  • Recent work on scalable program analysis
  • Problem Error checking demands powerful analysis
  • Solution Client-driven analysis algorithm
  • Example Detecting security vulnerabilities
  • Contributions
  • Related work
  • Conclusions and future work

29
Central contribution
  • Library-level compilation
  • Opportunity library interfaces make domains
    explicit in existing programming practice
  • Key a separate language for codifying
    domain-specific knowledge
  • Result our compiler can automate previously
    manual error checks and performance improvements

Knowledge representation
Applying knowledge
Results
Old way
Informal
Manual
Difficult, unpredictable
Broadway
Codified
Compiler
Easy, automatic, reliable
30
Specific contributions
  • Broadway compiler implementation
  • Working system (43K C-Breeze, 23K pointers, 30K
    Broadway)
  • Client-driven pointer analysis algorithm SAS03
  • Precise and scalable whole-program analysis
  • Library-level error checking experiments
    CSTR01
  • No false positives for format string
    vulnerability
  • Library-level optimization experiments LCPC00
  • Solid improvements for PLAPACK programs
  • Annotation language DSL99
  • Balance expressive power and ease of use

31
Related work
  • Configurable compilers
  • Power versus usability who is the user?
  • Active libraries
  • Previous work focusing on specific domains
  • Few complete, working systems
  • Error detection
  • Partial program verification paucity of results
  • Scalable pointer analysis
  • Many studies of cost/precision tradeoff
  • Few mixed-precision approaches

32
Future work
  • Language
  • More analysis capabilities
  • Optimization
  • We have only scratched the surface
  • Error checking
  • Resource leaks
  • Path sensitivity conditional transfer functions
  • Scalable analysis
  • Start with cheaper analysis unification-based
  • Refine to more expensive analysis shape analysis

33
(No Transcript)
34
Annotations (I)
  • Dependence and pointer information
  • Describe pointer structures
  • Indicate which objects are accessed and modified

procedure fopen(pathname, mode) on_entry
pathname --gt path_string mode --gt
mode_string access path_string,
mode_string on_exit return --gt new
file_stream
35
Annotations (II)
  • Library-specific properties
  • Dataflow lattices

property State Open, Closed initially
Open property Kind File,
Socket Local, Remote


Remote
Local
Open
Closed
Socket
File


36
Annotations (III)
  • Library routine effects
  • Dataflow transfer functions

procedure socket(domain, type, protocol)
analyze Kind if (domain AF_UNIX)
IOHandle lt- Local if (domain AF_INET)
IOHandle lt- Remote analyze State
IOHandle lt- Open on_exit return --gt new
IOHandle
37
Annotations (IV)
  • Reports and transformations

procedure execl(path, args) on_entry path
--gt path_string report if (Kind
path_string could-be Remote) Error at
callsite remote access procedure
slow_routine(first, second) when (condition)
replace-with quick_check(first)
fast_routine(first, second)
38
Why does it work?
  • Validates our hypothesis
  • Different clients have different precision
    requirements
  • Amount of extra precision is small

Name procedures context-sensitive procedures context-sensitive procedures context-sensitive procedures context-sensitive procedures context-sensitive variables flow-sensitive variables flow-sensitive variables flow-sensitive variables flow-sensitive variables flow-sensitive
Name RA File FSV RFSV FTP RA File FSV RFSV FTP
muh 6 0.1 0.07 0.31
apache 8 2 2 10 0.89 0.18 0.91 1.07 0.83
blackhole 2 5 0.24 0.04 0.32
wu-ftpd 4 4 17 0.63 0.09 0.51 0.53 0.23
named 1 2 1 4 0.14 0.01 0.23 0.20 0.42
cfengine 4 1 3 31 0.43 0.04 0.46 0.48 0.03
nn 2 1 1 30 1.82 0.17 1.99 2.03 0.97
39
Time
40
Validation
  • Optimization experiments
  • Cons One library three applications
  • Pros Complex library consistent results
  • Error checking experiments
  • Cons Quibble about different errors
  • Pros We set the standard for experiments
  • Overall
  • Same system designed for optimizations is among
    the best for detecting errors and security
    vulnerabilities

41
Type Theory
  • Equivalent to dataflow analysis (heresy?)
  • Different in practice
  • Dataflow flow-sensitive problems, iterative
    analysis
  • Types flow-insensitive problems, constraint
    solver
  • Commonality
  • No magic bullet same cost for the same precision
  • Extracting the store model is a primary concern

Remember Phil Wadlers talk?
42
Generators
  • Direct support for domain-specific programming
  • Language extensions or new language
  • Generate implementation from specification
  • Our ideas are complementary
  • Provides a way to analyze component compositions
  • Unifies common algorithms
  • Redundancy elimination
  • Dependence-based optimizations

43
Is it correct?
  • Three separate questions
  • Are Sam Guyers experiments correct?
  • Yes, to the best of our knowledge
  • Checked PLAPACK results
  • Checked detected errors against known errors
  • Is our compiler implemented correctly?
  • Flip answer whos is?
  • Better answer testing suites
  • How do we validate a set of annotations?

44
Annotation correctness
  • Not addressed in my dissertation, but...
  • Theoretical approach
  • Does the library implement the domain?
  • Formally verify annotations against
    implementation
  • Practical approach
  • Annotation debugger interactive
  • Automated assistance in early stages of
    development
  • Middle approach
  • Basic consistency checks

45
Error Checking vs Optimization
  • Optimistic
  • False positives allowed
  • It can even be unsound
  • Tend to be may analyses
  • Correctness is absolute
  • Black and white
  • Certify programs bug-free
  • Cost tolerant
  • Explore costly analysis
  • Pessimistic
  • Must preserve semantics
  • Soundness mandatory
  • Tend to be must analyses
  • Performance is relative
  • Spectrum of results
  • No guarantees
  • Cost sensitive
  • Compile-time is a factor

46
Complexity
  • Pointer analysis
  • Address taken linear
  • Steensgaard almost linear (log log n factor)
  • Anderson polynomial (cubic)
  • Shape analysis double exponential
  • Dataflow analysis
  • Intraprocedural polynomial (height of lattice)
  • Context-sensitivity exponential (call graph)
  • Rarely see worst-case

47
Optimization
  • Overall strategy
  • Exploit layers and modularity
  • Customize lower-level layers in context
  • Compiler strategy Top-down layer processing
  • Preserve high-level semantics as long as possible
  • Systematically dissolve layer boundaries
  • Annotation strategy
  • General-purpose specialization
  • Idiomatic code substitutions

48
PLAPACK Optimizations
  • PLAPACK matrices are distributed
  • Optimizations exploit special cases
  • Example Matrix multiply

49
Results
50
Find the error part 3
  • State-of-the-art compiler

struct __sue_23 var_72 struct __sue_25 new_f
(struct __sue_25 ) malloc(sizeof (struct
__sue_25)) _IO_no_init( new_f-gtfp.file, 1, 0,
((void ) 0), ((void ) 0)) (
new_f-gtfp)-gtvtable _IO_file_jumps _IO_file_in
it( new_f-gtfp) if (_IO_file_fopen((struct
__sue_23 ) new_f, filename, mode, is32) !
((void ) 0))   var_72 new_f-gtfp.file   if
((var_72-gt_flags2 1) (var_72-gt_flags 8))
    if (var_72-gt_mode lt 0) ((struct __sue_23
) var_72)-gtvtable _IO_file_jumps_maybe_mmap
    else ((struct __sue_23 ) var_72)-gtvtable
_IO_wfile_jumps_maybe_mmap    
var_72-gt_wide_data-gt_wide_vtable
_IO_wfile_jumps_maybe_mmap   if
(var_72-gt_flags 8192U) _IO_un_link((struct
__sue_23 ) var_72) if (var_72-gt_flags 8192U)
status _IO_file_close_it(var_72)   else status
var_72-gt_flags 32U ? - 1 0 (( (struct
_IO_jump_t ) ((void ) ( ((struct __sue_23 )
(var_72))-gtvtable)                             
(var_72)-gt_vtable_offset))-gt__finish)(var_72,
0) if (var_72-gt_mode lt 0)   if
(((var_72)-gt_IO_save_base ! ((void ) 0)))
_IO_free_backup_area(var_72) if (var_72 !
((struct __sue_23 ) ( _IO_2_1_stdin_))
var_72 ! ((struct __sue_23 ) (
_IO_2_1_stdout_))     var_72 ! ((struct
__sue_23 ) ( _IO_2_1_stderr_)))
var_72-gt_flags 0  
free(var_72)
bytes_read _IO_sgetn(var_72, (char ) var_81,
bytes_requested)
51
End backup slides
Write a Comment
User Comments (0)
About PowerShow.com