There - PowerPoint PPT Presentation

About This Presentation
Title:

There

Description:

There s Plenty of Room at the Bottom: Analyzing and Verifying Machine Code T. Reps,1,2 J. Lim,1 A. Thakur,1 G. Balakrishnan,3 and A. Lal4 1Univ. of Wisconsin 3NEC ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 129
Provided by: Thomas1046
Category:
Tags:

less

Transcript and Presenter's Notes

Title: There


1
Theres Plenty of Room at the BottomAnalyzing
and Verifying Machine Code
  • T. Reps,1,2 J. Lim,1 A. Thakur,1
  • G. Balakrishnan,3 and A. Lal4
  • 1Univ. of Wisconsin 3NEC Laboratories
    America
  • 2GrammaTech, Inc. 4Microsoft Research
    India

Joint work with A. Burton, E. Driscoll, M. Elder,
T. Andersen (UW), T. Teitelbaum, D. Melski, D.
Gopan, S. Yong, T. Johnson, A. Loginov
(GrammaTech)
2
Why Machine Code?
  • Windows
  • Login process keeps a users password in the heap
    after a successful login
  • To minimize data lifetime
  • clear buffer
  • call free()
  • But . . .
  • the compiler might optimize away the
    buffer-clearing code (useless-code elimination)

memset(buffer, \0, len) free(buffer)
free(buffer)
3
WYSINWYXWhat You See Is Not What You eXecute
  • Computers do not execute source-code programs
  • They execute machine-code programs that are
    generated from source code
  • An issue for any verification/analysis method
  • theorem proving
  • model checking
  • abstract interpretation

4
Goals of the Talk
  • WYSINWYX . . . Gulp!
  • Why is analyzing machine code important?
  • What makes machine code challenging?
  • what analysis techniques dont work right out of
    the box
  • Why can it be advantageous to analyze machine
    code?
  • A peek at what we have been able to accomplish
    starting from stripped executables

5
UW Machine-Code Analysis Projects
McVeto talk Saturday, July 17 at 1530
  • CodeSurfer/x86 machine-code slicing CC04,
    TOPLAS10
  • tracks dependences across memory updates and
    accesses
  • DDA/x86 device-driver verification TACAS08
  • TSL analysis generator CC08
  • concrete semantics abstract domain ? abstract
    semantics
  • McVeto machine-code verification CAV10
  • unrestricted machine-code programs
  • including self-modifying code and instruction
    aliasing
  • FFE/x86 file-format inference WCRE06
  • Library summarization CAV07
  • BCE botnet command extractor TR-1668

6
SLAM Error Trace
DDA/x86 Error Trace
7
Tutorial on x86 (Intel Syntax)
p q p q p q p a2
8
Tutorial on x86 (Intel Syntax)
  • mov ecx, edx
  • mov ecx, edx
  • mov ecx, edx
  • lea ecx, esp8

ecx edx ecx edx ecx edx ecx
a2
Stack pointer esp Frame pointer ebp
9
Demo
CodeSurfer/C CodeSurfer/x86
10
Goals of the Talk
  • WYSINWYX . . . Gulp!
  • Why is analyzing machine code important?
  • What makes machine code challenging?
  • what analysis techniques dont work right out of
    the box
  • Why can it be advantageous to analyze machine
    code?
  • A peek at what we have been able to accomplish
    starting from stripped executables

11
A Surprise Many People Spend their Lives
Inspecting Machine Code
  • Thousands of users of IDA Pro disassembler
  • Hex-Rays SA Liège, Belgium
  • Computer Emergency Response Teams
  • every major country has one
  • Anti-malware companies
  • Three-letter agencies
  • Malware writers
  • . . .

12
Machine Code can be a Better Platformfor Finding
Security Vulnerabilities
  • Many exploits utilize platform-specific quirks
  • non-obvious and unexpected
  • compiler artifacts (choices made by the compiler)
  • memory layout
  • padding between fields of a struct
  • which variables are adjacent?
  • register usage
  • execution order
  • optimizations performed
  • compiler bugs

13
Example of a Compiler Artifact
int callee(int a, int b) int local if
(local 5) return 1 else return 2 int
main() int c 5 int d 7 int v
callee(c,d) // What is the value of v here?
return 0
Answer 1 (for the Microsoft compiler)
14
Example of a Compiler Artifact
Standard prolog Prolog for 1 local push
ebp push ebp mov ebp, esp
mov ebp, esp sub esp, 4
push ecx
int callee(int a, int b) int local if
(local 5) return 1 else return 2 int
main() int c 5 int d 7 int v
callee(c,d) // What is the value of v here?
return 0
Answer 1 (for the Microsoft compiler)
15
  • Standard prolog
  • push ebp
  • mov ebp, esp
  • sub esp, 4

5
ecx
ebp
???
16
  • Standard prolog
  • push ebp
  • mov ebp, esp
  • sub esp, 4
  • Prolog for 1 local
  • push ebp
  • mov ebp, esp
  • push ecx

5
5
ecx
ecx
ebp
ebp
???
5
17
Example of a Compiler Artifact
Standard prolog Prolog for 1 local push
ebp push ebp mov ebp, esp
mov ebp, esp sub esp, 4
push ecx
int callee(int a, int b) int local if
(local 5) return 1 else return 2 int
main() int c 5 int d 7 int v
callee(c,d) // What is the value of v here?
return 0
Answer 1 (for the Microsoft compiler)
mov ebp - 8, 5 mov ebp - C, 7 mov
eax, ebp - C push eax mov ecx, ebp -
8 push ecx call _callee . . .
18
Analysis of Indirect Calls
  • Case Study Nimda virus
  • Use of telltale system routines are obfuscated
  • indirect use of LoadLibrary() and
    GetProcAddress()
  • indirection through memory
  • Detailed modeling of Dynamic Linked Libraries
    (DLLs)
  • runtime linking
  • aliasing forwarding
  • Ability to resolve indirect calls

NIMDA Resolved / Total
Indirect calls 366 / 373 209 via import table
LoadLibrary 6 / 8 5 indirect
GetProcAddress 45 / 46 45 indirect
19
Goals of the Talk
  • WYSINWYX . . . Gulp!
  • Why is analyzing machine code important?
  • What makes machine code challenging?
  • what analysis techniques dont work right out of
    the box
  • Why can it be advantageous to analyze machine
    code?
  • A peek at what we have been able to accomplish
    starting from stripped executables

20
State-Space Exploration
  • Bug detection
  • Policy adherence
  • Malicious-code detection

21
State-Space Exploration
22
Locate Bugs by Finding Problematic Paths
enter
exit
if (v malloc()) 0)

T
F



free(v)
double free!
path ? possible sequence of runtime events
23
State Machine for Allocation Bugs
  • v.unknown, v.null, v.notNull, v.freed
  • (v malloc(_)) 0 ?t v.null
  • ?f
    v.notNull
  • v malloc(_) ? v.unknown
  • v.notNull
  • free(v) ? v.freed
  • v ? ?no
    state change?
  • v.freed
  • free(v) ? double
    free!
  • v ? use
    after free!
  • v.null
  • free(v) ? free of
    NULL!
  • v.unknown
  • free(v) ? possible
    free of NULL!

24
Locate Bugs by Finding Problematic Paths
enter
exit
v.freed
if (v malloc()) 0)

v.notNull
T
F

v.freed


v.freed
v.notNull
double free!
free(v)
v.freed
path in CFG ? possible sequence of runtime events
25
State-Space Exploration in Machine Code
Target PC 62
Initial PC 0
55 8B EC 83 EC 14 83 7D F4 0F 7D 1F C7 45 F0 01
00 00 00 83 7D F8 08 7E 09 C7 45 EC 05 00 00 00
EB 07 C7 45 EC 06 00 00 00 EB 07 C7 45 F0 00 00
00 00 83 7D F0 01 75 16 83 7D F4 0F 7C 07 3D 00
2A 00 00 EB 07 C7 45 FC 08 00 00 00 EB 07 C7 45
FC 09 00 00 00 33 C0 8B E5 5D C3
PC Program Counter
26
  • 00 push ebp
  • 01 mov ebp, esp
  • 03 sub esp, 14h
  • 06 cmp ebpvar_C, 0Fh
  • 0A jge short loc_2B
  • 0C mov ebpvar_10, 1
  • 13 cmp ebpvar_8, 8
  • 17 jle short loc_22
  • 19 mov ebpvar_14, 5
  • 20 jmp short loc_29
  • 22 mov ebpvar_14, 6
  • 29 jmp short loc_32
  • 2B mov ebpvar_10, 0
  • 32 cmp ebpvar_10, 1
  • 36 jnz short loc_4E
  • 38 cmp ebpvar_C, 0Fh
  • 3C jl short loc_45
  • 3E cmp eax, 2A00h
  • 43 jmp short loc_4C

55 8B EC 83 EC 14 83 7D F4 0F 7D 1F C7 45 F0 01
00 00 00 83 7D F8 08 7E 09 C7 45 EC 05 00 00 00
EB 07 C7 45 EC 06 00 00 00 EB 07 C7 45 F0 00 00
00 00 83 7D F0 01 75 16 83 7D F4 0F 7C 07 3D 00
2A 00 00 EB 07 C7 45 FC 08 00 00 00 EB 07 C7 45
FC 09 00 00 00 33 C0 8B E5 5D C3
27
  • 00 push ebp
  • 01 mov ebp, esp
  • 03 sub esp, 14h
  • 06 cmp ebpvar_C, 0Fh
  • 0A jge short loc_2B
  • 0C mov ebpvar_10, 1
  • 13 cmp ebpvar_8, 8
  • 17 jle short loc_22
  • 19 mov ebpvar_14, 5
  • 20 jmp short loc_29
  • 22 mov ebpvar_14, 6
  • 29 jmp short loc_32
  • 2B mov ebpvar_10, 0
  • 32 cmp ebpvar_10, 1
  • 36 jnz short loc_4E
  • 38 cmp ebpvar_C, 0Fh
  • 3C jl short loc_45
  • 3E cmp eax, 2A00h
  • 43 jmp short loc_4C

PC 17 2-byte instruction
55 8B EC 83 EC 14 83 7D F4 0F 7D 1F C7 45 F0 01
00 00 00 83 7D F8 08 7E 09 C7 45 EC 05 00 00 00
EB 07 C7 45 EC 06 00 00 00 EB 07 C7 45 F0 00 00
00 00 83 7D F0 01 75 16 83 7D F4 0F 7C 07 3D 00
2A 00 00 EB 07 C7 45 FC 08 00 00 00 EB 07 C7 45
FC 09 00 00 00 33 C0 8B E5 5D C3
PC 18 2-byte instruction
18 or edi, eax 1A inc ebp 1B in al, dx 1C
add eax, 0EB000000h 21 pop es 22 mov dword ptr
ebp-14h, 6
28
  • 00 push ebp
  • 01 mov ebp, esp
  • 03 sub esp, 14h
  • 06 cmp ebpvar_C, 0Fh
  • 0A jge short loc_2B
  • 0C mov ebpvar_10, 1
  • 13 cmp ebpvar_8, 8
  • 17 jle short loc_22
  • 19 mov ebpvar_14, 5
  • 20 jmp short loc_29
  • 22 mov ebpvar_14, 6
  • 29 jmp short loc_32
  • 2B mov ebpvar_10, 0
  • 32 cmp ebpvar_10, 1
  • 36 jnz short loc_4E
  • 38 cmp ebpvar_C, 0Fh
  • 3C jl short loc_45
  • 3E cmp eax, 2A00h
  • 43 jmp short loc_4C

int main() int x, y, z, p, tmp1 if(x
lt 15) tmp1 1 if(y gt 8) z
5 else z 6 else tmp1 0
if(tmp1 1) if(x gt 15)
UNREACHABLE() else p 8 else
p 9 return 0
29
More ChallengesNeed Simultaneous Numeric
Pointer Analysis
dereference operation
Source int a . . . a 2
Compiled code sub esp, 40 for all
locals mov dword ptr ebp-12, 2
saved ret. addr
saved frame ptr
ebp
a
arithmetic operation
30
Intervals Not Appropriate forSets of Addresses
  • Checking for non-aligned accesses
  • pointer forging? e.g., 4-byte fetch from
    1020,1028
  • need to keep stride (congruence) information
  • e.g., 4-byte fetch from 41020,1028


31
Arithmetic on Integers vs. Ints
  • Machine arithmetic is 32-bit twos complement
    (int)
  • Static analysis based on infinite-precision
    integer arithmetic is unsound

Unreachable!
232 ? 0
x 232 1 if (trigger) x x 1 if (x 0)
?Do something malicious?
32
Design Space for Machine-Code-Analysis Tools
  • What is available initially?
  • source code plus the executable
  • no source code, but the executable includes
    symbol-table/debugging information (unstripped)
  • no source code no symbol-table/debugging
    information (stripped)
  • What properties are checked?
  • What is expected of the analyzer after the first
    anomalous action is detected?

33
Two Points in the Design Space
  • CodeSurfer/x86, DDA/x86
  • Stripped executables
  • IR recovery property checking (DDA/x86)
  • Account only for behaviors expected from a
    standard compilation model
  • Report evidence of possible deviations from such
    behaviors
  • McVeto/x86 PPC32
  • Stripped executables
  • Check reachability properties
  • Account for deviant behaviors
  • self-modifying code
  • instruction aliasing
  • report definite instances of stack corruption

34
Invariant Checking
?
?
?
?
35
Goals of the Talk
  • WYSINWYX . . . Gulp!
  • Why is analyzing machine code important?
  • What makes machine code challenging?
  • what analysis techniques dont work right out of
    the box
  • Why can it be advantageous to analyze machine
    code?
  • A peek at what we have been able to accomplish
    starting from stripped executables

36
A Surprise Analyzing Executables can be Less
Complicated than Analyzing Source
  • Some source-level issues go away!
  • use of multiple source languages
  • in-lined assembly code
  • avoids build problems (i.e., for the application
    to be analyzed)
  • analyze the actual library code, not hand-written
    stubs
  • D. Gopan T. Reps, Low-level library analysis
    and summarization CAV 07

37
A Surprise Analyzing Executables can be More
Precise than Analyzing Source
Source f( p(x), q(y), r(z) )
Executable r(z) q(y) p(x) f
To be as precise, a source-code-analysis tool
would have to duplicate all choices made by the
compiler optimizer
38
Platform-Specific ? Fewer Behaviors
?
39
Join Over-Approximation to Union
X ? 5,10
X ? 18,22
X ? 5,10 ? 18,22
40
Platform-Specific ? Fewer Behaviors
?
41
Goals of the Talk
  • WYSINWYX . . . Gulp!
  • Why is analyzing machine code important?
  • What makes machine code challenging?
  • what analysis techniques dont work right out of
    the box
  • Why can it be advantageous to analyze machine
    code?
  • A peek at what we have been able to accomplish
    starting from stripped executables

42
UW Machine-Code Analysis Projects
  • CodeSurfer/x86 machine-code slicing CC04,
    TOPLAS10
  • tracks dependences across memory updates and
    accesses
  • DDA/x86 device-driver verification TACAS08
  • TSL analysis generator CC08
  • concrete semantics abstract domain ? abstract
    semantics
  • McVeto machine-code verification CAV10
  • unrestricted machine-code programs
  • including self-modifying code and instruction
    aliasing
  • FFE/x86 file-format inference WCRE06
  • Library summarization CAV07
  • BCE botnet command extractor TR-1668

43
Static Analysis of ExecutablesPrior State of
the Art (2001)
  • Relies on debugging information
  • Atom, EEL, Vulcan
  • Able to track only data movements via registers
  • EEL, Cifuentes, Debbabi, Debray
  • Poor treatment of memory operations
  • Overly conservative treatment ? many false
    positives
  • Non-conservative treatment ? many false negatives
  • Limited usefulness for security analysis

CodeSurfer/x86s analyses Able to track the
effects of memory operations
44
What Sets Our Work Apart?
  • Create an analyzer to identify indirect calls
  • Create an analyzer to identify strings
  • Create an analyzer to check stack height
  • versus our approach
  • Create an analyzer that, for
  • each instruction I in the executable and its
    libraries
  • each calling context of I
  • each register and variable V in scope
  • statically determines an over-approximation to
  • the set of values that V may contain when I
    executes

But what are the variables if we start with a
stripped executable?
45
State-Space Exploration
46
State-Space Exploration
47
State-Space Exploration
  • Memory-safety violations!
  • Access outside of activation record
  • Access outside of malloced block
  • Call/jump to data
  • Use of code as data

48
IR Recovery in CodeSurfer/x86Scope of our
Ambitions
  • Aim to handle programs that conform to a
    standard compilation model
  • procedures
  • activation records
  • global data region
  • heap-allocated structs/objects (malloc/new)
  • virtual functions
  • dynamically linked libraries
  • Limitations
  • single-threaded applications
  • interrupts signal handlers

49
IR Recovery in CodeSurfer/x86Scope of our
Ambitions
  • Aim to handle programs that conform to a
    standard compilation model
  • procedures
  • activation records
  • global data region
  • heap-allocated structs/objects (malloc/new)
  • virtual functions
  • dynamically linked libraries
  • Report indications of non-conformance
  • violations of stack protocol
  • return address modified within procedure
  • Memory-safety violations
  • Access outside of activation record
  • Access outside of malloced block
  • Call/jump to data
  • Use of code as data

50
Invariant Checking
?
?
?
?
51
The IR-Recovery Problem
  • Given a stripped executable E, identify the
  • procedures
  • variables
  • data objects
  • types
  • libraries
  • that it uses, and
  • for each instruction I in E and its libraries
  • for each interprocedural calling context of I
  • for each machine register and variable V in scope
  • statically determine an over-approximation to
  • the set of values that V may contain when I
    executes

Bootstrapped analysis Obtain better and better
estimates using multiple rounds of analysis
52
Quality of Variables Structs RecoveredComparis
on with Debugging Information
Structure of data allocated on stack (locals)
Structure of data allocated on heap
53
Stripped executable
IDA Pro disassembler
IR Recovery Phases
  • Initial IR
  • instruction ASTs
  • initial call graph CFGs

Memory-access and variable-recovery analyzer
  • Fleshed-out IR
  • improved call graph CFGs
  • a-locs ( inferred variables structs)
  • possible values held by each a-loc at each CFG
    node
  • used, killed, may-killed a-locs at each CFG node

CodeSurfer dependence-graph builder
  • program dependence graphs
  • forward backward slicing chopping
  • GUI for dependence-graph navigation

54
What Does an Analyst Want to Know?
  • What are the
  • programs variables?
  • What are the
  • programs parameters?
  • Where could this
  • indirect jump go?
  • What function could be
  • called at this indirect call site?
  • What could this dereference
  • operation access/affect?
  • What kind of object is
  • allocated at this allocation site?
  • What could the value held
  • in V eventually affect?
  • What could have affected
  • the value of V?

55
Example
  • int arrVal0, pArray2
  • int main()
  • int i, a10, p
  • / Initialize pointers /
  • pArray2 a2
  • p a0
  • / Initialize Array /
  • for(i 0 ilt10 i)
  • p arrVal
  • p
  • / Return a2 /
  • return pArray2

ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
56
Example
  • int arrVal0, pArray2
  • int main()
  • int i, a10, p
  • / Initialize pointers /
  • pArray2 a2
  • p a0
  • / Initialize Array /
  • for(i 0 ilt10 i)
  • p arrVal
  • p
  • / Return a2 /
  • return pArray2

ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
?
57
Example
ffffffffh
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
a(40 bytes)
Data local to main (Activation Record)
?
pArray2(4 bytes)
8h
Global data
arrVal(4 bytes)
4h
58
Example
ffffffffh
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
Data local to main (Activation Record)
No debugging information
?
Global data
4h
59
Memory-Regions
  • Memory-region a sequence of similar runtime
    locations
  • AR-region Locations that belong to an activation
    record
  • Malloc-region Locations that are allocated at a
    malloc site
  • Global-region Locations that correspond to
    global data

. . .
AR of G
AR of F
. . .
AR of F
. . .
AR of G
AR of G
GLOBAL DATA
GLOBAL DATA
60
Example Memory-Regions
(main, 0)
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
(GL,12)
(GL,4)
Global Region
(main, -40)
Region for main
?
61
Memory-Layout Inference (1st Cut)
  • Data-layout clues (IDAPros approach)
  • some variables held in registers
  • global variables ? absolute addresses
  • local variables ? offsets in stack frame
  • A-locs
  • locations between consecutive addresses
  • locations between consecutive offsets
  • registers
  • Drawbacks No information for
  • indirect accesses using a non-stack-frame
    register
  • dynamically-allocated objects

62
Example A-locs
(main, 0)
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
(GL,12)
8
(GL,8)
4
(GL,4)
esp8
(main, -32)
Global Region
esp
(main, -40)
Region for main
?
63
Example A-locs
(main, 0)
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
(GL,12)
mem_8
main_32
(GL,8)
mem_4
(GL,4)
(main, -32)
Global Region
main_40
(main, -40)
Region for main
?
64
Value-Set An Abstraction of aSet of Values
Addresses
Concrete state
Memory-regions
. . .
AR of G
AR of F
. . .
AR of F
Value-set
. . .
AR of G
AR of G
GLOBAL DATA
(SIGlobal, SIG, SIF)
GLOBAL DATA
65
Example Value-Set Analysis
(main, 0)
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
(GL,12)
mem_8
main_32
(GL,8)
mem_4
(GL,4)
(main, -32)
Global Region
main_40
(main, -40)
Region for main
?
66
Memory-Layout Inference (2nd Cut)
  • IDAPro only provides information about
  • variables accessed via globals
  • variables accessed via ebp or esp
  • Problem indirect accesses that use a
  • non-stack-frame register, e.g., mov eax,10
  • Idea Use VSA results
  • VSA indicates what addresses eax can hold

67
Identifying Variables VSA ASI
  • Aggregate Structure Identification (ASI)
  • Ramalingam et al. POPL 1999
  • Partition aggregates according to the programs
    memory-access patterns
  • Original motivation Y2K
  • VSA provides ASI with the count, size, and stride
    of an array access
  • ASI identifies structs and arrays

68
Aggregate Structure Identification
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
40
edi ? (?, -32)
69
Aggregate Structure Identification
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
40
1?
7?
2?
4
ecx ?? (?, 4-40,-4)
70
Aggregate Structure Identification
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
40
1?
7?
2?
4
4
4
ASI two arrays one scalar
71
Aggregate Structure Identification
(main, 0)
40
main_32
(main, -32)
1?
7?
2?
main_40
(main, -40)
Region for main
4
4
4
Earlier one 8-byte a-loc one 32-byte a-loc
ASI two arrays one scalar
72
Aggregate Structure Identification
40
High-level type struct int a2 int b
int c7
1?
7?
2?
4
4
4
ASI two arrays one scalar
73
Memory-Layout Inference (3rd Cut)
  • What about dynamically-allocated objects?
  • G. Balakrishnan and T. Reps,
  • Recency-abstraction for heap-allocated
    storage SAS 06

74
Dynamically-Allocated Storage
MallocBlock
VirtualTable
p
p
VirtualTable
MallocBlock
75
Most Analyses Unsound!
MallocBlock
VirtualTable
p
p
VirtualTable
MallocBlock
76
Dynamically-Allocated Storage
  • Use two regions per malloc-site
  • Most-Recently-Allocated Block (MRAB)
  • Non-Most-Recently-Allocated Block (NMRAB)
  • MRAB
  • Never a summary region
  • At most one concrete heap block
  • NMRAB
  • Generally a summary region
  • Can represent more than one concrete block

77
Dynamically-Allocated Storage
MRAB
VT
f
g
f
g
f
g
NMRAB
MRAB
VT
78
Dynamically-Allocated Storage
MRAB
NMRAB
VT
NMRAB
MRAB
VT
79
Dynamically-Allocated Storage
MRAB
NMRAB
VT
p
v ?? w
v ?? w
NMRAB
MRAB
VT
80
Dynamically-Allocated Storage
MRAB
NMRAB
VT
v ?? w
p
Strong Update!
v ?? w
v ?? w
NMRAB
MRAB
VT
81
Dynamically-Allocated Storage
class parent int z public parent()
z(0) virtual void foo() virtual void
bar() class child public parent char
c public child() c(0) virtual
void foo() virtual void bar()
void virtualFnsDemo() parent p int x
scanf(" d ", x) if (x) p new
parent p-gtfoo() else p new
child p-gtfoo() p-gtbar() delete
p return
Constructor initializes the virtual-table field
call ??2_at_YAPAXI_at_Z operator new(uint) add
esp, 4 test eax, eax jz short
loc_F5 mov esi, eax mov dword ptr
eax4, 0 mov dword ptr eax, offset
??_7parent_at__at_6B_at_
parentvftable mov ecx, esi mov edx,
esi call dword ptr edx
edx parentfoo
82
Dynamically-Allocated Storage
class parent int z public parent()
z(0) virtual void foo() virtual void
bar() class child public parent char
c public child() c(0) virtual
void foo() virtual void bar()
void virtualFnsDemo() parent p int x
scanf(" d ", x) if (x) p new
parent p-gtfoo() else p new
child p-gtfoo() p-gtbar() delete
p return
edx4 childbar, parentbar
mov esi, eax . . . mov ecx, esi mov
edx, esi call dword ptr edx4
83
Dynamically-Allocated Storage
struct List int a struct List next
void mallocInALoop() int i List head
0 for(i 0 i lt 5 i) m List elem
(List)malloc(sizeof(List)) elem-gta
i elem-gtnext head head
elem return
84
Stripped executable
IDA Pro disassembler
IR Recovery Phases
  • Initial IR
  • instruction ASTs
  • initial call graph CFGs

Memory-access and variable-recovery analyzer
  • Fleshed-out IR
  • improved call graph CFGs
  • a-locs ( inferred variables structs)
  • possible values held by each a-loc at each CFG
    node
  • used, killed, may-killed a-locs at each CFG node

CodeSurfer dependence-graph builder
  • program dependence graphs
  • forward backward slicing chopping
  • GUI for dependence-graph navigation

85
Why Iterate?
a
  • Multi-level data structures
  • struct A a
  • a -gt b_ptr -gt c_ptr 17
  • struct A
  • int . . . char . . .
  • struct B b_ptr
  • . . .
  • struct B
  • int . . . char . . .
  • struct C c_ptr

A
B
C
17
86
Why Iterate?
a
  • Multi-level data structures
  • struct A a
  • a -gt b_ptr -gt c_ptr 17
  • mov eax, ebp - 20
  • mov ebx, eax 8
  • mov ecx, ebx 12
  • mov ecx 8, 17

A
B
C
17
87
Why Iterate?
ebp
a
  • Multi-level data structures
  • struct A a
  • a -gt b_ptr -gt c_ptr 17
  • mov eax, ebp - 20
  • mov ebx, eax 8
  • mov ecx, ebx 12
  • mov ecx 8, 17

A
??
B
??
C
??
88
Why Iterate?
ebp
a
  • Multi-level data structures
  • struct A a
  • a -gt b_ptr -gt c_ptr 17
  • mov eax, ebp - 20
  • mov ebx, eax 8
  • mov ecx, ebx 12
  • mov ecx 8, 17

A
eax
w.r.t. frame pointer
??
ebx
B
??
w.r.t. ???
ecx
C
??
17
89
UW Machine-Code Analysis Projects
  • CodeSurfer/x86 machine-code slicing CC04,
    TOPLAS10
  • tracks dependences across memory updates and
    accesses
  • DDA/x86 device-driver verification TACAS08
  • TSL analysis generator CC08
  • concrete semantics abstract domain ? abstract
    semantics
  • McVeto machine-code verification CAV10
  • unrestricted machine-code programs
  • including self-modifying code and instruction
    aliasing
  • FFE/x86 file-format inference WCRE06
  • Library summarization CAV07
  • BCE botnet command extractor TR-1668

90
Device-Driver Analysis with DDA/x86
  • Device driver
  • (roughly) a library that exports API for making
    I/O requests
  • Programming conventions are complicated
  • 85 of crashes in Windows due to driver bugs
  • Swift et al. 2005
  • DDA/x86 prototype
  • extension to CodeSurfer/x86
  • Balakrishnan Reps, Analyzing stripped
    device-driver executables, TACAS 2008

91
PendedCompletedRequested Rule
A drivers dispatch routine should not return
STATUS_PENDING on an I/O Request Packet (IRP) if
it has called IoCompleteRequest on the IRP,
unless it has also called IoMarkIrpPending.
92
State-Space Exploration
93
State-Space Exploration
Stripped executable
Property specification
Memory-access and property analyzer
Error report
CFG call graph memory-access info
OK
94
SLAM Error Trace
DDA/x86 Error Trace
95
Results For PendedCompletedRequested Rule
? A-locs from semi-naïve algorithm ? With
GMOD-based merge function ? With cross-product
automaton
96
Device Extension Structure for moufiltr driver
Declaration in C source
Structure in executable
97
Device Extension Structure for moufiltr driver
Declaration in C source
Structure identified by DDA/x86
98
UW Machine-Code Analysis Projects
  • CodeSurfer/x86 machine-code slicing CC04,
    TOPLAS10
  • tracks dependences across memory updates and
    accesses
  • DDA/x86 device-driver verification TACAS08
  • TSL analysis generator CC08
  • concrete semantics abstract domain ? abstract
    semantics
  • McVeto machine-code verification CAV10
  • unrestricted machine-code programs
  • including self-modifying code and instruction
    aliasing
  • FFE/x86 file-format inference WCRE06
  • Library summarization CAV07
  • BCE botnet command extractor TR-1668

99
Transformer Specification Language (TSL)
  • Machine-code analyses for other instruction sets
  • PowerPC, ARM, MIPS, . . .
  • Easy creation of new machine-code analyses
  • Provide a single unified definition of an
    instruction sets concrete semantics
  • from one well-defined formalism, generate all
    analyzers
  • guarantees that all analyzers work off of the
    same semantics
  • Automatically instantiate an appropriate variant
    of VSA, ASI, symbolic execution, etc.
  • More generally, a platform for creating
  • multiple analysis components
  • for multiple tools
  • analyzing multiple languages

100
TSL Design Principles
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN

TSL Compiler

M Instruction-Set Specifications
101
TSL Design Principles
Stays the same
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN

TSL Compiler

M Instruction-Set Specifications
102
TSL Design Principles
Easily add an additional analysis in a
language-independent way
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN
AnalysisN1

TSL Compiler

M Instruction-Set Specifications
103
TSL Leverage
Redefine 40 TSL operations for each analysis
Client Analyzer
N Analysis Components
Analysis1
Analysis2
AnalysisN
AnalysisN1

Conventional approach Redefine gt600 x86
instructions for each analysis
TSL Compiler

M Instruction-Set Specifications
104
TSL Two-Dimensional Generator
concrete semantics abstract domain1 ? abstract
semantics1 concrete semantics abstract domain2
? abstract semantics2 concrete semanticsL1
abstract domain ? abstract semanticsL1 concrete
semanticsL2 abstract domain ? abstract
semanticsL2 Tool generator abstract-semantics
generator Tool driver concrete semanticsL1
abstract domain(s) Tool driver ?
Tool/L1 concrete semanticsL2 abstract domain(s)
Tool driver ? Tool/L2
105
TSL Basics
  • Meta-language for specifying an instruction sets
    concrete semantics
  • Pure functional language ( first-order ML)
  • instruction lt. . . an appropriate tree data
    type . . . gt
  • state (INT32-gtINT8, reg-gtINT32, flag-gtBool)
  • Write an interpreter state interpInstr(instructio
    n I, state S) . . .
  • Interface for implementing analyses
  • supply an abstract interpretation for each TSL
    operator
  • , , mapAccess, mapUpdate, . . .
  • interpret interpInstr under semantics
    generates interpInstr
  • for an instruction at address m, the analysis
    transformer is
  • interpInstr(decodeInstrAt(m), _ )

106
TSL Leverage
Client Analyzer
N Analysis Components

M Instruction-Set Specifications
107
TSL Leverage
Client Analyzer
N Analysis Components

TSL Compiler
  • Greatly reduced effort ? enables
  • building more ambitious tools
  • Separation of concerns provides
  • greater confidence in correctness

M Instruction-Set Specifications
108
Static-Analysis Components
Analysis of memory contents (numeric values
addresses) Variable-identification analysis
Affine-relation analysis (affine equalities)
Global-modification analysis (GMOD) Live-flag
analysis Reaching-flag analysis
Available-register analysis
Symbolic-Analysis Components
Symbolic execution (quantifier-free bit-vector
arithmetic) Precondition generation
Dynamic-Analysis Components
Emulate the specified processor interpret each
operator concretely
109
Case Study
Instruction-set specifiers work x86 (3200
lines of TSL) 10-20 man-days to write the
TSL specification TSL generates about
27,000 lines of C PowerPC32 (1600 lines of
TSL) 4 man-days to write the TSL
specification
Analysis developers work 166
basetype-operators - This number covers four
kinds of operand sizes for each basic
operation Add8, Add16, Add32, Add64 166/4
40 operations 2 map-operators (access/update)
for each map type
110
Leverage Provided by TSL
Hand-written
TSL-based
CodeSurfer/SWYXx86
15 days to write the x86 spec
1 man month to implement all analyses
111
Leverage Provided by TSL
Hand-written
TSL-based
Affine Relation Analysis
542 instruction instances
112
Leverage Provided by TSL
Hand-written
TSL-based
Affine Relation Analysis
Equivalent in 324 cases out of 542
TSL generated transformers were more precise in
218 cases
Better!
113
UW Machine-Code Analysis Projects
  • CodeSurfer/x86 machine-code slicing CC04,
    TOPLAS10
  • tracks dependences across memory updates and
    accesses
  • DDA/x86 device-driver verification TACAS08
  • TSL analysis generator CC08
  • concrete semantics abstract domain ? abstract
    semantics
  • McVeto machine-code verification CAV10
  • unrestricted machine-code programs
  • including self-modifying code and instruction
    aliasing
  • FFE/x86 file-format inference WCRE06
  • Library summarization CAV07
  • BCE botnet command extractor TR-1668

114
Two Points in the Design Space
  • CodeSurfer/x86, DDA/x86
  • Stripped executables
  • IR recovery property checking (DDA/x86)
  • Account only for behaviors expected from a
    standard compilation model
  • Report evidence of possible deviations from such
    behaviors
  • McVeto/x86 PPC32
  • Stripped executables
  • Check reachability properties
  • Account for deviant behaviors
  • self-modifying code
  • instruction aliasing
  • report definite instances of stack corruption

115
But How to Obtain Leverage on the Problem?
  • Learn the right abstraction
  • Learning a source-code abstraction
  • Synergy/Dash/Smash use execution traces and
    directed test generation to drive the process of
    learning an abstraction
  • when directed test generation returns
    unsatisfiable,
  • apply splitting-based refinement of current
    abstraction
  • Relevant approach for machine-code verification
    because we can execute machine code faithfully
  • McVeto insight use many forms of learning

116
McVeto Insight Use Many Forms of Learning
Trace-based refinement
Speculative trace refinement
McVeto talk Saturday, July 17 at 1530
Aliasing-condition learning
Implicit summaries
Splitting-based refinement
117
Goals of the Talk
  • WYSINWYX . . . Gulp!
  • Why is analyzing machine code important?
  • What makes machine code challenging?
  • what analysis techniques dont work right out of
    the box
  • Why can it be advantageous to analyze machine
    code?
  • A peek at what we have been able to accomplish
    starting from stripped executables

118
UW Machine-Code Analysis Achievements
  • All start with stripped executables
  • we explored two main points in design space of
    machine-code-analysis tools
  • CodeSurfer/x86 CC04, TOPLAS10
  • first machine-code slicer capable of tracking
    dependences across memory updates and accesses
  • novel algorithms to
  • recover (proxies for) variables
  • discover the variables possible values
  • determine the effects of memory operations
  • DDA/x86 TACAS 08
  • first verification tool for stripped device
    drivers
  • TSL analysis generator CC08
  • concrete semantics abstract domain ? abstract
    semantics
  • McVeto machine-code verification CAV10
  • first automatic verification tool for
    machine-code programs, including self-modifying
    code and instruction aliasing
  • FFE/x86 file-format inference WCRE06
  • Summarization of library operations CAV07
  • BCE botnet command extractor TR-1668

119
Related Work
  • Platforms ATOM, EEL, Phoenix, Vulcan, Pin, IDA
    Pro, Paradyn
  • Improved creation of basic IRs B. Miller
    Lakhotia
  • Dataflow analysis Debray Backes Regehr A.
    King De Sutter et al. August
  • Software engineering/understanding Cifuentes
    Amme
  • Verification Bergeron Debbabi Kroening
    StEAM Kinder Schlich Bardin Muehlberg Chaki
  • Security Godefroid Song Vigna Kruegel
    Martignoni Lee Giffin
    Saidi Bardin
  • Types Mycroft
  • Analysis of cache behavior Wilhelm AbsInt
  • Proof-carrying code Necula Lee
  • Relating source code to compiled code Kozen
    Rival Pnueli
  • Low-level models of the semantics of high-level
    code
  • Grossman Chambers Miné Ramalingam

120
Theres Plenty of Room at the Bottom!
See paper in CAV Proceedings
McVeto talk Saturday, July 17 at 1530
Questions?
121
(No Transcript)
122
Related Work
  • Cifuentes Fraboulet, Intraprocedural static
    slicing of binary executables
  • ICSM 97
  • Debray et al., Alias analysis of executable
    code POPL 98
  • Amme et al., Data dependence analysis of
    assembly code PACT 98
  • Mycroft, Type-based decompilation ESOP 99
  • Linn et al., Stack analysis of x86 executables
    Unpublished
  • De Sutter et al., On the static analysis of
    indirect control transfers in binaries
  • PDPTA 00
  • Backes, Programmanalyse des XRTL
    Zwischencodes, Ph.D. thesis,
  • Univ. des Saarlandes, 2004
  • Guo et al., Practical and accurate low-level
    pointer analysis CGO 05
  • Christodorescu et al., String analysis for x86
    binaries PASTE 05
  • Regehr et al. Eliminating stack overflow by
    abstract interpretation TECS 05
  • Zhang et al., Parameter and return-value
    analysis of binary executables
  • COMPSAC 07
  • Sa?di, Logical foundation for static analysis
    Application to binary static analysis
  • for security Unpublished

123
For More Information
  • Balakrishnan Reps, Analyzing memory accesses
    in x86 executables CC 04
  • Reps, Balakrishnan, Lim, Teitelbaum, A next-
    generation platform for
  • analyzing executables APLAS 05
  • Lal, Reps, Balakrishnan, Extended weighted
    pushdown systems CAV 05
  • Reps, Balakrishnan, Lim, Intermediate-represen
    tation recovery from
  • low-level code PEPM 06
  • Balakrishnan Reps, Recency-abstraction for
    heap-allocated storage SAS 06
  • Balakrishnan Reps, DIVINE DIscovering
    Variables IN Executables
  • VMCAI 07
  • Gopan Reps, Low-level library analysis and
    summarization CAV 07
  • Balakrishnan, WYSINWYX What You See Is Not
    What You eXecute,
  • Ph.D. thesis and TR-1603, CS Dept., Univ. of
    Wisconsin, Aug. 2007
  • Balakrishnan Reps, Analyzing stripped
    device-driver executables TACAS 08
  • Lim Reps, A system for generating static
    analyzers for machine instructions
  • CC 08

124
Example Value-Set Analysis
(main, 0)
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
(GL,12)
mem_8
main_32
(GL,8)
mem_4
(GL,4)
(main, -32)
Global Region
main_40
(main, -40)
Region for main
?
Corrupts the stack?
125
Affine-Relation Analysis
  • Value-sets are an independent-attribute domain
  • no relations on values of different a-locs
  • Imprecise results, e.g.,
  • upper bound for ebx cmp ebx,10 (i lt 10?)
  • but no upper bound for ecx at loc_9
  • Improved by discovering affine relations
  • identifies a loops induction variables

. . . loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 . . .
126
Affine-Relation Analysis
  • Obtain affine relations via static analysis
  • Use affine relations to improve precision
  • e.g., at loc_9
  • ecxesp(4?ebx), ebx(10,9,?), esp(?,-40)
  • ? ecx(?,-40)4(10,9)
  • ? ecx(?,4-40,-4)
  • ? upper bound for ecx at loc_9

. . . loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 . . .
127
Affine-Relation Analysis
  • Affine relation a0 ??i?1..n(ai xi) 0
  • x1, x2, , xn a-locs
  • a0, a1, , an int constants
  • more general than
  • constant propagation
  • induction-variable analysis
  • Idea determine affine relations on a-locs
  • propagate loop-bound info to other a-locs
  • ARA for modular arithmetic MOS05

128
Example Value-Set Analysis
(main, 0)
ebx ? variable i ecx ? variable p sub
esp, 40 adjust stack lea edx, esp8
mov 8, edx pArray2a2 lea ecx,
esp pa0 mov edx, 4
loc_9 mov ecx, edx parrVal add
ecx, 4 p inc ebx i cmp
ebx, 10 ilt10? jl short loc_9 mov
edi, 8 mov eax, edi return
pArray2 add esp, 40 retn
(GL,12)
mem_8
main_32
(GL,8)
mem_4
(GL,4)
(main, -32)
Global Region
main_40
(main, -40)
Region for main
?
129
(No Transcript)
130
Questions
131
Sidestepping Undecidability
Overapproximate the reachable states
Reachable States
Bad States
False positive!
Universe of States
132
Outline
  • Value-set analysis
  • Better identification of variables

133
Outline
  • Value-set analysis
  • Better identification of variables

134
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com