Extracting File Formats from Executables - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Extracting File Formats from Executables

Description:

Construct Hierarchical Finite-State Machine (HFSM) Annotate HFSM with size/value information ... [Construct regular expression] Perform in-line expansion ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 43
Provided by: jungh3
Learn more at: http://pages.cs.wisc.edu
Category:

less

Transcript and Presenter's Notes

Title: Extracting File Formats from Executables


1
Extracting File Formats from Executables
  • Junghee Lim, Thomas Reps and Ben Liblit
  • University of Wisconsin-Madison
  • 13th Working Conference on Reverse Engineering
  • Oct. 26, 2006
  • http//www.cs.wisc.edu/junghee/WCRE2006.ppt

2
Data Format (File Format)
  • Goal automatically extract a specification of a
    programs output format
  • E.g., something similar to the file-format
    specification for gzip
  • FFE (File Format Extractor)
  • Input an executable without source code or
    documentation
  • Output a representation of the output data
    format
  • (e.g., a regular expression)

3
Gzip specification vs. our structure
4
Usage Scenarios
  • Reuse components of a tool chain
  • COTS (Commercial Off-The-Shelf) products
  • Detect malware
  • Recover output format ( network-communication
    pattern) from captured malware
  • Detect variants in the wild by detecting network
    traffic with that pattern
  • Characterize what a program computes/creates
  • Find inconsistencies between specifications and
    implementations

5
Programming Styles
  • e.g.
  • - gzip
  • - compress95
  • - png2ico

e.g. - tar - cpio
6
What are the Steps?
  • Disassemble executable
  • Recover
  • Interprocedural CFG
  • Variables (and their sizes)
  • Possible values of variables
  • Construct Hierarchical Finite-State Machine
    (HFSM)
  • Annotate HFSM with size/value information
  • Construct regular expression
  • Perform in-line expansion
  • Validation
  • Regular exp. ? flex spec. ? recognizer
  • Examples ? recognizer ? success/failure

7
What are the Steps?
  • Disassemble executable
  • Recover
  • Interprocedural CFG
  • Variables (and their sizes)
  • Possible values of variables
  • Construct Hierarchical Finite-State Machine
    (HFSM)
  • Annotate HFSM with size/value information
  • Construct regular expression
  • Perform in-line expansion
  • Validation
  • Regular exp. ? flex spec. ? recognizer
  • Examples ? recognizer ? success/failure

8
What are the Steps?
  • Disassemble executable
  • Recover
  • Interprocedural CFG
  • Variables (and their sizes)
  • Possible values of variables
  • Construct Hierarchical Finite-State Machine
    (HFSM)
  • Annotate HFSM with size/value information
  • Construct regular expression
  • Perform in-line expansion
  • Validation
  • Regular exp. ? flex spec. ? recognizer
  • Examples ? recognizer ? success/failure

9
What are the Steps?
  • Disassemble executable
  • Recover
  • Interprocedural CFG
  • Variables (and their sizes)
  • Possible values of variables
  • Construct Hierarchical Finite-State Machine
    (HFSM)
  • Annotate HFSM with size/value information
  • Construct regular expression
  • Perform in-line expansion
  • Validation
  • Regular exp. ? flex spec. ? recognizer
  • Examples ? recognizer ? success/failure

10
What are the Steps?
  • Disassemble executable
  • Recover
  • Interprocedural CFG
  • Variables (and their sizes)
  • Possible values of variables
  • Construct Hierarchical Finite-State Machine
    (HFSM)
  • Annotate HFSM with size/value information
  • Construct regular expression
  • Perform in-line expansion
  • Validation
  • Regular exp. ? flex spec. ? recognizer
  • Examples ? recognizer ? success/failure

11
What are the Steps?
  • Disassemble executable
  • Recover
  • Interprocedural CFG
  • Variables (and their sizes)
  • Possible values of variables
  • Construct Hierarchical Finite-State Machine
  • Annotate HFSM with size/value information
  • Construct regular expression
  • Perform in-line expansion
  • Validation
  • Regular exp. ? flex spec. ? recognizer
  • Examples ? recognizer ? success/failure

12
What are the Steps?
  • Disassemble executable
  • Recover
  • Interprocedural CFG
  • Variables (and their sizes)
  • Possible values of variables
  • Construct Hierarchical Finite-State Machine
    (HFSM)
  • Annotate HFSM with size/value information
  • Construct regular expression
  • Perform in-line expansion
  • Validation
  • Regular exp. ? flex spec. ? recognizer
  • Examples ? recognizer ? success/failure

13
Example code
14
The disassembled code for our example
401120 sub_401120 proc near type 401120 push
ebp 401121 mov ebp, esp 401123 sub
esp, 0Ch 401126 mov eax,
ebp-4 401129 mov ebp-8, eax 40112C
cmp ebp-8, 0 401130 jz short
loc_40113A 401132 cmp ebp-8, 1 401136
jz short loc_401147 401138 jmp
short loc_401152 40113A loc_40113A 40113A
mov eax, ebp-4 40113D mov esp,
eax 401140 call sub_401050 401145 jmp
short loc_401152 401147 loc_401147 401147
mov eax, ebp-4 40114A mov esp,
eax 40114D call sub_401050 401152
loc_401152 401152 leave 401153
retn 401154 sub_401154 proc near chksum 401154
push ebp 401155 mov ebp, esp 401157
sub esp, 8 40115A mov eax,
ebp-4 40115D mov esp, eax 401160
call sub_401075 401165 leave 401166
retn 401167 sub_401167 proc near
fill_data 401167 push ebp 401168 mov
ebp, esp 40116A sub esp, 8 40116D
loc_40116D 40116D cmp ebp-1, 0 401171
jz short loc_401181 401173 movsx
eax, ebp-1 401177 mov esp, eax 40117A
call sub_401050 40117F jmp short
loc_40116D 401181 loc_401181 401181
leave 401182 retn
401183 sub_401183 proc near main 401183 push
ebp 401184 mov ebp, esp 401186 sub
esp, 28h 401189 and esp,
0FFFFFFF0h 40118C mov eax, 0 401191
add eax, 0Fh 401194 add eax,
0Fh 401197 shr eax, 4 40119A shl
eax, 4 40119D mov ebp-14h, eax 4011A0
mov eax, ebp-14h 4011A3 call
sub_401200 4011A8 call __main 4011AD
mov eax, ebp-10h 4011B0 mov esp,
eax 4011B3 call sub_401075 4011B8 mov
eax, ebp-0Ch 4011BB mov esp,
eax 4011BE call sub_401075 4011C3 mov
esp4, 4 4011CB mov eax,
ebp-8 4011CE mov esp, eax 4011D1
call sub_4010E4 4011D6 call
sub_401120 4011DB call sub_401167 4011E0
mov eax, ebp-4 4011E3 mov esp,
eax 4011E6 call sub_401075 4011EB call
sub_401154 4011F0 mov eax, 0 4011F5
leave 4011F6 retn
  • sub_401050 (put_byte) void put_byte(char
    c)
  • sub_401075 (put_long) void put_long(int n)
  • sub_4010E4 (writes) void writes(char str,
    int size)

Output operations 401140, 40114D, 401160,
40117A, 4011B3, 4011BE, 4011D1, 4011E6
Output functions
15
HFSM for our example
16
HFSM for gzip
- 12 FSMs - 64 nodes - 36 call-sites
4051b4_ENTRY
4051b4_ENTRY
403d20_ENTRY
call 4056df
call 4056df
404f0e_ENTRY
40572b
403d62
call 40510c
call 4056df
call 4056df
403d6e
call 4054e6
call 4057f2
call 4054e6
call 4056df
403d7a
call 4057a5
call 4056df
403d90
404366_ENTRY
call 4056df
call 4056df
404145_ENTRY
403d9d
call 404145
call 4051b4
call 4056df
call 4056df
403df1
call 4051b4
call 4051b4
call 4056df
403dfd
call 4051b4
call 4051b4
404f0e_ENTRY
call 4056df
403e1f
40510c_ENTRY
call 404366
call 4056df
call 4056df
40510c_ENTRY
4059c8_ENTRY
403e43
call 4056df
call 4056df
403e50
call 4056df
403e50
call 4056df
403e50
call 4056df
call 4056df
call 4056df
403e50
call 404f0e
408281_ENTRY
4057a5_ENTRY
403e50
call 4056df
4057be
408414
4057d8
call 404f0e
17
A fragment of the call graph of gzip
18
HFSM for gzip
- 12 FSMs - 64 nodes - 36 call-sites
4051b4_ENTRY
4051b4_ENTRY
403d20_ENTRY
call 4056df
call 4056df
404f0e_ENTRY
40572b
403d62
call 40510c
call 4056df
call 4056df
403d6e
call 4054e6
call 4057f2
call 4054e6
call 4056df
403d7a
call 4057a5
call 4056df
403d90
404366_ENTRY
call 4056df
call 4056df
404145_ENTRY
403d9d
call 404145
call 4051b4
call 4056df
call 4056df
403df1
call 4051b4
call 4051b4
call 4056df
403dfd
call 4051b4
call 4051b4
404f0e_ENTRY
call 4056df
403e1f
40510c_ENTRY
call 404366
call 4056df
call 4056df
40510c_ENTRY
4059c8_ENTRY
403e43
call 4056df
call 4056df
403e50
call 4056df
403e50
call 4056df
403e50
call 4056df
call 4056df
call 4056df
403e50
call 404f0e
408281_ENTRY
4057a5_ENTRY
403e50
call 4056df
4057be
408414
4057d8
call 404f0e
19
Regular Expression for gzip
If HFSM is too complicated and there is no
recursion, in-line expand to create regular
expression
20
Augmenting an HFSM with VSA and ASI information
Organization of CodeSurfer/x86
IDA Pro
CodeSurfer/x86
disassembleExecutable
Executable
Connector
CodeSurfer Back-end
VSA
Build CFGs
ASI
VSA (Value Set Analysis) A combined
numeric-analysis and pointer-analysis algorithm
that determines an over-approximation of the set
of numeric values and addresses that each
abstract memory location holds at each program
point. (G. Balakrishnan and T. Reps. Analyzing
memory accesses in x86 executables, CC04) ASI
(Aggregate Structure Identification) A
unification-based, flow-insensitive algorithm to
identify a programs arrays and structs. (G.
Ramalingam and et. al, Aggregate structure
identification and its application to program
analysis, POPL99) (G. Balakrishnan and T. Reps,
Recovery of variables and heap structure in x86
executables, TR-1533, Comp. Sci. Dept.,
UW-Madison, 2005)
21
Value Set Analysis (VSA)
22
Value Set Analysis (VSA)
Output function Output operation
12h
size4
34h
void put_long(int n) put_short(n0xffff)
put_short((ulong)n gtgt 16)
push 12345678h call put_long
56h
78h
esp
stack
23
Value Set Analysis (VSA)
1004
d
Output function Output operation
1003
c
void writes(char c, uint len) for(int i0
iltlen i) outbufoutcnt(uchar)(ci)
if(outcntOUTBUFSIZE)
flush_outbuf()
mov ebx, 1000 ... push 4 push ebx call writes
1002
b
1001
a
1000
...
stack
24
Value Set Analysis (VSA)
1004
d
Output function Output operation
1003
c
void writes(char c, uint len) for(int i0
iltlen i) outbufoutcnt(uchar)(ci)
if(outcntOUTBUFSIZE)
flush_outbuf()
mov ebx, 1000 ... push 4 push ebx call writes
size4
1002
b
1001
a
1000
...
4
stack
25
Value Set Analysis (VSA)
1004
d
Output function Output operation
1003
c
void writes(char c, uint len) for(int i0
iltlen i) outbufoutcnt(uchar)(ci)
if(outcntOUTBUFSIZE)
flush_outbuf()
mov ebx, 1000 ... push 4 push ebx call writes
size4
1002
b
1001
a
1000
...
LookupVSA((esp-48))abcd
4
1000
stack
26
Before After
27
Aggregate Structure Identification (ASI)
...
14 call sendto
28
Experiments
  • gzip
  • GNU data-compression program
  • png2ico
  • converts PNG files to Windows icon-resource files
  • ping
  • sends ICMP ECHO_REQUEST packets to a host to see
    if the host is reachable via the network

29
gzip
30
png2ico (1)
  • Usage scenario
  • Find inconsistencies between specifications and
    implementations

31
png2ico (2)
size 2 value 0
size 2 value 1
size 2 value Top

size 1 value Top
size 1 value Top
size 1 value Top
size 1 value 0
size 2 value 0
size 4 value Top
size 4 value Top
size 2 value Top

size 2 value 1
size 2 value Top
size 4 value Top
size 4 value 40
size 4 value Top
size 4 value 0
size 4 value 0
size 4 value Top
size 4 value 0
size 4 value 0
size 4 value 0




size 4 value Top
size Top value Top
size 1 value 0
size Top value Top
32
png2ico (2)
size 2 value 0
size 2 value 1
size 2 value Top

size 1 value Top
size 1 value Top
size 1 value Top
size 1 value 0
size 2 value 0
size 4 value Top
size 4 value Top
size 2 value Top

size 2 value 1
size 2 value Top
size 4 value Top
size 4 value 40
size 4 value Top
size 4 value 0
size 4 value 0
size 4 value Top
size 4 value 0
size 4 value 0
size 4 value 0




size 4 value Top
size Top value Top
size 1 value 0
size Top value Top
33
png2ico (3)
  • We found an inconsistency between the file-format
    specification for Windows icons and the converter
    png2ico
  • png2ico ? regular exp. ? flex spec. ? recognizer
  • Windows icon files ? recognizer ? failure! ?

34
png2ico (4)
size 2 value 0
size 2 value 1
size 2 value Top

size 1 value Top
size 1 value Top
size 1 value Top
size 1 value 0
size 2 value 0
size 4 value Top
size 4 value Top
size 2 value Top

size 2 value 1
size 2 value Top
size 4 value Top
size 4 value 40
size 4 value Top
size 4 value 0
size 4 value 0
size 4 value Top
size 4 value 0
size 4 value 0
size 4 value 0




size 4 value Top
size Top value Top
size 1 value 0
size Top value Top
35
ping (1)
The HFSM gives a hint about the behavior of ping.
36
ping (2)
typedef struct icmp uint8 icmp_type /
type of message, see below / uint8
icmp_code / type sub code / uint16
icmp_checksum / ones complement cksum of struct
/ define icmp_cksum icmp_checksum union
uint8 ih_pptr / ICMP_PARAMPROB /
struct in_addr ih_gwaddr / ICMP_REDIRECT
/ struct ih_idseq uint16
icd_id uint16 icd_seq
ih_idseq int ih_void /
ICMP_UNREACH_NEEDFRAG Path MTU
Discovery (RFC1191) / struct ih_pmtu
uint16 ipm_void uint16
ipm_nextmtu ih_pmtu struct
ih_rtradv uint8 irt_num_addrs
uint8 irt_wpa uint16
irt_lifetime ih_rtradv
icmp_hun define icmp_pptr icmp_hun.ih_pptr
... union struct id_ts
uint32 its_otime uint32
its_rtime uint32 its_ttime
id_ts struct id_ip
struct ip idi_ip / options and
then 64 bits of data / id_ip
struct icmp_ra_addr id_radv uint32
id_mask char id_data1
icmp_dun define icmp_otime icmp_dun.id_ts.its_o
time ... icmp_t
size 1 value Top
size 1 value Top
size 2 value Top
size 2 value Top
size 2 value Top
37
Conclusion
  • A technique for extracting an over-approximation
    of a programs output data format, including
  • a way to extract a preliminary structure for the
    output data format
  • a way to elaborate the structure by annotating it
    with information about possible output values and
    sizes

38
Over-Approximation?
  • Yes, modulo . . .
  • All operations must append to the output
  • No tracking of file-pointer rewind, seek, . . .
  • Multiple different formats in a program
  • Signals and exceptions ignored
  • In principle, could use the same technique used
    in the MOPS tool

39
Possible Future Work
  • Automatic detection of output functions
  • Other operation sequences ? other formats
  • Input operations
  • Network-communication operations
  • Adoption of a learning technique for refining
    output formats

40
Thank you!Clarifications?
41
(No Transcript)
42
Identifying Output Operations
  • IDAPro disassembler identifies library output
    procedures
  • Typically, inspect the call graph to choose which
    application procedures should be considered
    output wrappers
Write a Comment
User Comments (0)
About PowerShow.com