K. Gondow (Titech, Japan) - PowerPoint PPT Presentation

About This Presentation
Title:

K. Gondow (Titech, Japan)

Description:

Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 40
Provided by: sdeCsTit2
Category:

less

Transcript and Presenter's Notes

Title: K. Gondow (Titech, Japan)


1
Binary-Level Lightweight Data Integration to
Develop Program Understanding Tools for Embedded
Software in C
  • K. Gondow (Titech, Japan)
  • T. Suzuki (Elmic System Inc, Japan)
  • H. Kawashima (JAIST, Japan)

2
Overview
  • Problems
  • Imprecision in C tools.
  • High development cost of C tools.
  • Our solution
  • Binary-level lightweight data integration.
  • As a testbed, DWARF2 used for developing
  • dxref, rxref cross-referencers
  • bscg a call-graph extractor

3
Imprecision in C tools (1/3)
  • e.g., GNU GLOBAL cannot identify a variable 'foo'
    and a label 'foo'.
  • Users must select some one from the list.
  • Because GNU GLOBAL partially analyzes source code
    to run very fast.

int main (void) int foo foo goto
foo
candidate list
click
foo 3 test.c int foo.c foo 4 test.c foo
goto foo
4
Imprecision in C tools (2/3)
  • e.g., Murphy's study
  • "An Empirical Study of Static Call Graph
    Extractors", by Murphy, et al., ICSE, 1996.
  • Tells "call graphs extracted by several broadly
    distributed tools vary significantly enough to
    surprise many experienced software engineers."

5
Imprecision in C tools (3/3 )
  • Quantitative results from mosaic, quoted from
    Murphy's paper.

cflownField
cflow-Field
Field-cflow
6
Why imprecision? (1/2)
  • Reason 1 many tools partially parse source
    code, resulting in incomplete analysis.
  • e.g, GNU GLOBAL, cxref, LXR, cscope, cflow...
  • At a glance, full-parsing seems to solve this
    problem, but...

7
Why imprecision? (2/2)
  • Reason 2 C source code is difficult to fully
    analyze because of
  • Compiler-specific extensions.
  • e.g., asm for inline assembly code
  • Ambiguous behaviors in the C standards.
  • undefined, unspecified, implementation-defined.
  • e.g., padding in a structure.

8
Compiler-specific extensions
  • Essential in C and embedded software.
  • e.g., asm is used to obtain H/W error code.
  • e.g., long long is used in C89's ltstdio.hgt
  • Make it hard to analyze source code.
  • Different compiler has different semantics.

void page_fault_handler (uint32_t error)
uint32_t cr2 asm volatile ("movl
cr2,0""r"(cr2)) ... / IA-32
control register 2 /
9
Ambiguous behaviors in C (1/2)
  • Intentional and essential to keep C compilers
    fast and simple.
  • e.g., padding in a structure is an
    implementation-defined behavior.
  • This makes pointer-analysis hard.
  • "Pointer analysis for programs with structures
    and casts", by Suan Hsi Yong, et al, PLDI'99.

10
Ambiguous behaviors in C (2/2)
struct S char c int ip p struct T char
c int i t t.i 0x1234 p (struct S
)t printf ("p\n", p-gtip)
  • Different padding on different platforms.
  • To obtain precise dataflow, tools need to know
    the padding values of the compiler.
  • But it is hard...

struct S
struct S
struct T
c
c
c
padding
ip
i
not
depends on
ip
Solaris8 (32bit)
Solaris8 (64bit)
11
Possible solutions
  • To modify compilers (e.g. GCC) to emit their
    analyzed internal data.
  • Seemingly high development cost.
  • Many compilers to be modified.
  • To use binary information in executables emitted
    by compilers.
  • Relatively easy, although it lacks some
    information, e.g., statements.

12
Our solution and result
  • Our solution
  • Uses DWARF2 debugging information as binary
    information.
  • Preliminary experiment
  • Good result for our cross-referencers and
    call-graph extractor.
  • Better precision, although
  • some false negatives increased.
  • quantitative results are not yet obtained.

13
Demonstration
  • Using DWARF2, we implemented
  • two cross-referencers
  • dxref only uses DWARF2
  • Sample output dxref
  • rxref hybrid of dxref and GNU GLOBAL
  • Sample output dxref
  • a static call-graph extractor
  • bscg uses DWARF2 and disassembler.
  • Sample outputs fact, dxref, bash, bash

14
DWARF2-XML
C code
compile
text data symbol info. relocation info. debug
info.
dxref, rxref cross-referencers
binary ELF/ DWARF2
bscg call graph extractor
extract
data inte- gration
use
common format DWARF2-XML
15
How bscg works
  1. extract call instructionsby disassembling text.

(2) convert addresses to symbols using
DWARF2
1234 call 5678
main call fact
(3) trim call graphs according to options
(4) output graph topologyin DOT of Graphviz
digraph G main -gt fact fact -gt fact
main
fact
usage
16
Advantages of bscg
  • Advantages of binary-level DI (explained later).
  • eg., high applicability and few false positives.
  • Can identify inlined functions.
  • Can extract a call from asm ("call fact")
  • Can exclude
  • library functions e.g., printf
  • system calls e.g., open, fork
  • functions in runtime systems _start, _fini

17
Disadvantages of bscg
  • No support for macro calls, signals, function
    pointers, optimization.
  • gprof-callgraph.pl can handle function pointers,
    since it uses dynamic information.
  • source-level ones (e.g., cflow) don't suffer from
    optimization problem.

18
So, is bscg good?
  • Yes! (not the best, of course)
  • Not easy to compare.

19
What is binary-level DI?
  • Provides common formats by extracting information
    from binary code.

source code
binary code
compile
.c
a.out
binary DI
analyze
analyze
common formats
source DI
DWARF2- XML
Tools
20
Why binary-level DI?
  • Many advantages
  • High applicability
  • Few false-positives.
  • More true-positives for low-level info.
  • Low development cost
  • Can improve C tool's precision.

21
What is lightweight DI?
  • Allows several common formats.
  • To be practical! Hard to perfectly integrate.

heavy- weight DI
light- weight DI
DWARF2- XML
22
Summary
  • Imprecision in C tools.
  • Our solution
  • Binary-level lightweight data integration.
  • As a testbed, DWARF2 used for developing
  • dxref, rxref cross-referencers
  • bscg call-graph extractor

23
Future works
  • Apply our technique to other tools
  • e.g., memory profilers, slicers, test coverage
    tools, ...
  • Develop new binary formats suitable for lower
    CASE tools.
  • tool-information carrying code.
  • cf. proof-carrying code, model-carrying code,
    schedule-carrying code.

24
(No Transcript)
25
Taxonomy of cross referencers.
  • Source-level
  • Partial-parsing GNU GLOBAL, LXR, ...
  • Full-parsing Sapid, ACML
  • Binary-level
  • Symbol tables Visual Studio .NET(?)
  • Debug info. dxref
  • Hybrid rxref

26
What is DWARF2?
  • A binary format for debugging information.
  • Primary target languages
  • C, C, Fortran, Modula2, Pascal.
  • Includes
  • types, nested blocks, line numbers,
    function/object names, addresses, stack frame
    information, ...

27
DWARF2-XML
  • Our common format in XML for DWARF2.
  • A testbed of binary-level lightweight DI.
  • Makes it easier to process DWARF2.
  • cf. libdwarf
  • About 15 times larger than DWARF2.

28
DWARF2-XML example
int i ...
address range
  • ltsection name".debug_info"gt
  • lttag name"DW_TAG_lexical_block"
    offset"id27"gt
  • ltattribute name"DW_AT_low_pc"
    value"67328"/gt
  • ltattribute name"DW_AT_high_pc"
    value"67356"/gt
  • ...
  • lttag name"DW_TAG_variable" offset"id27"gt
  • ltattribute name"DW_AT_name" value"i"/gt
  • ltattribute name"DW_AT_type"
    value_ref"id161"gt
  • ltattribute name"DW_AT_location"gt
  • ltdescriptiongtDW_OP_fbreg
    -24lt/descriptiongtlt/gtlt/gtlt/gtlt/gt
  • ...
  • lttag name"DW_TAG_base_type" offset"id161"gt
  • ltattribute name"DW_AT_name" value"int"/gt
  • ltattribute name"DW_AT_byte_size" value"4"/gt
  • ltattribute name"DW_AT_encoding" value"5"gt
  • ltdescriptiongtsignedlt/descriptiongtlt/gtlt/gtlt/gt

variable name
ID/IDREF link
offset to base ptr.
29
DWARF2-XML file sizes
  • About 15 times larger than DWARF2.
  • Size increase is almost cancelled by gzip.
  • Consumes much memory when using DOM.
  • e.g., we cannot build DOM tree for gdb in our
    environment.
  • Tradeoff between memory consumption and low
    development cost.

source a.out .debug_ DWARF2-XML compressed by gzip
x_debug.c 27KB 77KB 50KB 1.1MB 58KB
readelf.c 315KB 575KB 137KB 2.1MB 128KB
bash 1.2MB 2.9MB 705KB 16.3MB 815KB
gdb 12MB 21.5MB 14.4MB 276MB 14MB
gdb's LOC is about 400,000.
30
Execution speed
  • bscg is slower than the other, but acceptable for
    practical use.
  • 12000 lines in 8.8 sec.
  • but too bad in the case of bash-2.03.
  • bscg has a problem in scalability due to heavy
    overhead of DOM library.

31
Why XML?
  • Highly readable, portable, interoperable.
  • plain-text and self-descriptiveness.
  • Powerful enough to describe complex structures
    and relations in programs.
  • Nested tags and ID/IDREF links.
  • DTD for checking XML documents.
  • Flexibility to process semi-structured documents.
  • Easy to query/display/modify.
  • XML parsers, DOM/SAX, XPath.
  • XPath's description is much smaller than boring
    tree traversal code.

32
Drawbacks in API integration
e.g., libdwarf
  • Insufficient abstraction.
  • Many and various data structures/access make it
    hard to well encapsulate them into a fixed API.
  • e.g., poor API in libdwarf to traverse a wide
    range of data tree. (only dwarf_siblingof and
    dwarf_child are provided.)
  • High cost to implement API in many languages.
  • High cost to learn how to use API.

33
false/true positive/negative
  • false positives
  • tool's incorrect output.
  • true positives
  • tool's correct output.
  • false negatives
  • tool's incorrect silence.
  • tool should have produced output, but not.
  • true negatives
  • tool's correct silence
  • tool should not have produced output, and not.

34
bscg's graph trimming options
35
Why lightweight DI?
  • To be practical! Hard to perfectly integrate.
  • Supported by the fact that most technologies gave
    up the perfect integration/definition.
  • e.g., undefined behaviors in C.
  • e.g., GNU BFD gives API integrating different
    binary formats.
  • useful, but not perfect.
  • cannot convert ELF/DWARF2 into Windows PE.

36
Why function pointer analysis is difficult in C?
  • Pointer arithmetic and casting.
  • e.g., (int ()())(base offset)
  • Dynamic library
  • e.g., handle dlopen (libname, RTLD_LAZY)
    func dlsym (handle, funcname) f
    ()
  • Inline assembly code
  • e.g., asm ("call foo")

37
CASE tools development cost
  • Generally very high.
  • individual parsers analyzers.
  • internal data is less interoperable and portable
  • IBM Eclipse
  • 40,000,000 (?)

38
E.g., function pointer
  • Cflow
  • apply calls f (false positive)
  • gprof-callgraph.pl
  • apply calls add5 (true positive)
  • Other tools (bscg)
  • apply calls ? (false negative)

int add5 (int x) return x 5 int apply
(int (f)(int), int x) return f (x) int
main (void) return apply (add5, 10)
39
Our homepage
  • http//www.sde.cs.titech.ac.jp/gondow/dwarf2-xml/
  • DTD for DWARF2-XML
  • Source code of readelf, dxref, rxref, bscg
  • Some sample outputs
Write a Comment
User Comments (0)
About PowerShow.com