A Practical Introduction to TALx86 - PowerPoint PPT Presentation

About This Presentation
Title:

A Practical Introduction to TALx86

Description:

Greg Morrisett, Karl Crary, Neal Glew, Dan Grossman, Richard ... The TALx86 project provides tools for the assembly, disassembly, and linking of TAL binaries. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 33
Provided by: seas6
Category:

less

Transcript and Presenter's Notes

Title: A Practical Introduction to TALx86


1
A Practical Introduction to TALx86
  • CS342
  • February 15, 2007
  • Spencer Burdette

2
References
  • Greg Morrisett, Karl Crary, Neal Glew, Dan
    Grossman, Richard Samuels, Frederick Smith, David
    Walker, Stephanie Weirich, and Steve Zdancewic
    TALx86 A realistic typed assembly language. In
    the 1999 ACM SIGPLAN Workshop on Compiler Support
    for System Software, pages 25-35, Atlanta, GA,
    USA, May 1999.
  • Kedar N. Swadi and Andrew W. Appel. Typed Machine
    Language and its Semantics.
  • Dan Grossman and Greg Morrisett.  Scalable
    Certification for Typed Assembly Language.  In
    the 2000 ACM SIGPLAN Workshop on Types in
    Compilation, Montreal, Canada, September 2000.
  • Numerous Wikipedia articles http//www.wikipedia.
    org
  • The Cornell TALx86 homepage http//www.cs.cornell
    .edu/talc/

3
Prerequisites
  • Type theory notation and operations
  • Various notations that formalize programming
    semantics
  • Polymorphic lambda calculus
  • First order predicate calculus
  • Functional program languages that lend themselves
    to theorem proving
  • ML, CAML, OCAML
  • High Level Language polymorphism constructs
  • ad-hoc, subtyping, parametric, etc.
  • Assembler Operations and Optimizations
  • x86, control-flow graphs, tail-call elimination,
    constant folding, etc.

S
S
R
R
4
Type Safety
  • A looser interpretation than merely data type
    safety, (e.g. casting a long int to a char)
  • A type error is defined as an attempt to perform
    an operation on some value that is not
    appropriate to its type.
  • e.g. Setting the program counter to an address
    found in a local buffer (buffer overflow).
  • Performing an arithmetic function on
    uninitialized data.
  • Dereferencing a NULL pointer.

5
Type Safety (contd)
  • Type safety is closely linked with memory safety.
  • Allowing an arbitrary integer to be used as a
    pointer violates the principles of memory safety.
  • Bounds checking of arrays is required for type
    safety.
  • For the sake of simplicity, assume that most (if
    not all) types of programmatic flaws that have
    been discussed thus far in our class can be
    attributed to type violations.
  • Type safety ultimately aims to provide strong
    guarantees about the runtime behavior of a
    system.
  • Useful for proof-carrying or certified code
    deployment.

6
Type Safety Classifications (1 of 4)
  • Static
  • Early compile time (semantic analysis)
  • Compiler complains of mismatched types in
    assignments and other expressions remedied using
    explicit casts.
  • Late or post compile time (optimization time)
  • Compiler (or a related tool) constructs data and
    control flow graphs and infers types of values
    from series of unifications and reductions.
    (Hindley-Milner type inference algorithm).

7
Type Safety Classifications (2 of 4)
  • Dynamic
  • Run-time environment assigns or infers types of
    variables during execution.
  • Hybrid
  • Polymorphic behavior of language requires
    annotation of certain variables that can not be
    resolved to a type during the static phase.

8
Type Safety Classifications (3 of 4)
  • Nominative vs. Structural
  • Nominative types must share the precise name to
    be compatible
  • Structural types must describe values that
    share the same structure to be compatible.
  • Weak vs. Strong
  • Computer scientist experts agree that the weakly
    vs. strongly typed distinction is a grey area.
  • Indications that a programming language is weakly
    typed
  • Compiler inserts implicit type conversions on
    behalf of the programmer.
  • Language allows programmer access to underlying
    bit patterns of data types, thus allowing them to
    bypass type checking.
  • Data types can be cast or used directly for
    memory access.

9
Type Safety Classifications (4 of 4)
  • The series of type classifications results in a
    matrix

Language Static/Dynamic Weak/Strong Nominative/Structural Safe?
assembly none strong structural no
C static weak nominative no
Java hybrid strong nominative yes
Javascript dynamic weak nominative yes
Lisp dynamic strong structural yes
ML static strong structural yes
10
What is TALx86?
  • Typed Assembly Language for a subset of the Intel
    x86 ISA.
  • Consists of a RISC-like assembly language and
    operational semantics for a simple abstract
    machine.
  • A formal type system that captures the possible
    register, stack, and memory states of a program
    as well as their transitions.
  • Rigorous proofs (well beyond the scope of this
    presentation, thankfully) have demonstrated that
    TAL enforces certain safety guarantees.

11
Without the Jargon
  • TALx86 is a low-level target language, analogous
    to Java bytecode, that is intended to support a
    variety of statically typed, weak or strong
    source languages.
  • Like any good intermediate language, TALx86 has
    been designed to support common assembly-level
    optimizations.
  • The TALx86 project provides tools for the
    assembly, disassembly, and linking of TAL
    binaries.

12
Advantages over JVML
  • Semantic errors have been uncovered in the JVML
    verifier.
  • It has been suggested that if type-soundness
    theorems had been applied to the JVML during the
    design phase, more bugs would have been
    prevented.
  • It is difficult to compile high-level languages
    other than Java to the JVML, since the
    instructions and types are specifically tailored
    to Java.
  • JIT compilation is used to accelerate
    performance, however an error in the JIT compiler
    can introduce a security hole, since JIT
    translation occurs after the verification step.

13
Diagram of Process
14
Explanation of Process
  • Client receives packaged .tal files, similar to a
    .jar.
  • Without access to or knowledge of the program
    source code or compiler, the TALx86 type verifier
    and link checker can be run.
  • To prepare the code for execution, trusted
    modules are linked in for run time support, and
    memory management and array access and update
    macros are expanded.
  • A somewhat optimized, type-safe native machine
    code binary is produced.

15
Type Annotation Classes
  • Import and export interface information.
  • Type constructor declarations, for new types and
    type abbreviations.
  • Typing preconditions on code labels. Registers
    must have specific types before control may enter
    the associated code.
  • Types on data labels, to specify type of a static
    data item.
  • Typing coercions on instruction operands.
  • Macro instructions, used to encapsulate small
    instruction sequences.

16
Crux of TALx86
  • The most important feature of the type checker is
    3, the typing preconditions on code labels.
    General form of annotations
  • Registers r1 through rn must contain types
    t1through tn before control is passed to the
    corresponding label.
  • The bound type variables a1 an allow types on
    registers to be polymorphic, by treating them as
    abstract types.
  • A set of kind of variables is supported,
    labeled with k, so only appropriate types are
    used to instantiate the bound type variables.

17
Reference code snippet
  • / Calculate sum of first n natural numbers. /
  • int i n1
  • int s 0
  • while (--i gt 0)
  • s i
  • Translated into TALx86 as (assumes n initially
    resides in ecx)
  • mov eax, ecx i n
  • inc eax i
  • mov ebx, 0 s 0
  • jmp test
  • body eax B4, ebx B4
  • add ebx, eax s i
  • test eax B4, ebx B4
  • dec eax --i
  • cmp eax, 0 i gt 0
  • jg body

18
An Actual Function
  • int sum (int n)
  • int i n1
  • int s 0
  • while (--i gt 0)
  • s i
  • return s
  • Assume the caller places the return address in
    ebp and expects the return value in eax
  • sum ecx B4, ebp eax B4
  • mov eax, ecx i n
  • inc eax i
  • mov ebx, 0 s 0
  • jmp test
  • body eax B4, ebx B4, ebp eax B4
  • add ebx, eax s i
  • test eax B4, ebx B4, ebp eax B4
  • dec eax --i

19
Problems with Previous Snippet
  • Recall on the previous slide that the function
    sum() required the return address in ebp, the
    function parameter in ecx, and the return value
    in eax. Very ad-hoc approach, far from the
    standard calling convention.
  • The standard C calling convention typically
    places arguments, return address, the old base
    pointer, local parameters, and possibly a return
    value on the stack.
  • Values are not often passed among functions
    through registers.
  • TALx86 is constructed so as to not be bound to a
    specific calling convention.

20
Stack Layout for a General Purpose Calling
Convention
esp
Caller (int p) char a Callee(a,
p) Callee(char a, int p) int i //
do stuff
Local variables of Callee
Old Base Pointer
Return Address
Parameters for Callee
Local variables of Caller
Old Base Pointer
Return Address
Parameters for Caller
ebp
21
TALx86 Stack Abstraction and Stack Datatypes
  • Stack is defined as a list of types.
  • s is a stack type
  • se represents an empty stack
  • ts a type that describes stacks where the top
    most element is of type t and the rest of the
    stack is described by s.
  • Example
  • eax B4B4B4se
  • stack type with three elements, a return address
    expecting a B4 in eax, followed by two B4 values.
  • esp can be bound to a stack type with the label
    sptr.

22
Revisiting our Reference Code
  • int sum (int n)
  • / Function body /
  • The above code originally yielded the following
    TALx86 label
  • sum ecx B4, ebp eax B4
  • This represents a non-standard calling convention
  • Using TALx86 stack abstractions and stack types,
    the sum() label can be rewritten as
  • sum esp sptr eax B4B4se
  • Read as Esp must contain a stack pointer that
    points to a section of code requiring a B4 in eax
    (i.e. the return address), followed by a B4.
  • Spot a problem? The rewritten label does not
    allow for arbitrary stack depths.

23
Supporting Common Calling Conventions (Take 3)
  • TALx86 supports stack polymorphism in order to
    abstract portions of the stack.
  • Rewrite as
  • sum esp sptr eax B4, esp sptr
    B4rB4r
  • In order to enter the code associated with the
    sum label
  • esp must be a stack pointer that points to a
    section of code.
  • The code pointed to by sptr must require
  • eax to hold a B4
  • esp to hold a stack pointer that contains a B4
    (essentially, an address)
  • Following that address, there can be some other
    stuff, r.
  • Following the return address, our sums stack
    must contain a B4 (the input value n), and some
    other stuff r.

24
Dynamic Memory
  • TALx86 provides an assembly level macro for heap
    allocation of data, malloc.
  • malloc allocates memory and inserts a pointer to
    the newly allocated space in eax.
  • Proper initialization of dynamic memory is
    critical to type-safety, so a variance is added
    to each field.
  • e.g. B4u, B4r, B4w, B4rw
  • Field variances are tracked.
  • Uninitialized data can be written, but not read.

25
Dynamic Memory (contd)
  • Simple Example
  • malloc 4, ltB4gt
  • mov eax0, 3
  • After the first instruction, the verifier assigns
    eax the type B4u, a pointer to an
    uninitialized B4.
  • After the second instruction, the type of eax
    becomes
  • B4rw, a pointer to a readable and writeable
    B4.
  • Naturally, these types can be used as constraints
    for basic block labels.

26
Arrays
  • Array sizes and indices cannot always be
    determined statically, however a type safe
    language must guarantee that any index lies
    between 0 and the physical size of the array.
  • TALx86 introduces macros to handle array
    subscripting and updating asub, aupd,
    respectively.
  • Two type constructors are also provided
  • S(s), where s some constant or an abstract
    value, a.
  • array(s, tv)
  • s a type expression representing the size of
    the array.
  • t type of element.
  • v variance (one of u, r, w, or rw)

27
Array Example
  • Simple Example
  • Increment each element of a 5 integer array.
  • lab eax array(5, B4rw), ebx S(5)
  • mov ecx, 2
  • put eaxecx into edx, array size in ebx, B4
    size is 4
  • asub edx, eax, 4, ecx, ebx
  • inc edx
  • put edx into eaxecx, array size in ebx, B4
    size is 4
  • aupd eax, 4, ecx, edx, ebx
  • Clearly only works for arrays of size 5. Here
    comes the abstract notation
  • Sint.eax array(s, B4rw), ebx S(s)

28
Link Verification
  • Link Verifier
  • Ensures that xxx.tal make valid assumptions about
    the files and types they share with other modules
    using the corresponding xxx_i.tali and xxx_e.tali
    files.
  • In addition to verifying that there are no
    missing or multiple symbol definitions, the
    TALx86 linker checks that the files agree on the
    types of shared values.

29
Type Verification
  • Type Verification
  • Using the given annotations, performs type
    verification using a type inference algorithm
  • Algorithm succeeds if a set of values can be
    generated that preserve the label preconditions
    while progressing through each of the possible
    states.
  • If the algorithm get stuck, the code is not
    type safe.
  • Optimization methods are performed (constant
    folding, common subexpression reduction,
    tail-call elimination, etc.) to collapse code
    size and minimize runtime overhead.
  • Anything that is proven to be known at assembly
    time is flattened. Run time checks are preserved
    elsewhere.

30
Benefits
  • Language design has been supported by rigorously
    proven type-soundness theorems.
  • Target language is intended to be more generic
    than JVML, thus allowing a wider range of source
    languages to compile toward it.
  • Ideally, assembly-level optimizations will yield
    higher performance than JVML.

31
Limitations
  • Although the researchers claim that a type-unsafe
    source language (such as C) can be compiled to
    TALx86, the restrictions of the target language
    contradict this claim.
  • TALx86 specifically forbids pointer arithmetic,
    the address operator, and pointer casts, since
    compiling these features safely would impose a
    significant performance penalty. (As weve
    witnessed in previous discussions!)
  • Memory management type variance (u, r, w, rw)
    tracking does not support aliasing.
  • No floating point handling.
  • Essentially, only a subset of higher level source
    language capabilities are provided, and they are
    only mapped to a subset of the x86 ISA.

32
Discussion
  • THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com