Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters - PowerPoint PPT Presentation

About This Presentation
Title:

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

Description:

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters ... The Context Threading Table. A sequence of generated call ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 26
Provided by: csTor
Category:

less

Transcript and Presenter's Notes

Title: Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters


1
Context Threading A flexible and efficient
dispatch technique for virtual machine
interpreters
  • Marc Berndl
  • Benjamin Vitale
  • Mathew Zaleski
  • Angela Demke Brown

Research supported by IBM CAS, NSERC, CITO
2
Interpreter performance
  • Why not just JIT?
  • High performance JVMs still interpret
  • People use interpreted languages that dont yet
    have JITs
  • They still want performance!
  • 30-40 of execution time is due to branch
    misprediction
  • Our technique eliminates 95 of branch
    mispredictions

3
Overview
  • Motivation
  • Background The Context Problem
  • Existing Solutions
  • Our Approach
  • Inlining
  • Results

4
A Tale of Two Machines
Virtual Machine Interpreter
Execution Cycle
Virtual Program
load
Bytecode Bodies
Pipeline
Execution Cycle
Real Machine CPU
Target Address (Indirect)
Predictors
Return Address
Wayness (Conditional)
5
Interpreter
fetch
Load Parms
Bytecode bodies
Internal Representation
Execution Cycle
6
Running Java Example
Java Bytecode
Java Source
0 iconst_0 1 istore_1 2
iload_1 3 iload_1 4 iadd 5
istore_1 6 iload_1 7 bipush 64 9
if_icmplt 2 12 return
void foo() int i1 do ii
while(ilt64)
7
Switched Interpreter
while(1) opcode vpc switch(opcode)

iload_1 iload_1 iadd istore_1 iload_1 bipush
64 if_icmplt 2
ltiload_1gt
ltiload_1gt
case iload_1 .. break
ltiaddgt
ltistore_1gt
ltiload_1gt
case iadd .. break
ltbipushgt
64
ltif_icmpltgt
-7
Virtual Program
Switched Body Implementation
Internal Representation
4Simple, portable and extremely slow
8
Switched Interpreter
while(1) opcode vPC switch(opcode)
//and many more..
case iload_1 .. break
case iadd .. break
4slow. burdened by switch and loop overhead
9
Threading Dispatch
0 iconst_0 1 istore_1 2
iload_1 3 iload_1 4 iadd 5
istore_1 6 iload_1 7 bipush 64 9
if_icmplt 2 12 return
execution of virtual program threads through
bodies (as in needle thread)
  • No switch overhead. Data driven indirect branch.

10
Context Problem
0 iconst_0 1 istore_1 2
iload_1 3 iload_1 4 iadd 5
istore_1 6 iload_1 7 bipush 64 9
if_icmplt 2 12 return
indirect branch predictor (micro-arch)
  • Data driven indirect branches hard to predict

11
Threading Dispatch
execution of virtual program threads through
bodies (as in needle thread)
  • No switch overhead. Data driven indirect branch.

12
Direct Threaded Interpreter
iload_1
iload_1 iload_1 iadd istore_1 iload_1 bipush
64 if_icmplt 2
iload_1
iadd
istore_1
iload_1
bipush
64
if_icmplt
-7
DTT - Direct Threading Table
C implementation of each body
Virtual Program
4Target of computed goto is data-driven
13
Context Problem
indirect branch predictor (micro-arch)
direct threaded bytecode bodies (native code)
virtual program (DTT)
iload goto vpc
vpc
iadd goto vpc
istore goto vpc
4pc of dispatch branch insufficient context for
prediction
14
Context Problem
DTT - Direct Threading Table
Indirect Branch Predictors
15
Existing Solutions
Super Instruction
Replicate
goto pc
goto pc
Piumarta Ricardi Bodies Replicated
Ertl Gregg Bodies and Dispatch Replicated
4Limited to relocatable virtual instructions
16
Overview
  • Motivation
  • Background The Context Problem
  • Existing Solutions
  • Our Approach
  • Inlining
  • Results

17
Key Observation
  • Virtual and native control flow similar
  • Linear or straight-line code
  • Conditional branches
  • Calls and Returns
  • Indirect branches
  • Hardware has predictors for each type
  • Direct uses indirect branch for everything!
  • Solution Leverage hardware predictors

18
Key Observation
  • Virtual and native control flow have same branch
    types
  • Linear (not really a branch)
  • Conditional
  • Calls and Returns
  • Indirect
  • Hardware has predictors for each type
  • Solution Leverage hardware predictors

19
Essence of our Solution
CTT - Context Threading Table (generated code)
iload_1 iload_1 iadd istore_1 iload_1 bipush
64 if_icmplt 2
Bytecode bodies (ret terminated)
call iload_1
call iload_1
call iadd
call istore_1
call iload_1
..
Return Branch Predictor Stack
4Package bodies as subroutines and call them
20
Subroutine Threading
Bytecode bodies (ret terminated)
call iload_1
call iload_1
call iadd
call istore_1
call iload_1
call bipush
call if_icmplt
CTT load time generated code
4virtual branch instructions as before
21
Virtual Branches

call if_icmplt
call iload_1
call
Virtual Branch body



target

Context Threading Table
DTT
4Context problem remains for virtual branches
22
The Context Threading Table
  • A sequence of generated call instructions
  • Good alignment of virtual and hardware control
    flow for straight-line code.
  • Can virtual branches go into the CTT?

23
Specialized Branch Inlining


target

5
call iload_1
Conditional Branch Predictor now mobilized
call



target
Branch Inlined Into the CTT
DTT
4Inlining conditional branches provides context
24
Tiny Inlining
  • Context Threading is a dispatch technique
  • But, we inline branches
  • Some non-branching bodies are very small
  • Why not inline those?
  • Inline all tiny linear bodies into the CTT

25
What can go in the CTT?
  • Calls to bodies
  • Inlined bodies
  • Mixed-Mode virtual machine?
  • Performance?

26
Overview
  • Motivation
  • Background The Context Problem
  • Existing Solutions
  • Our Approach
  • Inlining
  • Results

27
Experimental Setup
  • Two Virtual Machines on two hardware
    architectures.
  • VM Java/SableVM, OCaml interpreter
  • Compare against direct threaded SableVM
  • SableVM distro uses selective inlining
  • Arch P4, PPC
  • Branch Misprediction
  • Execution Time
  • Is our technique effective and general?

28
Mispredicted Taken Branches
Normalized to Direct Threading
SableVm/Java Pentium 4
495 mispredictions eliminated on average
29
Execution time
Pentium 4
Normalized to Direct Threading
427 average reduction in execution time
30
Execution Time (geomean)
Normalized to Direct Threading
4Our technique is effective and general
31
Conclusions
  • Context Problem branch mispredictions due to
    mismatch between native and virtual control flow
  • Solution Generate control flow code into the
    Context Threading Table
  • Results
  • Eliminate 95 of branch mispredictions
  • Reduce execution time by 30-40
  • recent, post CGO 2005, work follows

32
What about Scripting Languages?
  • Recently ported context threading to TCL.
  • 10x cycles executed per bytecode dispatched.
  • Much lower dispatch overhead.
  • Speedup due to subroutine threading, approx. 5.
  • TCL conference 2005

Cycles per virtual instruction
Write a Comment
User Comments (0)
About PowerShow.com