Title: Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters
1Context Threading A flexible and efficient
dispatch technique for virtual machine
interpreters
- Marc Berndl
- Benjamin Vitale
- Mathew Zaleski
- Angela Demke Brown
Research supported by IBM CAS, NSERC, CITO
2Interpreter performance
- Why not just JIT?
- High performance JVMs still interpret
- People use interpreted languages that dont yet
have JITs - They still want performance!
- 30-40 of execution time is due to branch
misprediction - Our technique eliminates 95 of branch
mispredictions
3Overview
- Motivation
- Background The Context Problem
- Existing Solutions
- Our Approach
- Inlining
- Results
4A Tale of Two Machines
Virtual Machine Interpreter
Execution Cycle
Virtual Program
load
Bytecode Bodies
Pipeline
Execution Cycle
Real Machine CPU
Target Address (Indirect)
Predictors
Return Address
Wayness (Conditional)
5Interpreter
fetch
Load Parms
Bytecode bodies
Internal Representation
Execution Cycle
6Running Java Example
Java Bytecode
Java Source
0 iconst_0 1 istore_1 2
iload_1 3 iload_1 4 iadd 5
istore_1 6 iload_1 7 bipush 64 9
if_icmplt 2 12 return
void foo() int i1 do ii
while(ilt64)
7Switched Interpreter
while(1) opcode vpc switch(opcode)
iload_1 iload_1 iadd istore_1 iload_1 bipush
64 if_icmplt 2
ltiload_1gt
ltiload_1gt
case iload_1 .. break
ltiaddgt
ltistore_1gt
ltiload_1gt
case iadd .. break
ltbipushgt
64
ltif_icmpltgt
-7
Virtual Program
Switched Body Implementation
Internal Representation
4Simple, portable and extremely slow
8Switched Interpreter
while(1) opcode vPC switch(opcode)
//and many more..
case iload_1 .. break
case iadd .. break
4slow. burdened by switch and loop overhead
9Threading Dispatch
0 iconst_0 1 istore_1 2
iload_1 3 iload_1 4 iadd 5
istore_1 6 iload_1 7 bipush 64 9
if_icmplt 2 12 return
execution of virtual program threads through
bodies (as in needle thread)
- No switch overhead. Data driven indirect branch.
10Context Problem
0 iconst_0 1 istore_1 2
iload_1 3 iload_1 4 iadd 5
istore_1 6 iload_1 7 bipush 64 9
if_icmplt 2 12 return
indirect branch predictor (micro-arch)
- Data driven indirect branches hard to predict
11Threading Dispatch
execution of virtual program threads through
bodies (as in needle thread)
- No switch overhead. Data driven indirect branch.
12Direct Threaded Interpreter
iload_1
iload_1 iload_1 iadd istore_1 iload_1 bipush
64 if_icmplt 2
iload_1
iadd
istore_1
iload_1
bipush
64
if_icmplt
-7
DTT - Direct Threading Table
C implementation of each body
Virtual Program
4Target of computed goto is data-driven
13Context Problem
indirect branch predictor (micro-arch)
direct threaded bytecode bodies (native code)
virtual program (DTT)
iload goto vpc
vpc
iadd goto vpc
istore goto vpc
4pc of dispatch branch insufficient context for
prediction
14Context Problem
DTT - Direct Threading Table
Indirect Branch Predictors
15Existing Solutions
Super Instruction
Replicate
goto pc
goto pc
Piumarta Ricardi Bodies Replicated
Ertl Gregg Bodies and Dispatch Replicated
4Limited to relocatable virtual instructions
16Overview
- Motivation
- Background The Context Problem
- Existing Solutions
- Our Approach
- Inlining
- Results
17Key Observation
- Virtual and native control flow similar
- Linear or straight-line code
- Conditional branches
- Calls and Returns
- Indirect branches
- Hardware has predictors for each type
- Direct uses indirect branch for everything!
- Solution Leverage hardware predictors
18Key Observation
- Virtual and native control flow have same branch
types - Linear (not really a branch)
- Conditional
- Calls and Returns
- Indirect
- Hardware has predictors for each type
- Solution Leverage hardware predictors
19Essence of our Solution
CTT - Context Threading Table (generated code)
iload_1 iload_1 iadd istore_1 iload_1 bipush
64 if_icmplt 2
Bytecode bodies (ret terminated)
call iload_1
call iload_1
call iadd
call istore_1
call iload_1
..
Return Branch Predictor Stack
4Package bodies as subroutines and call them
20Subroutine Threading
Bytecode bodies (ret terminated)
call iload_1
call iload_1
call iadd
call istore_1
call iload_1
call bipush
call if_icmplt
CTT load time generated code
4virtual branch instructions as before
21Virtual Branches
call if_icmplt
call iload_1
call
Virtual Branch body
target
Context Threading Table
DTT
4Context problem remains for virtual branches
22The Context Threading Table
- A sequence of generated call instructions
- Good alignment of virtual and hardware control
flow for straight-line code. - Can virtual branches go into the CTT?
23Specialized Branch Inlining
target
5
call iload_1
Conditional Branch Predictor now mobilized
call
target
Branch Inlined Into the CTT
DTT
4Inlining conditional branches provides context
24Tiny Inlining
- Context Threading is a dispatch technique
- But, we inline branches
- Some non-branching bodies are very small
- Why not inline those?
- Inline all tiny linear bodies into the CTT
25What can go in the CTT?
- Calls to bodies
- Inlined bodies
- Mixed-Mode virtual machine?
- Performance?
26Overview
- Motivation
- Background The Context Problem
- Existing Solutions
- Our Approach
- Inlining
- Results
27Experimental Setup
- Two Virtual Machines on two hardware
architectures. - VM Java/SableVM, OCaml interpreter
- Compare against direct threaded SableVM
- SableVM distro uses selective inlining
- Arch P4, PPC
- Branch Misprediction
- Execution Time
- Is our technique effective and general?
28Mispredicted Taken Branches
Normalized to Direct Threading
SableVm/Java Pentium 4
495 mispredictions eliminated on average
29Execution time
Pentium 4
Normalized to Direct Threading
427 average reduction in execution time
30Execution Time (geomean)
Normalized to Direct Threading
4Our technique is effective and general
31Conclusions
- Context Problem branch mispredictions due to
mismatch between native and virtual control flow - Solution Generate control flow code into the
Context Threading Table - Results
- Eliminate 95 of branch mispredictions
- Reduce execution time by 30-40
- recent, post CGO 2005, work follows
32What about Scripting Languages?
- Recently ported context threading to TCL.
- 10x cycles executed per bytecode dispatched.
- Much lower dispatch overhead.
- Speedup due to subroutine threading, approx. 5.
- TCL conference 2005
Cycles per virtual instruction