Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

About This Presentation

Title:

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

Description:

Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters ... The Context Threading Table. A sequence of generated call ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 26

Provided by: csTor

Learn more at: http://www.cs.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters

1
Context Threading A flexible and efficient
dispatch technique for virtual machine
interpreters

Marc Berndl
Benjamin Vitale
Mathew Zaleski
Angela Demke Brown

Research supported by IBM CAS, NSERC, CITO
2
Interpreter performance

Why not just JIT?
High performance JVMs still interpret
People use interpreted languages that dont yet
have JITs
They still want performance!
30-40 of execution time is due to branch
misprediction
Our technique eliminates 95 of branch
mispredictions

3
Overview

Motivation
Background The Context Problem
Existing Solutions
Our Approach
Inlining
Results

4
A Tale of Two Machines
Virtual Machine Interpreter
Execution Cycle
Virtual Program
load
Bytecode Bodies
Pipeline
Execution Cycle
Real Machine CPU
Target Address (Indirect)
Predictors
Return Address
Wayness (Conditional)
5
Interpreter
fetch
Load Parms
Bytecode bodies
Internal Representation
Execution Cycle
6
Running Java Example
Java Bytecode
Java Source
0 iconst_0 1 istore_1 2
iload_1 3 iload_1 4 iadd 5
istore_1 6 iload_1 7 bipush 64 9
if_icmplt 2 12 return
void foo() int i1 do ii
while(ilt64)
7
Switched Interpreter
while(1) opcode vpc switch(opcode)

iload_1 iload_1 iadd istore_1 iload_1 bipush
64 if_icmplt 2
ltiload_1gt
ltiload_1gt
case iload_1 .. break
ltiaddgt
ltistore_1gt
ltiload_1gt
case iadd .. break
ltbipushgt
64
ltif_icmpltgt
-7
Virtual Program
Switched Body Implementation
Internal Representation
4Simple, portable and extremely slow
8
Switched Interpreter
while(1) opcode vPC switch(opcode)
//and many more..
case iload_1 .. break
case iadd .. break
4slow. burdened by switch and loop overhead
9
Threading Dispatch
0 iconst_0 1 istore_1 2
iload_1 3 iload_1 4 iadd 5
istore_1 6 iload_1 7 bipush 64 9
if_icmplt 2 12 return
execution of virtual program threads through
bodies (as in needle thread)

No switch overhead. Data driven indirect branch.

10
Context Problem
0 iconst_0 1 istore_1 2
iload_1 3 iload_1 4 iadd 5
istore_1 6 iload_1 7 bipush 64 9
if_icmplt 2 12 return
indirect branch predictor (micro-arch)

Data driven indirect branches hard to predict

11
Threading Dispatch
execution of virtual program threads through
bodies (as in needle thread)

No switch overhead. Data driven indirect branch.

12
Direct Threaded Interpreter
iload_1
iload_1 iload_1 iadd istore_1 iload_1 bipush
64 if_icmplt 2
iload_1
iadd
istore_1
iload_1
bipush
64
if_icmplt
-7
DTT - Direct Threading Table
C implementation of each body
Virtual Program
4Target of computed goto is data-driven
13
Context Problem
indirect branch predictor (micro-arch)
direct threaded bytecode bodies (native code)
virtual program (DTT)
iload goto vpc
vpc
iadd goto vpc
istore goto vpc
4pc of dispatch branch insufficient context for
prediction
14
Context Problem
DTT - Direct Threading Table
Indirect Branch Predictors
15
Existing Solutions
Super Instruction
Replicate
goto pc
goto pc
Piumarta Ricardi Bodies Replicated
Ertl Gregg Bodies and Dispatch Replicated
4Limited to relocatable virtual instructions
16
Overview

Motivation
Background The Context Problem
Existing Solutions
Our Approach
Inlining
Results

17
Key Observation

Virtual and native control flow similar
Linear or straight-line code
Conditional branches
Calls and Returns
Indirect branches
Hardware has predictors for each type
Direct uses indirect branch for everything!
Solution Leverage hardware predictors

18
Key Observation

Virtual and native control flow have same branch
types
Linear (not really a branch)
Conditional
Calls and Returns
Indirect
Hardware has predictors for each type
Solution Leverage hardware predictors

19
Essence of our Solution
CTT - Context Threading Table (generated code)
iload_1 iload_1 iadd istore_1 iload_1 bipush
64 if_icmplt 2
Bytecode bodies (ret terminated)
call iload_1
call iload_1
call iadd
call istore_1
call iload_1
..
Return Branch Predictor Stack
4Package bodies as subroutines and call them
20
Subroutine Threading
Bytecode bodies (ret terminated)
call iload_1
call iload_1
call iadd
call istore_1
call iload_1
call bipush
call if_icmplt
CTT load time generated code
4virtual branch instructions as before
21
Virtual Branches

call if_icmplt
call iload_1
call
Virtual Branch body

target

Context Threading Table
DTT
4Context problem remains for virtual branches
22
The Context Threading Table

A sequence of generated call instructions
Good alignment of virtual and hardware control
flow for straight-line code.
Can virtual branches go into the CTT?

23
Specialized Branch Inlining

target

5
call iload_1
Conditional Branch Predictor now mobilized
call

target
Branch Inlined Into the CTT
DTT
4Inlining conditional branches provides context
24
Tiny Inlining

Context Threading is a dispatch technique
But, we inline branches
Some non-branching bodies are very small
Why not inline those?
Inline all tiny linear bodies into the CTT

25
What can go in the CTT?

Calls to bodies
Inlined bodies
Mixed-Mode virtual machine?
Performance?

26
Overview

Motivation
Background The Context Problem
Existing Solutions
Our Approach
Inlining
Results

27
Experimental Setup

Two Virtual Machines on two hardware
architectures.
VM Java/SableVM, OCaml interpreter
Compare against direct threaded SableVM
SableVM distro uses selective inlining
Arch P4, PPC
Branch Misprediction
Execution Time
Is our technique effective and general?

28
Mispredicted Taken Branches
Normalized to Direct Threading
SableVm/Java Pentium 4
495 mispredictions eliminated on average
29
Execution time
Pentium 4
Normalized to Direct Threading
427 average reduction in execution time
30
Execution Time (geomean)
Normalized to Direct Threading
4Our technique is effective and general
31
Conclusions

Context Problem branch mispredictions due to
mismatch between native and virtual control flow
Solution Generate control flow code into the
Context Threading Table
Results
Eliminate 95 of branch mispredictions
Reduce execution time by 30-40
recent, post CGO 2005, work follows

32
What about Scripting Languages?