Frameworks for domain-specific optimization at run-time - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Frameworks for domain-specific optimization at run-time

Description:

Frameworks for domain-specific optimization at run-time Paul Kelly (Imperial College London) Joint work with Kwok Cheung Yeung Milos Puzovic September 2005 – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 38
Provided by: PaulHJ4
Category:

less

Transcript and Presenter's Notes

Title: Frameworks for domain-specific optimization at run-time


1
Frameworks for domain-specific optimization at
run-time
  • Paul Kelly (Imperial College London)
  • Joint work with
  • Kwok Cheung Yeung
  • Milos Puzovic

September 2005
2
Where were coming from
  • I lead the Software Performance Optimisation
    group within Computer Systems
  • Stuff Id love to talk about another time
  • Scalable interactive fluid-flow visualisation
  • FPGA and GPU accelerators
  • Bounds-checking for C, links with unchecked code
  • Is Morton-order layout for 2D arrays competitive?
  • Efficient algorithms for scalable pointer alias
    analysis
  • Domain-specific optimisation frameworks
  • Instant-access cycle stealing
  • Proxying in CC-NUMA cache-coherence protocols
    adaptive randomisation and combining

V A
Science Museum
Dept of Computing
Albert Hall
Hyde Park
3
Mission statement
  • Extend optimising compiler technology to
    challenging contexts beyond scope of conventional
    compilers
  • Component-based software cross-component
    optimisation
  • For example in distributed systems
  • Optimisation across network boundaries
  • Between different security domains
  • Maintaining proper semantics in the event of
    failures
  • Emergent mission (mission creep)
  • Design a domain-specification optimisation
    plug-in architecture for compiler/VM

4
Abstraction
  • Most performance improvement opportunities come
    from adapting components to their context
  • Most performance improvement measures break
    abstraction boundaries
  • So the goal of performance programming tool
    support is get performance without making a mess
    of your code
  • Optimisations are cross-cutting

5
  • Slogan Optimisations are features
  • and features can be separately-deployable,
    separately-marketable, components, or aspects
  • How can this be made to work?

6
Open compilers
  • Idea implement optimisation features as compiler
    passes
  • Need to design open plug-in architecture for
    inserting new optimization passes
  • Some interesting issues in how to design
    extensible intermediate representation
  • Feature composition raises research issues
  • Interference can we verify that feature A
    doesnt interfere with feature B?
  • Phase ordering problem which should come first?
  • Can feature B benefit from feature As program
    analysis?

7
Open virtual machines
  • How about an open optimizing VM?
  • Fresh issues
  • Dynamic installation of optimisation features?
  • Open access to instrumentation/profiling
  • Exploit opportunity to use dynamic information as
    well as static analysis

8
Open virtual machines
  • How about an open optimizing VM?
  • Fresh issues
  • Dynamic installation of optimisation features?
  • Open access to instrumentation/profiling
  • Exploit opportunity to use dynamic information as
    well as static analysis
  • This talk has three parts
  • Motivating example
  • A framework for deploying optimisations as
    separately-deployable features, or components
  • Support for optimisations that integrate static
    analysis with dynamic information

9
Project strategy
  • Implement aggregation optimisation for .Net
    Remoting
  • Do it with lower overheads than our Java version
  • To do it, build general-purpose tools for
  • Domain-specific optimisation
  • Run-time optimisation
  • Results so far
  • reflective dataflow analysis framework
  • optimisations as aspects framework prototype
  • Plugin architecture for domain-specific
    optimisation features (DSOFs)
  • elementary Remoting aggregation works, with
    excellent performance

10
Aggregating remote calls
void m(RemoteObject r, int a) int x
r.f(a) int y r.g(a,x) int z r.h(a,y)
System.Console.WriteLine(z)
a
a
x
a,x
y
a,y
a,z
z
Six messages
Two messages, no need to copy x and y back
  • Aggregation
  • a sequence of calls to same server can be
    executed in a single message exchange
  • Reduce number of messages
  • Also reduce amount of data transferred
  • Common parameters
  • Results passed from one call to another
  • Non-trivial correctness issues see
    YoshidaAhern, OOPSLA05

11
Real-world benchmarks
  • Simple example Multi-user Dungeon (from
    Flanagans Java Examples in a Nutshell)
  • Look method
  • String mudname p.getServer().getMudName()
  • String placename p.getPlaceName()
  • String description p.getDescription()
  • Vector things p.getThings()
  • Vector names p.getNames()
  • Vector exits p.getExits()
  • Seven aggregated calls

Time taken to execute look Ethernet ADSL
Without call aggregation 5.4ms 759.6ms
With call aggregation 5.8ms 164.9ms
Speedup 0.93 4.61
Client Athlon XP 1800 Servers Pentium III
500MHz, 650MHz and dual 700MHz Linux, Sun JDK
1.4.1_01 (Hotspot) Network Ethernet 10.03 MB/s,
ping 0.1ms, DSL 10.7KB/s, ping 98ms Mean of 3
trials of 1000 iterations each
12
Call aggregation our first implementation
Veneer virtual JVM intercepts class loading,
and fragments each method. Interpretive
executor inspects local fragment following each
remote call
int m() while (pltN) q x1.m1(p)
p 0 p x2.m2(p)
System.out.println(p) return p
Fragment W
pltN
Fragment X
q x1.m1(p)
poss.remote
Fragment Y
p 0
poss.remote
Fragment Z
p x2.m2(p)
println(p)
return p
  • Each fragment carries use/def and liveness info
  • Y can be executed before X, but p must be copied
  • Z cannot be delayed because p is printed

X Y Z B2
Defs q p p
Uses x1,p,q x2,p p
13
Call aggregation our first implementation
  • At this point, executor has collected a sequence
    of delayed remote calls (fragments X and Z)
  • But execution is now forced by need to print
  • Now, we can inspect delayed fragments and
    construct optimised execution plan

Fragment W
pltN
porig p
Fragment Y executed first Fragments X and Z
are delayed
Fragment Y
p 0
Fragment X
q x1.m1(porig)
Fragment Z
p x2.m2(p)
println(p)
return p
  • If x1 and x2 are on same server, send aggregate
    call
  • If x1 and x2 are on different servers, send
    execution plan to x2s server, telling it to
    invoke x1.m1(porig) on x1s server

14
Aggregation with conditional forcing
  • Runtime optimisation is justified for optimising
    heavyweight operations
  • In this example aggregation is valid if x gt y
  • If we intercept the fork we can find out whether
    aggregation will be valid
  • Original Veneer implementation intercepts all
    control-flow forks in methods containing
    aggregationopportunities
  • We need a better analysis, that pays overheads
    only when a benefit might occur

15
Deferred DFA motivating example
  • Identifies lossy, predictable control-flow
    forks
  • rescue data-flow information thrown away by
    conservative analysis by deferring meet operation
  • Generates data-flow summary functions for regions
    between
  • Uses predicted control-flow to stitch together
    summary function for actual path, using the work
    list algorithm

Outcome known at run-time
Deferred data-flow analysis. Shamik Sharma,
Anurag Acharya and Joel Saltz UC Santa Barbara
techreport TRCS98-38
16
DSOFs
  • Domain-specific optimisation features
  • Need a framework to plug the components into
  • What does the framework need to achieve?
  • Cross-cutting
  • Separately-deployable
  • Query language to select target sites
  • Static access to dataflow/dependence information
  • Dynamic access to dataflow/dependence information
  • Lets start with AOP

17
RMI aggregation DSOF, based on Loom aspect weaver
public class RemoteCallDSOF Loom.Aspect
private OpDomains opDomains new OpDomains()
private DelayedCalls delayedCalls new
DelayedCalls() private Set
delayedCallsDef new HashedSet() public
RemoteCallDSOF (DDFAAnalysis analysis)
this.opDomains analysis.getOpDomains()
Loom.ConnectionPoint.IncludeAll
Loom.Call(Invoke.Instead) public object
AnyMethod(object args) OpDomain
thisOpDomain opDomains.getOpDomain(Context.Metho
dName) OpNode opNode
thisOpDomain.OpNode Set opNodeDef
opNode.getDefs() Set opDomainUse
thisOpDomain.getUses() if (((Set)
opNodeDef opDomainUse).Count gt 0)
(((Set) opDomainUse delayedCallsDef).Count gt
0) delayedCalls.Execute()
object ret Context.Invoke(args)
return ret else
delayedCalls.Add(Context.MethodName, args)
if(!opDomains.hasNext())
object ret delayedCalls.Execute()
return ret return
null
  • Static part of pointcut
  • Dynamic part of pointcut, refers to dataflow
    properties of control flow that can be predicted
    from this point
  • getOpDomain() function stitches together summary
    functions
  • thisOpDomain.getUses() function returns all
    variables that are used within the op-domain
  • opNode.getDefs() function returns all variables
    that are defined by op-node

18
RMI Optimisation using a souped-up aspect weaver
public aspect OptimiseRMICall public pointcut
LikelyRMICall() public pointcut
StaticDelayableRMI() void around()
LikelyRMICall() StaticDelayableRMI()
if (DynamicDelayableRMI())
DelayedCalls.add(thisJoinPoint.ProceedClosure())
void around() LikelyRMICall()
StaticDelayableRMI() if
(!DynamicDelayableRMI())
DelayedCalls.execute() proceed() void
around() LikelyRMICall()
!StaticDelayableRMI()
DelayedCalls.execute() proceed()
Artists impression of RMI aggregation DSOF,
based on AspectJ
19
RMI Optimisation using a souped-up aspect weaver
Artists impression of RMI aggregation DSOF,
based on AspectJ
public aspect OptimiseRMICall public pointcut
LikelyRMICall() public pointcut
StaticDelayableRMI() void around()
LikelyRMICall() StaticDelayableRMI()
if (DynamicDelayableRMI())
DelayedCalls.add(thisJoinPoint.ProceedClosure())
void around() LikelyRMICall()
StaticDelayableRMI() if
(!DynamicDelayableRMI())
DelayedCalls.execute() proceed() void
around() LikelyRMICall()
!StaticDelayableRMI()
DelayedCalls.execute() proceed()
public pointcut LikelyRMICall() call(void
(..) throws RemoteException)
public static bool DynamicDelayableRMI()
return thisOpDomain. getUses().intersects(Dela
yedCalls.getDefs())
20
RMI Optimisation using a souped-up aspect weaver
Artists impression of RMI aggregation DSOF,
based on AspectJ
public aspect OptimiseRMICall public pointcut
LikelyRMICall() public pointcut
StaticDelayableRMI() void around()
LikelyRMICall() StaticDelayableRMI()
if (DynamicDelayableRMI())
DelayedCalls.add(thisJoinPoint.ProceedClosure())
void around() LikelyRMICall()
StaticDelayableRMI() if
(!DynamicDelayableRMI())
DelayedCalls.execute() proceed() void
around() LikelyRMICall()
!StaticDelayableRMI()
DelayedCalls.execute() proceed()
public pointcut LikelyRMICall() call(void
(..) throws RemoteException)
public pointcut StaticDelayableRMI()
thisOpDomainStaticPart. getUses().intersects(D
elayedCalls.getDefs())
public static bool DynamicDelayableRMI()
return thisOpDomain. getUses().intersects(Dela
yedCalls.getDefs())
21
Remote call aggregation benchmark
  • public Double vectorAddition (DDFAAnalysis
    analysis, int size )
  • Double v1 new Double size
  • Double v2 new Double size
  • ArrayAdder adder new ArrayAdder()
  • Double ret1 adder.Add(v1, v2)
  • Double ret2 adder.Add(ret1, v2)
  • Double ret3 adder.Add(ret2, v2)
  • Double ret4 adder.Add(ret3, v2)
  • return ret4
  • Includes four consecutive calls to same remote
    object
  • There is data-dependency between the calls

22
Remote call aggregation benchmark
  • ILMethod method CodeDatabase.GetMethod (new
    Function(Example.adder) )
  • DDFAAnalysis analysis new DDFAAnalysis ( )
  • analysis.Apply(method)
  • public Double vectorAddition (DDFAAnalysis
    analysis, int size )
  • Double v1 new Double size
  • Double v2 new Double size
  • RemoteCallDSOF opt new RemoteCallDSOF(analysis)
  • IAdder adder (IAdder) Loom.Weaver.CreateInstance
    (typeof(ArrayAdder), null, opt )
  • Double ret1 adder.Add(v1, v2)
  • Double ret2 adder.Add(ret1, v2)
  • Double ret3 adder.Add(ret2, v2)
  • Double ret4 adder.Add(ret3, v2)
  • return ret4
  • We deploy DSOF using Loom aspect weaver
  • When adder is created, DSOF is interposed
  • Slightly clunky

23
Remote call aggregation benchmark
  • ILMethod method CodeDatabase.GetMethod (new
    Function(Example.adder) )
  • DDFAAnalysis analysis new DDFAAnalysis ( )
  • analysis.Apply(method)
  • public Double vectorAddition (DDFAAnalysis
    analysis, int size )
  • Double v1 new Double size
  • Double v2 new Double size
  • RemoteCallDSOF opt new RemoteCallDSOF(analysis)
  • IAdder adder (IAdder) Loom.Weaver.CreateInstance
    (typeof(ArrayAdder), null, opt )
  • Double ret1 adder.Add(v1, v2)
  • Double ret2 adder.Add(ret1, v2)
  • Double ret3 adder.Add(ret2, v2)
  • Double ret4 adder.Add(ret3, v2)
  • return ret4
  • Aspect intercepts control flow at potential
    remote call sites
  • Accesses results of static dataflow analysis
  • Uses values of variables to determine whether
    future control flow will allow aggregation

24
Performance results
Modem, ping time 156.2ms (client 1.2GHz Pentium
4, server 2.6GhHz Pentium 4, .Net V1.1)
loopback device (3GHz Pentium 4, .Net V1.1)
  • Very preliminary results
  • Vector addition benchmark
  • Substantial speedup even on fast loopback
    connection
  • By avoiding interpretive mechanism, overheads are
    smaller than in our Java implementation

25
ObservationsVM
  • No change to VM
  • Not needed for our work so far
  • Though a more powerful dynamic interposition
    mechanism (ie aspect weaver) would be good
  • More ambitiously
  • access VMs dataflow analysis?
  • Access and control VMs instrumentation
  • Via a dynamic aspect weaver?

26
ObservationsAOP
  • What is the function of the aspect weaver here?
  • Type-safe binary rewriting
  • Pointcut language goes some way towards providing
    open access to intermediate representation
  • We have built a reflective dataflow analysis
    library to extend this somewhat

27
ObservationsDSI
  • Our scheme for aggregating Remote calls is an
    example of a Domain-Specific Interpreter
    pattern
  • Delay execution of calls
  • Execution of delayed calls is eventually forced
    by a dependence
  • Inspect list delayed calls, plan efficient
    execution
  • This idea is useful for optimising many APIs
  • Example parallelising VTK (Beckmann, Kelly et al
    LCPC05)
  • Example Fusing MPI collective communications
    (Field, Kelly, Hansen EuroPar02)
  • Example Data alignment in parallel linear
    algebra (Beckmann Kelly, LCR98)

28
Observationsother DSOFs
  • Were interested in API-specific optimisations
  • anti-pattern rewriting
  • Commonly heavyweight, so some runtime overhead
    can be justified
  • But not all optimisations fit the Domain-Specific
    Interpreter pattern
  • Eg SELECT antipattern
  • Find all the uses of the result set
  • Find all the columns that might actually be used
  • Rewrite the query to select just the columns
    needed

29
Conclusions and future directions
  • Implementation incomplete
  • Needs to be embedded in aspect language
  • Can deferred dataflow analysis work
    interprocedurally?
  • How would we derive where lp-fork aspects have to
    be deployed in order to produce the dataflow data
    needed by selected aspect
  • Apply optimisation statically where possible
  • Represent optimisation more abstractly?
  • Composition metaprogramming
  • Optimisation encapsulated as aspect
  • Operates on code that composes functions from
    some API
  • Exploits component metadata

30
Software products
  • Our Adon (Adaptive Optimisation for .Net)
    library is available at
  • http//www.doc.ic.ac.uk/phjk/Software/Adon/
  • Adon can be used interactively using the Adon
    Browser
  • Or programmatically, for example to apply partial
    evaluation to specialize a method from your
    program

31
Programming with Adon specialization
// Get the representation for the method
Example.Power ILMethod method
CodeDatabase.GetMethod(Example.Power) //
Create a specialising transformation,
specialising the second // parameter of the
transformed method to the integer value
3 SpecialisingTransformation transform new
SpecialisingTransformation() transform.Specialise
(method.Parameters1, 3) // Apply the
transformation to Example.Power transform.Apply(me
thod) // Generate the modified
method MethodInfo dynamicMethod
method.Generate() // Invoke the new
method Console.Out.WriteLine(dynamicMethod.Invoke(
null, new object 2 ))
  • Allows us to extract and mess with any method of
    the running applications code

32
The Adon Browser
  • Example lets mess with Bubblesort

33
The Adon Browser
  • Browser GUI interfaces to Adon library
  • Browse and analysis your apps bytecode

34
The Adon Browser
  • Browser GUI interfaces to library
  • Browse and analysis your apps bytecode

35
The Adon Browser
  • Browser GUI interfaces to library
  • Browse and analysis your apps bytecode
  • Apply selected analyses

36
The Adon Browser
  • Browser GUI interfaces to library
  • Browse and analysis your apps bytecode
  • Apply selected analyses

37
The Adon Browser
  • Browser GUI interfaces to library
  • Browse and analysis your apps bytecode
  • Apply selected analyses

38
The Adon Browser
  • Apply selected transformations

39
The Adon Browser
  • Apply selected transformations

40
The Adon Browser
  • Apply selected transformations
Write a Comment
User Comments (0)
About PowerShow.com