Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized ObjectBased Programs - PowerPoint PPT Presentation

About This Presentation
Title:

Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized ObjectBased Programs

Description:

if(right!=NULL,right- traverse(val)), if(left!=NULL,left- traverse(val) ... NULL) spawn(right- parallel_traverse(val)); Data Lock Coarsening Transformation ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 40
Provided by: martin49
Category:

less

Transcript and Presenter's Notes

Title: Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized ObjectBased Programs


1
Lock Coarsening Eliminating Lock Overhead in
Automatically Parallelized Object-Based Programs
  • Pedro C. Diniz
  • Martin C. Rinard
  • University of California, Santa Barbara
  • Santa Barbara, California 93106
  • martin,pedro_at_cs.ucsb.edu
  • http//www.cs.ucsb.edu/martin,pedro

2
Goal
  • Eliminate Synchronization Overhead in
  • Parallel Object-Based Programs
  • Basic Idea
  • Interprocedural Synchronization Analysis
  • Automatically Eliminate Synchronization
    Constructs
  • Context Parallelizing Compiler for C
  • Irregular Computations
  • Dynamic Data Structures
  • Commutativity Analysis

3
Structure of Talk
  • Commutativity Analysis
  • Model of Computation
  • Example
  • Basic Approach
  • Synchronization Optimization Techniques
  • Data Lock Coarsening
  • Computation Lock Coarsening
  • Experimental Results
  • Future Work
  • Self-Tuning Code

4
Model of Computation
operations
objects
executing operation
new object state
operation
initial object state
invoked operations
5
Graph Traversal Example
  • class graph
  • int val, sum
  • graph left, right
  • void graphtraverse(int v)
  • sum v
  • if (left !NULL) left-gttraverse(val)
  • if (right!NULL) right-gttraverse(val)

Goal Execute left and right traverse operations
in parallel
6
Parallel Traversal
7
Commuting Operations in Parallel Traversal
3
8
Commutativity Analysis
  • Compiler Chooses A Computation to Parallelize
  • In Example Entire graphtraverse Computation
  • Compiler Computes Extent of the Computation
  • Representation of all Operations in Computation
  • Current Representation Set of Methods
  • In Example graphtraverse
  • Do All Pairs of Operations in Extent Commute?
  • No - Generate Serial Code
  • Yes - Generate Parallel Code
  • In Example All Pairs Commute

9
Driver Version
Code Generation In Example
class graph lock mutex int val, sum graph
left, right
Class Declaration
void graphtraverse(int v) parallel_traverse(v)
wait()
10
Parallel Version In Example
void graphparallel_traverse(int v)
mutex.acquire() sum v mutex.release()
if (left ! NULL) spawn(left-gtparallel_traverse
(val)) if (right ! NULL) spawn(right-gtparall
el_traverse(val))
11
  • Commutativity Testing

12
Commutativity Testing Conditions
  • Do Two Operations A and B Commute?
  • Compiler Considers Two Execution Orders
  • AB - A executes before B
  • BA - B executes before A
  • Compiler Must Check Two Conditions

Instance Variables New values of instance
variables are same in both execution orders
Invoked Operations A and B together directly
invoke same set of operations in both execution
orders
13
Commutativity Testing Algorithm
  • Symbolic Execution
  • Compiler Executes Operations
  • Computes with Expressions not Values
  • Compiler Symbolically Executes Operations
  • In Both Execution Orders
  • Expressions for New Values of Instance Variables
  • Expressions for Multiset of Invoked Operations
  • Compiler Simplifies, Compares Corresponding
    Expressions
  • If All Equal - Operations Commute
  • If Not All Equal - Operations May Not Commute

14
Commutativity Testing In Example
  • Two Operations
  • r-gttraverse(v1) and r-gttraverse(v2)
  • In Order r-gttraverse(v1)r-gttraverse(v2)

Instance Variables New sum (sumv1)v2
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
  • In Order r-gttraverse(v2)r-gttraverse(v1)

Instance Variables New sum (sumv2)v1
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
15
Compiler Structure
Computation Selection
Entire Computation of Each Method
Extent Computation
Traverse Call Graph to Extract Extent
All Pairs of Operations In Extent
Commutativity Testing
All Operations Commute
Operations May Not Commute
Generate Serial Code
Generate Parallel Code
16
Traditional Approach
  • Data Dependence Analysis
  • Analyzes Reads and Writes
  • Independent Pieces of Code Execute in Parallel
  • Demonstrated Success for Array-Based Programs

17
Data Dependence Analysis in Example
  • For Data Dependence Analysis To Succeed in
    Example
  • left and right traverse Must Be Independent
  • left and right Subgraphs Must Be Disjoint
  • Graph Must Be a Tree
  • Depends on Global Topology of Data Structure
  • Analyze Code that Builds Data Structure
  • Extract and Propagate Topology Information
  • Fails For Graphs

18
Properties of Commutativity Analysis
  • Oblivious to Data Structure Topology
  • Wide Range of Computations
  • Irregular Computations with Dynamic Data
    Structures
  • Lists, Trees and Graphs
  • Updates to Central Data Structure
  • General Reductions
  • Key Issue in Code Generation
  • Operations Must Execute Atomically
  • Compiler Automatically Inserts Locking Constructs

19
Synchronization Optimizations
20
Default Code Generation Strategy
  • Each Object Has its Own Mutual Exclusion Lock
  • Each Operation Acquires and Releases Lock
  • class graph
  • lock mutex
  • int val, sum graph left, right
  • void graphparallel_traverse(int v)
  • mutex.acquire()
  • sum v
  • mutex.release()
  • if (left ! NULL) spawn(left-gtparallel_traverse(
    val))
  • if (right ! NULL) spawn(right-gtparallel_traverse
    (val))

21
Data Lock Coarsening Transformation
  • Give Multiple Objects the Same Lock
  • Current Policy Nested Objects Use the Lock in
    Enclosing Object
  • Find Sequences of Operations
  • Access Different Objects
  • Acquire and Release Same Lock
  • Transformed Code
  • Acquires Lock Once At Beginning of Sequence
  • Releases Lock Once At End of Sequence
  • Original Code
  • Each Operation Acquires and Releases Lock

22
Data Lock Coarsening Example
Original Code
Transformed Code
class vector lock mutex double
valNDIM void vectoradd(double v)
mutex.acquire() for(int i0 i lt NDIM i)
vali vi mutex.release() class body
lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM mutex.acquire() p
computeInter(b,v) phi - p
mutex.release() acc.add(v)
class vector double valNDIM void
vectoradd(double v) for(int i0 i lt
NDIM i) vali vi class body
lock mutex double phi vector acc void
bodygravsub(body b) double p, vNDIM
mutex.acquire() p computeInter(b,v) phi
- p acc.add(v) mutex.release()
23
Data Lock Coarsening Tradeoff
  • Advantage
  • Reduces Number of Executed Acquires and Releases
  • Reduces Acquire and Release Overhead
  • Disadvantage May Cause False Exclusion
  • Multiple Parallel Operations Access Different
    Objects
  • But Operations Attempt to Acquire Same Lock
  • Result Operations Execute Serially

24
False Exclusion
Original
After Data Lock Coarsening
Processor 0 L.acquire() A-gtop() L.release()
Processor 1 M.acquire() B-gtop() M.release()
Processor 0 L.acquire() A-gtop() L.release()
Processor 1 L.acquire() . . B-gtop() L.release()
False Exclusion
Time
25
Computation Lock Coarsening Transformation
  • Finds Sequences of Operations
  • Acquire and Release Same Lock
  • Transformed Code
  • Acquires Lock Once at Beginning of Sequence
  • Releases Lock Once at End of Sequence
  • Original Code
  • Acquires and Releases Lock Once for Each
    Operation
  • Result
  • Replaces Multiple Mutual Exclusion Regions With
  • One Large Mutual Exclusion Region

26
Computation Lock Coarsening Example
Original Code
Optimized Code
class body lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM p computeInter(b,v) phi - p
acc.add(v) void bodyloopsub(body b) int
i mutex.acquire() for (i 0 i lt N i)
this-gtgravsub(bi) mutex.release()
  • class body
  • lock mutex
  • double phi
  • vector acc
  • void bodygravsub(body b)
  • double p, vNDIM
  • mutex.acquire()
  • p computeInter(b,v)
  • phi - p
  • acc.add(v)
  • mutex.release()
  • void bodyloopsub(body b)
  • int i
  • for (i 0 i lt N i)
  • this-gtgravsub(bi)

27
Computation Lock Coarsening Tradeoff
  • Advantage
  • Reduces Number of Executed Acquires and Releases
  • Reduces Acquire and Release Overhead
  • Disadvantage May Introduce False Contention
  • Multiple Processors Attempt to Acquire Same Lock
  • Processor Holding the Lock is Executing Code that
    was Originally in No Mutual Exclusion Region

28
False Contention
Original
After Computation Lock Coarsening
Processor 0 L.acquire() A-gtop() L.release() L
.acquire() A-gtop() L.release()
Processor 1 L.acquire() A-gtop() L.release()
Processor 0 L.acquire() A-gtop() A-gtop() L.
release()
Processor 1 L.acquire() . . . . . A-gtop() L.
release()
Local Computation
False Contention
29
Managing Tradeoff Lock Coarsening Policies
  • To Manage Tradeoff, Compiler Must Successfully
  • Reduce Lock Overhead by Increasing Lock
    Granularity
  • Avoid Excessive False Exclusion and False
    Contention
  • Original Policy
  • Use Original Lock Algorithm
  • Bounded Policy
  • Apply Transformation Unless Transformed Code
  • Holds Lock During a Recursive Call, or
  • Holds Lock During a Loop that Invokes Operations
  • Aggressive Policy
  • Always Apply Transformation

30
  • Experimental Results

31
Methodology
  • Built Prototype Compiler
  • Integrated Lock Coarsening Transformations into
    Prototype
  • Acquired Two Complete Applications
  • Barnes-Hut N-Body Solver
  • Water Code
  • Automatically Parallelized Applications
  • Generated A Version of Each Application for Each
    Policy
  • Original
  • Bounded
  • Aggressive
  • Ran Applications on Stanford DASH Machine

32
Applications
  • Barnes-Hut
  • O(NlgN) N-Body Solver
  • Space Subdivision Tree
  • 1500 Lines of C Code
  • Water
  • Simulates Liquid Water
  • O(N2) Algorithm
  • 1850 Lines of C Code

33
Lock Overhead
  • Percentage of Time that the Single Processor
    Execution Spends Acquiring and Releasing Mutual
    Exculsion Locks



60
60
Original
40
40
Bounded
Percentage Lock Overhead
Percentage Lock Overhead
20
20
Original
Bounded
Aggressive
Aggressive
0
0
Water (512 Molecules)
Barnes-Hut (16K Particles)
34
Contention Overhead for Barnes-Hut
  • Percentage of Time that Processors Spend Waiting
    to Acquire Locks Held by Other Processors




100
100
100
Original
Bounded
Aggressive
75
75
75
50
50
Contention Percentage
50
25
25
25
0
0
0
0
4
8
12
16
0
4
8
12
16
0
4
8
12
16
Processors
Processors
Processors
35
Contention Overhead for Water
  • Percentage of Time that Processors Spend Waiting
    to Acquire Locks Held by Other Processors


100
100
100
Original
Bounded
75
75
75
Contention Percentage
50
50
50
Aggressive
25
25
25
0
0
0
0
4
8
12
16
0
4
8
12
16
0
4
8
12
16
Processors
Processors
Processors
36
Speedup
Ideal Aggressive Bounded Original
Ideal Aggressive Bounded Original
16

16
12
12
8
Speedup
8
Speedup
4
4
0
0
0
4
8
12
16
0
4
8
12
16
Processors
Processors
Barnes-Hut (16K Particles)
Water (512 Molecules)
37
Recent Work Choosing Best Policy
  • Best Policy May Depend On
  • Topology of Data Structures
  • Dynamic Schedule Of Computation
  • Information Required to Choose Best Policy
    Unavailable at Compile Time
  • Complications
  • Different Phases May Have Different Best Policy
  • In Same Phase, Best Policy May Change Over Time

38
Solution Generate Self-Tuning Code
  • Sampling Phase Measures Performance of Different
    Policies
  • Production Phase Uses Best Policy From Sampling
    Phase
  • Periodically Resample to Discover Changes in Best
    Policy
  • Guaranteed Performance Bounds

Original
Bounded
Overhead
Aggressive
Time
Sampling Phase
Sampling Phase
Production Phase
39
Conclusion
  • Synchronization Optimizations
  • Data Lock Coarsening
  • Computation Lock Coarsening
  • Integrated into Prototype Parallelizing Compiler
  • Object-Based Programs with Dynamic Data
    Structures
  • Commutativity Analysis
  • Experimental Results
  • Optimizations Have a Significant Performance
    Impact
  • With Optimizations, Applications Perform Well
Write a Comment
User Comments (0)
About PowerShow.com