Title: Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized ObjectBased Programs
1Lock Coarsening Eliminating Lock Overhead in
Automatically Parallelized Object-Based Programs
- Pedro C. Diniz
- Martin C. Rinard
- University of California, Santa Barbara
- Santa Barbara, California 93106
- martin,pedro_at_cs.ucsb.edu
- http//www.cs.ucsb.edu/martin,pedro
2Goal
- Eliminate Synchronization Overhead in
- Parallel Object-Based Programs
- Basic Idea
- Interprocedural Synchronization Analysis
- Automatically Eliminate Synchronization
Constructs - Context Parallelizing Compiler for C
- Irregular Computations
- Dynamic Data Structures
- Commutativity Analysis
3Structure of Talk
- Commutativity Analysis
- Model of Computation
- Example
- Basic Approach
- Synchronization Optimization Techniques
- Data Lock Coarsening
- Computation Lock Coarsening
- Experimental Results
- Future Work
- Self-Tuning Code
4Model of Computation
operations
objects
executing operation
new object state
operation
initial object state
invoked operations
5Graph Traversal Example
- class graph
- int val, sum
- graph left, right
-
- void graphtraverse(int v)
- sum v
- if (left !NULL) left-gttraverse(val)
- if (right!NULL) right-gttraverse(val)
Goal Execute left and right traverse operations
in parallel
6Parallel Traversal
7Commuting Operations in Parallel Traversal
3
8Commutativity Analysis
- Compiler Chooses A Computation to Parallelize
- In Example Entire graphtraverse Computation
- Compiler Computes Extent of the Computation
- Representation of all Operations in Computation
- Current Representation Set of Methods
- In Example graphtraverse
- Do All Pairs of Operations in Extent Commute?
- No - Generate Serial Code
- Yes - Generate Parallel Code
- In Example All Pairs Commute
9Driver Version
Code Generation In Example
class graph lock mutex int val, sum graph
left, right
Class Declaration
void graphtraverse(int v) parallel_traverse(v)
wait()
10Parallel Version In Example
void graphparallel_traverse(int v)
mutex.acquire() sum v mutex.release()
if (left ! NULL) spawn(left-gtparallel_traverse
(val)) if (right ! NULL) spawn(right-gtparall
el_traverse(val))
11 12Commutativity Testing Conditions
- Do Two Operations A and B Commute?
- Compiler Considers Two Execution Orders
- AB - A executes before B
- BA - B executes before A
- Compiler Must Check Two Conditions
Instance Variables New values of instance
variables are same in both execution orders
Invoked Operations A and B together directly
invoke same set of operations in both execution
orders
13Commutativity Testing Algorithm
- Symbolic Execution
- Compiler Executes Operations
- Computes with Expressions not Values
- Compiler Symbolically Executes Operations
- In Both Execution Orders
- Expressions for New Values of Instance Variables
- Expressions for Multiset of Invoked Operations
- Compiler Simplifies, Compares Corresponding
Expressions - If All Equal - Operations Commute
- If Not All Equal - Operations May Not Commute
14Commutativity Testing In Example
- Two Operations
- r-gttraverse(v1) and r-gttraverse(v2)
- In Order r-gttraverse(v1)r-gttraverse(v2)
Instance Variables New sum (sumv1)v2
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
- In Order r-gttraverse(v2)r-gttraverse(v1)
Instance Variables New sum (sumv2)v1
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
15Compiler Structure
Computation Selection
Entire Computation of Each Method
Extent Computation
Traverse Call Graph to Extract Extent
All Pairs of Operations In Extent
Commutativity Testing
All Operations Commute
Operations May Not Commute
Generate Serial Code
Generate Parallel Code
16Traditional Approach
- Data Dependence Analysis
- Analyzes Reads and Writes
- Independent Pieces of Code Execute in Parallel
- Demonstrated Success for Array-Based Programs
17Data Dependence Analysis in Example
- For Data Dependence Analysis To Succeed in
Example - left and right traverse Must Be Independent
- left and right Subgraphs Must Be Disjoint
- Graph Must Be a Tree
- Depends on Global Topology of Data Structure
- Analyze Code that Builds Data Structure
- Extract and Propagate Topology Information
- Fails For Graphs
18Properties of Commutativity Analysis
- Oblivious to Data Structure Topology
- Wide Range of Computations
- Irregular Computations with Dynamic Data
Structures - Lists, Trees and Graphs
- Updates to Central Data Structure
- General Reductions
-
- Key Issue in Code Generation
- Operations Must Execute Atomically
- Compiler Automatically Inserts Locking Constructs
19Synchronization Optimizations
20Default Code Generation Strategy
- Each Object Has its Own Mutual Exclusion Lock
- Each Operation Acquires and Releases Lock
- class graph
- lock mutex
- int val, sum graph left, right
-
- void graphparallel_traverse(int v)
- mutex.acquire()
- sum v
- mutex.release()
- if (left ! NULL) spawn(left-gtparallel_traverse(
val)) - if (right ! NULL) spawn(right-gtparallel_traverse
(val))
21Data Lock Coarsening Transformation
- Give Multiple Objects the Same Lock
- Current Policy Nested Objects Use the Lock in
Enclosing Object - Find Sequences of Operations
- Access Different Objects
- Acquire and Release Same Lock
- Transformed Code
- Acquires Lock Once At Beginning of Sequence
- Releases Lock Once At End of Sequence
- Original Code
- Each Operation Acquires and Releases Lock
22Data Lock Coarsening Example
Original Code
Transformed Code
class vector lock mutex double
valNDIM void vectoradd(double v)
mutex.acquire() for(int i0 i lt NDIM i)
vali vi mutex.release() class body
lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM mutex.acquire() p
computeInter(b,v) phi - p
mutex.release() acc.add(v)
class vector double valNDIM void
vectoradd(double v) for(int i0 i lt
NDIM i) vali vi class body
lock mutex double phi vector acc void
bodygravsub(body b) double p, vNDIM
mutex.acquire() p computeInter(b,v) phi
- p acc.add(v) mutex.release()
23Data Lock Coarsening Tradeoff
- Advantage
- Reduces Number of Executed Acquires and Releases
- Reduces Acquire and Release Overhead
- Disadvantage May Cause False Exclusion
- Multiple Parallel Operations Access Different
Objects - But Operations Attempt to Acquire Same Lock
- Result Operations Execute Serially
24False Exclusion
Original
After Data Lock Coarsening
Processor 0 L.acquire() A-gtop() L.release()
Processor 1 M.acquire() B-gtop() M.release()
Processor 0 L.acquire() A-gtop() L.release()
Processor 1 L.acquire() . . B-gtop() L.release()
False Exclusion
Time
25Computation Lock Coarsening Transformation
- Finds Sequences of Operations
- Acquire and Release Same Lock
- Transformed Code
- Acquires Lock Once at Beginning of Sequence
- Releases Lock Once at End of Sequence
- Original Code
- Acquires and Releases Lock Once for Each
Operation - Result
- Replaces Multiple Mutual Exclusion Regions With
- One Large Mutual Exclusion Region
26Computation Lock Coarsening Example
Original Code
Optimized Code
class body lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM p computeInter(b,v) phi - p
acc.add(v) void bodyloopsub(body b) int
i mutex.acquire() for (i 0 i lt N i)
this-gtgravsub(bi) mutex.release()
- class body
- lock mutex
- double phi
- vector acc
-
- void bodygravsub(body b)
- double p, vNDIM
- mutex.acquire()
- p computeInter(b,v)
- phi - p
- acc.add(v)
- mutex.release()
-
- void bodyloopsub(body b)
- int i
- for (i 0 i lt N i)
- this-gtgravsub(bi)
-
27Computation Lock Coarsening Tradeoff
- Advantage
- Reduces Number of Executed Acquires and Releases
- Reduces Acquire and Release Overhead
- Disadvantage May Introduce False Contention
- Multiple Processors Attempt to Acquire Same Lock
- Processor Holding the Lock is Executing Code that
was Originally in No Mutual Exclusion Region
28False Contention
Original
After Computation Lock Coarsening
Processor 0 L.acquire() A-gtop() L.release() L
.acquire() A-gtop() L.release()
Processor 1 L.acquire() A-gtop() L.release()
Processor 0 L.acquire() A-gtop() A-gtop() L.
release()
Processor 1 L.acquire() . . . . . A-gtop() L.
release()
Local Computation
False Contention
29Managing Tradeoff Lock Coarsening Policies
- To Manage Tradeoff, Compiler Must Successfully
- Reduce Lock Overhead by Increasing Lock
Granularity - Avoid Excessive False Exclusion and False
Contention - Original Policy
- Use Original Lock Algorithm
- Bounded Policy
- Apply Transformation Unless Transformed Code
- Holds Lock During a Recursive Call, or
- Holds Lock During a Loop that Invokes Operations
- Aggressive Policy
- Always Apply Transformation
30 31Methodology
- Built Prototype Compiler
- Integrated Lock Coarsening Transformations into
Prototype - Acquired Two Complete Applications
- Barnes-Hut N-Body Solver
- Water Code
- Automatically Parallelized Applications
- Generated A Version of Each Application for Each
Policy - Original
- Bounded
- Aggressive
- Ran Applications on Stanford DASH Machine
32Applications
- Barnes-Hut
- O(NlgN) N-Body Solver
- Space Subdivision Tree
- 1500 Lines of C Code
- Water
- Simulates Liquid Water
- O(N2) Algorithm
- 1850 Lines of C Code
33Lock Overhead
- Percentage of Time that the Single Processor
Execution Spends Acquiring and Releasing Mutual
Exculsion Locks
60
60
Original
40
40
Bounded
Percentage Lock Overhead
Percentage Lock Overhead
20
20
Original
Bounded
Aggressive
Aggressive
0
0
Water (512 Molecules)
Barnes-Hut (16K Particles)
34Contention Overhead for Barnes-Hut
- Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors
100
100
100
Original
Bounded
Aggressive
75
75
75
50
50
Contention Percentage
50
25
25
25
0
0
0
0
4
8
12
16
0
4
8
12
16
0
4
8
12
16
Processors
Processors
Processors
35Contention Overhead for Water
- Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors
100
100
100
Original
Bounded
75
75
75
Contention Percentage
50
50
50
Aggressive
25
25
25
0
0
0
0
4
8
12
16
0
4
8
12
16
0
4
8
12
16
Processors
Processors
Processors
36Speedup
Ideal Aggressive Bounded Original
Ideal Aggressive Bounded Original
16
16
12
12
8
Speedup
8
Speedup
4
4
0
0
0
4
8
12
16
0
4
8
12
16
Processors
Processors
Barnes-Hut (16K Particles)
Water (512 Molecules)
37Recent Work Choosing Best Policy
- Best Policy May Depend On
- Topology of Data Structures
- Dynamic Schedule Of Computation
- Information Required to Choose Best Policy
Unavailable at Compile Time - Complications
- Different Phases May Have Different Best Policy
- In Same Phase, Best Policy May Change Over Time
38Solution Generate Self-Tuning Code
- Sampling Phase Measures Performance of Different
Policies - Production Phase Uses Best Policy From Sampling
Phase - Periodically Resample to Discover Changes in Best
Policy - Guaranteed Performance Bounds
Original
Bounded
Overhead
Aggressive
Time
Sampling Phase
Sampling Phase
Production Phase
39Conclusion
- Synchronization Optimizations
- Data Lock Coarsening
- Computation Lock Coarsening
- Integrated into Prototype Parallelizing Compiler
- Object-Based Programs with Dynamic Data
Structures - Commutativity Analysis
- Experimental Results
- Optimizations Have a Significant Performance
Impact - With Optimizations, Applications Perform Well