Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized ObjectBased Programs - PowerPoint PPT Presentation

About This Presentation

Title:

Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized ObjectBased Programs

Description:

if(right!=NULL,right- traverse(val)), if(left!=NULL,left- traverse(val) ... NULL) spawn(right- parallel_traverse(val)); Data Lock Coarsening Transformation ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 40

Provided by: martin49

Learn more at: http://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized ObjectBased Programs

1
Lock Coarsening Eliminating Lock Overhead in
Automatically Parallelized Object-Based Programs

Pedro C. Diniz
Martin C. Rinard
University of California, Santa Barbara
Santa Barbara, California 93106
martin,pedro_at_cs.ucsb.edu
http//www.cs.ucsb.edu/martin,pedro

2
Goal

Eliminate Synchronization Overhead in
Parallel Object-Based Programs
Basic Idea
Interprocedural Synchronization Analysis
Automatically Eliminate Synchronization
Constructs
Context Parallelizing Compiler for C
Irregular Computations
Dynamic Data Structures
Commutativity Analysis

3
Structure of Talk

Commutativity Analysis
Model of Computation
Example
Basic Approach
Synchronization Optimization Techniques
Data Lock Coarsening
Computation Lock Coarsening
Experimental Results
Future Work
Self-Tuning Code

4
Model of Computation
operations
objects
executing operation
new object state
operation
initial object state
invoked operations
5
Graph Traversal Example

class graph
int val, sum
graph left, right
void graphtraverse(int v)
sum v
if (left !NULL) left-gttraverse(val)
if (right!NULL) right-gttraverse(val)

Goal Execute left and right traverse operations
in parallel
6
Parallel Traversal
7
Commuting Operations in Parallel Traversal
3
8
Commutativity Analysis

Compiler Chooses A Computation to Parallelize
In Example Entire graphtraverse Computation
Compiler Computes Extent of the Computation
Representation of all Operations in Computation
Current Representation Set of Methods
In Example graphtraverse
Do All Pairs of Operations in Extent Commute?
No - Generate Serial Code
Yes - Generate Parallel Code
In Example All Pairs Commute

9
Driver Version
Code Generation In Example
class graph lock mutex int val, sum graph
left, right
Class Declaration
void graphtraverse(int v) parallel_traverse(v)
wait()
10
Parallel Version In Example
void graphparallel_traverse(int v)
mutex.acquire() sum v mutex.release()
if (left ! NULL) spawn(left-gtparallel_traverse
(val)) if (right ! NULL) spawn(right-gtparall
el_traverse(val))
11

Commutativity Testing

12
Commutativity Testing Conditions

Do Two Operations A and B Commute?
Compiler Considers Two Execution Orders
AB - A executes before B
BA - B executes before A
Compiler Must Check Two Conditions

Instance Variables New values of instance
variables are same in both execution orders
Invoked Operations A and B together directly
invoke same set of operations in both execution
orders
13
Commutativity Testing Algorithm

Symbolic Execution
Compiler Executes Operations
Computes with Expressions not Values
Compiler Symbolically Executes Operations
In Both Execution Orders
Expressions for New Values of Instance Variables
Expressions for Multiset of Invoked Operations
Compiler Simplifies, Compares Corresponding
Expressions
If All Equal - Operations Commute
If Not All Equal - Operations May Not Commute

14
Commutativity Testing In Example

Two Operations
r-gttraverse(v1) and r-gttraverse(v2)
In Order r-gttraverse(v1)r-gttraverse(v2)

Instance Variables New sum (sumv1)v2
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))

In Order r-gttraverse(v2)r-gttraverse(v1)

Instance Variables New sum (sumv2)v1
Invoked Operations if(right!NULL,right-gttraverse(
val)), if(left!NULL,left-gttraverse(val)),
if(right!NULL,right-gttraverse(val)), if(left!NU
LL,left-gttraverse(val))
15
Compiler Structure
Computation Selection
Entire Computation of Each Method
Extent Computation
Traverse Call Graph to Extract Extent
All Pairs of Operations In Extent
Commutativity Testing
All Operations Commute
Operations May Not Commute
Generate Serial Code
Generate Parallel Code
16
Traditional Approach

Data Dependence Analysis
Analyzes Reads and Writes
Independent Pieces of Code Execute in Parallel
Demonstrated Success for Array-Based Programs

17
Data Dependence Analysis in Example

For Data Dependence Analysis To Succeed in
Example
left and right traverse Must Be Independent
left and right Subgraphs Must Be Disjoint
Graph Must Be a Tree
Depends on Global Topology of Data Structure
Analyze Code that Builds Data Structure
Extract and Propagate Topology Information
Fails For Graphs

18
Properties of Commutativity Analysis

Oblivious to Data Structure Topology
Wide Range of Computations
Irregular Computations with Dynamic Data
Structures
Lists, Trees and Graphs
Updates to Central Data Structure
General Reductions
Key Issue in Code Generation
Operations Must Execute Atomically
Compiler Automatically Inserts Locking Constructs

19
Synchronization Optimizations
20
Default Code Generation Strategy

Each Object Has its Own Mutual Exclusion Lock
Each Operation Acquires and Releases Lock
class graph
lock mutex
int val, sum graph left, right
void graphparallel_traverse(int v)
mutex.acquire()
sum v
mutex.release()
if (left ! NULL) spawn(left-gtparallel_traverse(
val))
if (right ! NULL) spawn(right-gtparallel_traverse
(val))

21
Data Lock Coarsening Transformation

Give Multiple Objects the Same Lock
Current Policy Nested Objects Use the Lock in
Enclosing Object
Find Sequences of Operations
Access Different Objects
Acquire and Release Same Lock
Transformed Code
Acquires Lock Once At Beginning of Sequence
Releases Lock Once At End of Sequence
Original Code
Each Operation Acquires and Releases Lock

22
Data Lock Coarsening Example
Original Code
Transformed Code
class vector lock mutex double
valNDIM void vectoradd(double v)
mutex.acquire() for(int i0 i lt NDIM i)
vali vi mutex.release() class body
lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM mutex.acquire() p
computeInter(b,v) phi - p
mutex.release() acc.add(v)
class vector double valNDIM void
vectoradd(double v) for(int i0 i lt
NDIM i) vali vi class body
lock mutex double phi vector acc void
bodygravsub(body b) double p, vNDIM
mutex.acquire() p computeInter(b,v) phi
- p acc.add(v) mutex.release()
23
Data Lock Coarsening Tradeoff

Advantage
Reduces Number of Executed Acquires and Releases
Reduces Acquire and Release Overhead
Disadvantage May Cause False Exclusion
Multiple Parallel Operations Access Different
Objects
But Operations Attempt to Acquire Same Lock
Result Operations Execute Serially

24
False Exclusion
Original
After Data Lock Coarsening
Processor 0 L.acquire() A-gtop() L.release()
Processor 1 M.acquire() B-gtop() M.release()
Processor 0 L.acquire() A-gtop() L.release()
Processor 1 L.acquire() . . B-gtop() L.release()
False Exclusion
Time
25
Computation Lock Coarsening Transformation

Finds Sequences of Operations
Acquire and Release Same Lock
Transformed Code
Acquires Lock Once at Beginning of Sequence
Releases Lock Once at End of Sequence
Original Code
Acquires and Releases Lock Once for Each
Operation
Result
Replaces Multiple Mutual Exclusion Regions With
One Large Mutual Exclusion Region

26
Computation Lock Coarsening Example
Original Code
Optimized Code
class body lock mutex double phi vector
acc void bodygravsub(body b) double p,
vNDIM p computeInter(b,v) phi - p
acc.add(v) void bodyloopsub(body b) int
i mutex.acquire() for (i 0 i lt N i)
this-gtgravsub(bi) mutex.release()

class body
lock mutex
double phi
vector acc
void bodygravsub(body b)
double p, vNDIM
mutex.acquire()
p computeInter(b,v)
phi - p
acc.add(v)
mutex.release()
void bodyloopsub(body b)
int i
for (i 0 i lt N i)
this-gtgravsub(bi)

27
Computation Lock Coarsening Tradeoff

Advantage
Reduces Number of Executed Acquires and Releases
Reduces Acquire and Release Overhead
Disadvantage May Introduce False Contention
Multiple Processors Attempt to Acquire Same Lock
Processor Holding the Lock is Executing Code that
was Originally in No Mutual Exclusion Region

28
False Contention
Original
After Computation Lock Coarsening
Processor 0 L.acquire() A-gtop() L.release() L
.acquire() A-gtop() L.release()
Processor 1 L.acquire() A-gtop() L.release()
Processor 0 L.acquire() A-gtop() A-gtop() L.
release()
Processor 1 L.acquire() . . . . . A-gtop() L.
release()
Local Computation
False Contention
29
Managing Tradeoff Lock Coarsening Policies

To Manage Tradeoff, Compiler Must Successfully
Reduce Lock Overhead by Increasing Lock
Granularity
Avoid Excessive False Exclusion and False
Contention
Original Policy
Use Original Lock Algorithm
Bounded Policy
Apply Transformation Unless Transformed Code
Holds Lock During a Recursive Call, or
Holds Lock During a Loop that Invokes Operations
Aggressive Policy
Always Apply Transformation

Experimental Results

31
Methodology

Built Prototype Compiler
Integrated Lock Coarsening Transformations into
Prototype
Acquired Two Complete Applications
Barnes-Hut N-Body Solver
Water Code
Automatically Parallelized Applications
Generated A Version of Each Application for Each
Policy
Original
Bounded
Aggressive
Ran Applications on Stanford DASH Machine

32
Applications

Barnes-Hut
O(NlgN) N-Body Solver
Space Subdivision Tree
1500 Lines of C Code
Water
Simulates Liquid Water
O(N2) Algorithm
1850 Lines of C Code

33
Lock Overhead

Percentage of Time that the Single Processor
Execution Spends Acquiring and Releasing Mutual
Exculsion Locks

60
60
Original
40
40
Bounded
Percentage Lock Overhead
Percentage Lock Overhead
20
20
Original
Bounded
Aggressive
Aggressive
0
0
Water (512 Molecules)
Barnes-Hut (16K Particles)
34
Contention Overhead for Barnes-Hut

Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors

100
100
100
Original
Bounded
Aggressive
75
75
75
50
50
Contention Percentage
50
25
25
25
0
0
0
0
4
8
12
16
0
4
8
12
16
0
4
8
12
16
Processors
Processors
Processors
35
Contention Overhead for Water

Percentage of Time that Processors Spend Waiting
to Acquire Locks Held by Other Processors

100
100
100
Original
Bounded
75
75
75
Contention Percentage
50
50
50
Aggressive
25
25
25
0
0
0
0
4
8
12
16
0
4
8
12
16
0
4
8
12
16
Processors
Processors
Processors
36
Speedup
Ideal Aggressive Bounded Original
Ideal Aggressive Bounded Original
16

16
12
12
8
Speedup
8
Speedup
4
4
0
0
0
4
8
12
16
0
4
8
12
16
Processors
Processors
Barnes-Hut (16K Particles)
Water (512 Molecules)
37
Recent Work Choosing Best Policy