Title: Effective Fine-Grain Synchronization For Automatically Parallelized Programs Using Optimistic Synchronization Primitives
1- Effective Fine-Grain Synchronization For
Automatically Parallelized Programs Using
Optimistic Synchronization Primitives - Martin Rinard
- University of California, Santa Barbara
2- Problem
- Efficiently Implementing Atomic Operations On
Objects - Key Issue
- Mutual Exclusion Locks
- Versus
- Optimistic Synchronization Primitives
- Context
- Parallelizing Compiler For Irregular Object-Based
Programs - Linked Data Structures
- Commutativity Analysis
3Talk Outline
- Histogram Example
- Advantages and Limitations of Optimistic
Synchronization - Synchronization Selection Algorithm
- Experimental Results
4Histogram Example
- class histogram
- private int countsN
- public void update(int i)
- countsi
-
-
- parallel for (i 0 i lt iterations i)
- int c f(i)
- h-gtupdate(c)
3
7
4
1
2
0
5
8
5Cloud Of Parallel Histogram Updates
Histogram
iteration 0
3
iteration 8
7
4
iteration 2
iteration 1
1
iteration 7
2
iteration 3
0
iteration 6
iteration 4
5
8
iteration 5
Updates Must Execute Atomically
6One Lock Per Object
- class histogram
- private int countsN
- lock mutex
- public void update(int i)
- mutex.acquire()
- countsi
- mutex.release()
-
Problem False Exclusion
7One Lock Per Item
- class histogram
- private int countsN
- lock mutexN
- public void update(int i)
- mutexi.acquire()
- countsi
- mutexi.release()
-
Problem Memory Consumption
8Optimistic Synchronization
Load Old Value
Compute New Value Into Local Storage
Commit Point
No Write Between Load and Commit
Write Between Load and Commit
Commit Fails Retry Update
Commit Succeeds Write New Value
9Parallel Updates With Optimistic Synchronization
Load Old Value
3
7
4
Compute New Value Into Local Storage
1
2
0
5
8
Commit Succeeds Write New Value
10Optimistic Synchronization In Modern Processors
- Load Linked (LL) - Used To Load Old Value
- Store Conditional (SC) - Used To Commit New Value
- Atomic Increment Using Optimistic Synchronization
Primitives - retry LL 2,0(4) Load Old Value
- addiu 3,2,1 Compute New Value Into
Local Storage - SC 3,0(4) Attempt To Store New Value
- beq 3,0,retry Retry If Failure
11Optimistically Synchronized Histogram
- class histogram
- private int countsN
- public void update(int i)
- do
- new_count LL(countsi)
- new_count
- while (!SC(new_count, countsi))
-
12Aspects of Optimistic Synchronization
- Advantages
- Slightly More Efficient Than Locked Updates
- No Memory Overhead
- No Data Cache Overhead
- Potentially Fewer Memory Consistency Requirements
- Advantages In Other Contexts
- No Deadlock, No Priority Inversions, No Lock
Convoys - Limitations
- Existing Primitives Support Only Single Word
Updates - Each Update Must Be Synchronized Individually
- Lack of Fairness
13Synchronization In Automatically Parallelized
Programs
Serial Program
Assumption Operations Execute Atomically
CommutativityAnalysis
Unsynchronized Parallel Program
Requirement Correctly Synchronize Atomic
Operations
Synchronization Selection
Goal Choose An Efficient Synchronization
Mechanism for Each Operation
Synchronized Parallel Program
14Atomicity Issues In Generated Code
Serial Program
Assumption Operations Execute Atomically
CommutativityAnalysis
Unsynchronized Parallel Program
Goal Choose An Efficient Synchronization
Mechanism For Each Operation
Synchronization Selection
Requirement Correctly Synchronize Atomic
Operations
Synchronized Parallel Program
15- Use Optimistic Synchronization
- Whenever Possible
16Model Of Computation
- Objects With Instance Variables
- class histogram
- private int countsN
-
- Operations Update Objects By Modifying Instance
Variables - void histogramupdate(int i)
- countsi
-
4
2
5
h-gtupdate(1)
4
4
2
3
5
5
17Commutativity Analysis
- Compiler Computes Extent Of Computation
- Representation of All Operations in Computation
- In Example histogramupdate
- Do All Pairs Of Operations Commute?
- No - Generate Serial Code
- Yes - Automatically Generate Parallel Code
- In Example
- h-gtupdate(i) and h-gtupdate(j) commute for all i,
j
18Synchronization Requirements
- Traditional Parallelizing Compilers
- Parallelize Loops With Independent Iterations
- Barrier Synchronization
- Commutativity Analysis
- Parallel Operations May Update Same Object
- For Generated Code To Execute Correctly,
- Operations Must Execute Atomically
- Code Generation Algorithm Must Insert
Synchronization
19Default Synchronization Algorithm
- class histogram
- private int countsN
- lock mutex One Lock Per Object
- public void update(int i)
- mutex.acquire()
- countsi
- mutex.release()
-
Operations Acquire and Release Lock
20Synchronization Constraints
- Operation
- countsi countsi1 aaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaa - temp countsi
- countsi countsj
- countsj temp
Synchronization Constraint Can Use Optimistic
Synchronization - Read/Compute/Write Update To A
Single Instance Variable Must Use Lock
Synchronization - Updates Involve Multiple
Interdependent Instance Variables
21Synchronization Selection Constraints
- Can Use Optimistic Synchronization Only For
Single Word Updates That - All Updates To Same Instance Variable Must Use
Same Synchronization Mechanism
Read An Instance Variable
Compute A New Value That Depends On No Other
Updated Instance Variable
Write New Value Back Into Instance Variable
22Synchronization Selection Algorithm
- Operates At Granularity Of Instance Variables
- Compiler Scans All Updates To Each Instance
Variable - If A Class Has A Lock Synchronized Variable,
- Class is Marked Lock Synchronized
If All Updates Can Use Optimistic
Synchronization, Instance Variable Is Marked
Optimistically Synchronized
If At Least One Update Must Use Lock
Synchronization, Instance Variable Is Marked Lock
Synchronized
23Synchronization Selection In Example
class histogram private int
countsN public void update(int i)
countsi
Optimistically Synchronized Instance Variable
histogram NOT Marked As Lock Synchronized Class
24Code Generation Algorithm
- All Lock Synchronized Classes Augmented With
Locks - Operations That Update Lock Synchronized
Variables Acquire and Release the Lock in the
Object - Operations That Update Optimistically
Synchronized Variables Use Optimistic
Synchronization Primitives
25Optimistically Synchronized Histogram
- class histogram
- private int countsN
- public void update(int i)
- do
- new_count LL(countsi)
- new_count
- while (!SC(new_count, countsi))
-
26Experimental Results
27Methodology
- Implemented Parallelizing Compiler
- Implemented Synchronization Selection Algorithm
- Parallelized Three Complete Scientific
Applications - Barnes-Hut, String, Water
- Produced Four Versions
- Optimistic (All Updates Optimistically
Synchronized) - Item Lock (Produced By Hand)
- Object Lock
- Coarse Lock
- Used Inline Intrinsic Locks With Exponential
Backoff - Measured Performance On SGI Challenge XL
28Time For One Update
Time for One Cached Update On Challenge XL
Time for One Uncached Update On Challenge XL
29Synchronization Frequency
Optimistic, Item Lock
Barnes-Hut
Object Lock
661
Coarse Lock
Optimistic, Item Lock
String
Object Lock
Optimistic, Item Lock
Water
Object Lock
25
Coarse Lock
0
5
10
15
Microseconds Per Synchronization
30Memory Consumption For Barnes-Hut
50
40
30
Memory Consumption (MBytes)
20
10
0
Optimistic
Item Lock
Object Lock
Coarse Lock
Total Memory Used To Store Objects
31Memory Consumption For String
5
4
3
Memory Consumption (MBytes)
2
1
0
Optimistic
Item Lock
Object Lock
Total Memory Used To Store Objects
32Memory Consumption For Water
1.5
1
Memory Consumption (MBytes)
0.5
0
Optimistic
Item Lock
Object Lock
Coarse Lock
Total Memory Used To Store Objects
33Speedups For Barnes-Hut
Optimistic
Item Lock
Object Lock
Coarse Lock
34Speedups For String
24
24
24
16
16
16
Speedup
8
8
8
0
0
0
0
8
16
24
0
8
16
24
0
8
16
24
Processors
Processors
Processors
Optimistic
Item Lock
Object Lock
35Speedups For Water
24
24
24
24
16
16
16
16
Speedup
8
8
8
8
0
0
0
0
0
8
16
24
0
8
16
24
0
8
16
24
0
8
16
24
Processors
Processors
Processors
Processors
Optimistic
Item Lock
Object Lock
Coarse Lock
36Acknowledgements
- Pedro Diniz
- Parallelizing Compiler
- Silicon Graphics
- Challenge XL Multiprocessor
- Rohit Chandra, T.K. Lakshman, Robert Kennedy,
Alex Poulos - Technical Assistance With SGI Hardware and
Software
37Bottom Line
- Optimistic Synchronization Offers
- No Memory Overhead
- No Data Cache Overhead
- Reasonably Small Execution Time Overhead
- Good Performance On All Applications
- Good Choice For Parallelizing Compiler
- Minimal Impact On Parallel Program
- Simple, Robust, Works Well In Range Of Situations
- Major Drawback
- Current Primitives Support Only Single Word
Updates - Use Optimistic Synchronization Whenever Applicable
38Future
- The Efficient Implementation Of Atomic Operations
On Objects Will Become A Crucial Issue For
Mainstream Software - Small-Scale Shared-Memory Multiprocessors
- Multithreaded Applications and Libraries
- Popularity of Object-Oriented Programming
- Specific Example Java Standard Library
- Optimistic Synchronization Primitives Will Play
An Important Role