Title: Communication Optimizations for Parallel Computing Using Data Access Information
1Communication Optimizations for Parallel
Computing Using Data Access Information
- Martin Rinard
- Department of Computer Science
- University of California,Santa Barbara
- martin_at_cs.ucsb.edu
- http//www.cs.ucsb.edu/martin
2Motivation
- Communication Overhead Can Substantially Degrade
the Performance of Parallel Computations
3Communication Optimizations
Broadcast Concurrent Fetches
Latency Hiding
4Applying Optimizations
- Programmer
- By Hand
- Programming Burden
- Portability Problems
- Language Implementation
- Automatically
- Reduces Programming Burden
- No Portability Problems -Each Implementation
Optimized for Current Hardware Platform
5Key Questions
- How does the implementation get the information
it needs to apply the communication
optimizations? - What communication optimization algorithms does
the implementation use? - How well do the optimized computations perform?
6Goal of Talk
- Present Experience Automatically Applying
Communication Optimizations in Jade
7Talk Outline
- Jade Language
- Message Passing Implementation
- Communication Optimization Algorithms
- Experimental Results on iPSC/860
- Shared Memory Implementation
- Communication Optimization Algorithms
- Experimental Results on Stanford DASH
- Conclusion
8Jade
- Portable, Implicitly Parallel Language
- Data Access Information
- Programmer starts with serial program
- Uses Jade constructs to provide information about
how parts of program access data - Jade Implementation Uses Data Access Information
to Automatically - Extract Concurrency
- Synchronize Computation
- Apply Communication Optimizations
9Jade Concepts
- Shared Objects
- Tasks
- Access Specifications
shared object references
withonly do () computation that reads
and writes
rd
wr
access specification
task
10Jade Example
withonly do () ...
wr
withonly do () ...
rd
wr
withonly do () ...
rd
rd
11Jade Example
withonly do () ...
rd
wr
withonly do () ...
rd
wr
withonly do () ...
rd
rd
12Jade Example
withonly do () ...
wr
wr
withonly do () ...
rd
wr
withonly do () ...
rd
rd
13Jade Example
withonly do () ...
wr
wr
withonly do () ...
rd
wr
withonly do () ...
rd
rd
14Jade Example
withonly do () ...
wr
wr
withonly do () ...
rd
wr
wr
withonly do () ...
rd
rd
15Jade Example
withonly do () ...
wr
wr
withonly do () ...
rd
wr
wr
withonly do () ...
rd
rd
16Jade Example
withonly do () ...
wr
wr
withonly do () ...
rd
wr
wr
withonly do () ...
rd
rd
rd
rd
17Jade Example
withonly do () ...
wr
wr
withonly do () ...
rd
wr
wr
withonly do () ...
rd
rd
rd
rd
18Jade Example
withonly do () ...
wr
wr
withonly do () ...
rd
wr
wr
withonly do () ...
rd
rd
rd
rd
19Jade Example
withonly do () ...
wr
wr
withonly do () ...
rd
wr
wr
withonly do () ...
rd
rd
rd
rd
20Jade Example
withonly do () ...
wr
withonly do () ...
rd
wr
wr
withonly do () ...
rd
rd
rd
rd
21Jade Example
withonly do () ...
wr
withonly do () ...
rd
wr
withonly do () ...
rd
rd
rd
rd
22Result
- At Each Point in the Execution
- A Collection of Enabled Tasks
- Each Task Has an Access Specification
- Jade Implementation
- Exploits Information in Access Specifications
- Apply Communication Optimizations
23Message Passing Implementation
- Model of Computation for Implementation
- Implementation Overview
- Communication Optimizations
- Experimental Results for iPSC/860
24Model of Computation
- Each Processor Has a Private Memory
- Processors Communicate by Sending Messages
through Network
network
memory
processor
25Implementation Overview
- Distributes Objects Across Memories
26Implementation Overview
- Assigns Enabled Tasks to Idle Processors
rd
wr
27Implementation Overview
- Transfers Objects to Accessing Processor
- Replicates Objects that Task will Read
rd
wr
28Implementation Overview
- Transfers Objects to Accessing Processor
- Migrates Objects that Task will Write
rd
wr
29Implementation Overview
- When all Remote Objects Arrive
- Task Executes
rd
wr
30Optimization
Goal
Mechanism
Broadcast Each New Version of Widely Accessed
Objects
Parallelize Communication
Adaptive Broadcast
Replicate Data on Reading Processors
Enable Tasks to Concurrently Read Same Data
Replication
Assign Multiple Enabled Tasks to Same Processor
Overlap Computation and Communication
Latency Hiding
Concurrently Transfer Remote Objects that Task
will Access
Parallelize Communication
Concurrent Fetch
Eliminate Communication
Execute Tasks on Processors that have Locally
Available Copies of Accessed Objects
Locality
31Application-Based Evaluation
- WaterEvaluates forces and potentials in a
system of liquid water molecules - StringComputes a velocity model of the geology
between two oil wells - OceanSimulates the role of eddy and boundary
currents in influencing large-scale ocean
movements - Panel CholeskySparse Cholesky factorization
algorithm
32Impact of Communication Optimizations
Panel Cholesky
String
Water
Ocean
-
Adaptive Broadcast
Replication
-
Latency Hiding
-
Concurrent Fetch
Significant Impact
-
Negligible Impact
Required To Expose Concurrency
33Optimization
Impact
Significant performance improvement for Water.
Negligible impact for String. No impact for
Ocean and Panel Cholesky
Adaptive Broadcast
Replication
Crucial. Without replication all applications
execute serially.
No impact for Water, String and Ocean - no excess
concurrency. Negligible impact for Panel
Cholesky.
Latency Hiding
None. Almost all tasks access at most one remote
object.
Concurrent Fetch
34Locality Optimization
- Integrated into Online Scheduler
- Scheduler
- Maintains Pool of Enabled Tasks
- Maintains Pool of Idle Processors
- Balances Load by Assigning Enabled Tasks to Idle
Processors - Locality Algorithm Affects the Assignment
35Locality Concepts
- Each Object has an Owner
- Last processor to write the object. Owner has a
current copy of the object. - Each Task has a Locality ObjectCurrently first
object in access specification. - Locality Object Determines Target
ProcessorOwner of locality object. - Goal Execute each task on its target processor.
36When Task Becomes Enabled
- Scheduler Checks Pool of Idle Processors
- If Target Processor is Idle Target Processor
Gets Task - If Some Other Processor is Idle Other Processor
Gets Task - No Processor is Idle Task is Held in Pool of
Enabled Tasks
37When Processor Becomes Idle
- Scheduler Checks Pool of Enabled Tasks
- If Processor is Target of an Enabled
Task Processor Gets That Task - If Other Enabled Tasks Exist Processor Gets one
of Those Tasks - No Enabled Tasks Processor Stays Idle
38Implementation Versions
- Locality
- Implementation uses Locality Algorithm
- No Locality
- First Come, First Served Assignment of Enabled
Tasks to Idle Processors - Task Placement (Ocean and Panel
Cholesky)Programmer assigns tasks to processors
39Task Locality Percentage
Measures how well scheduler places tasks on
target processor
- Number of Tasks Executed on Target Processor
- Total Number of Executed Tasks
100
40Percentage of Tasks Executed on Target Processor
on iPSC/860
100
75
50
25
0
0
8
16
24
32
41Communication to Computation Ratio
- Measures the Effect of the Locality Algorithm on
the Communication
Rationale Task Times Include no Communication
Overhead and do not Vary Significantly between
Versions
42Communication to Useful Computation Ratio on
IPSC/860 (Mbytes/Second/Processor)
0.0025
0.0025
0.0020
0.0020
0.0015
0.0015
0.0010
no locality
0.0010
0.0005
0.0005
0
0
locality
0
8
16
24
32
0
8
16
24
32
string
water
task placement
3
3
2
2
1
1
0
0
0
8
16
24
32
0
8
16
24
32
panel cholesky
ocean
43Speedup on iPSC/860
44Shared Memory Implementation
- Model of Computation
- Locality Optimization
- Locality Performance Results
45Model of Computation
- Single Shared Memory
- Composed of Memory Modules
- Each Memory Module Associated with a Processor
- Each Object Allocated in a Memory Module
- Processors Communicate by Reading and Writing
Objects in the Shared Memory
memory module
shared memory
object
processor
46Locality Algorithm
- Integrated into Online Scheduler
- Scheduler Runs Distributed Task Queue
- Each Processor Has a Queue of Enabled Tasks
- Idle Processors Search Task Queues
- Locality Algorithm Affects Task Queue Algorithm
47Locality Concepts
- Each Object has an Owner
- Processor associated with memory module that
holds object. Accesses to the object from this
processor are satisfied from local memory module. - Each Task has a Locality ObjectCurrently first
object in access specification. - Locality Object Determines Target Processor
- Owner of locality object.
- Goal Execute each task on its target processor.
48When Processor Becomes Idle
- If Its Task Queue is not Empty Execute First
Task in Task Queue - Otherwise Cyclically Search Task Queues
- If Remote Task Queue is not Empty Execute Last
Task in Task Queue
49When Task Becomes Enabled
- Locality Algorithm Inserts Task into Task Queue
at the Owner of Its Locality Object - Tasks with Same Locality Object are Adjacent in
Queue - Goals
- Enhance memory locality by executing each task on
the owner of its locality object. - Enhance cache locality by executing tasks with
the same locality object consecutively on same
the processor.
50Evaluation
- Same Set of Applications
- Water
- String
- Ocean
- Panel Cholesky
- Same Locality Versions
- Locality
- No Locality (Single Task Queue)
- Explicit Task Placement (Ocean and Panel Cholesky)
51Percentage of Tasks Executed on Target Processor
on DASH
100
75
50
25
0
0
8
16
24
32
52Task Execution Time
Measures the Effect of the Locality Algorithm on
the Communication
Sum of Task Execution Times
Rationale All communication performed on demand
as tasks access data. Differences in
communication show up as differences in task
execution times.
53Task Execution Time on DASH
no locality
locality
string
water
task placement
100
75
50
25
0
0
8
16
24
32
panel cholesky
ocean
54Speedup on DASH
32
32
24
24
16
16
8
8
0
0
0
8
16
24
32
0
8
16
24
32
32
32
24
24
16
16
8
8
0
0
0
8
16
24
32
0
8
16
24
32
55Related Work
- Shared Memory
- COOL - Chandra, Gupta, Hennessy
- Fowler, Kontothanasis
- Message Passing
- Tempest - Falsafi, Lebeck, Reinhardt, Schoinas,
Hill, Larus, Rogers, Wood - Munin - Carter, Bennet, Zwaenepol
- SAM - Scales, Lam
- Olden - Carlisle, Rogers
- Prelude - Hsieh, Wang, Weihl
56Conclusion
- Access Specifications Enable Communication
Optimizations - Implemented Optimizations for Jade
- Message Passing Implementation
- Shared Memory Implementation
- Experimental, Application-Based Evaluation
- Replication Required to Expose Concurrency
- Locality Significant for Ocean and Panel Cholesky
- Broadcast Significant for Water
- Other Optimizations Have Little or No Impact