Communication Optimizations for Parallel Computing Using Data Access Information - PowerPoint PPT Presentation

About This Presentation
Title:

Communication Optimizations for Parallel Computing Using Data Access Information

Description:

Uses Jade constructs to provide information about how parts of program access data ... Integrated into Online Scheduler. Scheduler. Maintains Pool of Enabled Tasks ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 53
Provided by: martin49
Category:

less

Transcript and Presenter's Notes

Title: Communication Optimizations for Parallel Computing Using Data Access Information


1
Communication Optimizations for Parallel
Computing Using Data Access Information
  • Martin Rinard
  • Department of Computer Science
  • University of California,Santa Barbara
  • martin_at_cs.ucsb.edu
  • http//www.cs.ucsb.edu/martin

2
Motivation
  • Communication Overhead Can Substantially Degrade
    the Performance of Parallel Computations

3
Communication Optimizations
  • Replication
  • Locality

Broadcast Concurrent Fetches
Latency Hiding
4
Applying Optimizations
  • Programmer
  • By Hand
  • Programming Burden
  • Portability Problems
  • Language Implementation
  • Automatically
  • Reduces Programming Burden
  • No Portability Problems -Each Implementation
    Optimized for Current Hardware Platform

5
Key Questions
  • How does the implementation get the information
    it needs to apply the communication
    optimizations?
  • What communication optimization algorithms does
    the implementation use?
  • How well do the optimized computations perform?

6
Goal of Talk
  • Present Experience Automatically Applying
    Communication Optimizations in Jade

7
Talk Outline
  • Jade Language
  • Message Passing Implementation
  • Communication Optimization Algorithms
  • Experimental Results on iPSC/860
  • Shared Memory Implementation
  • Communication Optimization Algorithms
  • Experimental Results on Stanford DASH
  • Conclusion

8
Jade
  • Portable, Implicitly Parallel Language
  • Data Access Information
  • Programmer starts with serial program
  • Uses Jade constructs to provide information about
    how parts of program access data
  • Jade Implementation Uses Data Access Information
    to Automatically
  • Extract Concurrency
  • Synchronize Computation
  • Apply Communication Optimizations

9
Jade Concepts
  • Shared Objects
  • Tasks
  • Access Specifications

shared object references
withonly do () computation that reads
and writes
rd

wr

access specification
task
10
Jade Example
withonly do () ...
wr

withonly do () ...
rd

wr

withonly do () ...
rd

rd

11
Jade Example
withonly do () ...
rd

wr

withonly do () ...
rd

wr

withonly do () ...
rd

rd

12
Jade Example
withonly do () ...
wr

wr

withonly do () ...
rd

wr

withonly do () ...
rd

rd

13
Jade Example
withonly do () ...
wr

wr

withonly do () ...
rd

wr

withonly do () ...
rd

rd

14
Jade Example
withonly do () ...
wr

wr

withonly do () ...
rd

wr

wr

withonly do () ...
rd

rd

15
Jade Example
withonly do () ...
wr

wr

withonly do () ...
rd

wr

wr

withonly do () ...
rd

rd

16
Jade Example
withonly do () ...
wr

wr

withonly do () ...
rd

wr

wr

withonly do () ...
rd

rd

rd

rd

17
Jade Example
withonly do () ...
wr

wr

withonly do () ...
rd

wr

wr

withonly do () ...
rd

rd

rd

rd

18
Jade Example
withonly do () ...
wr

wr

withonly do () ...
rd

wr

wr

withonly do () ...
rd

rd

rd

rd

19
Jade Example
withonly do () ...
wr

wr

withonly do () ...
rd

wr

wr

withonly do () ...
rd

rd

rd

rd

20
Jade Example
withonly do () ...
wr

withonly do () ...
rd

wr

wr

withonly do () ...
rd

rd

rd

rd

21
Jade Example
withonly do () ...
wr

withonly do () ...
rd

wr

withonly do () ...
rd

rd

rd

rd

22
Result
  • At Each Point in the Execution
  • A Collection of Enabled Tasks
  • Each Task Has an Access Specification
  • Jade Implementation
  • Exploits Information in Access Specifications
  • Apply Communication Optimizations

23
Message Passing Implementation
  • Model of Computation for Implementation
  • Implementation Overview
  • Communication Optimizations
  • Experimental Results for iPSC/860

24
Model of Computation
  • Each Processor Has a Private Memory
  • Processors Communicate by Sending Messages
    through Network

network
memory
processor
25
Implementation Overview
  • Distributes Objects Across Memories

26
Implementation Overview
  • Assigns Enabled Tasks to Idle Processors

rd

wr

27
Implementation Overview
  • Transfers Objects to Accessing Processor
  • Replicates Objects that Task will Read

rd

wr

28
Implementation Overview
  • Transfers Objects to Accessing Processor
  • Migrates Objects that Task will Write

rd

wr

29
Implementation Overview
  • When all Remote Objects Arrive
  • Task Executes

rd

wr

30
Optimization
Goal
Mechanism
Broadcast Each New Version of Widely Accessed
Objects
Parallelize Communication
Adaptive Broadcast
Replicate Data on Reading Processors
Enable Tasks to Concurrently Read Same Data
Replication
Assign Multiple Enabled Tasks to Same Processor
Overlap Computation and Communication
Latency Hiding
Concurrently Transfer Remote Objects that Task
will Access
Parallelize Communication
Concurrent Fetch
Eliminate Communication
Execute Tasks on Processors that have Locally
Available Copies of Accessed Objects
Locality
31
Application-Based Evaluation
  • WaterEvaluates forces and potentials in a
    system of liquid water molecules
  • StringComputes a velocity model of the geology
    between two oil wells
  • OceanSimulates the role of eddy and boundary
    currents in influencing large-scale ocean
    movements
  • Panel CholeskySparse Cholesky factorization
    algorithm

32
Impact of Communication Optimizations
Panel Cholesky
String
Water
Ocean
-

Adaptive Broadcast
Replication
-
Latency Hiding
-
Concurrent Fetch

Significant Impact
-
Negligible Impact
Required To Expose Concurrency
33
Optimization
Impact
Significant performance improvement for Water.
Negligible impact for String. No impact for
Ocean and Panel Cholesky
Adaptive Broadcast
Replication
Crucial. Without replication all applications
execute serially.
No impact for Water, String and Ocean - no excess
concurrency. Negligible impact for Panel
Cholesky.
Latency Hiding
None. Almost all tasks access at most one remote
object.
Concurrent Fetch
34
Locality Optimization
  • Integrated into Online Scheduler
  • Scheduler
  • Maintains Pool of Enabled Tasks
  • Maintains Pool of Idle Processors
  • Balances Load by Assigning Enabled Tasks to Idle
    Processors
  • Locality Algorithm Affects the Assignment

35
Locality Concepts
  • Each Object has an Owner
  • Last processor to write the object. Owner has a
    current copy of the object.
  • Each Task has a Locality ObjectCurrently first
    object in access specification.
  • Locality Object Determines Target
    ProcessorOwner of locality object.
  • Goal Execute each task on its target processor.

36
When Task Becomes Enabled
  • Scheduler Checks Pool of Idle Processors
  • If Target Processor is Idle Target Processor
    Gets Task
  • If Some Other Processor is Idle Other Processor
    Gets Task
  • No Processor is Idle Task is Held in Pool of
    Enabled Tasks

37
When Processor Becomes Idle
  • Scheduler Checks Pool of Enabled Tasks
  • If Processor is Target of an Enabled
    Task Processor Gets That Task
  • If Other Enabled Tasks Exist Processor Gets one
    of Those Tasks
  • No Enabled Tasks Processor Stays Idle

38
Implementation Versions
  • Locality
  • Implementation uses Locality Algorithm
  • No Locality
  • First Come, First Served Assignment of Enabled
    Tasks to Idle Processors
  • Task Placement (Ocean and Panel
    Cholesky)Programmer assigns tasks to processors

39
Task Locality Percentage
Measures how well scheduler places tasks on
target processor
  • Number of Tasks Executed on Target Processor
  • Total Number of Executed Tasks

100
40
Percentage of Tasks Executed on Target Processor
on iPSC/860
100
75
50
25
0
0
8
16
24
32
41
Communication to Computation Ratio
  • Measures the Effect of the Locality Algorithm on
    the Communication

Rationale Task Times Include no Communication
Overhead and do not Vary Significantly between
Versions
42
Communication to Useful Computation Ratio on
IPSC/860 (Mbytes/Second/Processor)
0.0025
0.0025
0.0020
0.0020
0.0015
0.0015
0.0010
no locality
0.0010
0.0005
0.0005
0
0
locality
0
8
16
24
32
0
8
16
24
32
string
water
task placement
3
3
2
2
1
1
0
0
0
8
16
24
32
0
8
16
24
32
panel cholesky
ocean
43
Speedup on iPSC/860
44
Shared Memory Implementation
  • Model of Computation
  • Locality Optimization
  • Locality Performance Results

45
Model of Computation
  • Single Shared Memory
  • Composed of Memory Modules
  • Each Memory Module Associated with a Processor
  • Each Object Allocated in a Memory Module
  • Processors Communicate by Reading and Writing
    Objects in the Shared Memory

memory module
shared memory
object
processor
46
Locality Algorithm
  • Integrated into Online Scheduler
  • Scheduler Runs Distributed Task Queue
  • Each Processor Has a Queue of Enabled Tasks
  • Idle Processors Search Task Queues
  • Locality Algorithm Affects Task Queue Algorithm

47
Locality Concepts
  • Each Object has an Owner
  • Processor associated with memory module that
    holds object. Accesses to the object from this
    processor are satisfied from local memory module.
  • Each Task has a Locality ObjectCurrently first
    object in access specification.
  • Locality Object Determines Target Processor
  • Owner of locality object.
  • Goal Execute each task on its target processor.

48
When Processor Becomes Idle
  • If Its Task Queue is not Empty Execute First
    Task in Task Queue
  • Otherwise Cyclically Search Task Queues
  • If Remote Task Queue is not Empty Execute Last
    Task in Task Queue

49
When Task Becomes Enabled
  • Locality Algorithm Inserts Task into Task Queue
    at the Owner of Its Locality Object
  • Tasks with Same Locality Object are Adjacent in
    Queue
  • Goals
  • Enhance memory locality by executing each task on
    the owner of its locality object.
  • Enhance cache locality by executing tasks with
    the same locality object consecutively on same
    the processor.

50
Evaluation
  • Same Set of Applications
  • Water
  • String
  • Ocean
  • Panel Cholesky
  • Same Locality Versions
  • Locality
  • No Locality (Single Task Queue)
  • Explicit Task Placement (Ocean and Panel Cholesky)

51
Percentage of Tasks Executed on Target Processor
on DASH
100
75
50
25
0
0
8
16
24
32
52
Task Execution Time
Measures the Effect of the Locality Algorithm on
the Communication
Sum of Task Execution Times
Rationale All communication performed on demand
as tasks access data. Differences in
communication show up as differences in task
execution times.
53
Task Execution Time on DASH
no locality
locality
string
water
task placement
100
75
50
25
0
0
8
16
24
32
panel cholesky
ocean
54
Speedup on DASH
32
32
24
24
16
16
8
8
0
0
0
8
16
24
32
0
8
16
24
32
32
32
24
24
16
16
8
8
0
0
0
8
16
24
32
0
8
16
24
32
55
Related Work
  • Shared Memory
  • COOL - Chandra, Gupta, Hennessy
  • Fowler, Kontothanasis
  • Message Passing
  • Tempest - Falsafi, Lebeck, Reinhardt, Schoinas,
    Hill, Larus, Rogers, Wood
  • Munin - Carter, Bennet, Zwaenepol
  • SAM - Scales, Lam
  • Olden - Carlisle, Rogers
  • Prelude - Hsieh, Wang, Weihl

56
Conclusion
  • Access Specifications Enable Communication
    Optimizations
  • Implemented Optimizations for Jade
  • Message Passing Implementation
  • Shared Memory Implementation
  • Experimental, Application-Based Evaluation
  • Replication Required to Expose Concurrency
  • Locality Significant for Ocean and Panel Cholesky
  • Broadcast Significant for Water
  • Other Optimizations Have Little or No Impact
Write a Comment
User Comments (0)
About PowerShow.com