Title: pArray as an Efficient Static Parallel Container in STAPL (Standard Template Adaptive Parallel Library)
1pArray as an Efficient StaticParallel
ContainerinSTAPL(Standard Template Adaptive
Parallel Library)
- Presenter Olga Tkachyshyn
- Grad Student Advisors Ping An, Gabriel Tanase
- Faculty Advisor Nancy Amato
2Presentation Plan
- Motivation
- STAPL Overview
- pContainer Design
- pArray
- Prefix-sum using pArray and pVector
- Performance results
- Conclusion and Future Work
3Motivation
- The time it takes to complete a task is limited
to the speed of the worker - Alternative
- Similarly, the time it takes to solve a problem
on a computer is limited to the speed of the
processor - Alternative parallel processing or the
concurrent use of multiple processors to process
data
4Parallel/Distributed Architecture
Processor 0
Processor 2
Processor 1
- Multiple processors are connected together
- A processor can have its own memory or share the
memory with another processor
Cache 2
Cache 1
Cache 0
Memory 1
Memory 0
5Motivation
- Powerful parallel computers can solve hard to
compute problems - Computational physics
- Protein folding
- Parallel programming is challenging due to the
communication and synchronization issues - Parallel libraries reduce the complexity of
parallel programming
6STAPL Introduction
- The Parasol Lab in the Computer Science
Department at TAMU is developing a Standard
Template Adaptive Parallel Library (STAPL) - STAPL is designed as a platform independent
parallel library - STAPL provides a collection of parallel
containers (generic distributed data structures)
that are efficient and easy to use
7STAPL Overview
- STAPL is a C parallel library designed as a
superset of the Standard Template Library(STL).
- STAPL simplifies parallel programming by letting
the user ignore the distributed machine details
like - data partitioning
- distribution
- communication
8STAPL Main Components
- pContainer
- Generic distributed data structure
- STAPL requires an efficient array data structure
for numeric intensive applications - pRange
- Presents an abstract view of a scoped data space,
which allows random access to a partition or
subrange of the data in a pContainer - pAlgorithms
- Parallel Algorithms which provide basic
functionality, bound with the pContainer by pRange
9pContainer Basic Design
- pContainers are data structures that allow users
to store and use distributed data as if it is
stored in a single memory - All pContainers have similar functionality
- pContainers have three basic components
- Base pContainer
- Base Distribution Manager
- Base Sequential Container Interface
10Base Distribution Manager
- Base Distribution Manager is responsible for
locating elements (finding the memory containing
the element) - Each pContainer element has a unique global
identifier (GID)
- Local element Processor 0 needs an element with
GID 2
Processor 0
Processor 1
GID 0 1 2
Data 7 3 6
GID 3 4 5
Data 4 1 2
- Remote element Processor 0 needs an element
with GID 4
Processor 0
Processor 1
GID 0 1 2
Data 7 3 6
GID 3 4 5
Data 4 1 2
11Base Sequential Container
- pContainer is composed of sequential containers
- Base Sequential Container Interface/Part provides
an uniform interface to easily build pContainers
from different sequential containers
12Base pContainer
- Generic methods to construct a pContainer
- Methods to add, access, modify elements
- Methods to efficiently locate elements
13Presentation Plan
- Motivation
- STAPL Overview
- pContainer Design
- pArray
- Prefix-sum using pArray and pVector
- Performance results
- Conclusion and Future Work
14pArray Introduction
- An array
- a data structure with fixed (unchangeable) size
0 1 2 3 4 5 6 7 8 9
13 98 56 45 0 45 77 38 23 52
- Elements can be accessed randomly using their
index - array5 45
- Arrays are useful for numerically intensive
applications - In C there is no fixed sized array
- C vector allows insertion and deletion of
elements in the middle and thus is hard to
optimize - We have designed a pArray for STAPL for this
purpose
15pArray Basic Design
- pArray is derived from the base classes of the
pContainer - Three Major Components
- Array Part
- Array Distribution
- pArray
16Array Distribution
- Responsible for locating local and remote
elements - Two ways this can be done
- Duplicated Distribution Information
- Each processor has information about where all
the elements are - Decentralized Distribution Information
- Each processor is responsible for keeping track
of the location of an evenly divided amount of
elements
17Duplicated Distribution Information
- Array Distribution information is stored in a
- vector of pair lt ltStart_Index, Sizegt,
Processor IDgt - Each processor has a copy of the Distribution
vector - Lookup Process Look in the Distribution Vector
- Check if GID is in the range
Processor 0
Processor 1
Data
Data
GID 0 1 6 7
Data
GID 2 3 4 5
Data
Distribution Vector (Start_ Index, Size)PID
Distribution Vector (Start_ Index, Size)PID
(0, 2)0 (2, 4)1 (6, 2)0
(0, 2)0 (2, 4)1 (6, 2)0
18Decentralized Distribution Information
- Evenly divide the array into segments
- Each processor is responsible for knowing the
location of one segment
Example Processor 0 needs element with GID 5
The algorithm
Processor 0
Processor 1
Cache Locally
Lookup GID
GID 0 1 6 7
Data
GID 2 3 4 5
Data
Is Local?
Get location information from Map Owner
yes
no
Location Cache
Location Cache
52/81
GID 3
Proc 1
GID
Proc
5
Is in Cache?
yes
MapOwner GIDnprocs/n
1
Location Map
Location Map
no
GID 0 1 2 3
Proc 0 0 1 1
GID 4 5 6 7
Proc 1 1 0 0
19Duplicated Distribution Information vs.
Decentralized Distribution Information
Duplicated Distribution Information Decentralized Distribution Information
PROs Each processor has information about the location of each element, no need to request information remotely Location information is distributed and not duplicated, save space
CONs Distribution information is duplicated, potentially space consuming If the distribution info is large, the search is slow May need to look up the location information of an element remotely, slower
20Array Part
- As a wrapper over the sequential STL container
valarray - Has all of functionality of the valarray
21pArray Class
- BasePContainer instatiates ArrayPart and
ArrayDistribution - pArray class is derived from the Base pContainer
to implement the functionality specific to the
pArray
22pArray Class
- class pArray
- //constructors
- pArray() //default constructor
- pArray(int size) //specific constructor with
default distribution - pArray(int size, ArrayDistribution distr)
//constructor with specified distribution -
//element access methods Data
GetElement(GID) //returns an element with
specified GID void SetElement(GID,Data)
//sets a specified location with the given
value
//operators and array specific methods
Data operator //index array access
operator pArray operator(Data scalar)
//adds a scalar to the pArray pArray
operator(pArray array) //adds term by term two
pArrays of the same size (undefined otherwise)
//returns an array with the same
distribution as the calling array pArray
operator(Data scalar) //multiplies the pArray
by a scalar pArray operator(pArray
array) //multiplies term by term two pArrays of
the same size (undefined otherwise)
//returns an array with the same
distribution as the calling array Data
accumulate() //sums up all the values stored
in the pArray Data dotproduct(pArray
array) //dot product of two pArrays of the same
size (undefined otherwise) long double
euclideannorm() //euclidean norm of an pArray
23Presentation Plan
- Motivation
- STAPL Overview
- pContainer Design
- pArray
- Prefix-sum using pArray and pVector
- Performance results
- Conclusion and Future Work
24Prefix Sums
- One of the most basic parallel algorithms
- Used in other parallel algorithms like sorts
- Prefix Sums of a sequence Sx1, x2, ,xn of n
elements are the n partial sums defined by - Pi x1 x2 xi, 1 ? i ? n
- Sequential Algorithm
- Sn //original array
- Pn //prefix sums
- P0 S0
- for (int i1 iltn i)
-
- PiPi-1Si
Index 0 1 2 3 4
Original Array 2 4 3 5 1
Prefix Sums 2 6 9 14 15
25Parallel Prefix Sums
Processor 0
Processor 1
Step 1 Each processor sums up its part
Data 1 2 2
Prefix Sum
Data 1 0 3
Prefix Sum
Part Sum 5
Part Sum 4
Step 2 Processor 0 receives all part sums,
calculates starting sums for each processor,
sends the corresponding starting sums to all
processors
Starting Sum 5
Starting Sum 0
Data 1 2 2
Prefix Sum 0
Data 1 0 3
Prefix Sum 5
Step 3 Each processor calculates its prefix sums
Data 1 2 2
Prefix Sum 1 3 5
Data 1 0 3
Prefix Sum 6 6 9
26Presentation Plan
- Motivation
- STAPL Overview
- pContainer Design
- pArray
- Prefix-sum using pArray and pVector
- Performance results
- Conclusion and Future Work
27Performance Results
- Scalability is the ability of a program to
exhibit good speed-up as the number of processors
used is increased - Scalability Time running on 1
Processor/Parallel Running Time
Running Prefix Sums for a pArray of 1,000,000
elements on 1 to 6 processors
28Performance Results
- pVector is a similar to pArray data structure
with a dynamic size (new elements can be added
and deleted at runtime)
- Running Prefix Sums on 1,000,000 elements using
pArray and pVector - pArray is faster due to less overhead
29Conclusions
- pArray is a useful pContainer
- pArray shows good scalability
- pArray is faster than pVector in parallel Prefix
Sum - Parallel Prefix Sums is an efficient pAlgorithm
30Future Work
- Array re-distribution
- Optimize Prefix Sums
- More pAlgorithms
31References
- 1 "STAPL An Adaptive, Generic Parallel C
Library", Ping An, Alin Jula, Silvius Rus,
Steven Saunders, Tim Smith, Gabriel Tanase,
Nathan Thomas, Nancy Amato and Lawrence
Rauchwerger, 14th Workshop on Languages and
Compilers for Parallel Computing (LCPC),Â
Cumberland Falls, KY, August, 2001. - 2 Efficient Parallel Containers with Shared
Object View", Ping An, Alin Jula, Gabriel
Tanase, Paul Thomas, Nancy Amato and Lawrence
Rauchwerger
32Thank you
- To my mentors
- Nancy Amato
- Ping An
- Gabriel Tanase