Loading...

PPT – pArray as an Efficient Static Parallel Container in STAPL (Standard Template Adaptive Parallel Library) PowerPoint presentation | free to download - id: 66f633-Nzk5Y

The Adobe Flash plugin is needed to view this content

pArray as an Efficient Static Parallel

Container in STAPL (Standard Template Adaptive

Parallel Library)

- Presenter Olga Tkachyshyn
- Grad Student Advisors Ping An, Gabriel Tanase
- Faculty Advisor Nancy Amato

Presentation Plan

- Motivation
- STAPL Overview
- pContainer Design
- pArray
- Prefix-sum using pArray and pVector
- Performance results
- Conclusion and Future Work

Motivation

- The time it takes to complete a task is limited

to the speed of the worker - Alternative

- Similarly, the time it takes to solve a problem

on a computer is limited to the speed of the

processor - Alternative parallel processing or the

concurrent use of multiple processors to process

data

Parallel/Distributed Architecture

Processor 0

Processor 2

Processor 1

- Multiple processors are connected together
- A processor can have its own memory or share the

memory with another processor

Cache 2

Cache 1

Cache 0

Memory 1

Memory 0

Motivation

- Powerful parallel computers can solve hard to

compute problems - Computational physics
- Protein folding
- Parallel programming is challenging due to the

communication and synchronization issues - Parallel libraries reduce the complexity of

parallel programming

STAPL Introduction

- The Parasol Lab in the Computer Science

Department at TAMU is developing a Standard

Template Adaptive Parallel Library (STAPL) - STAPL is designed as a platform independent

parallel library - STAPL provides a collection of parallel

containers (generic distributed data structures)

that are efficient and easy to use

STAPL Overview

- STAPL is a C parallel library designed as a

superset of the Standard Template Library(STL).

- STAPL simplifies parallel programming by letting

the user ignore the distributed machine details

like - data partitioning
- distribution
- communication

STAPL Main Components

- pContainer
- Generic distributed data structure
- STAPL requires an efficient array data structure

for numeric intensive applications - pRange
- Presents an abstract view of a scoped data space,

which allows random access to a partition or

subrange of the data in a pContainer - pAlgorithms
- Parallel Algorithms which provide basic

functionality, bound with the pContainer by pRange

pContainer Basic Design

- pContainers are data structures that allow users

to store and use distributed data as if it is

stored in a single memory - All pContainers have similar functionality
- pContainers have three basic components
- Base pContainer
- Base Distribution Manager
- Base Sequential Container Interface

Base Distribution Manager

- Base Distribution Manager is responsible for

locating elements (finding the memory containing

the element) - Each pContainer element has a unique global

identifier (GID)

- Local element Processor 0 needs an element with

GID 2

Processor 0

Processor 1

GID 0 1 2

Data 7 3 6

GID 3 4 5

Data 4 1 2

- Remote element Processor 0 needs an element

with GID 4

Processor 0

Processor 1

GID 0 1 2

Data 7 3 6

GID 3 4 5

Data 4 1 2

Base Sequential Container

- pContainer is composed of sequential containers
- Base Sequential Container Interface/Part provides

an uniform interface to easily build pContainers

from different sequential containers

Base pContainer

- Generic methods to construct a pContainer
- Methods to add, access, modify elements
- Methods to efficiently locate elements

Presentation Plan

- Motivation
- STAPL Overview
- pContainer Design
- pArray
- Prefix-sum using pArray and pVector
- Performance results
- Conclusion and Future Work

pArray Introduction

- An array
- a data structure with fixed (unchangeable) size

0 1 2 3 4 5 6 7 8 9

13 98 56 45 0 45 77 38 23 52

- Elements can be accessed randomly using their

index - array5 45
- Arrays are useful for numerically intensive

applications - In C there is no fixed sized array
- C vector allows insertion and deletion of

elements in the middle and thus is hard to

optimize - We have designed a pArray for STAPL for this

purpose

pArray Basic Design

- pArray is derived from the base classes of the

pContainer - Three Major Components
- Array Part
- Array Distribution
- pArray

Array Distribution

- Responsible for locating local and remote

elements - Two ways this can be done
- Duplicated Distribution Information
- Each processor has information about where all

the elements are - Decentralized Distribution Information
- Each processor is responsible for keeping track

of the location of an evenly divided amount of

elements

Duplicated Distribution Information

- Array Distribution information is stored in a
- vector of pair lt ltStart_Index, Sizegt,

Processor IDgt - Each processor has a copy of the Distribution

vector - Lookup Process Look in the Distribution Vector
- Check if GID is in the range

Processor 0

Processor 1

Data

Data

GID 0 1 6 7

Data

GID 2 3 4 5

Data

Distribution Vector (Start_ Index, Size)PID

Distribution Vector (Start_ Index, Size)PID

(0, 2)0 (2, 4)1 (6, 2)0

(0, 2)0 (2, 4)1 (6, 2)0

Decentralized Distribution Information

- Evenly divide the array into segments
- Each processor is responsible for knowing the

location of one segment

Example Processor 0 needs element with GID 5

The algorithm

Processor 0

Processor 1

Cache Locally

Lookup GID

- Data

- Data

GID 0 1 6 7

Data

GID 2 3 4 5

Data

Is Local?

Get location information from Map Owner

yes

- Distribution

- Distribution

no

Location Cache

Location Cache

52/81

GID 3

Proc 1

GID

Proc

5

Is in Cache?

yes

MapOwner GIDnprocs/n

1

Location Map

Location Map

no

GID 0 1 2 3

Proc 0 0 1 1

GID 4 5 6 7

Proc 1 1 0 0

Duplicated Distribution Information vs.

Decentralized Distribution Information

Duplicated Distribution Information Decentralized Distribution Information

PROs Each processor has information about the location of each element, no need to request information remotely Location information is distributed and not duplicated, save space

CONs Distribution information is duplicated, potentially space consuming If the distribution info is large, the search is slow May need to look up the location information of an element remotely, slower

Array Part

- As a wrapper over the sequential STL container

valarray - Has all of functionality of the valarray

pArray Class

- BasePContainer instatiates ArrayPart and

ArrayDistribution - pArray class is derived from the Base pContainer

to implement the functionality specific to the

pArray

pArray Class

- class pArray
- //constructors
- pArray() //default constructor
- pArray(int size) //specific constructor with

default distribution - pArray(int size, ArrayDistribution distr)

//constructor with specified distribution

//element access methods Data

GetElement(GID) //returns an element with

specified GID void SetElement(GID,Data)

//sets a specified location with the given

value

//operators and array specific methods

Data operator //index array access

operator pArray operator(Data scalar)

//adds a scalar to the pArray pArray

operator(pArray array) //adds term by term two

pArrays of the same size (undefined otherwise)

//returns an array with the same

distribution as the calling array pArray

operator(Data scalar) //multiplies the pArray

by a scalar pArray operator(pArray

array) //multiplies term by term two pArrays of

the same size (undefined otherwise)

//returns an array with the same

distribution as the calling array Data

accumulate() //sums up all the values stored

in the pArray Data dotproduct(pArray

array) //dot product of two pArrays of the same

size (undefined otherwise) long double

euclideannorm() //euclidean norm of an pArray

Presentation Plan

- Motivation
- STAPL Overview
- pContainer Design
- pArray
- Prefix-sum using pArray and pVector
- Performance results
- Conclusion and Future Work

Prefix Sums

- One of the most basic parallel algorithms
- Used in other parallel algorithms like sorts

- Prefix Sums of a sequence Sx1, x2, ,xn of n

elements are the n partial sums defined by - Pi x1 x2 xi, 1 ? i ? n

- Sequential Algorithm
- Sn //original array
- Pn //prefix sums
- P0 S0
- for (int i1 iltn i)
- PiPi-1Si

Index 0 1 2 3 4

Original Array 2 4 3 5 1

Prefix Sums 2 6 9 14 15

Parallel Prefix Sums

Processor 0

Processor 1

Step 1 Each processor sums up its part

Data 1 2 2

Prefix Sum

Data 1 0 3

Prefix Sum

Part Sum 5

Part Sum 4

Step 2 Processor 0 receives all part sums,

calculates starting sums for each processor,

sends the corresponding starting sums to all

processors

Starting Sum 5

Starting Sum 0

Data 1 2 2

Prefix Sum 0

Data 1 0 3

Prefix Sum 5

Step 3 Each processor calculates its prefix sums

Data 1 2 2

Prefix Sum 1 3 5

Data 1 0 3

Prefix Sum 6 6 9

Presentation Plan

- Motivation
- STAPL Overview
- pContainer Design
- pArray
- Prefix-sum using pArray and pVector
- Performance results
- Conclusion and Future Work

Performance Results

- Scalability is the ability of a program to

exhibit good speed-up as the number of processors

used is increased - Scalability Time running on 1

Processor/Parallel Running Time

Running Prefix Sums for a pArray of 1,000,000

elements on 1 to 6 processors

Performance Results

- pVector is a similar to pArray data structure

with a dynamic size (new elements can be added

and deleted at runtime)

- Running Prefix Sums on 1,000,000 elements using

pArray and pVector - pArray is faster due to less overhead

Conclusions

- pArray is a useful pContainer
- pArray shows good scalability
- pArray is faster than pVector in parallel Prefix

Sum - Parallel Prefix Sums is an efficient pAlgorithm

Future Work

- Array re-distribution
- Optimize Prefix Sums
- More pAlgorithms

References

- 1 "STAPL An Adaptive, Generic Parallel C

Library", Ping An, Alin Jula, Silvius Rus,

Steven Saunders, Tim Smith, Gabriel Tanase,

Nathan Thomas, Nancy Amato and Lawrence

Rauchwerger, 14th Workshop on Languages and

Compilers for Parallel Computing (LCPC),

Cumberland Falls, KY, August, 2001. - 2 Efficient Parallel Containers with Shared

Object View", Ping An, Alin Jula, Gabriel

Tanase, Paul Thomas, Nancy Amato and Lawrence

Rauchwerger

Thank you

- To my mentors
- Nancy Amato
- Ping An
- Gabriel Tanase