J90 vs' T3E Program Design - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

J90 vs' T3E Program Design

Description:

Agglomeration - evaluate both computation and communication with ... agglomerate. map. NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER. 22. Partitioning ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 45
Provided by: mjd5
Category:

less

Transcript and Presenter's Notes

Title: J90 vs' T3E Program Design


1
J90 vs. T3E Program Design
  • Jonathan Carter
  • High Performance Computing Department

2
Overview
  • J90 and T3E architectures
  • Parallel Programming Models
  • Models available on each platform
  • Designing Parallel Programs
  • Examples
  • Some results

3
J90 Architecture
  • Shared memory

4
J90 Architecture
  • 100 MHz, 200 MFlop vector processor
  • 20-28 processors in one machine
  • Shared memory of 1 Gword (8 byte word)
  • SV1 upgrade promises 1 GFlop per processor
  • Shared filesystems

5
T3E Architecture
  • Distributed memory

6
T3E Architecture
  • Processing elements (PEs) are composed of CPU and
    memory connected by a fast 3D torus

7
T3E Architecture
  • 450 MHz, 900 MFlop EV5 superscalar processor
  • 644 PEs, 512 PE maximum job size
  • Distributed memory 32 Mwords (8 byte words) per
    PE
  • Shared filesystems

8
Comparison
  • J90
  • Shared memory
  • Dynamically allocated CPUs
  • Time-shared CPU and memory
  • T3E
  • Distributed memory
  • Statically allocated CPUs
  • Dedicated CPU and memory

9
Programming Environment
  • Similar compilers and libraries
  • Different tools and data representations
  • Subtly different programming models

10
Parallel Programming Models I
  • Message Passing - set of processes each with
    local data, each process has a unique name and
    interacts with other processes by sending and
    receiving messages
  • Flexible Model - process creation and
    termination, multiple different programs execute
  • Single program multiple data (SPMD) - processes
    fixed at startup, copies of a single program
    execute
  • Message Passing Interface (MPI), Parallel Virtual
    Machine (PVM)

11
Parallel Programming Models II
  • Shared Memory - similar to message passing,
    except that one-sided memory operations (puts and
    gets) are allowed
  • Low latency, high bandwidth and less forced
    synchronization
  • SGI/Cray shared memory library (shmem)
  • Fortran Co-arrays extension (F--)

12
Parallel Programming Models III
  • Data Parallelism - exploits the fact that often
    the same operation is applied to all elements of
    a data structure. For example, adding a scalar to
    all the elements of a real 1D array can be done
    in parallel.
  • High Performance Fortran (HPF) provides a data
    parallel framework. Focus is on indicating data
    distribution and indicating what operations can
    be done in parallel.

13
Parallel Programming Models IV
  • Thread-based parallelism - a set of threads of
    control are spawned by a master process. Threads
    can access global data, but can also have private
    data. Only on shared-memory machines. Fine
    grained parallelism possible.
  • Automatic parallelizing compilers
  • Proprietary compiler directives and OpenMP
  • POSIX threads

14
Message Passing Interface - MPI
  • A library of routines to write parallel programs
    using message passing
  • Standard supported by most vendors
  • MPI is whatever size you like
  • Simple one to one send and receive (cooperative
    communication)
  • Broadcasts and reductions
  • MPI I/O is a parallel I/O standard (not on J90s)

15
High Performance Fortran - HPF
  • HPF is a data-parallel language. Compiler
    directives act to distribute data and indicate
    loops that may be executed in parallel. Some
    functions are automatically parallelized, other
    constructs need directives or rearranging.
  • High level, no explicit communication required
  • Portable, many compilers exist
  • Somewhat restrictive, not all algorithms can be
    specified
  • Performance may not be that great. Definitely
    cant just compile an old Fortran code and hope
    for the best.

16
HPF - Data Distribution
  • Consider a 4 processor case

!HPF DISTRIBUTE A(BLOCK) dimension a(20)
P1
P2
P3
P4
!HPF DISTRIBUTE B(BLOCK,) dimension
b(8,20)
P1
P2
P3
P4
17
HPF - Data Distribution
  • Consider a 4 processor case

!HPF DISTRIBUTE A(CYCLIC) dimension a(20)
!HPF DISTRIBUTE B(,CYCLIC) dimension
b(8,20)
18
Tasking Directives
  • Most vendors provide compiler directives to
    indicate where a region of code may be executed
    in parallel.
  • OpenMP is a standard for Fortran 90 and C, which
    should lead to portable programs.
  • High level, no explicit communication
  • Threads can join and leave as program progresses,
    relaxed approach

19
J90 Programming models
  • Automatic parallelizing compilers (f90, cc, CC)
  • Cray and OpenMP compiler directives
  • Message Passing Interface (MPI)
  • Parallel Virtual Machine (PVM)
  • Shared memory library (shmem)

20
T3E Programming Models
  • Message Passing Interface (MPI)
  • High Performance Fortran (HPF)
  • Parallel Virtual Machine (PVM)
  • Shared memory library (shmem)
  • Fortran Co-arrays (F--)

21
Designing Parallel Algorithms
Problem
  • Partitioning - decompose the computation and the
    data into tasks.
  • Communication - determine communication required
    to coordinate tasks
  • Agglomeration - evaluate both computation and
    communication with respect to eprformance and
    implementation costs combine tasks if necessary
  • Mapping - Assign tasks to processors either
    statically or dynamically.

partition
communicate
agglomerate
map
22
Partitioning
  • Two complimentary ways to think about
    partitioning
  • Domain decomposition - seek to divide the data
    into roughly equal portions per task
  • Functional decomposition - seek to divide the
    computation into disjoint functions per task

23
Communication
  • Types of communication
  • Local - needs access to data from one or very few
    processes
  • Global - needs access to data from all or most
    processes
  • Static - an unchanging pattern
  • Dynamic - a pattern changing with time
  • Regular pattern
  • Irregular pattern

24
Agglomeration
  • Consider
  • Is it more efficient or easier to combine certain
    tasks
  • Is it more efficient or easier to replicate data
    or computation
  • Issues
  • Granularity - computation vs. communication
  • Flexibility - dont limit number of tasks or
    scalability
  • Code reuse - seek to reuse old code or algorithms

25
Mapping
  • Map tasks to physical processes. For the J90 and
    T3E this is relatively simple. Both systems are
    homogeneous.
  • simple domain decomposition - fixed number of
    equal sized tasks, which are agglomerated to form
    a reasonable number of larger tasks which each
    map to a process
  • complex domain decomposition - need a load
    balancing algorithm
  • functional decomposition - task-scheduling
    algorithm

26
Example
  • Calculate the energy of a system of particles
    interacting via a Coulomb potential.

real coord(3,n), charge(n)
energy0.0 do i 1, n do j 1,
i-1 rdist 1.0/sqrt((coord(1,i)-coord(1
,j))2 (coord(2,i)-coord(2,j))2(c
oord(3,i)-coord(3,j))2) energy
energy charge(i)charge(j)rdist end
do end do
27
MPI Example 1
  • Functional decomposition
  • each task will compute the same number of
    interactions
  • accomplish this by dividing up the outer loop
  • replicate data to make communication simple
  • this approach will not scale

28
MPI - Example 1
include 'mpif.h' parameter(n50000)
dimension coord(3,n), charge(n) call
mpi_init(ierr) call mpi_comm_rank(MPI_COMM_W
ORLD, mype, ierr) call mpi_comm_size(MPI_COM
M_WORLD, npes, ierr) call
initdata(n,coord,charge,mype) e
energy(mype,npes,n,coord,charge)
etotal0.0 call mpi_reduce(e, etotal, 1,
MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD,
ierr) if (mype.eq.0) write(,) etotal
call mpi_finalize(ierr)
29
MPI - Example 1
subroutine initdata(n,coord,charge,mype)
include 'mpif.h' dimension coord(3,n),
charge(n) if (mype.eq.0) then
GENERATE coords, charge end if ! broadcast
data to slaves call mpi_bcast(coord, 3n,
MPI_REAL, 0, MPI_COMM_WORLD, ierr) call
mpi_bcast(charge, n, MPI_REAL, 0, MPI_COMM_WORLD,
ierr) return
30
MPI - Example 1
real function energy(mype,npes,n,coord,charg
e) dimension coord(3,n), charge(n)
intern(n-1)/npes nstartnint(sqrt(real(myp
einter)))1 nfinishnint(sqrt(real((mype1)
inter))) if (mype.eq.npes-1) nfinishn
total 0.0 do i nstart, nfinish
do j 1, i-1 .... total
total charge(i)charge(j)rdist end do
end do energy total return
31
MPI - Example 2
  • Domain decomposition
  • each task takes a chunk of particles
  • in turn, receives particle data from another
    process and computes all interactions between own
    data and received data
  • repeat until all interactions are done

32
MPI - Example 2
Proc 0
Proc 1
Proc 2
Proc 3
Proc 4
Step 1
21-40
41-60
61-80
81-100
1-20
21-40
41-60
61-80
81-100
1-20
Step 2
21-40
41-60
61-80
81-100
1-20
41-60
61-80
81-100
1-20
21-40
Step 3
21-40
41-60
61-80
81-100
1-20
61-80
81-100
1-20
21-40
41-60
33
subroutine initdata(n,coord,charge,mype,npes
,npepmax,nmax,nmin) include 'mpif.h'
dimension coord(3,n), charge(n) integer
status(MPI_STATUS_SIZE) itag0
isender0 if (mype.eq.0) then do
ipe1,npes-1 GENERATE coord, charge
for PEipe call mpi_send(coord, nj3,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) call mpi_send(charge, nj,
MPI_REAL, ipe, itag, MPI_COMM_WORLD,
ierror) end do GENERATE coord,
charge for self else ! receive particles
call mpi_recv(coord, 3n, MPI_REAL,
isender, itag, MPI_COMM_WORLD, status,
ierror) call mpi_recv(charge, n,
MPI_REAL, isender, itag,
MPI_COMM_WORLD, status, ierror) endif
return
34
niternpes/2 do iter1, niter ! PE
to send to and receive from if
(ipsend.eq.npes-1) then ipsend0
else ipsendipsend1 end if
if (iprecv.eq.0) then
iprecvnpes-1 else
iprecviprecv-1 end if ! send and
receive particles call mpi_sendrecv(coordi
, 3n, MPI_REAL, ipsend, itag, coordj,
3n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) call
mpi_sendrecv(chargei, n, MPI_REAL, ipsend, itag,
chargej, n, MPI_REAL, iprecv, itag,
MPI_COMM_WORLD, status, ierror) !
accumulate energy e e energy2(n,
coordi, chargei, n, coordj, chargej) end do
35
HPF Example
parameter(n50000) dimension
coord(3,n), charge(n), ep(n) !HPF DISTRIBUTE
coord(,BLOCK) !HPF ALIGN charge() WITH
coord(,) !HPF ALIGN ep() WITH coord(,)
call initdata(n, coord, charge)
eenergy(n, coord, charge, ep) write(,)
e stop end
36
HPF Example
real function energy(n,coord,charge,ep)
implicit real(a-h,o-z) dimension
coord(3,n), charge(n), ep(n) !HPF DISTRIBUTE
coord(,BLOCK) !HPF ALIGN charge() WITH
coord(,) !HPF ALIGN ep() WITH
coord(,) !HPF INDEPENDENT, NEW(rdist, j)
do i 1, n ep(i) 0.0 do j
1, i-1 rdist 1.0/sqrt((coord(1,i)-coor
d(1,j))2 (coord(2,i)-coord(2,j))2
(coord(3,i)-coord(3,j))2) ep(i)
ep(i) charge(i)charge(j)rdist end do
end do energy sum(ep) return
end
37
Cray Specific Directives
subroutine energy(n,coord,a) implicit
real(a-h,o-z) dimension coord(3,n), a(n)
total 0.0 cmic parallel autoscope,
shared(total),private(t,i,j,rdist)
t0.0 cmic do parallel do i 1, n
do j 1, i-1 rdist
1.0/sqrt((coord(1,i)-coord(1,j))2
(coord(2,i)-coord(2,j))2(coord(3,i)-coord(3,j))
2) t t a(i)a(j)rdist
end do end do cmic guard
totaltotalt cmic end guard cmic end parallel
write(,)' energy ',total return
end
38
OpenMP Directives
function energy(n,coord,charge)
dimension coord(3,n), charge(n) total
0.0 !omp parallel private(rdist) !omp do
schedule(dynamic,64) reduction(total) do
i 1, n do j 1, i-1 rdist
1.0/sqrt((coord(1,i)-coord(1,j))2
(coord(2,i)-coord(2,j))2(coord(3,i)-coord(3,j))
2) total total charge(i)charge(j)
rdist end do end do !omp end
parallel energytotal return
end
39
Coulomb Interaction - T3E
40
Coloumb Interaction - T3E
41
Coulomb Interaction - J90
42
Coulomb Interaction - J90
43
Strategies
  • Simple programs with low software development
    cost
  • Automatic parallelizing compiler or compiler
    directives on J90
  • High Performance Fortran on T3E (probably not
    optimal performance)
  • Complex programs
  • Compiler directives on J90
  • Redesign with MPI for both J90 and T3E

44
Further Information
  • General Parallel Programming
  • Designing and Building Parallel Programs, by Ian
    Foster. Addison-Wesley ISBN 0-201057594-9
  • http//www-unix.mcs.anl.gov/dbpp/
  • MPI
  • Using MPI, by Gropp, Lusk and Skjellum. MIT Press
    ISBN 0-262-57104-8
  • MPI - The Complete Reference, Vol 1, by Snir,
    Otto, Huss-Lederman, Walker, and Dongarra. MIT
    Press ISBN 0-262-69216-3
  • MPI - The Complete Reference, Vol 2, by Gropp,
    Huss-Lederman, Lumsdaine, Lusk, Nitzberg, Saphir
    and Snir. MIT Press ISBN 0-262-69216-3
  • http//www-unix.mcs.anl.gov/mpi/
  • HPF
  • The High Performance Fortran Handbook, by
    Koelbel, Loveman, Schreiber, Steele, Jr., and
    Zosel. MIT Press ISBN 0-262-61094-9
  • http//www.crpc.rice.edu/HPFF/home.html
  • Cray Tasking Directives
  • CF90 Commands and Directives Reference Manual
    SR-3901
  • Cray C/C Reference Manual SR-2179
  • http//www.cray.com/products/software/publications
    /
  • OpenMP
  • CF90 Commands and Directives Reference Manual
    SR-3901
  • http//www.openmp.org/
  • http//www.cray.com/products/software/publications
    /
Write a Comment
User Comments (0)
About PowerShow.com