Programming the IBM Power3 SP - PowerPoint PPT Presentation

About This Presentation
Title:

Programming the IBM Power3 SP

Description:

Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 71
Provided by: fcs66
Category:

less

Transcript and Presenter's Notes

Title: Programming the IBM Power3 SP


1
Programming the IBM Power3 SP
  • Eric Aubanel
  • Advanced Computational Research Laboratory
  • Faculty of Computer Science, UNB

2
Advanced Computational Research Laboratory
  • High Performance Computational Problem-Solving
    and Visualization Environment
  • Computational Experiments in multiple
    disciplines CS, Science and Eng.
  • 16-Processor IBM SP3
  • Member of C3.ca Association, Inc.
    (http//www.c3.ca)

3
Advanced Computational Research Laboratory
  • www.cs.unb.ca/acrl
  • Virendra Bhavsar, Director
  • Eric Aubanel, Research Associate Scientific
    Computing Support
  • Sean Seeley, System Administrator

4
(No Transcript)
5
(No Transcript)
6
Programming the IBM Power3 SP
  • History and future of POWER chip
  • Uni-processor optimization
  • Description of ACRLs IBM SP
  • Parallel Processing
  • MPI
  • OpenMP
  • Hybrid MPI/OpenMP
  • MPI-I/O (one slide)

7
POWER chip 1990 to 2003
  • 1990
  • Performance Optimized with Enhanced RISC
  • Reduced Instruction Set Computer
  • Superscalar combined floating point multiply-add
    (FMA) unit which allowed peak MFLOPS rate 2 x
    MHz
  • Initially 25 MHz (50 MFLOPS) and 64 KB data cache

8
POWER chip 1990 to 2003
  • 1991 SP1
  • IBMs first SP (scalable power parallel)
  • Rack of standalone POWER processors (62.5 MHz)
    connected by internal switch network
  • Parallel Environment system software

9
POWER chip 1990 to 2003
  • 1993 POWER2
  • 2 FMAs
  • Increased data cache size
  • 66.5 MHz (254 MFLOPS)
  • Improved instruction set (incl. Hardware square
    root)
  • SP2 POWER2 higher bandwidth switch for larger
    systems

10
POWER chip 1990 to 2003
  • 1993 POWERPC
  • Support SMP
  • 1996 P2SC
  • POWER2 super chip clock speeds up to 160 MHz

11
POWER chip 1990 to 2003
  • Feb. 99 POWER3
  • Combined P2SC POWERPC
  • 64 bit architecture
  • Initially 2-way SMP, 200 MHz
  • Cache improvement, including L2 cache of 1-16 MB
  • Instruction data prefetch

12
POWER3 chip Feb. 2000
  • Winterhawk II - 375 MHz
  • 4- way SMP
  • 2 MULT/ ADD - 1500 MFLOPS
  • 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
  • 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
  • 1.6 GB/ s Memory Bandwidth
  • 6 GFLOPS/ Node
  • Nighthawk II - 375 MHz
  • 16- way SMP
  • 2 MULT/ ADD - 1500 MFLOPS
  • 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec
  • 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec
  • 14 GB/ s Memory Bandwidth
  • 24 GFLOPS/ Node

13
The Clustered SMP
ACRLs SP Four 4-way SMPs
Each node has its own copy of the
O/S Processors on the node are closer than
those on different nodes
14
Power3 Architecture
15
Power4 - 32 way
  • Logical UMA
  • SP High Node
  • L3 cache shared between all processors on node -
    32 MB
  • Up to 32 GB main memory
  • Each processor 1.1 GHz
  • 140 Gflops total peak

16
Going to NUMA
NUMA up to 256 processors - 1.1 Teraflops
17
Programming the IBM Power3 SP
  • History and future of POWER chip
  • Uni-processor optimization
  • Description of ACRLs IBM SP
  • Parallel Processing
  • MPI
  • OpenMP
  • Hybrid MPI/OpenMP
  • MPI-I/O (one slide)

18
Uni-processor Optimization
  • Compiler options
  • start with -O3 -qstrict, then -O3, -qarchpwr3
  • Cache re-use
  • Take advantage of superscalar architecture
  • give enough operations per load/store
  • Use ESSL - optimization already maximally
    exploited

19
Memory Access Times
20
Cache
L2 cache 4-way set-associative, 8 MB total
L1 cache 128-way set-associative, 64 KB
21
How to Monitor Performance?
  • IBMs hardware monitor HPMCOUNT
  • Uses hardware counters on chip
  • Cache TLB misses, fp ops, load-stores,
  • Beta version
  • Available soon on ACRLs SP

22
HMPCOUNT sample output
  • real8 a(256,256),b(256,256),c(256,256)
  • common a,b,c
  • do j1,256
  • do i1,256
  • a(i,j)b(i,j)c(i,j)
  • end do
  • end do
  • end
  • PM_TLB_MISS (TLB misses)
    66543
  • Average number of loads per TLB miss
    5.916
  • Total loads and stores
    0.525 M
  • Instructions per load/store
    2.749
  • Cycles per instruction
    2.378
  • Instructions per cycle
    0.420
  • Total floating point operations
    0.066 M
  • Hardware float point rate
    2.749 Mflop/sec

23
HMPCOUNT sample output
  • real8 a(257,256),b(257,256),c(257,256)
  • common a,b,c
  • do j1,256
  • do i1,257
  • a(i,j)b(i,j)c(i,j)
  • end do
  • end do
  • end
  • PM_TLB_MISS (TLB misses)
    1634
  • Average number of loads per TLB miss
    241.876
  • Total loads and stores
    0.527 M
  • Instructions per load/store
    2.749
  • Cycles per instruction
    1.271
  • Instructions per cycle
    0.787
  • Total floating point operations
    0.066 M
  • Hardware float point rate
    3.525 Mflop/sec

24
ESSL
  • Linear algebra, Fourier related transforms,
    sorting, interpolation, quadrature, random
    numbers
  • Fast!
  • 560x560 real8 matrix multiply
  • Hand coding 19 Mflops
  • dgemm 1.2 GFlops
  • Parallel (threaded and distributed) versions

25
Programming the IBM Power3 SP
  • History and future of POWER chip
  • Uni-processor optimization
  • Description of ACRLs IBM SP
  • Parallel Processing
  • MPI
  • OpenMP
  • Hybrid MPI/OpenMP
  • MPI-I/O (one slide)

26
ACRLs IBM SP
  • 4 Winterhawk II nodes
  • 16 processors
  • Each node has
  • 1 GB RAM
  • 9 GB (mirrored) disk on each node
  • Switch adapter
  • High Perforrnance Switch
  • Gigabit Ethernet (1 node)
  • Control workstation
  • Disk SSA tower with 6 18.2 GB disks

Gigabit Ethernet
27

28
IBM Power3 SP Switch
  • Bidirectional multistage interconnection networks
    (MIN)
  • 300 MB/sec bi-directional
  • 1.2 ?sec latency

29
General Parallel File System
Node 2
Node 3
Node 4
SP Switch
Node 1
30
ACRL Software
  • Operating System AIX 4.3.3
  • Compilers
  • IBM XL Fortran 7.1 (HPF not yet installed)
  • VisualAge C for AIX, Version 5.0.1.0
  • VisualAge C Professional for AIX, Version
    5.0.0.0
  • IBM Visual Age Java - not yet installed
  • Job Scheduler Loadleveler 2.2
  • Parallel Programming Tools
  • IBM Parallel Environment 3.1 MPI, MPI-2
    parallel I/O
  • Numerical Libraries ESSL (v. 3.2) and Parallel
    ESSL (v. 2.2 )
  • Visualization OpenDX (not yet installed)
  • E-Commerce software (not yet installed)

31
Programming the IBM Power3 SP
  • History and future of POWER chip
  • Uni-processor optimization
  • Description of ACRLs IBM SP
  • Parallel Processing
  • MPI
  • OpenMP
  • Hybrid MPI/OpenMP
  • MPI-I/O (one slide)

32
Why Parallel Computing?
  • Solve large problems in reasonable time
  • Many algorithms are inherently parallel
  • image processing, Monte Carlo
  • Simulations (eg. CFD)
  • High performance computers have parallel
    architectures
  • Commercial off-the shelf (COTS) components
  • Beowulf clusters
  • SMP nodes
  • Improvements in network technology

33
NRL Layered Ocean Model at Naval Research
Laboratory IBM Winterhawk II SP
34
Parallel Computational Models
  • Data Parallelism
  • Parallel program looks like serial program
  • parallelism in the data
  • Vector processors
  • HPF

35
Parallel Computational Models
Send
Receive
  • Message Passing (MPI)
  • Processes have only local memory but can
    communicate with other processes by sending
    receiving messages
  • Data transfer between processes requires
    operations to be performed by both processes
  • Communication network not part of computational
    model (hypercube, torus, )

36
Parallel Computational Models
  • Shared Memory (threads)
  • P(osix)threads
  • OpenMP higher level standard

37
Parallel Computational Models
Get
Put
  • Remote Memory Operations
  • One-sided communication
  • MPI-2, IBMs LAPI
  • One process can access the memory of another
    without the others participation, but does so
    explicitly, not the same way it accesses local
    memory

38
Parallel Computational Models
  • Combined Message Passing Threads
  • Driven by clusters of SMPs
  • Leads to software complexity!

39
Programming the IBM Power3 SP
  • History and future of POWER chip
  • Uni-processor optimization
  • Description of ACRLs IBM SP
  • Parallel Processing
  • MPI
  • OpenMP
  • Hybrid MPI/OpenMP
  • MPI-I/O (one slide)

40
Message Passing Interface
  • MPI 1.0 standard in 1994
  • MPI 1.1 in 1995 - IBM support
  • MPI 2.0 in 1997
  • Includes 1.1 but adds new features
  • MPI-IO
  • One-sided communication
  • Dynamic processes

41
Advantages of MPI
  • Universality
  • Expressivity
  • Well suited to formulating a parallel algorithm
  • Ease of debugging
  • Memory is local
  • Performance
  • Explicit association of data with process allows
    good use of cache

42
MPI Functionality
  • Several modes of point-to-point message passing
  • blocking (e.g. MPI_SEND)
  • non-blocking (e.g. MPI_ISEND)
  • synchronous (e.g. MPI_SSEND)
  • buffered (e.g. MPI_BSEND)
  • Collective communication and synchronization
  • e.g. MPI_REDUCE, MPI_BARRIER
  • User-defined datatypes
  • Logically distinct communicator spaces
  • Application-level or virtual topologies

43
Simple MPI Example
My_Id
0
1
This is from MPI process number 0
This is from MPI processes other than 0
44
Simple MPI Example
  • Program Trivial
  • implicit none
  • include "mpif.h" ! MPI header file
  • integer My_Id, Numb_of_Procs, Ierr
  • call MPI_INIT ( ierr )
  • call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr
    )
  • call MPI_COMM_SIZE ( MPI_COMM_WORLD,
    Numb_of_Procs, ierr )
  • print , ' My_id, numb_of_procs ', My_Id,
    Numb_of_Procs
  • if ( My_Id .eq. 0 ) then
  • print , ' This is from MPI process number
    ',My_Id
  • else
  • print , ' This is from MPI processes other than
    0 ', My_Id
  • end if
  • call MPI_FINALIZE ( ierr ) ! bad things happen if
    you forget ierr
  • stop
  • end

45
MPI Example with send/recv
Send
Receive
Send
Receive
My_Id
0
1
46
MPI Example with send/recv
  • Program Simple
  • implicit none
  • Include "mpif.h"
  • Integer My_Id, Other_Id, Nx, Ierr
  • Parameter ( Nx 100 )
  • Real A ( Nx ), B ( Nx )
  • call MPI_INIT ( Ierr )
  • call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr
    )
  • Other_Id Mod ( My_Id 1, 2 )
  • A My_Id
  • call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id,
    MPI_COMM_WORLD, Ierr )
  • call MPI_RECV ( B, Nx, MPI_REAL, Other_Id,
    Other_Id, MPI_COMM_WORLD, Ierr )
  • call MPI_FINALIZE ( Ierr )
  • stop
  • end

47
What Will Happen?
  • / Processor 0 /
  • ...
  • MPI_Send(sendbuf,
  • bufsize,
  • MPI_CHAR,
  • partner,
  • tag,
  • MPI_COMM_WORLD)
  • printf("Posting receive now ...\n")
  • MPI_Recv(recvbuf,
  • bufsize,
  • MPI_CHAR,
  • partner,
  • tag,
  • MPI_COMM_WORLD,
  • status)
  • / Processor 1 /
  • ...
  • MPI_Send(sendbuf,
  • bufsize,
  • MPI_CHAR,
  • partner,
  • tag,
  • MPI_COMM_WORLD)
  • printf("Posting receive now ...\n")
  • MPI_Recv(recvbuf,
  • bufsize,
  • MPI_CHAR,
  • partner,
  • tag,
  • MPI_COMM_WORLD,
  • status)

48
MPI Message Passing Modes
Ready Standard Synchronous Buffered
Ready Eager Rendezvous Buffered
lt eager limit
gt eager limit
Default Eager Limit on SP is 4 KB (can be up to
64 KB)
49
MPI Performance Visualization
  • ParaGraph
  • Developed by University of Illinois
  • Graphical display system for visualizing
    behaviour and performance of MPI programs

50
(No Transcript)
51
(No Transcript)
52
Message Passing on SMP
Call MPI_SEND
Call MPI_RECEIVE
Memory Crossbar or Switch
Data to Send
Received Data
Buffer
Buffer
export MP_SHARED_MEMORYyesno
53
Shared Memory MPI
  • MPI_SHARED_MEMORYltyesnogt
  • Latency Bandwidth
  • (?sec) (Mbytes/sec)
  • between 2 nodes 24 133
  • same nodes 30 (no) 80 (no)
  • same nodes 10 (yes) 270(yes)

54
Message Passing off Node
MPI Across all the processors Many more messages
going through the fabric
55
Programming the IBM Power3 SP
  • History and future of POWER chip
  • Uni-processor optimization
  • Description of ACRLs IBM SP
  • Parallel Processing
  • MPI
  • OpenMP
  • Hybrid MPI/OpenMP
  • MPI-I/O (one slide)

56
OpenMP
  • 1997 group of hardware and software vendors
    announced their support for OpenMP, a new API for
    multi-platform shared-memory programming (SMP) on
    UNIX and Microsoft Windows NT platforms.
  • www.openmp.org
  • OpenMP parallelism specified through the use of
    compiler directives which are imbedded in C/C
    or Fortran source code. IBM does not yet support
    OpenMP for C.

57
OpenMP
  • All processors can access all the memory in the
    parallel system
  • Parallel execution is achieved by generating
    threads which execute in parallel
  • Overhead for SMP parallelization is large
    (100-200 ?sec)- size of parallel work construct
    must be significant enough to overcome overhead

58
OpenMP
  • 1.All OpenMP programs begin as a single process
    the master thread
  • 2.FORK the master thread then creates a
    team of parallel threads
  • 3.Parallel region statements executed
    in parallel among
    the various team threads
  • 4.JOIN threads
    synchronize and terminate, leaving only the
    master thread

59
OpenMP
  • How is OpenMP typically used?
  • OpenMP is usually used to parallelize loops
  • Find your most time consuming loops.
  • Split them up between threads.
  • Better scaling can be obtained using OpenMP
    parallel regions, but can be tricky!

60
OpenMP Loop Parallelization
  • !OMP PARALLEL DO
  • do i0,ilong
  • do k1,kshort
  • ...
  • end do
  • end do
  • pragma omp parallel for
  • for(i0 i lt ilong i)
  • for(k1 k lt kshort k)
  • ...

61
Variable Scoping
  • Most difficult part of Shared Memory
    Parallelization
  • What memory is Shared
  • What memory is Private - each processor has its
    own copy
  • Compare MPI all variables are private
  • Variables are shared by default, except
  • loop indices
  • scalars that are set and then used in loop

62
How Does Sharing Work?
Shared X initially 0
  • THREAD 1
  • increment(x)
  • x x 1
  • THREAD 1
  • 10 LOAD A, (x address)
  • 20 ADD A, 1
  • 30 STORE A, (x address)

  • THREAD 2
  • increment(x)

  • x x 1
  • THREAD 2
  • 10 LOAD A, (x address)
  • 20 ADD A, 1
  • 30 STORE A, (x address)

Result could be 1 or 2 Need synchronization
63
False Sharing
7 6 5 4 3 2 1 0
Block
Address tag
Cache line
Block in Cache
Say A(1-5)starts on cache line, then some of
A(6-10) will be on first cache line so wont be
accessible until first thread finished
!OMP PARALLEL DO do I1,20 A(I) ... enddo
64
Programming the IBM Power3 SP
  • History and future of POWER chip
  • Uni-processor optimization
  • Description of ACRLs IBM SP
  • Parallel Processing
  • MPI
  • OpenMP
  • Hybrid MPI/OpenMP
  • MPI-I/O (one slide)

65
Why Hybrid MPI-OpenMP?
  • To optimize performance on mixed-mode hardware
    like the SP
  • MPI is used for inter-node communication, and
    OpenMP is used for intra-node communication
  • threads have lower latency
  • threads can alleviate network contention of a
    pure MPI implementation

66
Hybrid MPI-OpenMP?
  • Unless you are forced against your will, for the
    hybrid model to be worthwhile
  • There has to be obvious parallelism to exploit
  • The code has to be easy to program and maintain
  • easy to write bad OpenMP code
  • It has to promise to perform at least as well as
    the equivalent all-MPI program
  • Experience has shown that converting working MPI
    code to a hybrid model rarely results in better
    performance
  • especially true with applications having a single
    level of parallelism

67
Hybrid Scenario
  • Thread the computational portions of the code
    that exist between MPI calls
  • MPI calls are single-threaded and therefore
    use only a single CPU.
  • Assumes
  • application has two natural levels of parallelism
  • or that in breaking an MPI code with one level of
    parallelism that communication between resulting
    threads is little/none

68
Programming the IBM Power3 SP
  • History and future of POWER chip
  • Uni-processor optimization
  • Description of ACRLs IBM SP
  • Parallel Processing
  • MPI
  • OpenMP
  • Hybrid MPI/OpenMP
  • MPI-I/O (one slide)

69
MPI-IO
memory processes file
  • Part of MPI-2
  • Resulted work at IBM Research exploring the
    analogy between I/O and message passing
  • See Using MPI-2, by Gropp et al. (MIT Press)

70
Conclusion
  • Dont forget uni-processor optimization
  • If you choose one parallel programming API,
    choose MPI
  • Mixed MPI-OpenMP may be appropriate in certain
    cases
  • More work needed here
  • Remote memory access model may be the answer
Write a Comment
User Comments (0)
About PowerShow.com