ORNL Power 4 Workshop July 24 and 25 PowerPoint PPT Presentation

presentation player overlay
1 / 305
About This Presentation
Transcript and Presenter's Notes

Title: ORNL Power 4 Workshop July 24 and 25


1
ORNL Power 4 WorkshopJuly 24 and 25
  • Mark R. Fahey
  • faheymr_at_ornl.gov

2
Sponsored by
  • Center for Computational Sciences (CCS)
  • http//www.ccs.ornl.gov
  • Joint Institute for Computational Science (JICS)
  • http//www.jics.utk.edu

3
Topics
  • Introduction
  • Announcements
  • Agenda
  • Speakers

4
Introduction
  • Goals
  • Discuss issues relating to the Power 4 processor
    and the 27 Power 4 node system at ORNL
  • Learn the systems limitations, weaknesses
  • Be made aware of all the available tools
  • Become a more efficient user of the Power 4 system

5
Announcements
  • Bathrooms
  • Parking
  • Cafeteria
  • Dinner
  • Talk with Trey

6
Agenda July 24
  • Introduction/Overview 900am
  • Site basics 915am
  • Break 1015 am
  • Hardware Configuration 1030am
  • Software 1100am
  • Lunch
  • LoadLeveler 100 pm
  • Tools introduction 200 pm
  • Break 300 pm
  • Special look at HPM and PCT 315 pm
  • Odds and Ends (maybe Thursday)

7
Agenda July 25
  • Benchmarks 830 am
  • Linpack
  • Early Evaluation Results
  • break 1015am
  • Performance issues 1030am
  • Lunch
  • Advanced Loadleveler 100 pm
  • MPI Programming 145pm
  • break 245 pm
  • Short Case Studies 300 pm
  • Libraries 345 pm

8
Speakers/Contributors
  • Rebecca Fahey
  • Christian Halloy
  • Trey White
  • Kwai Wong
  • Pat Worley

9
Overview of the Center for Computational Sciences
  • Cheetah Training Workshop
  • July 24, 2002
  • Trey White

10
CCS Overview
  • CCS systems
  • Distributed Computing Environment (DCE)
  • Secure shell, secure copy (ssh, scp)
  • Distributed File Service (DFS)
  • High-Performance Storage System (HPSS)

11
CCS systems
  • Bearcat - Basic login services
  • Supercomputers
  • Eagle (184-node IBM SP)
  • Falcon (64-node AlphaServer SC40)
  • Colt (16-node AlphaServer SC40)
  • Cheetah (27 IBM p690s)
  • DCE servers
  • DFS servers and disks
  • HPSS servers, disks, and tape robots

12
Distributed Computing Environment (DCE)
  • Directory services
  • Central user database
  • No /etc/passwd files to maintain
  • User authentication
  • Kerberos V
  • User authorization
  • DCE groups - systems, DFS, HPSS
  • Access control lists (ACLs)

13
DCE credentials
  • Kerberos V, more or less
  • Give your passwordGet a temporary license
  • License to access DFS, HPSS
  • Temporary? 30 days
  • For long-running jobs that resubmit
  • LoadLeveler passes on the license/credentials

14
DCE credentials
  • No credentials
  • cheetah0033 klist
  • No DCE identity available No currently
    established network identity for this context
    exists (dce / sec)
  • Kerberos Ticket Information
  • klist No credentials cache file found (dce /
    krb) (ticket cache /opt/dcelocal/var/security/cred
    s/dcecred_6d2c1b47)
  • Credentials
  • cheetah0033 klist
  • DCE Identity Information
  • Global Principal /.../dce.ccs.ornl.gov/tr
    ey
  • Cell 30f86b8a-0293-11d1-8bbe-02608ce
    8cceb /.../dce.ccs.ornl.gov
  • Principal 00009b9a-c5c7-21d2-ac00-02608ce
    8cceb trey
  • Group 0000000a-4f44-21d1-a701-02608ce
    8cceb staff
  • Local Groups
  • 0000012f-4c96-21d1-a701-02608ce8cc
    eb ccs

15
How to lose credentials
  • rsh without the right magic
  • ssh using public/private key
  • Reset/delete KRB5CCNAME environment variable
  • Crash dceunixd

16
DCE Control Program
cheetah0033 dcecp dcecp help The general forma
t of all dcecp object operations is as follows
dcecp argument options
In addition to all of the standard tcl commands,
dcecp supports many commands to administer DCE ob
jects. A dcecp object or task represent
a DCE entity. All of the following dcecp objects
and tasks require a verb account cdscach
e emsconsumer link
rpcgroup acl cdsclient emseven
t log rpcprofile
attrlist cell emsfilter
name secval aud cellalias
emslog object server
audevents clearinghouse endpoint
organization user audfilter clock
group principal utc
audtrail directory host
registry uuid cds dts
hostdata rpcentry xattrschema
cdsalias ems keytab
Miscellaneous commands perform specific
functions. These commands take no verb echo
errtext login logout quit resolve
shell To list all dcecp objects
dcecp help -verbose To list all verbs an obje
ct supports dcecp help
To list all options for an object operation
dcecp help For verbose informati
on on a dcecp object dcecp help
-verbose
17
DCE Control Program
dcecp account help catalog Returns t
he names of all accounts in the registry.
create Creates an account in the
registry. delete Deletes an account
from the registry. generate Generates
a random password for an account in the
registry. modify Modifies an account
in the registry. show Returns the
attributes of an account. help Pr
ints a summary of command-line options.
operations Returns a list of the valid
operations for this command. dcecp account help
modify -acctvalid Is the account valid a
nd can it be logged into. -change Spe
cify attributes to change in an attribute list
format. -client Can the account princ
ipal be a client. -description A general d
escription of the account. -dupkey Ca
n tkts to the account be obtained via its TGT
session key. -expdate When the account
expires. -forwardabletkt Allow use of forwar
dable tickets by or for this principal.
-goodsince The time indicating when the
account was good since. -group The a
ccount's primary group name. -home
The filesystem directory the principal uses at
login. -maxtktlife The maximum ticket lif
e for the account. -maxtktrenew The maximu
m ticket renewal time.
18
DCE Control Program
dcecp account show elmo acctvalid yes client
yes created /.../dce.ccs.ornl.gov/00000027-40e
f-21d2-a800-02608ce8cceb 1999-02-16-124244.000-0
500I----- description Elmo Monster,ORNL-CCS,8
652412103 dupkey no expdate none forward
abletkt yes goodsince 1952-04-14-102743.000-0
500I----- group staff home /dfs/home/elmo
lastchange /.../dce.ccs.ornl.gov/elmo
2002-05-23-152855.000-0400I-----
organization staff postdatedtkt no proxiabl
etkt no pwdvalid yes renewabletkt yes ser
ver yes shell /bin/ksh stdtgtauth yes use
rtouser no
19
Changing your default shell
  • ksh
  • dcecp account modify user -shell /usr/bin/ksh
  • tcsh
  • dcecp account modify user -shell
    /usr/local/bin/tcsh

20
Changing your password
  • Dont use dcecp
  • Plain old passwd works
  • Propagates in a few minutes

21
Secure shell
  • Encrypts your whole session, including
    passwordsssh user_at_cheetah.ccs.ornl.gov
  • Doesnt work?
  • ssh -1 user_at_cheetah.ccs.ornl.gov
  • ssh -2 user_at_cheetah.ccs.ornl.gov
  • ssh -v user_at_cheetah.ccs.ornl.gov
  • E-mail consult_at_ccs.ornl.gov

22
Backspace problems
  • Do you see ? when you press backspace/delete?stt
    y erase
  • Put this in .cshrc/.profilestty erase ?
    (literally)
  • H is also common

23
Secure copy
  • Built on secure shell
  • Options like cp
  • scp -p -r user_at_host.gov.dat .
  • Doesnt work?
  • scp -oProtocol1
  • Output in .cshrc/.profile?

24
scp and initialization output
  • .cshrc/.profile must produce no output
  • ksh (.profile)
  • TTY/usr/bin/tty
  • if ? 0 then/usr/bin/echo "interactive
    stuff goes here"
  • fi
  • csh (.cshrc)
  • ( /usr/bin/tty ) /dev/null
  • if ( status 0 ) then/usr/bin/echo
    "interactive stuff goes here"
  • endif

25
Secure X-Windows
  • Automatic tunneling of X-Windows through ssh
  • Encrypted
  • No xhost needed
  • DISPLAY points to localhost
  • Doesnt work?
  • ssh -X user_at_cheetah.ccs.ornl.gov

26
Insecure X-Windows
  • On Cheetah
  • export DISPLAYhost.gov0.0
  • setenv DISPLAY host.gov0.0
  • On your system
  • xhost cheetah0033.ccs.ornl.gov
  • Doesnt work?
  • ssh from Cheetah to your system
  • See hostname in "who am i"
  • Dont type your password in the X session!

27
Public/private key
  • Gets on Cheetah, but with no DCE credentials
  • ssh/scp from Cheetah
  • Keep private key in private DFS area

28
Distributed File Service (DFS)
  • Globally accessible
  • DCE/Kerberos security
  • Tape backups 3 times a week
  • Client and server caching
  • Transparent movement of user filesets among
    servers
  • Access control lists (ACLs)
  • Read-only replicas
  • NFS-like performance

29
DFS home directories
  • 500MB by default
  • public - Readable by others
  • private - Readable by only you
  • yesterday - 4AM daily read-only snapshot
  • Easy backup recovery!
  • bin - Executables
  • _at_sys directories (expanded by DFS, not shell)
  • Not in default path
  • www - http//www.ccs.ornl.gov/user

30
DFS quota
  • df doesnt work
  • fts
  • man fts
  • man fts_lsquota
  • cheetah0033 fts lsquota user
  • Fileset Name Quota Used Used Aggregate
  • us.user 500000 493813 9815493157/25574417(LFS)

31
DFS ACLs
  • cheetah0033 dcecp -c acl show user
  • mask_obj r-x---
  • user_obj rwxcid
  • user cell_admin rwxcid effective r-x---
  • group_obj r-x---
  • group dfs_admin rwxcid effective r-x---
  • group backup_admin rwxcid effective r-x---
  • other_obj r-x---
  • any_other r-x---
  • Confusing interaction with Unix permissions
  • user_obj - user permissions
  • mask_obj - group permissions
  • DCE group permissions OR this with group_objs
  • other_obj - others permissions
  • any_other - permissions for no DCE credentials

32
Why cant I write to DFS?
  • No credentials?
  • klist
  • Exceeded quota?
  • fts lsquota path
  • Wrong ACLs?
  • dcecp -c acl show path
  • DFS is broken?
  • E-mail consult_at_ccs.ornl.gov

33
High-Performance Storage System (HPSS)
  • Archival storage
  • Tapes with a large disk cache
  • TBs of space, no quota (yet?)
  • DCE security
  • Hierarchical Storage Interface (HSI)
  • Works without passwords in batch scripts
  • Maintenance Wednesday mornings!

34
Hierarchical Storage Interface (HSI)
  • Interactive hsi
  • Command linehsi "command command"
  • Scripted hsi "in script"
  • Some people have .hsirc in DFS
  • Want to change it? Ask for help.

35
HSI
  • Two copies/separate tapes by default
  • put/get - like FTP
  • ls, cd, mv, chmod, chgrp
  • Only move if new cput/cget
  • Command summary hsi help
  • http//www.sdsc.edu/Storage/hsi/

36
htar
  • Coming soon!
  • Store to/extract from tar file in HPSS
  • Uses HPSS interface efficiently
  • HPSS file is tar compatible
  • With one extra file at the end
  • Index file kept in HPSS disk cache

37
htar examples
  • Create an archive of local files in HPSShtar
    -cvf archive.tar localdir
  • Get contents of an archivehtar -tvf archive.tar
  • Listing will include a file like this
    /tmp/HTAR_CF_CHK_26452_1016481493
  • Retrieve files from an archivehtar -xvf
    archive.tar file1 file2

38
More info
  • http//www.ccs.ornl.gov/
  • Click "CCS Computers"
  • Click "Cheetah"
  • http//www.ccs.ornl.gov/Cheetah.html
  • consult_at_ccs.ornl.gov

39
Hardware Configuration
The Center For
Computational Sciences
  • Rebecca Fahey
  • User and Applications Support

40
Contents
  • Hardware Overview
  • File Systems
  • Types of Nodes
  • Recommendations for Jobs
  • Node Design
  • Cache
  • Operations
  • Additional Information

41
Hardware Overview
  • 27, IBM Power4 nodes
  • 32, 1.3 GHz, Power4 processors per physical node
    (864 processors)
  • Over 4.5 Teraflops in the compute partition

42
Overview Key Features
  • Server on a chip -- IBM's POWER4 microprocessor
    is the first "server on a chip." It contains two
    1.3 gigahertz processors, a high-bandwidth system
    switch, a large memory cache and I/O. Four chips
    combine to form an MCM. Four MCMs make up a
    node.
  • Virtualization -- It can be operated with 32-way
    nodes or the 32-way nodes can be divided into as
    many as 16 "virtual" servers. Cheetah has 8
    nodes divided into 4, 8-way LPARS.

43
File Systems Home directories
  • Location /dfs/home/
  • Backups are perform daily and a copy is placed in
    /dfs/home//yesterday.
  • Provides a small amount a permanent storage for
    files that are used frequently
  • Ex. Personal scripts, executables, libraries
  • Running jobs out of your home directory is not
    recommended because the access time is slow
  • For more info, refer to the site specific info
    presented earlier

44
File Systems GPFS
  • Location /tmp/gpfs750a/
  • SYSTEM_USERDIR and SCRATCH point to this
    directory
  • Global file system -- Accessible by all compute
    nodes
  • Recommended place to run jobs because it provides
    faster access
  • Your directory is not backed-up and the files are
    purged periodically
  • At job conclusion, copy files to be preserved
    into HPSS. For info on HPSS see the site wide
    info presented earlier

45
File Systems Node local
  • 160 GB of disk space local to each node
  • Accessible only through the NODE_JOBDIR
    environment variable during a batch job
  • All files are automatically purged at the
    conclusion of the job
  • Intended for temporary files

46
Types of Nodes
  • Login node (1)
  • The login node is a 32-way node. 16 are reserved
    for login use. The other 16 are available to
    batch jobs.
  • 32-way compute nodes (18) (Pool 32)
  • 11 nodes have 32 GB of memory
  • 5 nodes have 64 GB of memory
  • 2 nodes have 128 GB with 101 GB currently usable
  • 8-way LPARS (32) (Pool 8)
  • Each have 8 processors with 8 GB of memory
  • Note You do not automatically have access the
    memory on a node. For more than 1 segment, you
    must request resources (_at_ resources
    ConsumableMemory (1gb)).

47
Types of Nodes 32-way Nodes
  • 32-way Nodes
  • Note The Colony switch will be replaced with a
    Federation Switch when the switch is available

32-way node
32-way node
Dual Plane Colony
48
Types of Nodes LPARs
  • 32-way Nodes 8-way
    LPARs
  • Note The Colony switch will be replaced with a
    Federation Switch when the switch is available

32-way node
32-way node
32-way node
32-way node
Dual Plane Colony
8-way LPAR
8-way LPAR
Dual Plane Colony
8-way LPAR
8-way LPAR
Dual Plane Colony
Dual Plane Colony
8-way LPAR
8-way LPAR
Dual Plane Colony
8-way LPAR
8-way LPAR
49
Recommendations all jobs
  • Set MP_SHARED_MEMORYyes to keep intra-node
    messages from going out to the switch.
  • Use both planes of the colony for your jobs by
    setting
  • _at_ network.MPI csss,shared,US
  • in your batch script.
  • Set _at_ node_usage shared and use the _at_
    resources to reserve node resources

50
Recommendations Small Jobs
  • Run jobs nodes
  • To request a 32 processor node include the
    following in your batch script
  • _at_ requirements (Pool 32)
  • Reserve resources for your job with the _at_
    resources line in your batch script. For info
    see the LoadLeveler section.

51
Recommendations Large Jobs
  • Run jobs that do most of their communication
    within the node and minimal communication between
    nodes on 32-way nodes
  • EX. MPI/OpenMP code running 1-8 MPI tasks which
    each task spawning threads
  • Run all heavy communication codes (MPI or LAPI)
    using more than 32 processors on LPARs (Pool
    8) or run them using IP (_at_ network.MPIcsss,share
    d,IP) on the 32-way nodes

52
Node Design
1.3 GHz processor 32 KB Level 1 cache
Level 3 cache 512 MB total 128 MB ea.
Level 2 cache 1440 KB total 480 KB ea.
Four MCMs comprise one node of a regatta system.
53
Cache L1
  • L1 cache
  • 32 KB of data and 64 KB of instruction cache for
    each processor
  • Fetches 128 byte lines
  • Uses FIFO replacement policy rather than touch.
    Thus, blocking for cache reuse is not advisable.
  • Uses eight pre-fetching streams. Use the
  • qhot qcacheauto qarchpwr4
  • qtunepwr4 compiler options to perform loop
    optimizations that improve cache use.

54
Cache L2
  • L2 cache
  • Total of 1440 KB shared between the two
    processors on a chip
  • Use least recently touched replacement policy
  • For applications with memory requirements
    1GB/process, placement of processes on
    processors may impact cache performance.
  • If you are blocking for cache size, you should
    block for L2 cache. To leave it to the compiler
    use qarchpwr4, -qtunepwr4, qcacheauto, and
    -qhot. Implied with O4

55
Cache L3
  • L3 cache
  • Total of 512 MB split into 4 chunks of 128 MB
  • Shared between the eight processors on an MCM
  • Access times vary depending on location of the
    data to be retrieved

56
Cache Performance
57
Operations
  • Each processor has
  • 2 floating-point units
  • 1 madd per cycle with a 6 cycle latency
  • 72 registers serve both units. qunroll may
    improve register use. Implied with O2 and
    above
  • 2 integer units
  • 1 add per cycle with a 2 cycle latency
  • Instructions may execute out of order and may
    execute speculatively, but are tracked (200)

58
Additional Information
  • A good source for additional information on
    Regatta hardware is
  • http//www.redbooks.ibm.com/pubs/pdfs/redbooks/sg
    247041.pdf

59
Cheetah Software
  • Cheetah Training Workshop
  • July 24, 2002
  • Trey White

60
Cheetah software
  • Fortran
  • C/C
  • IBM libraries
  • Other libraries
  • Other tools

61
IBM Fortran compiler
  • One compiler, different default options
  • All accept F77, F90, F95 syntax
  • Limited Power4-specific code generation
  • Re-entrant (thread-safe) versions _r
  • Assume fixed form, -qsave xlf_r
  • Assume free form, -qnosave xlf90_r
  • Assume free form, F95 compliance xlf95_r
  • IBM MPI paths and libraries mp
  • Gui with option menus xxlf

62
Fortran optimization options
  • Optimized-g -O4 -qnoipa -qmaxmem-1
  • (-O4 -O3 -qarchauto -qtuneauto -qcacheauto
    -qhot -qipa)
  • Results not bitwise identical? -qstrict
  • Optimizer is breaking your code?-g -O
    -qmaxmem-1
  • Feeling lucky?-g -O5 -qmaxmem-1

63
More Fortran options
  • Feeling really lucky? -qsmp
  • OpenMP (requires _r) -qsmpomp
  • No array-statements with overlap?-qaliasnoaryovr
    lp
  • Get stack trace on signal -qsigtrap
  • Dump core with floating-point exceptions-qflttrap
    overflowzerodivideenable
  • Turn on profiling -pg
  • Use .f90 suffix -qsuffixff90

64
Fortran memory options
  • Default, 256MB limit
  • 2GB heap limit, 256MB stack limit-bmaxdata0x8000
    0000
  • Big heap, big stack -q64
  • 64-bit MPI now supported with _r
  • No vectorized intrinsics (yet)

65
Fortran data options
  • Static variables are implicitly save (xlf_r
    default) -qsave
  • Static variables are implicitly automatic
    (xlf9_r default) -qnosave
  • 8B REAL and 16B DOUBLE PRECISION-qrealsize8
  • 8B REAL and 8B REAL4-qautodbldbl4

66
Fortran environment variables
  • Old-fashoned namelistsexport XLFRTEOPTSnamelist
    oldsetenv XLFRTEOPTS namelistold
  • OpenMP standard variables
  • OMP_NUM_THREADS
  • OMP_SCHEDULE (DYNAMIC, GUIDED, STATIC)

67
IBM C/C compiler
  • Part of VisualAge C product
  • Beta version, with limited Power4-specific code
    generation
  • Re-entrant (thread-safe) versions _r
  • Strict ANSI C xlc_r
  • More lenient cc_r
  • Errors become warnings
  • ANSI C (but no STL?) xlC_r
  • IBM MPI paths and libraries mpcc_r, mpCC_r
  • but there is no CC_r (?)

68
C/C optimization options
  • Optimized-g -O4 -qnoipa -qmaxmem-1(-O4 -O3
    -qarchauto -qtuneauto -qcacheauto -qipa)
  • Results not bitwise identical? -qstrict
  • Optimizer is breaking your code?-g -O
    -qmaxmem-1
  • Feeling lucky? (IPA is C only)-g -O5 -qmaxmem-1

69
More C/C options
  • Feeling really lucky (C only)? -qsmp
  • OpenMP (requires _r, C only) -qsmpomp
  • Dump core with floating-point exceptions-qflttrap
    overflowzerodivideenable
  • -qsigtrap?
  • Turn on profiling -pg

70
C/C memory options
  • Default, 256MB limit
  • 2GB heap limit, 256MB stack limit-bmaxdata0x8000
    0000
  • Big heap, big stack -q64
  • MPI requires _r

71
gcc/g?
  • Available (32 bit)
  • Efficient?
  • Supported?
  • Can use IBM MPI

72
IBM libraries
  • 32/64 bits chosen automatically
  • Extended Scientific Subroutine Library (ESSL)
  • PESSL
  • MASS

73
ESSL
  • BLAS, linear algebra (not quite LAPACK), FFTs,
    sorting, quadrature, random
  • Used mostly for BLAS
  • Tuned for Power4
  • Sequential -lessl_r
  • Threaded parallel -lesslsmp
  • Call from sequential code

74
PESSL
  • Distributed-memory parallel
  • Tuned for Power4
  • BLACS, PBLAS, ScaLAPACK, FFTs
  • Not newest ScaLAPACK
  • Not thread safe (but thread tolerant)
  • Distributed-memory only -lpessl
  • And SMP parallel -lpesslsmp

75
MASS
  • -L/usr/local/lib
  • Faster intrinsics, less accuracy, thread safe
    -lmass
  • Vectorized math functions
  • IBM-specific calls
  • Power3/Power4 -lmassv
  • Power4-specific -lmassvp4

76
Other libraries
  • -I/usr/local/include -L/usr/local/lib
  • -L/usr/local/lib64 for some
  • Working on having combined 32 and 64 bit
  • LAPACK, FFTW, more PACKs
  • MPI BLACS, PBLAS, ScaLAPACK
  • -lnetcdf (NCAR-like interface)4B REAL
    -L/usr/local/lib32/r4i48B REAL
    -L/usr/local/lib32/r8i4

77
Other tools
  • /usr/local/bin in default path
  • gmake, gtar
  • TCL/Tk, Perl, Python
  • Debuggers, performance analyzers

78
Cheetah software
  • Questions?
  • Suggestions?

79
LoadLeveler on Cheetah
  • Cheetah Training Workshop
  • July 24, 2002
  • Trey White

80
LoadLeveler on Cheetah
  • Command files
  • What not to do
  • Scheduling
  • Classes
  • Machine status
  • Controlling jobs
  • Job status

81
MPI command file
  • _at_ shell /bin/ksh
  • _at_ job_type parallel
  • _at_ network.MPI csss,shared,US
  • _at_ output (host).(jobid).out
  • _at_ error (host).(jobid).err
  • _at_ wall_clock_limit 3000
  • _at_ tasks_per_node 32
  • _at_ node 2
  • _at_ queue
  • pwd
  • echo LOADL_PROCESSOR_LIST
  • export MP_SHARED_MEMORYyes
  • poe a.out

82
OpenMP command file
  • _at_ shell /bin/ksh
  • _at_ job_type serial
  • _at_ output (host).(jobid).out
  • _at_ error (host).(jobid).err
  • _at_ wall_clock_limit 3000
  • _at_ resources ConsumableCpus(8)
  • _at_ queue
  • pwd
  • echo LOADL_PROCESSOR_LIST
  • export OMP_NUM_THREADS8
  • a.out

83
Hybrid MPI/OpenMP command file
  • _at_ shell /bin/ksh
  • _at_ job_type parallel
  • _at_ network.MPI csss,shared,US
  • _at_ output (host).(jobid).out
  • _at_ error (host).(jobid).err
  • _at_ wall_clock_limit 3000
  • _at_ tasks_per_node 4
  • _at_ node 2
  • _at_ resources ConsumableCpus(8)
  • _at_ queue
  • pwd
  • echo LOADL_PROCESSOR_LIST
  • export MP_SHARED_MEMORYyes
  • export OMP_NUM_THREADS8
  • poe a.out

84
Picking node size
  • 32-processor nodes only
  • _at_ tasks_per_node 32
  • _at_ tasks_per_node 8_at_ resources
    ConsumableCpus(4)
  • _at_ requirements (Pool 32)
  • 8-processor nodes only
  • _at_ requirements (Pool 8)
  • Any available processors
  • _at_ total_tasks n_at_ blocking unlimited

85
Memory requirements
  • _at_ resources ConsumableMemory(N gb)
  • Units of kb, mb, gb, w
  • Nodes of 8, 32, 64, 128(96) GB
  • Enforced by IBM WorkLoad Manager (WLM)
  • Default is ConsumableMemory(256 mb)
  • Example large-memory OpenMP job_at_ resources
    ConsumableCpus(32) ConsumableMemory(64 gb)export
    OMP_NUM_THREADS32

86
What not to do
  • X _at_ node_usage not_shared
  • Use tasks_per_node, ConsumableCpus
  • Unless required by complex topologies
  • X _at_ network.MPI csss, not_shared, US
  • Take the whole node or share the switch
  • X Use multiple resource lines
  • _at_ resources ConsumableCpus(4) (ignored!)_at_
    resources ConsumableMemory(2 gb)
  • _at_ resource ConsumableCpus(4)
    ConsumableMemory(2 gb)

87
What not to do
  • X End command file without newline
  • Last command will be ignored
  • csh only
  • How to tell? tail
  • cheetah0033 tail csh.ll
  • _at_ queue
  • pwd
  • echo LOADL_PROCESSOR_LIST
  • setenv MP_SHARED_MEMORY yes
  • poe a.outcheetah0033

88
What not to do
  • X Forget ConsumableCpus with OpenMP
  • Default value is 1
  • WorkLoad Manager limits CPUs to 1 per task
  • Threads per CPU OMP_NUM_THREADS
  • X Forget OMP_NUM_THREADS with OpenMP
  • Default value is node CPU count
  • Threads per CPU tasks_per_node
  • X poe processor options in batch jobs
  • poe a.out -procs 16 -nodes 2Ignored!

89
LoadLeveler scheduling
  • FIFO with backfill
  • Oldest job scheduled next
  • All jobs have time limits
  • Short jobs inserted into space-time holes
  • Three waiting jobs may age at a time (I state)
  • Additional jobs don't age (NQ state)
  • No limit on running jobs
  • Some classes have priority
  • New jobs are already old

90
LoadLeveler classes
  • Default is batch
  • 12-hour time limit
  • No processor limit
  • Interactive (poe) default is interactive
  • 2-hour time limit
  • No processor
  • Half of login node is interactive only
  • Submit short jobs to interactive_at_ class
    interactive

91
llclass
  • cheetah0033 llclass
  • Name MaxJobCPU MaxProcCPU
    Free Max Description
  • dhhmmss dhhmmss
    Slots Slots
  • --------------- -------------- --------------
    ----- ----- ---------------------
  • interactive undefined undefined
    190 688 Interactive POE jobs
  • batch undefined undefined
    190 688 Batch jobs
  • No_Class undefined undefined
    0 0
  • royalty undefined undefined
    190 688 Uppercrust jobs
  • testing undefined undefined
    32 32 testing
  • sys undefined undefined
    286 784 System administration
  • --------------------------------------------------
    ------------------------------
  • "Maximum Slots" value of the class "No_Class" is
    constrained by the MAX_STARTERS limit(s).
  • "Free Slots" values of the classes "No_Class",
    "interactive", "batch", "climate_prod", "sys" are
    constrained by the MAX_STARTERS limit(s).

92
llclass -l
  • cheetah0033 llclass -l batch
  • Class batch
  • Name batch
  • Priority 0
  • Exclude_Users
  • Include_Users
  • cheetah0033 llclass -l egrep
    "NameWall_clock_limit"
  • Name interactive
  • Wall_clock_limit 020500, undefined (7500
    seconds, undefined)
  • Name batch
  • Wall_clock_limit 120000, undefined (43200
    seconds, undefined)
  • Name No_Class
  • Wall_clock_limit undefined, undefined
  • Name royalty
  • Wall_clock_limit 1000000, undefined
    (86400 seconds, undefined)
  • Name testing
  • Wall_clock_limit 1000000, undefined
    (86400 seconds, undefined)
  • Name sys

93
System status llstatus
  • cheetah0033 llstatus
  • Name Schedd InQ Act Startd
    Run LdAvg Idle Arch OpSys
  • cheetah0001.ccs.ornl.gov Avail 0 0 Run
    30 30.03 9999 RS6000 AIX51
  • cheetah0017.ccs.ornl.gov Avail 0 0 Run
    30 31.08 9999 RS6000 AIX51
  • cheetah0033.ccs.ornl.gov Avail 8 5 Idle
    0 0.07 12 RS6000 AIX51
  • cheetah0049.ccs.ornl.gov Avail 0 0 Busy
    32 33.41 9999 RS6000 AIX51
  • ...
  • RS6000/AIX51 58 machines 8
    jobs 498 running
  • Total Machines 58 machines 8
    jobs 498 running
  • The Central Manager is defined on
    manx.ccs.ornl.gov
  • The BACKFILL scheduler is in use

94
System resources llstatus -R
  • cheetah0033 llstatus -R
  • Machine Consumable
    Resource(Available, Total)
  • ------------------------------ -------------------
    ------------------------------
  • cheetah0001.ccs.ornl.gov ConsumableCpus(24,3
    2) ConsumableMemory(62.000 gb,64.000 gb)
  • cheetah0017.ccs.ornl.gov ConsumableCpus(32,3
    2) ConsumableMemory(64.000 gb,64.000 gb)
  • cheetah0033.ccs.ornl.gov ConsumableCpus(16,1
    6) ConsumableMemory(16.000 gb,16.000 gb)
  • cheetah0049.ccs.ornl.gov ConsumableCpus(0,32
    ) ConsumableMemory(24.000 gb,32.000 gb)
  • ...
  • cheetah1076.ccs.ornl.gov ConsumableCpus(8,8)
    ConsumableMemory(8.000 gb,8.000 gb)
  • cheetah1089.ccs.ornl.gov ConsumableCpus(0,8)
    ConsumableMemory(6.000 gb,8.000 gb)
  • ...
  • cheetah1601.ccs.ornl.gov
  • cheetah1617.ccs.ornl.gov
  • manx.ccs.ornl.gov
  • LoadL_startd daemons of machines with ""
    appended to their names are down.

95
Controlling jobs
  • Submit a job llsubmit script
  • Stopping a job llcancel jobname
  • Deletes queued jobs
  • Halts running jobs

96
Job status llq
  • cheetah0033 llq
  • Id Owner Submitted
    ST PRI Class Running On
  • ------------------------ ---------- -----------
    -- --- ------------ -----------
  • cheetah0033.7871.0 ernie 7/23 0924 R
    50 royalty cheetah0049
  • cheetah0033.7869.0 snuffy 7/23 0844 R
    50 batch cheetah1060
  • cheetah0033.7874.0 zoe 7/23 0952 R
    50 batch cheetah0097
  • cheetah0033.7910.0 bob 7/23 2102 R
    50 batch cheetah0273
  • cheetah0033.7911.0 gordon 7/23 2106 R
    50 batch cheetah1122
  • cheetah0033.7886.0 oscar 7/23 1344 I
    50 No_Class
  • cheetah0033.7896.0 bob 7/23 1938 I
    50 batch

97
Job statuses
  • Running R
  • Starting ST
  • Waiting to run, aging I (Idle)
  • Waiting to run, not aging NQ (not queued)
  • Held, not aging H
  • Won't run until released
  • Remove pending RP

98
Job status llqn -a
  • cheetah0033 llqn -a
  • Job Id Owner Class
    SysPrio S Date Node
  • ------------------------------- --------
    ------------ ------- - ------------ ----
  • cheetah0033.ccs.ornl.gov.7869.0 snuffy batch
    -51751 R Jul 23 1833 1
  • cheetah0033.ccs.ornl.gov.7874.0 zoe batch
    -55811 R Jul 23 1833 1
  • cheetah0033.ccs.ornl.gov.7910.0 bob batch
    -96036 R Jul 23 2151 8
  • cheetah0033.ccs.ornl.gov.7911.0 gordon batch
    -96242 R Jul 23 2106 24
  • cheetah0033.ccs.ornl.gov.7871.0 ernie royalty
    32289 R Jul 23 1833 1
  • cheetah0033.ccs.ornl.gov.7886.0 oscar No_Class
    -69737 I Jul 23 1344 1
  • cheetah0033.ccs.ornl.gov.7896.0 bob batch
    -90991 I Jul 23 1938 16

99
Job map qmap (coming soon)
  • Tue Jul 23 220608 EDT 2002

  • Req'd Elap
  • Job JobID Username Queue
    JobName N CPUs Time Time St


  • A cheetah0033.7911.0 gordon batch
    7911 24 192 0200 05954 R
  • B cheetah0033.7910.0 bob batch
    cheetah_0 8 64 0100 01410 R
  • C cheetah0033.7896.0 bob batch
    cheetah_0 0 0100 -------- I
  • D cheetah0033.7886.0 oscar No_Class
    7886 0 0600 -------- I
  • E cheetah0033.7874.0 zoe batch
    cpld.p4 1 31 1200 33259 R
  • F cheetah0033.7871.0 ernie royalty
    B07.03 1 32 1200 33259 R
  • G cheetah0033.7869.0 snuffy batch
    7869 1 3 1200 33311 R

100
Job map qmap
  • Node 8 16
    24 32
  • --------------------------------------------------
    ---------------------------
  • cheetah0001 B B B B B B B B - - - - - - - - - -
    - - - - - - - - - - - - - -
  • cheetah0017 . . . . . . . . . . . . . . . . . .
    . . . . . . . . . . . . . .
  • cheetah0033 . . . . . . . . . . . . . . . . . .
    . . . . . . . . . . . . . .
  • cheetah0049 F F F F F F F F F F F F F F F F F F
    F F F F F F F F F F F F F F
  • cheetah0065 B B B B B B B B - - - - - - - - - -
    - - - - - - - - - - - - - -
  • cheetah0081 B B B B B B B B - - - - - - - - - -
    - - - - - - - - - - - - - -
  • cheetah0097 E E E E E E E E E E E E E E E E E E
    E E E E E E E E E E E E E .
  • cheetah0113 B B B B B B B B - - - - - - - - - -
    - - - - - - - - - - - - - -
  • cheetah0129 . . . . . . . . . . . . . . . . . .
    . . . . . . . . . . . . . .
  • cheetah0145

  • cheetah0161 . . . . . . . . . . . . . . . . . .
    . . . . . . . . . . . . . .
  • cheetah0177 . . . . . . . . . . . . . . . . . .
    . . . . . . . . . . . . . .
  • cheetah0193 B B B B B B B B - - - - - - - - - -
    - - - - - - - - - - - - - -
  • cheetah0209 . . . . . . . . . . . . . . . . . .
    . . . . . . . . . . . . . .
  • cheetah0225 B B B B B B B B - - - - - - - - - -
    - - - - - - - - - - - - - -
  • cheetah0241 B B B B B B B B - - - - - - - - - -
    - - - - - - - - - - - - - -
  • cheetah0257 . . . . . . . . . . . . . . . . . .
    . . . . . . . . . . . . . .

101
Job map qmap
  • LPAR 1 LPAR 2
    LPAR 3 LPAR 4
  • -----------------------------------------------
    --------------------------
  • cheetah1025 A A A A A A A A A A A A A A A A A A
    A A A A A A A A A A A A A A
  • cheetah1041 A A A A A A A A A A A A A A A A A A
    A A A A A A A A A A A A A A
  • cheetah1057 . . . . . . . . A A A A A A A A . .
    . . . . . . G G G . . . . .
  • cheetah1073

  • cheetah1089 A A A A A A A A A A A A A A A A A A
    A A A A A A A A A A A A A A
  • cheetah1105 A A A A A A A A A A A A A A A A A A
    A A A A A A A A A A A A A A
  • cheetah1121 A A A A A A A A A A A A A A A A . .
    . . . . . . A A A A A A A A
  • cheetah1137 A A A A A A A A A A A A A A A A A A
    A A A A A A A A A A A A A A
  • --------------------------------------------------
    ---------------------------
  • Nodes 32-way 8-way All
    Key
  • In use 10 25 35
    . available CPU
  • Idle 8 3 11
    - not_shared
  • Down/drained 1 4 5
    - down/drained
  • Absent 0 0 0
    _at_ absent
  • Partially used 1 1 2

102
Why isn't my job running?
  • Resources are clearing for a big job
  • Impossible run conditions
  • More nodes than Cheetah has
  • Too many tasks per node
  • Excessive consumable resources
  • System components are down
  • LoadLeveler has a bug
  • llq -s job more

103
Why isn't my job running?
  • llq -s job sed -n '/SUMMARY/,/ANALYSIS/p'
  • Insufficient resources
  • Not enough appropriate nodes right now
  • Not enough appropriate nodes ever
  • Check reports for individual nodes
  • Dynamical constraints
  • Switch is down
  • LoadLeveler needs a kick

104
Insufficient resources
  • This LoadLeveler cluster does not have sufficient
    resources at the present time to run this job
    step.
  • Node is running a jobThe state of the
    LoadL_startd daemon on this machine is "Busy".
  • Node doesn't meet requirementsThe requirements
    expression of the job step evaluates to FALSE.
  • Node is downThe state of the LoadL_startd daemon
    on this machine is "Down".
  • Node doesn't run the classclass serf is not
    supported by this machine.

105
Why is my job held (H state)?
  • Your job tried to start but failed
  • Why?
  • DFS is down on a remote node
  • GPFS is down on a remote node
  • Your DFS space is fullfts lsquota

106
What nodes am I using?
  • Coming soon qmap
  • llq -l
  • cheetah0033 llq -l cheetah0033.7911.0 grep
    "gov"
  • Allocated Hosts cheetah1122.ccs.ornl.govcss
    s(1,MPI,US,12M),csss(2,MPI,US,12M),csss(3,MPI,US,1
    2M),csss(4,MPI,US,12M),csss(5,MPI,US,12M),csss(6,M
    PI,US,12M),csss(7,MPI,US,12M),csss(8,MPI,US,12M)
  • cheetah1043.ccs.ornl.govcss
    s(1,MPI,US,12M),csss(2,MPI,US,12M),csss(3,MPI,US,1
    2M),csss(4,MPI,US,12M),csss(5,MPI,US,12M),csss(6,M
    PI,US,12M),csss(7,MPI,US,12M),csss(8,MPI,US,12M)
  • cheetah1089.ccs.ornl.govcss
    s(1,MPI,US,12M),csss(2,MPI,US,12M),csss(3,MPI,US,1
    2M),csss(4,MPI,US,12M),csss(5,MPI,US,12M),csss(6,M
    PI,US,12M),csss(7,MPI,US,12M),csss(8,MPI,US,12M)

107
Questions?
  • http//www.ccs.ornl.gov/Cheetah/LL.html
  • consult_at_ccs.ornl.gov

108
  • Tools on the IBM SP4 Cheetah

Christian Halloy and Kwai Wong
Joint Institute for Computational Science (JICS)
University of Tennessee / Oak Ridge National
Laboratory halloy_at_jics.utk.edu - wong_at_jics.
utk.edu http//www.jics.utk.edu
109
Tools
  • Scientific Libraries
  • System Information
  • Profilers
  • Debuggers

110
(No Transcript)
111
(No Transcript)
112
(No Transcript)
113
(No Transcript)
114
(No Transcript)
115
(No Transcript)
116
(No Transcript)
117
xmperf on 12 Power2 thin-nodes
118
(No Transcript)
119
(No Transcript)
120
(No Transcript)
121
(No Transcript)
122
Doing your own timings/profiling
  • Functions that can be used within your code to
    time/profile your program
  • getrusage() fills a structure with lots of
    information user time, system time, memory
    footprint, context switches
  • gettimeofday() elapsed time
  • rtc() low overhead elapsed-time timer for
    FORTRAN (-xlf90)
  • MPI_Wtime - MPI wall-clock timer

123
Examples
  • 3dmon
  • Xmperf
  • mpcc pg mpi-nn1.c 0 mpi-nn1
  • poe mpi-nn1 procs 2 node 1 rmpool 1 or use
    llsubmit exe.cmd
  • xprofiler mpi-nn1 s gmon.out.0 gmon.out.1
  • -s option combine the files gmon.out. into one
    single gmon.sum file
  • mv gmon.sum gmon.out
  • gprof gp.report
  • xprofiler mpi-nn1 gmon.out.1

124
Debuggers
  • xldb IBMs (sequential) debugger, supports
    Fortran, C, C
  • pdbx - a command-line parallel debugger an
    extension of dbx to parallel applications using
    line-oriented interface and subcommands supports
    most of the familiar dbx subcommands, as well as
    some additional pdbx commands
  • POE options and variables can be used
  • pdbx executable -procs nodes rmpool 1
  • Totalview
  • tv executable procs -nodes -rmpool 1

125
Parallel Debugger pdbx
  • pdbx - supports most of the familiar dbx
    subcommands, as well as some additional pdbx
    commands.
  • Located in /usr/bin/pdbx type man pdbx for more
    information
  • Be sure to specify the -g flag when compiling
    the program mpcc -g -o hello hello.c
  • POE options and environment variables can be
    used pdbx hello -procs 4 -hostfile host.list
  • useful commands cont continue execution
    where find out where the program has halted
    stop at line_number create a breakpoint at
    line_number stop at routine create a
    breakpoint at beginning of routine print var_x
    var_y var_z once halted, print out values of
    variables

126
Parallel Debugger pdbx (contd)
  • figuring out a deadlock situation control-C
    after the code hangs, to stop code execution
    halt halts the debugger where find out
    where the program has stopped.
  • handling multiple tasks (or processes) on 2
    work only on task 2 on all - work on all
    tasks (default?) group add even_group 0 2 4 -
    defines a group called even_group on
    even_group work only on tasks within group
    even_group
  • attaching a POE process to pdbx ps ef
    grep poe - get the PID of your poe process
    pdbx a PID
  • other useful commands on 0 l 174,190 - list
    program statements 174190 status - list all
    breakpoints delete 3 - deletes breakpoint 3
    return - continues up to the next return (end
    of subroutine)

127
The TotalView Debugger
  • TotalView is a multi-platform debugger
    http//www.etnus.comFor serial/parallel,
    single/multi-threaded applications.
  • Lots of features and easy to use
    /usr/local/com/toolworks/totalview/bin/totalview
  • To launch a session with your code, use
    /usr/local/bin/tvExample kinit f
    (enter your password)
    tv a.out -procs 8 -nodes 2 -euilib us -rmpool
    1
  • For more information on TotalViewman
    totalviewman tvhttp//www.jics.utk.edu/SP_ornl/t
    otalview.htmlhttp//www.ccs.ornl.gov/eagle/intera
    ctive.html

128
(No Transcript)
129
Totalview
130
Some Basics
  • State Code
  • B stopped at break point
  • E stopped because of an error
  • H in a hold state
  • I idle
  • K Thread is executing with the kernel
  • M mixed some threads in a process are running
    and some not
  • R running
  • S sleeping
  • T Thread is stopped
  • W At a watchpoint
  • Z Process in zombie state
  • Left click mouse button in boxed line break
    point
  • Right click on mouse button dive to
    investigate
  • Group executes the command on all processes

131
(No Transcript)
132
Attach Processes and Core Files
  • Start a program
  • llsubmit exe.cmd
  • Notice that the program is having problems
  • Check cheetah0033.xxxxx.out or cheetah0033.xxxxx.e
    rr
  • Start Totalview
  • Attach process - use
  • Root Window File New Program
  • Enter executable name poe
  • Enter PID , use llq to get running node, ssh
    cheetahxxxx ps aux grep poe
  • Enter remote host id , cheetahxxxx
  • Detach process
  • Process Window Process Detach
  • Terminate process
  • llcancel cheetah0033.xxxxx.0
  • --------------------------------------------------
    --------------------------
  • Viewing core files - Use
  • Root Window File New Program
  • Enter executable name (NOT poe)
  • Enter corefile name, coredir.1/xxxxx.xxxxxxx

133
Examples
  • kinit f
  • mpcc g deadlock.c o deadlock
  • poe deadlock procs 2 nodes 1 rmpool 1
  • CTRL-C
  • tv deadlock procs 2 nodes 1 rmpool 1
  • ---Yes, ---Go , --- Halt, ----P,
  • Find problem, reset dest, Go, Exit
  • --------------------------------------------------
    -----------------
  • pdbx deadlock procs 2 nodes 1 rmpool 1
  • Cont, CTRL-C, Halt, where
  • --------------------------------------------------
    -------------------
  • Poe deadlock procs nodes 1 rmpool
  • Tv
  • Attach process
  • View core files

134
Hardware Performance Monitor Toolkit and
Alternate Profiling
  • Mark R. Fahey

135
HPM
  • High Performance Monitoring toolkit
  • Unsupported suite from IBM
  • Developed for application performance measurement
    on Power 3 and Power 4 systems
  • Requires the PMAPI kernel extensions
  • 3 tools
  • hpmcount
  • libhpm
  • hpmviz

136
hpmcount
  • A simple stand-alone utility that provides
    summary utilization data for the entire run
  • Provides
  • Wall-clock time
  • Hardware-performance counter statistics
  • Utilization information
  • Supports both serial and parallel applications
    written in Fortran, C, and C
  • Usage
  • poe hpmcount -h -o filename -s

137
hpmcount example
  • Example of hpmcount default usage
  • hpmcount matmul
  • adding counter 5 event 12 Cycles
  • adding counter 0 event 1 Instructions completed
  • adding counter 7 event 0 TLB misses
  • adding counter 2 event 9 Stores completed
  • adding counter 3 event 5 Loads completed
  • adding counter 4 event 5 FPU 0 instructions
  • adding counter 1 event 35 FPU 1 instructions
  • adding counter 6 event 9 FMAs executed
  • hpmcount (V 2.3.1) summary
  • Total execution time (wall clock time) 17.890108
    seconds

138
hpmcount example (cont.)
  • PM_CYC (Cycles) 6221755432
  • PM_INST_CMPL (Instructions completed)
    16525717994
  • PM_TLB_MISS (TLB misses) 4193504
  • PM_ST_CMPL (Stores completed) 103816592
  • PM_LD_CMPL (Loads completed) 6805124281
  • PM_FPU0_CMPL (FPU 0 instructions) 4783352231
  • PM_FPU1_CMPL (FPU 1 instructions) 3253672191
  • PM_EXEC_FMA (FMAs executed) 8024880769
  • Utilization rate 92.743
  • Avg number of loads per TLB miss 1622.778
  • Load and store operations 6908.941 M
  • Instructions per load/store 2.392
  • MIPS 923.735
  • Instructions per cycle 2.656
  • HW Float points instructions per Cycle 1.292
  • Floating point instructions FMAs 16061.905 M

  • Float point instructions FMA rate 897.809
    Mflip/s
  • FMA percentage 99.924

Approx. Mflop rate
139
hpmcount
  • Various counter groups to choose from
  • 56 for info on loads, stores, L1 misses, TLB
    misses
  • 58 or 5 for info on L2, L3, and memory access
  • 60 (default) or 53 for basic floating-point op
    counts
  • List of groups in /usr/local/ibm/HPM_V2_4/doc/powe
    r4.ref
  • Example group
  • group 60 pm_hpmcount2, Hpmcount group for
    computation intensity
  • 84,v,g,PM_FPU_FDIV,FPU executed FDIV
    instruction
  • 83,v,g,PM_FPU_FMA,FPU executed multiply-add
    instruction
  • 22,v,g,PM_FPU0_FIN,FPU0 produced a result
  • 27,v,g,PM_FPU1_FIN,FPU1 produced a result
  • 82,v,g,PM_CYC,Processor cycles
  • 84,v,g,PM_FPU_STF,FPU executed store
    instruction
  • 78,c,g,PM_INST_CMPL,Instructions completed
  • 78,v,g,PM_LSU_LDF,LSU executed Floating Point
    load instruction

140
hpmcount
  • To look at net computational load imbalance, do
    something like
  • grep PM_FPU0_FIN outputfile sort n 1
  • Example
  • PM_FPU0_FIN (FPU0 produced a result)
    4134125976812
  • PM_FPU0_FIN (FPU0 produced a result)
    4134162960199
  • PM_FPU0_FIN (FPU0 produced a result)
    4134172723274
  • PM_FPU0_FIN (FPU0 produced a result)
    4134174409595
  • PM_FPU0_FIN (FPU0 produced a result)
    4134186274629
  • PM_FPU0_FIN (FPU0 produced a result)
    4134192087552
  • PM_FPU0_FIN (FPU0 produced a result)
    4134204264318
  • PM_FPU0_FIN (FPU0 produced a result)
    4134229750408
  • Note the even balance. (From LSMS Gordon Bell
    winner)

141
libhpm.a
  • An interface for obtaining utilization statistics
    for certain regions of code
  • Stores data in two files
  • Plain text file that looks like hpmcount output
  • Another for use by hpmviz
  • If a large number of code regions are
    instrumented, then the visualization tool becomes
    very useful

142
libhpm.a Fortran Usage
  •    declaration     include f_hpm.h use
        call f_hpminit( taskID, my program )
        call f_hpmstart( 1, Do Loop  )     do
  • call do_work()
  • call f_hpmstart(5,computing meaning of
    life)
  • call do_more_work()
  •       call f_hpmstop( 5 )     end do    
    call f_hpmstop( 1 )     call f_hpmterminate(
    taskID )

143
libhpm.a C and C Usage
  • declaration         include libhpm.h
  • use         hpmInit( tasked, my program )
            hpmStart( 1, outer call )        
    do_work()         hpmStart( 2, computing
    meaning of life )         do_more_work()
            hpmStop( 2 )         hpmStop( 1 )
            hpmTerminate( taskID )

144
libhpm.a Thread Usage
  • !OMP PARALLEL
  • !OMPPRIVATE (instID) instID
    30omp_get_thread_num()  call f_hpmtstart(instID,
    "computing meaning of life")
  • !OMP DO
  •     do ...     do_work()
  • end do
  •   call f_hpmtstop( instID )
  • !OMP END PARALLEL
  • Note that instID should be a variable or number,
    not an expression

145
libhpm.a compiling and linking
  • HPM_DIR
  • HPM_INC -I(HPM_DIR)/include
  • HPM_LIB -L(HPM_DIR)/lib -lhpm_r -lpmapi lm
  • FFLAGS  -qsuffixcppf 
  • my.x   my.f   (FF) (HPM_INC) (FFLAGS) my.f
    (HPM_LIB) -o my.x
  • The flag -qsuffixcppf is only required for
    the compilation of Fortran programs with
    extension .f.

146
hpmviz
  • hpmviz is a graphical interface for visualization
    of the performance files (.viz) generated by
    libhpm
  • Usage
  • hpmviz

147
hpmviz screenshots
Only the f90 matmul function was instrumented
148
hpmviz more screenshots
149
hpmviz more screenshots
150
Derived Metrics
  • The HPM toolkit computes derived metrics,
    depending on the hardware events that are
    selected to be counted
  • The following are a list of selected derived
    metrics
  • Total time in user mode
  • User time cycles/processor frequency
  • Instructions per cycle
  • Instructions completed/cycles
  • MIPS
  • Instructions completed/(1000000Wall clock time)

151
Derived metrics (cont.)
  • L1 cache hit rate
  • 100(1-((Load misses in L1Store misses in L1) /
    Total LoadStores))
  • L2 cache hit rate
  • 100(1-((Load misses in L2Store misses in L2) /
    Total L1 misses))
  • Memory bandwidth
  • Memory traffic / Wall clock time
  • Floating point plus FMA rate
  • (FPU 0 FPU 1 FMAs)/(1000000Wall clock time)

152
Alternate profiling
  • An IBMer has made his own mpi profiling
    libraries
  • libmpitrace.a wrappers for low-overhead (1
    microsec per call) MPI elapsed time measurements
  • libmpihpm.a trace wrappers Power 4 HPM
    counter data
  • libmpiprof.a trace wrappers elapsed-time
    call-graph for MPI routines (5 microsec per
    call)
  • I plan to eventually get them in /usr/local/lib
  • Not thread safe, so use in single-threaded apps
    or when only one thread makes MPI calls

153
libmpitrace.a
  • To use
  • just link with the library
  • Run the application normally
  • Creates mpi_profile.
  • To reduce number of output files
  • Set TRACE_SOME to yes or 1

154
libmpihpm.a
  • To use
  • Link with library and lpmapi
  • Choose a Power 4 counter group
  • Export HPM_GROUP5 (for example)
  • Run the code
  • Creates mpi_profile_group
Write a Comment
User Comments (0)
About PowerShow.com