Title: ORNL Power 4 Workshop July 24 and 25
1ORNL Power 4 WorkshopJuly 24 and 25
- Mark R. Fahey
- faheymr_at_ornl.gov
2Sponsored by
- Center for Computational Sciences (CCS)
- http//www.ccs.ornl.gov
- Joint Institute for Computational Science (JICS)
- http//www.jics.utk.edu
3Topics
- Introduction
- Announcements
- Agenda
- Speakers
4Introduction
- Goals
- Discuss issues relating to the Power 4 processor
and the 27 Power 4 node system at ORNL
- Learn the systems limitations, weaknesses
- Be made aware of all the available tools
- Become a more efficient user of the Power 4 system
5Announcements
- Bathrooms
- Parking
- Cafeteria
- Dinner
- Talk with Trey
6Agenda July 24
- Introduction/Overview 900am
- Site basics 915am
- Break 1015 am
- Hardware Configuration 1030am
- Software 1100am
- Lunch
- LoadLeveler 100 pm
- Tools introduction 200 pm
- Break 300 pm
- Special look at HPM and PCT 315 pm
- Odds and Ends (maybe Thursday)
7Agenda July 25
- Benchmarks 830 am
- Linpack
- Early Evaluation Results
- break 1015am
- Performance issues 1030am
- Lunch
- Advanced Loadleveler 100 pm
- MPI Programming 145pm
- break 245 pm
- Short Case Studies 300 pm
- Libraries 345 pm
8Speakers/Contributors
- Rebecca Fahey
- Christian Halloy
- Trey White
- Kwai Wong
- Pat Worley
9Overview of the Center for Computational Sciences
- Cheetah Training Workshop
- July 24, 2002
- Trey White
10CCS Overview
- CCS systems
- Distributed Computing Environment (DCE)
- Secure shell, secure copy (ssh, scp)
- Distributed File Service (DFS)
- High-Performance Storage System (HPSS)
11CCS systems
- Bearcat - Basic login services
- Supercomputers
- Eagle (184-node IBM SP)
- Falcon (64-node AlphaServer SC40)
- Colt (16-node AlphaServer SC40)
- Cheetah (27 IBM p690s)
- DCE servers
- DFS servers and disks
- HPSS servers, disks, and tape robots
12Distributed Computing Environment (DCE)
- Directory services
- Central user database
- No /etc/passwd files to maintain
- User authentication
- Kerberos V
- User authorization
- DCE groups - systems, DFS, HPSS
- Access control lists (ACLs)
13DCE credentials
- Kerberos V, more or less
- Give your passwordGet a temporary license
- License to access DFS, HPSS
- Temporary? 30 days
- For long-running jobs that resubmit
- LoadLeveler passes on the license/credentials
14DCE credentials
- No credentials
- cheetah0033 klist
- No DCE identity available No currently
established network identity for this context
exists (dce / sec)
- Kerberos Ticket Information
- klist No credentials cache file found (dce /
krb) (ticket cache /opt/dcelocal/var/security/cred
s/dcecred_6d2c1b47)
- Credentials
- cheetah0033 klist
- DCE Identity Information
- Global Principal /.../dce.ccs.ornl.gov/tr
ey
- Cell 30f86b8a-0293-11d1-8bbe-02608ce
8cceb /.../dce.ccs.ornl.gov
- Principal 00009b9a-c5c7-21d2-ac00-02608ce
8cceb trey
- Group 0000000a-4f44-21d1-a701-02608ce
8cceb staff
- Local Groups
- 0000012f-4c96-21d1-a701-02608ce8cc
eb ccs
15How to lose credentials
- rsh without the right magic
- ssh using public/private key
- Reset/delete KRB5CCNAME environment variable
- Crash dceunixd
16DCE Control Program
cheetah0033 dcecp dcecp help The general forma
t of all dcecp object operations is as follows
dcecp argument options
In addition to all of the standard tcl commands,
dcecp supports many commands to administer DCE ob
jects. A dcecp object or task represent
a DCE entity. All of the following dcecp objects
and tasks require a verb account cdscach
e emsconsumer link
rpcgroup acl cdsclient emseven
t log rpcprofile
attrlist cell emsfilter
name secval aud cellalias
emslog object server
audevents clearinghouse endpoint
organization user audfilter clock
group principal utc
audtrail directory host
registry uuid cds dts
hostdata rpcentry xattrschema
cdsalias ems keytab
Miscellaneous commands perform specific
functions. These commands take no verb echo
errtext login logout quit resolve
shell To list all dcecp objects
dcecp help -verbose To list all verbs an obje
ct supports dcecp help
To list all options for an object operation
dcecp help For verbose informati
on on a dcecp object dcecp help
-verbose
17DCE Control Program
dcecp account help catalog Returns t
he names of all accounts in the registry.
create Creates an account in the
registry. delete Deletes an account
from the registry. generate Generates
a random password for an account in the
registry. modify Modifies an account
in the registry. show Returns the
attributes of an account. help Pr
ints a summary of command-line options.
operations Returns a list of the valid
operations for this command. dcecp account help
modify -acctvalid Is the account valid a
nd can it be logged into. -change Spe
cify attributes to change in an attribute list
format. -client Can the account princ
ipal be a client. -description A general d
escription of the account. -dupkey Ca
n tkts to the account be obtained via its TGT
session key. -expdate When the account
expires. -forwardabletkt Allow use of forwar
dable tickets by or for this principal.
-goodsince The time indicating when the
account was good since. -group The a
ccount's primary group name. -home
The filesystem directory the principal uses at
login. -maxtktlife The maximum ticket lif
e for the account. -maxtktrenew The maximu
m ticket renewal time.
18DCE Control Program
dcecp account show elmo acctvalid yes client
yes created /.../dce.ccs.ornl.gov/00000027-40e
f-21d2-a800-02608ce8cceb 1999-02-16-124244.000-0
500I----- description Elmo Monster,ORNL-CCS,8
652412103 dupkey no expdate none forward
abletkt yes goodsince 1952-04-14-102743.000-0
500I----- group staff home /dfs/home/elmo
lastchange /.../dce.ccs.ornl.gov/elmo
2002-05-23-152855.000-0400I-----
organization staff postdatedtkt no proxiabl
etkt no pwdvalid yes renewabletkt yes ser
ver yes shell /bin/ksh stdtgtauth yes use
rtouser no
19Changing your default shell
- ksh
- dcecp account modify user -shell /usr/bin/ksh
- tcsh
- dcecp account modify user -shell
/usr/local/bin/tcsh
20Changing your password
- Dont use dcecp
- Plain old passwd works
- Propagates in a few minutes
21Secure shell
- Encrypts your whole session, including
passwordsssh user_at_cheetah.ccs.ornl.gov
- Doesnt work?
- ssh -1 user_at_cheetah.ccs.ornl.gov
- ssh -2 user_at_cheetah.ccs.ornl.gov
- ssh -v user_at_cheetah.ccs.ornl.gov
- E-mail consult_at_ccs.ornl.gov
22Backspace problems
- Do you see ? when you press backspace/delete?stt
y erase
- Put this in .cshrc/.profilestty erase ?
(literally)
- H is also common
23Secure copy
- Built on secure shell
- Options like cp
- scp -p -r user_at_host.gov.dat .
- Doesnt work?
- scp -oProtocol1
- Output in .cshrc/.profile?
24scp and initialization output
- .cshrc/.profile must produce no output
- ksh (.profile)
- TTY/usr/bin/tty
- if ? 0 then/usr/bin/echo "interactive
stuff goes here"
- fi
- csh (.cshrc)
- ( /usr/bin/tty ) /dev/null
- if ( status 0 ) then/usr/bin/echo
"interactive stuff goes here"
- endif
25Secure X-Windows
- Automatic tunneling of X-Windows through ssh
- Encrypted
- No xhost needed
- DISPLAY points to localhost
- Doesnt work?
- ssh -X user_at_cheetah.ccs.ornl.gov
26Insecure X-Windows
- On Cheetah
- export DISPLAYhost.gov0.0
- setenv DISPLAY host.gov0.0
- On your system
- xhost cheetah0033.ccs.ornl.gov
- Doesnt work?
- ssh from Cheetah to your system
- See hostname in "who am i"
- Dont type your password in the X session!
27Public/private key
- Gets on Cheetah, but with no DCE credentials
- ssh/scp from Cheetah
- Keep private key in private DFS area
28Distributed File Service (DFS)
- Globally accessible
- DCE/Kerberos security
- Tape backups 3 times a week
- Client and server caching
- Transparent movement of user filesets among
servers
- Access control lists (ACLs)
- Read-only replicas
- NFS-like performance
29DFS home directories
- 500MB by default
- public - Readable by others
- private - Readable by only you
- yesterday - 4AM daily read-only snapshot
- Easy backup recovery!
- bin - Executables
- _at_sys directories (expanded by DFS, not shell)
- Not in default path
- www - http//www.ccs.ornl.gov/user
30DFS quota
- df doesnt work
- fts
- man fts
- man fts_lsquota
- cheetah0033 fts lsquota user
- Fileset Name Quota Used Used Aggregate
- us.user 500000 493813 9815493157/25574417(LFS)
31DFS ACLs
- cheetah0033 dcecp -c acl show user
- mask_obj r-x---
- user_obj rwxcid
- user cell_admin rwxcid effective r-x---
- group_obj r-x---
- group dfs_admin rwxcid effective r-x---
- group backup_admin rwxcid effective r-x---
- other_obj r-x---
- any_other r-x---
- Confusing interaction with Unix permissions
- user_obj - user permissions
- mask_obj - group permissions
- DCE group permissions OR this with group_objs
- other_obj - others permissions
- any_other - permissions for no DCE credentials
32Why cant I write to DFS?
- No credentials?
- klist
- Exceeded quota?
- fts lsquota path
- Wrong ACLs?
- dcecp -c acl show path
- DFS is broken?
- E-mail consult_at_ccs.ornl.gov
33High-Performance Storage System (HPSS)
- Archival storage
- Tapes with a large disk cache
- TBs of space, no quota (yet?)
- DCE security
- Hierarchical Storage Interface (HSI)
- Works without passwords in batch scripts
- Maintenance Wednesday mornings!
34Hierarchical Storage Interface (HSI)
- Interactive hsi
- Command linehsi "command command"
- Scripted hsi "in script"
- Some people have .hsirc in DFS
- Want to change it? Ask for help.
35HSI
- Two copies/separate tapes by default
- put/get - like FTP
- ls, cd, mv, chmod, chgrp
- Only move if new cput/cget
- Command summary hsi help
- http//www.sdsc.edu/Storage/hsi/
36htar
- Coming soon!
- Store to/extract from tar file in HPSS
- Uses HPSS interface efficiently
- HPSS file is tar compatible
- With one extra file at the end
- Index file kept in HPSS disk cache
37htar examples
- Create an archive of local files in HPSShtar
-cvf archive.tar localdir
- Get contents of an archivehtar -tvf archive.tar
- Listing will include a file like this
/tmp/HTAR_CF_CHK_26452_1016481493
- Retrieve files from an archivehtar -xvf
archive.tar file1 file2
38More info
- http//www.ccs.ornl.gov/
- Click "CCS Computers"
- Click "Cheetah"
- http//www.ccs.ornl.gov/Cheetah.html
- consult_at_ccs.ornl.gov
39Hardware Configuration
The Center For
Computational Sciences
- Rebecca Fahey
- User and Applications Support
40Contents
- Hardware Overview
- File Systems
- Types of Nodes
- Recommendations for Jobs
- Node Design
- Cache
- Operations
- Additional Information
41Hardware Overview
- 27, IBM Power4 nodes
- 32, 1.3 GHz, Power4 processors per physical node
(864 processors)
- Over 4.5 Teraflops in the compute partition
42Overview Key Features
- Server on a chip -- IBM's POWER4 microprocessor
is the first "server on a chip." It contains two
1.3 gigahertz processors, a high-bandwidth system
switch, a large memory cache and I/O. Four chips
combine to form an MCM. Four MCMs make up a
node. - Virtualization -- It can be operated with 32-way
nodes or the 32-way nodes can be divided into as
many as 16 "virtual" servers. Cheetah has 8
nodes divided into 4, 8-way LPARS.
43File Systems Home directories
- Location /dfs/home/
- Backups are perform daily and a copy is placed in
/dfs/home//yesterday.
- Provides a small amount a permanent storage for
files that are used frequently
- Ex. Personal scripts, executables, libraries
- Running jobs out of your home directory is not
recommended because the access time is slow
- For more info, refer to the site specific info
presented earlier
44File Systems GPFS
- Location /tmp/gpfs750a/
- SYSTEM_USERDIR and SCRATCH point to this
directory
- Global file system -- Accessible by all compute
nodes
- Recommended place to run jobs because it provides
faster access
- Your directory is not backed-up and the files are
purged periodically
- At job conclusion, copy files to be preserved
into HPSS. For info on HPSS see the site wide
info presented earlier
45File Systems Node local
- 160 GB of disk space local to each node
- Accessible only through the NODE_JOBDIR
environment variable during a batch job
- All files are automatically purged at the
conclusion of the job
- Intended for temporary files
46Types of Nodes
- Login node (1)
- The login node is a 32-way node. 16 are reserved
for login use. The other 16 are available to
batch jobs.
- 32-way compute nodes (18) (Pool 32)
- 11 nodes have 32 GB of memory
- 5 nodes have 64 GB of memory
- 2 nodes have 128 GB with 101 GB currently usable
- 8-way LPARS (32) (Pool 8)
- Each have 8 processors with 8 GB of memory
- Note You do not automatically have access the
memory on a node. For more than 1 segment, you
must request resources (_at_ resources
ConsumableMemory (1gb)).
47Types of Nodes 32-way Nodes
- 32-way Nodes
- Note The Colony switch will be replaced with a
Federation Switch when the switch is available
32-way node
32-way node
Dual Plane Colony
48Types of Nodes LPARs
- 32-way Nodes 8-way
LPARs
- Note The Colony switch will be replaced with a
Federation Switch when the switch is available
32-way node
32-way node
32-way node
32-way node
Dual Plane Colony
8-way LPAR
8-way LPAR
Dual Plane Colony
8-way LPAR
8-way LPAR
Dual Plane Colony
Dual Plane Colony
8-way LPAR
8-way LPAR
Dual Plane Colony
8-way LPAR
8-way LPAR
49Recommendations all jobs
- Set MP_SHARED_MEMORYyes to keep intra-node
messages from going out to the switch.
- Use both planes of the colony for your jobs by
setting
- _at_ network.MPI csss,shared,US
- in your batch script.
- Set _at_ node_usage shared and use the _at_
resources to reserve node resources
50Recommendations Small Jobs
- Run jobs nodes
- To request a 32 processor node include the
following in your batch script
- _at_ requirements (Pool 32)
- Reserve resources for your job with the _at_
resources line in your batch script. For info
see the LoadLeveler section.
51Recommendations Large Jobs
- Run jobs that do most of their communication
within the node and minimal communication between
nodes on 32-way nodes
- EX. MPI/OpenMP code running 1-8 MPI tasks which
each task spawning threads
- Run all heavy communication codes (MPI or LAPI)
using more than 32 processors on LPARs (Pool
8) or run them using IP (_at_ network.MPIcsss,share
d,IP) on the 32-way nodes
52Node Design
1.3 GHz processor 32 KB Level 1 cache
Level 3 cache 512 MB total 128 MB ea.
Level 2 cache 1440 KB total 480 KB ea.
Four MCMs comprise one node of a regatta system.
53Cache L1
- L1 cache
- 32 KB of data and 64 KB of instruction cache for
each processor
- Fetches 128 byte lines
- Uses FIFO replacement policy rather than touch.
Thus, blocking for cache reuse is not advisable.
- Uses eight pre-fetching streams. Use the
- qhot qcacheauto qarchpwr4
- qtunepwr4 compiler options to perform loop
optimizations that improve cache use.
54Cache L2
- L2 cache
- Total of 1440 KB shared between the two
processors on a chip
- Use least recently touched replacement policy
- For applications with memory requirements
1GB/process, placement of processes on
processors may impact cache performance.
- If you are blocking for cache size, you should
block for L2 cache. To leave it to the compiler
use qarchpwr4, -qtunepwr4, qcacheauto, and
-qhot. Implied with O4
55Cache L3
- L3 cache
- Total of 512 MB split into 4 chunks of 128 MB
- Shared between the eight processors on an MCM
- Access times vary depending on location of the
data to be retrieved
56Cache Performance
57Operations
- Each processor has
- 2 floating-point units
- 1 madd per cycle with a 6 cycle latency
- 72 registers serve both units. qunroll may
improve register use. Implied with O2 and
above
- 2 integer units
- 1 add per cycle with a 2 cycle latency
- Instructions may execute out of order and may
execute speculatively, but are tracked (200)
58Additional Information
- A good source for additional information on
Regatta hardware is
- http//www.redbooks.ibm.com/pubs/pdfs/redbooks/sg
247041.pdf
59Cheetah Software
- Cheetah Training Workshop
- July 24, 2002
- Trey White
60Cheetah software
- Fortran
- C/C
- IBM libraries
- Other libraries
- Other tools
61IBM Fortran compiler
- One compiler, different default options
- All accept F77, F90, F95 syntax
- Limited Power4-specific code generation
- Re-entrant (thread-safe) versions _r
- Assume fixed form, -qsave xlf_r
- Assume free form, -qnosave xlf90_r
- Assume free form, F95 compliance xlf95_r
- IBM MPI paths and libraries mp
- Gui with option menus xxlf
62Fortran optimization options
- Optimized-g -O4 -qnoipa -qmaxmem-1
- (-O4 -O3 -qarchauto -qtuneauto -qcacheauto
-qhot -qipa)
- Results not bitwise identical? -qstrict
- Optimizer is breaking your code?-g -O
-qmaxmem-1
- Feeling lucky?-g -O5 -qmaxmem-1
63More Fortran options
- Feeling really lucky? -qsmp
- OpenMP (requires _r) -qsmpomp
- No array-statements with overlap?-qaliasnoaryovr
lp
- Get stack trace on signal -qsigtrap
- Dump core with floating-point exceptions-qflttrap
overflowzerodivideenable
- Turn on profiling -pg
- Use .f90 suffix -qsuffixff90
64Fortran memory options
- Default, 256MB limit
- 2GB heap limit, 256MB stack limit-bmaxdata0x8000
0000
- Big heap, big stack -q64
- 64-bit MPI now supported with _r
- No vectorized intrinsics (yet)
65Fortran data options
- Static variables are implicitly save (xlf_r
default) -qsave
- Static variables are implicitly automatic
(xlf9_r default) -qnosave
- 8B REAL and 16B DOUBLE PRECISION-qrealsize8
- 8B REAL and 8B REAL4-qautodbldbl4
66Fortran environment variables
- Old-fashoned namelistsexport XLFRTEOPTSnamelist
oldsetenv XLFRTEOPTS namelistold
- OpenMP standard variables
- OMP_NUM_THREADS
- OMP_SCHEDULE (DYNAMIC, GUIDED, STATIC)
67IBM C/C compiler
- Part of VisualAge C product
- Beta version, with limited Power4-specific code
generation
- Re-entrant (thread-safe) versions _r
- Strict ANSI C xlc_r
- More lenient cc_r
- Errors become warnings
- ANSI C (but no STL?) xlC_r
- IBM MPI paths and libraries mpcc_r, mpCC_r
- but there is no CC_r (?)
68C/C optimization options
- Optimized-g -O4 -qnoipa -qmaxmem-1(-O4 -O3
-qarchauto -qtuneauto -qcacheauto -qipa)
- Results not bitwise identical? -qstrict
- Optimizer is breaking your code?-g -O
-qmaxmem-1
- Feeling lucky? (IPA is C only)-g -O5 -qmaxmem-1
69More C/C options
- Feeling really lucky (C only)? -qsmp
- OpenMP (requires _r, C only) -qsmpomp
- Dump core with floating-point exceptions-qflttrap
overflowzerodivideenable
- -qsigtrap?
- Turn on profiling -pg
70C/C memory options
- Default, 256MB limit
- 2GB heap limit, 256MB stack limit-bmaxdata0x8000
0000
- Big heap, big stack -q64
- MPI requires _r
71gcc/g?
- Available (32 bit)
- Efficient?
- Supported?
- Can use IBM MPI
72IBM libraries
- 32/64 bits chosen automatically
- Extended Scientific Subroutine Library (ESSL)
- PESSL
- MASS
73ESSL
- BLAS, linear algebra (not quite LAPACK), FFTs,
sorting, quadrature, random
- Used mostly for BLAS
- Tuned for Power4
- Sequential -lessl_r
- Threaded parallel -lesslsmp
- Call from sequential code
74PESSL
- Distributed-memory parallel
- Tuned for Power4
- BLACS, PBLAS, ScaLAPACK, FFTs
- Not newest ScaLAPACK
- Not thread safe (but thread tolerant)
- Distributed-memory only -lpessl
- And SMP parallel -lpesslsmp
75MASS
- -L/usr/local/lib
- Faster intrinsics, less accuracy, thread safe
-lmass
- Vectorized math functions
- IBM-specific calls
- Power3/Power4 -lmassv
- Power4-specific -lmassvp4
76Other libraries
- -I/usr/local/include -L/usr/local/lib
- -L/usr/local/lib64 for some
- Working on having combined 32 and 64 bit
- LAPACK, FFTW, more PACKs
- MPI BLACS, PBLAS, ScaLAPACK
- -lnetcdf (NCAR-like interface)4B REAL
-L/usr/local/lib32/r4i48B REAL
-L/usr/local/lib32/r8i4
77Other tools
- /usr/local/bin in default path
- gmake, gtar
- TCL/Tk, Perl, Python
- Debuggers, performance analyzers
78Cheetah software
79LoadLeveler on Cheetah
- Cheetah Training Workshop
- July 24, 2002
- Trey White
80LoadLeveler on Cheetah
- Command files
- What not to do
- Scheduling
- Classes
- Machine status
- Controlling jobs
- Job status
81MPI command file
- _at_ shell /bin/ksh
- _at_ job_type parallel
- _at_ network.MPI csss,shared,US
- _at_ output (host).(jobid).out
- _at_ error (host).(jobid).err
- _at_ wall_clock_limit 3000
- _at_ tasks_per_node 32
- _at_ node 2
- _at_ queue
- pwd
- echo LOADL_PROCESSOR_LIST
- export MP_SHARED_MEMORYyes
- poe a.out
82OpenMP command file
- _at_ shell /bin/ksh
- _at_ job_type serial
- _at_ output (host).(jobid).out
- _at_ error (host).(jobid).err
- _at_ wall_clock_limit 3000
- _at_ resources ConsumableCpus(8)
- _at_ queue
- pwd
- echo LOADL_PROCESSOR_LIST
- export OMP_NUM_THREADS8
- a.out
83Hybrid MPI/OpenMP command file
- _at_ shell /bin/ksh
- _at_ job_type parallel
- _at_ network.MPI csss,shared,US
- _at_ output (host).(jobid).out
- _at_ error (host).(jobid).err
- _at_ wall_clock_limit 3000
- _at_ tasks_per_node 4
- _at_ node 2
- _at_ resources ConsumableCpus(8)
- _at_ queue
- pwd
- echo LOADL_PROCESSOR_LIST
- export MP_SHARED_MEMORYyes
- export OMP_NUM_THREADS8
- poe a.out
84Picking node size
- 32-processor nodes only
- _at_ tasks_per_node 32
- _at_ tasks_per_node 8_at_ resources
ConsumableCpus(4)
- _at_ requirements (Pool 32)
- 8-processor nodes only
- _at_ requirements (Pool 8)
- Any available processors
- _at_ total_tasks n_at_ blocking unlimited
85Memory requirements
- _at_ resources ConsumableMemory(N gb)
- Units of kb, mb, gb, w
- Nodes of 8, 32, 64, 128(96) GB
- Enforced by IBM WorkLoad Manager (WLM)
- Default is ConsumableMemory(256 mb)
- Example large-memory OpenMP job_at_ resources
ConsumableCpus(32) ConsumableMemory(64 gb)export
OMP_NUM_THREADS32
86What not to do
- X _at_ node_usage not_shared
- Use tasks_per_node, ConsumableCpus
- Unless required by complex topologies
- X _at_ network.MPI csss, not_shared, US
- Take the whole node or share the switch
- X Use multiple resource lines
- _at_ resources ConsumableCpus(4) (ignored!)_at_
resources ConsumableMemory(2 gb)
- _at_ resource ConsumableCpus(4)
ConsumableMemory(2 gb)
87What not to do
- X End command file without newline
- Last command will be ignored
- csh only
- How to tell? tail
- cheetah0033 tail csh.ll
- _at_ queue
- pwd
- echo LOADL_PROCESSOR_LIST
- setenv MP_SHARED_MEMORY yes
- poe a.outcheetah0033
88What not to do
- X Forget ConsumableCpus with OpenMP
- Default value is 1
- WorkLoad Manager limits CPUs to 1 per task
- Threads per CPU OMP_NUM_THREADS
- X Forget OMP_NUM_THREADS with OpenMP
- Default value is node CPU count
- Threads per CPU tasks_per_node
- X poe processor options in batch jobs
- poe a.out -procs 16 -nodes 2Ignored!
89LoadLeveler scheduling
- FIFO with backfill
- Oldest job scheduled next
- All jobs have time limits
- Short jobs inserted into space-time holes
- Three waiting jobs may age at a time (I state)
- Additional jobs don't age (NQ state)
- No limit on running jobs
- Some classes have priority
- New jobs are already old
90LoadLeveler classes
- Default is batch
- 12-hour time limit
- No processor limit
- Interactive (poe) default is interactive
- 2-hour time limit
- No processor
- Half of login node is interactive only
- Submit short jobs to interactive_at_ class
interactive
91llclass
- cheetah0033 llclass
- Name MaxJobCPU MaxProcCPU
Free Max Description
- dhhmmss dhhmmss
Slots Slots
- --------------- -------------- --------------
----- ----- ---------------------
- interactive undefined undefined
190 688 Interactive POE jobs
- batch undefined undefined
190 688 Batch jobs
- No_Class undefined undefined
0 0
- royalty undefined undefined
190 688 Uppercrust jobs
- testing undefined undefined
32 32 testing
- sys undefined undefined
286 784 System administration
- --------------------------------------------------
------------------------------
- "Maximum Slots" value of the class "No_Class" is
constrained by the MAX_STARTERS limit(s).
- "Free Slots" values of the classes "No_Class",
"interactive", "batch", "climate_prod", "sys" are
constrained by the MAX_STARTERS limit(s).
92llclass -l
- cheetah0033 llclass -l batch
- Class batch
- Name batch
- Priority 0
- Exclude_Users
- Include_Users
-
- cheetah0033 llclass -l egrep
"NameWall_clock_limit"
- Name interactive
- Wall_clock_limit 020500, undefined (7500
seconds, undefined)
- Name batch
- Wall_clock_limit 120000, undefined (43200
seconds, undefined)
- Name No_Class
- Wall_clock_limit undefined, undefined
- Name royalty
- Wall_clock_limit 1000000, undefined
(86400 seconds, undefined)
- Name testing
- Wall_clock_limit 1000000, undefined
(86400 seconds, undefined)
- Name sys
93System status llstatus
- cheetah0033 llstatus
- Name Schedd InQ Act Startd
Run LdAvg Idle Arch OpSys
- cheetah0001.ccs.ornl.gov Avail 0 0 Run
30 30.03 9999 RS6000 AIX51
- cheetah0017.ccs.ornl.gov Avail 0 0 Run
30 31.08 9999 RS6000 AIX51
- cheetah0033.ccs.ornl.gov Avail 8 5 Idle
0 0.07 12 RS6000 AIX51
- cheetah0049.ccs.ornl.gov Avail 0 0 Busy
32 33.41 9999 RS6000 AIX51
- ...
- RS6000/AIX51 58 machines 8
jobs 498 running
- Total Machines 58 machines 8
jobs 498 running
- The Central Manager is defined on
manx.ccs.ornl.gov
- The BACKFILL scheduler is in use
94System resources llstatus -R
- cheetah0033 llstatus -R
- Machine Consumable
Resource(Available, Total)
- ------------------------------ -------------------
------------------------------
- cheetah0001.ccs.ornl.gov ConsumableCpus(24,3
2) ConsumableMemory(62.000 gb,64.000 gb)
- cheetah0017.ccs.ornl.gov ConsumableCpus(32,3
2) ConsumableMemory(64.000 gb,64.000 gb)
- cheetah0033.ccs.ornl.gov ConsumableCpus(16,1
6) ConsumableMemory(16.000 gb,16.000 gb)
- cheetah0049.ccs.ornl.gov ConsumableCpus(0,32
) ConsumableMemory(24.000 gb,32.000 gb)
- ...
- cheetah1076.ccs.ornl.gov ConsumableCpus(8,8)
ConsumableMemory(8.000 gb,8.000 gb)
- cheetah1089.ccs.ornl.gov ConsumableCpus(0,8)
ConsumableMemory(6.000 gb,8.000 gb)
- ...
- cheetah1601.ccs.ornl.gov
- cheetah1617.ccs.ornl.gov
- manx.ccs.ornl.gov
- LoadL_startd daemons of machines with ""
appended to their names are down.
95Controlling jobs
- Submit a job llsubmit script
- Stopping a job llcancel jobname
- Deletes queued jobs
- Halts running jobs
96Job status llq
- cheetah0033 llq
- Id Owner Submitted
ST PRI Class Running On
- ------------------------ ---------- -----------
-- --- ------------ -----------
- cheetah0033.7871.0 ernie 7/23 0924 R
50 royalty cheetah0049
- cheetah0033.7869.0 snuffy 7/23 0844 R
50 batch cheetah1060
- cheetah0033.7874.0 zoe 7/23 0952 R
50 batch cheetah0097
- cheetah0033.7910.0 bob 7/23 2102 R
50 batch cheetah0273
- cheetah0033.7911.0 gordon 7/23 2106 R
50 batch cheetah1122
- cheetah0033.7886.0 oscar 7/23 1344 I
50 No_Class
- cheetah0033.7896.0 bob 7/23 1938 I
50 batch
97Job statuses
- Running R
- Starting ST
- Waiting to run, aging I (Idle)
- Waiting to run, not aging NQ (not queued)
- Held, not aging H
- Won't run until released
- Remove pending RP
98Job status llqn -a
- cheetah0033 llqn -a
- Job Id Owner Class
SysPrio S Date Node
- ------------------------------- --------
------------ ------- - ------------ ----
- cheetah0033.ccs.ornl.gov.7869.0 snuffy batch
-51751 R Jul 23 1833 1
- cheetah0033.ccs.ornl.gov.7874.0 zoe batch
-55811 R Jul 23 1833 1
- cheetah0033.ccs.ornl.gov.7910.0 bob batch
-96036 R Jul 23 2151 8
- cheetah0033.ccs.ornl.gov.7911.0 gordon batch
-96242 R Jul 23 2106 24
- cheetah0033.ccs.ornl.gov.7871.0 ernie royalty
32289 R Jul 23 1833 1
- cheetah0033.ccs.ornl.gov.7886.0 oscar No_Class
-69737 I Jul 23 1344 1
- cheetah0033.ccs.ornl.gov.7896.0 bob batch
-90991 I Jul 23 1938 16
99Job map qmap (coming soon)
- Tue Jul 23 220608 EDT 2002
-
Req'd Elap
- Job JobID Username Queue
JobName N CPUs Time Time St
- A cheetah0033.7911.0 gordon batch
7911 24 192 0200 05954 R
- B cheetah0033.7910.0 bob batch
cheetah_0 8 64 0100 01410 R
- C cheetah0033.7896.0 bob batch
cheetah_0 0 0100 -------- I
- D cheetah0033.7886.0 oscar No_Class
7886 0 0600 -------- I
- E cheetah0033.7874.0 zoe batch
cpld.p4 1 31 1200 33259 R
- F cheetah0033.7871.0 ernie royalty
B07.03 1 32 1200 33259 R
- G cheetah0033.7869.0 snuffy batch
7869 1 3 1200 33311 R
100Job map qmap
- Node 8 16
24 32
- --------------------------------------------------
---------------------------
- cheetah0001 B B B B B B B B - - - - - - - - - -
- - - - - - - - - - - - - -
- cheetah0017 . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
- cheetah0033 . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
- cheetah0049 F F F F F F F F F F F F F F F F F F
F F F F F F F F F F F F F F
- cheetah0065 B B B B B B B B - - - - - - - - - -
- - - - - - - - - - - - - -
- cheetah0081 B B B B B B B B - - - - - - - - - -
- - - - - - - - - - - - - -
- cheetah0097 E E E E E E E E E E E E E E E E E E
E E E E E E E E E E E E E .
- cheetah0113 B B B B B B B B - - - - - - - - - -
- - - - - - - - - - - - - -
- cheetah0129 . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
- cheetah0145
- cheetah0161 . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
- cheetah0177 . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
- cheetah0193 B B B B B B B B - - - - - - - - - -
- - - - - - - - - - - - - -
- cheetah0209 . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
- cheetah0225 B B B B B B B B - - - - - - - - - -
- - - - - - - - - - - - - -
- cheetah0241 B B B B B B B B - - - - - - - - - -
- - - - - - - - - - - - - -
- cheetah0257 . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
101Job map qmap
- LPAR 1 LPAR 2
LPAR 3 LPAR 4
- -----------------------------------------------
--------------------------
- cheetah1025 A A A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A
- cheetah1041 A A A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A
- cheetah1057 . . . . . . . . A A A A A A A A . .
. . . . . . G G G . . . . .
- cheetah1073
- cheetah1089 A A A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A
- cheetah1105 A A A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A
- cheetah1121 A A A A A A A A A A A A A A A A . .
. . . . . . A A A A A A A A
- cheetah1137 A A A A A A A A A A A A A A A A A A
A A A A A A A A A A A A A A
- --------------------------------------------------
---------------------------
- Nodes 32-way 8-way All
Key
- In use 10 25 35
. available CPU
- Idle 8 3 11
- not_shared
- Down/drained 1 4 5
- down/drained
- Absent 0 0 0
_at_ absent
- Partially used 1 1 2
102Why isn't my job running?
- Resources are clearing for a big job
- Impossible run conditions
- More nodes than Cheetah has
- Too many tasks per node
- Excessive consumable resources
- System components are down
- LoadLeveler has a bug
- llq -s job more
103Why isn't my job running?
- llq -s job sed -n '/SUMMARY/,/ANALYSIS/p'
- Insufficient resources
- Not enough appropriate nodes right now
- Not enough appropriate nodes ever
- Check reports for individual nodes
- Dynamical constraints
- Switch is down
- LoadLeveler needs a kick
104Insufficient resources
- This LoadLeveler cluster does not have sufficient
resources at the present time to run this job
step.
- Node is running a jobThe state of the
LoadL_startd daemon on this machine is "Busy".
- Node doesn't meet requirementsThe requirements
expression of the job step evaluates to FALSE.
- Node is downThe state of the LoadL_startd daemon
on this machine is "Down".
- Node doesn't run the classclass serf is not
supported by this machine.
105Why is my job held (H state)?
- Your job tried to start but failed
- Why?
- DFS is down on a remote node
- GPFS is down on a remote node
- Your DFS space is fullfts lsquota
106What nodes am I using?
- Coming soon qmap
- llq -l
- cheetah0033 llq -l cheetah0033.7911.0 grep
"gov"
- Allocated Hosts cheetah1122.ccs.ornl.govcss
s(1,MPI,US,12M),csss(2,MPI,US,12M),csss(3,MPI,US,1
2M),csss(4,MPI,US,12M),csss(5,MPI,US,12M),csss(6,M
PI,US,12M),csss(7,MPI,US,12M),csss(8,MPI,US,12M) - cheetah1043.ccs.ornl.govcss
s(1,MPI,US,12M),csss(2,MPI,US,12M),csss(3,MPI,US,1
2M),csss(4,MPI,US,12M),csss(5,MPI,US,12M),csss(6,M
PI,US,12M),csss(7,MPI,US,12M),csss(8,MPI,US,12M) - cheetah1089.ccs.ornl.govcss
s(1,MPI,US,12M),csss(2,MPI,US,12M),csss(3,MPI,US,1
2M),csss(4,MPI,US,12M),csss(5,MPI,US,12M),csss(6,M
PI,US,12M),csss(7,MPI,US,12M),csss(8,MPI,US,12M) -
107Questions?
- http//www.ccs.ornl.gov/Cheetah/LL.html
- consult_at_ccs.ornl.gov
108- Tools on the IBM SP4 Cheetah
Christian Halloy and Kwai Wong
Joint Institute for Computational Science (JICS)
University of Tennessee / Oak Ridge National
Laboratory halloy_at_jics.utk.edu - wong_at_jics.
utk.edu http//www.jics.utk.edu
109Tools
- Scientific Libraries
- System Information
- Profilers
- Debuggers
110(No Transcript)
111(No Transcript)
112(No Transcript)
113(No Transcript)
114(No Transcript)
115(No Transcript)
116(No Transcript)
117xmperf on 12 Power2 thin-nodes
118(No Transcript)
119(No Transcript)
120(No Transcript)
121(No Transcript)
122Doing your own timings/profiling
- Functions that can be used within your code to
time/profile your program
- getrusage() fills a structure with lots of
information user time, system time, memory
footprint, context switches
- gettimeofday() elapsed time
- rtc() low overhead elapsed-time timer for
FORTRAN (-xlf90)
- MPI_Wtime - MPI wall-clock timer
123Examples
- 3dmon
- Xmperf
- mpcc pg mpi-nn1.c 0 mpi-nn1
- poe mpi-nn1 procs 2 node 1 rmpool 1 or use
llsubmit exe.cmd
- xprofiler mpi-nn1 s gmon.out.0 gmon.out.1
- -s option combine the files gmon.out. into one
single gmon.sum file
- mv gmon.sum gmon.out
- gprof gp.report
- xprofiler mpi-nn1 gmon.out.1
124Debuggers
- xldb IBMs (sequential) debugger, supports
Fortran, C, C
- pdbx - a command-line parallel debugger an
extension of dbx to parallel applications using
line-oriented interface and subcommands supports
most of the familiar dbx subcommands, as well as
some additional pdbx commands - POE options and variables can be used
- pdbx executable -procs nodes rmpool 1
- Totalview
- tv executable procs -nodes -rmpool 1
125Parallel Debugger pdbx
- pdbx - supports most of the familiar dbx
subcommands, as well as some additional pdbx
commands.
- Located in /usr/bin/pdbx type man pdbx for more
information
- Be sure to specify the -g flag when compiling
the program mpcc -g -o hello hello.c
- POE options and environment variables can be
used pdbx hello -procs 4 -hostfile host.list
- useful commands cont continue execution
where find out where the program has halted
stop at line_number create a breakpoint at
line_number stop at routine create a
breakpoint at beginning of routine print var_x
var_y var_z once halted, print out values of
variables
126Parallel Debugger pdbx (contd)
- figuring out a deadlock situation control-C
after the code hangs, to stop code execution
halt halts the debugger where find out
where the program has stopped. - handling multiple tasks (or processes) on 2
work only on task 2 on all - work on all
tasks (default?) group add even_group 0 2 4 -
defines a group called even_group on
even_group work only on tasks within group
even_group - attaching a POE process to pdbx ps ef
grep poe - get the PID of your poe process
pdbx a PID
- other useful commands on 0 l 174,190 - list
program statements 174190 status - list all
breakpoints delete 3 - deletes breakpoint 3
return - continues up to the next return (end
of subroutine)
127The TotalView Debugger
- TotalView is a multi-platform debugger
http//www.etnus.comFor serial/parallel,
single/multi-threaded applications.
- Lots of features and easy to use
/usr/local/com/toolworks/totalview/bin/totalview
- To launch a session with your code, use
/usr/local/bin/tvExample kinit f
(enter your password)
tv a.out -procs 8 -nodes 2 -euilib us -rmpool
1 - For more information on TotalViewman
totalviewman tvhttp//www.jics.utk.edu/SP_ornl/t
otalview.htmlhttp//www.ccs.ornl.gov/eagle/intera
ctive.html
128(No Transcript)
129Totalview
130Some Basics
- State Code
- B stopped at break point
- E stopped because of an error
- H in a hold state
- I idle
- K Thread is executing with the kernel
- M mixed some threads in a process are running
and some not
- R running
- S sleeping
- T Thread is stopped
- W At a watchpoint
- Z Process in zombie state
- Left click mouse button in boxed line break
point
- Right click on mouse button dive to
investigate
- Group executes the command on all processes
131(No Transcript)
132Attach Processes and Core Files
- Start a program
- llsubmit exe.cmd
- Notice that the program is having problems
- Check cheetah0033.xxxxx.out or cheetah0033.xxxxx.e
rr
- Start Totalview
- Attach process - use
- Root Window File New Program
- Enter executable name poe
- Enter PID , use llq to get running node, ssh
cheetahxxxx ps aux grep poe
- Enter remote host id , cheetahxxxx
- Detach process
- Process Window Process Detach
- Terminate process
- llcancel cheetah0033.xxxxx.0
- --------------------------------------------------
--------------------------
- Viewing core files - Use
- Root Window File New Program
- Enter executable name (NOT poe)
- Enter corefile name, coredir.1/xxxxx.xxxxxxx
133Examples
- kinit f
- mpcc g deadlock.c o deadlock
- poe deadlock procs 2 nodes 1 rmpool 1
- CTRL-C
- tv deadlock procs 2 nodes 1 rmpool 1
- ---Yes, ---Go , --- Halt, ----P,
- Find problem, reset dest, Go, Exit
- --------------------------------------------------
-----------------
- pdbx deadlock procs 2 nodes 1 rmpool 1
- Cont, CTRL-C, Halt, where
- --------------------------------------------------
-------------------
- Poe deadlock procs nodes 1 rmpool
- Tv
- Attach process
- View core files
134Hardware Performance Monitor Toolkit and
Alternate Profiling
135HPM
- High Performance Monitoring toolkit
- Unsupported suite from IBM
- Developed for application performance measurement
on Power 3 and Power 4 systems
- Requires the PMAPI kernel extensions
- 3 tools
- hpmcount
- libhpm
- hpmviz
136hpmcount
- A simple stand-alone utility that provides
summary utilization data for the entire run
- Provides
- Wall-clock time
- Hardware-performance counter statistics
- Utilization information
- Supports both serial and parallel applications
written in Fortran, C, and C
- Usage
- poe hpmcount -h -o filename -s
137hpmcount example
- Example of hpmcount default usage
- hpmcount matmul
- adding counter 5 event 12 Cycles
- adding counter 0 event 1 Instructions completed
- adding counter 7 event 0 TLB misses
- adding counter 2 event 9 Stores completed
- adding counter 3 event 5 Loads completed
- adding counter 4 event 5 FPU 0 instructions
- adding counter 1 event 35 FPU 1 instructions
- adding counter 6 event 9 FMAs executed
- hpmcount (V 2.3.1) summary
- Total execution time (wall clock time) 17.890108
seconds
138hpmcount example (cont.)
- PM_CYC (Cycles) 6221755432
- PM_INST_CMPL (Instructions completed)
16525717994
- PM_TLB_MISS (TLB misses) 4193504
- PM_ST_CMPL (Stores completed) 103816592
- PM_LD_CMPL (Loads completed) 6805124281
- PM_FPU0_CMPL (FPU 0 instructions) 4783352231
- PM_FPU1_CMPL (FPU 1 instructions) 3253672191
- PM_EXEC_FMA (FMAs executed) 8024880769
- Utilization rate 92.743
- Avg number of loads per TLB miss 1622.778
- Load and store operations 6908.941 M
- Instructions per load/store 2.392
- MIPS 923.735
- Instructions per cycle 2.656
- HW Float points instructions per Cycle 1.292
- Floating point instructions FMAs 16061.905 M
- Float point instructions FMA rate 897.809
Mflip/s
- FMA percentage 99.924
Approx. Mflop rate
139hpmcount
- Various counter groups to choose from
- 56 for info on loads, stores, L1 misses, TLB
misses
- 58 or 5 for info on L2, L3, and memory access
- 60 (default) or 53 for basic floating-point op
counts
- List of groups in /usr/local/ibm/HPM_V2_4/doc/powe
r4.ref
- Example group
- group 60 pm_hpmcount2, Hpmcount group for
computation intensity
- 84,v,g,PM_FPU_FDIV,FPU executed FDIV
instruction
- 83,v,g,PM_FPU_FMA,FPU executed multiply-add
instruction
- 22,v,g,PM_FPU0_FIN,FPU0 produced a result
- 27,v,g,PM_FPU1_FIN,FPU1 produced a result
- 82,v,g,PM_CYC,Processor cycles
- 84,v,g,PM_FPU_STF,FPU executed store
instruction
- 78,c,g,PM_INST_CMPL,Instructions completed
- 78,v,g,PM_LSU_LDF,LSU executed Floating Point
load instruction
140hpmcount
- To look at net computational load imbalance, do
something like
- grep PM_FPU0_FIN outputfile sort n 1
- Example
- PM_FPU0_FIN (FPU0 produced a result)
4134125976812
- PM_FPU0_FIN (FPU0 produced a result)
4134162960199
- PM_FPU0_FIN (FPU0 produced a result)
4134172723274
- PM_FPU0_FIN (FPU0 produced a result)
4134174409595
- PM_FPU0_FIN (FPU0 produced a result)
4134186274629
- PM_FPU0_FIN (FPU0 produced a result)
4134192087552
- PM_FPU0_FIN (FPU0 produced a result)
4134204264318
- PM_FPU0_FIN (FPU0 produced a result)
4134229750408
- Note the even balance. (From LSMS Gordon Bell
winner)
141libhpm.a
- An interface for obtaining utilization statistics
for certain regions of code
- Stores data in two files
- Plain text file that looks like hpmcount output
- Another for use by hpmviz
- If a large number of code regions are
instrumented, then the visualization tool becomes
very useful
142libhpm.a Fortran Usage
- declaration include f_hpm.h use
call f_hpminit( taskID, my program )
call f_hpmstart( 1, Do Loop ) do
- call do_work()
- call f_hpmstart(5,computing meaning of
life)
- call do_more_work()
- call f_hpmstop( 5 ) end do
call f_hpmstop( 1 ) call f_hpmterminate(
taskID )
143libhpm.a C and C Usage
- declaration include libhpm.h
- use hpmInit( tasked, my program )
hpmStart( 1, outer call )
do_work() hpmStart( 2, computing
meaning of life ) do_more_work()
hpmStop( 2 ) hpmStop( 1 )
hpmTerminate( taskID )
144libhpm.a Thread Usage
- !OMP PARALLEL
- !OMPPRIVATE (instID) instID
30omp_get_thread_num() call f_hpmtstart(instID,
"computing meaning of life")
- !OMP DO
- do ... do_work()
- end do
- call f_hpmtstop( instID )
- !OMP END PARALLEL
- Note that instID should be a variable or number,
not an expression
145libhpm.a compiling and linking
-
- HPM_DIR
- HPM_INC -I(HPM_DIR)/include
- HPM_LIB -L(HPM_DIR)/lib -lhpm_r -lpmapi lm
- FFLAGS -qsuffixcppf
- my.x my.f (FF) (HPM_INC) (FFLAGS) my.f
(HPM_LIB) -o my.x
-
- The flag -qsuffixcppf is only required for
the compilation of Fortran programs with
extension .f.
146hpmviz
- hpmviz is a graphical interface for visualization
of the performance files (.viz) generated by
libhpm
- Usage
- hpmviz
147hpmviz screenshots
Only the f90 matmul function was instrumented
148hpmviz more screenshots
149hpmviz more screenshots
150Derived Metrics
- The HPM toolkit computes derived metrics,
depending on the hardware events that are
selected to be counted
- The following are a list of selected derived
metrics
- Total time in user mode
- User time cycles/processor frequency
- Instructions per cycle
- Instructions completed/cycles
- MIPS
- Instructions completed/(1000000Wall clock time)
151Derived metrics (cont.)
- L1 cache hit rate
- 100(1-((Load misses in L1Store misses in L1) /
Total LoadStores))
- L2 cache hit rate
- 100(1-((Load misses in L2Store misses in L2) /
Total L1 misses))
- Memory bandwidth
- Memory traffic / Wall clock time
- Floating point plus FMA rate
- (FPU 0 FPU 1 FMAs)/(1000000Wall clock time)
152Alternate profiling
- An IBMer has made his own mpi profiling
libraries
- libmpitrace.a wrappers for low-overhead (1
microsec per call) MPI elapsed time measurements
- libmpihpm.a trace wrappers Power 4 HPM
counter data
- libmpiprof.a trace wrappers elapsed-time
call-graph for MPI routines (5 microsec per
call)
- I plan to eventually get them in /usr/local/lib
- Not thread safe, so use in single-threaded apps
or when only one thread makes MPI calls
153libmpitrace.a
- To use
- just link with the library
- Run the application normally
- Creates mpi_profile.
- To reduce number of output files
- Set TRACE_SOME to yes or 1
154libmpihpm.a
- To use
- Link with library and lpmapi
- Choose a Power 4 counter group
- Export HPM_GROUP5 (for example)
- Run the code
- Creates mpi_profile_group