Title: Computing Workshop for Users of NCAR’s SCD machines
1Computing Workshop for Users of NCARs SCD
machines
- Christiane Jablonowski (cjablono_at_ucar.edu) NCAR
ASP/SCD - 31 January 2006
ML Mesa Lab, Chapman Room video conference
facilities FL EOL Atrium and CG1 3150
2Overview
- Current machine architectures at NCAR (SCD)
- Some basics on parallel computing
- Batch queuing systems at NCAR
- GAU resources how to obtain a GAU account
- Insights into GAU charges
- The Mass Storage System
- How to monitor the GAUs
- Some practical tips on benchmarks, debugging
tools, restarts - ???
3Computer architectures
- SCDs machines are UNIX-based parallel computing
architectures - Two types
- Hybrid (shared and distributed memory) machines
like - bluesky (IBM Power4) bluevista (IBM
Power5)lightning (AMD Opteron system running
Linux) - Shared memory system liketempest (SGI, 128
CPUs), predominantly used for post-processing jobs
4Parallel Programming
- Parallel machines require parallel programming
techniques in the user application - MPI (Message Passing Interface) for distributed
memory systems, can also be used on shared memory
systems - OpenMP for shared memory systems
- Hybrid (MPI OpenMP) programming technique
common on the IBMs at NCAR - Pure MPI parallelization often the fastest
option, computational domain is split into pieces
that can communicate over the network (via
messages) - OpenMP Parallelization of (mostly) loops via
compiler directives - Parallelization provided in CAM/CCSM/WRF
5Most common Hybrid hardware architectures
- Combined shared and distributed memory
architecture - Shared-memory symmetric multiprocessor (SMP)
nodes,processors on a node have direct access to
memory - Nodes are connected via the network (distributed
memory)
6MPI example
Processors communicate via messages
7MPI Example
- Initialize finalize MPI in your program via
function/subroutine calls to the MPI library.
Examples includeMPI_Init, MPI_Comm_rank,
MPI_Comm_size, MPI_Finalize
Example fromprevious pagein C
notation(unoptimized)
Important to note such an operation (computing a
global sum) is very common, therefore MPI
provides a highly optimized function, also called
a reduction operation MPI_Reduce () that can
replace the example above
8Example domain decompositions for MPI
Each color presentsa processor
9OpenMP Example
- Parallel loops via compiler directives (here in
Fortran notation) - Before program is called set
- setenv OMP_NUM_THREADS proc
- Add compiler directives in your code
- !OMP PARALLEL DO
- DO i 1, n
- a(i) b(i) c(i)
- END DO
- !OMP END PARALLEL DO
master thread
team
master thread
Assume n1000 proc4 The loop will be split
into 4 threads that run in parallel with loop
indices 1250, 251500, 501750, 7511000
10SCDs machines
- Bluesky (web page)
- Oldest machine at NCAR (2002)
- Lots of user experience at NCAR, easy access to
help - CAM/CCSM/WRF are set up for this architecture
(Makefiles) - Batch queuing system LoadLeveler, short
interactive runs possible - Batch queues are listed under http//www.cisl.uca
r.edu/computers/bluesky/queue.charge.html - Lots of additional software available e.g. math
libraries, graphics packages, Totalview debugger
11SCDs machines
- Bluevista (web page)
- Newest machine on the floor (Jan. 2006)
- CAM/CCSM/WRF are (probably) set up for this
architecture - Batch queuing system LSF (Load Sharing Facility)
- Queue names different from bluesky premium,
regular, economy, standby, debug,
sharehttp//www.cisl.ucar.edu/computers/bluevista
/queue.charge.html - Some additional software available e.g. math
libraries, Totalview debugger
12SCDs machines
- Lightning (web page)
- Linux cluster
- Compilers different from the IBMsPortland Group
or Pathscale - Batch queuing system LSF
- Same queue names as bluevista
- Some support software
- Tempest (web page)
- for data post-processing with yet another batch
queuing system NQS - Lots of support software
- Interactive use possible
13Example of a LoadLeveler job script
Parallel job with 32 MPI processes, com_reg32
queue (32-way node)
- _at_ class com_rg32
- _at_ node 1
- _at_ tasks_per_node 32
- _at_ output out.(jobid)
- _at_ error out.(jobid)
- _at_ job_type parallel
- _at_ wall_clock_limit 002000
- _at_ network.MPI csss,not_shared,us
- _at_ node_usage not_shared
- _at_ account_no 54042108
- _at_ ja_report yes
- _at_ queue
- setenv OMP_NUM_THREADS 1
regular queue
32 MPI processesper 32-way node
Submit the job via llsubmit job_script
14Example of a LoadLeveler job script
Hybrid parallel job with 8 MPI processes and 4
OpenMP threads
- _at_ class com_ec32
- _at_ node 1
- _at_ tasks_per_node 8
- _at_ output out.(jobid)
- _at_ error out.(jobid)
- _at_ job_type parallel
- _at_ wall_clock_limit 002000
- _at_ network.MPI csss,not_shared,us
- _at_ node_usage not_shared
- _at_ account_no 54042108
- _at_ ja_report yes
- _at_ queue
-
- setenv OMP_NUM_THREADS 4
economy queue
8 MPI processesper 32-way node
Submit the job via llsubmit job_script
15Example of an LSF job script (lightning)
Parallel job with 8 MPI processes (on 4 2-way
nodes)
- ! /bin/csh
-
- BSUB -a 'mpich_gm'
- BSUB -P 54042108
- BSUB -q regular
- BSUB -W 0030
- BSUB -x
- BSUB -n 8
- BSUB -R "spanptile2"
- BSUB -o fvcore_amr.out.J
- BSUB -e fvcore_amr.err.J
- BSUB -J test0.path
-
- mpirun.lsf -v ./dycore
select on lightning
regular queue
wallclock limit 30 min
8 MPI processes (total)
2 MPI processes per node
name of the job (listedin the SCD Portal)
Submit the job via bsub lt job_script
16Example of an LSF job script (bluevista)
Parallel job with 8 MPI processes (on 1 8-way
node)
- ! /bin/csh
-
- BSUB -a poe
- BSUB -P 54042108
- BSUB -q economy
- BSUB -W 0030
- BSUB -x
- BSUB -n 8
- BSUB -R "spanptile8"
- BSUB -o fvcore_amr.out.J
- BSUB -e fvcore_amr.err.J
- BSUB -J test0.path
-
- mpirun.lsf -v ./dycore
select poe on bluevista
economy queue
exclusive use (not shared)
Allows up to 8 MPI processes on a node
Submit the job via bsub lt job_script
17More information on SCDs machines
- Web page SCDs Support and Consulting services
- SCDs costomer support sometimes you even get
help on the weekends or in the evenings - Email consult1_at_ucar.edu
- Phone 303 497 1278
- Walk-in support at the Mesa Lab
- Check out SCDs Daily Bulletin (scheduled machine
downtimes, etc.) - Subscribe to the hpcstatus mailing list (short
e-mails about machine status, system updates)
18GAU resources
- ASP has a monthly allocation of 3850 GAUs
(General Accounting Units) - A GAU is a measure for some compute time on the
supercomputers maintained by NCARs Scientific
Computing Division (SCD)http//www.cisl.ucar.edu
/ - Access to these machines require
- an SCD login account (dbs_at_ucar.edu or
303-497-1225) - a GAU account (for ASP contact Maura, otherwise
contact your division / apply for a university
account) - ssh environment
- and a crypto card (for secure access)
- SCD contacts Dick Valent Mike Page (here
today), Juli Rew, Siddhartha Gosh, Ginger
Caldwell (GAUs)
19GAU resources
- GAUs Use it or lose it - strategy
- In ASP We share the resource among the ASP
postdocs graduate fellows - Distribution is flexible and will be discussed
occasionally, e.g. monthly, either via meetings
or e-mail discussions email asp-gau-users_at_asp.u
car.edu - GAUs are also charged for
- storing files in the Mass Storage System (MSS)
- file transfers from MSS to other machines
20ASP GAU account
- ASP GAU account number 54042108 (also project
number) - Needs to be specified in the batch job scripts
- ASP account number is not your default account
number - Therefore everybody needs a second (default) GAU
account - divisional GAU account
- so-called University account (small request form
for 1500 GAUs http//www.cisl.ucar.edu/resources/c
ompServ.html)these GAUs do not expire every
month, one-time allocation - Second GAU account should be used for the
accumulating MSS charges - automatic when using CAM / CCSMs MSS option
21GAU charges on SCDs supercomputers
- You are charged GAUs for how much time you use a
processor (on bluesky, bluevista, lightning,
tempest) - On bluesky, there are actually two formulas
- Shared-node usageGAUs charged CPU hours used
? computer factor ? class charging
factor - Dedicated-node usageGAUs charged wallclock
hours used ? number of nodes used ?
number of processors in that node ?
computer factor ? class charging factor
Slides on GAU charges Modified from an earlier
presentation by George Bryan, NCAR MMM
22Number of nodes used andNumber of processors
in that node
- Self explanatory (?)
- Bluesky
- 76 8-way (processors) nodes
- 25 32-way (processors) nodes
- Bluevista
- 78 8-way (processors) nodes
- Lightning
- 128 2-way (processors) nodes
23CPU hours used and Wallclock hours used
- Measure of how long you used a processor
- NOTE This includes all time you were allocated
the use of a processor, whether you actually used
it or not - Example you used two 8-processor nodes on
bluesky. The job started at 100 PM and finished
at 230 PM. You are charged for 1.5 hrs
24Computer factor
- A measure of how powerful a computer is
- Bluesky 0.24
- Bluevista 0.5
- Lightning 0.34
- This levels the playing field
25Class charging factor
- Tied to queuing system How quickly do you want
your results, and how much are you willing to pay
for it? - Current setting on all SCD supercomputers
- Premium 1.5 (highest priority, fastest
turnaround) - Regular 1.0
- Economy 0.5
- Standby 0.1 (lowest priority, slow turnaround)
26Example
- Recall dedicated-node usage on bluesky
- GAUs charged wallclock hours used ? number of
nodes used ? number of processors in that node ?
computer factor ? class charging factor - 1.5 hours using two 8-processor nodes
- Bluesky regular queue
- GAUs used 1.5 ? 2 ? 8 ? 0.24 ? 1.0 5.76
GAUs - In premium queue, this would be 8.64 GAUs
- In standby queue, this would be 0.576 GAUs
27Recommendations Queuing systems
- Check the queue before you submit any job
- If the queue is not busy, try using the standby
or economy queues - The queue tends to be emptier evenings,
weekends, and holidays - Job will start sooner when specifying a wallclock
limit in the job script (scheduler tries to
squeeze in short jobs) - The less processors you request, the sooner you
start - Use the premium queue sparingly
- Short debug jobs (there is also a special debug
queue on lightning) - When that conference paper is due
28Recommendations of processors vs. run times
- If you are using more processors, you might wait
longer in the queue, but usually the actual
runtime of your job is reduced - Caveat it usually costs more GAUs
- Example you run the same job, but using
- Using 8 processors, the job ran in 24 hours
- Using 64 processors, the job ran in 4 hours
- 1st example used 46 GAUs
- 2nd example used 61 GAUs
29The Mass Storage System
- MSS Mass storage system (disks and cartridges)
for your big data sets - MSS connected to the SCD machines, sometimes also
to divisional computers - MSS user have directories like mss/LOGIN_NAME/
- Quick online reference (mss commands)http//www.
cisl.ucar.edu/docs/mss/mss-commandlist.html - You are charged GAUs for using the MSS
- The GAU equation for MSS is more complicated ....
30MSS Charges
- GAUs charged .0837 ? R .0012 ? A N ?
(.1195 ? W .2050 ? S) - where
- R Gigabytes read
- W Gigabytes created or written
- A Number of disk drive or tape cartridge
accesses - S Data stored, in gigabyte-years
- N Number of copies of file 1 if economy
reliability selected 2 if standard reliability
selected
31Recommendations The MSS
- MSS charges seem small, but they add up!
- Examples FY04 MSS usage
- ACD 24,000 of 60,000 GAUs
- CGD 94,500 of 181,000 GAUs
- HAO 22,000 of 122,000 GAUs
- MMM 34,000 of 139,000 GAUs
- RAP 32,000 of 35,000 GAUs
32Recommendations The MSS
- Recommendation for ASP users
- use an account in your home division or your
so-called university account (1500 GAUs for
postdocs, you need to apply) for MSS charges - leave ASP GAUs for supercomputing
33GAU Usage Strategy 30-day and 90-day averages
- The allocation actually works through 30-day and
90-day averages - Limits 120 for 30-day use 105 for 90-day use
- It is helpful to spread usage out evenly
- How to check GAU usage
- Type charges on command line of a supercomputer
- Check the daily summary output (next page)
- SCD Portal look for the link on SCDs main page
http//www.cisl.ucar.edu/
34Web page http//www.cisl.ucar.edu/dbsg/dbs/ASP/ A
SP 30 Day Percent 57.0 ASP 90 Day
Percent 48.3 30 Day Allocation 3850
90 Day Allocation 1155030 Day Use
2193 90 Day Use 5575 90 DAY ST
-- 30 DAY ST -- LAST DAY
01-NOV-05 31-DEC-05 29-JAN-06 ASP
Gaus Used by Day 01-NOV-05 9.3603-NOV-05
.03 04-NOV-05 141.45
22-JAN-06 .04 23-JAN-06
44.29 24-JAN-06 170.83
25-JAN-06 120.30 26-JAN-06
91.67 27-JAN-06 41.97
28-JAN-06 15.59 29-JAN-06
16.95
35What happens when we use too many GAUs?
- Your jobs will be thrown into a very low
priority the dreaded hold queue - It will be hard to get work done
- But, jobs will still run
- ASP Users You can use more than 3850 GAUs /
month - Experience says, its better to use too many than
not enough
36What happens when we use too many/too few GAUs?
- Too many
- Recommendation when the 30- and 90-day averages
are running high, use the economy or standby
queue ... conserve GAUs - But, dont worry about going overToo few
- ASPs allocation will be cut in the long run if
the 3850 GAUs per month allocation is not used
37How to catch up when behind
- Be wasteful
- Use the premium queue
- Use more processors than you need
- Have fun
- Try something you always wanted to do, but never
had the resources
38How to conserve GAUs
- Be frugal
- Use the economy and standby queues
- Use fewer processors
- Use divisional GAUs (if possible) or your
university GAU account
39How to share monitor GAUs in ASP
- Communicate!
- Occasionally, we (ASP postdocs) use the e-mail
listasp-gau-users_at_asp.ucar.edu - to announce a busy GAU period
- Keep watching the ASP GAU usage on the webpage
http//www.cisl.ucar.edu/dbsg/dbs/ASP/ or in the
SCD Portal - Look for the SCD Portal link on the SCD
pagehttp//www.cisl.ucar.edu/
40SCD Portal
- Online tool that helps you monitor the GAU
charges and the current machine status (e.g.
batch queues), display can be customized - Information on the machine status requires a
setup-command on roy.scd.ucar.edu via the
crypto-card access, just enter scdportalkey
hostname (e.g. lightning) after logging on with
the crypto-card - At this time (Jan/31/2006) the GAU charges on
bluevista are not itemized will be included in
the next release in Spring 2006
41Other IBM resources
- Sources of information on the IBM machines
bluesky (from the command line), batchview also
works on bluevista lightning - batchview for overview of jobs with their
rankings - llq for list of all submitted jobs, no ranking
- spinfo queue limits, memory quotas on home file
system and the temporary file system /ptmp - Useful IBM LoadLeveler keywords in the
script_at_account_no54042108 -gt ASP account
_at_ja_reportyes -gt job report (see
example on the next page) - Useful LoadLeveler commands llsubmit
script_file, llcancel job_id
42Example IBM Job Report
- If selected, one email per job is sent to you at
midnight, Output on the IBM machines, here
blackforest (meanwhile decommisioned) - Job Accounting - Summary Report
Operating System
blackforest AIX51 User Name (ID) cjablono
(7568) Group Name (ID) ncar (100) Account
Name 54042108 Job Name
bf0913en.26921 Job Sequence Number
bf0913en.26921 Job Starts 12/20/04 175633
Job Ends 12/20/04 232634 Elapsed Time
(Wall-Clock CPU) 633632 s Number of Nodes
(not_shared) 8 Number of CPUs 32 Number of
Steps 1
43IBM Job Report (continued)
- Charge Components Wall-clock Time 53001
Wall-clock CPU hours 176.00889 hrs Multiplier
for com_ec Queue 0.50 Charge before Computer
Factor 88.00444 GAUs Multiplier for computer
blackforest 0.10 Charged against Allocation
8.80044 GAUs Project GAUs Allocated 5000.00
GAUs Project GAUs Used, as of 12/16/041889.20
GAUs Division GAUs 30-Day Average 103.3
Division GAUs 90-Day Average 58.6
44How to increase the efficiency
- Get a feel for the GAUs for long jobs benchmark
the application on target machine - Run a short but relevant test problem and measure
the run time (wall clock time) via MPI commands
(function MPI_WTIME) or UNIX timing commands like
time or timex (output formats are shell-script
dependent) - Vary number of processors to assess the scaling
- If application scales poorly, avoid using a large
number of processors (waste of GAUs), instead use
smaller number with numerous restarts - Make sure your job fits into the queue (finishes
before the max. time is up) - Use compiler options, especially the optimization
options - In case of programming problems the Totalview
debugger can save you days, weeks or even
monthson the IBMs compile your program with
the compiler options-g -qfullpath -d
45Restarts
- Restart files are important for long simulations
- Queue limits are up to 6 wallclock hours (hard
limit, job fails afterwards), then a restart
becomes necessary - Get information on the queue limits (SCD web
page) and select the jobs integration time
accordingly - Restarts built into CAM/CCSM/WRF, must only be
activated - Restarts for other user applications must
probably be programmed
46Questions ?