High Performance Computing and Trends, Enhancing Performance, Measurement Tools,

About This Presentation

Title:

High Performance Computing and Trends, Enhancing Performance, Measurement Tools,

Description:

High Performance Computing and Trends, Enhancing Performance, Measurement Tools, – PowerPoint PPT presentation

Number of Views:936

Avg rating:3.0/5.0

Slides: 149

Provided by: jack250

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Computing and Trends, Enhancing Performance, Measurement Tools,

1
High Performance Computing and Trends, Enhancing
Performance, Measurement Tools,
Facultad de Ciencias, Universidad de Los Andes,
La
Hechicera, Mérida, Venezuela. December 3-5, 2001

Jack Dongarra
Innovative Computing Laboratory
University of Tennessee
http//www.cs.utk.edu/dongarra/

2
Overview

High Performance Computing
ATLAS
PAPI
NetSolve
Grid Experiments

3
Technology Trends Microprocessor Capacity
Moores Law
2X transistors/Chip Every 1.5 years Called
Moores Law
Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
Microprocessors have become smaller, denser, and
more powerful.
4
Moores Law
5
H. Mauer, H. Simon, E. Strohmaier, JD -
Listing of the 500 most powerful Computers
in the World - Yardstick Rmax from LINPACK
MPP Axb, dense problem - Updated twice a
year SCxy in the States in November Meeting in
Mannheim, Germany in June - All data
available from www.top500.org
TPP performance
Rate
Size
6

In 1980 a computation that took 1 full year to
complete
can now be done in 10 hours!

In 1980 a computation that took 1 full year to
complete
can now be done in 16 minutes!

In 1980 a computation that took 1 full year to
complete
can today be done in 27 seconds!

9
Top 10 Machines (November 2001)
10
Performance Development
60G - 400 M7.2 Tflop/s 94Gflop/s, Schwab 24,
1/2 per year, 394 gt 100 Gf, faster than Moores
law
11
Performance Development
My Laptop
Entry 1 T 2005 and 1 P 2010
12
Petaflop (1015 flop/s) Computers Within the Next
Decade

Five basis design points
Conventional technologies
4.8 GHz processor, 8000 nodes, each w/16
processors
Processing-in-memory (PIM) designs
Reduce memory access bottleneck
Superconducting processor technologies
Digital superconductor technology, Rapid
Single-Flux-Quantum (RSFQ) logic hybrid
technology multi-threaded (HTMT)
Special-purpose hardware designs
Specific applications e.g. GRAPE Project in Japan
for gravitational force computations
Schemes utilizing the aggregate computing power
of processors distributed on the web
SETI_at_home 26 Tflop/s

13
Petaflops (1015 flop/s) Computer Today?

1 GHz processor (O(109) ops/s)
1 Million PCs
1B (1K each)
100 Mwatts
5 acres
1 Million Windows licenses!!
PC failure every second

14
Architectures
Constellation of p/n ? n
15
Chip Technology
16
Manufacturer
IBM 32, HP 30, SGI 8, Cray 8, SUN 6, Fuji
4, NEC 3, Hitachi 3
17
Cumulative Performance Nov 2001
134.9 TF/s
500
18
High-Performance Computing Directions
Beowulf-class PC Clusters
Definition
Advantages

COTS PC Nodes
Pentium, Alpha, PowerPC, SMP
COTS LAN/SAN Interconnect
Ethernet, Myrinet, Giganet, ATM
Open Source Unix
Linux, BSD
Message Passing Computing
MPI, PVM
HPF

Best price-performance
Low entry-level cost
Just-in-place configuration
Vendor invulnerable
Scalable
Rapid technology tracking

Enabled by PC hardware, networks and operating
system achieving capabilities of scientific
workstations at a fraction of the cost and
availability of industry standard message passing
libraries. However, much more of a contact sport.
19

Peak performance
Interconnection
http//clusters.top500.org
Benchmark results to follow in the coming months

20
Where Does the Performance Go? orWhy Should I
Care About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
21
Optimizing Computation and Memory Use

Computational optimizations
Theoretical peak( fpus)(flops/cycle) Mhz
PIII (1 fpu)(1 flop/cycle)(850 Mhz) 850
MFLOP/s
Athlon (2 fpu)(1flop/cycle)(600 Mhz) 1200
MFLOP/s
Power3 (2 fpu)(2 flops/cycle)(375 Mhz) 1500
MFLOP/s
Operations like
? xTy 2 operands (16 Bytes) needed for
2 flops at 850 Mflop/s
will requires 1700 MW/s bandwidth
y ? x y 3 operands (24 Bytes) needed for 2
flops at 850 Mflop/s
will requires 2550 MW/s bandwidth
Memory optimization
Theoretical peak (bus width) (bus speed)
PIII (32 bits)(133 Mhz) 532 MB/s
66.5 MW/s
Athlon (64 bits)(133 Mhz) 1064 MB/s 133
MW/s
Power3 (128 bits)(100 Mhz) 1600 MB/s 200
MW/s

22
Memory Hierarchy

By taking advantage of the principle of locality
Present the user with as much memory as is
available in the cheapest technology.
Provide access at the speed offered by the
fastest technology.

Processor
Tertiary Storage (Disk/Tape)
Secondary Storage (Disk)
Control
Main Memory (DRAM)
Level 2 and 3 Cache (SRAM)
Remote Cluster Memory
Distributed Memory
On-Chip Cache
Datapath
Registers
10,000,000s (10s ms) 100,000 s (.1s ms)
1s
Speed (ns)
10s
100s
10,000,000,000s (10s sec) 10,000,000 s (10s
ms)
100s
Size (bytes)
Ks
Ms
Gs
Ts
23
Self-Adapting Numerical Software (SANS)

Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning.
Operations like the BLAS require many man-hours /
platform
Software lags far behind hardware introduction
Only done if financial incentive is there
Hardware, compilers, and software have a large
design space w/many parameters
Blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules.
Complicated interactions with the increasingly
sophisticated micro-architectures of new
microprocessors.
Need for quick/dynamic deployment of optimized
routines.
ATLAS - Automatic Tuned Linear Algebra Software

24
Software Generation Strategy - BLAS

Parameter study of the hw
Generate multiple versions of code, w/difference
values of key performance parameters
Run and measure the performance for various
versions
Pick best and generate library
Level 1 cache multiply optimizes for
TLB access
L1 cache reuse
FP unit usage
Memory fetch
Register reuse
Loop overhead minimization

Takes 20 minutes to run.
New model of high performance programming where
critical code is machine generated using
parameter optimization.
Designed for RISC arch
Super Scalar
Need reasonable C compiler
Today ATLAS in use by Matlab, Mathematica,
Octave, Maple, Debian, Scyld Beowulf, SuSE,

25
ATLAS (DGEMM n 500)64-bit floating point
results

ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.

26
Recursive Approach for Other Level
3 BLAS
Recursive TRMM

Recur down to L1 cache block size
Need kernel at bottom of recursion
Use gemm-based kernel for portability

27
Intel PIII 933 MHzMKL 5.0 vs ATLAS 3.2.0 using
Windows 2000

ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.

28
ATLAS Matrix Multiply (64-bit floating point
results)
Intel IA-64
Intel P4 64-bit fl pt
AMD Athlon
29
Pentium 4 - SSE2

1.5 GHz, 400 MHz system bus, 16K L1 256K L2
Cache, theoretical peak of 1.5 Gflop/s, high
power consumption
Streaming SIMD Extensions 2 (SSE2)
which consists of 144 new instructions
includes SIMD IEEE double precision floating
point (vector instructions)
Peak for 64 bit floating point 2X
Peak for 32 bit floating point 4X
SIMD 128-bit integer
new cache and memory management instructions.
Intels compiler supports these instructions
today

30
ATLAS Matrix Multiply Intel Pentium 4 using
SSE2
P4 32-bit fl pt using SSE2
P4 64-bit fl pt using SSE2
P4 64-bit fl pt
250/processor, lt1000 for system gt 0.50/Mflops
!!
31
(No Transcript)
32
Machine-Assisted Application Development and
Adaptation

Communication libraries
Optimize for the specifics of ones
configuration.
Algorithm layout and implementation
Look at the different ways to express
implementation

33
Work in ProgressATLAS-like Approach Applied to
Broadcast (PII 8 Way Cluster with 100 Mb/s
switched network)
34
Reformulating/Rearranging/Reuse

Example is the reduction to narrow band from for
the SVD
Fetch each entry of A once
Restructure and combined operations
Results in a speedup of gt 30

35
Conjugate Gradient Variants by Dynamic Selection
at Run Time

Variants combine inner products to reduce
communication bottleneck at the expense of more
scalar ops.
Same number of iterations, no advantage on a
sequential processor
With a large number of processor and a
high-latency network may be advantages.
Improvements can range from 15 to 50 depending
on size.

36
Conjugate Gradient Variants by Dynamic Selection
at Run Time

Variants combine inner products to reduce
communication bottleneck at the expense of more
scalar ops.
Same number of iterations, no advantage on a
sequential processor
With a large number of processor and a
high-latency network may be advantages.
Improvements can range from 15 to 50 depending
on size.

37
Related Tuning Projects

PHiPAC
Portable High Performance ANSI C
http//www.icsi.berkeley.edu/bilmes/phipac
initial automatic GEMM generation project
FFTW Fastest Fourier Transform in the West
http//www.fftw.org
UHFFT
tuning parallel FFT algorithms
http//rodin.cs.uh.edu/mirkovic/fft/parfft.htm
SPIRAL
Signal Processing Algorithms Implementation
Research for Adaptable Libraries maps DSP
algorithms to architectures
http//www.ece.cmu.edu/spiral/
Sparsity
Sparse-matrix-vector and Sparse-matrix-matrix
multiplication http//www.cs.berkeley.edu/ejim/pu
blication/ tunes code to sparsity structure of
matrix more later in this tutorial
University of Tennessee

38
Tools for Performance Evaluation

Timing and performance evaluation has been an art
Resolution of the clock
Issues about cache effects
Different systems
Can be cumbersome and inefficient with
traditional tools
Situation about to change
Todays processors have internal counters

39
Performance Counters

Almost all high performance processors include
hardware performance counters.
Some are easy to access, others not available to
users.
On most platforms the APIs, if they exist, are
not appropriate for the end user or well
documented.
Existing performance counter APIs
Compaq Alpha EV 6 6/7
SGI MIPS R10000
IBM Power Series
CRAY T3E
Sun Solaris
Pentium Linux and Windows

IA-64
HP-PA RISC
Hitachi
Fujitsu
NEC

40
Performance Data That May Be Available

Pipeline stalls due to memory subsystem
Pipeline stalls due to resource conflicts
I/D cache misses for different levels
Cache invalidations
TLB misses
TLB invalidations

Cycle count
Floating point instruction count
Integer instruction count
Instruction count
Load/store count
Branch taken / not taken count
Branch mispredictions

41
Overview of PAPI

Performance Application Programming Interface
The purpose of the PAPI project is to design,
standardize and implement a portable and
efficient API to access the hardware performance
monitor counters found on most modern
microprocessors

42
Implementation

Counters exist as a small set of registers that
count events.
PAPI provides three interfaces to the underlying
counter hardware
The low level interface manages hardware events
in user defined groups called EventSet.
The high level interface simply provides the
ability to start, stop and read the counters for
a specified list of events.
Graphical tools to visualize information.

43
Low Level API

Increased efficiency and functionality over the
high level PAPI interface
Theres about 40 functions
Obtain information about the executable and the
hardware.
Thread safe

44
High Level API

Meant for application programmers wanting
coarse-grained measurements
Calls the lower level API
Not thread safe at the moment
Only allows PAPI Presets events

45
High Level Functions

PAPI_flops()
PAPI_num_counters()
Number of counters in the system
PAPI_start_counters()
PAPI_stop_counters()
Enable counting of events and describes what to
count
PAPI_read_counters()
Returns event counts

46
Perfometer Features

Platform independent visualization of PAPI
metrics
Flexible interface
Quick interpretation of complex results
Small footprint
(compiled code size lt 15k)
Color coding to highlight selected procedures
Trace file generation or real time viewing.

47
PAPI Implementation
48
Graphical ToolsPerfometer Usage

Application is instrumented with PAPI
call perfometer()
Will be layered over the best existing
vendor-specific APIs for these platforms
Application is started, at the call to
performeter signal handler and timer set to
collect and send the information to a Java applet
containing the graphical view.
Sections of code that are of interest can be
designated with specific colors
Using a call to set_perfometer(color)

49
Perfometer
Call Perfometer(red)
50
Early Users of PAPI

DEEP/PAPI (Pacific Sierra) http//www.psrv.com/dee
p_papi_top.html
TAU (Allen Mallony, U of Oregon)
http//www.cs.uoregon.edu/research/paracomp/tau/
SvPablo (Dan Reed, U of Illinois)
http//vibes.cs.uiuc.edu/Software/SvPablo/svPablo.
htm
Cactus (Ed Seidel, Max Plank/U of Illinois)
http//www.aei-potsdam.mpg.de
Vprof (Curtis Janssen, Sandia Livermore Lab)
http//aros.ca.sandia.gov/cljanss/perf/vprof/
Cluster Tools (Al Geist, ORNL)
DynaProf (Phil Mucci, UTK) http//www.cs.utk.edu/
mucci/dynaprof/

51
Next Version of
Perfometer Implementation
Application
GUI
Server
Application
Application
52
PAPIs Parallel Interface
53
PAPI - Supported Processors

Intel Pentium,II,III,4, Itanium
Linux 2.4, 2.2, 2.0 and perf kernel patch
IBM Power 3,604,604e
For AIX 4.3 and pmtoolkit (in 4.3.4 available)
(laderose_at_us.ibm.com)
Sun UltraSparc I, II, III
Solaris 2.8
MIPS R10K, R12K
AMD Athlon
Linux 2.4 and perf kernel patch
Cray T3E, SV1, SV2
Windows 2K and XP
To download software see
http//icl.cs.utk.edu/papi/

54
(No Transcript)
55
Innovative Computing Laboratory University of
Tennessee

Numerical Linear Algebra
Heterogeneous Distributed Computing
Software Repositories
Performance Evaluation
Software and ideas have found there way into many
areas of Computational Science
Around 40 people At the moment...
15 Researchers Research Assoc/Post-Doc/Research
Prof
15 Students Graduate and Undergraduate
8 Support staff Secretary, Systems, Artist
1 Long term visitor
Many opportunities within the group at Tennessee

56
SETI_at_home

Use thousands of Internet-connected PCs to help
in the search for extraterrestrial intelligence.
Uses data collected with the Arecibo Radio
Telescope, in Puerto Rico
When their computer is idle or being wasted this
software will download a 300 kilobyte chunk of
data for analysis.
The results of this analysis are sent back to the
SETI team, combined with thousands of other
participants.

Largest distributed computation project in
existence
400,000 machines
Averaging 27 Tflop/s
Today many companies trying this for profit.

57
Distributed and Parallel Systems
Massively parallel systems homo- geneous
ASCI Tflops (7 Tflop/s)
Clusters w/ special interconnect
Distributed systems hetero- geneous
SETI_at_home (27 Tflop/s)
Parallel Dist mem
Beowulf cluster
Network of ws
Grid based Computing
Entropia

Gather (unused) resources
Steal cycles
System SW manages resources
System SW adds value
10 - 20 overhead is OK
Resources drive applications
Time to completion is not critical
Time-shared

Bounded set of resources
Apps grow to consume all cycles
Application manages resources
System SW gets in the way
5 overhead is maximum
Apps drive purchase of equipment
Real-time constraints
Space-shared

58
Grids are Hot

IPG NAS-NASA http//nas.nasa.gov/wej/home/IPG
Globus http//www.globus.org/
Legion http//www.cs.virgina.edu/grims
haw/
AppLeS http//www-cse.ucsd.edu/groups/hp
cl/apples
NetSolve http//www.cs.utk.edu/netsolve/
NINF http//phase.etl.go.jp/ninf/
Condor http//www.cs.wisc.edu/condor/
CUMULVS http//www.epm.ornl.gov/cs/cumulvs.
html
WebFlow http//www.npac.syr.edu/users/gcf/

59
The Grid

To treat CPU cycles and software like
commodities.
Napster on steroids.
Enable the coordinated use of geographically
distributed resources in the absence of central
control and existing trust relationships.
Computing power is produced much like utilities
such as power and water are produced for
consumers.
Users will have access to power on demand
When the Network is as fast as the computers
internal links, the machine disintegrates across
the Net into a set of special purpose appliances
Gilder Technology Report June 2000

60
The Grid
61
The Grid Architecture Picture
User Portals
Problem Solving Environments
Application Science Portals
Grid Access Info
Service Layers
Co- Scheduling
Fault Tolerance
Authentication
Events
Naming Files
Computers
Data bases
Online instruments
Resource Layer
Software
High speed networks and routers
62
Globus Grid Services

The Globus toolkit provides a range of basic Grid
services
Security, information, fault detection,
communication, resource management, ...
These services are simple and orthogonal
Can be used independently, mix and match
Programming model independent
For each there are well-defined APIs
Standards are used extensively
E.g., LDAP, GSS-API, X.509, ...
You dont program in Globus, its a set of tools
like Unix

63
NetSolve Network
Enabled Server

NetSolve is an example of a grid based
hardware/software server.
Easy-of-use paramount
Based on a RPC model but with
resource discovery, dynamic problem solving
capabilities, load balancing, fault tolerance
asynchronicity, security,
Other examples are NEOS from Argonne and NINF
Japan.
Use resources, not tie together geographically
distributed resources, for a single application.

64
NetSolve The Big Picture
Client
Schedule Database
AGENT(s)
Matlab Mathematica C, Fortran Java, Excel
S3
S4
Op(C, A, B)
S1
S2
C
A
No knowledge of the grid required, RPC like.
65
Basic Usage Scenarios

Grid based numerical library routines
User doesnt have to have software library on
their machine, LAPACK, SuperLU, ScaLAPACK, PETSc,
AZTEC, ARPACK
Task farming applications
Pleasantly parallel execution
eg Parameter studies
Remote application execution
Complete applications with user specifying input
parameters and receiving output

Blue Collar Grid Based Computing
Does not require deep knowledge of network
programming
Level of expressiveness right for many users
User can set things up, no su required
In use today, up to 200 servers in 9 countries
Can plug into Globus, Condor, NINF,

66
NetSolve Agent

Name server for the NetSolve
system.
Information Service
client users and administrators can query the
hardware and software services available.
Resource scheduler
maintains both static and dynamic information
regarding theNetSolve server components touse
for the allocation of resources

67
NetSolve Agent

Resource Scheduling (contd)
CPU Performance (LINPACK).
Network bandwidth, latency.
Server workload.
Problem size/algorithm complexity.
Calculates a Time to Compute. for each
appropriate server.
Notifies client of most appropriate server.

68
NetSolve Client

Function Based Interface.
Client program embeds call
from NetSolves API to access additional
resources.
Interface available to C, Fortran, Matlab,
Mathematica, and Java.
Opaque networking interactions.
NetSolve can be invoked using a variety of
methods blocking, non-blocking, task farms,

69
NetSolve Client

Intuitive and easy to use.
Matlab Matrix multiply e.g.
A matmul(B, C)

A netsolve(matmul, B, C)

Possible parallelisms hidden.

70
NetSolve Client

Client makes request to agent.
Agent returns list of servers.
Client tries each one in turn untilone executes
successfully or list is exhausted.

71
NPACI Alpha Project - MCell 3-D Monte-Carlo
Simulation of Neuro-Transmitter Release in
Between Cells

UCSD (F. Berman, H. Casanova, M. Ellisman), Salk
Institute (T. Bartol), CMU (J. Stiles), UTK
(Dongarra, R. Wolski)
Study how neurotransmitters diffuse and activate
receptors in synapses
blue unbounded, red singly bounded, green doubly
bounded closed, yellow doubly bounded open

72
MCell 3-D Monte-Carlo Simulation of
Neuro-Transmitter Release in Between Cells

Developed at Salk Institute, CMU
In the past, manually run on available
workstations
Transparent Parallelism, Load balancing,
Fault-tolerance
Fits the farming semantic and need for NetSolve
Collaboration with AppLeS Project for scheduling
tasks

List of seeds
AppLeS
NetSolve Servers
script
MCell
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
73

Integrated Parallel Accurate Reservoir Simulator.
Mary Wheelers group, UT-Austin
Reservoir and Environmental Simulation.
models black oil, waterflood, compositions
3D transient flow of multiple phase
Integrates Existing Simulators.
Framework simplified development
Provides solvers, handling for wells, table
lookup.
Provides pre/postprocessor, visualization.
Full IPARS access without Installation.
IPARS Interfaces
C, FORTRAN, Matlab, Mathematica, and Web.

74
Netsolve and SCIRun
SCIRun torso defibrillator application Chris
Johnson, U of Utah
75
NetSolve A Plug into the Grid
C
Fortran
NetSolve
Grid middleware
Resource Discovery
Fault Tolerance
System Management
Resource Scheduling
Globus proxy
NetSolve proxy
Ninf proxy
Condor proxy
76
NetSolve A Plug into the Grid
C
Fortran
NetSolve
Grid middleware
Resource Discovery
Fault Tolerance
System Management
Resource Scheduling
Globus proxy
NetSolve proxy
Ninf proxy
Condor proxy
Globus
NetSolve servers
Ninf servers
Condor
Grid back-ends
NetSolve servers
NetSolve servers
77
NetSolve A Plug into the Grid
PSE front-ends
Matlab
Mathematica
Custom
SCIRun
Remote procedure call
C
Fortran
NetSolve
Grid middleware
Resource Discovery
Fault Tolerance
System Management
Resource Scheduling
Globus proxy
NetSolve proxy
Ninf proxy
Condor proxy
Globus
NetSolve servers
Ninf servers
Condor
Grid back-ends
NetSolve servers
NetSolve servers
78
University of Tennessee Deployment
Scalable Intracampus Research Grid SInRG (NSF
Infrastruction Award)

SInRG equipment 8 Grid Service Clusters
deployed, 6 more to come
Federated Ownership CS, Chem Eng., Medical
School, Computational Ecology, El. Eng.
Real applications,
middleware

development,
logistical
networking

The Knoxville Campus has two DS-3 commodity
Internet connections and one DS-3
Internet2/Abilene connection. An OC-3 ATM link
routes IP traffic between the Knoxville campus,
National Transportation Research Center, and Oak
Ridge National Laboratory. UT participates in
several national networking initiatives including
Internet2 (I2), Abilene, the federal Next
Generation Internet (NGI) initiative, Southern
Universities Research Association (SURA) Regional
Information Infrastructure (RII), and Southern
Crossroads (SoX). The UT campus consists of a
meshed ATM OC-12 being migrated over to switched
Gigabit by early 2002.
79
SInRGs Vision

SInRG provides a testbed
CS grid middleware
Computational Science applications
Many hosts, co-existing
in a loose confederation
tied together with
high-speed links.
Users have the illusion of a very powerful
computer on the desk.
Spectrum of users

80
GrADS - Three Research and Technology Thrusts

GrADS Grid Application Development Software
NSF Next Generation Software (NGS) effort
Effort within the GrADS Project
GrADS PIs Berman, Chien, Cooper, Dongarra,
Foster, Gannon, Johnsson, Kennedy, Kesselman,
Mellor-Crummey, Reed, Torczon, Wolski
GrADSoft
Software infrastructure for programming and
running on the Grid
Reconfigurable object programs
Performance contracts
Core Grid technologies
Globus, NetSolve, NWS, Autopilot, AppLeS,
Portals, Cactus
MacroGrid
Persistent multi-institution Grid testbed
MicroGrid
Portable Grid emulator

81
Grid-Aware Numerical Libraries

Using ScaLAPACK and PETSc on the Grid
Early Experiences

82
Grid-Aware Numerical Libraries

Using ScaLAPACK and PETSc on the Grid
Early Experiences

In some sense ScaLAPACK not an ideal application
for the Grid. Expanded our understand how various
GrADS component fit together. Key is managing
dynamism.
83
ScaLAPACK

ScaLAPACK is a portable distributed
memory numerical library
Complete numerical library for dense matrix
computations
Designed for distributed parallel computing (MPP
Clusters) using MPI
One of the first math software packages to do
this
Numerical software that will work on a
heterogeneous platform
Funding from DOE, NSF, and DARPA
In use today by IBM, HP-Convex, Fujitsu, NEC,
Sun, SGI, Cray, NAG, IMSL,
Tailor performance provide support

84
ScaLAPACK Grid Enabled

Implement a version of a ScaLAPACK library
routine that runs on the Grid.
Make use of resources at the users disposal
Provide the best time to solution
Proceed without the users involvement
Make as few changes as possible to the numerical
software.
Assumption is that the user is already Grid
enabled and runs a program that contacts the
execution environment to determine where the
execution should take place.

85
To Use ScaLAPACK a User Must

Download the package and auxiliary packages (like
PBLAS, BLAS, BLACS, MPI) to the machines.
Write a SPMD program which
Sets up the logical 2-D process grid
Places the data on the logical process grid
Calls the numerical library routine in a SPMD
fashion
Collects the solution after the library routine
finishes
The user must allocate the processors and decide
the number of processes the application will run
on
The user must start the application
mpirun np N user_app
Note the number of processors is fixed by the
user before the run, if problem size changes
dynamically
Upon completion, return the processors to the
pool of resources

86
GrADS Numerical Library

Want to relieve the user of some of the tasks
Make decisions on which machines to use based on
the users problem and the state of the system
Determinate machines that can be used
Optimize for the best time to solution
Distribute the data on the processors and
collections of results
Start the SPMD library routine on all the
platforms
Check to see if the computation is proceeding as
planned
If not perhaps migrate application

87
GrADS Library Sequence
User
Library Routine

Has crafted code to make things work correctly
and together.

Assumptions Autopilot Manager has been started
and Globus is there.
88
Resource Selector
Resource Selector
User
Library Routine

Uses Globus-MDS and Rich Wolskis NWS to build
an array of values for the machines that are
available for the user.
2 matrices (bw,lat) 2 arrays (cpu, memory
available)
Matrix information is clique based
On return from RS, Crafted Code filters
information to use only machines that have the
necessary software and are really eligible to be
used.

89
Arrays of Values Generated by Resource Selector

Clique based
2 _at_ UT, UCSD, UIUC
Part of the MacroGrid
Full at the cluster level and the connections
(clique leaders)
Bandwidth and Latency information looks like
this.
Linear arrays for CPU and Memory
Matrix of values are filled out to generate a
complete, dense, matrix of values.
At this point have a workable coarse grid.
Know what is available, the connections, and the
power of the machines

90
ScaLAPACK Performance Model

Total number of floating-point operations per
processor
Total number of data items communicated per
processor
Total number of messages
Time per floating point operation
Time per data item communicated
Time per message

91
Performance Model
Resource Selector
User
Library Routine

Performance Model uses the information generated
in the RS to decide on the fine grid.
Pick a machine that is closest to every other
machine in the collection.
If not enough memory, adds machines until it can
solve problem.
Cost model is run on this set.
Process adds a machine to group and reruns cost
model.
If better, iterate last step, if not stop.

Performance Model
92
Resource Selector/Performance Modeler

Refines the course grid by determining the
process set that will provide the best time to
solution.
This is based on dynamic information from the
grid and the routines performance model.
The PM does a simulation of the actual
application using the information from the RS.
It literally runs the program without doing the
computation or data movement.
There is no backtracking in the Optimizer.
This is an area for enhancement and
experimentation.

93
Contract Development
Resource Selector
User
Library Routine

Contract between the application and the Grid
System
CD should validate the fine grid.
Should iterate between the CD and PM phases to
get a workable fine grid.

Performance Model
Contract Development
94
Application Launcher
Resource Selector
User
Library Routine
Performance Model
App Launcher
Contract Development
mpirun machinefile globusrsl fine_grid
grid_linear_solve
95
Experimental Hardware / Software Grid
MacroGrid Testbed

Autopilot version 2.3
Globus version 1.1.3
NWS version 2.0.pre2
MPICH-G version 1.1.2
ScaLAPACK version 1.6
ATLAS/BLAS version 3.0.2
BLACS version 1.1
PAPI version 1.1.5
GrADS Crafted code

Independent components being put together and
interacting
96
Performance Model Validation
Speed 60 of the peak
Latency in msec
Bandwidth in Mb/s
This is for a refined grid
97
N600, NB40, 2 torc procs. Ratio 46.12
N1500, NB40, 4 torc procs. Ratio 15.03
N5000, NB40, 6 torc procs. Ratio 2.25
N8000, NB40, 8 torc procs. Ratio 1.52
N10,000, NB40, 8 torc procs. Ratio 1.29
98
OPUS
OPUS, CYPHER
OPUS, TORC, CYPHER
2 OPUS, 4 TORC, 6 CYPHER
8 OPUS, 4 TORC, 4 CYPHER
8 OPUS, 2 TORC, 6 CYPHER
6 OPUS, 5 CYPHER
8 OPUS, 6 CYPHER
8 OPUS
5 OPUS
8 OPUS
99
Largest Problem Solved

Matrix of size 30,000
7.2 GB for the data
32 processors to choose from UIUC and UT
Not all machines have 512 MBs, some little as 128
MBs
PM chose 17 machines in 2 clusters from UT
Computation took 84 minutes
3.6 Gflop/s total
210 Mflop/s per processor
ScaLAPACK on a cluster of 17 processors would get
about 50 of peak
Processors are 500 MHz or 500 Mflop/s peak
For this grid computation 20 less than ScaLAPACK

Compiler analogy
100
Contracts, Checkpointing, Migration

We are using University of Illinois Autopilot to
monitor the progress of the execution.
The applications software has the ability to
perform a checkpoint and can be restarted.
We manually inserted the checkpointing code.
If the application is not progressing as the
contract specifies we want to take some
corrective action.
Go back and figure out where the application can
be run optimally.
Restart the process from the last checkpoint,
perhaps rearranging the data to fit the new set
of processors.

101
General Library Interface

We have a start on a general interface for
numerical libraries.
Its can be a simple operation to plug in other
numerical routines/libraries.
Developing migration mechanisms for contract
violations.
Today a library writer needs to supply
Numerical Routine
Performance Model
The rest of the framework can remain the same.

102
Futures for Numerical Algorithms and Software

Numerical software will be adaptive, exploratory,
and intelligent
Polyalgorithms and other techniques
Determinism in numerical computing will be gone.
After all, its not reasonable to ask for
exactness in numerical computations.
Auditability of the computation, reproducibility
at a cost
Importance of floating point arithmetic will be
undiminished.
16, 32, 64, 128 bits and beyond.
Interval arithmetic
Reproducibility, fault tolerance, and
auditability
Adaptivity is a key so applications can
effectively use the resources.

103
Contributors to These Ideas

Top500
Erich Strohmaier, LBL, NERSC
Hans Meuer, Mannheim U
Horst Simon, LBL, NERSC
ATLAS
Antoine Petitet, Sun France
Clint Whaley, UTK
Parallel Computing, Vol 27,
No 1-2, pp 3-25, 2001
PAPI
Shirley Browne, UTK
Kevin London, UTK
Phil Mucci, UTK
Keith Seymour, UTK
NetSolve
Dorian Arnold, UWisc
Henri Casanova, UCSD
Michelle Miller, UTK
Sathish Vadhiyar, UTK

For additional information see
http//icl.cs.utk.edu/top500/
http//icl.cs.utk.edu/atlas/
http//icl.cs.utk.edu/papi/
http//icl.cs.utk.edu/scalapack/
http//icl.cs.utk.edu/netsolve/
www.cs.utk.edu/dongarra/

Many opportunities within the group at Tennessee
104
(No Transcript)
105
6 Variations of Matrix Multiple
106
6 Variations of Matrix Multiple
107
6 Variations of Matrix Multiple
108
6 Variations of Matrix Multiple
109
6 Variations of Matrix Multiple
110
6 Variations of Matrix Multiple
111
6 Variations of Matrix Multiple
112
6 Variations of Matrix Multiple
C
Fortran
113
6 Variations of Matrix Multiple
C
Fortran
However, only part of the story
114
SUN Ultra 2 200 MHz (L116KB, L21MB)

ijk
jki
kij
dgemm

115
Cache Blocking

We want blocks to fit into cache. On parallel
computers we have p x cache so that data may fit
into cache on p processors, but not one. This
leads to superlinear speed up! Consider
matrix-matrix multiply.
An alternate form is ...

do k 1,n do j 1,n do i
1,n c(i,j) c(i,j)
a(i,k)b(k,j) end do end do
end do
116
Cache Blocking
do ii 1,n,nblk do jj 1,n,nblk
do kk 1,n,nblk do k
kk,kknblk-1 do j
jj,jjnblk-1 do i
ii,iinblk-1 c(i,j)
c(i,j) a(i,k) b(k,j)
end do . . . end do
117
Assignment

Implement, in Fortran or C, the six different
ways to perform matrix multiplication by
interchanging the loops. (Use 64-bit arithmetic.)
Make each implementation a subroutine, like
subroutine ijk (a,m,n,lda,b,k,ldb,c,ldc)
subroutine ikj ( a,m,n,lda,b,k,ldb,c,ldc)
...
Construct a driver program to generate random
matrices and calls each matrix multiplyroutine
with square matrices of orders 50, 100, 150, 200,
250, and 300, timing the calls and computing the
Mflops rate.
Include in your timing routine a call to the
following system supplied routines
call dgemm('No', 'No', n, n, n, 1.0d0, a,
lda, b, ldb,
1.0d0, c, ldc )
Writeup a description of the timing and describe
why the routines perform as they do.
Download ATLAS from http//www.netlib.atlas/ and
build the ATLAS Version of DGEMM and Time.

118
EISPACK and LINPACK

EISPACK
Design for the algebraic eigenvalue problem,
Ax ?x and Ax ?Bx.
work of J. Wilkinson and colleagues in the 70s.
Fortran 77 software based on translation of
ALGOL.
LINPACK
Design for the solving systems of equations, Ax
b.
Fortran 77 software using the Level 1 BLAS.

119
History of Block Partitioned Algorithms

Early algorithms involved use of small main
memory using tapes as secondary storage.
Recent work centers on use of vector registers,
level 1 and 2 cache, main memory, and out of
core memory.

120
Blocked Partitioned Algorithms

LU Factorization
Cholesky factorization
Symmetric indefinite factorization
Matrix inversion
QR, QL, RQ, LQ factorizations
Form Q or QTC

Orthogonal reduction to
(upper) Hessenberg form
symmetric tridiagonal form
bidiagonal form
Block QR iteration for nonsymmetric eigenvalue
problems

121
LAPACK

Linear Algebra library in Fortran 77
Solution of systems of equations
Solution of eigenvalue problems
Combine algorithms from LINPACK and EISPACK into
a single package
Efficient on a wide range of computers
RISC, Vector, SMPs
User interface similar to LINPACK
Single, Double, Complex, Double Complex
Built on the Level 1, 2, and 3 BLAS

122
LAPACK

Most of the parallelism in the BLAS.
Advantages of using the BLAS for parallelism
Clarity
Modularity
Performance
Portability

123
Derivation of Blocked AlgorithmsCholesky
Factorization A UTU

Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
124
LINPACK Implementation

Here is the body of the LINPACK routine SPOFA
which implements the method
DO 30 J 1, N
INFO J
S 0.0E0
JM1 J - 1
IF( JM1.LT.1 ) GO TO 20
DO 10 K 1, JM1
T A( K, J ) - SDOT( K-1,
A( 1, K ), 1,A( 1, J ), 1 )
T T / A( K, K )
A( K, J ) T
S S TT
10 CONTINUE
20 CONTINUE
S A( J, J ) - S
C ...EXIT
IF( S.LE.0.0E0 ) GO TO 40
A( J, J ) SQRT( S )
30 CONTINUE

125
LAPACK Implementation

DO 10 J 1, N
CALL STRSV( 'Upper', 'Transpose',
'Non-Unit, J-1, A, LDA, A( 1, J ), 1 )
S A( J, J ) - SDOT( J-1, A( 1, J ),
1, A( 1, J ), 1 )
IF( S.LE.ZERO ) GO TO 20
A( J, J ) SQRT( S )
10 CONTINUE
This change by itself is sufficient to
significantly improve the performance on a number
of machines.
From 238 to 312 Mflop/s for a matrix of order 500
on a Pentium 4-1.7 GHz.
However on peak is 1,700 Mflop/s.
Suggest further work needed.

126
Derivation of Blocked Algorithms

Equating coefficient of second block of columns,
we obtain
Hence, if U11 has already been computed, we can
compute U12 as the solution of the following
equations by a call to the Level 3 BLAS routine
STRSM
127
LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
CONTINUE

On Pentium 4, L3 BLAS squeezes a lot more out of
1 proc

128
LAPACK Contents

Combines algorithms from LINPACK and EISPACK into
a single package. User interface similar to
LINPACK.
Built on the Level 1, 2 and 3 BLAS, for high
performance (manufacturers optimize BLAS)
LAPACK does not provide routines for structured
problems or general sparse matrices (i.e sparse
storage formats such as compressed-row, -column,
-diagonal, skyline ...).

129
LAPACK Ongoing Work

Add functionality
updating/downdating, divide and conquer least
squares,bidiagonal bisection, bidiagonal inverse
iteration, band SVD, Jacobi methods, ...
Move to new generation of high performance
machines
IBM SPs, CRAY T3E, SGI Origin, clusters of
workstations
New challenges
New languages FORTRAN 90, HP FORTRAN, ...
(CMMD, MPL, NX ...)
many flavors of message passing, need standard
(PVM, MPI) BLACS
Highly varying ratio
Many ways to layout data,
Fastest parallel algorithm sometimes less stable
numerically.

130
Gaussian Elimination
x
0
x
. . .
x
x
Standard Way subtract a multiple of a row
131
Gaussian Elimination via a Recursive Algorithm
F. Gustavson and S. Toledo
LU Algorithm 1 Split matrix into two
rectangles (m x n/2) if only 1 column,
scale by reciprocal of pivot return 2
Apply LU Algorithm to the left part 3 Apply
transformations to right part
(triangular solve A12 L-1A12 and
matrix multiplication A22A22 -A21A12
) 4 Apply LU Algorithm to right part
Most of the work in the matrix multiply Matrices
of size n/2, n/4, n/8,
132
Recursive Factorizations

Just as accurate as conventional method
Same number of operations
Automatic variable blocking
Level 1 and 3 BLAS only !
Extreme clarity and simplicity of expression
Highly efficient
The recursive formulation is just a rearrangement
of the point-wise LINPACK algorithm
The standard error analysis applies (assuming the
matrix operations are computed the conventional
way).
OK for LU, LLT, QR
Open question on 2-sided algs. eg eigenvalue
reduction

133

Recursive LU

Dual-processor
LAPACK
Recursive LU
Uniprocessor
LAPACK
134
Challenges in Developing Distributed Memory
Libraries

How to integrate software?
Until recently no standards
Many parallel languages
Various parallel programming models
Assumptions about the parallel environment
granularity
topology
overlapping of communication/computation
development tools

Where is the data
Who owns it?
Opt data distribution
Who determines data layout
Determined by user?
Determined by library developer?
Allow dynamic data dist.
Load balancing

135
ScaLAPACK

Library of software dealing with dense banded
routines
Distributed Memory - Message Passing
MIMD Computers and Networks of Workstations
Clusters of SMPs

136
Programming Style

SPMD Fortran 77 with object based design
Built on various modules
PBLAS Interprocessor communication
BLACS
PVM, MPI
Provides right level of notation.
BLAS
LAPACK software expertise/quality
Software approach
Numerical methods

137
Overall Structure of Software

Object based - Array descriptor
Contains information required to establish
mapping between a global array entry and its
corresponding process and memory location.
Provides a flexible framework to easily specify
additional data distributions or matrix types.
Currently dense, banded, out-of-core
Using the concept of context

138
PBLAS

Similar to the BLAS in functionality and naming.
Built on the BLAS and BLACS
Provide global view of matrix
CALL DGEXXX ( M, N, A( IA, JA ), LDA,... )
CALL PDGEXXX( M, N, A, IA, JA, DESCA,... )

139
ScaLAPACK Structure
Global
Local
140
Choosing a Data Distribution

Main issues are
Load balancing
Use of the Level 3 BLAS

141
Possible Data Layouts

1D block and cyclic column distributions
1D block-cycle column and 2D block-cyclic
distribution
2D block-cyclic used in ScaLAPACK for dense
matrices

142
Distribution and Storage

Matrix is block-partitioned maps blocks
Distributed 2-D block-cyclic scheme
5x5 matrix partitioned in 2x2 blocks
2x2 process grid
point of view
Routines available to distribute/redistribute
data.

143
Parallelism in ScaLAPACK

Level 3 BLAS block operations
All the reduction routines
Pipelining
QR Algorithm, Triangular Solvers, classic
factorizations
Redundant computations
Condition estimators
Static work assignment
Bisection

Task parallelism
Sign function eigenvalue computations
Divide and Conquer
Tridiagonal and band solvers, symmetric
eigenvalue problem and Sign function
Cyclic reduction
Reduced system in the band solver
Data parallelism
Sign function

144
(No Transcript)
145
References

http//www.netlib.org
http//www.netlib.org/lapack
http//www.netlib.org/scalapack
http//www.netlib.org/lapack/lawns
http//www.netlib.org/atlas
http//www.netlib.org/papi/
http//www.netlib.org/netsolve/
http//www.netlib.org/lapack90
http//www.nhse.org
http//www.netlib.org/utk/people/JackDongarra/la-s
w.html
lapack_at_cs.utk.edu
scalapack_at_cs.utk.edu

146
Motivation for Grid Computing
In the past Isolation

Many science and
engineering problems today require that widely
dispersed resources be operated as systems.
Networking, distributed computing, and parallel
computation research have matured to make it
possible for distributed systems to support
high-performance applications, but...
Resources are dispersed
Connectivity is variable
Dedicated access may not be possible

Today Collaboration
147
Performance Distribution Nov 2001
1,3,4,6 ½ life
148
Bandwidth Wont Be A Problem Soon -- Bisection
Bandwidth (BB) Across the US

1971 - BB 112 Kb/s
1986 - BB 1 Mb/s
2001 - BB 200 Gb/s
Today in the lab, 4000 channels on single fiber
and each channel 10 Gb/s
12 strands of fiber can carry 400010 Gb/s or 40
Tb/s
5 backbone network across the US each w/ 2 sets
of 12 strands can provide 2.4 Pb/s

When the Network is as fast as the computers
internal links, the machine disintegrates across
the Net into a set of special purpose appliances
Gilder Technology Report June 2000
Internet doubling every 9 months
Factor of 100 in 5 years
BB will grow be a factor of 12000.

Write a Comment

User Comments (0)