Title: High Performance Computing and Trends, Enhancing Performance, Measurement Tools,
1High Performance Computing and Trends, Enhancing
Performance, Measurement Tools,
Facultad de Ciencias, Universidad de Los Andes,
La
Hechicera, Mérida, Venezuela. December 3-5, 2001
- Jack Dongarra
- Innovative Computing Laboratory
- University of Tennessee
- http//www.cs.utk.edu/dongarra/
2Overview
- High Performance Computing
- ATLAS
- PAPI
- NetSolve
- Grid Experiments
3Technology Trends Microprocessor Capacity
Moores Law
2X transistors/Chip Every 1.5 years Called
Moores Law
Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
Microprocessors have become smaller, denser, and
more powerful.
4Moores Law
5 H. Mauer, H. Simon, E. Strohmaier, JD -
Listing of the 500 most powerful Computers
in the World - Yardstick Rmax from LINPACK
MPP Axb, dense problem - Updated twice a
year SCxy in the States in November Meeting in
Mannheim, Germany in June - All data
available from www.top500.org
TPP performance
Rate
Size
6- In 1980 a computation that took 1 full year to
complete - can now be done in 10 hours!
7- In 1980 a computation that took 1 full year to
complete - can now be done in 16 minutes!
8- In 1980 a computation that took 1 full year to
complete - can today be done in 27 seconds!
9Top 10 Machines (November 2001)
10Performance Development
60G - 400 M7.2 Tflop/s 94Gflop/s, Schwab 24,
1/2 per year, 394 gt 100 Gf, faster than Moores
law
11Performance Development
My Laptop
Entry 1 T 2005 and 1 P 2010
12Petaflop (1015 flop/s) Computers Within the Next
Decade
- Five basis design points
- Conventional technologies
- 4.8 GHz processor, 8000 nodes, each w/16
processors - Processing-in-memory (PIM) designs
- Reduce memory access bottleneck
- Superconducting processor technologies
- Digital superconductor technology, Rapid
Single-Flux-Quantum (RSFQ) logic hybrid
technology multi-threaded (HTMT) - Special-purpose hardware designs
- Specific applications e.g. GRAPE Project in Japan
for gravitational force computations - Schemes utilizing the aggregate computing power
of processors distributed on the web - SETI_at_home 26 Tflop/s
13Petaflops (1015 flop/s) Computer Today?
- 1 GHz processor (O(109) ops/s)
- 1 Million PCs
- 1B (1K each)
- 100 Mwatts
- 5 acres
- 1 Million Windows licenses!!
- PC failure every second
14Architectures
Constellation of p/n ? n
15Chip Technology
16Manufacturer
IBM 32, HP 30, SGI 8, Cray 8, SUN 6, Fuji
4, NEC 3, Hitachi 3
17Cumulative Performance Nov 2001
134.9 TF/s
500
18High-Performance Computing Directions
Beowulf-class PC Clusters
Definition
Advantages
- COTS PC Nodes
- Pentium, Alpha, PowerPC, SMP
- COTS LAN/SAN Interconnect
- Ethernet, Myrinet, Giganet, ATM
- Open Source Unix
- Linux, BSD
- Message Passing Computing
- MPI, PVM
- HPF
- Best price-performance
- Low entry-level cost
- Just-in-place configuration
- Vendor invulnerable
- Scalable
- Rapid technology tracking
Enabled by PC hardware, networks and operating
system achieving capabilities of scientific
workstations at a fraction of the cost and
availability of industry standard message passing
libraries. However, much more of a contact sport.
19- Peak performance
- Interconnection
- http//clusters.top500.org
- Benchmark results to follow in the coming months
20Where Does the Performance Go? orWhy Should I
Care About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
21Optimizing Computation and Memory Use
- Computational optimizations
- Theoretical peak( fpus)(flops/cycle) Mhz
- PIII (1 fpu)(1 flop/cycle)(850 Mhz) 850
MFLOP/s - Athlon (2 fpu)(1flop/cycle)(600 Mhz) 1200
MFLOP/s - Power3 (2 fpu)(2 flops/cycle)(375 Mhz) 1500
MFLOP/s - Operations like
- ? xTy 2 operands (16 Bytes) needed for
2 flops at 850 Mflop/s
will requires 1700 MW/s bandwidth - y ? x y 3 operands (24 Bytes) needed for 2
flops at 850 Mflop/s
will requires 2550 MW/s bandwidth - Memory optimization
- Theoretical peak (bus width) (bus speed)
- PIII (32 bits)(133 Mhz) 532 MB/s
66.5 MW/s - Athlon (64 bits)(133 Mhz) 1064 MB/s 133
MW/s - Power3 (128 bits)(100 Mhz) 1600 MB/s 200
MW/s
22 Memory Hierarchy
- By taking advantage of the principle of locality
- Present the user with as much memory as is
available in the cheapest technology. - Provide access at the speed offered by the
fastest technology.
Processor
Tertiary Storage (Disk/Tape)
Secondary Storage (Disk)
Control
Main Memory (DRAM)
Level 2 and 3 Cache (SRAM)
Remote Cluster Memory
Distributed Memory
On-Chip Cache
Datapath
Registers
10,000,000s (10s ms) 100,000 s (.1s ms)
1s
Speed (ns)
10s
100s
10,000,000,000s (10s sec) 10,000,000 s (10s
ms)
100s
Size (bytes)
Ks
Ms
Gs
Ts
23Self-Adapting Numerical Software (SANS)
- Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning. - Operations like the BLAS require many man-hours /
platform - Software lags far behind hardware introduction
- Only done if financial incentive is there
- Hardware, compilers, and software have a large
design space w/many parameters - Blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules. - Complicated interactions with the increasingly
sophisticated micro-architectures of new
microprocessors. - Need for quick/dynamic deployment of optimized
routines. - ATLAS - Automatic Tuned Linear Algebra Software
24Software Generation Strategy - BLAS
- Parameter study of the hw
- Generate multiple versions of code, w/difference
values of key performance parameters - Run and measure the performance for various
versions - Pick best and generate library
- Level 1 cache multiply optimizes for
- TLB access
- L1 cache reuse
- FP unit usage
- Memory fetch
- Register reuse
- Loop overhead minimization
- Takes 20 minutes to run.
- New model of high performance programming where
critical code is machine generated using
parameter optimization. - Designed for RISC arch
- Super Scalar
- Need reasonable C compiler
- Today ATLAS in use by Matlab, Mathematica,
Octave, Maple, Debian, Scyld Beowulf, SuSE,
25ATLAS (DGEMM n 500)64-bit floating point
results
- ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.
26Recursive Approach for Other Level
3 BLAS
Recursive TRMM
- Recur down to L1 cache block size
- Need kernel at bottom of recursion
- Use gemm-based kernel for portability
27Intel PIII 933 MHzMKL 5.0 vs ATLAS 3.2.0 using
Windows 2000
- ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.
28ATLAS Matrix Multiply (64-bit floating point
results)
Intel IA-64
Intel P4 64-bit fl pt
AMD Athlon
29Pentium 4 - SSE2
- 1.5 GHz, 400 MHz system bus, 16K L1 256K L2
Cache, theoretical peak of 1.5 Gflop/s, high
power consumption - Streaming SIMD Extensions 2 (SSE2)
- which consists of 144 new instructions
- includes SIMD IEEE double precision floating
point (vector instructions) - Peak for 64 bit floating point 2X
- Peak for 32 bit floating point 4X
- SIMD 128-bit integer
- new cache and memory management instructions.
- Intels compiler supports these instructions
today
30ATLAS Matrix Multiply Intel Pentium 4 using
SSE2
P4 32-bit fl pt using SSE2
P4 64-bit fl pt using SSE2
P4 64-bit fl pt
250/processor, lt1000 for system gt 0.50/Mflops
!!
31(No Transcript)
32Machine-Assisted Application Development and
Adaptation
- Communication libraries
- Optimize for the specifics of ones
configuration. - Algorithm layout and implementation
- Look at the different ways to express
implementation
33Work in ProgressATLAS-like Approach Applied to
Broadcast (PII 8 Way Cluster with 100 Mb/s
switched network)
34Reformulating/Rearranging/Reuse
- Example is the reduction to narrow band from for
the SVD - Fetch each entry of A once
- Restructure and combined operations
- Results in a speedup of gt 30
35Conjugate Gradient Variants by Dynamic Selection
at Run Time
- Variants combine inner products to reduce
communication bottleneck at the expense of more
scalar ops. - Same number of iterations, no advantage on a
sequential processor - With a large number of processor and a
high-latency network may be advantages. - Improvements can range from 15 to 50 depending
on size.
36Conjugate Gradient Variants by Dynamic Selection
at Run Time
- Variants combine inner products to reduce
communication bottleneck at the expense of more
scalar ops. - Same number of iterations, no advantage on a
sequential processor - With a large number of processor and a
high-latency network may be advantages. - Improvements can range from 15 to 50 depending
on size.
37Related Tuning Projects
- PHiPAC
- Portable High Performance ANSI C
http//www.icsi.berkeley.edu/bilmes/phipac
initial automatic GEMM generation project - FFTW Fastest Fourier Transform in the West
- http//www.fftw.org
- UHFFT
- tuning parallel FFT algorithms
- http//rodin.cs.uh.edu/mirkovic/fft/parfft.htm
- SPIRAL
- Signal Processing Algorithms Implementation
Research for Adaptable Libraries maps DSP
algorithms to architectures - http//www.ece.cmu.edu/spiral/
- Sparsity
- Sparse-matrix-vector and Sparse-matrix-matrix
multiplication http//www.cs.berkeley.edu/ejim/pu
blication/ tunes code to sparsity structure of
matrix more later in this tutorial - University of Tennessee
38Tools for Performance Evaluation
- Timing and performance evaluation has been an art
- Resolution of the clock
- Issues about cache effects
- Different systems
- Can be cumbersome and inefficient with
traditional tools - Situation about to change
- Todays processors have internal counters
39Performance Counters
- Almost all high performance processors include
hardware performance counters. - Some are easy to access, others not available to
users. - On most platforms the APIs, if they exist, are
not appropriate for the end user or well
documented. - Existing performance counter APIs
- Compaq Alpha EV 6 6/7
- SGI MIPS R10000
- IBM Power Series
- CRAY T3E
- Sun Solaris
- Pentium Linux and Windows
- IA-64
- HP-PA RISC
- Hitachi
- Fujitsu
- NEC
40Performance Data That May Be Available
- Pipeline stalls due to memory subsystem
- Pipeline stalls due to resource conflicts
- I/D cache misses for different levels
- Cache invalidations
- TLB misses
- TLB invalidations
- Cycle count
- Floating point instruction count
- Integer instruction count
- Instruction count
- Load/store count
- Branch taken / not taken count
- Branch mispredictions
41Overview of PAPI
- Performance Application Programming Interface
- The purpose of the PAPI project is to design,
standardize and implement a portable and
efficient API to access the hardware performance
monitor counters found on most modern
microprocessors
42Implementation
- Counters exist as a small set of registers that
count events. - PAPI provides three interfaces to the underlying
counter hardware - The low level interface manages hardware events
in user defined groups called EventSet. - The high level interface simply provides the
ability to start, stop and read the counters for
a specified list of events. - Graphical tools to visualize information.
43Low Level API
- Increased efficiency and functionality over the
high level PAPI interface - Theres about 40 functions
- Obtain information about the executable and the
hardware. - Thread safe
44High Level API
- Meant for application programmers wanting
coarse-grained measurements - Calls the lower level API
- Not thread safe at the moment
- Only allows PAPI Presets events
45High Level Functions
- PAPI_flops()
- PAPI_num_counters()
- Number of counters in the system
- PAPI_start_counters()
- PAPI_stop_counters()
- Enable counting of events and describes what to
count - PAPI_read_counters()
- Returns event counts
46Perfometer Features
- Platform independent visualization of PAPI
metrics - Flexible interface
- Quick interpretation of complex results
- Small footprint
- (compiled code size lt 15k)
- Color coding to highlight selected procedures
- Trace file generation or real time viewing.
47PAPI Implementation
48Graphical ToolsPerfometer Usage
- Application is instrumented with PAPI
- call perfometer()
- Will be layered over the best existing
vendor-specific APIs for these platforms - Application is started, at the call to
performeter signal handler and timer set to
collect and send the information to a Java applet
containing the graphical view. - Sections of code that are of interest can be
designated with specific colors - Using a call to set_perfometer(color)
49Perfometer
Call Perfometer(red)
50Early Users of PAPI
- DEEP/PAPI (Pacific Sierra) http//www.psrv.com/dee
p_papi_top.html - TAU (Allen Mallony, U of Oregon)
http//www.cs.uoregon.edu/research/paracomp/tau/ - SvPablo (Dan Reed, U of Illinois)
http//vibes.cs.uiuc.edu/Software/SvPablo/svPablo.
htm - Cactus (Ed Seidel, Max Plank/U of Illinois)
http//www.aei-potsdam.mpg.de - Vprof (Curtis Janssen, Sandia Livermore Lab)
http//aros.ca.sandia.gov/cljanss/perf/vprof/ - Cluster Tools (Al Geist, ORNL)
- DynaProf (Phil Mucci, UTK) http//www.cs.utk.edu/
mucci/dynaprof/
51Next Version of
Perfometer Implementation
Application
GUI
Server
Application
Application
52PAPIs Parallel Interface
53PAPI - Supported Processors
- Intel Pentium,II,III,4, Itanium
- Linux 2.4, 2.2, 2.0 and perf kernel patch
- IBM Power 3,604,604e
- For AIX 4.3 and pmtoolkit (in 4.3.4 available)
- (laderose_at_us.ibm.com)
- Sun UltraSparc I, II, III
- Solaris 2.8
- MIPS R10K, R12K
- AMD Athlon
- Linux 2.4 and perf kernel patch
- Cray T3E, SV1, SV2
- Windows 2K and XP
- To download software see
- http//icl.cs.utk.edu/papi/
54(No Transcript)
55Innovative Computing Laboratory University of
Tennessee
- Numerical Linear Algebra
- Heterogeneous Distributed Computing
- Software Repositories
- Performance Evaluation
- Software and ideas have found there way into many
areas of Computational Science - Around 40 people At the moment...
- 15 Researchers Research Assoc/Post-Doc/Research
Prof - 15 Students Graduate and Undergraduate
- 8 Support staff Secretary, Systems, Artist
- 1 Long term visitor
- Many opportunities within the group at Tennessee
56SETI_at_home
- Use thousands of Internet-connected PCs to help
in the search for extraterrestrial intelligence. - Uses data collected with the Arecibo Radio
Telescope, in Puerto Rico - When their computer is idle or being wasted this
software will download a 300 kilobyte chunk of
data for analysis. - The results of this analysis are sent back to the
SETI team, combined with thousands of other
participants.
- Largest distributed computation project in
existence - 400,000 machines
- Averaging 27 Tflop/s
- Today many companies trying this for profit.
57Distributed and Parallel Systems
Massively parallel systems homo- geneous
ASCI Tflops (7 Tflop/s)
Clusters w/ special interconnect
Distributed systems hetero- geneous
SETI_at_home (27 Tflop/s)
Parallel Dist mem
Beowulf cluster
Network of ws
Grid based Computing
Entropia
- Gather (unused) resources
- Steal cycles
- System SW manages resources
- System SW adds value
- 10 - 20 overhead is OK
- Resources drive applications
- Time to completion is not critical
- Time-shared
- Bounded set of resources
- Apps grow to consume all cycles
- Application manages resources
- System SW gets in the way
- 5 overhead is maximum
- Apps drive purchase of equipment
- Real-time constraints
- Space-shared
58Grids are Hot
- IPG NAS-NASA http//nas.nasa.gov/wej/home/IPG
- Globus http//www.globus.org/
- Legion http//www.cs.virgina.edu/grims
haw/ - AppLeS http//www-cse.ucsd.edu/groups/hp
cl/apples - NetSolve http//www.cs.utk.edu/netsolve/
- NINF http//phase.etl.go.jp/ninf/
- Condor http//www.cs.wisc.edu/condor/
- CUMULVS http//www.epm.ornl.gov/cs/cumulvs.
html - WebFlow http//www.npac.syr.edu/users/gcf/
59The Grid
- To treat CPU cycles and software like
commodities. - Napster on steroids.
- Enable the coordinated use of geographically
distributed resources in the absence of central
control and existing trust relationships. - Computing power is produced much like utilities
such as power and water are produced for
consumers. - Users will have access to power on demand
- When the Network is as fast as the computers
internal links, the machine disintegrates across
the Net into a set of special purpose appliances - Gilder Technology Report June 2000
60The Grid
61The Grid Architecture Picture
User Portals
Problem Solving Environments
Application Science Portals
Grid Access Info
Service Layers
Co- Scheduling
Fault Tolerance
Authentication
Events
Naming Files
Computers
Data bases
Online instruments
Resource Layer
Software
High speed networks and routers
62Globus Grid Services
- The Globus toolkit provides a range of basic Grid
services - Security, information, fault detection,
communication, resource management, ... - These services are simple and orthogonal
- Can be used independently, mix and match
- Programming model independent
- For each there are well-defined APIs
- Standards are used extensively
- E.g., LDAP, GSS-API, X.509, ...
- You dont program in Globus, its a set of tools
like Unix
63NetSolve Network
Enabled Server
- NetSolve is an example of a grid based
hardware/software server. - Easy-of-use paramount
- Based on a RPC model but with
- resource discovery, dynamic problem solving
capabilities, load balancing, fault tolerance
asynchronicity, security, - Other examples are NEOS from Argonne and NINF
Japan. - Use resources, not tie together geographically
distributed resources, for a single application.
64NetSolve The Big Picture
Client
Schedule Database
AGENT(s)
Matlab Mathematica C, Fortran Java, Excel
S3
S4
Op(C, A, B)
S1
S2
C
A
No knowledge of the grid required, RPC like.
65Basic Usage Scenarios
- Grid based numerical library routines
- User doesnt have to have software library on
their machine, LAPACK, SuperLU, ScaLAPACK, PETSc,
AZTEC, ARPACK - Task farming applications
- Pleasantly parallel execution
- eg Parameter studies
- Remote application execution
- Complete applications with user specifying input
parameters and receiving output
- Blue Collar Grid Based Computing
- Does not require deep knowledge of network
programming - Level of expressiveness right for many users
- User can set things up, no su required
- In use today, up to 200 servers in 9 countries
- Can plug into Globus, Condor, NINF,
66NetSolve Agent
- Name server for the NetSolve
system. - Information Service
- client users and administrators can query the
hardware and software services available. - Resource scheduler
- maintains both static and dynamic information
regarding theNetSolve server components touse
for the allocation of resources
67NetSolve Agent
- Resource Scheduling (contd)
- CPU Performance (LINPACK).
- Network bandwidth, latency.
- Server workload.
- Problem size/algorithm complexity.
- Calculates a Time to Compute. for each
appropriate server. - Notifies client of most appropriate server.
68NetSolve Client
- Function Based Interface.
- Client program embeds call
from NetSolves API to access additional
resources. - Interface available to C, Fortran, Matlab,
Mathematica, and Java. - Opaque networking interactions.
- NetSolve can be invoked using a variety of
methods blocking, non-blocking, task farms,
69NetSolve Client
- Intuitive and easy to use.
- Matlab Matrix multiply e.g.
- A matmul(B, C)
A netsolve(matmul, B, C)
- Possible parallelisms hidden.
70NetSolve Client
- Client makes request to agent.
- Agent returns list of servers.
- Client tries each one in turn untilone executes
successfully or list is exhausted.
71NPACI Alpha Project - MCell 3-D Monte-Carlo
Simulation of Neuro-Transmitter Release in
Between Cells
- UCSD (F. Berman, H. Casanova, M. Ellisman), Salk
Institute (T. Bartol), CMU (J. Stiles), UTK
(Dongarra, R. Wolski) - Study how neurotransmitters diffuse and activate
receptors in synapses - blue unbounded, red singly bounded, green doubly
bounded closed, yellow doubly bounded open
72MCell 3-D Monte-Carlo Simulation of
Neuro-Transmitter Release in Between Cells
- Developed at Salk Institute, CMU
- In the past, manually run on available
workstations - Transparent Parallelism, Load balancing,
Fault-tolerance - Fits the farming semantic and need for NetSolve
- Collaboration with AppLeS Project for scheduling
tasks
List of seeds
AppLeS
NetSolve Servers
script
MCell
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
73- Integrated Parallel Accurate Reservoir Simulator.
- Mary Wheelers group, UT-Austin
- Reservoir and Environmental Simulation.
- models black oil, waterflood, compositions
- 3D transient flow of multiple phase
- Integrates Existing Simulators.
- Framework simplified development
- Provides solvers, handling for wells, table
lookup. - Provides pre/postprocessor, visualization.
- Full IPARS access without Installation.
- IPARS Interfaces
- C, FORTRAN, Matlab, Mathematica, and Web.
74Netsolve and SCIRun
SCIRun torso defibrillator application Chris
Johnson, U of Utah
75NetSolve A Plug into the Grid
C
Fortran
NetSolve
Grid middleware
Resource Discovery
Fault Tolerance
System Management
Resource Scheduling
Globus proxy
NetSolve proxy
Ninf proxy
Condor proxy
76NetSolve A Plug into the Grid
C
Fortran
NetSolve
Grid middleware
Resource Discovery
Fault Tolerance
System Management
Resource Scheduling
Globus proxy
NetSolve proxy
Ninf proxy
Condor proxy
Globus
NetSolve servers
Ninf servers
Condor
Grid back-ends
NetSolve servers
NetSolve servers
77NetSolve A Plug into the Grid
PSE front-ends
Matlab
Mathematica
Custom
SCIRun
Remote procedure call
C
Fortran
NetSolve
Grid middleware
Resource Discovery
Fault Tolerance
System Management
Resource Scheduling
Globus proxy
NetSolve proxy
Ninf proxy
Condor proxy
Globus
NetSolve servers
Ninf servers
Condor
Grid back-ends
NetSolve servers
NetSolve servers
78University of Tennessee Deployment
Scalable Intracampus Research Grid SInRG (NSF
Infrastruction Award)
- SInRG equipment 8 Grid Service Clusters
deployed, 6 more to come - Federated Ownership CS, Chem Eng., Medical
School, Computational Ecology, El. Eng. - Real applications,
middleware
development,
logistical
networking
The Knoxville Campus has two DS-3 commodity
Internet connections and one DS-3
Internet2/Abilene connection. An OC-3 ATM link
routes IP traffic between the Knoxville campus,
National Transportation Research Center, and Oak
Ridge National Laboratory. UT participates in
several national networking initiatives including
Internet2 (I2), Abilene, the federal Next
Generation Internet (NGI) initiative, Southern
Universities Research Association (SURA) Regional
Information Infrastructure (RII), and Southern
Crossroads (SoX). The UT campus consists of a
meshed ATM OC-12 being migrated over to switched
Gigabit by early 2002.
79SInRGs Vision
- SInRG provides a testbed
- CS grid middleware
- Computational Science applications
- Many hosts, co-existing
in a loose confederation
tied together with
high-speed links. - Users have the illusion of a very powerful
computer on the desk. - Spectrum of users
80GrADS - Three Research and Technology Thrusts
- GrADS Grid Application Development Software
- NSF Next Generation Software (NGS) effort
- Effort within the GrADS Project
- GrADS PIs Berman, Chien, Cooper, Dongarra,
Foster, Gannon, Johnsson, Kennedy, Kesselman,
Mellor-Crummey, Reed, Torczon, Wolski - GrADSoft
- Software infrastructure for programming and
running on the Grid - Reconfigurable object programs
- Performance contracts
- Core Grid technologies
- Globus, NetSolve, NWS, Autopilot, AppLeS,
Portals, Cactus - MacroGrid
- Persistent multi-institution Grid testbed
- MicroGrid
- Portable Grid emulator
81Grid-Aware Numerical Libraries
- Using ScaLAPACK and PETSc on the Grid
Early Experiences
82Grid-Aware Numerical Libraries
- Using ScaLAPACK and PETSc on the Grid
Early Experiences
In some sense ScaLAPACK not an ideal application
for the Grid. Expanded our understand how various
GrADS component fit together. Key is managing
dynamism.
83ScaLAPACK
- ScaLAPACK is a portable distributed
memory numerical library - Complete numerical library for dense matrix
computations - Designed for distributed parallel computing (MPP
Clusters) using MPI - One of the first math software packages to do
this - Numerical software that will work on a
heterogeneous platform - Funding from DOE, NSF, and DARPA
- In use today by IBM, HP-Convex, Fujitsu, NEC,
Sun, SGI, Cray, NAG, IMSL, - Tailor performance provide support
84ScaLAPACK Grid Enabled
- Implement a version of a ScaLAPACK library
routine that runs on the Grid. - Make use of resources at the users disposal
- Provide the best time to solution
- Proceed without the users involvement
- Make as few changes as possible to the numerical
software. - Assumption is that the user is already Grid
enabled and runs a program that contacts the
execution environment to determine where the
execution should take place.
85To Use ScaLAPACK a User Must
- Download the package and auxiliary packages (like
PBLAS, BLAS, BLACS, MPI) to the machines. - Write a SPMD program which
- Sets up the logical 2-D process grid
- Places the data on the logical process grid
- Calls the numerical library routine in a SPMD
fashion - Collects the solution after the library routine
finishes - The user must allocate the processors and decide
the number of processes the application will run
on - The user must start the application
- mpirun np N user_app
- Note the number of processors is fixed by the
user before the run, if problem size changes
dynamically - Upon completion, return the processors to the
pool of resources
86GrADS Numerical Library
- Want to relieve the user of some of the tasks
- Make decisions on which machines to use based on
the users problem and the state of the system - Determinate machines that can be used
- Optimize for the best time to solution
- Distribute the data on the processors and
collections of results - Start the SPMD library routine on all the
platforms - Check to see if the computation is proceeding as
planned - If not perhaps migrate application
87GrADS Library Sequence
User
Library Routine
- Has crafted code to make things work correctly
and together.
Assumptions Autopilot Manager has been started
and Globus is there.
88Resource Selector
Resource Selector
User
Library Routine
- Uses Globus-MDS and Rich Wolskis NWS to build
an array of values for the machines that are
available for the user. - 2 matrices (bw,lat) 2 arrays (cpu, memory
available) - Matrix information is clique based
- On return from RS, Crafted Code filters
information to use only machines that have the
necessary software and are really eligible to be
used.
89Arrays of Values Generated by Resource Selector
- Clique based
- 2 _at_ UT, UCSD, UIUC
- Part of the MacroGrid
- Full at the cluster level and the connections
(clique leaders) - Bandwidth and Latency information looks like
this. - Linear arrays for CPU and Memory
- Matrix of values are filled out to generate a
complete, dense, matrix of values. - At this point have a workable coarse grid.
- Know what is available, the connections, and the
power of the machines
90ScaLAPACK Performance Model
- Total number of floating-point operations per
processor - Total number of data items communicated per
processor - Total number of messages
- Time per floating point operation
- Time per data item communicated
- Time per message
91Performance Model
Resource Selector
User
Library Routine
- Performance Model uses the information generated
in the RS to decide on the fine grid. - Pick a machine that is closest to every other
machine in the collection. - If not enough memory, adds machines until it can
solve problem. - Cost model is run on this set.
- Process adds a machine to group and reruns cost
model. - If better, iterate last step, if not stop.
Performance Model
92Resource Selector/Performance Modeler
- Refines the course grid by determining the
process set that will provide the best time to
solution. - This is based on dynamic information from the
grid and the routines performance model. - The PM does a simulation of the actual
application using the information from the RS. - It literally runs the program without doing the
computation or data movement. - There is no backtracking in the Optimizer.
- This is an area for enhancement and
experimentation.
93Contract Development
Resource Selector
User
Library Routine
- Contract between the application and the Grid
System - CD should validate the fine grid.
- Should iterate between the CD and PM phases to
get a workable fine grid.
Performance Model
Contract Development
94Application Launcher
Resource Selector
User
Library Routine
Performance Model
App Launcher
Contract Development
mpirun machinefile globusrsl fine_grid
grid_linear_solve
95Experimental Hardware / Software Grid
MacroGrid Testbed
- Autopilot version 2.3
- Globus version 1.1.3
- NWS version 2.0.pre2
- MPICH-G version 1.1.2
- ScaLAPACK version 1.6
- ATLAS/BLAS version 3.0.2
- BLACS version 1.1
- PAPI version 1.1.5
- GrADS Crafted code
Independent components being put together and
interacting
96Performance Model Validation
Speed 60 of the peak
Latency in msec
Bandwidth in Mb/s
This is for a refined grid
97N600, NB40, 2 torc procs. Ratio 46.12
N1500, NB40, 4 torc procs. Ratio 15.03
N5000, NB40, 6 torc procs. Ratio 2.25
N8000, NB40, 8 torc procs. Ratio 1.52
N10,000, NB40, 8 torc procs. Ratio 1.29
98 OPUS
OPUS, CYPHER
OPUS, TORC, CYPHER
2 OPUS, 4 TORC, 6 CYPHER
8 OPUS, 4 TORC, 4 CYPHER
8 OPUS, 2 TORC, 6 CYPHER
6 OPUS, 5 CYPHER
8 OPUS, 6 CYPHER
8 OPUS
5 OPUS
8 OPUS
99Largest Problem Solved
- Matrix of size 30,000
- 7.2 GB for the data
- 32 processors to choose from UIUC and UT
- Not all machines have 512 MBs, some little as 128
MBs - PM chose 17 machines in 2 clusters from UT
- Computation took 84 minutes
- 3.6 Gflop/s total
- 210 Mflop/s per processor
- ScaLAPACK on a cluster of 17 processors would get
about 50 of peak - Processors are 500 MHz or 500 Mflop/s peak
- For this grid computation 20 less than ScaLAPACK
Compiler analogy
100Contracts, Checkpointing, Migration
- We are using University of Illinois Autopilot to
monitor the progress of the execution. - The applications software has the ability to
perform a checkpoint and can be restarted. - We manually inserted the checkpointing code.
- If the application is not progressing as the
contract specifies we want to take some
corrective action. - Go back and figure out where the application can
be run optimally. - Restart the process from the last checkpoint,
perhaps rearranging the data to fit the new set
of processors.
101General Library Interface
- We have a start on a general interface for
numerical libraries. - Its can be a simple operation to plug in other
numerical routines/libraries. - Developing migration mechanisms for contract
violations. - Today a library writer needs to supply
- Numerical Routine
- Performance Model
- The rest of the framework can remain the same.
102Futures for Numerical Algorithms and Software
- Numerical software will be adaptive, exploratory,
and intelligent - Polyalgorithms and other techniques
- Determinism in numerical computing will be gone.
- After all, its not reasonable to ask for
exactness in numerical computations. - Auditability of the computation, reproducibility
at a cost - Importance of floating point arithmetic will be
undiminished. - 16, 32, 64, 128 bits and beyond.
- Interval arithmetic
- Reproducibility, fault tolerance, and
auditability - Adaptivity is a key so applications can
effectively use the resources.
103Contributors to These Ideas
- Top500
- Erich Strohmaier, LBL, NERSC
- Hans Meuer, Mannheim U
- Horst Simon, LBL, NERSC
- ATLAS
- Antoine Petitet, Sun France
- Clint Whaley, UTK
- Parallel Computing, Vol 27,
No 1-2, pp 3-25, 2001 - PAPI
- Shirley Browne, UTK
- Kevin London, UTK
- Phil Mucci, UTK
- Keith Seymour, UTK
- NetSolve
- Dorian Arnold, UWisc
- Henri Casanova, UCSD
- Michelle Miller, UTK
- Sathish Vadhiyar, UTK
- For additional information see
- http//icl.cs.utk.edu/top500/
- http//icl.cs.utk.edu/atlas/
- http//icl.cs.utk.edu/papi/
- http//icl.cs.utk.edu/scalapack/
- http//icl.cs.utk.edu/netsolve/
- www.cs.utk.edu/dongarra/
Many opportunities within the group at Tennessee
104(No Transcript)
1056 Variations of Matrix Multiple
1066 Variations of Matrix Multiple
1076 Variations of Matrix Multiple
1086 Variations of Matrix Multiple
1096 Variations of Matrix Multiple
1106 Variations of Matrix Multiple
1116 Variations of Matrix Multiple
1126 Variations of Matrix Multiple
C
Fortran
1136 Variations of Matrix Multiple
C
Fortran
However, only part of the story
114SUN Ultra 2 200 MHz (L116KB, L21MB)
115Cache Blocking
- We want blocks to fit into cache. On parallel
computers we have p x cache so that data may fit
into cache on p processors, but not one. This
leads to superlinear speed up! Consider
matrix-matrix multiply. - An alternate form is ...
do k 1,n do j 1,n do i
1,n c(i,j) c(i,j)
a(i,k)b(k,j) end do end do
end do
116Cache Blocking
do ii 1,n,nblk do jj 1,n,nblk
do kk 1,n,nblk do k
kk,kknblk-1 do j
jj,jjnblk-1 do i
ii,iinblk-1 c(i,j)
c(i,j) a(i,k) b(k,j)
end do . . . end do
117Assignment
- Implement, in Fortran or C, the six different
ways to perform matrix multiplication by
interchanging the loops. (Use 64-bit arithmetic.)
Make each implementation a subroutine, like - subroutine ijk (a,m,n,lda,b,k,ldb,c,ldc)
- subroutine ikj ( a,m,n,lda,b,k,ldb,c,ldc)
- ...
- Construct a driver program to generate random
matrices and calls each matrix multiplyroutine
with square matrices of orders 50, 100, 150, 200,
250, and 300, timing the calls and computing the
Mflops rate. - Include in your timing routine a call to the
following system supplied routines - call dgemm('No', 'No', n, n, n, 1.0d0, a,
lda, b, ldb, - 1.0d0, c, ldc )
- Writeup a description of the timing and describe
why the routines perform as they do. - Download ATLAS from http//www.netlib.atlas/ and
build the ATLAS Version of DGEMM and Time.
118EISPACK and LINPACK
- EISPACK
- Design for the algebraic eigenvalue problem,
Ax ?x and Ax ?Bx. - work of J. Wilkinson and colleagues in the 70s.
- Fortran 77 software based on translation of
ALGOL. - LINPACK
- Design for the solving systems of equations, Ax
b. - Fortran 77 software using the Level 1 BLAS.
119History of Block Partitioned Algorithms
- Early algorithms involved use of small main
memory using tapes as secondary storage. - Recent work centers on use of vector registers,
level 1 and 2 cache, main memory, and out of
core memory.
120Blocked Partitioned Algorithms
- LU Factorization
- Cholesky factorization
- Symmetric indefinite factorization
- Matrix inversion
- QR, QL, RQ, LQ factorizations
- Form Q or QTC
- Orthogonal reduction to
- (upper) Hessenberg form
- symmetric tridiagonal form
- bidiagonal form
- Block QR iteration for nonsymmetric eigenvalue
problems
121LAPACK
- Linear Algebra library in Fortran 77
- Solution of systems of equations
- Solution of eigenvalue problems
- Combine algorithms from LINPACK and EISPACK into
a single package - Efficient on a wide range of computers
- RISC, Vector, SMPs
- User interface similar to LINPACK
- Single, Double, Complex, Double Complex
- Built on the Level 1, 2, and 3 BLAS
122LAPACK
- Most of the parallelism in the BLAS.
- Advantages of using the BLAS for parallelism
- Clarity
- Modularity
- Performance
- Portability
123Derivation of Blocked AlgorithmsCholesky
Factorization A UTU
Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
124LINPACK Implementation
- Here is the body of the LINPACK routine SPOFA
which implements the method - DO 30 J 1, N
- INFO J
- S 0.0E0
- JM1 J - 1
- IF( JM1.LT.1 ) GO TO 20
- DO 10 K 1, JM1
- T A( K, J ) - SDOT( K-1,
A( 1, K ), 1,A( 1, J ), 1 ) - T T / A( K, K )
- A( K, J ) T
- S S TT
- 10 CONTINUE
- 20 CONTINUE
- S A( J, J ) - S
- C ...EXIT
- IF( S.LE.0.0E0 ) GO TO 40
- A( J, J ) SQRT( S )
- 30 CONTINUE
125LAPACK Implementation
- DO 10 J 1, N
- CALL STRSV( 'Upper', 'Transpose',
'Non-Unit, J-1, A, LDA, A( 1, J ), 1 ) - S A( J, J ) - SDOT( J-1, A( 1, J ),
1, A( 1, J ), 1 ) - IF( S.LE.ZERO ) GO TO 20
- A( J, J ) SQRT( S )
- 10 CONTINUE
- This change by itself is sufficient to
significantly improve the performance on a number
of machines. - From 238 to 312 Mflop/s for a matrix of order 500
on a Pentium 4-1.7 GHz. - However on peak is 1,700 Mflop/s.
- Suggest further work needed.
126Derivation of Blocked Algorithms
Equating coefficient of second block of columns,
we obtain
Hence, if U11 has already been computed, we can
compute U12 as the solution of the following
equations by a call to the Level 3 BLAS routine
STRSM
127LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
CONTINUE
- On Pentium 4, L3 BLAS squeezes a lot more out of
1 proc
128LAPACK Contents
- Combines algorithms from LINPACK and EISPACK into
a single package. User interface similar to
LINPACK. - Built on the Level 1, 2 and 3 BLAS, for high
performance (manufacturers optimize BLAS) - LAPACK does not provide routines for structured
problems or general sparse matrices (i.e sparse
storage formats such as compressed-row, -column,
-diagonal, skyline ...).
129LAPACK Ongoing Work
- Add functionality
- updating/downdating, divide and conquer least
squares,bidiagonal bisection, bidiagonal inverse
iteration, band SVD, Jacobi methods, ... - Move to new generation of high performance
machines - IBM SPs, CRAY T3E, SGI Origin, clusters of
workstations - New challenges
- New languages FORTRAN 90, HP FORTRAN, ...
- (CMMD, MPL, NX ...)
- many flavors of message passing, need standard
(PVM, MPI) BLACS - Highly varying ratio
- Many ways to layout data,
- Fastest parallel algorithm sometimes less stable
numerically.
130Gaussian Elimination
x
0
x
. . .
x
x
Standard Way subtract a multiple of a row
131Gaussian Elimination via a Recursive Algorithm
F. Gustavson and S. Toledo
LU Algorithm 1 Split matrix into two
rectangles (m x n/2) if only 1 column,
scale by reciprocal of pivot return 2
Apply LU Algorithm to the left part 3 Apply
transformations to right part
(triangular solve A12 L-1A12 and
matrix multiplication A22A22 -A21A12
) 4 Apply LU Algorithm to right part
Most of the work in the matrix multiply Matrices
of size n/2, n/4, n/8,
132Recursive Factorizations
- Just as accurate as conventional method
- Same number of operations
- Automatic variable blocking
- Level 1 and 3 BLAS only !
- Extreme clarity and simplicity of expression
- Highly efficient
- The recursive formulation is just a rearrangement
of the point-wise LINPACK algorithm - The standard error analysis applies (assuming the
matrix operations are computed the conventional
way). - OK for LU, LLT, QR
- Open question on 2-sided algs. eg eigenvalue
reduction
133Dual-processor
LAPACK
Recursive LU
Uniprocessor
LAPACK
134Challenges in Developing Distributed Memory
Libraries
- How to integrate software?
- Until recently no standards
- Many parallel languages
- Various parallel programming models
- Assumptions about the parallel environment
- granularity
- topology
- overlapping of communication/computation
- development tools
- Where is the data
- Who owns it?
- Opt data distribution
- Who determines data layout
- Determined by user?
- Determined by library developer?
- Allow dynamic data dist.
- Load balancing
135ScaLAPACK
- Library of software dealing with dense banded
routines - Distributed Memory - Message Passing
- MIMD Computers and Networks of Workstations
- Clusters of SMPs
136Programming Style
- SPMD Fortran 77 with object based design
- Built on various modules
- PBLAS Interprocessor communication
- BLACS
- PVM, MPI
- Provides right level of notation.
- BLAS
- LAPACK software expertise/quality
- Software approach
- Numerical methods
137Overall Structure of Software
- Object based - Array descriptor
- Contains information required to establish
mapping between a global array entry and its
corresponding process and memory location. - Provides a flexible framework to easily specify
additional data distributions or matrix types. - Currently dense, banded, out-of-core
- Using the concept of context
138PBLAS
- Similar to the BLAS in functionality and naming.
- Built on the BLAS and BLACS
- Provide global view of matrix
- CALL DGEXXX ( M, N, A( IA, JA ), LDA,... )
- CALL PDGEXXX( M, N, A, IA, JA, DESCA,... )
139ScaLAPACK Structure
Global
Local
140Choosing a Data Distribution
- Main issues are
- Load balancing
- Use of the Level 3 BLAS
141Possible Data Layouts
- 1D block and cyclic column distributions
- 1D block-cycle column and 2D block-cyclic
distribution - 2D block-cyclic used in ScaLAPACK for dense
matrices
142Distribution and Storage
- Matrix is block-partitioned maps blocks
- Distributed 2-D block-cyclic scheme
- 5x5 matrix partitioned in 2x2 blocks
2x2 process grid
point of view - Routines available to distribute/redistribute
data.
143Parallelism in ScaLAPACK
- Level 3 BLAS block operations
- All the reduction routines
- Pipelining
- QR Algorithm, Triangular Solvers, classic
factorizations - Redundant computations
- Condition estimators
- Static work assignment
- Bisection
- Task parallelism
- Sign function eigenvalue computations
- Divide and Conquer
- Tridiagonal and band solvers, symmetric
eigenvalue problem and Sign function - Cyclic reduction
- Reduced system in the band solver
- Data parallelism
- Sign function
144(No Transcript)
145References
- http//www.netlib.org
- http//www.netlib.org/lapack
- http//www.netlib.org/scalapack
- http//www.netlib.org/lapack/lawns
- http//www.netlib.org/atlas
- http//www.netlib.org/papi/
- http//www.netlib.org/netsolve/
- http//www.netlib.org/lapack90
- http//www.nhse.org
- http//www.netlib.org/utk/people/JackDongarra/la-s
w.html - lapack_at_cs.utk.edu
- scalapack_at_cs.utk.edu
146Motivation for Grid Computing
In the past Isolation
- Many science and
engineering problems today require that widely
dispersed resources be operated as systems. - Networking, distributed computing, and parallel
computation research have matured to make it
possible for distributed systems to support
high-performance applications, but... - Resources are dispersed
- Connectivity is variable
- Dedicated access may not be possible
Today Collaboration
147Performance Distribution Nov 2001
1,3,4,6 ½ life
148Bandwidth Wont Be A Problem Soon -- Bisection
Bandwidth (BB) Across the US
- 1971 - BB 112 Kb/s
- 1986 - BB 1 Mb/s
- 2001 - BB 200 Gb/s
- Today in the lab, 4000 channels on single fiber
and each channel 10 Gb/s - 12 strands of fiber can carry 400010 Gb/s or 40
Tb/s - 5 backbone network across the US each w/ 2 sets
of 12 strands can provide 2.4 Pb/s
- When the Network is as fast as the computers
internal links, the machine disintegrates across
the Net into a set of special purpose appliances - Gilder Technology Report June 2000
- Internet doubling every 9 months
- Factor of 100 in 5 years
- BB will grow be a factor of 12000.