Loading...

PPT – High Performance Computing and Trends, Enhancing Performance, Measurement Tools, PowerPoint presentation | free to download - id: 27aa9f-ODUyZ

The Adobe Flash plugin is needed to view this content

High Performance Computing and Trends, Enhancing

Performance, Measurement Tools,

Facultad de Ciencias, Universidad de Los Andes,

La

Hechicera, Mérida, Venezuela. December 3-5, 2001

- Jack Dongarra
- Innovative Computing Laboratory
- University of Tennessee
- http//www.cs.utk.edu/dongarra/

Overview

- High Performance Computing
- ATLAS
- PAPI
- NetSolve
- Grid Experiments

Technology Trends Microprocessor Capacity

Moores Law

2X transistors/Chip Every 1.5 years Called

Moores Law

Gordon Moore (co-founder of Intel) predicted in

1965 that the transistor density of semiconductor

chips would double roughly every 18 months.

Microprocessors have become smaller, denser, and

more powerful.

Moores Law

H. Mauer, H. Simon, E. Strohmaier, JD -

Listing of the 500 most powerful Computers

in the World - Yardstick Rmax from LINPACK

MPP Axb, dense problem - Updated twice a

year SCxy in the States in November Meeting in

Mannheim, Germany in June - All data

available from www.top500.org

TPP performance

Rate

Size

- In 1980 a computation that took 1 full year to

complete - can now be done in 10 hours!

- In 1980 a computation that took 1 full year to

complete - can now be done in 16 minutes!

- In 1980 a computation that took 1 full year to

complete - can today be done in 27 seconds!

Top 10 Machines (November 2001)

Performance Development

60G - 400 M7.2 Tflop/s 94Gflop/s, Schwab 24,

1/2 per year, 394 gt 100 Gf, faster than Moores

law

Performance Development

My Laptop

Entry 1 T 2005 and 1 P 2010

Petaflop (1015 flop/s) Computers Within the Next

Decade

- Five basis design points
- Conventional technologies
- 4.8 GHz processor, 8000 nodes, each w/16

processors - Processing-in-memory (PIM) designs
- Reduce memory access bottleneck
- Superconducting processor technologies
- Digital superconductor technology, Rapid

Single-Flux-Quantum (RSFQ) logic hybrid

technology multi-threaded (HTMT) - Special-purpose hardware designs
- Specific applications e.g. GRAPE Project in Japan

for gravitational force computations - Schemes utilizing the aggregate computing power

of processors distributed on the web - SETI_at_home 26 Tflop/s

Petaflops (1015 flop/s) Computer Today?

- 1 GHz processor (O(109) ops/s)
- 1 Million PCs
- 1B (1K each)
- 100 Mwatts
- 5 acres
- 1 Million Windows licenses!!
- PC failure every second

Architectures

Constellation of p/n ? n

Chip Technology

Manufacturer

IBM 32, HP 30, SGI 8, Cray 8, SUN 6, Fuji

4, NEC 3, Hitachi 3

Cumulative Performance Nov 2001

134.9 TF/s

500

High-Performance Computing Directions

Beowulf-class PC Clusters

Definition

Advantages

- COTS PC Nodes
- Pentium, Alpha, PowerPC, SMP
- COTS LAN/SAN Interconnect
- Ethernet, Myrinet, Giganet, ATM
- Open Source Unix
- Linux, BSD
- Message Passing Computing
- MPI, PVM
- HPF

- Best price-performance
- Low entry-level cost
- Just-in-place configuration
- Vendor invulnerable
- Scalable
- Rapid technology tracking

Enabled by PC hardware, networks and operating

system achieving capabilities of scientific

workstations at a fraction of the cost and

availability of industry standard message passing

libraries. However, much more of a contact sport.

- Peak performance
- Interconnection
- http//clusters.top500.org
- Benchmark results to follow in the coming months

Where Does the Performance Go? orWhy Should I

Care About the Memory Hierarchy?

Processor-DRAM Memory Gap (latency)

µProc 60/yr. (2X/1.5yr)

1000

CPU

Moores Law

100

Performance

10

DRAM 9/yr. (2X/10 yrs)

DRAM

1

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

1982

Time

Optimizing Computation and Memory Use

- Computational optimizations
- Theoretical peak( fpus)(flops/cycle) Mhz
- PIII (1 fpu)(1 flop/cycle)(850 Mhz) 850

MFLOP/s - Athlon (2 fpu)(1flop/cycle)(600 Mhz) 1200

MFLOP/s - Power3 (2 fpu)(2 flops/cycle)(375 Mhz) 1500

MFLOP/s - Operations like
- ? xTy 2 operands (16 Bytes) needed for

2 flops at 850 Mflop/s

will requires 1700 MW/s bandwidth - y ? x y 3 operands (24 Bytes) needed for 2

flops at 850 Mflop/s

will requires 2550 MW/s bandwidth - Memory optimization
- Theoretical peak (bus width) (bus speed)
- PIII (32 bits)(133 Mhz) 532 MB/s

66.5 MW/s - Athlon (64 bits)(133 Mhz) 1064 MB/s 133

MW/s - Power3 (128 bits)(100 Mhz) 1600 MB/s 200

MW/s

Memory Hierarchy

- By taking advantage of the principle of locality
- Present the user with as much memory as is

available in the cheapest technology. - Provide access at the speed offered by the

fastest technology.

Processor

Tertiary Storage (Disk/Tape)

Secondary Storage (Disk)

Control

Main Memory (DRAM)

Level 2 and 3 Cache (SRAM)

Remote Cluster Memory

Distributed Memory

On-Chip Cache

Datapath

Registers

10,000,000s (10s ms) 100,000 s (.1s ms)

1s

Speed (ns)

10s

100s

10,000,000,000s (10s sec) 10,000,000 s (10s

ms)

100s

Size (bytes)

Ks

Ms

Gs

Ts

Self-Adapting Numerical Software (SANS)

- Todays processors can achieve high-performance,

but this requires extensive machine-specific hand

tuning. - Operations like the BLAS require many man-hours /

platform - Software lags far behind hardware introduction
- Only done if financial incentive is there
- Hardware, compilers, and software have a large

design space w/many parameters - Blocking sizes, loop nesting permutations, loop

unrolling depths, software pipelining strategies,

register allocations, and instruction schedules. - Complicated interactions with the increasingly

sophisticated micro-architectures of new

microprocessors. - Need for quick/dynamic deployment of optimized

routines. - ATLAS - Automatic Tuned Linear Algebra Software

Software Generation Strategy - BLAS

- Parameter study of the hw
- Generate multiple versions of code, w/difference

values of key performance parameters - Run and measure the performance for various

versions - Pick best and generate library
- Level 1 cache multiply optimizes for
- TLB access
- L1 cache reuse
- FP unit usage
- Memory fetch
- Register reuse
- Loop overhead minimization

- Takes 20 minutes to run.
- New model of high performance programming where

critical code is machine generated using

parameter optimization. - Designed for RISC arch
- Super Scalar
- Need reasonable C compiler
- Today ATLAS in use by Matlab, Mathematica,

Octave, Maple, Debian, Scyld Beowulf, SuSE,

ATLAS (DGEMM n 500)64-bit floating point

results

- ATLAS is faster than all other portable BLAS

implementations and it is comparable with

machine-specific libraries provided by the vendor.

Recursive Approach for Other Level

3 BLAS

Recursive TRMM

- Recur down to L1 cache block size
- Need kernel at bottom of recursion
- Use gemm-based kernel for portability

Intel PIII 933 MHzMKL 5.0 vs ATLAS 3.2.0 using

Windows 2000

- ATLAS is faster than all other portable BLAS

implementations and it is comparable with

machine-specific libraries provided by the vendor.

ATLAS Matrix Multiply (64-bit floating point

results)

Intel IA-64

Intel P4 64-bit fl pt

AMD Athlon

Pentium 4 - SSE2

- 1.5 GHz, 400 MHz system bus, 16K L1 256K L2

Cache, theoretical peak of 1.5 Gflop/s, high

power consumption - Streaming SIMD Extensions 2 (SSE2)
- which consists of 144 new instructions
- includes SIMD IEEE double precision floating

point (vector instructions) - Peak for 64 bit floating point 2X
- Peak for 32 bit floating point 4X
- SIMD 128-bit integer
- new cache and memory management instructions.
- Intels compiler supports these instructions

today

ATLAS Matrix Multiply Intel Pentium 4 using

SSE2

P4 32-bit fl pt using SSE2

P4 64-bit fl pt using SSE2

P4 64-bit fl pt

250/processor, lt1000 for system gt 0.50/Mflops

!!

(No Transcript)

Machine-Assisted Application Development and

Adaptation

- Communication libraries
- Optimize for the specifics of ones

configuration. - Algorithm layout and implementation
- Look at the different ways to express

implementation

Work in ProgressATLAS-like Approach Applied to

Broadcast (PII 8 Way Cluster with 100 Mb/s

switched network)

Reformulating/Rearranging/Reuse

- Example is the reduction to narrow band from for

the SVD - Fetch each entry of A once
- Restructure and combined operations
- Results in a speedup of gt 30

Conjugate Gradient Variants by Dynamic Selection

at Run Time

- Variants combine inner products to reduce

communication bottleneck at the expense of more

scalar ops. - Same number of iterations, no advantage on a

sequential processor - With a large number of processor and a

high-latency network may be advantages. - Improvements can range from 15 to 50 depending

on size.

Conjugate Gradient Variants by Dynamic Selection

at Run Time

- Variants combine inner products to reduce

communication bottleneck at the expense of more

scalar ops. - Same number of iterations, no advantage on a

sequential processor - With a large number of processor and a

high-latency network may be advantages. - Improvements can range from 15 to 50 depending

on size.

Related Tuning Projects

- PHiPAC
- Portable High Performance ANSI C

http//www.icsi.berkeley.edu/bilmes/phipac

initial automatic GEMM generation project - FFTW Fastest Fourier Transform in the West
- http//www.fftw.org
- UHFFT
- tuning parallel FFT algorithms
- http//rodin.cs.uh.edu/mirkovic/fft/parfft.htm
- SPIRAL
- Signal Processing Algorithms Implementation

Research for Adaptable Libraries maps DSP

algorithms to architectures - http//www.ece.cmu.edu/spiral/
- Sparsity
- Sparse-matrix-vector and Sparse-matrix-matrix

multiplication http//www.cs.berkeley.edu/ejim/pu

blication/ tunes code to sparsity structure of

matrix more later in this tutorial - University of Tennessee

Tools for Performance Evaluation

- Timing and performance evaluation has been an art
- Resolution of the clock
- Issues about cache effects
- Different systems
- Can be cumbersome and inefficient with

traditional tools - Situation about to change
- Todays processors have internal counters

Performance Counters

- Almost all high performance processors include

hardware performance counters. - Some are easy to access, others not available to

users. - On most platforms the APIs, if they exist, are

not appropriate for the end user or well

documented. - Existing performance counter APIs
- Compaq Alpha EV 6 6/7
- SGI MIPS R10000
- IBM Power Series
- CRAY T3E
- Sun Solaris
- Pentium Linux and Windows

- IA-64
- HP-PA RISC
- Hitachi
- Fujitsu
- NEC

Performance Data That May Be Available

- Pipeline stalls due to memory subsystem
- Pipeline stalls due to resource conflicts
- I/D cache misses for different levels
- Cache invalidations
- TLB misses
- TLB invalidations

- Cycle count
- Floating point instruction count
- Integer instruction count
- Instruction count
- Load/store count
- Branch taken / not taken count
- Branch mispredictions

Overview of PAPI

- Performance Application Programming Interface
- The purpose of the PAPI project is to design,

standardize and implement a portable and

efficient API to access the hardware performance

monitor counters found on most modern

microprocessors

Implementation

- Counters exist as a small set of registers that

count events. - PAPI provides three interfaces to the underlying

counter hardware - The low level interface manages hardware events

in user defined groups called EventSet. - The high level interface simply provides the

ability to start, stop and read the counters for

a specified list of events. - Graphical tools to visualize information.

Low Level API

- Increased efficiency and functionality over the

high level PAPI interface - Theres about 40 functions
- Obtain information about the executable and the

hardware. - Thread safe

High Level API

- Meant for application programmers wanting

coarse-grained measurements - Calls the lower level API
- Not thread safe at the moment
- Only allows PAPI Presets events

High Level Functions

- PAPI_flops()
- PAPI_num_counters()
- Number of counters in the system
- PAPI_start_counters()
- PAPI_stop_counters()
- Enable counting of events and describes what to

count - PAPI_read_counters()
- Returns event counts

Perfometer Features

- Platform independent visualization of PAPI

metrics - Flexible interface
- Quick interpretation of complex results
- Small footprint
- (compiled code size lt 15k)
- Color coding to highlight selected procedures
- Trace file generation or real time viewing.

PAPI Implementation

Graphical ToolsPerfometer Usage

- Application is instrumented with PAPI
- call perfometer()
- Will be layered over the best existing

vendor-specific APIs for these platforms - Application is started, at the call to

performeter signal handler and timer set to

collect and send the information to a Java applet

containing the graphical view. - Sections of code that are of interest can be

designated with specific colors - Using a call to set_perfometer(color)

Perfometer

Call Perfometer(red)

Early Users of PAPI

- DEEP/PAPI (Pacific Sierra) http//www.psrv.com/dee

p_papi_top.html - TAU (Allen Mallony, U of Oregon)

http//www.cs.uoregon.edu/research/paracomp/tau/ - SvPablo (Dan Reed, U of Illinois)

http//vibes.cs.uiuc.edu/Software/SvPablo/svPablo.

htm - Cactus (Ed Seidel, Max Plank/U of Illinois)

http//www.aei-potsdam.mpg.de - Vprof (Curtis Janssen, Sandia Livermore Lab)

http//aros.ca.sandia.gov/cljanss/perf/vprof/ - Cluster Tools (Al Geist, ORNL)
- DynaProf (Phil Mucci, UTK) http//www.cs.utk.edu/

mucci/dynaprof/

Next Version of

Perfometer Implementation

Application

GUI

Server

Application

Application

PAPIs Parallel Interface

PAPI - Supported Processors

- Intel Pentium,II,III,4, Itanium
- Linux 2.4, 2.2, 2.0 and perf kernel patch
- IBM Power 3,604,604e
- For AIX 4.3 and pmtoolkit (in 4.3.4 available)
- (laderose_at_us.ibm.com)
- Sun UltraSparc I, II, III
- Solaris 2.8
- MIPS R10K, R12K
- AMD Athlon
- Linux 2.4 and perf kernel patch
- Cray T3E, SV1, SV2
- Windows 2K and XP
- To download software see
- http//icl.cs.utk.edu/papi/

(No Transcript)

Innovative Computing Laboratory University of

Tennessee

- Numerical Linear Algebra
- Heterogeneous Distributed Computing
- Software Repositories
- Performance Evaluation
- Software and ideas have found there way into many

areas of Computational Science - Around 40 people At the moment...
- 15 Researchers Research Assoc/Post-Doc/Research

Prof - 15 Students Graduate and Undergraduate
- 8 Support staff Secretary, Systems, Artist
- 1 Long term visitor
- Many opportunities within the group at Tennessee

SETI_at_home

- Use thousands of Internet-connected PCs to help

in the search for extraterrestrial intelligence. - Uses data collected with the Arecibo Radio

Telescope, in Puerto Rico - When their computer is idle or being wasted this

software will download a 300 kilobyte chunk of

data for analysis. - The results of this analysis are sent back to the

SETI team, combined with thousands of other

participants.

- Largest distributed computation project in

existence - 400,000 machines
- Averaging 27 Tflop/s
- Today many companies trying this for profit.

Distributed and Parallel Systems

Massively parallel systems homo- geneous

ASCI Tflops (7 Tflop/s)

Clusters w/ special interconnect

Distributed systems hetero- geneous

SETI_at_home (27 Tflop/s)

Parallel Dist mem

Beowulf cluster

Network of ws

Grid based Computing

Entropia

- Gather (unused) resources
- Steal cycles
- System SW manages resources
- System SW adds value
- 10 - 20 overhead is OK
- Resources drive applications
- Time to completion is not critical
- Time-shared

- Bounded set of resources
- Apps grow to consume all cycles
- Application manages resources
- System SW gets in the way
- 5 overhead is maximum
- Apps drive purchase of equipment
- Real-time constraints
- Space-shared

Grids are Hot

- IPG NAS-NASA http//nas.nasa.gov/wej/home/IPG
- Globus http//www.globus.org/
- Legion http//www.cs.virgina.edu/grims

haw/ - AppLeS http//www-cse.ucsd.edu/groups/hp

cl/apples - NetSolve http//www.cs.utk.edu/netsolve/
- NINF http//phase.etl.go.jp/ninf/
- Condor http//www.cs.wisc.edu/condor/
- CUMULVS http//www.epm.ornl.gov/cs/cumulvs.

html - WebFlow http//www.npac.syr.edu/users/gcf/

The Grid

- To treat CPU cycles and software like

commodities. - Napster on steroids.
- Enable the coordinated use of geographically

distributed resources in the absence of central

control and existing trust relationships. - Computing power is produced much like utilities

such as power and water are produced for

consumers. - Users will have access to power on demand
- When the Network is as fast as the computers

internal links, the machine disintegrates across

the Net into a set of special purpose appliances - Gilder Technology Report June 2000

The Grid

The Grid Architecture Picture

User Portals

Problem Solving Environments

Application Science Portals

Grid Access Info

Service Layers

Co- Scheduling

Fault Tolerance

Authentication

Events

Naming Files

Computers

Data bases

Online instruments

Resource Layer

Software

High speed networks and routers

Globus Grid Services

- The Globus toolkit provides a range of basic Grid

services - Security, information, fault detection,

communication, resource management, ... - These services are simple and orthogonal
- Can be used independently, mix and match
- Programming model independent
- For each there are well-defined APIs
- Standards are used extensively
- E.g., LDAP, GSS-API, X.509, ...
- You dont program in Globus, its a set of tools

like Unix

NetSolve Network

Enabled Server

- NetSolve is an example of a grid based

hardware/software server. - Easy-of-use paramount
- Based on a RPC model but with
- resource discovery, dynamic problem solving

capabilities, load balancing, fault tolerance

asynchronicity, security, - Other examples are NEOS from Argonne and NINF

Japan. - Use resources, not tie together geographically

distributed resources, for a single application.

NetSolve The Big Picture

Client

Schedule Database

AGENT(s)

Matlab Mathematica C, Fortran Java, Excel

S3

S4

Op(C, A, B)

S1

S2

C

A

No knowledge of the grid required, RPC like.

Basic Usage Scenarios

- Grid based numerical library routines
- User doesnt have to have software library on

their machine, LAPACK, SuperLU, ScaLAPACK, PETSc,

AZTEC, ARPACK - Task farming applications
- Pleasantly parallel execution
- eg Parameter studies
- Remote application execution
- Complete applications with user specifying input

parameters and receiving output

- Blue Collar Grid Based Computing
- Does not require deep knowledge of network

programming - Level of expressiveness right for many users
- User can set things up, no su required
- In use today, up to 200 servers in 9 countries
- Can plug into Globus, Condor, NINF,

NetSolve Agent

- Name server for the NetSolve

system. - Information Service
- client users and administrators can query the

hardware and software services available. - Resource scheduler
- maintains both static and dynamic information

regarding theNetSolve server components touse

for the allocation of resources

NetSolve Agent

- Resource Scheduling (contd)
- CPU Performance (LINPACK).
- Network bandwidth, latency.
- Server workload.
- Problem size/algorithm complexity.
- Calculates a Time to Compute. for each

appropriate server. - Notifies client of most appropriate server.

NetSolve Client

- Function Based Interface.
- Client program embeds call

from NetSolves API to access additional

resources. - Interface available to C, Fortran, Matlab,

Mathematica, and Java. - Opaque networking interactions.
- NetSolve can be invoked using a variety of

methods blocking, non-blocking, task farms,

NetSolve Client

- Intuitive and easy to use.
- Matlab Matrix multiply e.g.
- A matmul(B, C)

A netsolve(matmul, B, C)

- Possible parallelisms hidden.

NetSolve Client

- Client makes request to agent.
- Agent returns list of servers.
- Client tries each one in turn untilone executes

successfully or list is exhausted.

NPACI Alpha Project - MCell 3-D Monte-Carlo

Simulation of Neuro-Transmitter Release in

Between Cells

- UCSD (F. Berman, H. Casanova, M. Ellisman), Salk

Institute (T. Bartol), CMU (J. Stiles), UTK

(Dongarra, R. Wolski) - Study how neurotransmitters diffuse and activate

receptors in synapses - blue unbounded, red singly bounded, green doubly

bounded closed, yellow doubly bounded open

MCell 3-D Monte-Carlo Simulation of

Neuro-Transmitter Release in Between Cells

- Developed at Salk Institute, CMU
- In the past, manually run on available

workstations - Transparent Parallelism, Load balancing,

Fault-tolerance - Fits the farming semantic and need for NetSolve
- Collaboration with AppLeS Project for scheduling

tasks

List of seeds

AppLeS

NetSolve Servers

script

MCell

Scrip ...

Scrip ...

Scrip ...

Scrip ...

Scrip ...

Scrip ...

Scrip ...

Scrip ...

Scrip ...

Scrip ...

- Integrated Parallel Accurate Reservoir Simulator.

- Mary Wheelers group, UT-Austin
- Reservoir and Environmental Simulation.
- models black oil, waterflood, compositions
- 3D transient flow of multiple phase
- Integrates Existing Simulators.
- Framework simplified development
- Provides solvers, handling for wells, table

lookup. - Provides pre/postprocessor, visualization.
- Full IPARS access without Installation.
- IPARS Interfaces
- C, FORTRAN, Matlab, Mathematica, and Web.

Netsolve and SCIRun

SCIRun torso defibrillator application Chris

Johnson, U of Utah

NetSolve A Plug into the Grid

C

Fortran

NetSolve

Grid middleware

Resource Discovery

Fault Tolerance

System Management

Resource Scheduling

Globus proxy

NetSolve proxy

Ninf proxy

Condor proxy

NetSolve A Plug into the Grid

C

Fortran

NetSolve

Grid middleware

Resource Discovery

Fault Tolerance

System Management

Resource Scheduling

Globus proxy

NetSolve proxy

Ninf proxy

Condor proxy

Globus

NetSolve servers

Ninf servers

Condor

Grid back-ends

NetSolve servers

NetSolve servers

NetSolve A Plug into the Grid

PSE front-ends

Matlab

Mathematica

Custom

SCIRun

Remote procedure call

C

Fortran

NetSolve

Grid middleware

Resource Discovery

Fault Tolerance

System Management

Resource Scheduling

Globus proxy

NetSolve proxy

Ninf proxy

Condor proxy

Globus

NetSolve servers

Ninf servers

Condor

Grid back-ends

NetSolve servers

NetSolve servers

University of Tennessee Deployment

Scalable Intracampus Research Grid SInRG (NSF

Infrastruction Award)

- SInRG equipment 8 Grid Service Clusters

deployed, 6 more to come - Federated Ownership CS, Chem Eng., Medical

School, Computational Ecology, El. Eng. - Real applications,

middleware

development,

logistical

networking

The Knoxville Campus has two DS-3 commodity

Internet connections and one DS-3

Internet2/Abilene connection. An OC-3 ATM link

routes IP traffic between the Knoxville campus,

National Transportation Research Center, and Oak

Ridge National Laboratory. UT participates in

several national networking initiatives including

Internet2 (I2), Abilene, the federal Next

Generation Internet (NGI) initiative, Southern

Universities Research Association (SURA) Regional

Information Infrastructure (RII), and Southern

Crossroads (SoX). The UT campus consists of a

meshed ATM OC-12 being migrated over to switched

Gigabit by early 2002.

SInRGs Vision

- SInRG provides a testbed
- CS grid middleware
- Computational Science applications
- Many hosts, co-existing

in a loose confederation

tied together with

high-speed links. - Users have the illusion of a very powerful

computer on the desk. - Spectrum of users

GrADS - Three Research and Technology Thrusts

- GrADS Grid Application Development Software
- NSF Next Generation Software (NGS) effort
- Effort within the GrADS Project
- GrADS PIs Berman, Chien, Cooper, Dongarra,

Foster, Gannon, Johnsson, Kennedy, Kesselman,

Mellor-Crummey, Reed, Torczon, Wolski - GrADSoft
- Software infrastructure for programming and

running on the Grid - Reconfigurable object programs
- Performance contracts
- Core Grid technologies
- Globus, NetSolve, NWS, Autopilot, AppLeS,

Portals, Cactus - MacroGrid
- Persistent multi-institution Grid testbed
- MicroGrid
- Portable Grid emulator

Grid-Aware Numerical Libraries

- Using ScaLAPACK and PETSc on the Grid

Early Experiences

Grid-Aware Numerical Libraries

- Using ScaLAPACK and PETSc on the Grid

Early Experiences

In some sense ScaLAPACK not an ideal application

for the Grid. Expanded our understand how various

GrADS component fit together. Key is managing

dynamism.

ScaLAPACK

- ScaLAPACK is a portable distributed

memory numerical library - Complete numerical library for dense matrix

computations - Designed for distributed parallel computing (MPP

Clusters) using MPI - One of the first math software packages to do

this - Numerical software that will work on a

heterogeneous platform - Funding from DOE, NSF, and DARPA
- In use today by IBM, HP-Convex, Fujitsu, NEC,

Sun, SGI, Cray, NAG, IMSL, - Tailor performance provide support

ScaLAPACK Grid Enabled

- Implement a version of a ScaLAPACK library

routine that runs on the Grid. - Make use of resources at the users disposal
- Provide the best time to solution
- Proceed without the users involvement
- Make as few changes as possible to the numerical

software. - Assumption is that the user is already Grid

enabled and runs a program that contacts the

execution environment to determine where the

execution should take place.

To Use ScaLAPACK a User Must

- Download the package and auxiliary packages (like

PBLAS, BLAS, BLACS, MPI) to the machines. - Write a SPMD program which
- Sets up the logical 2-D process grid
- Places the data on the logical process grid
- Calls the numerical library routine in a SPMD

fashion - Collects the solution after the library routine

finishes - The user must allocate the processors and decide

the number of processes the application will run

on - The user must start the application
- mpirun np N user_app
- Note the number of processors is fixed by the

user before the run, if problem size changes

dynamically - Upon completion, return the processors to the

pool of resources

GrADS Numerical Library

- Want to relieve the user of some of the tasks
- Make decisions on which machines to use based on

the users problem and the state of the system - Determinate machines that can be used
- Optimize for the best time to solution
- Distribute the data on the processors and

collections of results - Start the SPMD library routine on all the

platforms - Check to see if the computation is proceeding as

planned - If not perhaps migrate application

GrADS Library Sequence

User

Library Routine

- Has crafted code to make things work correctly

and together.

Assumptions Autopilot Manager has been started

and Globus is there.

Resource Selector

Resource Selector

User

Library Routine

- Uses Globus-MDS and Rich Wolskis NWS to build

an array of values for the machines that are

available for the user. - 2 matrices (bw,lat) 2 arrays (cpu, memory

available) - Matrix information is clique based
- On return from RS, Crafted Code filters

information to use only machines that have the

necessary software and are really eligible to be

used.

Arrays of Values Generated by Resource Selector

- Clique based
- 2 _at_ UT, UCSD, UIUC
- Part of the MacroGrid
- Full at the cluster level and the connections

(clique leaders) - Bandwidth and Latency information looks like

this. - Linear arrays for CPU and Memory
- Matrix of values are filled out to generate a

complete, dense, matrix of values. - At this point have a workable coarse grid.
- Know what is available, the connections, and the

power of the machines

ScaLAPACK Performance Model

- Total number of floating-point operations per

processor - Total number of data items communicated per

processor - Total number of messages
- Time per floating point operation
- Time per data item communicated
- Time per message

Performance Model

Resource Selector

User

Library Routine

- Performance Model uses the information generated

in the RS to decide on the fine grid. - Pick a machine that is closest to every other

machine in the collection. - If not enough memory, adds machines until it can

solve problem. - Cost model is run on this set.
- Process adds a machine to group and reruns cost

model. - If better, iterate last step, if not stop.

Performance Model

Resource Selector/Performance Modeler

- Refines the course grid by determining the

process set that will provide the best time to

solution. - This is based on dynamic information from the

grid and the routines performance model. - The PM does a simulation of the actual

application using the information from the RS. - It literally runs the program without doing the

computation or data movement. - There is no backtracking in the Optimizer.
- This is an area for enhancement and

experimentation.

Contract Development

Resource Selector

User

Library Routine

- Contract between the application and the Grid

System - CD should validate the fine grid.
- Should iterate between the CD and PM phases to

get a workable fine grid.

Performance Model

Contract Development

Application Launcher

Resource Selector

User

Library Routine

Performance Model

App Launcher

Contract Development

mpirun machinefile globusrsl fine_grid

grid_linear_solve

Experimental Hardware / Software Grid

MacroGrid Testbed

- Autopilot version 2.3
- Globus version 1.1.3
- NWS version 2.0.pre2
- MPICH-G version 1.1.2
- ScaLAPACK version 1.6
- ATLAS/BLAS version 3.0.2
- BLACS version 1.1
- PAPI version 1.1.5
- GrADS Crafted code

Independent components being put together and

interacting

Performance Model Validation

Speed 60 of the peak

Latency in msec

Bandwidth in Mb/s

This is for a refined grid

N600, NB40, 2 torc procs. Ratio 46.12

N1500, NB40, 4 torc procs. Ratio 15.03

N5000, NB40, 6 torc procs. Ratio 2.25

N8000, NB40, 8 torc procs. Ratio 1.52

N10,000, NB40, 8 torc procs. Ratio 1.29

OPUS

OPUS, CYPHER

OPUS, TORC, CYPHER

2 OPUS, 4 TORC, 6 CYPHER

8 OPUS, 4 TORC, 4 CYPHER

8 OPUS, 2 TORC, 6 CYPHER

6 OPUS, 5 CYPHER

8 OPUS, 6 CYPHER

8 OPUS

5 OPUS

8 OPUS

Largest Problem Solved

- Matrix of size 30,000
- 7.2 GB for the data
- 32 processors to choose from UIUC and UT
- Not all machines have 512 MBs, some little as 128

MBs - PM chose 17 machines in 2 clusters from UT
- Computation took 84 minutes
- 3.6 Gflop/s total
- 210 Mflop/s per processor
- ScaLAPACK on a cluster of 17 processors would get

about 50 of peak - Processors are 500 MHz or 500 Mflop/s peak
- For this grid computation 20 less than ScaLAPACK

Compiler analogy

Contracts, Checkpointing, Migration

- We are using University of Illinois Autopilot to

monitor the progress of the execution. - The applications software has the ability to

perform a checkpoint and can be restarted. - We manually inserted the checkpointing code.
- If the application is not progressing as the

contract specifies we want to take some

corrective action. - Go back and figure out where the application can

be run optimally. - Restart the process from the last checkpoint,

perhaps rearranging the data to fit the new set

of processors.

General Library Interface

- We have a start on a general interface for

numerical libraries. - Its can be a simple operation to plug in other

numerical routines/libraries. - Developing migration mechanisms for contract

violations. - Today a library writer needs to supply
- Numerical Routine
- Performance Model
- The rest of the framework can remain the same.

Futures for Numerical Algorithms and Software

- Numerical software will be adaptive, exploratory,

and intelligent - Polyalgorithms and other techniques
- Determinism in numerical computing will be gone.
- After all, its not reasonable to ask for

exactness in numerical computations. - Auditability of the computation, reproducibility

at a cost - Importance of floating point arithmetic will be

undiminished. - 16, 32, 64, 128 bits and beyond.
- Interval arithmetic
- Reproducibility, fault tolerance, and

auditability - Adaptivity is a key so applications can

effectively use the resources.

Contributors to These Ideas

- Top500
- Erich Strohmaier, LBL, NERSC
- Hans Meuer, Mannheim U
- Horst Simon, LBL, NERSC
- ATLAS
- Antoine Petitet, Sun France
- Clint Whaley, UTK
- Parallel Computing, Vol 27,

No 1-2, pp 3-25, 2001 - PAPI
- Shirley Browne, UTK
- Kevin London, UTK
- Phil Mucci, UTK
- Keith Seymour, UTK
- NetSolve
- Dorian Arnold, UWisc
- Henri Casanova, UCSD
- Michelle Miller, UTK
- Sathish Vadhiyar, UTK

- For additional information see
- http//icl.cs.utk.edu/top500/
- http//icl.cs.utk.edu/atlas/
- http//icl.cs.utk.edu/papi/
- http//icl.cs.utk.edu/scalapack/
- http//icl.cs.utk.edu/netsolve/
- www.cs.utk.edu/dongarra/

Many opportunities within the group at Tennessee

(No Transcript)

6 Variations of Matrix Multiple

6 Variations of Matrix Multiple

6 Variations of Matrix Multiple

6 Variations of Matrix Multiple

6 Variations of Matrix Multiple

6 Variations of Matrix Multiple

6 Variations of Matrix Multiple

6 Variations of Matrix Multiple

C

Fortran

6 Variations of Matrix Multiple

C

Fortran

However, only part of the story

SUN Ultra 2 200 MHz (L116KB, L21MB)

- jik
- kji
- ikj

- ijk
- jki
- kij
- dgemm

Cache Blocking

- We want blocks to fit into cache. On parallel

computers we have p x cache so that data may fit

into cache on p processors, but not one. This

leads to superlinear speed up! Consider

matrix-matrix multiply. - An alternate form is ...

do k 1,n do j 1,n do i

1,n c(i,j) c(i,j)

a(i,k)b(k,j) end do end do

end do

Cache Blocking

do ii 1,n,nblk do jj 1,n,nblk

do kk 1,n,nblk do k

kk,kknblk-1 do j

jj,jjnblk-1 do i

ii,iinblk-1 c(i,j)

c(i,j) a(i,k) b(k,j)

end do . . . end do

Assignment

- Implement, in Fortran or C, the six different

ways to perform matrix multiplication by

interchanging the loops. (Use 64-bit arithmetic.)

Make each implementation a subroutine, like - subroutine ijk (a,m,n,lda,b,k,ldb,c,ldc)
- subroutine ikj ( a,m,n,lda,b,k,ldb,c,ldc)
- ...
- Construct a driver program to generate random

matrices and calls each matrix multiplyroutine

with square matrices of orders 50, 100, 150, 200,

250, and 300, timing the calls and computing the

Mflops rate. - Include in your timing routine a call to the

following system supplied routines - call dgemm('No', 'No', n, n, n, 1.0d0, a,

lda, b, ldb, - 1.0d0, c, ldc )
- Writeup a description of the timing and describe

why the routines perform as they do. - Download ATLAS from http//www.netlib.atlas/ and

build the ATLAS Version of DGEMM and Time.

EISPACK and LINPACK

- EISPACK
- Design for the algebraic eigenvalue problem,

Ax ?x and Ax ?Bx. - work of J. Wilkinson and colleagues in the 70s.
- Fortran 77 software based on translation of

ALGOL. - LINPACK
- Design for the solving systems of equations, Ax

b. - Fortran 77 software using the Level 1 BLAS.

History of Block Partitioned Algorithms

- Early algorithms involved use of small main

memory using tapes as secondary storage. - Recent work centers on use of vector registers,

level 1 and 2 cache, main memory, and out of

core memory.

Blocked Partitioned Algorithms

- LU Factorization
- Cholesky factorization
- Symmetric indefinite factorization
- Matrix inversion
- QR, QL, RQ, LQ factorizations
- Form Q or QTC

- Orthogonal reduction to
- (upper) Hessenberg form
- symmetric tridiagonal form
- bidiagonal form
- Block QR iteration for nonsymmetric eigenvalue

problems

LAPACK

- Linear Algebra library in Fortran 77
- Solution of systems of equations
- Solution of eigenvalue problems
- Combine algorithms from LINPACK and EISPACK into

a single package - Efficient on a wide range of computers
- RISC, Vector, SMPs
- User interface similar to LINPACK
- Single, Double, Complex, Double Complex
- Built on the Level 1, 2, and 3 BLAS

LAPACK

- Most of the parallelism in the BLAS.
- Advantages of using the BLAS for parallelism
- Clarity
- Modularity
- Performance
- Portability

Derivation of Blocked AlgorithmsCholesky

Factorization A UTU

Equating coefficient of the jth column, we obtain

Hence, if U11 has already been computed, we can

compute uj and ujj from the equations

LINPACK Implementation

- Here is the body of the LINPACK routine SPOFA

which implements the method - DO 30 J 1, N
- INFO J
- S 0.0E0
- JM1 J - 1
- IF( JM1.LT.1 ) GO TO 20
- DO 10 K 1, JM1
- T A( K, J ) - SDOT( K-1,

A( 1, K ), 1,A( 1, J ), 1 ) - T T / A( K, K )
- A( K, J ) T
- S S TT
- 10 CONTINUE
- 20 CONTINUE
- S A( J, J ) - S
- C ...EXIT
- IF( S.LE.0.0E0 ) GO TO 40
- A( J, J ) SQRT( S )
- 30 CONTINUE

LAPACK Implementation

- DO 10 J 1, N
- CALL STRSV( 'Upper', 'Transpose',

'Non-Unit, J-1, A, LDA, A( 1, J ), 1 ) - S A( J, J ) - SDOT( J-1, A( 1, J ),

1, A( 1, J ), 1 ) - IF( S.LE.ZERO ) GO TO 20
- A( J, J ) SQRT( S )
- 10 CONTINUE
- This change by itself is sufficient to

significantly improve the performance on a number

of machines. - From 238 to 312 Mflop/s for a matrix of order 500

on a Pentium 4-1.7 GHz. - However on peak is 1,700 Mflop/s.
- Suggest further work needed.

Derivation of Blocked Algorithms

Equating coefficient of second block of columns,

we obtain

Hence, if U11 has already been computed, we can

compute U12 as the solution of the following

equations by a call to the Level 3 BLAS routine

STRSM

LAPACK Blocked Algorithms

DO 10 J 1, N, NB CALL STRSM( 'Left',

'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,

LDA, A( 1, J ), LDA )

CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,

A( 1, J ), LDA, ONE, A( J, J ),

LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),

LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10

CONTINUE

- On Pentium 4, L3 BLAS squeezes a lot more out of

1 proc

LAPACK Contents

- Combines algorithms from LINPACK and EISPACK into

a single package. User interface similar to

LINPACK. - Built on the Level 1, 2 and 3 BLAS, for high

performance (manufacturers optimize BLAS) - LAPACK does not provide routines for structured

problems or general sparse matrices (i.e sparse

storage formats such as compressed-row, -column,

-diagonal, skyline ...).

LAPACK Ongoing Work

- Add functionality
- updating/downdating, divide and conquer least

squares,bidiagonal bisection, bidiagonal inverse

iteration, band SVD, Jacobi methods, ... - Move to new generation of high performance

machines - IBM SPs, CRAY T3E, SGI Origin, clusters of

workstations - New challenges
- New languages FORTRAN 90, HP FORTRAN, ...
- (CMMD, MPL, NX ...)
- many flavors of message passing, need standard

(PVM, MPI) BLACS - Highly varying ratio
- Many ways to layout data,
- Fastest parallel algorithm sometimes less stable

numerically.

Gaussian Elimination

x

0

x

. . .

x

x

Standard Way subtract a multiple of a row

Gaussian Elimination via a Recursive Algorithm

F. Gustavson and S. Toledo

LU Algorithm 1 Split matrix into two

rectangles (m x n/2) if only 1 column,

scale by reciprocal of pivot return 2

Apply LU Algorithm to the left part 3 Apply

transformations to right part

(triangular solve A12 L-1A12 and

matrix multiplication A22A22 -A21A12

) 4 Apply LU Algorithm to right part

Most of the work in the matrix multiply Matrices

of size n/2, n/4, n/8,

Recursive Factorizations

- Just as accurate as conventional method
- Same number of operations
- Automatic variable blocking
- Level 1 and 3 BLAS only !
- Extreme clarity and simplicity of expression
- Highly efficient
- The recursive formulation is just a rearrangement

of the point-wise LINPACK algorithm - The standard error analysis applies (assuming the

matrix operations are computed the conventional

way). - OK for LU, LLT, QR
- Open question on 2-sided algs. eg eigenvalue

reduction

- Recursive LU

Dual-processor

LAPACK

Recursive LU

Uniprocessor

LAPACK

Challenges in Developing Distributed Memory

Libraries

- How to integrate software?
- Until recently no standards
- Many parallel languages
- Various parallel programming models
- Assumptions about the parallel environment
- granularity
- topology
- overlapping of communication/computation
- development tools

- Where is the data
- Who owns it?
- Opt data distribution
- Who determines data layout
- Determined by user?
- Determined by library developer?
- Allow dynamic data dist.
- Load balancing

ScaLAPACK

- Library of software dealing with dense banded

routines - Distributed Memory - Message Passing
- MIMD Computers and Networks of Workstations
- Clusters of SMPs

Programming Style

- SPMD Fortran 77 with object based design
- Built on various modules
- PBLAS Interprocessor communication
- BLACS
- PVM, MPI
- Provides right level of notation.
- BLAS
- LAPACK software expertise/quality
- Software approach
- Numerical methods

Overall Structure of Software

- Object based - Array descriptor
- Contains information required to establish

mapping between a global array entry and its

corresponding process and memory location. - Provides a flexible framework to easily specify

additional data distributions or matrix types. - Currently dense, banded, out-of-core
- Using the concept of context

PBLAS

- Similar to the BLAS in functionality and naming.
- Built on the BLAS and BLACS
- Provide global view of matrix
- CALL DGEXXX ( M, N, A( IA, JA ), LDA,... )
- CALL PDGEXXX( M, N, A, IA, JA, DESCA,... )

ScaLAPACK Structure

Global

Local

Choosing a Data Distribution

- Main issues are
- Load balancing
- Use of the Level 3 BLAS

Possible Data Layouts

- 1D block and cyclic column distributions
- 1D block-cycle column and 2D block-cyclic

distribution - 2D block-cyclic used in ScaLAPACK for dense

matrices

Distribution and Storage

- Matrix is block-partitioned maps blocks
- Distributed 2-D block-cyclic scheme
- 5x5 matrix partitioned in 2x2 blocks

2x2 process grid

point of view - Routines available to distribute/redistribute

data.

Parallelism in ScaLAPACK

- Level 3 BLAS block operations
- All the reduction routines
- Pipelining
- QR Algorithm, Triangular Solvers, classic

factorizations - Redundant computations
- Condition estimators
- Static work assignment
- Bisection

- Task parallelism
- Sign function eigenvalue computations
- Divide and Conquer
- Tridiagonal and band solvers, symmetric

eigenvalue problem and Sign function - Cyclic reduction
- Reduced system in the band solver
- Data parallelism
- Sign function

(No Transcript)

References

- http//www.netlib.org
- http//www.netlib.org/lapack
- http//www.netlib.org/scalapack
- http//www.netlib.org/lapack/lawns
- http//www.netlib.org/atlas
- http//www.netlib.org/papi/
- http//www.netlib.org/netsolve/
- http//www.netlib.org/lapack90
- http//www.nhse.org
- http//www.netlib.org/utk/people/JackDongarra/la-s

w.html - lapack_at_cs.utk.edu
- scalapack_at_cs.utk.edu

Motivation for Grid Computing

In the past Isolation

- Many science and

engineering problems today require that widely

dispersed resources be operated as systems. - Networking, distributed computing, and parallel

computation research have matured to make it

possible for distributed systems to support

high-performance applications, but... - Resources are dispersed
- Connectivity is variable
- Dedicated access may not be possible

Today Collaboration

Performance Distribution Nov 2001

1,3,4,6 ½ life

Bandwidth Wont Be A Problem Soon -- Bisection

Bandwidth (BB) Across the US

- 1971 - BB 112 Kb/s
- 1986 - BB 1 Mb/s
- 2001 - BB 200 Gb/s
- Today in the lab, 4000 channels on single fiber

and each channel 10 Gb/s - 12 strands of fiber can carry 400010 Gb/s or 40

Tb/s - 5 backbone network across the US each w/ 2 sets

of 12 strands can provide 2.4 Pb/s

- When the Network is as fast as the computers

internal links, the machine disintegrates across

the Net into a set of special purpose appliances - Gilder Technology Report June 2000
- Internet doubling every 9 months
- Factor of 100 in 5 years
- BB will grow be a factor of 12000.