High Performance Computing and Trends, Enhancing Performance, Measurement Tools, - PowerPoint PPT Presentation

Loading...

PPT – High Performance Computing and Trends, Enhancing Performance, Measurement Tools, PowerPoint presentation | free to download - id: 27aa9f-ODUyZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

High Performance Computing and Trends, Enhancing Performance, Measurement Tools,

Description:

High Performance Computing and Trends, Enhancing Performance, Measurement Tools, – PowerPoint PPT presentation

Number of Views:879
Avg rating:3.0/5.0
Slides: 149
Provided by: jack250
Learn more at: http://www.cecalc.ula.ve
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: High Performance Computing and Trends, Enhancing Performance, Measurement Tools,


1
High Performance Computing and Trends, Enhancing
Performance, Measurement Tools,
Facultad de Ciencias, Universidad de Los Andes,
La
Hechicera, Mérida, Venezuela. December 3-5, 2001
  • Jack Dongarra
  • Innovative Computing Laboratory
  • University of Tennessee
  • http//www.cs.utk.edu/dongarra/

2
Overview
  • High Performance Computing
  • ATLAS
  • PAPI
  • NetSolve
  • Grid Experiments

3
Technology Trends Microprocessor Capacity
Moores Law
2X transistors/Chip Every 1.5 years Called
Moores Law
Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
Microprocessors have become smaller, denser, and
more powerful.
4
Moores Law
5
H. Mauer, H. Simon, E. Strohmaier, JD -
Listing of the 500 most powerful Computers
in the World - Yardstick Rmax from LINPACK
MPP Axb, dense problem - Updated twice a
year SCxy in the States in November Meeting in
Mannheim, Germany in June - All data
available from www.top500.org
TPP performance
Rate
Size
6
  • In 1980 a computation that took 1 full year to
    complete
  • can now be done in 10 hours!

7
  • In 1980 a computation that took 1 full year to
    complete
  • can now be done in 16 minutes!

8
  • In 1980 a computation that took 1 full year to
    complete
  • can today be done in 27 seconds!

9
Top 10 Machines (November 2001)
10
Performance Development
60G - 400 M7.2 Tflop/s 94Gflop/s, Schwab 24,
1/2 per year, 394 gt 100 Gf, faster than Moores
law
11
Performance Development
My Laptop
Entry 1 T 2005 and 1 P 2010
12
Petaflop (1015 flop/s) Computers Within the Next
Decade
  • Five basis design points
  • Conventional technologies
  • 4.8 GHz processor, 8000 nodes, each w/16
    processors
  • Processing-in-memory (PIM) designs
  • Reduce memory access bottleneck
  • Superconducting processor technologies
  • Digital superconductor technology, Rapid
    Single-Flux-Quantum (RSFQ) logic hybrid
    technology multi-threaded (HTMT)
  • Special-purpose hardware designs
  • Specific applications e.g. GRAPE Project in Japan
    for gravitational force computations
  • Schemes utilizing the aggregate computing power
    of processors distributed on the web
  • SETI_at_home 26 Tflop/s

13
Petaflops (1015 flop/s) Computer Today?
  • 1 GHz processor (O(109) ops/s)
  • 1 Million PCs
  • 1B (1K each)
  • 100 Mwatts
  • 5 acres
  • 1 Million Windows licenses!!
  • PC failure every second

14
Architectures
Constellation of p/n ? n
15
Chip Technology
16
Manufacturer
IBM 32, HP 30, SGI 8, Cray 8, SUN 6, Fuji
4, NEC 3, Hitachi 3
17
Cumulative Performance Nov 2001
134.9 TF/s
500
18
High-Performance Computing Directions
Beowulf-class PC Clusters
Definition
Advantages
  • COTS PC Nodes
  • Pentium, Alpha, PowerPC, SMP
  • COTS LAN/SAN Interconnect
  • Ethernet, Myrinet, Giganet, ATM
  • Open Source Unix
  • Linux, BSD
  • Message Passing Computing
  • MPI, PVM
  • HPF
  • Best price-performance
  • Low entry-level cost
  • Just-in-place configuration
  • Vendor invulnerable
  • Scalable
  • Rapid technology tracking

Enabled by PC hardware, networks and operating
system achieving capabilities of scientific
workstations at a fraction of the cost and
availability of industry standard message passing
libraries. However, much more of a contact sport.
19
  • Peak performance
  • Interconnection
  • http//clusters.top500.org
  • Benchmark results to follow in the coming months

20
Where Does the Performance Go? orWhy Should I
Care About the Memory Hierarchy?
Processor-DRAM Memory Gap (latency)
µProc 60/yr. (2X/1.5yr)
1000
CPU
Moores Law
100
Performance
10
DRAM 9/yr. (2X/10 yrs)
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
21
Optimizing Computation and Memory Use
  • Computational optimizations
  • Theoretical peak( fpus)(flops/cycle) Mhz
  • PIII (1 fpu)(1 flop/cycle)(850 Mhz) 850
    MFLOP/s
  • Athlon (2 fpu)(1flop/cycle)(600 Mhz) 1200
    MFLOP/s
  • Power3 (2 fpu)(2 flops/cycle)(375 Mhz) 1500
    MFLOP/s
  • Operations like
  • ? xTy 2 operands (16 Bytes) needed for
    2 flops at 850 Mflop/s
    will requires 1700 MW/s bandwidth
  • y ? x y 3 operands (24 Bytes) needed for 2
    flops at 850 Mflop/s
    will requires 2550 MW/s bandwidth
  • Memory optimization
  • Theoretical peak (bus width) (bus speed)
  • PIII (32 bits)(133 Mhz) 532 MB/s
    66.5 MW/s
  • Athlon (64 bits)(133 Mhz) 1064 MB/s 133
    MW/s
  • Power3 (128 bits)(100 Mhz) 1600 MB/s 200
    MW/s

22
Memory Hierarchy
  • By taking advantage of the principle of locality
  • Present the user with as much memory as is
    available in the cheapest technology.
  • Provide access at the speed offered by the
    fastest technology.

Processor
Tertiary Storage (Disk/Tape)
Secondary Storage (Disk)
Control
Main Memory (DRAM)
Level 2 and 3 Cache (SRAM)
Remote Cluster Memory
Distributed Memory
On-Chip Cache
Datapath
Registers
10,000,000s (10s ms) 100,000 s (.1s ms)
1s
Speed (ns)
10s
100s
10,000,000,000s (10s sec) 10,000,000 s (10s
ms)
100s
Size (bytes)
Ks
Ms
Gs
Ts
23
Self-Adapting Numerical Software (SANS)
  • Todays processors can achieve high-performance,
    but this requires extensive machine-specific hand
    tuning.
  • Operations like the BLAS require many man-hours /
    platform
  • Software lags far behind hardware introduction
  • Only done if financial incentive is there
  • Hardware, compilers, and software have a large
    design space w/many parameters
  • Blocking sizes, loop nesting permutations, loop
    unrolling depths, software pipelining strategies,
    register allocations, and instruction schedules.
  • Complicated interactions with the increasingly
    sophisticated micro-architectures of new
    microprocessors.
  • Need for quick/dynamic deployment of optimized
    routines.
  • ATLAS - Automatic Tuned Linear Algebra Software

24
Software Generation Strategy - BLAS
  • Parameter study of the hw
  • Generate multiple versions of code, w/difference
    values of key performance parameters
  • Run and measure the performance for various
    versions
  • Pick best and generate library
  • Level 1 cache multiply optimizes for
  • TLB access
  • L1 cache reuse
  • FP unit usage
  • Memory fetch
  • Register reuse
  • Loop overhead minimization
  • Takes 20 minutes to run.
  • New model of high performance programming where
    critical code is machine generated using
    parameter optimization.
  • Designed for RISC arch
  • Super Scalar
  • Need reasonable C compiler
  • Today ATLAS in use by Matlab, Mathematica,
    Octave, Maple, Debian, Scyld Beowulf, SuSE,

25
ATLAS (DGEMM n 500)64-bit floating point
results
  • ATLAS is faster than all other portable BLAS
    implementations and it is comparable with
    machine-specific libraries provided by the vendor.

26
Recursive Approach for Other Level
3 BLAS
Recursive TRMM
  • Recur down to L1 cache block size
  • Need kernel at bottom of recursion
  • Use gemm-based kernel for portability

27
Intel PIII 933 MHzMKL 5.0 vs ATLAS 3.2.0 using
Windows 2000
  • ATLAS is faster than all other portable BLAS
    implementations and it is comparable with
    machine-specific libraries provided by the vendor.

28
ATLAS Matrix Multiply (64-bit floating point
results)
Intel IA-64
Intel P4 64-bit fl pt
AMD Athlon
29
Pentium 4 - SSE2
  • 1.5 GHz, 400 MHz system bus, 16K L1 256K L2
    Cache, theoretical peak of 1.5 Gflop/s, high
    power consumption
  • Streaming SIMD Extensions 2 (SSE2)
  • which consists of 144 new instructions
  • includes SIMD IEEE double precision floating
    point (vector instructions)
  • Peak for 64 bit floating point 2X
  • Peak for 32 bit floating point 4X
  • SIMD 128-bit integer
  • new cache and memory management instructions.
  • Intels compiler supports these instructions
    today

30
ATLAS Matrix Multiply Intel Pentium 4 using
SSE2
P4 32-bit fl pt using SSE2
P4 64-bit fl pt using SSE2
P4 64-bit fl pt
250/processor, lt1000 for system gt 0.50/Mflops
!!
31
(No Transcript)
32
Machine-Assisted Application Development and
Adaptation
  • Communication libraries
  • Optimize for the specifics of ones
    configuration.
  • Algorithm layout and implementation
  • Look at the different ways to express
    implementation

33
Work in ProgressATLAS-like Approach Applied to
Broadcast (PII 8 Way Cluster with 100 Mb/s
switched network)
34
Reformulating/Rearranging/Reuse
  • Example is the reduction to narrow band from for
    the SVD
  • Fetch each entry of A once
  • Restructure and combined operations
  • Results in a speedup of gt 30

35
Conjugate Gradient Variants by Dynamic Selection
at Run Time
  • Variants combine inner products to reduce
    communication bottleneck at the expense of more
    scalar ops.
  • Same number of iterations, no advantage on a
    sequential processor
  • With a large number of processor and a
    high-latency network may be advantages.
  • Improvements can range from 15 to 50 depending
    on size.

36
Conjugate Gradient Variants by Dynamic Selection
at Run Time
  • Variants combine inner products to reduce
    communication bottleneck at the expense of more
    scalar ops.
  • Same number of iterations, no advantage on a
    sequential processor
  • With a large number of processor and a
    high-latency network may be advantages.
  • Improvements can range from 15 to 50 depending
    on size.

37
Related Tuning Projects
  • PHiPAC
  • Portable High Performance ANSI C
    http//www.icsi.berkeley.edu/bilmes/phipac
    initial automatic GEMM generation project
  • FFTW Fastest Fourier Transform in the West
  • http//www.fftw.org
  • UHFFT
  • tuning parallel FFT algorithms
  • http//rodin.cs.uh.edu/mirkovic/fft/parfft.htm
  • SPIRAL
  • Signal Processing Algorithms Implementation
    Research for Adaptable Libraries maps DSP
    algorithms to architectures
  • http//www.ece.cmu.edu/spiral/
  • Sparsity
  • Sparse-matrix-vector and Sparse-matrix-matrix
    multiplication http//www.cs.berkeley.edu/ejim/pu
    blication/ tunes code to sparsity structure of
    matrix more later in this tutorial
  • University of Tennessee

38
Tools for Performance Evaluation
  • Timing and performance evaluation has been an art
  • Resolution of the clock
  • Issues about cache effects
  • Different systems
  • Can be cumbersome and inefficient with
    traditional tools
  • Situation about to change
  • Todays processors have internal counters

39
Performance Counters
  • Almost all high performance processors include
    hardware performance counters.
  • Some are easy to access, others not available to
    users.
  • On most platforms the APIs, if they exist, are
    not appropriate for the end user or well
    documented.
  • Existing performance counter APIs
  • Compaq Alpha EV 6 6/7
  • SGI MIPS R10000
  • IBM Power Series
  • CRAY T3E
  • Sun Solaris
  • Pentium Linux and Windows
  • IA-64
  • HP-PA RISC
  • Hitachi
  • Fujitsu
  • NEC

40
Performance Data That May Be Available
  • Pipeline stalls due to memory subsystem
  • Pipeline stalls due to resource conflicts
  • I/D cache misses for different levels
  • Cache invalidations
  • TLB misses
  • TLB invalidations
  • Cycle count
  • Floating point instruction count
  • Integer instruction count
  • Instruction count
  • Load/store count
  • Branch taken / not taken count
  • Branch mispredictions

41
Overview of PAPI
  • Performance Application Programming Interface
  • The purpose of the PAPI project is to design,
    standardize and implement a portable and
    efficient API to access the hardware performance
    monitor counters found on most modern
    microprocessors

42
Implementation
  • Counters exist as a small set of registers that
    count events.
  • PAPI provides three interfaces to the underlying
    counter hardware
  • The low level interface manages hardware events
    in user defined groups called EventSet.
  • The high level interface simply provides the
    ability to start, stop and read the counters for
    a specified list of events.
  • Graphical tools to visualize information.

43
Low Level API
  • Increased efficiency and functionality over the
    high level PAPI interface
  • Theres about 40 functions
  • Obtain information about the executable and the
    hardware.
  • Thread safe

44
High Level API
  • Meant for application programmers wanting
    coarse-grained measurements
  • Calls the lower level API
  • Not thread safe at the moment
  • Only allows PAPI Presets events

45
High Level Functions
  • PAPI_flops()
  • PAPI_num_counters()
  • Number of counters in the system
  • PAPI_start_counters()
  • PAPI_stop_counters()
  • Enable counting of events and describes what to
    count
  • PAPI_read_counters()
  • Returns event counts

46
Perfometer Features
  • Platform independent visualization of PAPI
    metrics
  • Flexible interface
  • Quick interpretation of complex results
  • Small footprint
  • (compiled code size lt 15k)
  • Color coding to highlight selected procedures
  • Trace file generation or real time viewing.

47
PAPI Implementation
48
Graphical ToolsPerfometer Usage
  • Application is instrumented with PAPI
  • call perfometer()
  • Will be layered over the best existing
    vendor-specific APIs for these platforms
  • Application is started, at the call to
    performeter signal handler and timer set to
    collect and send the information to a Java applet
    containing the graphical view.
  • Sections of code that are of interest can be
    designated with specific colors
  • Using a call to set_perfometer(color)

49
Perfometer
Call Perfometer(red)
50
Early Users of PAPI
  • DEEP/PAPI (Pacific Sierra) http//www.psrv.com/dee
    p_papi_top.html
  • TAU (Allen Mallony, U of Oregon)
    http//www.cs.uoregon.edu/research/paracomp/tau/
  • SvPablo (Dan Reed, U of Illinois)
    http//vibes.cs.uiuc.edu/Software/SvPablo/svPablo.
    htm
  • Cactus (Ed Seidel, Max Plank/U of Illinois)
    http//www.aei-potsdam.mpg.de
  • Vprof (Curtis Janssen, Sandia Livermore Lab)
    http//aros.ca.sandia.gov/cljanss/perf/vprof/
  • Cluster Tools (Al Geist, ORNL)
  • DynaProf (Phil Mucci, UTK) http//www.cs.utk.edu/
    mucci/dynaprof/

51
Next Version of
Perfometer Implementation
Application
GUI
Server
Application
Application
52
PAPIs Parallel Interface
53
PAPI - Supported Processors
  • Intel Pentium,II,III,4, Itanium
  • Linux 2.4, 2.2, 2.0 and perf kernel patch
  • IBM Power 3,604,604e
  • For AIX 4.3 and pmtoolkit (in 4.3.4 available)
  • (laderose_at_us.ibm.com)
  • Sun UltraSparc I, II, III
  • Solaris 2.8
  • MIPS R10K, R12K
  • AMD Athlon
  • Linux 2.4 and perf kernel patch
  • Cray T3E, SV1, SV2
  • Windows 2K and XP
  • To download software see
  • http//icl.cs.utk.edu/papi/

54
(No Transcript)
55
Innovative Computing Laboratory University of
Tennessee
  • Numerical Linear Algebra
  • Heterogeneous Distributed Computing
  • Software Repositories
  • Performance Evaluation
  • Software and ideas have found there way into many
    areas of Computational Science
  • Around 40 people At the moment...
  • 15 Researchers Research Assoc/Post-Doc/Research
    Prof
  • 15 Students Graduate and Undergraduate
  • 8 Support staff Secretary, Systems, Artist
  • 1 Long term visitor
  • Many opportunities within the group at Tennessee

56
SETI_at_home
  • Use thousands of Internet-connected PCs to help
    in the search for extraterrestrial intelligence.
  • Uses data collected with the Arecibo Radio
    Telescope, in Puerto Rico
  • When their computer is idle or being wasted this
    software will download a 300 kilobyte chunk of
    data for analysis.
  • The results of this analysis are sent back to the
    SETI team, combined with thousands of other
    participants.
  • Largest distributed computation project in
    existence
  • 400,000 machines
  • Averaging 27 Tflop/s
  • Today many companies trying this for profit.

57
Distributed and Parallel Systems
Massively parallel systems homo- geneous
ASCI Tflops (7 Tflop/s)
Clusters w/ special interconnect
Distributed systems hetero- geneous
SETI_at_home (27 Tflop/s)
Parallel Dist mem
Beowulf cluster
Network of ws
Grid based Computing
Entropia
  • Gather (unused) resources
  • Steal cycles
  • System SW manages resources
  • System SW adds value
  • 10 - 20 overhead is OK
  • Resources drive applications
  • Time to completion is not critical
  • Time-shared
  • Bounded set of resources
  • Apps grow to consume all cycles
  • Application manages resources
  • System SW gets in the way
  • 5 overhead is maximum
  • Apps drive purchase of equipment
  • Real-time constraints
  • Space-shared

58
Grids are Hot
  • IPG NAS-NASA http//nas.nasa.gov/wej/home/IPG
  • Globus http//www.globus.org/
  • Legion http//www.cs.virgina.edu/grims
    haw/
  • AppLeS http//www-cse.ucsd.edu/groups/hp
    cl/apples
  • NetSolve http//www.cs.utk.edu/netsolve/
  • NINF http//phase.etl.go.jp/ninf/
  • Condor http//www.cs.wisc.edu/condor/
  • CUMULVS http//www.epm.ornl.gov/cs/cumulvs.
    html
  • WebFlow http//www.npac.syr.edu/users/gcf/

59
The Grid
  • To treat CPU cycles and software like
    commodities.
  • Napster on steroids.
  • Enable the coordinated use of geographically
    distributed resources in the absence of central
    control and existing trust relationships.
  • Computing power is produced much like utilities
    such as power and water are produced for
    consumers.
  • Users will have access to power on demand
  • When the Network is as fast as the computers
    internal links, the machine disintegrates across
    the Net into a set of special purpose appliances
  • Gilder Technology Report June 2000

60
The Grid
61
The Grid Architecture Picture
User Portals
Problem Solving Environments
Application Science Portals
Grid Access Info
Service Layers
Co- Scheduling
Fault Tolerance
Authentication
Events
Naming Files
Computers
Data bases
Online instruments
Resource Layer
Software
High speed networks and routers
62
Globus Grid Services
  • The Globus toolkit provides a range of basic Grid
    services
  • Security, information, fault detection,
    communication, resource management, ...
  • These services are simple and orthogonal
  • Can be used independently, mix and match
  • Programming model independent
  • For each there are well-defined APIs
  • Standards are used extensively
  • E.g., LDAP, GSS-API, X.509, ...
  • You dont program in Globus, its a set of tools
    like Unix

63
NetSolve Network
Enabled Server
  • NetSolve is an example of a grid based
    hardware/software server.
  • Easy-of-use paramount
  • Based on a RPC model but with
  • resource discovery, dynamic problem solving
    capabilities, load balancing, fault tolerance
    asynchronicity, security,
  • Other examples are NEOS from Argonne and NINF
    Japan.
  • Use resources, not tie together geographically
    distributed resources, for a single application.

64
NetSolve The Big Picture
Client
Schedule Database
AGENT(s)
Matlab Mathematica C, Fortran Java, Excel
S3
S4
Op(C, A, B)
S1
S2
C
A
No knowledge of the grid required, RPC like.
65
Basic Usage Scenarios
  • Grid based numerical library routines
  • User doesnt have to have software library on
    their machine, LAPACK, SuperLU, ScaLAPACK, PETSc,
    AZTEC, ARPACK
  • Task farming applications
  • Pleasantly parallel execution
  • eg Parameter studies
  • Remote application execution
  • Complete applications with user specifying input
    parameters and receiving output
  • Blue Collar Grid Based Computing
  • Does not require deep knowledge of network
    programming
  • Level of expressiveness right for many users
  • User can set things up, no su required
  • In use today, up to 200 servers in 9 countries
  • Can plug into Globus, Condor, NINF,

66
NetSolve Agent
  • Name server for the NetSolve
    system.
  • Information Service
  • client users and administrators can query the
    hardware and software services available.
  • Resource scheduler
  • maintains both static and dynamic information
    regarding theNetSolve server components touse
    for the allocation of resources

67
NetSolve Agent
  • Resource Scheduling (contd)
  • CPU Performance (LINPACK).
  • Network bandwidth, latency.
  • Server workload.
  • Problem size/algorithm complexity.
  • Calculates a Time to Compute. for each
    appropriate server.
  • Notifies client of most appropriate server.

68
NetSolve Client
  • Function Based Interface.
  • Client program embeds call
    from NetSolves API to access additional
    resources.
  • Interface available to C, Fortran, Matlab,
    Mathematica, and Java.
  • Opaque networking interactions.
  • NetSolve can be invoked using a variety of
    methods blocking, non-blocking, task farms,

69
NetSolve Client
  • Intuitive and easy to use.
  • Matlab Matrix multiply e.g.
  • A matmul(B, C)

A netsolve(matmul, B, C)
  • Possible parallelisms hidden.

70
NetSolve Client
  • Client makes request to agent.
  • Agent returns list of servers.
  • Client tries each one in turn untilone executes
    successfully or list is exhausted.

71
NPACI Alpha Project - MCell 3-D Monte-Carlo
Simulation of Neuro-Transmitter Release in
Between Cells
  • UCSD (F. Berman, H. Casanova, M. Ellisman), Salk
    Institute (T. Bartol), CMU (J. Stiles), UTK
    (Dongarra, R. Wolski)
  • Study how neurotransmitters diffuse and activate
    receptors in synapses
  • blue unbounded, red singly bounded, green doubly
    bounded closed, yellow doubly bounded open

72
MCell 3-D Monte-Carlo Simulation of
Neuro-Transmitter Release in Between Cells
  • Developed at Salk Institute, CMU
  • In the past, manually run on available
    workstations
  • Transparent Parallelism, Load balancing,
    Fault-tolerance
  • Fits the farming semantic and need for NetSolve
  • Collaboration with AppLeS Project for scheduling
    tasks

List of seeds
AppLeS
NetSolve Servers
script
MCell
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
Scrip ...
73
  • Integrated Parallel Accurate Reservoir Simulator.
  • Mary Wheelers group, UT-Austin
  • Reservoir and Environmental Simulation.
  • models black oil, waterflood, compositions
  • 3D transient flow of multiple phase
  • Integrates Existing Simulators.
  • Framework simplified development
  • Provides solvers, handling for wells, table
    lookup.
  • Provides pre/postprocessor, visualization.
  • Full IPARS access without Installation.
  • IPARS Interfaces
  • C, FORTRAN, Matlab, Mathematica, and Web.

74
Netsolve and SCIRun
SCIRun torso defibrillator application Chris
Johnson, U of Utah
75
NetSolve A Plug into the Grid
C
Fortran
NetSolve
Grid middleware
Resource Discovery
Fault Tolerance
System Management
Resource Scheduling
Globus proxy
NetSolve proxy
Ninf proxy
Condor proxy
76
NetSolve A Plug into the Grid
C
Fortran
NetSolve
Grid middleware
Resource Discovery
Fault Tolerance
System Management
Resource Scheduling
Globus proxy
NetSolve proxy
Ninf proxy
Condor proxy
Globus
NetSolve servers
Ninf servers
Condor
Grid back-ends
NetSolve servers
NetSolve servers
77
NetSolve A Plug into the Grid
PSE front-ends
Matlab
Mathematica
Custom
SCIRun
Remote procedure call
C
Fortran
NetSolve
Grid middleware
Resource Discovery
Fault Tolerance
System Management
Resource Scheduling
Globus proxy
NetSolve proxy
Ninf proxy
Condor proxy
Globus
NetSolve servers
Ninf servers
Condor
Grid back-ends
NetSolve servers
NetSolve servers
78
University of Tennessee Deployment
Scalable Intracampus Research Grid SInRG (NSF
Infrastruction Award)
  • SInRG equipment 8 Grid Service Clusters
    deployed, 6 more to come
  • Federated Ownership CS, Chem Eng., Medical
    School, Computational Ecology, El. Eng.
  • Real applications,
    middleware

    development,
    logistical
    networking

The Knoxville Campus has two DS-3 commodity
Internet connections and one DS-3
Internet2/Abilene connection. An OC-3 ATM link
routes IP traffic between the Knoxville campus,
National Transportation Research Center, and Oak
Ridge National Laboratory. UT participates in
several national networking initiatives including
Internet2 (I2), Abilene, the federal Next
Generation Internet (NGI) initiative, Southern
Universities Research Association (SURA) Regional
Information Infrastructure (RII), and Southern
Crossroads (SoX). The UT campus consists of a
meshed ATM OC-12 being migrated over to switched
Gigabit by early 2002.
79
SInRGs Vision
  • SInRG provides a testbed
  • CS grid middleware
  • Computational Science applications
  • Many hosts, co-existing
    in a loose confederation
    tied together with
    high-speed links.
  • Users have the illusion of a very powerful
    computer on the desk.
  • Spectrum of users

80
GrADS - Three Research and Technology Thrusts
  • GrADS Grid Application Development Software
  • NSF Next Generation Software (NGS) effort
  • Effort within the GrADS Project
  • GrADS PIs Berman, Chien, Cooper, Dongarra,
    Foster, Gannon, Johnsson, Kennedy, Kesselman,
    Mellor-Crummey, Reed, Torczon, Wolski
  • GrADSoft
  • Software infrastructure for programming and
    running on the Grid
  • Reconfigurable object programs
  • Performance contracts
  • Core Grid technologies
  • Globus, NetSolve, NWS, Autopilot, AppLeS,
    Portals, Cactus
  • MacroGrid
  • Persistent multi-institution Grid testbed
  • MicroGrid
  • Portable Grid emulator

81
Grid-Aware Numerical Libraries
  • Using ScaLAPACK and PETSc on the Grid
    Early Experiences

82
Grid-Aware Numerical Libraries
  • Using ScaLAPACK and PETSc on the Grid
    Early Experiences

In some sense ScaLAPACK not an ideal application
for the Grid. Expanded our understand how various
GrADS component fit together. Key is managing
dynamism.
83
ScaLAPACK
  • ScaLAPACK is a portable distributed
    memory numerical library
  • Complete numerical library for dense matrix
    computations
  • Designed for distributed parallel computing (MPP
    Clusters) using MPI
  • One of the first math software packages to do
    this
  • Numerical software that will work on a
    heterogeneous platform
  • Funding from DOE, NSF, and DARPA
  • In use today by IBM, HP-Convex, Fujitsu, NEC,
    Sun, SGI, Cray, NAG, IMSL,
  • Tailor performance provide support

84
ScaLAPACK Grid Enabled
  • Implement a version of a ScaLAPACK library
    routine that runs on the Grid.
  • Make use of resources at the users disposal
  • Provide the best time to solution
  • Proceed without the users involvement
  • Make as few changes as possible to the numerical
    software.
  • Assumption is that the user is already Grid
    enabled and runs a program that contacts the
    execution environment to determine where the
    execution should take place.

85
To Use ScaLAPACK a User Must
  • Download the package and auxiliary packages (like
    PBLAS, BLAS, BLACS, MPI) to the machines.
  • Write a SPMD program which
  • Sets up the logical 2-D process grid
  • Places the data on the logical process grid
  • Calls the numerical library routine in a SPMD
    fashion
  • Collects the solution after the library routine
    finishes
  • The user must allocate the processors and decide
    the number of processes the application will run
    on
  • The user must start the application
  • mpirun np N user_app
  • Note the number of processors is fixed by the
    user before the run, if problem size changes
    dynamically
  • Upon completion, return the processors to the
    pool of resources

86
GrADS Numerical Library
  • Want to relieve the user of some of the tasks
  • Make decisions on which machines to use based on
    the users problem and the state of the system
  • Determinate machines that can be used
  • Optimize for the best time to solution
  • Distribute the data on the processors and
    collections of results
  • Start the SPMD library routine on all the
    platforms
  • Check to see if the computation is proceeding as
    planned
  • If not perhaps migrate application

87
GrADS Library Sequence
User
Library Routine
  • Has crafted code to make things work correctly
    and together.

Assumptions Autopilot Manager has been started
and Globus is there.
88
Resource Selector
Resource Selector
User
Library Routine
  • Uses Globus-MDS and Rich Wolskis NWS to build
    an array of values for the machines that are
    available for the user.
  • 2 matrices (bw,lat) 2 arrays (cpu, memory
    available)
  • Matrix information is clique based
  • On return from RS, Crafted Code filters
    information to use only machines that have the
    necessary software and are really eligible to be
    used.

89
Arrays of Values Generated by Resource Selector
  • Clique based
  • 2 _at_ UT, UCSD, UIUC
  • Part of the MacroGrid
  • Full at the cluster level and the connections
    (clique leaders)
  • Bandwidth and Latency information looks like
    this.
  • Linear arrays for CPU and Memory
  • Matrix of values are filled out to generate a
    complete, dense, matrix of values.
  • At this point have a workable coarse grid.
  • Know what is available, the connections, and the
    power of the machines

90
ScaLAPACK Performance Model
  • Total number of floating-point operations per
    processor
  • Total number of data items communicated per
    processor
  • Total number of messages
  • Time per floating point operation
  • Time per data item communicated
  • Time per message

91
Performance Model
Resource Selector
User
Library Routine
  • Performance Model uses the information generated
    in the RS to decide on the fine grid.
  • Pick a machine that is closest to every other
    machine in the collection.
  • If not enough memory, adds machines until it can
    solve problem.
  • Cost model is run on this set.
  • Process adds a machine to group and reruns cost
    model.
  • If better, iterate last step, if not stop.

Performance Model
92
Resource Selector/Performance Modeler
  • Refines the course grid by determining the
    process set that will provide the best time to
    solution.
  • This is based on dynamic information from the
    grid and the routines performance model.
  • The PM does a simulation of the actual
    application using the information from the RS.
  • It literally runs the program without doing the
    computation or data movement.
  • There is no backtracking in the Optimizer.
  • This is an area for enhancement and
    experimentation.

93
Contract Development
Resource Selector
User
Library Routine
  • Contract between the application and the Grid
    System
  • CD should validate the fine grid.
  • Should iterate between the CD and PM phases to
    get a workable fine grid.

Performance Model
Contract Development
94
Application Launcher
Resource Selector
User
Library Routine
Performance Model
App Launcher
Contract Development
mpirun machinefile globusrsl fine_grid
grid_linear_solve
95
Experimental Hardware / Software Grid
MacroGrid Testbed
  • Autopilot version 2.3
  • Globus version 1.1.3
  • NWS version 2.0.pre2
  • MPICH-G version 1.1.2
  • ScaLAPACK version 1.6
  • ATLAS/BLAS version 3.0.2
  • BLACS version 1.1
  • PAPI version 1.1.5
  • GrADS Crafted code

Independent components being put together and
interacting
96
Performance Model Validation
Speed 60 of the peak
Latency in msec
Bandwidth in Mb/s
This is for a refined grid
97
N600, NB40, 2 torc procs. Ratio 46.12
N1500, NB40, 4 torc procs. Ratio 15.03
N5000, NB40, 6 torc procs. Ratio 2.25
N8000, NB40, 8 torc procs. Ratio 1.52
N10,000, NB40, 8 torc procs. Ratio 1.29
98
OPUS
OPUS, CYPHER
OPUS, TORC, CYPHER
2 OPUS, 4 TORC, 6 CYPHER
8 OPUS, 4 TORC, 4 CYPHER
8 OPUS, 2 TORC, 6 CYPHER
6 OPUS, 5 CYPHER
8 OPUS, 6 CYPHER
8 OPUS
5 OPUS
8 OPUS
99
Largest Problem Solved
  • Matrix of size 30,000
  • 7.2 GB for the data
  • 32 processors to choose from UIUC and UT
  • Not all machines have 512 MBs, some little as 128
    MBs
  • PM chose 17 machines in 2 clusters from UT
  • Computation took 84 minutes
  • 3.6 Gflop/s total
  • 210 Mflop/s per processor
  • ScaLAPACK on a cluster of 17 processors would get
    about 50 of peak
  • Processors are 500 MHz or 500 Mflop/s peak
  • For this grid computation 20 less than ScaLAPACK

Compiler analogy
100
Contracts, Checkpointing, Migration
  • We are using University of Illinois Autopilot to
    monitor the progress of the execution.
  • The applications software has the ability to
    perform a checkpoint and can be restarted.
  • We manually inserted the checkpointing code.
  • If the application is not progressing as the
    contract specifies we want to take some
    corrective action.
  • Go back and figure out where the application can
    be run optimally.
  • Restart the process from the last checkpoint,
    perhaps rearranging the data to fit the new set
    of processors.

101
General Library Interface
  • We have a start on a general interface for
    numerical libraries.
  • Its can be a simple operation to plug in other
    numerical routines/libraries.
  • Developing migration mechanisms for contract
    violations.
  • Today a library writer needs to supply
  • Numerical Routine
  • Performance Model
  • The rest of the framework can remain the same.

102
Futures for Numerical Algorithms and Software
  • Numerical software will be adaptive, exploratory,
    and intelligent
  • Polyalgorithms and other techniques
  • Determinism in numerical computing will be gone.
  • After all, its not reasonable to ask for
    exactness in numerical computations.
  • Auditability of the computation, reproducibility
    at a cost
  • Importance of floating point arithmetic will be
    undiminished.
  • 16, 32, 64, 128 bits and beyond.
  • Interval arithmetic
  • Reproducibility, fault tolerance, and
    auditability
  • Adaptivity is a key so applications can
    effectively use the resources.

103
Contributors to These Ideas
  • Top500
  • Erich Strohmaier, LBL, NERSC
  • Hans Meuer, Mannheim U
  • Horst Simon, LBL, NERSC
  • ATLAS
  • Antoine Petitet, Sun France
  • Clint Whaley, UTK
  • Parallel Computing, Vol 27,
    No 1-2, pp 3-25, 2001
  • PAPI
  • Shirley Browne, UTK
  • Kevin London, UTK
  • Phil Mucci, UTK
  • Keith Seymour, UTK
  • NetSolve
  • Dorian Arnold, UWisc
  • Henri Casanova, UCSD
  • Michelle Miller, UTK
  • Sathish Vadhiyar, UTK
  • For additional information see
  • http//icl.cs.utk.edu/top500/
  • http//icl.cs.utk.edu/atlas/
  • http//icl.cs.utk.edu/papi/
  • http//icl.cs.utk.edu/scalapack/
  • http//icl.cs.utk.edu/netsolve/
  • www.cs.utk.edu/dongarra/

Many opportunities within the group at Tennessee
104
(No Transcript)
105
6 Variations of Matrix Multiple
106
6 Variations of Matrix Multiple
107
6 Variations of Matrix Multiple
108
6 Variations of Matrix Multiple
109
6 Variations of Matrix Multiple
110
6 Variations of Matrix Multiple
111
6 Variations of Matrix Multiple
112
6 Variations of Matrix Multiple
C
Fortran
113
6 Variations of Matrix Multiple
C
Fortran
However, only part of the story
114
SUN Ultra 2 200 MHz (L116KB, L21MB)
  • jik
  • kji
  • ikj
  • ijk
  • jki
  • kij
  • dgemm

115
Cache Blocking
  • We want blocks to fit into cache. On parallel
    computers we have p x cache so that data may fit
    into cache on p processors, but not one. This
    leads to superlinear speed up! Consider
    matrix-matrix multiply.
  • An alternate form is ...

do k 1,n do j 1,n do i
1,n c(i,j) c(i,j)
a(i,k)b(k,j) end do end do
end do
116
Cache Blocking
do ii 1,n,nblk do jj 1,n,nblk
do kk 1,n,nblk do k
kk,kknblk-1 do j
jj,jjnblk-1 do i
ii,iinblk-1 c(i,j)
c(i,j) a(i,k) b(k,j)
end do . . . end do
117
Assignment
  • Implement, in Fortran or C, the six different
    ways to perform matrix multiplication by
    interchanging the loops. (Use 64-bit arithmetic.)
    Make each implementation a subroutine, like
  • subroutine ijk (a,m,n,lda,b,k,ldb,c,ldc)
  • subroutine ikj ( a,m,n,lda,b,k,ldb,c,ldc)
  • ...
  • Construct a driver program to generate random
    matrices and calls each matrix multiplyroutine
    with square matrices of orders 50, 100, 150, 200,
    250, and 300, timing the calls and computing the
    Mflops rate.
  • Include in your timing routine a call to the
    following system supplied routines
  • call dgemm('No', 'No', n, n, n, 1.0d0, a,
    lda, b, ldb,
  • 1.0d0, c, ldc )
  • Writeup a description of the timing and describe
    why the routines perform as they do.
  • Download ATLAS from http//www.netlib.atlas/ and
    build the ATLAS Version of DGEMM and Time.

118
EISPACK and LINPACK
  • EISPACK
  • Design for the algebraic eigenvalue problem,
    Ax ?x and Ax ?Bx.
  • work of J. Wilkinson and colleagues in the 70s.
  • Fortran 77 software based on translation of
    ALGOL.
  • LINPACK
  • Design for the solving systems of equations, Ax
    b.
  • Fortran 77 software using the Level 1 BLAS.

119
History of Block Partitioned Algorithms
  • Early algorithms involved use of small main
    memory using tapes as secondary storage.
  • Recent work centers on use of vector registers,
    level 1 and 2 cache, main memory, and out of
    core memory.

120
Blocked Partitioned Algorithms
  • LU Factorization
  • Cholesky factorization
  • Symmetric indefinite factorization
  • Matrix inversion
  • QR, QL, RQ, LQ factorizations
  • Form Q or QTC
  • Orthogonal reduction to
  • (upper) Hessenberg form
  • symmetric tridiagonal form
  • bidiagonal form
  • Block QR iteration for nonsymmetric eigenvalue
    problems

121
LAPACK
  • Linear Algebra library in Fortran 77
  • Solution of systems of equations
  • Solution of eigenvalue problems
  • Combine algorithms from LINPACK and EISPACK into
    a single package
  • Efficient on a wide range of computers
  • RISC, Vector, SMPs
  • User interface similar to LINPACK
  • Single, Double, Complex, Double Complex
  • Built on the Level 1, 2, and 3 BLAS

122
LAPACK
  • Most of the parallelism in the BLAS.
  • Advantages of using the BLAS for parallelism
  • Clarity
  • Modularity
  • Performance
  • Portability

123
Derivation of Blocked AlgorithmsCholesky
Factorization A UTU

Equating coefficient of the jth column, we obtain
Hence, if U11 has already been computed, we can
compute uj and ujj from the equations
124
LINPACK Implementation
  • Here is the body of the LINPACK routine SPOFA
    which implements the method
  • DO 30 J 1, N
  • INFO J
  • S 0.0E0
  • JM1 J - 1
  • IF( JM1.LT.1 ) GO TO 20
  • DO 10 K 1, JM1
  • T A( K, J ) - SDOT( K-1,
    A( 1, K ), 1,A( 1, J ), 1 )
  • T T / A( K, K )
  • A( K, J ) T
  • S S TT
  • 10 CONTINUE
  • 20 CONTINUE
  • S A( J, J ) - S
  • C ...EXIT
  • IF( S.LE.0.0E0 ) GO TO 40
  • A( J, J ) SQRT( S )
  • 30 CONTINUE

125
LAPACK Implementation
  • DO 10 J 1, N
  • CALL STRSV( 'Upper', 'Transpose',
    'Non-Unit, J-1, A, LDA, A( 1, J ), 1 )
  • S A( J, J ) - SDOT( J-1, A( 1, J ),
    1, A( 1, J ), 1 )
  • IF( S.LE.ZERO ) GO TO 20
  • A( J, J ) SQRT( S )
  • 10 CONTINUE
  • This change by itself is sufficient to
    significantly improve the performance on a number
    of machines.
  • From 238 to 312 Mflop/s for a matrix of order 500
    on a Pentium 4-1.7 GHz.
  • However on peak is 1,700 Mflop/s.
  • Suggest further work needed.

126
Derivation of Blocked Algorithms

Equating coefficient of second block of columns,
we obtain
Hence, if U11 has already been computed, we can
compute U12 as the solution of the following
equations by a call to the Level 3 BLAS routine
STRSM
127
LAPACK Blocked Algorithms
DO 10 J 1, N, NB CALL STRSM( 'Left',
'Upper', 'Transpose','Non-Unit', J-1, JB, ONE, A,
LDA, A( 1, J ), LDA )
CALL SSYRK( 'Upper', 'Transpose', JB, J-1,-ONE,
A( 1, J ), LDA, ONE, A( J, J ),
LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ),
LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10
CONTINUE
  • On Pentium 4, L3 BLAS squeezes a lot more out of
    1 proc

128
LAPACK Contents
  • Combines algorithms from LINPACK and EISPACK into
    a single package. User interface similar to
    LINPACK.
  • Built on the Level 1, 2 and 3 BLAS, for high
    performance (manufacturers optimize BLAS)
  • LAPACK does not provide routines for structured
    problems or general sparse matrices (i.e sparse
    storage formats such as compressed-row, -column,
    -diagonal, skyline ...).

129
LAPACK Ongoing Work
  • Add functionality
  • updating/downdating, divide and conquer least
    squares,bidiagonal bisection, bidiagonal inverse
    iteration, band SVD, Jacobi methods, ...
  • Move to new generation of high performance
    machines
  • IBM SPs, CRAY T3E, SGI Origin, clusters of
    workstations
  • New challenges
  • New languages FORTRAN 90, HP FORTRAN, ...
  • (CMMD, MPL, NX ...)
  • many flavors of message passing, need standard
    (PVM, MPI) BLACS
  • Highly varying ratio
  • Many ways to layout data,
  • Fastest parallel algorithm sometimes less stable
    numerically.

130
Gaussian Elimination
x
0
x
. . .
x
x
Standard Way subtract a multiple of a row
131
Gaussian Elimination via a Recursive Algorithm
F. Gustavson and S. Toledo
LU Algorithm 1 Split matrix into two
rectangles (m x n/2) if only 1 column,
scale by reciprocal of pivot return 2
Apply LU Algorithm to the left part 3 Apply
transformations to right part
(triangular solve A12 L-1A12 and
matrix multiplication A22A22 -A21A12
) 4 Apply LU Algorithm to right part
Most of the work in the matrix multiply Matrices
of size n/2, n/4, n/8,
132
Recursive Factorizations
  • Just as accurate as conventional method
  • Same number of operations
  • Automatic variable blocking
  • Level 1 and 3 BLAS only !
  • Extreme clarity and simplicity of expression
  • Highly efficient
  • The recursive formulation is just a rearrangement
    of the point-wise LINPACK algorithm
  • The standard error analysis applies (assuming the
    matrix operations are computed the conventional
    way).
  • OK for LU, LLT, QR
  • Open question on 2-sided algs. eg eigenvalue
    reduction

133
  • Recursive LU

Dual-processor
LAPACK
Recursive LU
Uniprocessor
LAPACK
134
Challenges in Developing Distributed Memory
Libraries
  • How to integrate software?
  • Until recently no standards
  • Many parallel languages
  • Various parallel programming models
  • Assumptions about the parallel environment
  • granularity
  • topology
  • overlapping of communication/computation
  • development tools
  • Where is the data
  • Who owns it?
  • Opt data distribution
  • Who determines data layout
  • Determined by user?
  • Determined by library developer?
  • Allow dynamic data dist.
  • Load balancing

135
ScaLAPACK
  • Library of software dealing with dense banded
    routines
  • Distributed Memory - Message Passing
  • MIMD Computers and Networks of Workstations
  • Clusters of SMPs

136
Programming Style
  • SPMD Fortran 77 with object based design
  • Built on various modules
  • PBLAS Interprocessor communication
  • BLACS
  • PVM, MPI
  • Provides right level of notation.
  • BLAS
  • LAPACK software expertise/quality
  • Software approach
  • Numerical methods

137
Overall Structure of Software
  • Object based - Array descriptor
  • Contains information required to establish
    mapping between a global array entry and its
    corresponding process and memory location.
  • Provides a flexible framework to easily specify
    additional data distributions or matrix types.
  • Currently dense, banded, out-of-core
  • Using the concept of context

138
PBLAS
  • Similar to the BLAS in functionality and naming.
  • Built on the BLAS and BLACS
  • Provide global view of matrix
  • CALL DGEXXX ( M, N, A( IA, JA ), LDA,... )
  • CALL PDGEXXX( M, N, A, IA, JA, DESCA,... )

139
ScaLAPACK Structure
Global
Local
140
Choosing a Data Distribution
  • Main issues are
  • Load balancing
  • Use of the Level 3 BLAS

141
Possible Data Layouts
  • 1D block and cyclic column distributions
  • 1D block-cycle column and 2D block-cyclic
    distribution
  • 2D block-cyclic used in ScaLAPACK for dense
    matrices

142
Distribution and Storage
  • Matrix is block-partitioned maps blocks
  • Distributed 2-D block-cyclic scheme
  • 5x5 matrix partitioned in 2x2 blocks
    2x2 process grid
    point of view
  • Routines available to distribute/redistribute
    data.

143
Parallelism in ScaLAPACK
  • Level 3 BLAS block operations
  • All the reduction routines
  • Pipelining
  • QR Algorithm, Triangular Solvers, classic
    factorizations
  • Redundant computations
  • Condition estimators
  • Static work assignment
  • Bisection
  • Task parallelism
  • Sign function eigenvalue computations
  • Divide and Conquer
  • Tridiagonal and band solvers, symmetric
    eigenvalue problem and Sign function
  • Cyclic reduction
  • Reduced system in the band solver
  • Data parallelism
  • Sign function

144
(No Transcript)
145
References
  • http//www.netlib.org
  • http//www.netlib.org/lapack
  • http//www.netlib.org/scalapack
  • http//www.netlib.org/lapack/lawns
  • http//www.netlib.org/atlas
  • http//www.netlib.org/papi/
  • http//www.netlib.org/netsolve/
  • http//www.netlib.org/lapack90
  • http//www.nhse.org
  • http//www.netlib.org/utk/people/JackDongarra/la-s
    w.html
  • lapack_at_cs.utk.edu
  • scalapack_at_cs.utk.edu

146
Motivation for Grid Computing
In the past Isolation
  • Many science and
    engineering problems today require that widely
    dispersed resources be operated as systems.
  • Networking, distributed computing, and parallel
    computation research have matured to make it
    possible for distributed systems to support
    high-performance applications, but...
  • Resources are dispersed
  • Connectivity is variable
  • Dedicated access may not be possible

Today Collaboration
147
Performance Distribution Nov 2001
1,3,4,6 ½ life
148
Bandwidth Wont Be A Problem Soon -- Bisection
Bandwidth (BB) Across the US
  • 1971 - BB 112 Kb/s
  • 1986 - BB 1 Mb/s
  • 2001 - BB 200 Gb/s
  • Today in the lab, 4000 channels on single fiber
    and each channel 10 Gb/s
  • 12 strands of fiber can carry 400010 Gb/s or 40
    Tb/s
  • 5 backbone network across the US each w/ 2 sets
    of 12 strands can provide 2.4 Pb/s
  • When the Network is as fast as the computers
    internal links, the machine disintegrates across
    the Net into a set of special purpose appliances
  • Gilder Technology Report June 2000
  • Internet doubling every 9 months
  • Factor of 100 in 5 years
  • BB will grow be a factor of 12000.
About PowerShow.com