Benchmarks for Parallel Systems - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Benchmarks for Parallel Systems

Description:

Computer Type indicated by manufacturer or vendor. Installation Site Customer ... respect to the number of systems, Hewlett-Packard topped IBM again by a small margin. ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 32

Provided by: SathishV4

Category:

more less

Transcript and Presenter's Notes

Title: Benchmarks for Parallel Systems

1
Benchmarks for Parallel Systems

Sources/Credits
Performance of Various Computers Using Standard
Linear Equations Software, Jack Dongarra,
University of Tennessee, Knoxville TN, 37996,
Computer Science Technical Report Number CS - 89
85, April 8, 2004, urlhttp//www.netlib.org/ben
chmark/performance.ps
http//www.top500.org
FAQ http//www.netlib.org/utk/people/JackDongarra
/faq-linpack.html
Courtesy Jack Dongarra (Top500)
http//www.top500.org
The LINPACK Benchmark Past, Present, and Future,
Jack Dongarra, Piotr Luszczek, and Antoine
Petitet
NAS Parallel Benchmarks. http//www.nas.nasa.gov/S
oftware/NPB/

2
LINPACK (Dongarra 1979)

Dense system of linear equations
Initially used as a users guide for LINPACK
package
LINPACK 1979
N100 benchmark, N1000 benchmark, Highly
Parallel Computing benchmark (High Performance
LinPack- HPL)

3
LINPACK benchmark

Implemented on top of BLAS1
2 main operations DGEFA(Gaussian elimination -
O(n3)) and DGESL(Ax b O(n2))
Major operation (97) DAXPY y y a.x
Called n2/3 n times. Hence 2n3/3 2n2 flops
(approx.)
64-bit floating point arithmetic

4
LINPACK

N100, 100x100 system of equations. No change in
code. User asked to give a timing routine called
SECOND, no compiler optimizations
N1000, 1000x1000 user can implement any code,
should provide the required accuracy. Driver
program always uses 2n3/3 2n2
Highly Parallel Computing benchmark any
software, matrix size can be chosen. Used in
Top500
Based on 64-bit floating point arithmetic

5
HPL Algorithm

2-D block-cyclic data distribution
Right-looking LU
Panel factorization various options
- Crout, left or right-looking recursive
variants based on matrix multiply
- Number of sub-panels
- recursive stopping criteria
- pivot search and broadcast by
binary-exchange

6
HPL algorithm

Panel broadcast
-
Update of trailing matrix
- look-ahead pipeline
Validity check
- should be O(1)

7
Top500 (www.top500.org)

Top500 1993
Twice a year June and November
Top500 gives Nmax, Rmax, N1/2, Rpeak

8
TOP500 list Data shown

Manufacturer Manufacturer or vendor
Computer Type indicated by manufacturer or
vendor
Installation Site Customer
Location Location and country
Year Year of installation/last major update
Installation Type Academic, Research, Industry,
Vendor, Classified, Government
Installation Area e.g. Research Energy /
Industry Finance
Processors Number of processors
Rmax Maxmimal LINPACK performance achieved
Rpeak Theoretical peak performance
Nmax Problem size for achieving Rmax
N1/2 Problem size for achieving half of Rmax
Nworld Position within the TOP500 ranking

9
28th List Top 3
10
28th List India
11
(No Transcript)
12
(No Transcript)
13
Countries
14
Manufacturer
15
Processor Family
16
System Processor Count
17
NAS Parallel Benchmarks - NPB

Also for evaluation of Supercomputers
A set of 8 programs from CFD
5 kernels, 3 pseudo applications
NPB 1 Original benchmarks
NPB 2 NASs MPI implementation. NPB 2.4 Class D
has more work and more I/O
NPB 3 based on OpenMP, HPF, Java
GridNPB3 for computational grids
NPB 3 multi-zone for hybrid parallelism

18
Kernel Benchmarks

EP embarrassingly parallel
MG multigrid. Regular communication
CG conjugate gradient. Irregular long distance
communication
FT a 3-D PDE using FFT. Rigorous test of long
distance communication
IS large integer sort
Detailed rules regarding
- brief statement of the problem
- algorithm to be practiced
- validation of results
- where to insert timing calls
- method for generating random numbers
- submission of results

19
Pseudo applications / Synthetic CFDs

Benchmark 1 perform few iterations of the
approximate factorization algorithm (SP)
Benchmark 2 - perform few iterations of diagonal
form of the approximate factorization algorithm
(BT)
Benchmark 3 - perform few iterations of SSOR (LU)

20
Class A and Class B
Class A
Sample Code
Class B
21
Thank You !
22
LINPACK

100x100 inner loop optimization
1000x1000 three-loop/whole program optimization
Scalable parallel program Largest problem that
can fit in memory
Template of Linpack code
Generate
Solve
Check
Time

23
HPL (Implementation of HPLinpack Benchmark)
24
22nd List The TOP10
25
Trends / Manufacturers
26
(No Transcript)
27
1 Earth Simulator

Located in Yokohama
5,120 (640 8-way nodes) NEC SX-5 CPUs
8 GFLOPS per CPU (41 TFLOPS total)
640 640 crossbar switch between the nodes
16 GB/s inter-node bandwidth
Super-UX Unix based OS
Three-level parallel system
Occupies a 4-storey building

28
2 and 3

2 - ASCI Q from LANL, 13.88 TFlops
Based on 1024 HP Alphaserver ES45 each 4-way SMP
Quadrics interconnect
3 Based on Apple G5 systems in Virginia Tech
1100 2-way systems, 10.28 TFlops

29
Highlights from Top10

The third system ever to exceed the 10 TFflop/s
mark is Virgina Tech's X measured at 10.28
TFlop/s. This cluster is built with the Apple G5
as building blocks.
6 is the first system in the TOP500 based on
AMD's Opteron chip. It was installed by Linux
Networx at the Los Alamos National Laboratory and
also uses a Myrinet interconnect.
The list of cluster systems in the TOP10 has
grown impressively to seven systems.
With the exception of the leading Earth
Simulator, all other TOP10 systems are installed
in the U.S.
208 systems are now labeled as clusters, up from
149. This makes clustered systems the most common
architecture in the TOP500.
IBM is still leading the list with respect to the
total installed performance - and increased its
share to 35.4 percent --- up from 31.8 percent
one year ago and 34.9 percent 6 months ago. HP is
second in installed performance with 22.7 percent
and NEC is third with 8.7 percent.
With respect to the number of systems,
Hewlett-Packard topped IBM again by a small
margin. HP is at 165 systems (up from 159) and
IBM is at 159 systems (up one system) installed.
SGI is again third with 41 systems, down from 54.

30
NPB 1.0 (March 1994)

Defines class A and class B versions
Paper and pencil algorithmic specifications
Generic benchmarks as compared to MPI-based
LinPack
General rules for implementations Fortran90 or
C, 64-bit arithmetic etc.
Sample implementations provided

31
NPB 2.0 (1995)

MPI and Fortran 77 implementations
2 parallel kernels (MG, FT) and 3 simulated
applications (LU, SP, BT)
Class C bigger size
Benchmark rules 0, 5, gt5 change in source
code

32
NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003)

EP and IS added
FT rewritten
NPB 2.4 class D and rationale for class D sizes
2.4 I/O a new benchmark problem based on BT
(BTIO) to test the output capabilities
A MPI implementation of the same (MPI-IO)
different options using collective buffering or
not etc.

Write a Comment

User Comments (0)