Benchmarks for Parallel Systems - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Benchmarks for Parallel Systems

Description:

Computer Type indicated by manufacturer or vendor. Installation Site Customer ... respect to the number of systems, Hewlett-Packard topped IBM again by a small margin. ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 32
Provided by: SathishV4
Category:

less

Transcript and Presenter's Notes

Title: Benchmarks for Parallel Systems


1
Benchmarks for Parallel Systems
  • Sources/Credits
  • Performance of Various Computers Using Standard
    Linear Equations Software, Jack Dongarra,
    University of Tennessee, Knoxville TN, 37996,
    Computer Science Technical Report Number CS - 89
    85, April 8, 2004, urlhttp//www.netlib.org/ben
    chmark/performance.ps
  • http//www.top500.org
  • FAQ http//www.netlib.org/utk/people/JackDongarra
    /faq-linpack.html
  • Courtesy Jack Dongarra (Top500)
  • http//www.top500.org
  • The LINPACK Benchmark Past, Present, and Future,
    Jack Dongarra, Piotr Luszczek, and Antoine
    Petitet
  • NAS Parallel Benchmarks. http//www.nas.nasa.gov/S
    oftware/NPB/

2
LINPACK (Dongarra 1979)
  • Dense system of linear equations
  • Initially used as a users guide for LINPACK
    package
  • LINPACK 1979
  • N100 benchmark, N1000 benchmark, Highly
    Parallel Computing benchmark (High Performance
    LinPack- HPL)

3
LINPACK benchmark
  • Implemented on top of BLAS1
  • 2 main operations DGEFA(Gaussian elimination -
    O(n3)) and DGESL(Ax b O(n2))
  • Major operation (97) DAXPY y y a.x
  • Called n2/3 n times. Hence 2n3/3 2n2 flops
    (approx.)
  • 64-bit floating point arithmetic

4
LINPACK
  • N100, 100x100 system of equations. No change in
    code. User asked to give a timing routine called
    SECOND, no compiler optimizations
  • N1000, 1000x1000 user can implement any code,
    should provide the required accuracy. Driver
    program always uses 2n3/3 2n2
  • Highly Parallel Computing benchmark any
    software, matrix size can be chosen. Used in
    Top500
  • Based on 64-bit floating point arithmetic

5
HPL Algorithm
  • 2-D block-cyclic data distribution
  • Right-looking LU
  • Panel factorization various options
  • - Crout, left or right-looking recursive
    variants based on matrix multiply
  • - Number of sub-panels
  • - recursive stopping criteria
  • - pivot search and broadcast by
    binary-exchange

6
HPL algorithm
  • Panel broadcast
  • -
  • Update of trailing matrix
  • - look-ahead pipeline
  • Validity check
  • - should be O(1)

7
Top500 (www.top500.org)
  • Top500 1993
  • Twice a year June and November
  • Top500 gives Nmax, Rmax, N1/2, Rpeak

8
TOP500 list Data shown
  • Manufacturer Manufacturer or vendor
  • Computer Type indicated by manufacturer or
    vendor
  • Installation Site Customer
  • Location Location and country
  • Year Year of installation/last major update
  • Installation Type Academic, Research, Industry,
    Vendor, Classified, Government
  • Installation Area e.g. Research Energy /
    Industry Finance
  • Processors Number of processors
  • Rmax Maxmimal LINPACK performance achieved
  • Rpeak Theoretical peak performance
  • Nmax Problem size for achieving Rmax
  • N1/2 Problem size for achieving half of Rmax
  • Nworld Position within the TOP500 ranking

9
28th List Top 3
10
28th List India
11
(No Transcript)
12
(No Transcript)
13
Countries
14
Manufacturer
15
Processor Family
16
System Processor Count
17
NAS Parallel Benchmarks - NPB
  • Also for evaluation of Supercomputers
  • A set of 8 programs from CFD
  • 5 kernels, 3 pseudo applications
  • NPB 1 Original benchmarks
  • NPB 2 NASs MPI implementation. NPB 2.4 Class D
    has more work and more I/O
  • NPB 3 based on OpenMP, HPF, Java
  • GridNPB3 for computational grids
  • NPB 3 multi-zone for hybrid parallelism

18
Kernel Benchmarks
  • EP embarrassingly parallel
  • MG multigrid. Regular communication
  • CG conjugate gradient. Irregular long distance
    communication
  • FT a 3-D PDE using FFT. Rigorous test of long
    distance communication
  • IS large integer sort
  • Detailed rules regarding
  • - brief statement of the problem
  • - algorithm to be practiced
  • - validation of results
  • - where to insert timing calls
  • - method for generating random numbers
  • - submission of results

19
Pseudo applications / Synthetic CFDs
  • Benchmark 1 perform few iterations of the
    approximate factorization algorithm (SP)
  • Benchmark 2 - perform few iterations of diagonal
    form of the approximate factorization algorithm
    (BT)
  • Benchmark 3 - perform few iterations of SSOR (LU)

20
Class A and Class B
Class A
Sample Code
Class B
21
Thank You !
22
LINPACK
  • 100x100 inner loop optimization
  • 1000x1000 three-loop/whole program optimization
  • Scalable parallel program Largest problem that
    can fit in memory
  • Template of Linpack code
  • Generate
  • Solve
  • Check
  • Time

23
HPL (Implementation of HPLinpack Benchmark)
24
22nd List The TOP10
25
Trends / Manufacturers
26
(No Transcript)
27
1 Earth Simulator
  • Located in Yokohama
  • 5,120 (640 8-way nodes) NEC SX-5 CPUs
  • 8 GFLOPS per CPU (41 TFLOPS total)
  • 640 640 crossbar switch between the nodes
  • 16 GB/s inter-node bandwidth
  • Super-UX Unix based OS
  • Three-level parallel system
  • Occupies a 4-storey building

28
2 and 3
  • 2 - ASCI Q from LANL, 13.88 TFlops
  • Based on 1024 HP Alphaserver ES45 each 4-way SMP
  • Quadrics interconnect
  • 3 Based on Apple G5 systems in Virginia Tech
  • 1100 2-way systems, 10.28 TFlops

29
Highlights from Top10
  • The third system ever to exceed the 10 TFflop/s
    mark is Virgina Tech's X measured at 10.28
    TFlop/s. This cluster is built with the Apple G5
    as building blocks.
  • 6 is the first system in the TOP500 based on
    AMD's Opteron chip. It was installed by Linux
    Networx at the Los Alamos National Laboratory and
    also uses a Myrinet interconnect.
  • The list of cluster systems in the TOP10 has
    grown impressively to seven systems.
  • With the exception of the leading Earth
    Simulator, all other TOP10 systems are installed
    in the U.S.
  • 208 systems are now labeled as clusters, up from
    149. This makes clustered systems the most common
    architecture in the TOP500.
  • IBM is still leading the list with respect to the
    total installed performance - and increased its
    share to 35.4 percent --- up from 31.8 percent
    one year ago and 34.9 percent 6 months ago. HP is
    second in installed performance with 22.7 percent
    and NEC is third with 8.7 percent.
  • With respect to the number of systems,
    Hewlett-Packard topped IBM again by a small
    margin. HP is at 165 systems (up from 159) and
    IBM is at 159 systems (up one system) installed.
    SGI is again third with 41 systems, down from 54.

30
NPB 1.0 (March 1994)
  • Defines class A and class B versions
  • Paper and pencil algorithmic specifications
  • Generic benchmarks as compared to MPI-based
    LinPack
  • General rules for implementations Fortran90 or
    C, 64-bit arithmetic etc.
  • Sample implementations provided

31
NPB 2.0 (1995)
  • MPI and Fortran 77 implementations
  • 2 parallel kernels (MG, FT) and 3 simulated
    applications (LU, SP, BT)
  • Class C bigger size
  • Benchmark rules 0, 5, gt5 change in source
    code

32
NPB 2.2 (1996), 2.4 (2002), 2.4 I/O (Jan 2003)
  • EP and IS added
  • FT rewritten
  • NPB 2.4 class D and rationale for class D sizes
  • 2.4 I/O a new benchmark problem based on BT
    (BTIO) to test the output capabilities
  • A MPI implementation of the same (MPI-IO)
    different options using collective buffering or
    not etc.
Write a Comment
User Comments (0)
About PowerShow.com