An Overview of High Performance Computing and Challenges for the Future - PowerPoint PPT Presentation

About This Presentation
Title:

An Overview of High Performance Computing and Challenges for the Future

Description:

Four Important Concepts that Will Effect Math ... AMD Opteron 246. 3000. 2.00. 5000. 1.70. UltraSparc-IIe. 3000. 1.64. 5000. 1.66. Intel PIII Coppermine ... – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 63
Provided by: jack241
Category:

less

Transcript and Presenter's Notes

Title: An Overview of High Performance Computing and Challenges for the Future


1
An Overview of High Performance Computing and
Challenges for the Future
  • Jack Dongarra
  • INNOVATIVE COMP ING LABORATORY
  • University of Tennessee
  • Oak Ridge National Laboratory
  • University of Manchester

2
Outline
  • Top500 Results
  • Four Important Concepts that Will Effect Math
    Software
  • Effective Use of Many-Core
  • Exploiting Mixed Precision in Our Numerical
    Computations
  • Self Adapting / Auto Tuning of Software
  • Fault Tolerant Algorithms

3
H. Meuer, H. Simon, E. Strohmaier, JD -
Listing of the 500 most powerful Computers
in the World - Yardstick Rmax from LINPACK
MPP Axb, dense problem - Updated twice a
year SCxy in the States in November Meeting in
Germany in June - All data available from
www.top500.org
4
Performance Development
My Laptop
5
29th List The TOP10
Manufacturer Computer Rmax TF/s Installation Site Country Year Proc
1 IBM BlueGene/LeServer Blue Gene 280.6 DOE/NNSA/LLNL USA 2005 131,072
210 Cray JaguarCray XT3/XT4 101.7 DOE/ORNL USA 2007 23,016
32 Sandia/Cray Red StormCray XT3 101.4 DOE/NNSA/Sandia USA 2006 26,544
43 IBM BGWeServer Blue Gene 91.29 IBM Thomas Watson USA 2005 40,960
5 IBM New York BLueeServer Blue Gene 82.16 Stony Brook/BNL USA 2007 36,864
64 IBM ASC PurpleeServer pSeries p575 75.76 DOE/NNSA/LLNL USA 2005 12,208
7 IBM BlueGene/LeServer Blue Gene 73.03 Rensselaer Polytechnic Institute/CCNI USA 2007 32,768
8 Dell AbePowerEdge 1955, Infiniband 62.68 NCSA USA 2007 9,600
95 IBM MareNostrumJS21 Cluster, Myrinet 62.63 Barcelona Supercomputing Center Spain 2006 12,240
10 SGI HLRB-IISGI Altix 4700 56.52 LRZ Germany 2007 9,728
6
Performance Projection
7
Cores per System - June 2007
8
88 systems gt 10 Tflop/s
326 systems gt 5 Tflop/s
14 systems gt 50 Tflop/s
88 systems gt 10 Tflop/s
326 systems gt 5 Tflop/s
9
Chips Used in Each of the 500 Systems
96 58 Intel 17 IBM
21 AMD
10
Interconnects / Systems
(128)
(46)
(206)
GigE Infiniband Myrinet 74
11
Countries / Systems
Rank Site Manufact Computer Procs RMax Segment Interconnect Family
66 CINECA IBM eServer 326 Opteron Dual 5120 12608 Academic Infband
132 SCS S.r.l. HP Cluster Platform 3000 Xeon 1024 7987.2 Research Infband
271 Telecom Italia HP SuperDome 875 MHz 3072 5591 Industry Myrinet
295 Telecom Italia HP Cluster Platform 3000 Xeon 740 5239 Industry Gige
305 Esprinet HP Cluster Platform 3000 Xeon 664 5179 Industry Gige
12
Power is an Industry Wide Problem
  • Google facilities
  • leveraging hydroelectric power
  • old aluminum plants
  • gt500,000 servers worldwide

Hiding in Plain Sight, Google Seeks More Power,
by John Markoff, June 14, 2006
New Google Plant in The Dulles, Oregon, from
NYT, June 14, 2006
13
Gflop/KWatt in the Top 20
14
IBM BlueGene/L 1 131,072 Cores
Total of 33 systems in the Top500
1.6 MWatts (1600 homes) 43,000 ops/s/person
BlueGene/L Compute ASIC
Full system total of 131,072 processors
Fastest Computer BG/L 700 MHz 131K proc 64
racks Peak 367 Tflop/s Linpack 281
Tflop/s 77 of peak
The compute node ASICs include all networking and
processor functionality. Each compute ASIC
includes two 32-bit superscalar PowerPC 440
embedded cores (note that L1 cache coherence is
not maintained between these cores). (13K sec
about 3.6 hours n1.8M)
15
Increasing the number of gates into a tight knot
and decreasing the cycle time of the processor
We have seen increasing number of gates on a chip
and increasing clock speed. Heat becoming an
unmanageable problem, Intel Processors gt 100
Watts We will not see the dramatic increases in
clock speeds in the future. However, the number
of
gates on a chip will
continue to increase.
16
Power Cost of Frequency
  • Power ? Voltage2 x Frequency (V2F)
  • Frequency ? Voltage
  • Power ?Frequency3

17
Power Cost of Frequency
  • Power ? Voltage2 x Frequency (V2F)
  • Frequency ? Voltage
  • Power ?Frequency3

18
Whats Next?
Mixed Largeand Small Core
All Large Core
Many Small Cores
All Small Core
Different Classes of Chips Home Games /
Graphics Business Scientific
Many Floating- Point Cores
19
Novel Opportunities in Multicores
  • Dont have to contend with uniprocessors
  • Not your same old multiprocessor problem
  • How does going from Multiprocessors to Multicores
    impact programs?
  • What changed?
  • Where is the Impact?
  • Communication Bandwidth
  • Communication Latency

20
Communication Bandwidth
  • How much data can be communicated between two
    cores?
  • What changed?
  • Number of Wires
  • Clock rate
  • Multiplexing
  • Impact on programming model?
  • Massive data exchange is possible
  • Data movement is not the bottleneck ? processor
    affinity not that important

10,000X
32 Giga bits/sec
300 Tera bits/sec
21
Communication Latency
  • How long does it take for a round trip
    communication?
  • What changed?
  • Length of wire
  • Pipeline stages
  • Impact on programming model?
  • Ultra-fast synchronization
  • Can run real-time apps on multiple cores

50X
200 Cycles
4 cycles
22
80 Core
  • Intels 80 Core chip
  • 1 Tflop/s
  • 62 Watts
  • 1.2 TB/s internal BW

23
(No Transcript)
24
NSF Track 1 NCSA/UIUC
  • 200M
  • 10 Pflop/s
  • 40K 8-core 4Ghz IBM Power7 chips
  • 1.2 PB memory
  • 5PB/s global bandwidth
  • interconnect BW of 0.55PB/s
  • 18 PB disk at 1.8 TB/s I/O bandwidth.
  • For use by a few people

25
NSF UTK/JICS Track 2 proposal
  • 65M over 5 years for a 1 Pflop/s system
  • 30M over 5 years for equipment
  • 36 cabinets of a Cray XT5
  • (AMD 8-core/chip, 12 socket/board, 3 GHz, 4
    flops/cycle/core)
  • 35M over 5 years for operations
  • Power cost
  • 1.1M/year
  • Cray Maintenance
  • 1M/year
  • To be used by the NSF community
  • 1000s of users
  • Joins UCSD, PSC, TACC

26
Last Years Track 2 award to U of Texas
27
Major Changes to Software
  • Must rethink the design of our software
  • Another disruptive technology
  • Similar to what happened with cluster computing
    and message passing
  • Rethink and rewrite the applications, algorithms,
    and software
  • Numerical libraries for example will change
  • For example, both LAPACK and ScaLAPACK will
    undergo major changes to accommodate this

28
Major Changes to Software
  • Must rethink the design of our software
  • Another disruptive technology
  • Similar to what happened with cluster computing
    and message passing
  • Rethink and rewrite the applications, algorithms,
    and software
  • Numerical libraries for example will change
  • For example, both LAPACK and ScaLAPACK will
    undergo major changes to accommodate this

29
A New Generation of Software
Algorithms follow hardware evolution in time Algorithms follow hardware evolution in time Algorithms follow hardware evolution in time
LINPACK (80s) (Vector operations) Rely on - Level-1 BLAS operations
LAPACK (90s) (Blocking, cache friendly) Rely on - Level-3 BLAS operations
PLASMA (00s) New Algorithms (many-core friendly) Rely on - a DAG/scheduler - block data layout - some extra kernels
Those new algorithms - have a very low
granularity, they scale very well (multicore,
petascale computing, ) - removes a lots of
dependencies among the tasks, (multicore,
distributed computing) - avoid latency
(distributed computing, out-of-core) - rely
on fast kernels Those new algorithms need new
kernels and rely on efficient scheduling
algorithms.
30
A New Generation of SoftwareParallel Linear
Algebra Software for Multicore Architectures
(PLASMA)
Algorithms follow hardware evolution in time Algorithms follow hardware evolution in time Algorithms follow hardware evolution in time
LINPACK (80s) (Vector operations) Rely on - Level-1 BLAS operations
LAPACK (90s) (Blocking, cache friendly) Rely on - Level-3 BLAS operations
PLASMA (00s) New Algorithms (many-core friendly) Rely on - a DAG/scheduler - block data layout - some extra kernels
Those new algorithms - have a very low
granularity, they scale very well (multicore,
petascale computing, ) - removes a lots of
dependencies among the tasks, (multicore,
distributed computing) - avoid latency
(distributed computing, out-of-core) - rely
on fast kernels Those new algorithms need new
kernels and rely on efficient scheduling
algorithms.
31
Steps in the LAPACK LU
(Factor a panel)
(Backward swap)
(Forward swap)
(Triangular solve)
(Matrix multiply)
32
LU Timing Profile (4 processor system)
Threads no lookahead
1D decomposition and SGI Origin
Time for each component
DGETF2
DLASWP(L)
DLASWP(R)
DTRSM
DGEMM
Bulk Sync Phases
33
Adaptive Lookahead - Dynamic
Reorganizing algorithms to use this approach
Event Driven Multithreading
34
Fork-Join vs. Dynamic Execution
Fork-Join parallel BLAS
Time
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
35
Fork-Join vs. Dynamic Execution
Fork-Join parallel BLAS
Time
DAG-based dynamic scheduling
Time saved
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
36
With the Hype on Cell PS3We Became Interested
  • The PlayStation 3's CPU based on a "Cell
    processor
  • Each Cell contains a Power PC processor and 8
    SPEs. (SPE is processing unit, SPE SPU DMA
    engine)
  • An SPE is a self contained vector processor which
    acts independently from the others.
  • 4 way SIMD floating point units capable of a
    total of 25.6 Gflop/s _at_ 3.2 GHZ
  • 204.8 Gflop/s peak!
  • The catch is that this is for 32 bit floating
    point (Single Precision SP)
  • And 64 bit floating point runs at 14.6 Gflop/s
    total for all 8 SPEs!!
  • Divide SP peak by 14 factor of 2 because of DP
    and 7 because of latency issues

SPE 25 Gflop/s peak
37
Performance of Single Precision on Conventional
Processors
  • Realized have the similar situation on our
    commodity processors.
  • That is, SP is 2X as fast as DP on many systems
  • The Intel Pentium and AMD Opteron have SSE2
  • 2 flops/cycle DP
  • 4 flops/cycle SP
  • IBM PowerPC has AltiVec
  • 8 flops/cycle SP
  • 4 flops/cycle DP
  • No DP on AltiVec

  Size SGEMM/DGEMM Size SGEMV/DGEMV
AMD Opteron 246 3000 2.00 5000 1.70
UltraSparc-IIe 3000 1.64 5000 1.66
Intel PIII Coppermine 3000 2.03 5000 2.09
PowerPC 970 3000 2.04 5000 1.44
Intel Woodcrest 3000 1.81 5000 2.18
Intel XEON 3000 2.04 5000 1.82
Intel Centrino Duo 3000 2.71 5000 2.21
  • Single precision is faster because
  • Higher parallelism in SSE/vector units
  • Reduced data motion
  • Higher locality in cache

38
32 or 64 bit Floating Point Precision?
  • A long time ago 32 bit floating point was used
  • Still used in scientific apps but limited
  • Most apps use 64 bit floating point
  • Accumulation of round off error
  • A 10 TFlop/s computer running for 4 hours
    performs gt 1 Exaflop (1018) ops.
  • Ill conditioned problems
  • IEEE SP exponent bits too few (8 bits, 1038)
  • Critical sections need higher precision
  • Sometimes need extended precision (128 bit fl pt)
  • However some can get by with 32 bit fl pt in some
    parts
  • Mixed precision a possibility
  • Approximate in lower precision and then refine or
    improve solution to high precision.

39
Idea Goes Something Like This
  • Exploit 32 bit floating point as much as
    possible.
  • Especially for the bulk of the computation
  • Correct or update the solution with selective use
    of 64 bit floating point to provide a refined
    results
  • Intuitively
  • Compute a 32 bit result,
  • Calculate a correction to 32 bit result using
    selected higher precision and,
  • Perform the update of the 32 bit results with the
    correction using high precision.

40
Mixed-Precision Iterative Refinement
  • Iterative refinement for dense systems, Ax b,
    can work this way.

L U lu(A) SINGLE O(n3) x
L\(U\b) SINGLE O(n2) r b
Ax DOUBLE O(n2) WHILE r not small
enough z L\(U\r) SINGLE O(n2)
x x z DOUBLE O(n1) r b
Ax DOUBLE O(n2) END
41
Mixed-Precision Iterative Refinement
  • Iterative refinement for dense systems, Ax b,
    can work this way.
  • Wilkinson, Moler, Stewart, Higham provide error
    bound for SP fl pt results when using DP fl pt.
  • It can be shown that using this approach we can
    compute the solution to 64-bit floating point
    precision.
  • Requires extra storage, total is 1.5 times
    normal
  • O(n3) work is done in lower precision
  • O(n2) work is done in high precision
  • Problems if the matrix is ill-conditioned in sp
    O(108)

L U lu(A) SINGLE O(n3) x
L\(U\b) SINGLE O(n2) r b
Ax DOUBLE O(n2) WHILE r not small
enough z L\(U\r) SINGLE O(n2)
x x z DOUBLE O(n1) r b
Ax DOUBLE O(n2) END
42
Results for Mixed Precision Iterative Refinement
for Dense Ax b
  • Single precision is faster than DP because
  • Higher parallelism within vector units
  • 4 ops/cycle (usually) instead of 2 ops/cycle
  • Reduced data motion
  • 32 bit data instead of 64 bit data
  • Higher locality in cache
  • More data items in cache

43
Results for Mixed Precision Iterative Refinement
for Dense Ax b
Architecture (BLAS-MPI) procs n DP Solve /SP Solve DP Solve /Iter Ref iter
AMD Opteron (Goto OpenMPI MX) 32 22627 1.85 1.79 6
AMD Opteron (Goto OpenMPI MX) 64 32000 1.90 1.83 6
  • Single precision is faster than DP because
  • Higher parallelism within vector units
  • 4 ops/cycle (usually) instead of 2 ops/cycle
  • Reduced data motion
  • 32 bit data instead of 64 bit data
  • Higher locality in cache
  • More data items in cache

44
What about the Cell?
  • Power PC at 3.2 GHz
  • DGEMM at 5 Gflop/s
  • Altivec peak at 25.6 Gflop/s
  • Achieved 10 Gflop/s SGEMM
  • 8 SPUs
  • 204.8 Gflop/s peak!
  • The catch is that this is for 32 bit floating
    point (Single Precision SP)
  • And 64 bit floating point runs at 14.6 Gflop/s
    total for all 8 SPEs!!
  • Divide SP peak by 14 factor of 2 because of DP
    and 7 because of latency issues

45
Moving Data Around on the Cell
256 KB
25.6 GB/s Injection bandwidth
Injection bandwidth
Worst case memory bound operations (no reuse of
data) 3 data movements (2 in and 1 out) with 2
ops (SAXPY) For the cell would be 4.6 Gflop/s
(25.6 GB/s2ops/12B)
46
IBM Cell 3.2 GHz, Ax b
8 SGEMM (Embarrassingly Parallel)
.30 secs
3.9 secs
47
IBM Cell 3.2 GHz, Ax b
8 SGEMM (Embarrassingly Parallel)
.30 secs
.47 secs
3.9 secs
48
Cholesky on the Cell, Axb, AAT, xTAx gt 0
Single precision performance
Mixed precision performance using iterative
refinement Method achieving 64 bit accuracy
For the SPEs standard C code and C language SIMD
extensions (intrinsics)
49
Cholesky - Using 2 Cell Chips
50
Intriguing Potential
  • Exploit lower precision as much as possible
  • Payoff in performance
  • Faster floating point
  • Less data to move
  • Automatically switch between SP and DP to match
    the desired accuracy
  • Compute solution in SP and then a correction to
    the solution in DP
  • Potential for GPU, FPGA, special purpose
    processors
  • What about 16 bit floating point?
  • Use as little you can get away with and improve
    the accuracy
  • Applies to sparse direct and iterative linear
    systems and Eigenvalue, optimization problems,
    where Newtons method is used.

51
IBM/Mercury Cell Blade
  • From IBM or Mercury
  • 2 Cell chip
  • Each w/8 SPEs
  • 512 MB/Cell
  • 8K - 17K
  • Some SW

52
Sony Playstation 3 Cluster PS3-T
  • From IBM or Mercury
  • 2 Cell chip
  • Each w/8 SPEs
  • 512 MB/Cell
  • 8K - 17K
  • Some SW
  • From WALMART PS3
  • 1 Cell chip
  • w/6 SPEs
  • 256 MB/PS3
  • 600
  • Download SW
  • Dual boot

53
Cell Hardware Overview
SIT CELL
25.6 Gflop/s
25.6 Gflop/s
25.6 Gflop/s
25.6 Gflop/s
200 GB/s
PowerPC
25.6 Gflop/s
25.6 Gflop/s
25.6 Gflop/s
25.6 Gflop/s
25 GB/s
3.2 GHz 25 GB/s injection bandwidth 200 GB/s
between SPEs 32 bit peak perf 825.6
Gflop/s 204.8 Gflop/s peak 64 bit peak perf
81.8 Gflop/s 14.6 Gflop/s peak 512 MiB memory
512 MiB
54
PS3 Hardware Overview
SIT CELL
25.6 Gflop/s
Disabled/Broken Yield issues
25.6 Gflop/s
25.6 Gflop/s
200 GB/s
GameOS Hypervisor
PowerPC
25.6 Gflop/s
25.6 Gflop/s
25.6 Gflop/s
25 GB/s
3.2 GHz 25 GB/s injection bandwidth 200 GB/s
between SPEs 32 bit peak perf 625.6
Gflop/s 153.6 Gflop/s peak 64 bit peak perf
61.8 Gflop/s 10.8 Gflop/s peak 1 Gb/s NIC 256
MiB memory
256 MiB
55
PlayStation 3 LU Codes
6 SGEMM (Embarrassingly Parallel)
56
PlayStation 3 LU Codes
6 SGEMM (Embarrassingly Parallel)
57
Cholesky on the PS3, Axb, AAT, xTAx gt 0
58
HPC in the Living Room
59
Matrix Multiple on a 4 Node PlayStation3 Cluster
  • What's bad
  • Gigabit network card. 1 Gb/s is too little for
    such computational power (150 Gflop/s per node)
  • Linux can only run on top of GameOS (hypervisor)
  • Extremely high network access latencies (120
    usec)
  • Low bandwidth (600 Mb/s)
  • Only 256 MB local memory
  • Only 6 SPEs
  • What's good
  • Very cheap 4 per Gflop/s (with 32 bit fl pt
    theoretical peak)
  • Fast local computations between SPEs
  • Perfect overlap between communications and
    computations is possible (Open-MPI running)
  • PPE does communication via MPI
  • SPEs do computation via SGEMMs

Gold Computation 8 ms
Blue Communication 20 ms
60
Users Guide for SC on PS3
  • SCOP3 A Rough Guide to Scientific Computing on
    the PlayStation 3
  • See webpage for
    details

61
Conclusions
  • For the last decade or more, the research
    investment strategy has been overwhelmingly
    biased in favor of hardware.
  • This strategy needs to be rebalanced - barriers
    to progress are increasingly on the software
    side.
  • Moreover, the return on investment is more
    favorable to software.
  • Hardware has a half-life measured in years, while
    software has a half-life measured in decades.
  • High Performance Ecosystem out of balance
  • Hardware, OS, Compilers, Software, Algorithms,
    Applications
  • No Moores Law for software, algorithms and
    applications

62
Collaborators / Support
  • Alfredo Buttari, UTK
  • Julien Langou, UColorado
  • Julie Langou, UTK
  • Piotr Luszczek, MathWorks
  • Jakub Kurzak, UTK
  • Stan Tomov, UTK
Write a Comment
User Comments (0)
About PowerShow.com