Jack Dongarra - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Jack Dongarra

Description:

Jack Dongarra. INNOVATIVE COMP ING LABORATORY. University of Tennessee. Oak Ridge ... Fork-Join vs. Dynamic Execution. Fork-Join parallel BLAS. Experiments on ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 20
Provided by: jack242
Category:
Tags: dongarra | jack | join

less

Transcript and Presenter's Notes

Title: Jack Dongarra


1
Some of the Software Challenges for Numerical
Libraries on ManyCore Systems
  • Jack Dongarra
  • INNOVATIVE COMP ING LABORATORY
  • University of Tennessee
  • Oak Ridge National Laboratory

2
Major Changes to Math Software
  • Scalar
  • Fortran code in EISPACK (1974)
  • Vector
  • Level 1 BLAS use in LINPACK (1979)
  • SMP
  • Level 3 BLAS use in LAPACK (1992)
  • Distributed Memory
  • Message Passing w/MPI in ScaLAPACK (1995)
  • Many-Core
  • Event driven multi-threading in PLASMA
  • Parallel Linear Algebra Software for Multicore
    Architectures

3
ManyCore - Parallelism for the Masses
  • We are looking at the following concepts in
    designing the next library implementation
  • Dynamic Data Driven Execution
  • Self Adapting
  • Block Data Layout
  • Mixed Precision in the Algorithm
  • Exploit Hybrid Architectures
  • Fault Tolerant Methods

4
Parallelism in LAPACK / ScaLAPACK
Distributed Memory
Shared Memory
ScaLAPACK
LAPACK
PBLAS
ATLAS
Specialized BLAS
Parallel
BLACS
threads
MPI
BLAS
Two well known open source software efforts for
dense matrix problems.
5
Steps in the LAPACK LU
(Factor a panel)
(Backward swap)
(Forward swap)
(Triangular solve)
(Matrix multiply)
Most of the work done here
6
LU Timing Profile (4 Core System)
Threads no lookahead
1D decomposition
Time for each component
DGETF2
DLASWP(L)
DLASWP(R)
DTRSM
DGEMM
Bulk Sync Phases
7
Adaptive Lookahead - Dynamic
Reorganizing algorithms to use this approach
Event Driven Multithreading Out of Order Execution
8
Fork-Join vs. Dynamic Execution
Fork-Join parallel BLAS
Time
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
9
Fork-Join vs. Dynamic Execution
Fork-Join parallel BLAS
Time
DAG-based dynamic scheduling
Time saved
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
10
Cholesky Factorization DAG-based Dependency
Tracking
11
11
12
13
14
12
22
22
23
24
13
23
33
22
14
24
34
44
23
24
  • Dependencies expressed by the DAG
  • are enforced on a tile basis
  • fine-grained parallelization
  • flexible scheduling

33
34
33
11
Cholesky on the IBM Cell
  • Pipelining
  • Between loop iterations.
  • Double Buffering
  • Within BLAS,
  • Between BLAS,
  • Between loop iterations.
  • Result
  • Minimum load imbalance,
  • Minimum dependency stalls,
  • Minimum memory stalls
  • (no waiting for data).

Achieves 174 Gflop/s 85 of peak in SP.
12
How to Deal with Complexity?
  • Adaptivity is the key for applications to
    effectively use available resources whose
    complexity is exponentially increasing
  • Goal
  • Automatically bridge the gap between the
    application and computers that are rapidly
    changing and getting more and more complex

13
Examples of Automatic Performance Tuning
Proceedings of the IEEE,
V 93   2  Feb. 2005 Issue on Program
Generation, Optimization, and Platform
Adaptation
  • Dense BLAS
  • Sequential
  • ATLAS (UTK) PHiPAC (UCB)
  • Fast Fourier Transform (FFT) variations
  • FFTW (MIT)
  • Sequential and Parallel
  • www.fftw.org
  • Digital Signal Processing
  • SPIRAL www.spiral.net (CMU)
  • MPI Collectives (UCB, UTK)
  • More projects, conferences, government reports,

14
Generic Code Optimization
  • Can ATLAS-like techniques be applied to arbitrary
    code?
  • What do we mean by ATLAS-like techniques?
  • Blocking
  • Loop unrolling
  • Data prefetch
  • Functional unit scheduling
  • etc.
  • Referred to as empirical optimization
  • Generate many variations
  • Pick the best implementation by measuring the
    performance

15
Applying Self Adapting Software
  • Numerical and Non-numerical applications
  • BLAS like ops / message passing collectives
  • Static or Dynamic determine code to be used
  • Perform at make time / every time invoked
  • Independent or dependent on data presented
  • Same on each data set / depends on properties of
    data

16
Multi, Many, , Many-More
  • Parallelism for the masses
  • Multi, Many, Many-MoreCore
    are here and coming fast
  • Use Dynamic DAG based scheduling
  • Minimize sync - Non-blocking communication
  • Maximize locality - Block data layout
  • Autotuners should take on a larger, or at least
    complementary, role to compilers in translating
    parallel programs.
  • Whats needed is a long-term, balanced investment
    in hardware, software, algorithms and
    applications in the HPC Ecosystem.

17
Collaborators / Support
  • Alfredo Buttari
  • Julien Langou
  • Julie Langou
  • Piotr Luszczek
  • Jakub Kurzak
  • Stan Tomov

18
(No Transcript)
19
Summary of Current Unmet Needs
  • Performance / Portability
  • Memory bandwidth/Latency
  • Fault tolerance
  • Adaptability Some degree of autonomy to self
    optimize, test, or monitor.
  • Able to change mode of operation static or
    dynamic
  • Better programming models
  • Global shared address space
  • Visible locality
  • Maybe coming soon (incremental, yet offering real
    benefits)
  • Global Address Space (GAS) languages UPC,
    Co-Array Fortran, Titanium, Chapel, X10,
    Fortress)
  • Minor extensions to existing languages
  • More convenient than MPI
  • Have performance transparency via explicit remote
    memory references
  • Whats needed is a long-term, balanced investment
    in hardware, software, algorithms and
    applications in the HPC Ecosystem.
Write a Comment
User Comments (0)
About PowerShow.com