Multigrid in the Chiral limit - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Multigrid in the Chiral limit

Description:

Multi-grid in the Chiral limit. Richard C. Brower. Extreme Scale Computing Workshop ... when tuned further & in chiral limit. Both Domain Wall & Staggered ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 31
Provided by: physWas
Category:

less

Transcript and Presenter's Notes

Title: Multigrid in the Chiral limit


1
Multi-grid in the Chiral limit
  • Richard C. Brower
  • Extreme Scale Computing Workshop
  • - Quantum Universe_at_ Stanford,Dec. 9-11 , 2008


2
Multi-grid
  • Motivation
  • Eigenvectors vs Inexact Deflation
  • Multi-grid preconditioning
  • Applications
  • Analysis
  • Disconnected Diagrams
  • All to All
  • HMC with Chronological Inverter.

3
Slow convergence is due to vectors in near null
space
  • Laplace solver for A x 0 starting with a random
    x
  • Result is (algebraically) smooth

4
Slow convergence is due to vectors in near null
space
  • Laplace solver for A x 0 starting with a random
    x
  • Result is (algebraically) smooth

5
Slow convergence is due to vectors in near null
space
  • Laplace solver for A x 0 starting with a random
    x
  • Result is (algebraically) smooth

6
Slow convergence is due to vectors in near null
space
  • Laplace solver for A x 0 starting with a random
    x
  • Result is (algebraically) smooth

7
Motivation
  • Algorithms for lighter mass fermions and larger
    lattice
  • The Dirac solver D Ã b becomes increasingly
    singular
  • split vector into near null space D S ' 0
    Complement S?
  • Basic idea (as always) is Schur decomposition!
  • (e near null, o complement)

Schur

Implies
8
3 Approaches to splitting
  • Deflation Nº exact eigenvector projection
  • Inexact deflation plus Schwarz (Luscher)
  • multi-grid preconditioning
  • 2 3 use the same splitting S and S?

9
Choosing the Restrictor (R Py) and
Prolongator (P)?
  • Relax from random to find near null vectors
  • Cut up on sublattice (No. of blocks NB 2
    L4/44 )

S Range(P) dim(S) Nº NB 2Nº L4/44
10
P non-square matrix
S?
But PyP 1 so Ker(P) 0
ker(Py)
Py
S
Image(Py)
Image(P)
P
(coarse lattice)
(fine lattice)
Ã1
0
0
Ã2
0
0
Ã3
0
P
0
Ã4
0
0
Ã5
0
0
Ã6
0
0
Ã7
0
0
Ã8
0
0
11
Multigrid Cycle (simplified)
  • Smooth x (1 - A) x b
  • r (1- A) r
  • Project Ac Py A P and rc Py r
  • Solve Ac ec rc
  • e P A-1c Py r
  • Update x x e
  • r b -D(x e)
  • 1 - D P (Py D
    P)-1 Py r

oblique projector
Note since Py r 0 exact deflation in S

12
  • Real algorithm has lots of tuning!
  • Multigrid is recursive to multi-levels.
  • Near null vectors are augmented by
    recursive using MG itself.
  • pre and pos-smoothing is done by Minimum
    Residual.
  • Entire cycle is used as preconditioner in CG.
  • 5 is preserved 5,P 0
  • Current benchmarks for Wilson-Dirac
  • V163 x32, ß6.0, mcrit 0.8049,
  • Coarse lattice Block 44xNcx2, Nº 20.
  • 3 level V(2,2) MG cycle.
  • 1 CG application per 6 Dirac application
  • Note Nº scales O(1) but deflation Nº O(V)

13
Multigrid QCD PetaApps project
Brannick, Brower, Clark, McCormick,Manteuffel,Osbo
rn and Rebbi, The removal of critical slowing
down Lattice 2008 proceedings
see Oct 10-10 workshop (http//super.bu.edu/browe
r/PetaAppsOct10-11)
14
SA/AMGy timings for QCD
y Adaptive Smooth Aggregations Algebraic MultiGrid
15
163 x 64 asymmetric lattice
msea -0.4125.
16
MG Disconnected Diagram
  • Stochastic estimators need to solve
  • D x (random sources)
  • many time!
  • MG speed this up D-1 of course, but
  • Also gives a large recursive variance reduction!
  • Tr O D-1 Tr O (D-1 - P(Py D P)-1Py)
  • Tr Py O P (Py D P)-1
  • Smaller operator and matrix inverse on coarse
  • levels

17
Summary
  • Wilson Dirac Multi-grid works well
  • Better when tuned further in chiral limit.
  • Both Domain Wall Staggered are being developed.
  • Application to multiple RHS Analysis and
    Disconnected Diagrams.
  • Both MG and Luschers deflation can be
  • applied to RHMC with chronological methods
  • Now the precondition is a by product of HMC
  • Exaflops may require concurrent RHS to spread
    pre-conditioner work load to many processor

18
Nvidia GPU architecture
19
Consumer Chip GTX 280 ) Tesla C1060
20
Tesla Quad S1070 1U System 8K
21
Barros, Babich, Brower, Clark and Rebbi,
Blasting Through Lattice Calculations using
CUDA. Lattice 2008 procedings
22
C870 code using 60 of the memory bandwidth
(Why?).
23
(No Transcript)
24
CUDA code for Wilson Dirac CG inverter
http//www.scala-lang.org/
25
GPU multi-core straw man
  • 10K S1070 Nvidiay with 103 cores
  • ) 500 QCD-Gigaflop/s (sustained)
  • ) O(106) core QCD-Petaflop/s (sustained)
  • But 324 to 1284 O(106 to 108) lattice sites.
  • ) 1 thread per site O(1 to 28) threads/core!
  • y QCD Petaflop costs 20 Million s

26
Future Nvidia software Plans
  • Need find out why we are only saturating 60 of
    Memory bandwidth
  • Further educe memory traffic
  • 8 real number per SU(3) matrix (2/3 of 12 used
    now)
  • shear spinors in 43 blocks (5/9 of used now)
  • Generalize to clover Wilson Domain Wall
    operator (slightly better flops/mem ratio).
  • DMA between GPU on Quad system and network for
    cluster
  • Start to design SciDAC API for many-core
    technologies.

27
New Scaling Problem to solve!
  • QCD- Petaflop/s
  • ) O(106) cores
  • Multigrid algorithm does data compression.
  • ) fewer cores not more!
  • Possible solution?
  • Change Hardware Multigrid accelerators?
  • Change Softwarey Multiple copies of
    Multigrid ?
  • Change Algorithms Schwartz domains,
    deflation,...?
  • y help from Paul F. Fischer?

28
3 Approaches to splitting
  • deflation exact near null eigenvector
    projection
  • inexact deflation plus Schwarz iterations
    (Luscher)
  • multi-grid preconditioning (TOPS/PetaApps)
  • 2 3 have Exactly the same splitting subspace
    S

ker(Py)
Py
S
Image(Py)
Image(P)
P
(coarse lattice)
(fine lattice)
29
P non-square matrix
S
Im(P)
Im(Py)
P
S?
Ker(Py)
Ã1
Ã2
Ã3
P
Ã4
Ã5
Ã6
Ã7
Ã8
But PyP 1 so Ker(P) 0
30
Two Generations Consumer vs HPC GPUs
  • Consumer cards ) High Performance (HPC) GPUs
  • I. 8880 GTX ) Tesla C870
  • (16 multi-processor with 8 cores each)
  • II. GTX 280 ) Tesla C1060
  • (30 multi-processor with 8 cores each)
Write a Comment
User Comments (0)
About PowerShow.com