Cell processor implementation of a MILC lattice QCD application PowerPoint PPT Presentation

presentation player overlay
1 / 18
About This Presentation
Transcript and Presenter's Notes

Title: Cell processor implementation of a MILC lattice QCD application


1
Cell processor implementation of a MILC lattice
QCD application
  • Guochun Shi, Volodymyr Kindratenko, Steven
    Gottlieb

2
Presentation outline
  • Introduction
  • Our view of MILC applications
  • Introduction to Cell Broadband Engine
  • Implementation in Cell/B.E.
  • PPU performance and stream benchmark
  • Profile in CPU and kernels to be ported
  • Different approaches
  • Performance
  • Conclusion

3
Introduction
  • Our target
  • MIMD Lattice Computation (MILC) Collaboration
    code dynamical clover fermions
    (clover_dynamical) using the hybrid-molecular
    dynamics R algorithm
  • Our view of the MILC applications
  • A sequence of communication and computation blocks

4
Introduction
  • Cell/B.E. processor
  • One Power Processor Element (PPE) and eight
    Synergistic Processing Elements (SPE), each SPE
    has 256 KBs of local storage
  • 3.2 GHz processor
  • 25.6 GB/s processor-to-memory bandwidth
  • gt 200 GB/s EIB sustained aggregate bandwidth
  • Theoretical peak performance 204.8 GFLOPS (SP)
    and 14.63 GFLOPS (DP)

5
Presentation outline
  • Introduction
  • Our view of MILC applications
  • Introduction to Cell Broadband Engine
  • Implementation in Cell/B.E.
  • PPE performance and stream benchmark
  • Profile in CPU and kernels to be ported
  • Different approaches
  • Performance
  • Conclusion

6
Performance in PPE
  • Step 1 try to run it in PPE
  • In PPE it runs approximately 2-3x slower than
    modern CPU
  • MILC is bandwidth-bound
  • It agrees with what we see with stream benchmark

7
Execution profile and kernels to be ported
  • 10 of these subroutines are responsible for gt90
    of overall runtime
  • All kernels responsible for 98.8

8
Kernel memory access pattern
define FORSOMEPARITY(i,s,choice) \ for(
i((choice)ODD ? even_sites_on_node 0 ), \
s (latticei) \ ilt ( (choice)EVEN ?
even_sites_on_node sites_on_node) \
i,s) FORSOMEPARITY(i,s,parity)
mult_adj_mat_wilson_vec( (s-gtlinknu),
((wilson_vector )F_PT(s,rsrc)), rtemp )
mult_adj_mat_wilson_vec( (su3_matrix
)(gen_pt1i), rtemp, (s-gttmp) )
mult_mat_wilson_vec( (su3_matrix
)(gen_pt0i), (s-gttmp), rtemp )
mult_mat_wilson_vec( (s-gtlinkmu), rtemp,
(s-gttmp) )

mult_sigma_mu_nu( (s-gttmp), rtemp,
mu, nu ) su3_projector_w( rtemp,
((wilson_vector )F_PT(s,lsrc)),
((su3_matrix)F_PT(s,mat)) )
  • Kernel codes must be SIMDized
  • Performance determined by how fast you DMA in/out
    the data, not by SIMDized code
  • In each iteration, only small elements are
    accessed
  • Lattice size 1832 bytes
  • su3_matrix 72 bytes
  • wilson_vector 96 bytes
  • Challenge how to get data into SPUs as fast as
    possible?
  • Cell/B.E. has the best DMA performance when data
    is aligned to 128 bytes and size is multiple of
    128 bytes.
  • Data layout in MILC meets neither of them

One sample kernel from udadu_mu_nu() routine
9
Approach I packing and unpacking
  • Good performance in DMA operations
  • Packing and unpacking are expensive in PPE

10
Approach II Indirect memory access
  • Replace elements in struct site with pointers
  • Pointers point to continuous memory regions
  • PPE overhead due to indirect memory access

11
Approach III Padding and small memory DMAs
  • Padding elements to appropriate size
  • Padding struct site to appropriate size
  • Gained good bandwidth performance with padding
    overhead
  • Su3_matrix from 3x3 complex to 4x4 complex matrix
  • 72 bytes ? 128 bytes
  • Bandwidth efficiency lost 44
  • Wilson_vector from 4x3 complex to 4x4 complex
  • 98 bytes ? 128 bytes
  • Bandwidth efficiency lost 23

12
Struct site Padding
  • 128 byte stride access has different performance
    for different stride size
  • This is due to 16 banks in main memory
  • Odd numbers always reach peak
  • We choose to pad the struct site to 2688 (21128)
    bytes

13
Presentation outline
  • Introduction
  • Our view of MILC applications
  • Introduction to Cell Broadband Engine
  • Implementation in Cell/BE.
  • PPU performance and stream benchmark
  • Profile in CPU and kernels to be ported
  • Different approaches
  • Performance
  • Conclusion

14
Kernel performance
  • GFLOPS are low for all kernels
  • Bandwidth is around 80 of peak for most of
    kernels
  • Kernel speedup compared to CPU for most of
    kernels are between 10x to 20x
  • set_memory_to_zero kernel has 40x speedup,
    su3mat_copy() speedup gt15x

15
Application performance
8x8x16x16 lattice
  • Single Cell Application performance speedup
  • 810x, compared to Xeon single core
  • Cell Blade application performance speedup
  • 1.5-4.1x, compared to Xeon 2 socket 8 cores
  • Profile in Xeon
  • 98.8 parallel code, 1.2 serial code
  • speedup
    slowdown
  • 67-38 kernel SPU time, 33-62 PPU time of
    overall runtime in Cell
  • ?PPE is standing in the way for further
    improvement

16x16x16x16 lattice
16
Application performance on two blades
  • For comparison, we ran two Intel Xeon blades and
    Cell/B.E. blades through Gigabit Ethernet
  • More data needed for Cell blades connected
    through Infiniband

17
Application performance a fair comparison
  • PPE is slower than Xeon
  • PPE 1 SPE is 2x faster than Xeon
  • A cell blade is 1.5-4.1x faster than 8-core Xeon
    blade

18
Conclusion
  • We achieved reasonably good performance
  • 4.5-5.0 Gflops in one Cell processor for whole
    application
  • We maintained the MPI framework
  • Without the assumtion that the code runs on one
    Cell processor, certain optimization cannot be
    done, e.g. loop fusion
  • Current site-centric data layout forces us to
    take the padding approach
  • 23-44 efficiency lost for bandwidth
  • Fix field-centric data layout desired
  • PPE slows the serial part, which is a problem for
    further improvement
  • Fix IBM putting a full-version power core in
    Cell/B.E.
  • PPE may impose problems in scaling to multiple
    Cell blades
  • PPE over Infiniband test needed
Write a Comment
User Comments (0)
About PowerShow.com