Title: Cell processor implementation of a MILC lattice QCD application
1Cell processor implementation of a MILC lattice
QCD application
- Guochun Shi, Volodymyr Kindratenko, Steven
Gottlieb
2Presentation outline
- Introduction
- Our view of MILC applications
- Introduction to Cell Broadband Engine
- Implementation in Cell/B.E.
- PPU performance and stream benchmark
- Profile in CPU and kernels to be ported
- Different approaches
- Performance
- Conclusion
3Introduction
- Our target
- MIMD Lattice Computation (MILC) Collaboration
code dynamical clover fermions
(clover_dynamical) using the hybrid-molecular
dynamics R algorithm - Our view of the MILC applications
- A sequence of communication and computation blocks
4Introduction
- Cell/B.E. processor
- One Power Processor Element (PPE) and eight
Synergistic Processing Elements (SPE), each SPE
has 256 KBs of local storage - 3.2 GHz processor
- 25.6 GB/s processor-to-memory bandwidth
- gt 200 GB/s EIB sustained aggregate bandwidth
- Theoretical peak performance 204.8 GFLOPS (SP)
and 14.63 GFLOPS (DP)
5Presentation outline
- Introduction
- Our view of MILC applications
- Introduction to Cell Broadband Engine
- Implementation in Cell/B.E.
- PPE performance and stream benchmark
- Profile in CPU and kernels to be ported
- Different approaches
- Performance
- Conclusion
6Performance in PPE
- Step 1 try to run it in PPE
- In PPE it runs approximately 2-3x slower than
modern CPU - MILC is bandwidth-bound
- It agrees with what we see with stream benchmark
7Execution profile and kernels to be ported
- 10 of these subroutines are responsible for gt90
of overall runtime - All kernels responsible for 98.8
8Kernel memory access pattern
define FORSOMEPARITY(i,s,choice) \ for(
i((choice)ODD ? even_sites_on_node 0 ), \
s (latticei) \ ilt ( (choice)EVEN ?
even_sites_on_node sites_on_node) \
i,s) FORSOMEPARITY(i,s,parity)
mult_adj_mat_wilson_vec( (s-gtlinknu),
((wilson_vector )F_PT(s,rsrc)), rtemp )
mult_adj_mat_wilson_vec( (su3_matrix
)(gen_pt1i), rtemp, (s-gttmp) )
mult_mat_wilson_vec( (su3_matrix
)(gen_pt0i), (s-gttmp), rtemp )
mult_mat_wilson_vec( (s-gtlinkmu), rtemp,
(s-gttmp) )
mult_sigma_mu_nu( (s-gttmp), rtemp,
mu, nu ) su3_projector_w( rtemp,
((wilson_vector )F_PT(s,lsrc)),
((su3_matrix)F_PT(s,mat)) )
- Kernel codes must be SIMDized
- Performance determined by how fast you DMA in/out
the data, not by SIMDized code - In each iteration, only small elements are
accessed - Lattice size 1832 bytes
- su3_matrix 72 bytes
- wilson_vector 96 bytes
- Challenge how to get data into SPUs as fast as
possible? - Cell/B.E. has the best DMA performance when data
is aligned to 128 bytes and size is multiple of
128 bytes. - Data layout in MILC meets neither of them
One sample kernel from udadu_mu_nu() routine
9Approach I packing and unpacking
- Good performance in DMA operations
- Packing and unpacking are expensive in PPE
10Approach II Indirect memory access
- Replace elements in struct site with pointers
- Pointers point to continuous memory regions
- PPE overhead due to indirect memory access
11Approach III Padding and small memory DMAs
- Padding elements to appropriate size
- Padding struct site to appropriate size
- Gained good bandwidth performance with padding
overhead - Su3_matrix from 3x3 complex to 4x4 complex matrix
- 72 bytes ? 128 bytes
- Bandwidth efficiency lost 44
- Wilson_vector from 4x3 complex to 4x4 complex
- 98 bytes ? 128 bytes
- Bandwidth efficiency lost 23
12Struct site Padding
- 128 byte stride access has different performance
for different stride size - This is due to 16 banks in main memory
- Odd numbers always reach peak
- We choose to pad the struct site to 2688 (21128)
bytes
13Presentation outline
- Introduction
- Our view of MILC applications
- Introduction to Cell Broadband Engine
- Implementation in Cell/BE.
- PPU performance and stream benchmark
- Profile in CPU and kernels to be ported
- Different approaches
- Performance
- Conclusion
14Kernel performance
- GFLOPS are low for all kernels
- Bandwidth is around 80 of peak for most of
kernels - Kernel speedup compared to CPU for most of
kernels are between 10x to 20x - set_memory_to_zero kernel has 40x speedup,
su3mat_copy() speedup gt15x
15Application performance
8x8x16x16 lattice
- Single Cell Application performance speedup
- 810x, compared to Xeon single core
- Cell Blade application performance speedup
- 1.5-4.1x, compared to Xeon 2 socket 8 cores
- Profile in Xeon
- 98.8 parallel code, 1.2 serial code
- speedup
slowdown - 67-38 kernel SPU time, 33-62 PPU time of
overall runtime in Cell - ?PPE is standing in the way for further
improvement
16x16x16x16 lattice
16Application performance on two blades
- For comparison, we ran two Intel Xeon blades and
Cell/B.E. blades through Gigabit Ethernet - More data needed for Cell blades connected
through Infiniband
17Application performance a fair comparison
- PPE is slower than Xeon
- PPE 1 SPE is 2x faster than Xeon
- A cell blade is 1.5-4.1x faster than 8-core Xeon
blade
18Conclusion
- We achieved reasonably good performance
- 4.5-5.0 Gflops in one Cell processor for whole
application - We maintained the MPI framework
- Without the assumtion that the code runs on one
Cell processor, certain optimization cannot be
done, e.g. loop fusion - Current site-centric data layout forces us to
take the padding approach - 23-44 efficiency lost for bandwidth
- Fix field-centric data layout desired
- PPE slows the serial part, which is a problem for
further improvement - Fix IBM putting a full-version power core in
Cell/B.E. - PPE may impose problems in scaling to multiple
Cell blades - PPE over Infiniband test needed