Cell processor implementation of a MILC lattice QCD application presentation

About This Presentation

Transcript and Presenter's Notes

Title: Cell processor implementation of a MILC lattice QCD application

1
Cell processor implementation of a MILC lattice
QCD application

Guochun Shi, Volodymyr Kindratenko, Steven
Gottlieb

2
Presentation outline

Introduction
Our view of MILC applications
Introduction to Cell Broadband Engine
Implementation in Cell/B.E.
PPU performance and stream benchmark
Profile in CPU and kernels to be ported
Different approaches
Performance
Conclusion

3
Introduction

Our target
MIMD Lattice Computation (MILC) Collaboration
code dynamical clover fermions
(clover_dynamical) using the hybrid-molecular
dynamics R algorithm
Our view of the MILC applications
A sequence of communication and computation blocks

4
Introduction

Cell/B.E. processor
One Power Processor Element (PPE) and eight
Synergistic Processing Elements (SPE), each SPE
has 256 KBs of local storage
3.2 GHz processor
25.6 GB/s processor-to-memory bandwidth
gt 200 GB/s EIB sustained aggregate bandwidth
Theoretical peak performance 204.8 GFLOPS (SP)
and 14.63 GFLOPS (DP)

5
Presentation outline

Introduction
Our view of MILC applications
Introduction to Cell Broadband Engine
Implementation in Cell/B.E.
PPE performance and stream benchmark
Profile in CPU and kernels to be ported
Different approaches
Performance
Conclusion

6
Performance in PPE

Step 1 try to run it in PPE
In PPE it runs approximately 2-3x slower than
modern CPU
MILC is bandwidth-bound
It agrees with what we see with stream benchmark

7
Execution profile and kernels to be ported

10 of these subroutines are responsible for gt90
of overall runtime
All kernels responsible for 98.8

8
Kernel memory access pattern
define FORSOMEPARITY(i,s,choice) \ for(
i((choice)ODD ? even_sites_on_node 0 ), \
s (latticei) \ ilt ( (choice)EVEN ?
even_sites_on_node sites_on_node) \
i,s) FORSOMEPARITY(i,s,parity)
mult_adj_mat_wilson_vec( (s-gtlinknu),
((wilson_vector )F_PT(s,rsrc)), rtemp )
mult_adj_mat_wilson_vec( (su3_matrix
)(gen_pt1i), rtemp, (s-gttmp) )
mult_mat_wilson_vec( (su3_matrix
)(gen_pt0i), (s-gttmp), rtemp )
mult_mat_wilson_vec( (s-gtlinkmu), rtemp,
(s-gttmp) )

mult_sigma_mu_nu( (s-gttmp), rtemp,
mu, nu ) su3_projector_w( rtemp,
((wilson_vector )F_PT(s,lsrc)),
((su3_matrix)F_PT(s,mat)) )

Kernel codes must be SIMDized
Performance determined by how fast you DMA in/out
the data, not by SIMDized code
In each iteration, only small elements are
accessed
Lattice size 1832 bytes
su3_matrix 72 bytes
wilson_vector 96 bytes
Challenge how to get data into SPUs as fast as
possible?
Cell/B.E. has the best DMA performance when data
is aligned to 128 bytes and size is multiple of
128 bytes.
Data layout in MILC meets neither of them

One sample kernel from udadu_mu_nu() routine
9
Approach I packing and unpacking

Good performance in DMA operations
Packing and unpacking are expensive in PPE

10
Approach II Indirect memory access

Replace elements in struct site with pointers
Pointers point to continuous memory regions
PPE overhead due to indirect memory access

11
Approach III Padding and small memory DMAs

Padding elements to appropriate size
Padding struct site to appropriate size
Gained good bandwidth performance with padding
overhead
Su3_matrix from 3x3 complex to 4x4 complex matrix
72 bytes ? 128 bytes
Bandwidth efficiency lost 44
Wilson_vector from 4x3 complex to 4x4 complex
98 bytes ? 128 bytes
Bandwidth efficiency lost 23

12
Struct site Padding

128 byte stride access has different performance
for different stride size
This is due to 16 banks in main memory
Odd numbers always reach peak
We choose to pad the struct site to 2688 (21128)
bytes

13
Presentation outline

Introduction
Our view of MILC applications
Introduction to Cell Broadband Engine
Implementation in Cell/BE.
PPU performance and stream benchmark
Profile in CPU and kernels to be ported
Different approaches
Performance
Conclusion

14
Kernel performance

GFLOPS are low for all kernels
Bandwidth is around 80 of peak for most of
kernels
Kernel speedup compared to CPU for most of
kernels are between 10x to 20x
set_memory_to_zero kernel has 40x speedup,
su3mat_copy() speedup gt15x

15
Application performance
8x8x16x16 lattice

Single Cell Application performance speedup
810x, compared to Xeon single core
Cell Blade application performance speedup
1.5-4.1x, compared to Xeon 2 socket 8 cores
Profile in Xeon
98.8 parallel code, 1.2 serial code
speedup
slowdown
67-38 kernel SPU time, 33-62 PPU time of
overall runtime in Cell
?PPE is standing in the way for further
improvement

16x16x16x16 lattice
16
Application performance on two blades

For comparison, we ran two Intel Xeon blades and
Cell/B.E. blades through Gigabit Ethernet
More data needed for Cell blades connected
through Infiniband

17
Application performance a fair comparison

PPE is slower than Xeon
PPE 1 SPE is 2x faster than Xeon
A cell blade is 1.5-4.1x faster than 8-core Xeon
blade

18
Conclusion

We achieved reasonably good performance
4.5-5.0 Gflops in one Cell processor for whole
application
We maintained the MPI framework
Without the assumtion that the code runs on one
Cell processor, certain optimization cannot be
done, e.g. loop fusion
Current site-centric data layout forces us to
take the padding approach
23-44 efficiency lost for bandwidth
Fix field-centric data layout desired
PPE slows the serial part, which is a problem for
further improvement
Fix IBM putting a full-version power core in
Cell/B.E.
PPE may impose problems in scaling to multiple
Cell blades
PPE over Infiniband test needed

Write a Comment

User Comments (0)

About PowerShow.com

Cell processor implementation of a MILC lattice QCD application PowerPoint PPT Presentation