Cell CF06 Presentation - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Cell CF06 Presentation

Description:

One core is a conventional cache based PPC. The other 8 are local memory based SIMD ... 500W blades (2 chips DRAM network) 6. SPE Architecture. 128b SIMD ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 42
Provided by: samwil
Category:

less

Transcript and Presenter's Notes

Title: Cell CF06 Presentation


1
The Potential of the Cell Processor for
Scientific Computing Sam Williams samw_at_cs.berkele
y.edu Computational Research Division Future
Technologies Group July 12, 2006
2
Outline
  • Cell Architecture
  • Programming Cell
  • Benchmarks Performance
  • Stencils on Structured Grids
  • Sparse Matrix-Vector Multiplication
  • Matrix-Matrix Multiplication
  • FFTs
  • Summary

3
Cell Architecture
4
Cell Chip Architecture
  • 9 Core SMP
  • One core is a conventional cache based PPC
  • The other 8 are local memory based SIMD
    processors (SPEs)
  • Cores connected via 4 rings (4 x 128b _at_ 1.6GHz)
  • 25.6GB/s memory bandwidth (128b _at_ 1.6GHz) to XDR
  • I/O channels can be used to build multichip SMPs

5
Cell Chip Characteristics
  • Core frequency up to 3.2GHz
  • Limited access to 2.1GHz hardware
  • Faster (2.4GHz 3.2GHz) became available after
    the paper.
  • 221mm2 chip,
  • Much smaller than Itanium2 (1/2 to 1/3)
  • About the size of a Power5
  • Opteron/Pentium before shrink
  • 500W blades (2 chips DRAM network)

6
SPE Architecture
  • 128b SIMD
  • Dual issue (FP/ALU load/store/permute/etc),
    but not DP
  • Statically Scheduled, in-order
  • 4 FMA SP datapaths, 6 cycle latency
  • 1 FMA DP datapath, 13 cycle latency (including 7
    stall cycles)
  • Software managed BTB (branch hints)
  • Local memory based architecture (two steps to
    access DRAM)
  • DMA commands are queued in the MFC
  • 128b _at_ 1.6GHz to local store (256KB)
  • Aligned loads from local store to RF (128 x 128b)
  • 3W _at_ 3.2GHz

7
Programming Cell
8
Estimation, Simulation and Exploration
  • Perform analysis before writing any code
  • Conceptualize Algorithm with
  • Double buffering long DMAs in order machine
  • use static timing analysis memory traffic
    modeling
  • Model Performance
  • For regular data structures, spreadsheet modeling
    works
  • SpMV requires more advanced modeling
  • Full System Simulator
  • Cell
  • How severely does DP throughput of 1 SIMD
    instruction every 7 or 8 cycles impair
    performance?
  • 1 SIMD instruction every 2 cycles
  • Allows dual issuing of DP instructions with
    loads/stores/permutes/branch

9
Cell Programming
  • Modified SPMD (Single Program Multiple Data)
  • Dual Program Multiple Data
  • Hierarchical SPMD
  • Kind of clunky approach compared to MPI or
    pthreads
  • Power core is used to
  • Load/initialize data structures
  • Spawn SPE threads
  • Parallelize data structures
  • Pass pointers to SPEs
  • Synchronize SPE threads
  • Communicate with other processors
  • Perform other I/O operations

10
SPE Programming
  • Software controlled Memory
  • Use pointers from PPC to construct global
    addresses
  • Programmer handles transfers from global to local
  • Compiler handles transfers from local to RF
  • Most applicable when address stream is long and
    independent
  • DMAs
  • Granularity is 1,2,4,8,16N bytes
  • Issued with very low level intrinsics (no error
    checking)
  • GETL Stanza Gather/Scatter distributed in
    global, packed in local
  • Double buffering
  • SPU and MFC operate in parallel
  • Operate on current data set while transferring
    the next/last
  • Time is max of computation time and communication
    time
  • Prologue and epilogue penalties
  • Although more verbose, intrinsics are an
    effective way of delivering peak performance.

11
  • Benchmark Kernels
  • Stencil Computations on Structured Grids
  • Sparse Matrix-Vector Multiplication
  • Matrix-Matrix Multiplication
  • 1D FFTs
  • 2D FFTs

12
Processors Used
Note Cell performance does not include the
Power core
13
Stencil Operations on Structured Grids
14
Stencil Operations
  • Simple Example - The Heat Equation
  • dT/dt k?2T
  • Parabolic PDE on 3D discretized scalar domain
  • Jacobi Style (read from current grid, write to
    next grid)
  • 7 FLOPs per point, typically double precision
  • Nextx,y,z AlphaCurrentx,y,z
  • Currentx-1,y,z Currentx1,y,z
  • Currentx,y-1,z Currentx,y1,z
  • Currentx,y,z-1 Currentx,y,z1
  • Doesnt exploit the FMA pipeline well
  • Basically 6 streams presented to the memory
    subsystem
  • Explicit ghost zones bound grid

15
Optimization - Planes
  • Naïve approach (cacheless vector machine) is to
    load 5 streams and store one.
  • This is 7 flops per 48 bytes
  • memory limits performance to 3.7 GFLOP/s
  • A better approach is to make each DMA the size of
    a plane
  • cache the 3 most recent planes (z-1, z, z1)
  • there are only two streams (one load, one store)
  • memory now limits performance to 11.2 GFLOP/s
  • Still must compute on each plane after it is
    loaded
  • e.g. forall Current_localx,y update
    Next_localx,y
  • Note computation can severely limit performance

16
Optimization - Double Buffering
  • Add a input stream buffer and and output stream
    buffer (keep 6 planes in local store)
  • Two phases (transfer compute) are now
    overlapped
  • Thus it is possible to hide the faster of DMA
    transfer time and computation time

17
Optimization - Cache Blocking
  • Domains can be quite large (1GB)
  • A single plane, let alone 6, might not fit in the
    local store
  • Partition domain (and thus planes) into cache
    blocks so that 6 can fit in the local store.
  • Has the added benefit that cache blocks are
    independent and thus can be parallelized across
    multiple SPEs
  • Memory efficiency can be diminished as an intra
    grid ghost zone is implicitly created.

18
Optimization - Register Blocking
  • Instead of computing on pencils, compute on
    ribbons (4x2)
  • Hides functional unit local store latency
  • Minimizes local store memory traffic
  • Minimizes loop overhead
  • May not be beneficial / noticeable for cache
    based machines

19
Optimization - Time Skewing
  • If the application allows it, perform multiple
    time steps within the local store
  • Only appropriate on memory bound implementations
  • Improves computational intensity
  • single precision, or
  • improved double precision
  • Simple approach
  • Overlapping trapezoids in time-space plot
  • Can be inefficient due to duplicated work
  • If performing two steps, local store must now
    hold 9 planes (thus smaller cache blocks)
  • If performing n steps, the local store must hold
    3(n1) planes

20
Stencil Performance
  • Notes
  • Performance model when accounting for limited
    dual issue, matches well with FSS and hardware
  • Double precision runs dont exploit time skewing
  • In single precision time skewing 4
  • Problem was best of 1283 and 2563

21
Sparse Matrix-Vector Multiplication
22
Sparse Matrix Vector Multiplication
  • Most of the matrix entries are zero, thus the
    nonzeros are sparsely distributed
  • Can be used for unstructured grid problems
  • Issues
  • Like DGEMM, can exploit a FMA well
  • Very low computational intensity
  • (1 FMA for every 12 bytes)
  • Non FP instructions can dominate
  • Can be very irregular
  • Row lengths can be unique and very short

23
Compressed Sparse Row
  • Compressed Sparse Row (CSR) is the standard
    format
  • Array of nonzero values
  • Array of corresponding column for each nonzero
    value
  • Array of row starts containing index (in the
    above arrays) of first nonzero in the row

24
Optimization - Double Buffer Nonzeros
  • Computation and Communication are approximately
    equally expensive
  • While operating on the current set of nonzeros,
    load the next (1K nonzero buffers)
  • Need complex (thought) code to stop and restart a
    row between buffers
  • Can nearly double performance

25
Optimization - SIMDization
  • BCSR
  • Nonzeros are grouped into r x c blocks
  • O(nnz) explicit zeros are added
  • Choose rc so that it meshes well with 128b
    registers
  • Performance can hurt especially in DP as
    computing on zeros is very wasteful
  • Can hide latency and amortize loop overhead
  • Only used in initial performance model
  • Row Padding
  • Pad rows to the nearest multiple of 128b
  • Might requires O(N) explicit zeros
  • Loop overhead still present
  • Generally works better in double precision

26
Optimization - Cache Blocked Vectors
  • Doubleword DMA gathers from DRAM can be expensive
  • Cache block source and destination vectors
  • Finite LS, so whats the best aspect ratio?
  • DMA large blocks into local store
  • Gather operations into local store
  • ISA vs. memory subsystem inefficiencies
  • Exploits temporal and spatial locality within
  • the SPE
  • In effect, the sparse matrix is explicitly
    blocked
  • into submatrices, and we can skip, or otherwise
  • optimize empty submatrices
  • Indices are now relative to the cache block
  • half words
  • reduces memory traffic by 16

27
Optimization - Load Balancing
  • Potentially irregular problem
  • Load imbalance can severely hurt performance
  • Partitioning by rows is clearly insufficient
  • Partitioning by nonzeros is inefficient when
  • average number of nonzeros per row per cache
  • block is small
  • Define a cost function of number of row starts
  • and number of nonzeros.
  • Determine the parameters via static timing
    analysis or tuning.

28
Benchmark Matrices
  • 4 nonsymmetric SPARSITY matrices
  • 6 symmetric SPARSITY matrices
  • 7pt Heat equation matrix

29
Other Approaches
  • BeBop / OSKI on the Itanium2 Opteron
  • uses BCSR
  • auto tunes for optimal r x c blocking
  • Cell implementation is similar
  • Crays routines on the X1E
  • Report best of CSRP, Segmented Scan Jagged
    Diagonal

30
Double Precision SpMV Performance
  • 16 SPE version needs broadcast optimization
  • Frequency helps (mildly computationally bound)
  • Cell doesnt help much more
  • (non FP instruction BW)

31
Dense Matrix Matrix Multiplication (performance
model only)
32
Dense Matrix-Matrix Multiplication
  • Blocking
  • Explicit (BDL) or implicit blocking (gather
    stanzas)
  • Hybrid method would be to convert and store to
    DRAM on the fly
  • Choose a block size so that kernel is
    computationally bound
  • ?642 in single precision
  • much easier in double precision (14x
    computational time, 2x transfer time)
  • Parallelization
  • Partition A C among SPEs
  • Future work - cannons algorithm

33
GEMM Performance
34
Fourier Transforms (performance model only)
35
1D Fast Fourier Transforms
  • Naïve Algorithm
  • Load roots of unity
  • Load data (cyclic)
  • Local work, on-chip transpose, local work
  • i.e. SPEs cooperate on a single FFT
  • No overlap of communication or computation

36
2D Fast Fourier Transforms
  • Each SPE performs 2 (N/8) FFTs
  • Double buffer rows
  • overlap communication and computation
  • 2 incoming, 2 outgoing
  • Straightforward algorithm (N2 2D FFT)
  • N simultaneous FFTs, transpose,
  • N simultaneous FFTs, transpose.
  • Long DMAs necessitate transposes
  • transposes represent about 50 of total SP
    execution time
  • SP Simultaneous FFTs run at 75 GFLOP/s

37
FFT Performance
38
Conclusions
39
Conclusions
  • Far more predictable than conventional OOO
    machines
  • Even in double precision, it obtains much better
    performance on a surprising variety of codes.
  • Cell can eliminate unneeded memory traffic, hide
    memory latency, and thus achieves a much higher
    percentage of memory bandwidth.
  • Instruction set can be very inefficient for
    poorly SIMDizable or misaligned codes.
  • Loop overheads can heavily dominate performance.
  • Programming Model is clunky

40
Acknowledgments
  • This work (paper in CF06, poster in EDGE06) is a
    collaboration with the following FTG members
  • John Shalf, Lenny Oliker, Shoaib Kamil, Parry
    Husbands, and Kathy Yelick
  • Additional thanks to
  • Joe Gebis and David Patterson
  • X1E FFT numbers provided by
  • Bracy Elton, and Adrian Tate
  • Cell access provided by
  • Mike Perrone (IBM Watson)
  • Otto Wohlmuth (IBM Germany)
  • Nehal Desai (Los Alamos National Lab)

41
Questions?
Write a Comment
User Comments (0)
About PowerShow.com