STI Cell Broadband Engine - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

STI Cell Broadband Engine

Description:

Slow but power efficient PowerPC instruction set implementation ... No Cell compiler in existence to manage utilization of SPE's at compile time ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 38
Provided by: daniel440
Category:

less

Transcript and Presenter's Notes

Title: STI Cell Broadband Engine


1
STICell Broadband Engine
  • Brian J. Alliet
  • Jonathon W. Donaldson

2
Agenda
  • History of development
  • Technical overview of architecture
  • Detailed technical discussion of components
  • Design choices
  • Other processors like the cell
  • Programming for the cell

3
History of Development
  • Ken Kutaragi
  • Partnership between Sony, Toshiba, IBM (STI)
  • Provide high-performance, low cost chip to the
    average home user
  • Utilizes AIMs vector instructions (SPE, PPE) and
    PowerPC ISA (PPE)

4
Overall Goals for Cell
  • High performance in multimedia apps
  • Support for both real-time and general-purpose OS
  • Power consumption Hofstee
  • Low Cost
  • Available 2005 ?
  • Avoid latency issues

5
The Cell itself
  • Power PC based main core (PPE)
  • Multiple SPEs
  • On die memory controller
  • Inter-core transport bus
  • High speed IO

6
(No Transcript)
7
Cell Implementation
  • Cell is an architecture
  • Preliminary PS3 Implementation
  • 1 PPE
  • 7 SPE (1 Disabled for yield increase)
  • 221 mm² die size on a 90 nm process
  • Clocked at 3-4ghz
  • 256GFLOPS Single Precision _at_ 4ghz

8
Why a Cell Architecture
  • Follows a trend in computing architecture
  • Natural extension of dual and multi-core
  • Extremely low hardware overhead
  • Software controllable
  • Specialized hardware more useful for multimedia

9
Possible Uses
  • Playstation3 (Obviously)
  • Blade servers (IBM)
  • Amazing single precision FP performance
  • Scientific applications
  • Toshiba HDTV products

10
Power Processing Element
  • PowerPC instruction set with AltiVec
  • Used for general purpose computing and
    controlling SPEs
  • Simultaneous Multithreading
  • Separate 32 KB L1 Caches and unified 512 KB L2
    Cache

11
PPE (cont.)
  • Slow but power efficient PowerPC instruction set
    implementation
  • Two issue in-order instruction fetch
  • Conspicuous lack of instruction window
  • Compare to conventional PowerPC implementations
    (G5)
  • Performance depends on SPE utilization

12
Synergistic Processing Element (SPE)
  • Specialized hardware
  • Meant to be used in parallel
  • (7 on PS3 implementation)
  • On chip memory (256kb)
  • No branch prediction
  • In-order execution
  • Dual issue

13
SPE Architecture
  • 0.99µm2 on 90nm Process
  • 128 registers (128 bits wide)
  • Instructions assumed to be 4x 32bit
  • Variant of VMX instruction set
  • Modified for 128 registers
  • On chip memory is NOT a cache

14
SPE Execution
  • Dual issue, in-order
  • Seven execution units
  • Vector logic
  • 8 single precision operations per cycle
  • Significant performance hit for double precision

15
SPE Execution Diagram
16
SPE Local Storage Area
  • NOT a cache
  • 256kb, 4 x 64kb ECC single port SRAM
  • Completely private to each SPE
  • Directly addressable by software
  • Can be used as a cache, but only with software
    controls
  • No tag bits, or any extra hardware

17
SPE LS Scheduling
  • Software controlled DMA
  • DMA to and from main memory
  • Scheduling a HUGE problem
  • Done primarily in software
  • IBM predicts 80-90 usage ideally
  • Request queue handles 16 simultaneous requests
  • Up to 16 kb transfer each
  • Priority DMA, L/S, Fetch
  • Fetch / execute parallelism

18
SPE Control Logic
  • Very little in comparison
  • Represents shift in focus
  • Complete lack of branch prediction
  • Software branch prediction
  • Loop unrolling
  • 18 cycle penalty
  • Software controlled DMA

19
SPE Pipeline
  • Little ILP, and thus little control logic
  • Dual issue
  • Simple commit unit (no reorder buffer or other
    complexities)
  • Same execution unit for FP/int

20
SPE Summary
  • Essentially small vector computer
  • Based on Altivec/VMX ISA
  • Extensions for DMA and LS management
  • Extended for 128x 128bit registerfile
  • Uniquely suited for real time applications
  • Extremely fast for certain FP operations
  • Offload a large amount on to compiler / software.

21
Element Interconnect Bus
  • 4 concentric rings connecting all Cell elements
  • 128-bit wide interconnects

22
EIB (cont.)
  • Designed to minimize coupling noise
  • Rings of data traveling in alternating directions
  • Buffers and repeaters at each SPE boundary
  • Architecture can be scaled up with increased bus
    latency

23
EIB (cont.)
  • Total bandwidth at 200GB/s
  • EIB controller located physically in center of
    chip between SPEs
  • Controller reserves channels for each individual
    data transfer request
  • Implementation allows for SPE extension
    horizontally

24
Memory Interface
  • Rambus XDR memory to keep Cell at full
    utilization
  • 3.2 Gbps data bandwidth per device connected to
    XDR interface
  • Cell uses dual channel XDR with four devices and
    16-bit wide buses to achieve 25.2 GB/s total
    memory bandwidth

25
Input / Output Bus
  • Rambus FlexIO Bus
  • IO interface consists of 12 unidirectional byte
    lanes
  • Each lane supports 6.4 GB/s bandwidth
  • 7 outbound lanes and 5 inbound lanes

26
Design Choices
  • In-order execution
  • Abandoning ILP
  • ILP 10-20 increase per generation
  • Reducing control logic
  • Real time responsiveness
  • Cache Design
  • Software configuration on SPE
  • Standard L2 cache on PPE

27
Cell Programming Issues
  • No Cell compiler in existence to manage
    utilization of SPEs at compile time
  • SPEs do not natively support context switching.
    Must be OS managed.
  • SPEs are vector processors. Not efficient for
    general-purpose computation.
  • PPEs and SPEs use different instruction sets.

28
The IBM Octopiler
29
Cell Programming (cont.)
  • Functional Offload Model
  • Simplest model for Cell programming
  • Optimize existing libraries for SPE computation
  • Requires no rebuild of main application logic
    which runs on PPE

30
Cell Programming (cont.)
  • Device Extension Model
  • Take advantage of SPE DMA
  • Use SPEs as interfaces to external devices

31
Cell Programming (cont.)
  • Computational Acceleration Model
  • Traditional super-computing methods using Cell
  • Shared memory or message passing paradigm for
    accelerating inherently parallel math operations
  • Can overwrite intensive math libraries without
    rewriting applications

32
Cell Programming (cont.)
  • Streaming model
  • Use Cell processor as one large programmable
    pipeline
  • Partition algorithms into logically sensible
    steps. Execute each separately, in serial, on
    separate processors.

33
Cell Programming (cont.)
  • Asymmetric Thread Runtime Model
  • Abstract Cell architecture away from programmer.
  • Use OS to use processors to each run different
    threads.

34
Sample Performance
  • Demonstration physics engine for real-time game
  • http//www.research.ibm.com/cell/whitepapers/cell_
    online_game.pdf
  • 182 Compute to DMA ratio on SPEs
  • For the right tasks, Cell architecture can be
    extremely efficient.

35
  • Supposed to act like cells in a biological system
  • Core 512KB cache
  • Different version with different numbers of SPEs
  • PPE core carchitecture unlike ANY other Power
    architecture in existence (i.e. clock comparisons
    are meaningless) but same Instruction Set

36
  • SPE
  • 256kB local store
  • Access only to LS
  • 16 simultaneous transfers
  • 128-bit by 128 entry register file

37
References
  • IBM Cell Developer Works (SDK/Tutorials)
  • http//www-128.ibm.com/developerworks/power/cell/
Write a Comment
User Comments (0)
About PowerShow.com