Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor - PowerPoint PPT Presentation

About This Presentation
Title:

Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor

Description:

overview of the architecture, circuit design, and physical implementation of a first-generation cell processor ieee journal of solid-state circuits, vol. 41, no. 1 ... – PowerPoint PPT presentation

Number of Views:411
Avg rating:3.0/5.0
Slides: 77
Provided by: Kir59
Category:

less

Transcript and Presenter's Notes

Title: Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor


1
Overview of the Architecture, Circuit Design,
andPhysical Implementation of a
First-Generation Cell Processor
  • IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41,
    NO. 1, JANUARY 2006

2
First Consumer Product
  • Play Station 3!

3
Introduction
  • Developed through partnership of
  • SONY Computer Entertainment.
  • Toshiba.
  • IBM.
  • Aim
  • Highly tuned for media processing.
  • Expected demands for complex and larger data
    handling.

4
What is Cell?
  • Cell is an architecture for high performance
    distributed computing.  
  • It is comprised of hardware and software cells.
  • Implementation of a wide range of single or
    multiple processor and memory configurations.

5
Supercomputer in daily life
  • Parallelism with high frequency.
  • Real time response.
  • Supports Multiple operating system.
  • 10 simultaneous threads.
  • 128 memory requests.
  • Optimally address many different system and
    application requirements.

6
Architecture Overview
  • 8 SPEs with Local Storage (LS).
  • PPE with its L2 cache.
  • Internal element interconnect bus (EIB).
  • Memory Interface Controller (MIC).
  • Bus Interface Controller (BIC).
  • Power Management Unit (PMU).
  • Thermal Management Unit (TMU).
  • Pervasive Unit.

7
High Level Diagram
8
Die Photograph
9
Synergistic Processing Elements (SPE) (1/2)
  • Share system memory with PPE through DMA.
  • Data and instructions in a private real address
    space supported by a 256 K LS.
  • According to IBM a single SPE can perform as well
    as a top end (single core) desktop CPU given the
    right task.

10
Synergistic Processing Elements (SPE) (2/2)
  • Access main storage by issuing DMA commands to
    the associated MFC block (asynchronous transfer).
  • Fully pipelined 128 bit wide dual issue SIMD.
  • SPEs in a Cell can be chained together to act as
    a stream processor.  

11
Power Processor Element (PPE) (1/2)
  • 32-kB instruction and data cache.
  • 64 bit Power Architecture with 512kB L2 cache.

12
Power Processor Element (PPE) (2/2)
  • Through MMIO control registers can intiate DMA
    for SPE.
  • Hyepervisor extension.
  • Moderate length of pipeline.

13
Element Interconnect Bus(EIB)
  • Can transfer upto 96bytes per cycle.
  • 4 16byte wide rings
  • Two rings going clockwise.
  • Two rings going counterclockwise.
  • Separate address and command network.
  • 12on/off ramps.

14
(No Transcript)
15
Memory Interface Controller (MIC)
  • Two 36 bit wide XDR memory banks.
  • Can also support just a single bank.
  • Speed matching SRAM and two clocks.

16
Power Reduction
  • Power Management Unit.
  • PMU allows software controls to reduce chip
    power.
  • Can cause OS to throttle, pause or stop for
    single or multiple units.

17
Thermal Monitoring
  • Thermal Sensors and Thermal Monitoring Unit.
  • One sensor located at relatively constant temp.
    location, for external cooling.
  • 10 DTS at various critical locations.

18
Optimum Point (1/3)
  • Triple constraint Power, Performance, Area.
  • Gate Oxide thickness
  • Thinner oxide
  • Higher performance.
  • Higher gate tunneling too.
  • Reliabilty concerns.

19
Optimum Point (2/3)
  • Channel Length
  • Short channel length
  • Improved performance.
  • Increased leakage current too.
  • Supply Voltage
  • Higher voltage
  • Improved performance.
  • Higher AC/DC power.

20
Optimum Point (3/3)
  • Wire Levels
  • Few levels
  • Increased chip area.
  • Many levels
  • More cost.

21
Final Technology Parameters
22
Chip Integration
  • 241M transistors.
  • 8912 discrete flour planned blocks.
  • Custom tailored nets.
  • 20 separate power domains.

23
POWER-CONSCIOUS DESIGN OFTHE CELL PROCESSORSSPE
  • Osamu Takahashi
  • IBM Systems and
  • Technology Group
  • Scott Cottier
  • Sang H. Dhong
  • Brian Flachs
  • Joel Silberman
  • IBM T.J. Watson
  • Research Center

24
The CELL Processor - Properties
  • Mostly CMOS static gates.
  • Dynamic gates used for time critical paths.
  • Tight coupling of
  • ISA
  • uArchitecture
  • Physical implementation
  • achieves Compact and Power efficient design.

25
APPLICATIONS
  • To name a few (list goes endless)
  • Image processing for high definition TV
  • Image processing for medical usages
  • High performance computing
  • Gaming
  • Flexible enough to be a GP uP that supports HLL
    programming.

26
Cell processor - Architecture
  • 64-bit power core
  • Eight Synergistic Processor Elements(SPEs)
  • L2 Cache
  • Interconnection bus
  • I/O Controller
  • Rambus Flex I/O

27
Architecture contd.
  • SPE has two clock domains
  • one with an 11FO4 cycle time.
  • other with a 22FO4 cycle time.
  • Implementation using custom design -
    high-frequency domain.
  • The SPE contains
  • 256 Kbytes of dedicated local store memory.
  • The 128-bit, 128-entry general-purpose register
    file with six read ports and two write ports.

28
SPE
  • The SMF operates at half the SPEs frequency.
  • The SPE operates at operations of up to 5.6 GHz
    at a 1.4 V supply and 56 C.
  • The SPEs measured power consumption is in the
    range of 1 W to 11 W, depending on
  • Operating clock frequency.
  • Temperature.
  • Workload.

29
Triple design constraints
  • Cell contains eight copies of the SPE.
  • Optimization of the SPEs power and area is
    critical to the overall chip design.
  • Conscious effort to reduce SPE area and power
    while meeting the 11 FO4 cycle time performance
    objectives.
  • Optimized design to balance three constraints of
  • Power.
  • Area.
  • Performance.
  • Tradeoffs to achieve the overall best results
  • Some techniques used
  • latch selection.
  • fine-grained clock-gating scheme.
  • multiclock-domain design.
  • use of dual-threshold voltage.
  • Selective use of dynamic circuits.

30
Latch selection
  • Logic has 8-9FO4 time.
  • Rest of the time used by latches.
  • Several Latches with various insertion delays
    used.

31
Transmission Gate Latch
  • SPEs main workhorse latch.
  • Come in two varieties
  • Scannable.
  • Non scannable.
  • Each has several power levels.
  • Used almost throughout the SPE.

32
Pulsed Clock Latch
  • Non scannable.
  • Small insertion delay.
  • Small Area.
  • Relatively low power consumption.
  • Used in
  • Most timing.
  • Power critical areas.

33
Dynamic multiplexer latch
  • Scannable.
  • Multiplexing widths from 4-10.
  • Small insertion delay.
  • Used in
  • Time critical.
  • Multiplexing requiring areas.
  • Typical use in dataflow operand latches.

34
Dynamic PLA Latch
  • Scannable latch.
  • Used to generate control signals (clock gating
    signals).
  • The last two latches use slightly higher power.
  • Complete complex task in critical time.
  • Example of a tradeoff among triple constraints.

35
Fine-grained clock gating
  • Effective method of reducing power -used
    extensively in the CELL.
  • Use of local clock buffer (LCB)
  • Supplies clock to bank of latches.
  • If enable signal fired LCB buffers the global
    clock and sends to the bank of latches.
  • SPE activates only necessary pipeline stages.
  • Registers are turned off normally.
  • Functional blocks were simulated and verified.
  • 50 active power reduction using this design
    process.

36
Multiple clock frequency domains
  • High frequency increases performance.
  • Has some penalties
  • Higher clock power.
  • Higher percentage of clock insertion delays.
  • Shorter distance that a signal can travel.
  • SPE has some units whose performance does not
    solely depend on frequency.
  • SMF operates at half the frequency.

37
Multiple clock frequency domains
  • 11 FO4 blocks
  • Register file.
  • Fixed point unit.
  • Floating point unit.
  • Data forwarding.
  • Load/Store.
  • 22 FO4 blocks
  • Direct memory access unit.
  • Bus control.
  • Distribution of one clock to both domains.
  • SMF activated every second clock cycle.

38
Multiple clock frequency domains
  • Avoids physical implementation difficulties.
  • Helps escape
  • Latch insertion delay.
  • Travel distance penalties.
  • Advantages
  • Large percentage of clock dedicated to logic.
  • Most of SMF paths become non-critical.
  • Smaller transistors can be used.
  • SMF optimized for both area and power without
    sacrificing performance.

39
Dual-threshold-voltage devices
  • Leakage significant portion of power
    consumption for deep micron technology.
  • Cannot be solved by clock gating or two clock
    domains.
  • Use high-threshold-voltage transistors.
  • Penalty slower switching time.
  • Used in paths with enough timing slack.
  • Non critical paths from SMF because of two clock
    domains were replaced with these.

40
Selective use of dynamic circuits
  • Advantages of static circuits over dynamic
  • Design ease.
  • Low switching factor.
  • Tool compatibility.
  • Technology independence.
  • Advantages of dynamic circuits over static
    counterparts
  • Faster speed due to low cap at dynamic nodes.
  • Larger gains because of invertors after logic.
  • Micro architecture efficiency fewer stages.
  • Smaller area.

41
Selective use of dynamic circuits
  • Dynamic logic requires a clock higher power
    consumption.
  • Requires both true and complementary signals.
  • Static implementation tends to hit speed wall
    earlier.
  • Approach for design
  • Implement logic circuits in static CMOS as much
    as possible.
  • Alternatives when static did not meet the speed
    requirements.

42
Selective use of dynamic circuits
  • Dynamic logic requires a clock higher power
    consumption.
  • Requires both true and complementary signals.
  • Static implementation tends to hit speed wall
    earlier.
  • Approach for design
  • Implement logic circuits in static CMOS as much
    as possible.
  • Alternatives when static did not meet the speed
    requirements.

43
Selective use of dynamic circuits
  • Dynamic circuits have static interfaces.
  • 19 percent of the non-SRAM area.
  • Include the following macros
  • Dataflow forwarding.
  • Multiport register file.
  • Floating point unit.
  • Dynamic PLL.
  • Multiplexer latch.
  • Instruction line buffer.

44
SPE hardware measurements
  • Tested for complicated 3D picture rendering.
  • The fastest operation ran at 5.6 GHz with a 1.4 V
    supply at 56 C.
  • The global clock meshs measured power is 1.3 W
    per SPE at a 1.2V supply and 2.0-GHz clock
    frequency.
  • The Cell architecture is compatible with the 64b
    Power architecture so that applications can be
    built on the Power investments.
  • It can be considered as a non-homogenous coherent
    chip multiprocessor.
  • High design frequency has been achieved through
    highly optimized implementation.
  • Its streaming DMA architecture helps to enhance
    memory effectiveness of a processor.
  • Refer to shmoo plot for power analysis

45
SPE shmoo plot
46
Applications of the CELL ProcessorAnd Its
Potential For Scientific Computing
47
r
48
THE POWER!
  • FOLDING_at_HOME Broke the Guinness world record for
    the worlds most powerful distributed network
    with computing power of gt 1 PF(thousand trillion
    floating point operations per second).
  • Blue Gene is 500 TF

49
WHY THE POWER?
  • Cell combines the considerable floating point
    resources required for demanding numerical
    algorithms with a power efficient
    software-controlled memory hierarchy.
  • Contains a powerful 64-bit Dual-threaded IBM
    PowerPC core and eight proprietary 'Synergistic
    Processing Elements' (SPEs), - eight more highly
    specialized mini-computers on the same die.
  • Cells peak double precision performance is very
    impressive relative to its commodity peers
    (14.6Gflop/s_at_3.2GHz),

50
OVERVIEW
  • Quantitative Performance comparison of the cell
    to AMD Opteron(superscalar), Intel Itanium
    2(VLIW) and Cray X1E(vector)?
  • Minor Architectural Changes (CELL ) to improve
    DP performance.
  • Complexity of mapping scientific algorithms onto
    the CELL.
  • A few interesting Applications

51
ARCHITECTURE
  • Each SPE contains 4 SP 6 cycle pipelined
    FMA(fused multiplyadd) datapaths, 1 DP 9 cycle
    pipelined FMA datapath 4 cycles for data
    movement.
  • 7 Cycle in-order ex. Pipeline and forwarding
    network.
  • Inserts a 6 cycle stall after a DP instr
  • 1 DP instruction issued every 7 Cycles
  • DP Performance is 1/14 peak SP performance

52
Programming
  • Modified SPMD(Single Program Multiple Data)?
  • Dual Program Multiple Data
  • Each SPE has its own local memory to fetch code
    and read/write data.
  • All loads and stores are local.
  • Explicit DMA operations to move data from main
    memory to local memory.
  • Software controlled Memory

53
Programming Models
  • Very challenging to program.
  • Explicit parallelism between SPE and PPC
  • Quad word ISA
  • Unlike MPI communication intrinsics are low
    level, hence faster
  • Three Basic Models
  • Task Parallel Separate Tasks assigned each SPE
  • Pipeline Parallel Large Blocks of data
    transferred between SPEs
  • Data Parallel Same code, Distinct Data (paper
    uses this)?

54
Benchmark Kernels
  • Stencil Computations on Structured Grids
  • Sparse Matrix-Vector Multiplication
  • Matrix-Matrix Multiplication
  • 1D FFTs
  • 2D FFTs

55
CELL
  • The authors of this paper proposed minor
    architectural changes to the CELL Processor
  • DP wasnt a major focus for the Gaming world
  • Redesign would increase complexity and power
    consuption
  • DP instructions fetched every 2 cycles keeping
    everything else the same

56
The Processors Used
57
Benchmark 1 GEMM
  • Dense Matrix-Matrix Multiplication High
    Computational Intensity and regular memory access
  • Expect to reach close to peak on most platforms
  • Explored two blocking formats Column major and
    Block data layout

58
Benchmark 1 GEMM
59
BENCHMARK 2 Sparse Matrix Vector Multiply
  • Seems like a poor choice at first glance due to
    low computational intensity and irregular data
    accesses.
  • But less local store latency, task parallelism, 8
    SPE load store units and DMA prove otherwise.
  • Most of the matrix entries are zero, thus the
    nonzeros are sparsely distributed and can be
    streamed via DMA
  • Like DGEMM, can exploit a FMA well
  • Very low computational intensity (1 FMA for every
    12 bytes)?
  • Non FP instructions can dominate
  • Row lengths can be unique and in multiples of 4

60
SpMV - Results
61
Stencil Based Computations
  • Stencil computations codes represent wide array
    of scientific applications
  • Each point in multidimensional grid is updated
    from subset of neighbours
  • Finite difference operations used to solver
    complex numerical systems
  • Here simple heat equations and 3D hyperbolic PDE
    are examined
  • Relatively low computational intensity results in
    low of peak on superscalars
  • Memory bandwidth bound Low computational
    intensity

62
Stencils - Results
63
1D Fast Fourier Transforms
  • Fast Fourier transform (FFT) - is of great
    importance to a wide variety of applications
  • One of the main techniques for solving PDEs
  • Relatively low computational intensity with
    non-trivial volume data movement
  • 1D FFT Naïve Algorithm - cooperatively executed
    across the SPEs
  • Load roots of unity, load data (cyclic)?
  • 3 stages local work, on-chip transpose, local
    work
  • No double buffering (ie no overlap of
    communication or computation)?
  • 2D FFT 1D FFTs are each run on single SPE
  • Each SPE performs 2 (N/8) FFTs
  • Double buffer (2 incoming and 2 outgoing)?
  • Straightforward algorithm (N2 2D FFT)
  • N simultaneous FFTs, transpose,
  • Transposes represent about 50 of SP execution
    time, but only 20 of DP
  • Cell performance compared with highly optimized
    FFTW and vendor libraries

64
1D FFT Results
65
A Few Conclusions
  • Far more predictable than conventional machines
  • Even in double precision, it obtains much better
    performance on a surprising variety of codes.
  • Cell can eliminate unneeded memory traffic, hide
    memory latency, and thus achieves a much higher
    percentage of memory bandwidth.
  • Instruction set can be very inefficient for
    poorly SIMD or misaligned codes.
  • Loop overheads can heavily dominate performance.
  • Programming Model is clunky

66
Real World Applications
67
FOLDING_at_HOME
  • Folding_at_home tm is a Distributed Computing
    Project at Stanford University
  • Connects gt 1 Million CPUs
  • Mainly to study protein folding and misfolding
  • PS3 Cell Broadband Engine increased the total
    computation power exponentially upto 1 PT
  • 1 work unit takes 8 hours. Run PS3 overnight.
    Then sends results back.
  • 250 K CPUs active in 2008

68
(No Transcript)
69
Other Real life Scientific Appln
  • Ray Tracing
  • Modeling of the human brain
  • Solve complex equations to predict gravity waves
    that are generated by the super-sized black
  • To assist an autonomous vehicle.

70
Axion Racing Entry Into Darpa Urban Challenge
  • Series of events designed to test autonomous
    vehicles for developing technology that keep
    people off the battlefield. Axion Racing, used
    PS3 running Yellow Dog Linux as part of its
    on-board image recognition system.
  • Spirit the name of Axion Racings vehicle, was
    the first of its kind to drive itself to the
    14,110 foot summit of Colorados Pikes Peak.
  • uses stereo vision (2 cameras) concept to
    determine object distance andthen running them
    through the software produces something called a
    disparity map. The further away the object is the
    smaller the disparity map, likewise the opposite
    for near objects.
  • Spirit uses cell to park and reverse.

71
SPIRIT
  • Along with the stereo cameras, Spirit uses a
    laser range finder,infrared camera and two NAVCOM
    Starfire GPS units And an inertial navigation
    system (to correct for GPS errors and signal
    losses)

72
Ray Tracing
  • Very computationally intense algorithm to model
    the path taken by light as they interact with
    optical surfaces
  • Also used in modeling radio waves, radiation
    effects and in other engineering areas.Algorithm
    needs to be heavily modified to run on the cell.

73
Ray Tracing
  • This video shows a progression of ray-traced
    shaders executing on a cluster of IBM QS20 Cell
    blades.
  • over 300,000 triangles, render at over 60 frames
    per second (depending on the shader) at 1080p
    resolution using 14 Cell processors.
  • Because of the scalable nature of the ray-tracer
    it can also render interactive frames on a single
    Linux Playstation3 using only 6 SPEs.

74

75
Conclusion
  • Overall, a single PS3 performs better than the
    highest-end desktops available and compares to as
    many as 25 nodes of an IBM Blue Gene
    supercomputer. And there is still tremendous
    scope left for extracting more performance
    through further optimization.
  • Its a commodity processor, hence cheap and can be
    used in large quantities.
  • The most Difficult process is writing and
    compiling code!

76
  • QUESTIONS????
Write a Comment
User Comments (0)
About PowerShow.com