On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus

Description:

On Characterizing Performance of the. Cell Broadband Engine. Element Interconnect Bus ... Jason Dale, Eiji Iwata, 'Cell Broadband Engine Architecture and its first ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 38
Provided by: timothyp3
Category:

less

Transcript and Presenter's Notes

Title: On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus


1
On Characterizing Performance of the Cell
Broadband EngineElement Interconnect Bus
  • Thomas William Ainsworth
  • Timothy Mark Pinkston
  • University of Southern California

2
Outline
  • Motivation and Related Work
  • Background EIB Requirements
  • Latency Analysis
  • Throughput Analysis
  • Software Considerations

3
Motivation
  • With the rise of multi-core computing, the design
    of on-chip interconnection networks has become an
    increasingly important component of computer
    architecture
  • In designing NoCs for specific or general purpose
    functions, it is important to understand the
    impact and limitations of various design choices
    in achieving this goal
  • The EIB is an interesting OCN to study as it
    provides higher raw network bandwidth and
    interconnects more end nodes than most mainstream
    commercial multi-core processors

4
Related Work on Characterizing NoC Performance
  • Academic NoCs
  • TRIPS
  • Doug Berger, Steve Keckler, and the TRIPS Project
    Team, Design and Implementation of the TRIPS
    EDGE Architecture, ISCA-32 Tutorial, pp 1-239,
    June 2005.
  • RAW
  • Michael Bedford Taylor, Walter Lee, Saman P.
    Amarasinghe, and Anant Agarwal, Scalar Operand
    Networks, IEEE Transactions on Parallel and
    Distributed Systems, Volume 16, Issue 2, February
    2005.
  • Commercial NoCs
  • Cell BE EIB
  • Thomas Chen, Ram Raghavan, Jason Dale, Eiji
    Iwata, Cell Broadband Engine Architecture and
    its first implementation A performance view, 29
    Nov 2005, http//www-128.ibm.com/developerworks/po
    wer/library/pa-cellperf/
  • Fabrizio Petrini, Gordon Fossum, Ana Varbanescu,
    Michael Perrone, Michael D. Kistler, Juan
    Fernandez Peinador, Multicore Surprises Lesson
    Learned from Optimizing Sweep3D on the Cell
    Broadband Engine, IPDPS, 2007

5
Background Cell BE Interconnect Requirements
  • Cell BE On-Chip Network Requirements
  • Interconnect 12 elements
  • 1 PPE with 51.2GB/s aggregate bandwidth
  • 8 SPEs each with 51.2GB/s aggregate bandwidth
  • MIC 25.6GB/s of memory bandwidth
  • 2 IOIF 35GB/s(out), 25GB/s(in) of I/O bandwidth
  • Provide coherent and non-coherent data transfer
  • Support two transfer modes
  • DMA between SPEs
  • MMIO/DMA between PPE and system memory

6
Background Element Interconnect Bus (EIB)
1.6 GHz Bus Clock Frequency, Four 16B data rings
(2 per direction)
7
Characterization Approach
  • Approach Characterize Network Latency and
    Throughput based on Best- and Worst-Case Network
    Conditions
  • Network Latency
  • Four phases make up the end-to-end latency
  • Latency Sending Phase Command Phase Data
    Phase Receiving Phase
  • Sending latency Transport
    latency Receiving latency
  • Sending Phase responsible for EIB transaction
    initiation
  • Includes all processor/DMA controller activities
    prior to EIB access
  • Command Phase sets up inter-element (end-to-end)
    communication
  • Informs target element about the impending
    transaction to allow for element to set up
    transaction, performs coherency checking, etc.
  • Data Phase handles ring arbitration and data
    transport
  • Data ring segment to destination must be free
    before granting access
  • Receiving Phase directs the received data to its
    final location
  • Local Store, Memory, or I/O

8
Best-case Network Latency
  • SPE1 ? SPE6 Non-coherent DMA Transaction
  • - Best case for longest EIB transfer

9
Best-case Network Latency - Sending Phase
  • Pipeline Latency
  • 23 CPU clock cycles
  • DMA Issue 10 CPU cycles
  • Write SPE local store address
  • Write effective address high
  • Write effective address low
  • Write DMA size
  • Write DMA command
  • DMA Controller Processing
  • 20 CPU clock cycles

Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
10
Best-case Network Latency Command Phase
11
Best-case Network Latency Command Phase
  • Non-coherent Command Issue 3 Bus cycles

12
Best-case Network Latency Command Phase
  • Command Reflection to
  • AC1 elements
  • 3 Bus cycles

AC2 elements 4 Bus cycles
AC3 elements 5 Bus cycles
13
Best-case Network Latency Command Phase
  • Snoop Response 13 Bus cycles

14
Best-case Network Latency Command Phase
  • Non-coherent Snoop Combining 3 Bus cycles

15
Best-case Network Latency Command Phase
  • Final Snoop Response to
  • AC1 elements
  • 3 Bus cycles

AC2 elements 4 Bus cycles
AC3 elements 5 Bus cycles
16
Best-case Network Latency Command Phase
  • Command Issue Command Reflection Snoop
    Response
  • Combined Snoop Response Final Snoop Response

Non-coherent Command Phase 31 Bus cycles
Additional latency for coherent transactions 12
Bus cycles
17
Best-case Network Latency Data Phase
18
Best-case Network Latency Data Phase
  • Data Request to Arbiter 2 Bus Cycles
  • Data Arbitration 2 Bus Cycles
  • Data Bus Grant 2 Bus Cycles

Time of Flight 6 Bus Cycles Transmission Time
8 Bus Cycles
Data Phase (DP) 20 Bus cycles
19
Best-case Network Latency - Receiving Phase
Move data from BIU to MFC 1 Bus cycle Move
data from MFC to LS 1 Bus cycle
Receiving Phase (RP) 2 Bus cycles
20
Best-case Total Latency
  • For non-coherent, SPE to SPE transfers

Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
Non-coherent Command Phase 31 Bus cycles
Data Phase (DP) 20 Bus cycles
Receiving Phase (RP) 2 Bus cycles
Non-coherent Network Latency 79.5 Bus cycles
21
Best-case Total Latency
  • For memory coherent transfers

Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
Non-coherent Command Phase 31 Bus cycles
Additional latency for coherent transactions 12
Bus cycles
Data Phase (DP) 20 Bus cycles
Receiving Phase (RP) 2 Bus cycles
Coherent Network Latency 91.5 Bus cycles
22
Worst-case Network Latency Estimates
  • Worst-case latency for a coherent transfer to
    memory controller
  • Assumptions
  • Maximum number of command requests ahead of
    transfer
  • Maximum number of data transactions with same
    destination as transfer
  • Only includes EIB worst-case, not PPE/SPE worst
    case
  • No cache/LS misses
  • No pipeline stalls

23
Worst-case Network Latency Command Phase
  • Command Phase
  • There are a total of 240 request slots available
    across all other elements
  • 64 MIC slots 16 element slots x 11 elements
  • Since each request uses two slots for both the
    sender and the receiver, there are a total of 120
    outstanding transactions that could be in the
    queue.
  • One command per transaction
  • Round-robin arbitration
  • Command rate of 1 for every two clock cycles
  • 120 commands x 2 Bus cycles 240 Bus cycles

Worst-case Command Phase 240 Bus cycles
24
Worst-case Network Latency Data Phase
  • Data Phase
  • If all elements are trying to write to the MIC
    then there is a maximum of 64 transfers
  • In this worst-case, the MIC can only handle 1
    write every 8 bus cycles
  • 64 transfers x 8 bus cycles 512 bus cycles

Worst-case Data Phase 512 Bus cycles
25
Worst-case Total Latency
  • Worst-case transfer

Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
Worst-case Command Phase 240 Bus cycles
Worst-case Data Phase 512 Bus cycles
Receiving Phase (RP) 2 Bus cycles
Worst-case Coherent Network Latency 780.5 Bus
cycles
26
Throughput
  • End-to-End Throughput Analysis
  • Network Injection/Reception Bandwidth 307.2
    GB/s
  • 12 elements
  • 32B (read and write) data per bus cycle
  • ½ x 32B x 1.6 GHz x 12 nodes 307.2 GB/s
  • Network Bisection Bandwidth
  • Unidirectional ring data width 16B
  • 4 rings, two in each direction
  • 3 concurrent transfers per ring
  • 16B x 1.6 GHz x 4 rings x 3 transfers per ring
    307.2 GB/s

Injection bandwidth
Bisection bandwidth
Reception bandwidth
27
Cell BE EIB Throughput
  • Cell BE EIB Sustainable Effective
    Bandwidth
  • Command Phase Limitations (Non-coherent
    Transfers)
  • Max. effective bandwidth 204.8GB/s
    (non-coherent transfers)
  • Command bus is limited to 1 request per bus cycle
  • Each request can transfer 128B
  • Command bus allows the issuing of up to one
    transaction per cycle
  • Each ring can issue one new transaction every 3
    cycles (grey cycles indicate a ring is
    unavailable)
  • Command bus cannot sustain more than 8
    concurrent transactions at any given time

Sustainable Effective Bandwidth (non-coherent)
204.8 GB/s
28
Cell BE EIB Throughput
  • Cell BE EIB Sustainable Effective
    Bandwidth
  • Command Phase Limitations (Coherent Transfers)
  • Max. effective bandwidth 102.4 GB/s (coherent
    transfers)
  • Command bus is limited to 1 coherent request per
    2 bus cycles
  • Each request transfers 128B

Sustainable Effective Network Bandwidth
(coherent) 102.4 GB/s
29
Cell BE EIB Throughput (Best-Case)
  • Cell BE EIB Calculated vs. Measured Effective
    Bandwidth

BWNetwork ? 204.8 /1 GB/s
197 GB/s (measured)
Injection bandwidth 25.6 GB/s per element
Reception bandwidth 25.6 GB/s per element
g 1
Command Bus Bandwidth
BWBisection 8 links 204.8 GB/s
204.8 GB/s
Network injection
Network reception
Aggregate bandwidth
Peak BWNetwork of 25.6 GB/s x 3 x 4 307.2 GB/s
(4 rings each with 12 links)
(12 elements)
(12 elements)
1,228.8 GB/s
(3 transfers per ring)
307.2 GB/s
307.2 GB/s
r can, at best, reach 100 since no ring
interferrence
Contention-free traffic pattern
30
Cell BE EIB Throughput (Worst-Case)
  • Cell BE EIB Calculated vs. Measured Effective
    Bandwidth

BWNetwork ? 204.8 /1 GB/s
78 GB/s (measured)
g 1
Injection bandwidth 25.6 GB/s per element
Reception bandwidth 25.6 GB/s per element
Command Bus Bandwidth
BWBisection 8 links 204.8 GB/s
204.8 GB/s
Network injection
Network reception
Aggregate bandwidth
Peak BWNetwork of 25.6 GB/s x 3 x 4 307.2 GB/s
(4 rings each with 12 links)
(12 Nodes)
(12 Nodes)
1,228.8 GB/s
(3 transfers per ring)
307.2 GB/s
307.2 GB/s
Contention due to traffic pattern
r limited, at best, to only 50 due to ring
interferrence
31
Software Considerations
  • The network should not be considered in isolation
    of the rest of the system, nor the system in
    isolation of the network
  • Due to the wide range of performance of EIB, Cell
    BE software developers must consider EIB
    strengths and limitations
  • Software should plan transactions in order to
    take full advantage of the EIB
  • Traffic patterns
  • Latency can be mitigated through bandwidth
  • Software cache, double-buffering, pre-fetching,
    etc.
  • SPE tasks should be carefully assigned
  • Context switches will take at least 5
    microseconds (7700 bus cycles) assuming no bus
    contention, no waiting for memory, and no
    outstanding EIB transactions

32
Closing Remarks
  • The design of interconnection networks is
    end-to-end
  • Injection links/interface ? network fabric ?
    reception links/interface
  • Topology, routing, arbitration, switching, flow
    control, marchitecture are among key aspects in
    realizing high-performance designs
  • Cell BEs EIB requires a delicate balance
  • Capable of providing high bandwidth
  • Should not overlook EIB control and end-to-end
    effects
  • Performance characteristics are highly dependent
    on how the network is used
  • Software teams cannot rely on the compiler to
    handle Cell for them (at least not yet)
  • Manual and semi-automated workload management is
    necessary
  • Future Work
  • This analysis is only a first step
  • Further analysis using simulation/hardware-based
    evaluation will give better insight and
    understanding into the EIB

33
Questions
34
Network Characterization Summary
35
Acronym List
  • CBE - Cell Broadband Engine
  • PPE - POWER Processing Element
  • SPE - Synergistic Processing Element
  • ISA - Instruction Set Architecture
  • DMA - Direct Memory Access
  • MMIO Memory Mapped Input/Output
  • EIB - Element Interconnect Bus
  • MFC - Memory Flow Controller
  • LS - Local store
  • SIMD - Single Instruction Multiple Data
  • NoC Network on Chip
  • OCN - On-chip Network
  • SO - Sending Overhead
  • TF - Time of Flight
  • TT - Transmission Time
  • RO - Receiver Overhead

36
References
  • 1 William J. Dally, Interconnect limited VLSI
    architecture, in Proceedings of International
    Interconnect Technology Conference, pp 15-17, May
    1999.
  • 2 William J. Dally and Brian Towles, Route
    Packets, Not Wires On-Chip Interconnection
    Networks, in Proc. of the Design Automation
    Conf., pp 684-689, ACM, June 2001.
  • 3 Luca Benini and Giovanni De Micheli,
    Networks on Chip A New SoC Paradigm, IEEE
    Computer, Volume 35, Issue 1, pp 70-80, January
    2002.
  • 4 Ahmed Hemani, Axel Jantsch, Shashi Kumar,
    Adam Postula, Johnny Oberg, Mikael Millberg, and
    Dan Lidqvist, Network on a Chip An Architecture
    for Billion Transitor Era, in Proc. of the IEEE
    NorChip Conference, Nov., 2000.
  • 5 Marco Sgroi, M. Sheets, A. Mihal, Kurt
    Keutzer, Sharad Malik, Jan M. Rabaey, and Alberto
    L. Sangiovanni-Vincentelli, Addressing the
    system-on-a-chip interconnect woes through
    communication-based design, in Proceedings of
    Design Automation Conference, pp 667-672, June
    2001.
  • 6 IBM CoreConnect, Technical Report,
    www.chips.ibm.com/products/powerpc/cores, 2000.
  • 7 Kanishka Lahiri, Anand Raghunathan, and
    Ganesh Lakshminarayana, LotteryBus A New
    High-Performance Communication Architecture for
    System-on-Chip Designs, in Proc. of the Design
    Automation Conf., ACM, June 2001.
  • 8 David Flynn, AMBA Enabling Reusable On-Chip
    Designs, IEEE Micro, pp 20-27, July/August 1997.
  • 9 Rakesh Kumar, Victor Zyuban, and DeanM.
    Tullsen, Interconnections in Multi-Core
    Architectures Understanding Mechanisms,
    Overheads, and Scaling, in Proceedings of the
    International Symposium on Computer Architecture,
    IEEE Computer Society, June 2005.
  • 10 Jian Liang, Sriram Swaminathan, and Russell
    Tessier, aSoC A Scalable, Singlechip
    Communications Architecture, in Intl Conference
    on Parallel Architectures and Compilation
    Techniques, pp 37-46, October 2000.
  • 11 Pierre Guerrier and Alain Greiner, A
    Generic Architecture for On-Chip Packet-Switched
    Interconnections, in Proceedings of the Design
    Automation and Test in Europe, pp 250-256, March
    2000.
  • 12 Adrijean Adriahantenaina, Herv e Charlery,
    Alain Greiner, Laurent Mortiez, and CesarAlbenes
    Zeferino, SPIN A Scalable, Packet Switched,
    On-Chip Micro-Network, in Proceedings of the
    Design, Automation and Test in Europe Conference,
    March 2003.
  • 13 Shashi Kumar, Axel Jantsch, Juha-Pekka
    Soininen, Martti Forsell, Mikael Millberg, Johnny
    Oberg, Kari Tiensyrja, and Ahmed Hemani, A
    Network on Chip Architecture and Design
    Methodology, in Proceedings of the IEEE Computer
    Society Annual Symposium on VLSI, pp 105-112,
    2002.
  • 14 Jacob Chang, Srivaths Ravi, and Anand
    Raghunathan, FLEXBAR A Crossbar Switching
    Fabric with Improved Performance and
    Utilization, in Proceedings of the IEEE CICC, pp
    405-408, 2002.
  • 15 A. Brinkmann, J. C. Niemann, I. Hehemann, D.
    Langen, M. Porrmann, and U. Ruckert, On-chip
    Interconects for Next Generation
    Systems-on-Chips, in Proc. of the 15th Annual
    IEEE Intl ASIC/SOC Conf., pp 211-215, Sept 2002.
  • 16 Timothy Mark Pinkston and Jeonghee Shin,
    Trends Toward On-Chip Networked Microsystems,
    in International Journal of High Performance
    Computing and Networking, Volume 3, Number 1, pp
    3-18, December 2005.
  • 17 Joan-Manuel Parcerisa, Julio Sahuquillo,
    Antonio Gonzalez, and Jose Duato, Efficient
    Interconnects for Clustered Microarchitectures,
    in Proceedings of 2002 International Conference
    on Parallel Architectures and Compilation
    Techniques, September 2002.
  • 18 Aneesh Aggarwal and Manoj Franklin,
    Hierarchical Interconnects for On-chip
    Clustering, in Proc. of Intl Parallel and
    Distributed Processing Symposium, April 2002.
  • 19 Shubhendu S. Mukherjee, Peter Bannon, Steven
    Lang, Aaron Spink, and David Webb, The Alpha
    21364 Network Architecture, in Proceedings of
    Hot Interconnects 9, pp 113-117, August 2001.

37
References (Continued)
  • 23 Doug Berger, Steve Keckler, and the TRIPS
    Project Team, Design and Implementation of the
    TRIPS EDGE Architecture, ISCA-32 Tutorial, pp
    1-239, June 2005.
  • 24 Ramadass Nagarajan, Karthikeyan
    Sankaralingam, Doug Burger, and Stephen W.
    Keckler, A design space evaluation of grid
    processor architectures, in Proceedings of the
    34th Annual International Symposium on
    Microarchitecture, pages 40-51, December 2001.
  • 25 H. Peter Hofstee, Power Efficient Processor
    Architecture and the Cell Processor, in
    Proceedings of the Eleventh Intl Symposium on
    High-Performance Computer Architecture, IEEE
    Computer Society, February 2005.
  • 26 Kevin Krewell, Sun's Niagara Pours on the
    Cores, Microprocessor Report, Vol. 18, No. 9, pp
    1-3, Sept., 2004.
  • 27 James M. Baker Jr., Brian Gold, Mark
    Bucciero, Sidney Bennett, Rajneesh Mahajan,
    Priyadarshini Ramachandran, and Jignesh Shah,
    SCMP A Single-Chip Message-Passing Parallel
    Computer, in the Journal of Supercomputing,
    Volume 30, pp 133-149, 2004.
  • 28 Terry Tao Ye, Luca Benini, and Giovanni De
    Micheli, Packetized On-Chip Interconnect
    Communication Analysis for MPSoC, in Proceedings
    of the Design, Automation and Test in Europe
    Conference, pp 344-349, March 2003.
  • 29 Peter N. Glaskowsky, IBM raises curtain on
    Power5, Microprocessor Report, Volume 17, Issue
    10, pp 13-14, October 2003.
  • 30 Joel M. Tendler, Steve Dodson, Steve Fields,
    Hung Le, and Balaram Sinharoy, POWER4 system
    microarchitecture, IBM Journal of Research and
    Development, Volume 46, Issue 1, pp 5-26, January
    2002.
  • 31 Seon Wook Kim, Chong-Liang Ooi, IlPark,
    Rudolf Eigenmann, Babak Falsafi, and T. N.
    Vijaykumar, Multiplex unifying conventional and
    speculative thread-level parallelism on a chip
    multiprocessor, in Proc. of Intl Conference on
    Supercomputing, pp 368-380, June 2001.
  • 32 J. A. Kahle, M. N. Day, H. P. Hofstee, C. R.
    Johns, T. R. Maeurer, and D. Shippy,
    Introduction to the Cell multiprocessor, IBM
    Journal of Research and Development, Volume 49,
    Number 4/5, 2005.
  • 33 Thomas Chen, Ram Raghavan, Jason Dale, Eiji
    Iwata, Cell Broadband Engine Architecture and
    its first implementation A performance view, 29
    Nov 2005, http//www-128.ibm.com/developerworks/po
    wer/library/pa-cellperf/.
  • 34 Timothy M. Pinkston and Jose Duato,
    Appendix E Interconnection Networks in
    Computer Architecture A Quantitative Approach,
    4th Edition, by John L. Hennessy and David A
    Patterson, pp 1-114, Elsevier Publishers, 2007.
  • 35 Mike Kistler, Michael Perrone, and Fabrizio
    Petrini, Cell Microprocessor Communication
    Network Built for Speed, IBM Austin Research
    Lab, White Paper.
  • 36 Mike Kistler, (Private Communication), IBM,
    Austin, TX, June 2006.
  • 37 Scott Clark, (Private Communication), IBM,
    Rochester, MN, December 2005.
  • 38 Scott Clark, (Private Communication), IBM,
    Rochester, MN, June/July 2006.
  • 39 David Krolak, Just like being there Papers
    from the Fall Processor Forum 2005 Unleashing
    the Cell Broadband Engine ProcessorThe Element
    Interconnect Bus, November 29, 2005,
    http//www-128.ibm.com/developerworks/power/librar
    y/pa-fpfeib/.
  • 40 Michael B. Taylor. Scalar Operand Networks
    for Tiled Microprocessors, invited talk at the
    Workshop on On- and Off-chip Interconnection
    Networks, Stanford University, December 2006.
  • 41 DeveloperWorks, IBM, Meet the experts The
    Mambo team on the IBM Full-System Simulator for
    the Cell Broadband Engine processor, 22 Nov
    2005, http//www-128.ibm.com/developerworks/librar
    y/pa-expert7/.
Write a Comment
User Comments (0)
About PowerShow.com