Title: On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus
1On Characterizing Performance of the Cell
Broadband EngineElement Interconnect Bus
- Thomas William Ainsworth
- Timothy Mark Pinkston
- University of Southern California
2Outline
- Motivation and Related Work
- Background EIB Requirements
- Latency Analysis
- Throughput Analysis
- Software Considerations
3Motivation
- With the rise of multi-core computing, the design
of on-chip interconnection networks has become an
increasingly important component of computer
architecture - In designing NoCs for specific or general purpose
functions, it is important to understand the
impact and limitations of various design choices
in achieving this goal - The EIB is an interesting OCN to study as it
provides higher raw network bandwidth and
interconnects more end nodes than most mainstream
commercial multi-core processors
4Related Work on Characterizing NoC Performance
- Academic NoCs
- TRIPS
- Doug Berger, Steve Keckler, and the TRIPS Project
Team, Design and Implementation of the TRIPS
EDGE Architecture, ISCA-32 Tutorial, pp 1-239,
June 2005. - RAW
- Michael Bedford Taylor, Walter Lee, Saman P.
Amarasinghe, and Anant Agarwal, Scalar Operand
Networks, IEEE Transactions on Parallel and
Distributed Systems, Volume 16, Issue 2, February
2005. - Commercial NoCs
- Cell BE EIB
- Thomas Chen, Ram Raghavan, Jason Dale, Eiji
Iwata, Cell Broadband Engine Architecture and
its first implementation A performance view, 29
Nov 2005, http//www-128.ibm.com/developerworks/po
wer/library/pa-cellperf/ - Fabrizio Petrini, Gordon Fossum, Ana Varbanescu,
Michael Perrone, Michael D. Kistler, Juan
Fernandez Peinador, Multicore Surprises Lesson
Learned from Optimizing Sweep3D on the Cell
Broadband Engine, IPDPS, 2007
5Background Cell BE Interconnect Requirements
- Cell BE On-Chip Network Requirements
- Interconnect 12 elements
- 1 PPE with 51.2GB/s aggregate bandwidth
- 8 SPEs each with 51.2GB/s aggregate bandwidth
- MIC 25.6GB/s of memory bandwidth
- 2 IOIF 35GB/s(out), 25GB/s(in) of I/O bandwidth
- Provide coherent and non-coherent data transfer
- Support two transfer modes
- DMA between SPEs
- MMIO/DMA between PPE and system memory
6Background Element Interconnect Bus (EIB)
1.6 GHz Bus Clock Frequency, Four 16B data rings
(2 per direction)
7Characterization Approach
- Approach Characterize Network Latency and
Throughput based on Best- and Worst-Case Network
Conditions - Network Latency
- Four phases make up the end-to-end latency
- Latency Sending Phase Command Phase Data
Phase Receiving Phase - Sending latency Transport
latency Receiving latency - Sending Phase responsible for EIB transaction
initiation - Includes all processor/DMA controller activities
prior to EIB access - Command Phase sets up inter-element (end-to-end)
communication - Informs target element about the impending
transaction to allow for element to set up
transaction, performs coherency checking, etc. - Data Phase handles ring arbitration and data
transport - Data ring segment to destination must be free
before granting access - Receiving Phase directs the received data to its
final location - Local Store, Memory, or I/O
8Best-case Network Latency
- SPE1 ? SPE6 Non-coherent DMA Transaction
- - Best case for longest EIB transfer
9Best-case Network Latency - Sending Phase
- Pipeline Latency
- 23 CPU clock cycles
- DMA Issue 10 CPU cycles
- Write SPE local store address
- Write effective address high
- Write effective address low
- Write DMA size
- Write DMA command
- DMA Controller Processing
- 20 CPU clock cycles
Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
10Best-case Network Latency Command Phase
11Best-case Network Latency Command Phase
- Non-coherent Command Issue 3 Bus cycles
12Best-case Network Latency Command Phase
- Command Reflection to
- AC1 elements
- 3 Bus cycles
AC2 elements 4 Bus cycles
AC3 elements 5 Bus cycles
13Best-case Network Latency Command Phase
- Snoop Response 13 Bus cycles
14Best-case Network Latency Command Phase
- Non-coherent Snoop Combining 3 Bus cycles
15Best-case Network Latency Command Phase
- Final Snoop Response to
- AC1 elements
- 3 Bus cycles
AC2 elements 4 Bus cycles
AC3 elements 5 Bus cycles
16Best-case Network Latency Command Phase
- Command Issue Command Reflection Snoop
Response - Combined Snoop Response Final Snoop Response
Non-coherent Command Phase 31 Bus cycles
Additional latency for coherent transactions 12
Bus cycles
17Best-case Network Latency Data Phase
18Best-case Network Latency Data Phase
- Data Request to Arbiter 2 Bus Cycles
- Data Arbitration 2 Bus Cycles
- Data Bus Grant 2 Bus Cycles
Time of Flight 6 Bus Cycles Transmission Time
8 Bus Cycles
Data Phase (DP) 20 Bus cycles
19Best-case Network Latency - Receiving Phase
Move data from BIU to MFC 1 Bus cycle Move
data from MFC to LS 1 Bus cycle
Receiving Phase (RP) 2 Bus cycles
20Best-case Total Latency
- For non-coherent, SPE to SPE transfers
Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
Non-coherent Command Phase 31 Bus cycles
Data Phase (DP) 20 Bus cycles
Receiving Phase (RP) 2 Bus cycles
Non-coherent Network Latency 79.5 Bus cycles
21Best-case Total Latency
- For memory coherent transfers
Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
Non-coherent Command Phase 31 Bus cycles
Additional latency for coherent transactions 12
Bus cycles
Data Phase (DP) 20 Bus cycles
Receiving Phase (RP) 2 Bus cycles
Coherent Network Latency 91.5 Bus cycles
22Worst-case Network Latency Estimates
- Worst-case latency for a coherent transfer to
memory controller - Assumptions
- Maximum number of command requests ahead of
transfer - Maximum number of data transactions with same
destination as transfer - Only includes EIB worst-case, not PPE/SPE worst
case - No cache/LS misses
- No pipeline stalls
23Worst-case Network Latency Command Phase
- Command Phase
- There are a total of 240 request slots available
across all other elements - 64 MIC slots 16 element slots x 11 elements
- Since each request uses two slots for both the
sender and the receiver, there are a total of 120
outstanding transactions that could be in the
queue. - One command per transaction
- Round-robin arbitration
- Command rate of 1 for every two clock cycles
- 120 commands x 2 Bus cycles 240 Bus cycles
Worst-case Command Phase 240 Bus cycles
24Worst-case Network Latency Data Phase
- Data Phase
- If all elements are trying to write to the MIC
then there is a maximum of 64 transfers - In this worst-case, the MIC can only handle 1
write every 8 bus cycles - 64 transfers x 8 bus cycles 512 bus cycles
Worst-case Data Phase 512 Bus cycles
25Worst-case Total Latency
Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
Worst-case Command Phase 240 Bus cycles
Worst-case Data Phase 512 Bus cycles
Receiving Phase (RP) 2 Bus cycles
Worst-case Coherent Network Latency 780.5 Bus
cycles
26Throughput
- End-to-End Throughput Analysis
- Network Injection/Reception Bandwidth 307.2
GB/s - 12 elements
- 32B (read and write) data per bus cycle
- ½ x 32B x 1.6 GHz x 12 nodes 307.2 GB/s
- Network Bisection Bandwidth
- Unidirectional ring data width 16B
- 4 rings, two in each direction
- 3 concurrent transfers per ring
- 16B x 1.6 GHz x 4 rings x 3 transfers per ring
307.2 GB/s
Injection bandwidth
Bisection bandwidth
Reception bandwidth
27Cell BE EIB Throughput
- Cell BE EIB Sustainable Effective
Bandwidth - Command Phase Limitations (Non-coherent
Transfers) - Max. effective bandwidth 204.8GB/s
(non-coherent transfers) - Command bus is limited to 1 request per bus cycle
- Each request can transfer 128B
- Command bus allows the issuing of up to one
transaction per cycle - Each ring can issue one new transaction every 3
cycles (grey cycles indicate a ring is
unavailable) - Command bus cannot sustain more than 8
concurrent transactions at any given time
Sustainable Effective Bandwidth (non-coherent)
204.8 GB/s
28Cell BE EIB Throughput
- Cell BE EIB Sustainable Effective
Bandwidth - Command Phase Limitations (Coherent Transfers)
- Max. effective bandwidth 102.4 GB/s (coherent
transfers) - Command bus is limited to 1 coherent request per
2 bus cycles - Each request transfers 128B
Sustainable Effective Network Bandwidth
(coherent) 102.4 GB/s
29Cell BE EIB Throughput (Best-Case)
- Cell BE EIB Calculated vs. Measured Effective
Bandwidth
BWNetwork ? 204.8 /1 GB/s
197 GB/s (measured)
Injection bandwidth 25.6 GB/s per element
Reception bandwidth 25.6 GB/s per element
g 1
Command Bus Bandwidth
BWBisection 8 links 204.8 GB/s
204.8 GB/s
Network injection
Network reception
Aggregate bandwidth
Peak BWNetwork of 25.6 GB/s x 3 x 4 307.2 GB/s
(4 rings each with 12 links)
(12 elements)
(12 elements)
1,228.8 GB/s
(3 transfers per ring)
307.2 GB/s
307.2 GB/s
r can, at best, reach 100 since no ring
interferrence
Contention-free traffic pattern
30Cell BE EIB Throughput (Worst-Case)
- Cell BE EIB Calculated vs. Measured Effective
Bandwidth
BWNetwork ? 204.8 /1 GB/s
78 GB/s (measured)
g 1
Injection bandwidth 25.6 GB/s per element
Reception bandwidth 25.6 GB/s per element
Command Bus Bandwidth
BWBisection 8 links 204.8 GB/s
204.8 GB/s
Network injection
Network reception
Aggregate bandwidth
Peak BWNetwork of 25.6 GB/s x 3 x 4 307.2 GB/s
(4 rings each with 12 links)
(12 Nodes)
(12 Nodes)
1,228.8 GB/s
(3 transfers per ring)
307.2 GB/s
307.2 GB/s
Contention due to traffic pattern
r limited, at best, to only 50 due to ring
interferrence
31Software Considerations
- The network should not be considered in isolation
of the rest of the system, nor the system in
isolation of the network - Due to the wide range of performance of EIB, Cell
BE software developers must consider EIB
strengths and limitations - Software should plan transactions in order to
take full advantage of the EIB - Traffic patterns
- Latency can be mitigated through bandwidth
- Software cache, double-buffering, pre-fetching,
etc. - SPE tasks should be carefully assigned
- Context switches will take at least 5
microseconds (7700 bus cycles) assuming no bus
contention, no waiting for memory, and no
outstanding EIB transactions
32Closing Remarks
- The design of interconnection networks is
end-to-end - Injection links/interface ? network fabric ?
reception links/interface - Topology, routing, arbitration, switching, flow
control, marchitecture are among key aspects in
realizing high-performance designs - Cell BEs EIB requires a delicate balance
- Capable of providing high bandwidth
- Should not overlook EIB control and end-to-end
effects - Performance characteristics are highly dependent
on how the network is used - Software teams cannot rely on the compiler to
handle Cell for them (at least not yet) - Manual and semi-automated workload management is
necessary - Future Work
- This analysis is only a first step
- Further analysis using simulation/hardware-based
evaluation will give better insight and
understanding into the EIB
33Questions
34Network Characterization Summary
35Acronym List
- CBE - Cell Broadband Engine
- PPE - POWER Processing Element
- SPE - Synergistic Processing Element
- ISA - Instruction Set Architecture
- DMA - Direct Memory Access
- MMIO Memory Mapped Input/Output
- EIB - Element Interconnect Bus
- MFC - Memory Flow Controller
- LS - Local store
- SIMD - Single Instruction Multiple Data
- NoC Network on Chip
- OCN - On-chip Network
- SO - Sending Overhead
- TF - Time of Flight
- TT - Transmission Time
- RO - Receiver Overhead
36References
- 1 William J. Dally, Interconnect limited VLSI
architecture, in Proceedings of International
Interconnect Technology Conference, pp 15-17, May
1999. - 2 William J. Dally and Brian Towles, Route
Packets, Not Wires On-Chip Interconnection
Networks, in Proc. of the Design Automation
Conf., pp 684-689, ACM, June 2001. - 3 Luca Benini and Giovanni De Micheli,
Networks on Chip A New SoC Paradigm, IEEE
Computer, Volume 35, Issue 1, pp 70-80, January
2002. - 4 Ahmed Hemani, Axel Jantsch, Shashi Kumar,
Adam Postula, Johnny Oberg, Mikael Millberg, and
Dan Lidqvist, Network on a Chip An Architecture
for Billion Transitor Era, in Proc. of the IEEE
NorChip Conference, Nov., 2000. - 5 Marco Sgroi, M. Sheets, A. Mihal, Kurt
Keutzer, Sharad Malik, Jan M. Rabaey, and Alberto
L. Sangiovanni-Vincentelli, Addressing the
system-on-a-chip interconnect woes through
communication-based design, in Proceedings of
Design Automation Conference, pp 667-672, June
2001. - 6 IBM CoreConnect, Technical Report,
www.chips.ibm.com/products/powerpc/cores, 2000. - 7 Kanishka Lahiri, Anand Raghunathan, and
Ganesh Lakshminarayana, LotteryBus A New
High-Performance Communication Architecture for
System-on-Chip Designs, in Proc. of the Design
Automation Conf., ACM, June 2001. - 8 David Flynn, AMBA Enabling Reusable On-Chip
Designs, IEEE Micro, pp 20-27, July/August 1997. - 9 Rakesh Kumar, Victor Zyuban, and DeanM.
Tullsen, Interconnections in Multi-Core
Architectures Understanding Mechanisms,
Overheads, and Scaling, in Proceedings of the
International Symposium on Computer Architecture,
IEEE Computer Society, June 2005. - 10 Jian Liang, Sriram Swaminathan, and Russell
Tessier, aSoC A Scalable, Singlechip
Communications Architecture, in Intl Conference
on Parallel Architectures and Compilation
Techniques, pp 37-46, October 2000. - 11 Pierre Guerrier and Alain Greiner, A
Generic Architecture for On-Chip Packet-Switched
Interconnections, in Proceedings of the Design
Automation and Test in Europe, pp 250-256, March
2000. - 12 Adrijean Adriahantenaina, Herv e Charlery,
Alain Greiner, Laurent Mortiez, and CesarAlbenes
Zeferino, SPIN A Scalable, Packet Switched,
On-Chip Micro-Network, in Proceedings of the
Design, Automation and Test in Europe Conference,
March 2003. - 13 Shashi Kumar, Axel Jantsch, Juha-Pekka
Soininen, Martti Forsell, Mikael Millberg, Johnny
Oberg, Kari Tiensyrja, and Ahmed Hemani, A
Network on Chip Architecture and Design
Methodology, in Proceedings of the IEEE Computer
Society Annual Symposium on VLSI, pp 105-112,
2002. - 14 Jacob Chang, Srivaths Ravi, and Anand
Raghunathan, FLEXBAR A Crossbar Switching
Fabric with Improved Performance and
Utilization, in Proceedings of the IEEE CICC, pp
405-408, 2002. - 15 A. Brinkmann, J. C. Niemann, I. Hehemann, D.
Langen, M. Porrmann, and U. Ruckert, On-chip
Interconects for Next Generation
Systems-on-Chips, in Proc. of the 15th Annual
IEEE Intl ASIC/SOC Conf., pp 211-215, Sept 2002. - 16 Timothy Mark Pinkston and Jeonghee Shin,
Trends Toward On-Chip Networked Microsystems,
in International Journal of High Performance
Computing and Networking, Volume 3, Number 1, pp
3-18, December 2005. - 17 Joan-Manuel Parcerisa, Julio Sahuquillo,
Antonio Gonzalez, and Jose Duato, Efficient
Interconnects for Clustered Microarchitectures,
in Proceedings of 2002 International Conference
on Parallel Architectures and Compilation
Techniques, September 2002. - 18 Aneesh Aggarwal and Manoj Franklin,
Hierarchical Interconnects for On-chip
Clustering, in Proc. of Intl Parallel and
Distributed Processing Symposium, April 2002. - 19 Shubhendu S. Mukherjee, Peter Bannon, Steven
Lang, Aaron Spink, and David Webb, The Alpha
21364 Network Architecture, in Proceedings of
Hot Interconnects 9, pp 113-117, August 2001.
37References (Continued)
- 23 Doug Berger, Steve Keckler, and the TRIPS
Project Team, Design and Implementation of the
TRIPS EDGE Architecture, ISCA-32 Tutorial, pp
1-239, June 2005. - 24 Ramadass Nagarajan, Karthikeyan
Sankaralingam, Doug Burger, and Stephen W.
Keckler, A design space evaluation of grid
processor architectures, in Proceedings of the
34th Annual International Symposium on
Microarchitecture, pages 40-51, December 2001. - 25 H. Peter Hofstee, Power Efficient Processor
Architecture and the Cell Processor, in
Proceedings of the Eleventh Intl Symposium on
High-Performance Computer Architecture, IEEE
Computer Society, February 2005. - 26 Kevin Krewell, Sun's Niagara Pours on the
Cores, Microprocessor Report, Vol. 18, No. 9, pp
1-3, Sept., 2004. - 27 James M. Baker Jr., Brian Gold, Mark
Bucciero, Sidney Bennett, Rajneesh Mahajan,
Priyadarshini Ramachandran, and Jignesh Shah,
SCMP A Single-Chip Message-Passing Parallel
Computer, in the Journal of Supercomputing,
Volume 30, pp 133-149, 2004. - 28 Terry Tao Ye, Luca Benini, and Giovanni De
Micheli, Packetized On-Chip Interconnect
Communication Analysis for MPSoC, in Proceedings
of the Design, Automation and Test in Europe
Conference, pp 344-349, March 2003. - 29 Peter N. Glaskowsky, IBM raises curtain on
Power5, Microprocessor Report, Volume 17, Issue
10, pp 13-14, October 2003. - 30 Joel M. Tendler, Steve Dodson, Steve Fields,
Hung Le, and Balaram Sinharoy, POWER4 system
microarchitecture, IBM Journal of Research and
Development, Volume 46, Issue 1, pp 5-26, January
2002. - 31 Seon Wook Kim, Chong-Liang Ooi, IlPark,
Rudolf Eigenmann, Babak Falsafi, and T. N.
Vijaykumar, Multiplex unifying conventional and
speculative thread-level parallelism on a chip
multiprocessor, in Proc. of Intl Conference on
Supercomputing, pp 368-380, June 2001. - 32 J. A. Kahle, M. N. Day, H. P. Hofstee, C. R.
Johns, T. R. Maeurer, and D. Shippy,
Introduction to the Cell multiprocessor, IBM
Journal of Research and Development, Volume 49,
Number 4/5, 2005. - 33 Thomas Chen, Ram Raghavan, Jason Dale, Eiji
Iwata, Cell Broadband Engine Architecture and
its first implementation A performance view, 29
Nov 2005, http//www-128.ibm.com/developerworks/po
wer/library/pa-cellperf/. - 34 Timothy M. Pinkston and Jose Duato,
Appendix E Interconnection Networks in
Computer Architecture A Quantitative Approach,
4th Edition, by John L. Hennessy and David A
Patterson, pp 1-114, Elsevier Publishers, 2007. - 35 Mike Kistler, Michael Perrone, and Fabrizio
Petrini, Cell Microprocessor Communication
Network Built for Speed, IBM Austin Research
Lab, White Paper. - 36 Mike Kistler, (Private Communication), IBM,
Austin, TX, June 2006. - 37 Scott Clark, (Private Communication), IBM,
Rochester, MN, December 2005. - 38 Scott Clark, (Private Communication), IBM,
Rochester, MN, June/July 2006. - 39 David Krolak, Just like being there Papers
from the Fall Processor Forum 2005 Unleashing
the Cell Broadband Engine ProcessorThe Element
Interconnect Bus, November 29, 2005,
http//www-128.ibm.com/developerworks/power/librar
y/pa-fpfeib/. - 40 Michael B. Taylor. Scalar Operand Networks
for Tiled Microprocessors, invited talk at the
Workshop on On- and Off-chip Interconnection
Networks, Stanford University, December 2006. - 41 DeveloperWorks, IBM, Meet the experts The
Mambo team on the IBM Full-System Simulator for
the Cell Broadband Engine processor, 22 Nov
2005, http//www-128.ibm.com/developerworks/librar
y/pa-expert7/.