On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus

Description:

On Characterizing Performance of the. Cell Broadband Engine. Element Interconnect Bus ... Jason Dale, Eiji Iwata, 'Cell Broadband Engine Architecture and its first ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 38

Provided by: timothyp3

Category:

more less

Transcript and Presenter's Notes

Title: On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus

1
On Characterizing Performance of the Cell
Broadband EngineElement Interconnect Bus

Thomas William Ainsworth
Timothy Mark Pinkston
University of Southern California

2
Outline

Motivation and Related Work
Background EIB Requirements
Latency Analysis
Throughput Analysis
Software Considerations

3
Motivation

With the rise of multi-core computing, the design
of on-chip interconnection networks has become an
increasingly important component of computer
architecture
In designing NoCs for specific or general purpose
functions, it is important to understand the
impact and limitations of various design choices
in achieving this goal
The EIB is an interesting OCN to study as it
provides higher raw network bandwidth and
interconnects more end nodes than most mainstream
commercial multi-core processors

4
Related Work on Characterizing NoC Performance

Academic NoCs
TRIPS
Doug Berger, Steve Keckler, and the TRIPS Project
Team, Design and Implementation of the TRIPS
EDGE Architecture, ISCA-32 Tutorial, pp 1-239,
June 2005.
RAW
Michael Bedford Taylor, Walter Lee, Saman P.
Amarasinghe, and Anant Agarwal, Scalar Operand
Networks, IEEE Transactions on Parallel and
Distributed Systems, Volume 16, Issue 2, February
2005.
Commercial NoCs
Cell BE EIB
Thomas Chen, Ram Raghavan, Jason Dale, Eiji
Iwata, Cell Broadband Engine Architecture and
its first implementation A performance view, 29
Nov 2005, http//www-128.ibm.com/developerworks/po
wer/library/pa-cellperf/
Fabrizio Petrini, Gordon Fossum, Ana Varbanescu,
Michael Perrone, Michael D. Kistler, Juan
Fernandez Peinador, Multicore Surprises Lesson
Learned from Optimizing Sweep3D on the Cell
Broadband Engine, IPDPS, 2007

5
Background Cell BE Interconnect Requirements

Cell BE On-Chip Network Requirements
Interconnect 12 elements
1 PPE with 51.2GB/s aggregate bandwidth
8 SPEs each with 51.2GB/s aggregate bandwidth
MIC 25.6GB/s of memory bandwidth
2 IOIF 35GB/s(out), 25GB/s(in) of I/O bandwidth
Provide coherent and non-coherent data transfer
Support two transfer modes
DMA between SPEs
MMIO/DMA between PPE and system memory

6
Background Element Interconnect Bus (EIB)
1.6 GHz Bus Clock Frequency, Four 16B data rings
(2 per direction)
7
Characterization Approach

Approach Characterize Network Latency and
Throughput based on Best- and Worst-Case Network
Conditions
Network Latency
Four phases make up the end-to-end latency
Latency Sending Phase Command Phase Data
Phase Receiving Phase
Sending latency Transport
latency Receiving latency
Sending Phase responsible for EIB transaction
initiation
Includes all processor/DMA controller activities
prior to EIB access
Command Phase sets up inter-element (end-to-end)
communication
Informs target element about the impending
transaction to allow for element to set up
transaction, performs coherency checking, etc.
Data Phase handles ring arbitration and data
transport
Data ring segment to destination must be free
before granting access
Receiving Phase directs the received data to its
final location
Local Store, Memory, or I/O

8
Best-case Network Latency

SPE1 ? SPE6 Non-coherent DMA Transaction
- Best case for longest EIB transfer

9
Best-case Network Latency - Sending Phase

Pipeline Latency
23 CPU clock cycles
DMA Issue 10 CPU cycles
Write SPE local store address
Write effective address high
Write effective address low
Write DMA size
Write DMA command
DMA Controller Processing
20 CPU clock cycles

Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
10
Best-case Network Latency Command Phase
11
Best-case Network Latency Command Phase

Non-coherent Command Issue 3 Bus cycles

12
Best-case Network Latency Command Phase

Command Reflection to
AC1 elements
3 Bus cycles

AC2 elements 4 Bus cycles
AC3 elements 5 Bus cycles
13
Best-case Network Latency Command Phase

Snoop Response 13 Bus cycles

14
Best-case Network Latency Command Phase

Non-coherent Snoop Combining 3 Bus cycles

15
Best-case Network Latency Command Phase

Final Snoop Response to
AC1 elements
3 Bus cycles

AC2 elements 4 Bus cycles
AC3 elements 5 Bus cycles
16
Best-case Network Latency Command Phase

Command Issue Command Reflection Snoop
Response
Combined Snoop Response Final Snoop Response

Non-coherent Command Phase 31 Bus cycles
Additional latency for coherent transactions 12
Bus cycles
17
Best-case Network Latency Data Phase
18
Best-case Network Latency Data Phase

Data Request to Arbiter 2 Bus Cycles
Data Arbitration 2 Bus Cycles
Data Bus Grant 2 Bus Cycles

Time of Flight 6 Bus Cycles Transmission Time
8 Bus Cycles
Data Phase (DP) 20 Bus cycles
19
Best-case Network Latency - Receiving Phase
Move data from BIU to MFC 1 Bus cycle Move
data from MFC to LS 1 Bus cycle
Receiving Phase (RP) 2 Bus cycles
20
Best-case Total Latency

For non-coherent, SPE to SPE transfers

Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
Non-coherent Command Phase 31 Bus cycles
Data Phase (DP) 20 Bus cycles
Receiving Phase (RP) 2 Bus cycles
Non-coherent Network Latency 79.5 Bus cycles
21
Best-case Total Latency

For memory coherent transfers

Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
Non-coherent Command Phase 31 Bus cycles
Additional latency for coherent transactions 12
Bus cycles
Data Phase (DP) 20 Bus cycles
Receiving Phase (RP) 2 Bus cycles
Coherent Network Latency 91.5 Bus cycles
22
Worst-case Network Latency Estimates

Worst-case latency for a coherent transfer to
memory controller
Assumptions
Maximum number of command requests ahead of
transfer
Maximum number of data transactions with same
destination as transfer
Only includes EIB worst-case, not PPE/SPE worst
case
No cache/LS misses
No pipeline stalls

23
Worst-case Network Latency Command Phase

Command Phase
There are a total of 240 request slots available
across all other elements
64 MIC slots 16 element slots x 11 elements
Since each request uses two slots for both the
sender and the receiver, there are a total of 120
outstanding transactions that could be in the
queue.
One command per transaction
Round-robin arbitration
Command rate of 1 for every two clock cycles
120 commands x 2 Bus cycles 240 Bus cycles

Worst-case Command Phase 240 Bus cycles
24
Worst-case Network Latency Data Phase

Data Phase
If all elements are trying to write to the MIC
then there is a maximum of 64 transfers
In this worst-case, the MIC can only handle 1
write every 8 bus cycles
64 transfers x 8 bus cycles 512 bus cycles

Worst-case Data Phase 512 Bus cycles
25
Worst-case Total Latency

Worst-case transfer

Sending Phase (SP) 53 CPU cycles 26.5 Bus
cycles
Worst-case Command Phase 240 Bus cycles
Worst-case Data Phase 512 Bus cycles
Receiving Phase (RP) 2 Bus cycles
Worst-case Coherent Network Latency 780.5 Bus
cycles
26
Throughput

End-to-End Throughput Analysis
Network Injection/Reception Bandwidth 307.2
GB/s
12 elements
32B (read and write) data per bus cycle
½ x 32B x 1.6 GHz x 12 nodes 307.2 GB/s
Network Bisection Bandwidth
Unidirectional ring data width 16B
4 rings, two in each direction
3 concurrent transfers per ring
16B x 1.6 GHz x 4 rings x 3 transfers per ring
307.2 GB/s

Injection bandwidth
Bisection bandwidth
Reception bandwidth
27
Cell BE EIB Throughput

Cell BE EIB Sustainable Effective
Bandwidth
Command Phase Limitations (Non-coherent
Transfers)
Max. effective bandwidth 204.8GB/s
(non-coherent transfers)
Command bus is limited to 1 request per bus cycle
Each request can transfer 128B

Command bus allows the issuing of up to one
transaction per cycle
Each ring can issue one new transaction every 3
cycles (grey cycles indicate a ring is
unavailable)
Command bus cannot sustain more than 8
concurrent transactions at any given time

Sustainable Effective Bandwidth (non-coherent)
204.8 GB/s
28
Cell BE EIB Throughput

Cell BE EIB Sustainable Effective
Bandwidth
Command Phase Limitations (Coherent Transfers)
Max. effective bandwidth 102.4 GB/s (coherent
transfers)
Command bus is limited to 1 coherent request per
2 bus cycles
Each request transfers 128B

Sustainable Effective Network Bandwidth
(coherent) 102.4 GB/s
29
Cell BE EIB Throughput (Best-Case)

Cell BE EIB Calculated vs. Measured Effective
Bandwidth

BWNetwork ? 204.8 /1 GB/s
197 GB/s (measured)
Injection bandwidth 25.6 GB/s per element
Reception bandwidth 25.6 GB/s per element
g 1
Command Bus Bandwidth
BWBisection 8 links 204.8 GB/s
204.8 GB/s
Network injection
Network reception
Aggregate bandwidth
Peak BWNetwork of 25.6 GB/s x 3 x 4 307.2 GB/s
(4 rings each with 12 links)
(12 elements)
(12 elements)
1,228.8 GB/s
(3 transfers per ring)
307.2 GB/s
307.2 GB/s
r can, at best, reach 100 since no ring
interferrence
Contention-free traffic pattern
30
Cell BE EIB Throughput (Worst-Case)

Cell BE EIB Calculated vs. Measured Effective
Bandwidth

BWNetwork ? 204.8 /1 GB/s
78 GB/s (measured)
g 1
Injection bandwidth 25.6 GB/s per element
Reception bandwidth 25.6 GB/s per element
Command Bus Bandwidth
BWBisection 8 links 204.8 GB/s
204.8 GB/s
Network injection
Network reception
Aggregate bandwidth
Peak BWNetwork of 25.6 GB/s x 3 x 4 307.2 GB/s
(4 rings each with 12 links)
(12 Nodes)
(12 Nodes)
1,228.8 GB/s
(3 transfers per ring)
307.2 GB/s
307.2 GB/s
Contention due to traffic pattern
r limited, at best, to only 50 due to ring
interferrence
31
Software Considerations

The network should not be considered in isolation
of the rest of the system, nor the system in
isolation of the network
Due to the wide range of performance of EIB, Cell
BE software developers must consider EIB
strengths and limitations
Software should plan transactions in order to
take full advantage of the EIB
Traffic patterns
Latency can be mitigated through bandwidth
Software cache, double-buffering, pre-fetching,
etc.
SPE tasks should be carefully assigned
Context switches will take at least 5
microseconds (7700 bus cycles) assuming no bus
contention, no waiting for memory, and no
outstanding EIB transactions

32
Closing Remarks

The design of interconnection networks is
end-to-end
Injection links/interface ? network fabric ?
reception links/interface
Topology, routing, arbitration, switching, flow
control, marchitecture are among key aspects in
realizing high-performance designs
Cell BEs EIB requires a delicate balance
Capable of providing high bandwidth
Should not overlook EIB control and end-to-end
effects
Performance characteristics are highly dependent
on how the network is used
Software teams cannot rely on the compiler to
handle Cell for them (at least not yet)
Manual and semi-automated workload management is
necessary
Future Work
This analysis is only a first step
Further analysis using simulation/hardware-based
evaluation will give better insight and
understanding into the EIB

33
Questions
34
Network Characterization Summary
35
Acronym List

CBE - Cell Broadband Engine
PPE - POWER Processing Element
SPE - Synergistic Processing Element
ISA - Instruction Set Architecture
DMA - Direct Memory Access
MMIO Memory Mapped Input/Output
EIB - Element Interconnect Bus
MFC - Memory Flow Controller
LS - Local store
SIMD - Single Instruction Multiple Data
NoC Network on Chip
OCN - On-chip Network
SO - Sending Overhead
TF - Time of Flight
TT - Transmission Time
RO - Receiver Overhead

36
References

1 William J. Dally, Interconnect limited VLSI
architecture, in Proceedings of International
Interconnect Technology Conference, pp 15-17, May
1999.
2 William J. Dally and Brian Towles, Route
Packets, Not Wires On-Chip Interconnection
Networks, in Proc. of the Design Automation
Conf., pp 684-689, ACM, June 2001.
3 Luca Benini and Giovanni De Micheli,
Networks on Chip A New SoC Paradigm, IEEE
Computer, Volume 35, Issue 1, pp 70-80, January
2002.
4 Ahmed Hemani, Axel Jantsch, Shashi Kumar,
Adam Postula, Johnny Oberg, Mikael Millberg, and
Dan Lidqvist, Network on a Chip An Architecture
for Billion Transitor Era, in Proc. of the IEEE
NorChip Conference, Nov., 2000.
5 Marco Sgroi, M. Sheets, A. Mihal, Kurt
Keutzer, Sharad Malik, Jan M. Rabaey, and Alberto
L. Sangiovanni-Vincentelli, Addressing the
system-on-a-chip interconnect woes through
communication-based design, in Proceedings of
Design Automation Conference, pp 667-672, June
2001.
6 IBM CoreConnect, Technical Report,
www.chips.ibm.com/products/powerpc/cores, 2000.
7 Kanishka Lahiri, Anand Raghunathan, and
Ganesh Lakshminarayana, LotteryBus A New
High-Performance Communication Architecture for
System-on-Chip Designs, in Proc. of the Design
Automation Conf., ACM, June 2001.
8 David Flynn, AMBA Enabling Reusable On-Chip
Designs, IEEE Micro, pp 20-27, July/August 1997.
9 Rakesh Kumar, Victor Zyuban, and DeanM.
Tullsen, Interconnections in Multi-Core
Architectures Understanding Mechanisms,
Overheads, and Scaling, in Proceedings of the
International Symposium on Computer Architecture,
IEEE Computer Society, June 2005.
10 Jian Liang, Sriram Swaminathan, and Russell
Tessier, aSoC A Scalable, Singlechip
Communications Architecture, in Intl Conference
on Parallel Architectures and Compilation
Techniques, pp 37-46, October 2000.
11 Pierre Guerrier and Alain Greiner, A
Generic Architecture for On-Chip Packet-Switched
Interconnections, in Proceedings of the Design
Automation and Test in Europe, pp 250-256, March
2000.
12 Adrijean Adriahantenaina, Herv e Charlery,
Alain Greiner, Laurent Mortiez, and CesarAlbenes
Zeferino, SPIN A Scalable, Packet Switched,
On-Chip Micro-Network, in Proceedings of the
Design, Automation and Test in Europe Conference,
March 2003.
13 Shashi Kumar, Axel Jantsch, Juha-Pekka
Soininen, Martti Forsell, Mikael Millberg, Johnny
Oberg, Kari Tiensyrja, and Ahmed Hemani, A
Network on Chip Architecture and Design
Methodology, in Proceedings of the IEEE Computer
Society Annual Symposium on VLSI, pp 105-112,
2002.
14 Jacob Chang, Srivaths Ravi, and Anand
Raghunathan, FLEXBAR A Crossbar Switching
Fabric with Improved Performance and
Utilization, in Proceedings of the IEEE CICC, pp
405-408, 2002.
15 A. Brinkmann, J. C. Niemann, I. Hehemann, D.
Langen, M. Porrmann, and U. Ruckert, On-chip
Interconects for Next Generation
Systems-on-Chips, in Proc. of the 15th Annual
IEEE Intl ASIC/SOC Conf., pp 211-215, Sept 2002.
16 Timothy Mark Pinkston and Jeonghee Shin,
Trends Toward On-Chip Networked Microsystems,
in International Journal of High Performance
Computing and Networking, Volume 3, Number 1, pp
3-18, December 2005.
17 Joan-Manuel Parcerisa, Julio Sahuquillo,
Antonio Gonzalez, and Jose Duato, Efficient
Interconnects for Clustered Microarchitectures,
in Proceedings of 2002 International Conference
on Parallel Architectures and Compilation
Techniques, September 2002.
18 Aneesh Aggarwal and Manoj Franklin,
Hierarchical Interconnects for On-chip
Clustering, in Proc. of Intl Parallel and
Distributed Processing Symposium, April 2002.
19 Shubhendu S. Mukherjee, Peter Bannon, Steven
Lang, Aaron Spink, and David Webb, The Alpha
21364 Network Architecture, in Proceedings of
Hot Interconnects 9, pp 113-117, August 2001.

37
References (Continued)

23 Doug Berger, Steve Keckler, and the TRIPS
Project Team, Design and Implementation of the
TRIPS EDGE Architecture, ISCA-32 Tutorial, pp
1-239, June 2005.
24 Ramadass Nagarajan, Karthikeyan
Sankaralingam, Doug Burger, and Stephen W.
Keckler, A design space evaluation of grid
processor architectures, in Proceedings of the
34th Annual International Symposium on
Microarchitecture, pages 40-51, December 2001.
25 H. Peter Hofstee, Power Efficient Processor
Architecture and the Cell Processor, in
Proceedings of the Eleventh Intl Symposium on
High-Performance Computer Architecture, IEEE
Computer Society, February 2005.
26 Kevin Krewell, Sun's Niagara Pours on the
Cores, Microprocessor Report, Vol. 18, No. 9, pp
1-3, Sept., 2004.
27 James M. Baker Jr., Brian Gold, Mark
Bucciero, Sidney Bennett, Rajneesh Mahajan,
Priyadarshini Ramachandran, and Jignesh Shah,
SCMP A Single-Chip Message-Passing Parallel
Computer, in the Journal of Supercomputing,
Volume 30, pp 133-149, 2004.
28 Terry Tao Ye, Luca Benini, and Giovanni De
Micheli, Packetized On-Chip Interconnect
Communication Analysis for MPSoC, in Proceedings
of the Design, Automation and Test in Europe
Conference, pp 344-349, March 2003.
29 Peter N. Glaskowsky, IBM raises curtain on
Power5, Microprocessor Report, Volume 17, Issue
10, pp 13-14, October 2003.
30 Joel M. Tendler, Steve Dodson, Steve Fields,
Hung Le, and Balaram Sinharoy, POWER4 system
microarchitecture, IBM Journal of Research and
Development, Volume 46, Issue 1, pp 5-26, January
2002.
31 Seon Wook Kim, Chong-Liang Ooi, IlPark,
Rudolf Eigenmann, Babak Falsafi, and T. N.
Vijaykumar, Multiplex unifying conventional and
speculative thread-level parallelism on a chip
multiprocessor, in Proc. of Intl Conference on
Supercomputing, pp 368-380, June 2001.
32 J. A. Kahle, M. N. Day, H. P. Hofstee, C. R.
Johns, T. R. Maeurer, and D. Shippy,
Introduction to the Cell multiprocessor, IBM
Journal of Research and Development, Volume 49,
Number 4/5, 2005.
33 Thomas Chen, Ram Raghavan, Jason Dale, Eiji
Iwata, Cell Broadband Engine Architecture and
its first implementation A performance view, 29
Nov 2005, http//www-128.ibm.com/developerworks/po
wer/library/pa-cellperf/.
34 Timothy M. Pinkston and Jose Duato,
Appendix E Interconnection Networks in
Computer Architecture A Quantitative Approach,
4th Edition, by John L. Hennessy and David A
Patterson, pp 1-114, Elsevier Publishers, 2007.
35 Mike Kistler, Michael Perrone, and Fabrizio
Petrini, Cell Microprocessor Communication
Network Built for Speed, IBM Austin Research
Lab, White Paper.
36 Mike Kistler, (Private Communication), IBM,
Austin, TX, June 2006.
37 Scott Clark, (Private Communication), IBM,
Rochester, MN, December 2005.
38 Scott Clark, (Private Communication), IBM,
Rochester, MN, June/July 2006.
39 David Krolak, Just like being there Papers
from the Fall Processor Forum 2005 Unleashing
the Cell Broadband Engine ProcessorThe Element
Interconnect Bus, November 29, 2005,
http//www-128.ibm.com/developerworks/power/librar
y/pa-fpfeib/.
40 Michael B. Taylor. Scalar Operand Networks
for Tiled Microprocessors, invited talk at the
Workshop on On- and Off-chip Interconnection
Networks, Stanford University, December 2006.
41 DeveloperWorks, IBM, Meet the experts The
Mambo team on the IBM Full-System Simulator for
the Cell Broadband Engine processor, 22 Nov
2005, http//www-128.ibm.com/developerworks/librar
y/pa-expert7/.