Title: Evaluation of On-Chip Interconnect Architectures for Multi-Core DSP
1Evaluation of On-Chip Interconnect Architectures
for Multi-Core DSP
Students Haim Assor, Horesh Ben Shitrit
Project Number p-2006-092
Supervisors Dr. Shlomo Greenberg
Mr. Ori Goren
Mr. Norman Goldstein
1. Introduction The need for Multi Core
Chip manufactures nowadays are building multiple
processing units inside one integrated chip.
Those energy-efficient processing cores instead
of one powerful core help in reducing the power
consumption while increasing performance. Each
core doesnt necessarily run as fast as highest
performing single-core module, but this
multi-core architecture improve overall
performance by utilizing parallelism.
Multi-Core systems allow parallel processing on a
chip using many small processors or simply allow
communication processors to process more data
streams such as communication channels. Enable
several cores to share the same resources cause
new communicational problems, problems like
resource unavailability, coherency keeping
etc. Using the classical shared bus approach for
connectivity between different components inside
the chip leads to a major bottleneck in todays
Multi-Core systems only one transaction at a
time is available on the bus. All other
components that wish to use the bus have to wait
for it to be free.
Performance through Parallelism
More Data Transfer
More Data Processing
Increased Algorithmic Complexity
High Energy Consumption
Multi Core system
Project Goal In this project we address the
connectivity problem in a multi-core DSP. We
explore and model new interconnect architectures
that comes to replace the classical shared bus.
Our goal is to analyze these architectures and
give quantitative evaluation to multi-core
performance with each architecture.
2. Shared Bus
3. Fabric
4. Network on Chip (NoC)
- In Fabric architecture every master has a
dedicated bus to each one of the slaves, and each
bus has full bandwidth power and can operate any
main bus feature. - The Fabric interconnect enables concurrent
transactions between different components. - In each slave device entry there is an arbiter
which is responsible to decide which transaction
takes place at a specific moment. - The Fabric architecture enables high performance
capabilities but expensive in terms of area. - Fabric main features are
- Connects multi master multi slave systems and
therefore more complex than the shred bus (design
and verification efforts). - Non-Blocking - Many concurrent transactions.
- The Transfer approach contains address, data and
control similar to the shared bus. - Preemptive.
- Like the Shared Bus, it can handle out-of-order
transactions, but this feature will increase
design effort and should be taken into
consideration. - Enables memory bank interleaving.
- Network on Chip use packet switched transfer
approach to transfer data between different chip
components. The NoC is based on computer
networking. The packet contains the destination
address, the data and other control features
needed for correct transfer. Transactions that
move through the network are out of order this
means that packets from different initiators can
mix on the network re-order buffers will insure
the proper ordering on the target. - The NoC is constructed from identical routers
which construct a homogeneous, scalable network,
therefore has high growth capabilities. - NoC main features are
- The NoC is based on Routers which considered
simple and their major advantage is the ability
to build a network easily from the same
components. - The Router disadvantage is the need for Re-Order
buffer. This feature is necessary because of the
network transaction nature Out-of-order.
Re-Order buffer are complicated and require big
area, and therefore increases design effort. - The Transfer approach is packet based (each
packet contain the address, data and control
information). - Non-preemptive.
- Semi Non-Blocking - Many transactions at a time,
but can stall transaction.
- Shared Bus is the classical way to connect
components inside the chip. Components are
divided to masters which can initiate a
transaction (for example cores, DMA, peripherals
like Ethernet controllers etc) and slaves who
can only reply (like memories). - The 3-bus architecture (address, data and
control) is a shared transmission medium. A
signal transmitted by one device is available for
reception by all other devices attached to the
bus. Only one device can successfully transmit at
a time, and an arbitration mechanism is needed to
control the transfers. - In a multi-core system all cores connect to the
same shares bus and therefore extra burden lies
on the bus. - Shared Bus main features are
- Simplicity (small design and verification
efforts). - Blocking - Only one transaction at a time.
- Transfer approach contains address, data and
control. - Preemptive (can stall low priority transaction
when higher priority arrives). - There are buses that can handle out-of-order
transaction (Out of Order is a feature for
handling data that was accepted not in the order
that was requested).
5. Modeling
6. Asymptotic comparison
The Table shows a theoretical comparison of
multi-core interconnect solutions at the
asymptotic limit for n 100 (n represents number
of cores in the system). The advantages of Cost
and Performance of the NoC over other
interconnect solutions is clear, but this table
is deceiving because number of cores in today's
most advanced chips is currently 2 to 8, and the
NoC suffers from big overhead that is not
displayed in that table (the coefficient of the
cost and performance functions is dropped).
Therefore the NoC advantages will come to
fruition only in future technology generation
when the number of cores will increase.
Architecture Total area Power Dissipation Operating Frequency
In order to examine the performance of different
interconnects we modeled typical systems that
contain the same components with different
interconnect solutions. We used the PANAMA tool
which is SystemC based software that enables
modeling components and transactions. The PANAMA
helps architects to model behavior of complete
chip even before the RTL stage. The diagrams
represent the modeled systems. Each system
simulate real multi-core chip that contains 4
cores, memories, DMA and peripherals. Traces of
several typical applications were executed on
these systems. Besides the classical shared bus,
a split shared bus was modeled which is also a
common solution in order to overcome part of
shred bus limitations. Also modeled a specific
fabric which is in use at Freescales chips and a
split fabric to make the comparison
complete. Results for comparison between those 4
interconnect architectures are shown at the
bottom, it is clear that the Fabric outcomes the
shared bus at this type of systems.
Shared Bus
Fabric
NoC
.
7. Conclusions and future research
It is clear that in the long run with the
progress in technology, chips will be more
complicated, and will contain many cores on the
same die. With that scenario we assume that chips
manufactures will choose the NoC as their
interconnect solution. Meanwhile the most
complicated chips contain only several computing
units (cores) and the NoC advantages are still
not obvious, Furthermore NoC has a lot of
limitation when using it with small amount of
cores. Initial results show that for multi-core
chips containing several cores, the best
interconnect solution is the fabric. Our purpose
is to produce quantitative conclusions, which
will help choosing the best interconnect solution
according to number of cores in the system.