Single-Chip Multiprocessor - PowerPoint PPT Presentation

About This Presentation

Title:

Single-Chip Multiprocessor

Description:

Number of Views:103

Avg rating:3.0/5.0

Slides: 21

Provided by: Nirm84

Learn more at: https://www.eng.auburn.edu

Category:

Tags: chip | circular | multiprocessor | queue | single

Transcript and Presenter's Notes

Title: Single-Chip Multiprocessor

1
Single-Chip Multiprocessor

2
Case for single chip multiprocessors

Advances in the field of integrated chip
processing.
- Gate density (More transistors per
chip)
- Cost of wires
Many studies done in Stanford University during
late 90s and proved CMP (single-chip
multiprocessor) is better than competing
technology .

3
Parallelism

Parallelism becomes a necessity for improving
performance.
Parallelism made possible using dynamic
scheduling, multiple instruction issue,
speculative execution, non-blocking caches etc.,
(late 90s)
Parallelism classifications
Instruction level
Loop level
Thread level - Future trend
Process level - Future trend

4
Loop level parallelism

To increase amount of parallelism exploit
parallelism among iterations of a loop.
ILP that results from data independent loop
iterations is LLP.
No circular dependencies. This could be avoided
too using loop unrolling (beyond the scope of
this lecture).

5
LLP (Loop Level Parallelism)

6
Competing technology - Superscalar

7
Competing technology - Superscalar
8
Wide issue superscalar
9
Fetch Phase

3 phase Fetch, Issue, Execution
Bottlenecks Issue and Execution phase.
Fetch phase Provide large and accurate window of
decoded instructions - 3 issues instruction
misalignment, cache miss, mispredicted branch.
- misprediction reduced to under 5 using
branch predictor designed by McFarling.
- instruction misalignement reduced to under
3 by dividing cache into banks (Conte).
- Roesnblum et al. shows that the 60 of
latency by cache miss can be hidden.
S. McFarling, Combining branch
predictors, WRL Technical Note TN-36, Digital
Equipment Corporation, 1993.
M. Rosenblum, E. Bugnion, S. Herrod, E.
Witchel, and A. Gupta, The impact of
architectural trends on operating system
performance, Proceedings of 15th ACM
symposium on Operating Systems Principles,
Colorado,December, 1995.

10
Issue phase

Issue phase Register renaming.
2 techniques for register renaming
- Use a table to map architectural
registers and physical registers. Ports
required operands per instruction Instruction
window size
- Use reorder buffer. Comparators required
to find which physical register should provide
data to which packet of instruction. Large number
of comparators required.
In HP PA 8000 20 of die space occupied by
comparators.
Quadratic increase in instruction queue register
with increase in issue width.
Queue register uses broadcast to connect to
registers which increases the wires used
increased delay and cost

11
Execute phase

Execution phase also has similar issues.
Increase in issue width causes increase in
renamed registers leading to quadratic increase
in register file complexity.
Increase in execution unit causes quadratic
increase in the complexity of bypass logic.
Bottleneck Interconnect delay between
execution units.

12
Architecture - Superscalar
13
Competing technologies Simultaneous Multi
Threading

Simultaneous Multi threading architecture is
similar to that of the superscalar.
SMT processors support wide superscalar
processors with hardware, to execute instructions
from multiple thread concurrently.
Provides latency tolerance.
Reduces to conventional wide-issue superscalar
when no multiple threads possible.

14
Competing technologies - Simultaneous Multi
Threading

15
Centralized architecture

Disadvantages of centralized architectures such
as SMT and Superscalars are
- Area increases quadratically with cores
complexity.
- Increase in cycle time interconnect
delays. Delay with wires dominate delay of
critical path of CPU. Possible to make simpler
clusters, but results in deeper pipeline and
increase in branch misprediction penalty.
- Design verification cost high, due to
complexity and single processor
- Large demand on memory system.

16
Single Chip multiprocessor

Motivation for a decentralized architecture due
to the disadvantages of competing technologies.
Simple individual processors and high clock rate.
Low interconnect latency
Exploits thread level and processor level
parallelism.

17
Single chip Multiprocessor architecture
18
Performance comparison

Example 8 core Cell processor in the PS3 and the
3 core Xenon processor in the Xbox 360)
Performance chart
Run for different benchmark programs.

19
Summary (CMP)

CMP (Chip level multiprocessor) provides superior
performance with simpler hardware.
No parallelism Superscalar performance is 30
better than CMP
Fine grained thread-level parallelism
Superscalar is 10 better in performance
Coarse grained thread-level parallelism CMP
is 50-100 better than superscalar.
Disadvantage Slow when no multithreading, equal
development of software required.

20
Reference

K Olukotun, BA Nayfeh, L Hammond, K Wilson, K,
The case for a single-chip multiprocessor,ACM
SIGPLAN Notices, 1996.
L Hammond, BA Nayfeh, K Olukotun, A Single-Chip
Multiprocessor, IEEE, Sept 1997.
Wikipedia, Superscalar, April 2008.
http//en.wikipedia.org/wiki/Superscalar.
Wikipedia, Multi-core, April 2008.
http//en.wikipedia.org/wiki/Multi-core_computing.