Complex multiprocessor architectures presentation

About This Presentation

Transcript and Presenter's Notes

Title: Complex multiprocessor architectures

1
Chapter 9 Complex multiprocessor architectures

Outline
Introduction limitations of simple
multiprocessors
Application domain characteristics multi-window
TV
Top level architecture
Architecture of the signal processing subsystem

2
Example DVP platform
3
Discussion

architecture
the limit of scalability of busses is reached
communication via central memory doubles the
bandwidth
programming (software) issues
synchronisation via central CPU leads to coarse
grain tasks
scheduling of central resources (bus and memory)
is difficult and time consuming specially for
real-time applications
VLSI (hardware) issues
interconnect delay dominates the gate delay
which leads
to multi-hop communication
clocking

4
From busses to concentrators to ...
5
Networks-on-Silicon
Extern SDRAM
Hierarchical architecture with clusters of
processors and memories, autonomously operating
and cooperating via an on-chip network with
routers, segmented interconnect and distributed
memories
C4
C1
C8
C5
C12
C9
C16
C13
C2
C3
C6
C7
C10
C11
C14
C15
embedded cores embedded memories
6
Networks-on-Silicon
C4
C1
C8
C5
C12
C9
C16
C13
C2
C3
C6
C7
C10
C11
C14
C15
7
Design paradigm shift every 7- 8 years
Networks-on-Silicon
Multi-processor arch Platform based
design Reuse of IP
Embedded processors HLS SW-compilers ILP,
VLIW
FU design Logic RT synthesis parametrised
libraries
8
Outline

Introduction limitations of simple
multiprocessors
Application domain characteristics multi-window
TV
Top level architecture
Architecture of the periodic subsystem

9
High-end TV architecture of 1998

ad hoc architecture
local optima
globally not
cost-effective

10
Market analysis
New features PC like windows with variable sizes
and shapes e.g. PIP, TXT, OSD
11
(No Transcript)
12
(No Transcript)
13
Application domain analysis
14
Application domain analysis
control UI, menu, modem, cond. access, select
tt page soft real time tt generation hard real
time video
mem
Video In1
NR
HSRC
VSRC
mix
100Hz
Peak
Matrix
Video In2
NR
HSRC
VSRC
mix
Txt gen
mem
HSRC
VSRC
mem

nodes limited number of well
known weakly programmable tasks
graph represents 1 application
50 ... 100 applications
run-time switching between applic.

RC
15
Application domain analysis
mem
Application graph gt subgraphs gt tasks

set of closely coupled tasks
Video In1
NR
HSRC
VSRC
mix
100Hz
Peak
Matrix
Video In2
NR
HSRC
VSRC
mix
Txt gen
mem
HSRC
VSRC
mem

processing power Gops/s
bandwidth between tasks GB/s
11 internal streams
20 external streams to mem
5 IO streams
asynchronous

16
Outline

Introduction limitations of simple
multiprocessors
Application domain characteristics multi-window
TV
Top level architecture
Architecture of the signal processing subsystem

17
Top-level architecture
signal processing subsystem
network
periodic requests
P_3
P_4
P_5
P_6
SDRAM
interface
random requests
memory subsystem
P_1
P_2
I
D
Embedded CPU
control subsystem
18
Hosseini-Khayat 95
time
server
time
T
time
?
time
1 2 ...
N
Bus time slot (e.g. 18 cc)
Bus service cycle ? N time slots (e.g. N
64) Q number of bus time slots per service
cycle reserved for periodic streams
Minimize the latency for random requests while
guaranteeing the throughput for periodic requests
19
Algorithm

N number of time slots
Q number of time slots per service cycle for
periodic requests
n remaining number of time slots (initially n
N)
q remaining number of time slots for periodic
requests
(initially q Q)

1. If n gt q choose random request if available 2.
If n ? q choose periodic request if available 3.
Decrement q if a periodic request was chosen 4.
Decrement n. If n0, restart (nN, qQ)
Claim average delay of random requests in the
presence of periodic requests approaches
the delay when periodic requests are absent
(if no overload situation).
20
From here always choose periodic request
21
Outline

Introduction limitations of simple
multiprocessors
Application domain characteristics multi-window
TV
Top level architecture
Architecture of the signal processing subsystem

22
Signal Processing Subsystem
A
B
C
B
D
C
A
A
B
C
D
A
B
C
D
(re)configuration like in FPGAs but at coarse
level
23
Signal Processing Subsystem
Communication network

different flowgraphs
implemented via
programmable
switches
separate procltgtcomm.
Fifo (signals)
dynamic dataflow
Async. Streams
dyn Fnct. (VLD)
simpler
scalable

fifos
P_1
P_2
P_n
...
fifos
1 to 1 mapping
24
Signal Processing Subsystem
Reconfigurable communication network
fifos
P_1
P_2
P_n
...
fifos
Inverse communication network
25
Signal Processing Subsystem
Resource sharing of processors
2
A/B Proc P
C Proc Q
1
3
4
A
B
C
Mapping multiplexing
Data identification is needed -gt header or tag
26
D
y
A
B
P
E
Q
x
x
y
C
(a) Process flow graph
Proper schedule A B C P D E Q
Forbidden schedule A B D C
x y P/Q
x y P/Q
Deadlock!
(b) Mapping
Data is present in the system but there is no
progress.
27
Signal Processing Subsystem
Ways to avoid deadlock
1. Graph transformations 2. Extra memory
processes (rearrange order of tokens)
D
A
B
M
P
E
M
Q
C
28
Signal Processing Subsystem
3. Extra control on the sequence of firing
D
A
B
P
E
Q
C
29
Signal Processing Subsystem
4. Out of order execution
P/Q
Separate fifos for tokens with different colors
no sharing of fifos
30
Processor model
Clock generation
fifo_1
fifo_2
fifo_3
fifo_4

4 streams in parallel
no fifo/state sharing
Zero overhead
context switch
blocking protocol
via clock gating
round robin scheduler

Clock gating
Shared logic
State_1
State_m
State_1
State_m
debug
Local control task switching
fifo_1
fifo_2
fifo_3
fifo_4
31
b1
b2
b3
a1
a2
a3
a4
32
Space switch
Time switch
b1
b2
b3
b4
a1
frame
frame
s1
s2
a2
time
a3
a3
a4
a2
a1
a4
a2
a1
a3
s1
s2
a4
time
b3
b3
b4
b2
b1
b4
b2
b1
33
TST interconnect network
space
time
time
x
inputs to processors
outputs from processors
y
Communication ctrl active task memory cyclostatic
1 2 3 4
x
x
x
y
y
y
phase
y
y
y
Configuration memories
Configuration ctrl run time reconfiguring
Appl. Graph 1
Appl. Graph 2
34
Example Communication backbone
blanking
time
Task 1
time
Video stream 1
Task 2
time
Task 3
blanking
Task 4
Video stream 2
Task 5
Task 6
Overlap old and new appl graph
35
Chip metrics
36
Chip layout
SE
S2MEM MEM2S JUGGLER
NR
OUT
RC
video
IN
CC
VS
SDRAM
HS
INT
CPU
37
The End

Course goals
understand the design space and the trade-offs
between area,
time and power
understand the trends and the driving forces
behind the
different types of embedded cores (hardware and
software)
understand the role and the task of the system
level architect

Be aware that learning, just like architecting is
an iterative and interactive process. iterative
gt read and consume it again interactive gt
contact me if you have questions

Write a Comment

User Comments (0)

About PowerShow.com

Complex multiprocessor architectures PowerPoint PPT Presentation