Complex multiprocessor architectures - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Complex multiprocessor architectures

Description:

Architecture of the signal processing subsystem. 9/22/09 ... run time reconfiguring. y. Appl. Graph 1. Appl. Graph 2. Configuration. memories ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 38
Provided by: abc774
Category:

less

Transcript and Presenter's Notes

Title: Complex multiprocessor architectures


1
Chapter 9 Complex multiprocessor architectures
  • Outline
  • Introduction limitations of simple
    multiprocessors
  • Application domain characteristics multi-window
    TV
  • Top level architecture
  • Architecture of the signal processing subsystem

2
Example DVP platform
3
Discussion
  • architecture
  • the limit of scalability of busses is reached
  • communication via central memory doubles the
    bandwidth
  • programming (software) issues
  • synchronisation via central CPU leads to coarse
    grain tasks
  • scheduling of central resources (bus and memory)
    is difficult and time consuming specially for
    real-time applications
  • VLSI (hardware) issues
  • interconnect delay dominates the gate delay
    which leads
  • to multi-hop communication
  • clocking

4
From busses to concentrators to ...
5
Networks-on-Silicon
Extern SDRAM
Hierarchical architecture with clusters of
processors and memories, autonomously operating
and cooperating via an on-chip network with
routers, segmented interconnect and distributed
memories
C4
C1
C8
C5
C12
C9
C16
C13
C2
C3
C6
C7
C10
C11
C14
C15
embedded cores embedded memories
6
Networks-on-Silicon
C4
C1
C8
C5
C12
C9
C16
C13
C2
C3
C6
C7
C10
C11
C14
C15
7
Design paradigm shift every 7- 8 years
Networks-on-Silicon
Multi-processor arch Platform based
design Reuse of IP
Embedded processors HLS SW-compilers ILP,
VLIW
FU design Logic RT synthesis parametrised
libraries
8
Outline
  • Introduction limitations of simple
    multiprocessors
  • Application domain characteristics multi-window
    TV
  • Top level architecture
  • Architecture of the periodic subsystem

9
High-end TV architecture of 1998
  • ad hoc architecture
  • local optima
  • globally not
  • cost-effective

10
Market analysis
New features PC like windows with variable sizes
and shapes e.g. PIP, TXT, OSD
11
(No Transcript)
12
(No Transcript)
13
Application domain analysis
14
Application domain analysis
control UI, menu, modem, cond. access, select
tt page soft real time tt generation hard real
time video
mem
Video In1
NR
HSRC
VSRC
mix
100Hz
Peak
Matrix
Video In2
NR
HSRC
VSRC
mix
Txt gen
mem
HSRC
VSRC
mem
  • nodes limited number of well
  • known weakly programmable tasks
  • graph represents 1 application
  • 50 ... 100 applications
  • run-time switching between applic.

RC
15
Application domain analysis
mem
Application graph gt subgraphs gt tasks

set of closely coupled tasks
Video In1
NR
HSRC
VSRC
mix
100Hz
Peak
Matrix
Video In2
NR
HSRC
VSRC
mix
Txt gen
mem
HSRC
VSRC
mem
  • processing power Gops/s
  • bandwidth between tasks GB/s
  • 11 internal streams
  • 20 external streams to mem
  • 5 IO streams
  • asynchronous

16
Outline
  • Introduction limitations of simple
    multiprocessors
  • Application domain characteristics multi-window
    TV
  • Top level architecture
  • Architecture of the signal processing subsystem

17
Top-level architecture
signal processing subsystem
network
periodic requests
P_3
P_4
P_5
P_6
SDRAM
interface
random requests
memory subsystem
P_1
P_2
I
D
Embedded CPU
control subsystem
18
Hosseini-Khayat 95
time
server
time
T
time
?
time
1 2 ...
N
Bus time slot (e.g. 18 cc)
Bus service cycle ? N time slots (e.g. N
64) Q number of bus time slots per service
cycle reserved for periodic streams
Minimize the latency for random requests while
guaranteeing the throughput for periodic requests
19
Algorithm
  • N number of time slots
  • Q number of time slots per service cycle for
    periodic requests
  • n remaining number of time slots (initially n
    N)
  • q remaining number of time slots for periodic
    requests
  • (initially q Q)

1. If n gt q choose random request if available 2.
If n ? q choose periodic request if available 3.
Decrement q if a periodic request was chosen 4.
Decrement n. If n0, restart (nN, qQ)
Claim average delay of random requests in the
presence of periodic requests approaches
the delay when periodic requests are absent
(if no overload situation).
20
From here always choose periodic request
21
Outline
  • Introduction limitations of simple
    multiprocessors
  • Application domain characteristics multi-window
    TV
  • Top level architecture
  • Architecture of the signal processing subsystem

22
Signal Processing Subsystem
A
B
C
B
D
C
A
A
B
C
D
A
B
C
D
(re)configuration like in FPGAs but at coarse
level
23
Signal Processing Subsystem
Communication network
  • different flowgraphs
  • implemented via
  • programmable
  • switches
  • separate procltgtcomm.
  • Fifo (signals)
  • dynamic dataflow
  • Async. Streams
  • dyn Fnct. (VLD)
  • simpler
  • scalable

fifos
P_1
P_2
P_n
...
fifos
1 to 1 mapping
24
Signal Processing Subsystem
Reconfigurable communication network
fifos
P_1
P_2
P_n
...
fifos
Inverse communication network
25
Signal Processing Subsystem
Resource sharing of processors
2
A/B Proc P
C Proc Q
1
3
4
A
B
C
Mapping multiplexing
Data identification is needed -gt header or tag
26
D
y
A
B
P
E
Q
x
x
y
C
(a) Process flow graph
Proper schedule A B C P D E Q
Forbidden schedule A B D C
x y P/Q
x y P/Q
Deadlock!
(b) Mapping
Data is present in the system but there is no
progress.
27
Signal Processing Subsystem
Ways to avoid deadlock
1. Graph transformations 2. Extra memory
processes (rearrange order of tokens)
D
A
B
M
P
E
M
Q
C
28
Signal Processing Subsystem
3. Extra control on the sequence of firing
D
A
B
P
E
Q
C
29
Signal Processing Subsystem
4. Out of order execution
P/Q
Separate fifos for tokens with different colors
no sharing of fifos
30
Processor model
Clock generation
fifo_1
fifo_2
fifo_3
fifo_4
  • 4 streams in parallel
  • no fifo/state sharing
  • Zero overhead
  • context switch
  • blocking protocol
  • via clock gating
  • round robin scheduler

Clock gating
Shared logic
State_1
State_m
State_1
State_m
debug
Local control task switching
fifo_1
fifo_2
fifo_3
fifo_4
31
b1
b2
b3
a1
a2
a3
a4
32
Space switch
Time switch
b1
b2
b3
b4
a1
frame
frame
s1
s2
a2
time
a3
a3
a4
a2
a1
a4
a2
a1
a3
s1
s2
a4
time
b3
b3
b4
b2
b1
b4
b2
b1
33
TST interconnect network
space
time
time
x
inputs to processors
outputs from processors
y
Communication ctrl active task memory cyclostatic
1 2 3 4
x
x
x
y
y
y
phase
y
y
y
Configuration memories
Configuration ctrl run time reconfiguring
Appl. Graph 1
Appl. Graph 2
34
Example Communication backbone
blanking
time
Task 1
time
Video stream 1
Task 2
time
Task 3
blanking
Task 4
Video stream 2
Task 5
Task 6
Overlap old and new appl graph
35
Chip metrics
36
Chip layout
SE
S2MEM MEM2S JUGGLER
NR
OUT
RC
video
IN
CC
VS
SDRAM
HS
INT
CPU
37
The End
  • Course goals
  • understand the design space and the trade-offs
    between area,
  • time and power
  • understand the trends and the driving forces
    behind the
  • different types of embedded cores (hardware and
    software)
  • understand the role and the task of the system
    level architect

Be aware that learning, just like architecting is
an iterative and interactive process. iterative
gt read and consume it again interactive gt
contact me if you have questions
Write a Comment
User Comments (0)
About PowerShow.com