Title: MorphoSys: Case Study of A Reconfigurable System Targeting Multimedia Applications
1MorphoSys Case Study of A Reconfigurable System
TargetingMultimedia Applications
- Hartej Singh, Ming-Hau Lee, Guangming Lu,Fadi
Kurdahi, Nader Bagherzadeh, - University of California, Irvine
- Eliseu M. C. Filho,
- Federal Univ. of Rio de Janeiro, Brazil
- Rafael Maestre,
- Univ. Complutense, Madrid, Spain
2Outline
- Motivation
- M1 MorphoSys Implementation
- System Operation
- Software Environment
- Application Mapping (Performance)
- Future Work
3Motivation
Application-specific embedded systems (ASICs)
Performance
Reconfigurable computing
General-purpose computing
Range of Applications
4Related Work
- PADDI, PADDI-II(Chen Rabaey 92, Yeung Rabaey
95) - MATRIX(Mirsky and DeHon, 1996)
- RAW(Babb, Frank, Waingold, et al, 1997)
- RaPiD(Ebelin, Cronquist and Franklin, 1997)
- Remarc(Miyamori and Olukotun, 1998)
5Target Application Domains
- Application Domains
- Data-parallel, computation-intensive, block or
streaming data, high throughput
6Target Application Domains
- Application Domains
- Data-parallel, computation-intensive, block or
streaming data, high throughput - Examples
- Image Processing (block-based)
- Multimedia (Video Compression)
- Template Matching (computation-intensive)
7Target Application Domains
- Application Domains
- Data-parallel, computation-intensive, block or
streaming data, high throughput - Examples
- Image Processing (block-based)
- Multimedia (Video Compression)
- Template Matching (computation-intensive)
- Digital Signal Processing - streaming operations
- FIR filters, Viterbi codecs (for high throughput)
- FFT, DCT/IDCT, wavelets (computation-intensive)
8Target Application Domains
- Application Domains
- Data-parallel, computation-intensive, block or
streaming data, high throughput - Examples
- Image Processing (block-based)
- Multimedia (Video Compression)
- Template Matching (computation-intensive)
- Digital Signal Processing - streaming operations
- FIR filters, Viterbi codecs (for high throughput)
- FFT, DCT/IDCT, wavelets (computation-intensive)
- Information security (block-based)
- Data Encryption/Decryption (data-intensive)
9MorphoSys Model
MorphoSys
Reconfigurable Processor Array
- Reconfigurable processor array
- High-bandwidth data interface
High Bandwidth Data Interface
10MorphoSys Model
Main Processor e.g. advanced RISC
MorphoSys
Reconfigurable Processor Array
- Reconfigurable processor array
- High-bandwidth data interface
- Mainstream processor
System Bus
Instr./Data Cache (L1)
High Bandwidth Data Interface
External Memory (e.g. SDRAM, RDRAM)
11M1 Architecture
12M1 Architecture
Chip boundary
Chip 0.35 micron, 100 MHz
13M1 Architecture
Chip boundary
Chip 0.35 micron, 100 MHz
14RC Array
- RC Array
- 8 X 8 RC Array
- 4 quads of 4X4 RCS
15RC Array and Context Memory
- Context Broadcast
- row or column
- switch column b-cast to row b-cast, avoid data
movement - possible to activate only 1 column of RCs.
16RC Array and Context Memory
- Context Memory
- 2 blocks
- 8 sets in each block
- A set controls 1 row or column (SIMD)
- 16 contexts in 1 set.
- Possible to overlap ctx broadcast with ctx
reloading
17M1 Architecture RC Array
Chip boundary
Chip 0.35 micron, 100 MHz
18Reconfigurable Cell (RC)
Neighbor RCs
Data bus
Neighbor RCs
Register file
Data bus
MUX_A
MUX_B
ALUMULT
R2
R0
R3
R1
SHIFT
Register File
O/P REG
19Reconfigurable Cell (RC)
Neighbor RCs
Data bus
Neighbor RCs
Register file
Data bus
C o n t e x t R e g i s t e r
MUX_A
MUX_B
Constant
12
16
16
Control signals
Context word from Context Memory
ALUMULT
R2
R0
R3
R1
28
SHIFT
Register File
28
O/P REG
16
16
To other RCs
To Data Bus
20RC Array Quad Interconnect
RC
RC
RC
RC
North, South, East, West connection (level
1) Full column and row connectivity between 4x4
partition (level 2) Each RC in one col. or row
of a quad. can receive data from one other RC in
its col or row.
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
21M1 Architecture TinyRISC
Chip boundary
Chip 0.35 micron, 100 MHz
22Control processor TinyRISC
Execute Stage
Data Cache
ALU
Shift Unit
Memory Unit
MorphoSys Unit
23M1 Architecture Data Interface
Chip boundary
Chip 0.35 micron, 100 MHz
24Frame Buffer
BANK A (64 x 8 bytes)
BANK B (64 x 8 bytes)
MSByte
2 sets Each set has 2banks. 8 consecutive bytes
access. Bank A provides op. A for RC Bank B
provides op. B for RC. Enables overlap of
computations with data transfer
A
B
A
SET ZERO
B
A
B
COL OFFSET
COL OFFSET
A
B
A
BANK A
BANK B
B
A
COL OFFSET
COL OFFSET
B
A
SET ONE
B
A
B
LSByte
25DMA Controller
Tiny RISC Core Processor
Main Memory
Frame Buffer
DMA Controller
Context Memory
26System Operation a, b
(a) (b)
Steps gt
Tiny RISC
LDCTXT
LDFB
RC Array
IDLE
IDLE
CM
Load new context data
IDLE
FB Set 0
IDLE
Load new application data
FB Set 1
IDLE
IDLE
27System Operation a, b, c
(a) (b) (c)
Steps gt
Tiny RISC
LDCTXT
LDFB
CBCAST, LDFB
CBCAST, LDCTXT
RC Array
IDLE
IDLE
EXECUTE
EXECUTE
CM
Load new context data
IDLE
Context to RC Array
Reload context
FB Set 0
IDLE
Load new application data
Data to RC Array
Data to RC Array
FB Set 1
IDLE
IDLE
Load new application data
IDLE
28System Operation a, b, c, d
(a) (b) (c)
(d)
Steps gt
Tiny RISC
LDCTXT
LDFB
CBCAST, LDFB
CBCAST, LDCTXT
CBCAST, STFB, and LDFB
CBCAST, LDCTXT
RC Array
IDLE
IDLE
EXECUTE
EXECUTE
EXECUTE
EXECUTE
CM
Load new context data
IDLE
Context to RC Array
Reload context
Context to RC Array
Reload context
FB Set 0
IDLE
Load new application data
Data to RC Array
Data to RC Array
Write out previous data, load new data
IDLE
FB Set 1
IDLE
IDLE
Load new application data
IDLE
Data to RC Array
Data to RC Array
29Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
0100011....11100 1100110....00010 0011101....10100
30Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
0100011....11100 1100110....00010 0011101....10100
mSched
31Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
mcc
0100011....11100 1100110....00010 0011101....10100
mSched
Executable
32Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
mcc
0100011....11100 1100110....00010 0011101....10100
mSched
Executable
MuLate, MorphoSim
C, VHDL
33Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
mcc
0100011....11100 1100110....00010 0011101....10100
mSched
Executable
MuLate, MorphoSim
MorphoSys Chip
C, VHDL
TinyRISC
RC Array
34mcc MorphoSys C Compiler
MPEG_Encode (int macroblk, int blk_type) int
curr_blk, new_blk, rev_blk,motion_vectors,
quant_factor, contextdataaddr
motion_comp(curr_blk, motion_vectors, blk_type,
contextdataaddr) . . void motion_comp (int
curr_blk, int motion_vectors, int blk_type,
int contextdataaddr) TR_ldfb(contextaddr, BANKA,
0, 256) TR_cbcast(0,0,0,0) TR_cbcast(0,0,0,1)
. . . . TR_cbcast(0,0,0,6) TR_cbcast(0,0,0,7) r
eturn
MPEG_Encode subi 15,15,13 ldi 9,2 add 9,1
5,9 stw 14,9 stw 6,9 addi 9,9,8 ldw 6,
9 lda 14,motion_comp stw 6,15
jal 14,14 motion_comp subi 15,15,3 ldw 1
4,15 ldfb 2,0,0,256 cbcast 0, 0, 0,
0 cbcast 0, 0, 0, 1 .. cbcast 0, 0, 0,
6 cbcast 0, 0, 0, 7 jal 0,14
35M1 Program Components
Sequential functions
Data-Parallel (RC Array) functions
DATA PATH
CONTROL FLOW
CONTROL DATA
TinyRISC assembly program
RC context program
TinyRISC
Context Memory
Frame Buffer
RC Array
DMAC
36M1 Program Components
Sequential
Data-Parallel
DATA PATH
CONTROL
CONTROL DATA
set 0,1 KEEP I I set 1,1 CLOAD!0x004 I I gt2
set 2,1 ADD E r0 LSL 2 gt0 WE set 3,1 ADD XQ
r0 LSL 2 gt0 set 4,1 SUB XQ r0 LSL 2 gt0 set
5,1 SUB E r0 LSL 2 gt0 WE set 6,1 CLOAD!0x004 I I
gt2 set 7,1 KEEP I I
ldfb 2,0,0,256 ldctxt 1, 0, 80 cbcast 0, 0, 0,
0 sbcb 0, 2, 1, 17 dbcbc 1, 1, 4, 64 stfb 6, 1,
0, 16 wfbi 0, 0, 1, 32
subi 15,15,3 addi 2,0,0,256 ldw 1, 0, 0,
0, jal 0, 2, 0, 1, lui 0, 0, 0, 5 addw 0, 0, 0, 1,
TinyRISC
Context Memory
Frame Buffer
RC Array
DMAC
37Simulation Environment
RC Array Configuration program
Application (e.g. image) data
TinyRISC Executable Code
08000000 09010001 08023000 39030037 09047777 .
0-0-11-00000-0000-000-11111001-00000000 0-0-10-000
00-0000-000-1001-000000000100 0-1-00-00010-0111-10
0-11110011-00000000 0-0-00-00010-1000-100-11110011
-00000000 0-0-00-00010-1000-100-11110100-00000000
0-1-00-00010-0111-100-11110100-00000000
00234567 09876534 92834723 29382034 20342034 .
38Simulation Environment
RC Array Configuration program
Application (e.g. image) data
TinyRISC Executable Code
08000000 09010001 08023000 39030037 09047777 .
0-0-11-00000-0000-000-11111001-00000000 0-0-10-000
00-0000-000-1001-000000000100 0-1-00-00010-0111-10
0-11110011-00000000 0-0-00-00010-1000-100-11110011
-00000000 0-0-00-00010-1000-100-11110100-00000000
0-1-00-00010-0111-100-11110100-00000000
00234567 09876534 92834723 29382034 20342034 .
39mSched Kernel Scheduler
- Optimization criteria
- Context reloading (minimize).
- Data reuse (maximize).
- RC computation and data
- movements (maximize).
Optimal solution implies exploration of the whole
search space.
Conflicting criteria
40mSched Kernel Scheduler
- Optimization criteria
- Context reloading (minimize).
- Data reuse (maximize).
- RC computation and data
- movements (maximize).
Optimal solution implies exploration of the whole
search space.
Conflicting criteria
To improve performance, the proposed
methodology divides the problem into two guided
subtasks 1. Partitioning of the DFG. 2.
Scheduling within a partition.
41Target Applications
- Image Processing
- Video Compression (multimedia)
- Template Matching (target recognition)
- Hyperspectral Imaging (video analysis)
- Medical Diagnostic Imaging (MRI)
- Digital Signal Processing
- matrix and vector operations, FIR filters,
Viterbi codecs, beam-forming (SONAR) - FFT, DCT/IDCT, wavelets (DWT)
- Information security
- Data Encryption/Decryption (DES, IDEA, 3-Way)
42Video Compression MPEG-2
Motion Estimation for 8x8 block with 16x16 search
area (comparison is in clock cycles)
Better
43Video Compression MPEG-2
M1 vs other Processors for 2-D DCT on 8x8 block
(cycles) (Reconfigurable, DSP, general purpose,
and multimedia)
Better
44MPEG-2 Video Encoder
Regulator
Zig-zag scan
Quantization
DCT
VLC Encoder
-
Frame Memory
Inv. Quant.
Output Buffer
Pre- Processing
IDCT
Predictive frame
Output stream
Input stream
Motion Vectors
Motion Compensation
Frame Memory
Motion Estimation
45MPEG-2 Video Encoder
46Automatic Target Recognition
M1 vs. other ATR systems for ATR Second Level of
Detection (1 template pair, 128x128 image)
47Data Encryption (IDEA)
48Future Work
- Architecture
- low power, enhanced functionality
- reconfigurability for multiple data types
- Software/Programming
- Data-parallel function identification
- Control flow generation for RC Array
- Automatic mapping to RC (or multiple RCs)
- Application Mapping
- communication applications
- bio-medical imaging
49(No Transcript)
50(No Transcript)
51(No Transcript)