MorphoSys: Case Study of A Reconfigurable System Targeting Multimedia Applications - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

MorphoSys: Case Study of A Reconfigurable System Targeting Multimedia Applications

Description:

MorphoSys: Case Study of A Reconfigurable System Targeting. Multimedia Applications ... FIR filters, Viterbi codecs (for high throughput) ... – PowerPoint PPT presentation

Number of Views:209
Avg rating:3.0/5.0
Slides: 52
Provided by: bradlhu
Category:

less

Transcript and Presenter's Notes

Title: MorphoSys: Case Study of A Reconfigurable System Targeting Multimedia Applications


1
MorphoSys Case Study of A Reconfigurable System
TargetingMultimedia Applications
  • Hartej Singh, Ming-Hau Lee, Guangming Lu,Fadi
    Kurdahi, Nader Bagherzadeh,
  • University of California, Irvine
  • Eliseu M. C. Filho,
  • Federal Univ. of Rio de Janeiro, Brazil
  • Rafael Maestre,
  • Univ. Complutense, Madrid, Spain

2
Outline
  • Motivation
  • M1 MorphoSys Implementation
  • System Operation
  • Software Environment
  • Application Mapping (Performance)
  • Future Work

3
Motivation
Application-specific embedded systems (ASICs)
Performance
Reconfigurable computing
General-purpose computing
Range of Applications
4
Related Work
  • PADDI, PADDI-II(Chen Rabaey 92, Yeung Rabaey
    95)
  • MATRIX(Mirsky and DeHon, 1996)
  • RAW(Babb, Frank, Waingold, et al, 1997)
  • RaPiD(Ebelin, Cronquist and Franklin, 1997)
  • Remarc(Miyamori and Olukotun, 1998)

5
Target Application Domains
  • Application Domains
  • Data-parallel, computation-intensive, block or
    streaming data, high throughput

6
Target Application Domains
  • Application Domains
  • Data-parallel, computation-intensive, block or
    streaming data, high throughput
  • Examples
  • Image Processing (block-based)
  • Multimedia (Video Compression)
  • Template Matching (computation-intensive)

7
Target Application Domains
  • Application Domains
  • Data-parallel, computation-intensive, block or
    streaming data, high throughput
  • Examples
  • Image Processing (block-based)
  • Multimedia (Video Compression)
  • Template Matching (computation-intensive)
  • Digital Signal Processing - streaming operations
  • FIR filters, Viterbi codecs (for high throughput)
  • FFT, DCT/IDCT, wavelets (computation-intensive)

8
Target Application Domains
  • Application Domains
  • Data-parallel, computation-intensive, block or
    streaming data, high throughput
  • Examples
  • Image Processing (block-based)
  • Multimedia (Video Compression)
  • Template Matching (computation-intensive)
  • Digital Signal Processing - streaming operations
  • FIR filters, Viterbi codecs (for high throughput)
  • FFT, DCT/IDCT, wavelets (computation-intensive)
  • Information security (block-based)
  • Data Encryption/Decryption (data-intensive)

9
MorphoSys Model
MorphoSys
Reconfigurable Processor Array
  • Reconfigurable processor array
  • High-bandwidth data interface

High Bandwidth Data Interface
10
MorphoSys Model
Main Processor e.g. advanced RISC
MorphoSys
Reconfigurable Processor Array
  • Reconfigurable processor array
  • High-bandwidth data interface
  • Mainstream processor

System Bus
Instr./Data Cache (L1)
High Bandwidth Data Interface
External Memory (e.g. SDRAM, RDRAM)
11
M1 Architecture
12
M1 Architecture
Chip boundary
Chip 0.35 micron, 100 MHz
13
M1 Architecture
Chip boundary
Chip 0.35 micron, 100 MHz
14
RC Array
  • RC Array
  • 8 X 8 RC Array
  • 4 quads of 4X4 RCS

15
RC Array and Context Memory
  • Context Broadcast
  • row or column
  • switch column b-cast to row b-cast, avoid data
    movement
  • possible to activate only 1 column of RCs.

16
RC Array and Context Memory
  • Context Memory
  • 2 blocks
  • 8 sets in each block
  • A set controls 1 row or column (SIMD)
  • 16 contexts in 1 set.
  • Possible to overlap ctx broadcast with ctx
    reloading

17
M1 Architecture RC Array
Chip boundary
Chip 0.35 micron, 100 MHz
18
Reconfigurable Cell (RC)
Neighbor RCs
Data bus
Neighbor RCs
Register file
Data bus
MUX_A
MUX_B
ALUMULT
R2
R0
R3
R1
SHIFT
Register File
O/P REG
19
Reconfigurable Cell (RC)
Neighbor RCs
Data bus
Neighbor RCs
Register file
Data bus
C o n t e x t R e g i s t e r
MUX_A
MUX_B
Constant
12
16
16
Control signals
Context word from Context Memory
ALUMULT
R2
R0
R3
R1
28
SHIFT
Register File
28
O/P REG
16
16
To other RCs
To Data Bus
20
RC Array Quad Interconnect
RC
RC
RC
RC
North, South, East, West connection (level
1) Full column and row connectivity between 4x4
partition (level 2) Each RC in one col. or row
of a quad. can receive data from one other RC in
its col or row.
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
RC
21
M1 Architecture TinyRISC
Chip boundary
Chip 0.35 micron, 100 MHz
22
Control processor TinyRISC
Execute Stage
Data Cache
ALU
Shift Unit
Memory Unit
MorphoSys Unit
23
M1 Architecture Data Interface
Chip boundary
Chip 0.35 micron, 100 MHz
24
Frame Buffer
BANK A (64 x 8 bytes)
BANK B (64 x 8 bytes)
MSByte
2 sets Each set has 2banks. 8 consecutive bytes
access. Bank A provides op. A for RC Bank B
provides op. B for RC. Enables overlap of
computations with data transfer
A
B
A
SET ZERO
B
A
B
COL OFFSET
COL OFFSET
A
B
A
BANK A
BANK B
B
A
COL OFFSET
COL OFFSET
B
A
SET ONE
B
A
B
LSByte
25
DMA Controller
Tiny RISC Core Processor
Main Memory
Frame Buffer
DMA Controller
Context Memory
26
System Operation a, b
(a) (b)

Steps gt
Tiny RISC
LDCTXT
LDFB
RC Array
IDLE
IDLE
CM
Load new context data
IDLE
FB Set 0
IDLE
Load new application data
FB Set 1
IDLE
IDLE
27
System Operation a, b, c
(a) (b) (c)

Steps gt
Tiny RISC
LDCTXT
LDFB
CBCAST, LDFB
CBCAST, LDCTXT
RC Array
IDLE
IDLE
EXECUTE
EXECUTE
CM
Load new context data
IDLE
Context to RC Array
Reload context
FB Set 0
IDLE
Load new application data
Data to RC Array
Data to RC Array
FB Set 1
IDLE
IDLE
Load new application data
IDLE
28
System Operation a, b, c, d
(a) (b) (c)
(d)
Steps gt
Tiny RISC
LDCTXT
LDFB
CBCAST, LDFB
CBCAST, LDCTXT
CBCAST, STFB, and LDFB
CBCAST, LDCTXT
RC Array
IDLE
IDLE
EXECUTE
EXECUTE
EXECUTE
EXECUTE
CM
Load new context data
IDLE
Context to RC Array
Reload context
Context to RC Array
Reload context
FB Set 0
IDLE
Load new application data
Data to RC Array
Data to RC Array
Write out previous data, load new data
IDLE
FB Set 1
IDLE
IDLE
Load new application data
IDLE
Data to RC Array
Data to RC Array
29
Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
0100011....11100 1100110....00010 0011101....10100
30
Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
0100011....11100 1100110....00010 0011101....10100
mSched
31
Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
mcc
0100011....11100 1100110....00010 0011101....10100
mSched
Executable
32
Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
mcc
0100011....11100 1100110....00010 0011101....10100
mSched
Executable
MuLate, MorphoSim
C, VHDL
33
Software Environment
mView
App. (C Code)
TR_app a b c p a 1
Configuration context
ZRC_F(X) WRC_F(Y)
RC Array functions
Context Lib.
mcc
0100011....11100 1100110....00010 0011101....10100
mSched
Executable
MuLate, MorphoSim
MorphoSys Chip
C, VHDL
TinyRISC
RC Array
34
mcc MorphoSys C Compiler
MPEG_Encode (int macroblk, int blk_type) int
curr_blk, new_blk, rev_blk,motion_vectors,
quant_factor, contextdataaddr
motion_comp(curr_blk, motion_vectors, blk_type,
contextdataaddr) . . void motion_comp (int
curr_blk, int motion_vectors, int blk_type,
int contextdataaddr) TR_ldfb(contextaddr, BANKA,
0, 256) TR_cbcast(0,0,0,0) TR_cbcast(0,0,0,1)
. . . . TR_cbcast(0,0,0,6) TR_cbcast(0,0,0,7) r
eturn
MPEG_Encode subi 15,15,13 ldi 9,2 add 9,1
5,9 stw 14,9 stw 6,9 addi 9,9,8 ldw 6,
9 lda 14,motion_comp stw 6,15
jal 14,14 motion_comp subi 15,15,3 ldw 1
4,15 ldfb 2,0,0,256 cbcast 0, 0, 0,
0 cbcast 0, 0, 0, 1 .. cbcast 0, 0, 0,
6 cbcast 0, 0, 0, 7 jal 0,14
35
M1 Program Components
Sequential functions
Data-Parallel (RC Array) functions
DATA PATH
CONTROL FLOW
CONTROL DATA
TinyRISC assembly program
RC context program
TinyRISC
Context Memory
Frame Buffer
RC Array
DMAC
36
M1 Program Components
Sequential
Data-Parallel
DATA PATH
CONTROL
CONTROL DATA
set 0,1 KEEP I I set 1,1 CLOAD!0x004 I I gt2
set 2,1 ADD E r0 LSL 2 gt0 WE set 3,1 ADD XQ
r0 LSL 2 gt0 set 4,1 SUB XQ r0 LSL 2 gt0 set
5,1 SUB E r0 LSL 2 gt0 WE set 6,1 CLOAD!0x004 I I
gt2 set 7,1 KEEP I I
ldfb 2,0,0,256 ldctxt 1, 0, 80 cbcast 0, 0, 0,
0 sbcb 0, 2, 1, 17 dbcbc 1, 1, 4, 64 stfb 6, 1,
0, 16 wfbi 0, 0, 1, 32
subi 15,15,3 addi 2,0,0,256 ldw 1, 0, 0,
0, jal 0, 2, 0, 1, lui 0, 0, 0, 5 addw 0, 0, 0, 1,

TinyRISC
Context Memory
Frame Buffer
RC Array
DMAC
37
Simulation Environment
RC Array Configuration program
Application (e.g. image) data
TinyRISC Executable Code
08000000 09010001 08023000 39030037 09047777 .
0-0-11-00000-0000-000-11111001-00000000 0-0-10-000
00-0000-000-1001-000000000100 0-1-00-00010-0111-10
0-11110011-00000000 0-0-00-00010-1000-100-11110011
-00000000 0-0-00-00010-1000-100-11110100-00000000
0-1-00-00010-0111-100-11110100-00000000
00234567 09876534 92834723 29382034 20342034 .
38
Simulation Environment
RC Array Configuration program
Application (e.g. image) data
TinyRISC Executable Code
08000000 09010001 08023000 39030037 09047777 .
0-0-11-00000-0000-000-11111001-00000000 0-0-10-000
00-0000-000-1001-000000000100 0-1-00-00010-0111-10
0-11110011-00000000 0-0-00-00010-1000-100-11110011
-00000000 0-0-00-00010-1000-100-11110100-00000000
0-1-00-00010-0111-100-11110100-00000000
00234567 09876534 92834723 29382034 20342034 .
39
mSched Kernel Scheduler
  • Optimization criteria
  • Context reloading (minimize).
  • Data reuse (maximize).
  • RC computation and data
  • movements (maximize).

Optimal solution implies exploration of the whole
search space.
Conflicting criteria
40
mSched Kernel Scheduler
  • Optimization criteria
  • Context reloading (minimize).
  • Data reuse (maximize).
  • RC computation and data
  • movements (maximize).

Optimal solution implies exploration of the whole
search space.
Conflicting criteria
To improve performance, the proposed
methodology divides the problem into two guided
subtasks 1. Partitioning of the DFG. 2.
Scheduling within a partition.
41
Target Applications
  • Image Processing
  • Video Compression (multimedia)
  • Template Matching (target recognition)
  • Hyperspectral Imaging (video analysis)
  • Medical Diagnostic Imaging (MRI)
  • Digital Signal Processing
  • matrix and vector operations, FIR filters,
    Viterbi codecs, beam-forming (SONAR)
  • FFT, DCT/IDCT, wavelets (DWT)
  • Information security
  • Data Encryption/Decryption (DES, IDEA, 3-Way)

42
Video Compression MPEG-2
Motion Estimation for 8x8 block with 16x16 search
area (comparison is in clock cycles)
Better
43
Video Compression MPEG-2
M1 vs other Processors for 2-D DCT on 8x8 block
(cycles) (Reconfigurable, DSP, general purpose,
and multimedia)
Better
44
MPEG-2 Video Encoder
Regulator
Zig-zag scan
Quantization
DCT
VLC Encoder
-
Frame Memory
Inv. Quant.
Output Buffer
Pre- Processing
IDCT
Predictive frame
Output stream
Input stream

Motion Vectors
Motion Compensation
Frame Memory
Motion Estimation
45
MPEG-2 Video Encoder
46
Automatic Target Recognition
M1 vs. other ATR systems for ATR Second Level of
Detection (1 template pair, 128x128 image)
47
Data Encryption (IDEA)
48
Future Work
  • Architecture
  • low power, enhanced functionality
  • reconfigurability for multiple data types
  • Software/Programming
  • Data-parallel function identification
  • Control flow generation for RC Array
  • Automatic mapping to RC (or multiple RCs)
  • Application Mapping
  • communication applications
  • bio-medical imaging

49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com