Title: Synthesis of Heterogeneous Pipelined Multiprocessor Systems Using ILP : JPEG Case Study
1Synthesis of Heterogeneous Pipelined
Multiprocessor Systems Using ILP JPEG Case Study
- Haris Javaid UNSW, Australia
- Sri Parameswaran UNSW, Australia
2Overview
- Introduction
- Motivation
- Aims
- Case Study
- Experimental Setup
- Results
- Conclusion Future Work
3Introduction
- Transition in embedded devices single processor
system to heterogeneous multiprocessor system - Achieve high performance gains
- Minimize area and power consumption
Single Processor
Homogeneous multiprocessor
Heterogeneous multiprocessor
4Introduction - ASIPs
- Application Specific Instruction Set Processors
- Instruction set and underlying architecture
configured for a specific application - Extensible processors consists of base processor
and optional instructions
32 KB
- Base Instructions
- 1KB instruction cache size
- 1KB data cache size
- No additional instructions
- Base Instructions
- 2KB instruction cache size
- 4KB data cache size
- 10 additional instructions
- Base Instructions
- 16KB instruction cache size
- 32KB data cache size
- 25 additional instructions
8 KB
4 KB
8 KB
Heterogeneous Multiprocessor System using ASIPs
5Motivation
- Single processors can not attain high performance
- By using higher clock speeds
- Instruction level parallelism
- Billions of transistors available
- Task level parallelism can be exploited
- Multiprocessor System on Chip (MPSoC) with ASIPs
as building blocks - Task level parallelism (coarse-grained)
- Instruction level parallelism (fine-grained)
6Related Work
- S.L. Shee et al. Design Methodology for
Pipelined Heterogeneous Multiprocessor System in
DAC 2007 - Pipelined Multiprocessor System using ASIPs
- Heuristic to rapidly search the design space
- Minimized runtime x area of the system
- F. Sun et al. Synthesis of application-specific
heterogeneous multiprocessor architectures using
extensible processors in VLSID 2005 - Heuristic to simultaneously explore application
partitioning and custom instructions for ASIPs - Minimize runtime within an area budget
7Why Pipelined Multiprocessor Systems?
- S. L. Shee et al. Heterogeneous Multiprocessor
Implementations for JPEG A Case Study in
CODESISSS 2006 - Pipelined Configuration is better for streaming
applications
8Aims
- To implement an application as a heterogeneous
pipelined multiprocessor system - Application JPEG Compression Algorithm
- Six processor pipelined system
- Minimize system area while runtime constraint is
satisfied - Explore design space consisting of different
configurations for each processor in the system - Additional instructions
- Differing Instruction and Data cache sizes
- Speed up the exploration process to target large
design spaces
9Case Study JPEG Encoder
- Tasks 1-8
- JPEG encoder kernel
- Processes macro blocks one by one
- Tasks 9-11
- Initialise Quantization Tables
- Finalization functions
10Case Study JPEG Encoder
Six Processor pipeline Implementation of JPEG
encoder
11Runtime Calculation
Latency (cycles)
1000
2000
1300
900
1000
1
1
1
1
1
2
2
3
4
3
2
2
3
4
5
6
5
4
3
2
7
6
5
4
3
Macro Block
Raw Image 256x128
Macro Block 1
Macro Block 2
Macro Block 3
512 Macro blocks
12Runtime Calculation
First Macro block processing time
Average Latency of critical processor
13Processor Configurations
Configuration1
Extended Instructions
Program Executable
Configuration2
Extended Instructions
Tensilicas XPRES technology
Overhead Granularity
Configuration3
Extended Instructions
Configuration4
Extended Instructions
Configuration5
Base Processor
Extended Instructions
14Pipelined Multiprocessor System
Runtime Constraint Satisfied
Minimum Area
15Design Space Exploration
- Formulated the problem of mapping processes of an
application on to processor configurations as a
0-1 ILP problem - Objective
- Minimize area of overall system
- Constraints
- Only one configuration for each processor can be
selected - Amongst the selected configurations, one
processor configuration is considered critical in
the runtime calculation - System Runtime lt Runtime constraint by designer
16Design Space Pruning
- Runtime constraint imposed by the designer
- Some configurations of a processor cannot be part
of the optimal design - Only removes the inferior processor
configurations - Three different times are defined
-
-
17Design Space Pruning
Processor 0 selected
Pruned Design Space
Max(min latencies)
Critical Processor
Min. Latency
Min. Critical Processing Time
18Experimental Setup
- Tensilica C/C Compilation tools
- ISS and XTMP
- Instruction Set Simulator
- Multiprocessor environment
- Queues are used to connect processors
- XPRES and TIE Compiler used to create tailored
processors - Lp_solve is used as the 0-1 ILP Solver
19Results JPEG Encoder
- Configurations include additional instructions
and differing instruction and data cache sizes - Design Space 4.2 x 1013 design points
20Results JPEG Encoder
- Time Comparison of ILP Solver
21Results JPEG Encoder
- Pseudo Pareto optimal points of the design space
22Conclusion
- Formulated mapping of an application onto ASIP
configurations in a pipelined multiprocessor
system as a 0-1 ILP problem - Presented a novel design space pruning algorithm
to reduce the complexity of ILP problem - Targeted a design space of 4.2 x 1013 points,
obtaining each of the pseudo Pareto optimal
designs in less than 100 seconds. - Future Work Design heuristics to search design
space faster, comparing with ILP solutions
23THANK YOU