Srijan: A Methodology for Synthesis of ASIP Based Multiprocessor SoCs Project Progress Presentation - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Srijan: A Methodology for Synthesis of ASIP Based Multiprocessor SoCs Project Progress Presentation

Description:

Srijan: A Methodology for Synthesis of ASIP Based Multiprocessor SoCs ... in the proceedings of Workshop on Application Specific Processors (WASP) ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 27

Provided by: bas80

Category:

more less

Transcript and Presenter's Notes

Title: Srijan: A Methodology for Synthesis of ASIP Based Multiprocessor SoCs Project Progress Presentation

1
Srijan A Methodology for Synthesis of ASIP Based
Multiprocessor SoCs Project Progress
Presentation

Anshul Kumar

2
Outline

Introduction
Participants
Design Space
Methodology
Key Achievements
Work Done
Work in Progress

3
Srijan

Objective
To develop an integrated framework for
synthesis of embedded systems built around
application specific ASIP (RISC and VLIW) based
multiprocessor, where both system and
subsystem-level design-spaces can be
efficiently explored.
A 3-year project started in November 2002
Being funded by Naval Research Board, Govt. of
India

4
Participants

Faculty
Prof. Anshul Kumar (chief investigator)
Prof. M.Balakrishnan
Dr. Preeti Ranjan Panda
Prof. Subhashis Banerjee
Dr. Prem Kalra
Project Staff
Satya Kiran M.N.V.
Nitin Bhardwaj
Research Scholars (PhD)
Anup Gangwar
Basant Kumar Dwivedi
Students
M.Techs 7
B.Techs 12

5
Why Application Specific Multiprocessors ?

Higher performance
Lesser area
Low power

Compute Intensive Application

Lower Cost

Control Part
General Purpose Multiprocessor
Application Specific Multiprocessor
No customization
Customization
Higher Performance
Avg. Performance
6
Customization Opportunities -gt System Level
7
Customization Opportunities - gt System Level

Compute units
No. and Types ASICs, VLIW or RISC ASIPs, DSPs
Interconnection Network
Shared bus, MINs or crossbar switches
Custom interconnection based on communication
pattern of the application
Memory architecture
Types synchronous/asynchronous,
pipelined/non-pipelined etc.
Various transfer modes
Custom memories FIFOs, frame buffers etc.

8
Customization Opportunities -gt Processor Level
Register Files
RF1
RF2
No. and Type of Regfile Customization
Interconnect Customization
Functional Units
FU1
FU2
FU3
FU4
AFU2
AFU1
No. and Type of FU Customization
9
Customization Opportunities - gt Processor Level

Functional Units
MISO, MIMO, MIMO with LD/ST
Rigid or flexible I/O timeshapes
Register File Clustering
Each FU can read from and write to only a subset
of registers
Area grows as N3, Delay grows as N3/2, Power
grows as N3
where N is the no. of Functional Units connected
to the register file
Powerful application analysis required to
minimize data copying
Interconnects
between different clusters and between clusters
and memory
Analysis of data access patterns required for
evaluating cost-performance tradeoffs
Current ASIP vendors do not offer customizable
interconnects
Instruction encoding and decoding
Reduce or remove explicit NOPs in code
Affects Code size, Object code compatibility,
Branch miss prediction penalty, Hardware cost,
Address specification in code size

10
Overall Methodology
Parallel Application Model
Constraints
Manual parallel model refinement
Refined performance numbers
Verification using simulation
FPGA Prototype
RTOS Specialization
11
System Level Exploration
Parallel Application Model
Constraints
Component Library

Estimations
communication time
context switch overheads etc.

Annotated Task Graph
Estimator

Measurements
- Performance
Resource
Utilizations
- Power

Processor Selection
Partitioning
Interconnection Arch. Evaluation
Constraints met ?
Memory Arch. Evaluation
Y
N
Reconfiguration
System Arch. Description
12
VLIW ASIP Synthesis Methodology
Task Set and Constraints
Architecture Description
Application Parameter Extraction
Architecture Design Space Exploration
Retargetable Compiler
Instruction Encoding Specialization
Validation (Simulation with encoded instructions)
Architecture Description (Output to synthesizer)
13
Validation Framework
Task-set
Architecture Description
Retargetable Compiler
Retargetable Assembler
Performance and Power Numbers
Output to Other Tools
Retargetable Simulator
Power Consumption Information Gen.
14
Frameworks in Place

System Level Activities
Synthesis framework for application specific
multiprocessors
Heterogeneous multiprocessor simulation
infrastructure
Prototype and validation platform for LEON based
multiprocessor SoC
Real time kernel for multiprocessor LEON
A random process network generator
Subsystem Level Activities
A framework to evaluate clustered VLIW processors
Synthesizable RTL for single cluster VLIW
processor
High level synthesis framework with optimizations

15
Work Done System Level

System Level Design Space Exploration
Address the synthesis/mapping problem of process
networks onto heterogeneous multiprocessor
Existing work
Heavily deals with acyclic process networks
Data independent process behavior
Models are developed to estimate the additional
delays due to communication conflicts
Statistical process behavior is exploited to
provide cheaper solutions
Quality-of-service and energy efficiency are also
being considered
One publication in VLSI Design, Jan04, Mumbai,
India

16
Work Done System Level (cont.)

Validation
Estimation models are validated against
performance statistics generated by system
simulation
Requires a simulator framework that is flexible
enough to assemble a heterogeneous multiprocessor
with highly customized cores
Should be fast enough to be able to simulate
realistic applications
Developed SrijanSim, a cycle-accurate simulator
framework centered around transaction level
modeling
Generic and highly modular simulator
Uses both state-of-the art and novel techniques
to expedite simulation
Supports rich variety of system components and
communication architectures.
Reduces model development time and system
composition effort
Implemented simulation models of a retargetable
VLIW ASIP core, SRAM, FIFO memory, point-to-point
links, shared buses within this simulation
framework
Planning to release the simulator for public
access in the first quarter of 2004

17
Work Done Sub-system Level

Subsystem Level Design Space Exploration
Exploring the design choices in VLIW ASIP
Analyzed real applications to judge the
suitability of high ILP architectures
Identified the design space for inter-cluster
communication
Built a framework to analyze various
inter-cluster communication mechanisms
Systematic evaluation of the impact of various
FU-FU, FU-RF and RF-RF interconnection
architectures on achievable ILP
Demonstrated that the most commonly used type of
interconnection, RF-to-RF, is not a good
candidate
Proposed a new interconnection network
Accepted for publication in the proceedings of
Workshop on Application Specific Processors (WASP)

18
Work Done Sub-system Level (cont.)

Power and Energy Estimation
Recently power has become the driving design
constraint
More power optimization opportunities at system
and sub-system synthesis level but requires
reliable estimates
Setup the tool flow for power characterization
Power library for VLIW ASIP components
In this project we built power models for various
VLIW components and develop a methodology to use
these in design space exploration phase
Memory Synthesis
Customization of Cache Memory for Embedded
Systems
Goal of this project was to generate on-chip
cache configuration by performing application
analysis and estimations

19
Work Done Sub-system Level (cont.)

Implementation and prototyping
Requires synthesizable descriptions of various
components of the target architecture
We are using synthesizable LEON as RISC processor
but no such is available for VLIW
Designed and implemented a synthesizable VLIW
core
Studied various micro-architectural choices
available
Designed a parameterized VLIW processor
Synthesized the core using ASIC synthesis tools
and VTVT standard-cell library from VirginiaTech
university
A 4 issue slot configuration works at a clock
speed of 200MHz in 0.25um technology but higher
clock speeds (up to 400MHz, we hope) are possible
with sophisticated libraries
This core is useful not only for prototyping but
also in generating realistic power and
performance estimates for high level exploration
tools

20
Work Done Behavioral Synthesis

Source level optimizations
Goal was to enhance the C-to-VHDL translator with
optimizations such as loop unrolling and
bit-width analysis
FSM derivation from SystemC
SystemC is becoming a de-facto standard of system
specification
Goal was to develop a framework for hardware
synthesis from SystemC specification by
leveraging available high-level synthesis tool
support

21
Work Done Software Synthesis

Compiler optimization to exploit pipeline
registers and forwarding circuitry
Idea is to maximize the utilization of available
architectural resources through code
optimizations
This optimization targets to reduce the
register-file, a potential bottle-neck of
multiple issue processors, pressure
Goal was to design and modify the scheduling and
register allocation passes in IMPACT compiler to
incorporate this optimization
Extensions to RtKer
RtKer is a in-house real time OS and was ported
onto x86, ARM, Trimedia
First goal was to map RtKer on LEON
multiprocessor
Second goal was to develop framework to customize
scheduler(s)
Binary utilities for multiprocessor code
generation
Objective was to develop assembler linker tools
to generate memory footprints for various
multiprocessor architectures

22
Work Done Case Studies and Prototyping

Application modeling and case studies
LipSync
Converts the text into an audiovisual speech
stream incorporating the lip movements
Goal was to map LipSync application on embedded
ARM platform
Ray Tracer
A very computationally intensive graphics
rendering technique
Objective was to develop a FPGA hardware
accelerator for computation intensive part and
interface it with the host through PCI
Prototyping
Extended LEON-MP, shared bus and shared memory
based multiprocessor built around LEON, to
incorporate local memories, which reduces the
contention for global resources

23
Current Status
24
Work in Progress

Energy aware synthesis of application specific
multiprocessors
Impact of inter-cluster connectivity on clock
period in clustered VLIW processors
Case studies on MPEG4 and text-to-speech
applications
Loop unrolling optimizations in high level
synthesis
Cache design space exploration
Extensions on RtKer-MP

25
Expenditure