Exploring VLIW ASIP Design Space using Trimaran Based Framework - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Exploring VLIW ASIP Design Space using Trimaran Based Framework

Description:

Application Specific Instruction Set Processors ... Design of a largely automated framework for ASIP design & Evaluation. ... An Extensible IR called Rebel. ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 54

Provided by: bas80

Category:

more less

Transcript and Presenter's Notes

Title: Exploring VLIW ASIP Design Space using Trimaran Based Framework

1
Exploring VLIW ASIP Design SpaceusingTrimaran
Based Framework

Under the guidance of
Prof. Anshul Kumar
Department of Computer Science Engineering
IIT Delhi.

Diviya Jain 2001MCS022
2
Motivation

Application Specific Instruction Set Processors
(ASIPs)
Combines advantages of ASICs GPPs.
Desired performance , cost power consumption
through
Processor Specialisation.
Instruction Set Customization.
for a targeted set of applications.

3
Processor Specialisation Techniques

Instruction Set Specialisation.
Function Unit Data Path Adaptation.
Memory Specialisation.
Cache Configuration.
Interconnect Specialisation.
Current Approach
Instruction Set Specialisation.
Function Unit Data Path Specialisation.

4
Objectives

Design of a largely automated framework for ASIP
design Evaluation.
Enable automatic Identification, Evaluation and
Selection of critical parts of application to be
mapped on special FUs.
Validate performance of ASIP designed through the
suitably modified Trimaran Framework.

5
Architectural Choices

Instruction Level Parallelism (ILP) for high
performance.
ILP choices Superscalar and VLIW.
Our Target Architecture
VLIW architecture
Core with fine-grain FUs.
Application Specific coarse-grain FUs.
Core may be fixed or parameterized.

6
Coarse-Grain FUs ?

Chaining a sequence of operations reduces
computation time.
Limited Resolution Operation leads to faster
hardware.
Concurrent Operations within a group more easily
parallelized.
Intermediate results stored locally - Reduction
in Register Pressure.
Collapsing a series of operations into one
instruction shortens code and reduces pressure on
I-cache.

7
AFU Design Space

Type of code mapped
Clusters of elementary arithmetic or logic
operations.
Complex, control-oriented, sections e.g. complete
loops.
The Number of Inputs and Outputs to the AFU.
The Number of AFUs or Number of operations per
single AFU.
Accessibility to Memory.

8
Spectrum of Custom FUs
REGISTER FILE
9
Automatic Processor Specialization Process
Application Code
Automatic Identification Of Coarse Grained FU
Identification Algorithm
Evaluation Of FUs identified
Selection Criteria
Selection Of FUs
Modification of Application Code
Customization Of Machine Architecture
Instruction Set Specialisation
Code Generation
Simulation
Performance Statistics
Synthesis
10
ASIP Architecture Exploration Requirements

Execution Evaluation of Application Code
on Simulated Architectures.
Tools needed
Language for Architecture Description.
Re-targetable Compiler or Compiler Generator.
Reconfigurable Simulator.

11
Trimaran Framework

The Infrastructure is comprised of the following
components
A Machine description language, HMDES, for
describing ILP architectures.
A parameterized ILP Architecture called HPL-PD
A compiler front-end (IMPACT) for C, performing
parsing, type checking, and a large suite of
high-level (i.e. machine independent)
optimizations.

12
Trimaran Framework

A Compiler back-end (ELCOR), parameterized by a
machine description, performing instruction
scheduling, register allocation, and machine
dependent optimizations.
An Extensible IR called Rebel.
A Cycle-level Simulator of the HPL-PD
architecture configurable by a machine
description and provides run-time information on
execution time, branch frequencies, and resource
utilization.
An Integrated graphical user interface (GUI) for
configuring and running the Trimaran system.

13
Trimaran Compiler Infrastructure
14
Limitations of Trimaran

Restricted to the HPL- Play Doh architecture.
Introduction of a new instruction requires
modification of the source code.
Since it is a VLIW based compiler we cannot have
instructions with variable latency (excludes the
possibility of exploration of conditionals and
loops).

15
Earlier Framework
Original C Program
PERL SCRIPT (for automation)
Patch Files
Instrumented C Program
IMPACT
ELCOR
SIMULATOR GENERATOR
Generated Simulator
Results and Statistics
MDES(Description of FUs)
16
Limitations Of Earlier Framework

Manual selection of Application Code to be mapped
on special FUs designed
Failure to identify potentially good candidates.
Allowed validation of framework only on small
benchmarks.
Instrumented Code Generated was not completely
functional
Erroneous Code Profiling.
Frequently the code failed to pass through
Trimaran Front End Compiler.
Sub-optimally scheduled code.

17
Limitations Of Earlier Framework

Poor Design Of Evaluation Framework.
Instruction Set Specialisation delayed to the
Elcor Stage of Trimaran Framework.
Introduction of a large number of data movement
instructions by the Front End Compiler.
Execution Statistics obtained did not conform to
the performance gain estimated.

18
Identification of Special Fus
Application Code
Identification Algorithm
Automatic Identification Of Coarse Grained FU
Evaluation Of FUs identified
Selection Criteria
Selection Of FUs
19
MachSUIF

Features Of MachSUIF
IR has a DFG/CFG representation, but is
architecture independent.
Operations resemble generic assembly operations.
Provides control and data flow libraries.
Suited to the process of identification which
depends upon the topological characteristics of
DAG constructed.

20
Inefficient Machsuif IR

Original Source Code
for (i0 iltNUM i)
ai i
Trimaran failed to recognize the for loop.
No loop optimizations eg. loop unrolling,
software pipelining are not done.

loop condition check
loop body
exit code
21
Generation of Efficient IR

IR generated is modified, to add loop condition
check along with the loop body.
Trimaran recognized the for loop and loop
optimizations were performed.

loop condition check
loop body condition check
exit code
22
MISO Identification Algorithm

for all Nodes e Nodes_to_be_analysed
Generate_MaxMISO(Node)
Nodes_to_be_analysed - Nodes_in_MaxMISO
Generate_MaxMISO(Node)
for all Parents_of_Node(Node)
if(fanout_of_Parent_of_Node(Node)1)
include(Parent_of_Node)
Generate_MaxMISO(Parent_of_Node)
else
fanout_of_Parent_of_Node --

23
Implementation Of Identification Algorithm

Identified all the MISOs in the Application Code.
Capable of including or excluding memory
operations as a part of a MISO.
Annotated each instruction with the
identification information.
Annotated the inputs and output of the MISO.
Generated graphical representation of each MISO
identified.

24
Evaluation Selection Technique

Let ?sw represent execution time of instructions
on a processor.
Let ?hw represent relative delay of operations
when executed on dedicated hardware unit.
?hw is represented as a fraction or a multiple of
32 bit multiply accumulate delay.
Let CP represent the critical path delay of the
MISO identified.
Let ? represent the Number of times a basic block
is executed.

25
Evaluation Selection Technique

Execution time of a MISO on a processor is
calculated as
Tsw S ?sw
for all instr
Execution time of a MISO on a special FU is
calculated as
Thw ceil(CPhw)
Thus Speed Up Potential Of a MISO is calculated
as
SpeedUp ( Tsw Thw ) ?
Finally Best N candidates expected to provide
highest Speedup are selected.

26
Modeling Of MISOs

Instrumented Source Code
(Current Framework)
int FU_miso (int a, int b, int c)
return a bc bc
main()
int a, b, c
while(a lt 1000)
/ identified MISO /
a FU_miso (a, b, c)

Instrumented Source Code
(Earlier Framework)
int FU_miso (int a, int b, int c)
return 1
main()
int a, b, c
while(a lt 1000)
/ identified MISO /
a FU_miso (a, b, c)

Original Source Code
main()
int a, b, c
while(a lt 1000)
/ identified MISO /
a a bc bc

27
Advantages of New Approach

Completely functional instrumented code.
Eliminates erroneous profiling.
Generation of optimally scheduled code.
Elimination of Semantic Analysis.
Elimination of illegal memory access and hence
segmentation faults.

28
Modeling Of MIMOs

Instrumented Source Code
(Earlier Framework)
void FU_mimo (int a, int b)
main()
int a, b, c, d
int j1, j2, r1, r2
scanf(d, j1)
scanf(d, j2)
r1 j1
r2 j2
/ identified MIMO /
FU_mimo (a,b)
c r1
d r2

Instrumented Source Code
(Current Framework)
int FU_mimo_one (int a, int b)
return a b
int FU_mimo_two (int a, int b)
return a - b
void FU_mimo (int a, int b)
main()
int a, b, c, d
/ identified MIMO /
c FU_mimo_one (a, b)
d FU_mimo_two (a, b)
FU_mimo(a,b)

Original Source Code
main()
int a, b, c, d
/ identified MIMO /
c a b
d a b

29
Advantages Of the Approach

Completely functional instrumented code.
No need to explicitly reserve registers through
introduction of scanf instructions.
No Erroneous profiling.
Generation of optimally scheduled code.

30
MISO/MIMO with load/store units

Modeled in exactly similar manner as the
MISO/MIMO.
During Resource definition, load unit/store is
reserved for a few cycles before and after
computation for memory access.

Original Code
main()
int a10
for(int i ilt10 i)
ai ai i2
Modified Code
int FU_miso_ld(int a, int i)
return a i2
main()
int a10
for(int i ilt10 i)
FU_miso_ld(ai, i)

31
Instruction Set Specialisation in Earlier
Framework

The function call representing the new
instruction to be introduced passed through
Impact without any modifications.
Impact requires the function call arguments to be
present either in Macro Registers or on the
stack.
The introduction of a new machine instruction was
delayed to the Elcor Stage.

32
Overheads Introduced

Source Code
d FU_main_mimofun(a, b, c, d, e)
Impact Generated IR
(op 113 st_f2 (mac OP i)(i -24)(r 36 f2) lt(tm
(i 300))gt
(op 115 st_i (mac OP i)(i -28)(r 111 i) lt(tm (i
301))gt
(op 116 st_f2 (mac OP i)(i -36)(r 41 f2) lt(tm (i
302))gt
(op 117 mov_f2 mac P5 f2) r 26 f2)
(op 118 mov_f2 mac P7 f2) r 31 f2)
(op 119 jsr (l_g_abs fn_FU_main_mimofun) lt(tr
(mac P5 f2)(mac P7
f2)) (tm (i 300)(i 301)(i 302)) (tmo (i -24)(i
-28)(i -36)) (ret (mac P4 f2)) (param size (I
36))gt(call info (s_l_abs doubledoubledoubledou
bleintdouble ))

33
Impact Compilation Phases
C Language Source
Pcode
Hcode
Machine Independent Lcode
Mcode
Target Architecture Code
34
Selection Of Hcode Phase

Pcode generated needs to be reverse translated
into C for execution and subsequent collection of
profiling information.
At Lcode level, data movement instructions are
already introduced. Elimination requires complex
handling of data dependencies.
Hcode forms a Natural Choice
No extra overhead is yet introduced.
Instruction Set Extension is easy to accomplish.

35
Customization of Machine Architecture

A new Functional Unit is introduced using HMDES,
machine description language.
Operation format , latency, resource usage etc
are all specified.
Semantics of the special machine instruction are
provided to the new Simulator.
Trimaran Back End Compiler is modified so that
it recognizes the new machine instruction and
optimally schedules the execution of the new
instruction on the special FU.

36
Enhanced Performance Evaluation Framework
Original C Program
IMPACT
Identification Selection Of FUs
Pcode
Hcode
Lcode
Instrumented C Program
ELCOR
SIMULATOR GENERATOR
Generated Simulator
Results and Statistics
MDES(Description of FUs)
37
Case Studies (Kalman Filter)

Modeled 5 MISOs with Load Store Units.
Latency of the FUs conformed to amount of
computation involved.

38
Discussion Of Results (Kalman Filter)

Kalman_Update Better Performance Evaluation can
be attributed to
Removal of extra data movement instructions as
was required in earlier framework.
Reuse of register values containing memory
addresses for successive MISOs.
Predict_State Performance efficiency evaluated
remains the same
Completely different memory addresses required
for successive MISOs.
Addresses generated stored in GPRs instead of
Macro registers. Thus, elimination of data
movement instructions is of no use.

39
Case Studies (Fast Fourier Transform)

Modeled a MIMO performing butterfly operation.
Latency of the MIMO is assumed to be 8.
MIMO has 6 sources and 4 destinations.

a
a bw
w

b
a - bw
a ar i(ac)b br i(bc) w wr i(wc) i
v-1
40
Results (Fast Fourier Transform)

Overheads due to additional scanfs are removed.
Quality of the code generated by the new
framework is much better.
Loop optimizations like software pipelining could
be applied unlike the previous framework.
Though optimizations performed, performance
efficiency was lowered.

41
Explanation of the Anomaly

The scheduled code generated shows no evidence of
software pipelining.
Extra code added to support these optimization
features.
Generation of Statistics is not done accurately.

0
st
1

ld
ld
ld
2

ld
4
bfly
12
st
13
st
br
14
st
42
Case Studies (FFT)

For fair comparison, modulo scheduling algorithm
is explicitly switched off.

43
Case Studies (AdpcmDecode)

2 MISOs were introduced
dest1 (src2 (src1 gtgt 1)
dest1 (src2 (src1 gtgt 2)
Latency of the FUs was taken to be 1 assuming
chaining of operations.

44
Case Studies (AdpcmDecode)

No performance gain is attributed to
Presence of only small MISOs.
Reduction of execution time on hardware is
matched by the execution of instructions of the
MISO in parallel with other instructions,
achieved by VLIW compiler.
Poor estimation technique which assumes temporal
execution of instructions on the processor.

45
Case Studies Predicated Adpcmdecode

Predication Done to identify larger MISOs
3 MISOs identified
Critical Path lengths of the MISOs are 14 , 3, 4
respectively.
Their latencies are assumed to be 7, 2, 2
respectively.

46
Results Discussion

Though latency of the computations are
dramatically reduced, gain is not as expected
VLIW Compiler is able to schedule the component
instructions with remaining application code in
parallel.
Gain achieved by shortening the critical path of
the application.
Tradeoff between the ILP and the granularity of
the MISOs considered.
Raises question Given a core with enough
resources to handle Maximum parallelism
available, will special FUs enhance performance
further?

47
Need for VLIW Special FU ?

Modeled 1 MISO having critical path length of 14.
Latency assumed to be 7.
Varied the number of Integer ALUs.
Once VLIW compiler extracts maximum parallelism,
MISOs which shorten critical path length of
application will enhance efficiency.
Graph shows that modified code has a lower level
of ILP.

48
Conclusions Contributions

A largely automated framework for the design and
evaluation of ASIPs is achieved.
Pluggable Modules for Identification, Evaluation
Selection of critical parts of application are
implemented.
Trimaran Evaluation Framework is enhanced, to
achieve better insight into the gain achieved.
Performance gain evaluated is significantly
improved, and depend upon the base architecture
and nature of application.
VLIW architecture augmented with special FUs will
perform better provided the FU is capable of
reducing the critical path of the application as
dictated by the base architecture.

49
Future Work

Explore the complexity/performance tradeoff of
ASIPs with control flows mapped on to the FUs.
Better Evaluation and Selection Criterion which
depend on architectural constraints.
Multi-Objective Selection including area, power,
I/O constraints etc.
Design of a better Memory Model to evaluate the
gain.

50
References

Bhuvan Middha, Varun Raj, Anup Gangwar, Anshul
Kumar, M. Balakrishnan, and Paolo Ienne. A
Trimaran based framework for exploring the design
space of VLIW ASIPs with coarse grain functional
units. In Proceedings of the 15th International
Symposium on System Synthesis, Kyoto, October
2002.
Paolo Ienne, Laura Pozzi, and Miljan Vuletic. On
the limits of processor specialisation by mapping
dataflow sections on ad-hoc functional units.
Technical Report 01/376, Swiss Federal Institute
of Technology Lausanne (EPFL), Computer Science
Department (DI), Lausanne, December 2001.
Laura Pozzi, Miljan Vuletic, and Paolo Ienne.
Automatic topology-based identification of
instruction-set extensions for embedded
processors. Technical Report 01/377, Swiss
Federal Institute of Technology Lausanne (EPFL),
Computer Science Department (DI), Lausanne,
December 2001.

51
References

P.P.Tirumalai B. Ramakrishna Rau, Michael S.
Schlansker. Code generation schema for modulo
scheduled loops. Technical Report HPL - 92 -47,
Hewlett Packard Laboratories, April 1992.
Machine suif, http//www.eecs.harvard.edu/hube/sof
tware
The trimaran compiler infrastructure,
http//www.trimaran.org
B. Ramakrishna Rau Vinod Kathail, Michael S.
Schlansker. Hpl-pd architecture specification
Version 1.1. Technical Report HPL-93-80(R.1),
Compiler and Architecture Research HP
Laboratories Palo Alto, February, 2000 (Revised).

52
Acknowledgements