Exploring VLIW ASIP Design Space using Trimaran Based Framework - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Exploring VLIW ASIP Design Space using Trimaran Based Framework

Description:

Application Specific Instruction Set Processors ... Design of a largely automated framework for ASIP design & Evaluation. ... An Extensible IR called Rebel. ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 54
Provided by: bas80
Category:

less

Transcript and Presenter's Notes

Title: Exploring VLIW ASIP Design Space using Trimaran Based Framework


1
Exploring VLIW ASIP Design SpaceusingTrimaran
Based Framework
  • Under the guidance of
  • Prof. Anshul Kumar
  • Department of Computer Science Engineering
  • IIT Delhi.

Diviya Jain 2001MCS022
2
Motivation
  • Application Specific Instruction Set Processors
    (ASIPs)
  • Combines advantages of ASICs GPPs.
  • Desired performance , cost power consumption
    through
  • Processor Specialisation.
  • Instruction Set Customization.
  • for a targeted set of applications.

3
Processor Specialisation Techniques
  • Instruction Set Specialisation.
  • Function Unit Data Path Adaptation.
  • Memory Specialisation.
  • Cache Configuration.
  • Interconnect Specialisation.
  • Current Approach
  • Instruction Set Specialisation.
  • Function Unit Data Path Specialisation.

4
Objectives
  • Design of a largely automated framework for ASIP
    design Evaluation.
  • Enable automatic Identification, Evaluation and
    Selection of critical parts of application to be
    mapped on special FUs.
  • Validate performance of ASIP designed through the
    suitably modified Trimaran Framework.

5
Architectural Choices
  • Instruction Level Parallelism (ILP) for high
    performance.
  • ILP choices Superscalar and VLIW.
  • Our Target Architecture
  • VLIW architecture
  • Core with fine-grain FUs.
  • Application Specific coarse-grain FUs.
  • Core may be fixed or parameterized.

6
Coarse-Grain FUs ?
  • Chaining a sequence of operations reduces
    computation time.
  • Limited Resolution Operation leads to faster
    hardware.
  • Concurrent Operations within a group more easily
    parallelized.
  • Intermediate results stored locally - Reduction
    in Register Pressure.
  • Collapsing a series of operations into one
    instruction shortens code and reduces pressure on
    I-cache.

7
AFU Design Space
  • Type of code mapped
  • Clusters of elementary arithmetic or logic
    operations.
  • Complex, control-oriented, sections e.g. complete
    loops.
  • The Number of Inputs and Outputs to the AFU.
  • The Number of AFUs or Number of operations per
    single AFU.
  • Accessibility to Memory.

8
Spectrum of Custom FUs
REGISTER FILE
9
Automatic Processor Specialization Process
Application Code
Automatic Identification Of Coarse Grained FU
Identification Algorithm
Evaluation Of FUs identified
Selection Criteria
Selection Of FUs
Modification of Application Code
Customization Of Machine Architecture
Instruction Set Specialisation
Code Generation
Simulation
Performance Statistics
Synthesis
10
ASIP Architecture Exploration Requirements
  • Execution Evaluation of Application Code
  • on Simulated Architectures.
  • Tools needed
  • Language for Architecture Description.
  • Re-targetable Compiler or Compiler Generator.
  • Reconfigurable Simulator.

11
Trimaran Framework
  • The Infrastructure is comprised of the following
    components
  • A Machine description language, HMDES, for
    describing ILP architectures.
  • A parameterized ILP Architecture called HPL-PD
  • A compiler front-end (IMPACT) for C, performing
    parsing, type checking, and a large suite of
    high-level (i.e. machine independent)
    optimizations.

12
Trimaran Framework
  • A Compiler back-end (ELCOR), parameterized by a
    machine description, performing instruction
    scheduling, register allocation, and machine
    dependent optimizations.
  • An Extensible IR called Rebel.
  • A Cycle-level Simulator of the HPL-PD
    architecture configurable by a machine
    description and provides run-time information on
    execution time, branch frequencies, and resource
    utilization.
  • An Integrated graphical user interface (GUI) for
    configuring and running the Trimaran system.

13
Trimaran Compiler Infrastructure
14
Limitations of Trimaran
  • Restricted to the HPL- Play Doh architecture.
  • Introduction of a new instruction requires
    modification of the source code.
  • Since it is a VLIW based compiler we cannot have
    instructions with variable latency (excludes the
    possibility of exploration of conditionals and
    loops).

15
Earlier Framework
Original C Program
PERL SCRIPT (for automation)
Patch Files
Instrumented C Program
IMPACT
ELCOR
SIMULATOR GENERATOR
Generated Simulator
Results and Statistics
MDES(Description of FUs)
16
Limitations Of Earlier Framework
  • Manual selection of Application Code to be mapped
    on special FUs designed
  • Failure to identify potentially good candidates.
  • Allowed validation of framework only on small
    benchmarks.
  • Instrumented Code Generated was not completely
    functional
  • Erroneous Code Profiling.
  • Frequently the code failed to pass through
    Trimaran Front End Compiler.
  • Sub-optimally scheduled code.

17
Limitations Of Earlier Framework
  • Poor Design Of Evaluation Framework.
  • Instruction Set Specialisation delayed to the
    Elcor Stage of Trimaran Framework.
  • Introduction of a large number of data movement
    instructions by the Front End Compiler.
  • Execution Statistics obtained did not conform to
    the performance gain estimated.

18
Identification of Special Fus
Application Code
Identification Algorithm
Automatic Identification Of Coarse Grained FU
Evaluation Of FUs identified
Selection Criteria
Selection Of FUs
19
MachSUIF
  • Features Of MachSUIF
  • IR has a DFG/CFG representation, but is
    architecture independent.
  • Operations resemble generic assembly operations.
  • Provides control and data flow libraries.
  • Suited to the process of identification which
    depends upon the topological characteristics of
    DAG constructed.

20
Inefficient Machsuif IR
  • Original Source Code
  • for (i0 iltNUM i)
  • ai i
  • Trimaran failed to recognize the for loop.
  • No loop optimizations eg. loop unrolling,
    software pipelining are not done.

loop condition check
loop body
exit code
21
Generation of Efficient IR
  • IR generated is modified, to add loop condition
    check along with the loop body.
  • Trimaran recognized the for loop and loop
    optimizations were performed.

loop condition check
loop body condition check
exit code
22
MISO Identification Algorithm
  • for all Nodes e Nodes_to_be_analysed
  • Generate_MaxMISO(Node)
  • Nodes_to_be_analysed - Nodes_in_MaxMISO
  • Generate_MaxMISO(Node)
  • for all Parents_of_Node(Node)
  • if(fanout_of_Parent_of_Node(Node)1)
  • include(Parent_of_Node)
  • Generate_MaxMISO(Parent_of_Node)
  • else
  • fanout_of_Parent_of_Node --

23
Implementation Of Identification Algorithm
  • Identified all the MISOs in the Application Code.
  • Capable of including or excluding memory
    operations as a part of a MISO.
  • Annotated each instruction with the
    identification information.
  • Annotated the inputs and output of the MISO.
  • Generated graphical representation of each MISO
    identified.

24
Evaluation Selection Technique
  • Let ?sw represent execution time of instructions
    on a processor.
  • Let ?hw represent relative delay of operations
    when executed on dedicated hardware unit.
  • ?hw is represented as a fraction or a multiple of
    32 bit multiply accumulate delay.
  • Let CP represent the critical path delay of the
    MISO identified.
  • Let ? represent the Number of times a basic block
    is executed.

25
Evaluation Selection Technique
  • Execution time of a MISO on a processor is
    calculated as
  • Tsw S ?sw
  • for all instr
  • Execution time of a MISO on a special FU is
    calculated as
  • Thw ceil(CPhw)
  • Thus Speed Up Potential Of a MISO is calculated
    as
  • SpeedUp ( Tsw Thw ) ?
  • Finally Best N candidates expected to provide
    highest Speedup are selected.

26
Modeling Of MISOs
  • Instrumented Source Code
  • (Current Framework)
  • int FU_miso (int a, int b, int c)
  • return a bc bc
  • main()
  • int a, b, c
  • while(a lt 1000)
  • / identified MISO /
  • a FU_miso (a, b, c)
  • Instrumented Source Code
  • (Earlier Framework)
  • int FU_miso (int a, int b, int c)
  • return 1
  • main()
  • int a, b, c
  • while(a lt 1000)
  • / identified MISO /
  • a FU_miso (a, b, c)
  • Original Source Code
  • main()
  • int a, b, c
  • while(a lt 1000)
  • / identified MISO /
  • a a bc bc

27
Advantages of New Approach
  • Completely functional instrumented code.
  • Eliminates erroneous profiling.
  • Generation of optimally scheduled code.
  • Elimination of Semantic Analysis.
  • Elimination of illegal memory access and hence
    segmentation faults.

28
Modeling Of MIMOs
  • Instrumented Source Code
  • (Earlier Framework)
  • void FU_mimo (int a, int b)
  • main()
  • int a, b, c, d
  • int j1, j2, r1, r2
  • scanf(d, j1)
  • scanf(d, j2)
  • r1 j1
  • r2 j2
  • / identified MIMO /
  • FU_mimo (a,b)
  • c r1
  • d r2
  • Instrumented Source Code
  • (Current Framework)
  • int FU_mimo_one (int a, int b)
  • return a b
  • int FU_mimo_two (int a, int b)
  • return a - b
  • void FU_mimo (int a, int b)
  • main()
  • int a, b, c, d
  • / identified MIMO /
  • c FU_mimo_one (a, b)
  • d FU_mimo_two (a, b)
  • FU_mimo(a,b)
  • Original Source Code
  • main()
  • int a, b, c, d
  • / identified MIMO /
  • c a b
  • d a b

29
Advantages Of the Approach
  • Completely functional instrumented code.
  • No need to explicitly reserve registers through
    introduction of scanf instructions.
  • No Erroneous profiling.
  • Generation of optimally scheduled code.

30
MISO/MIMO with load/store units
  • Modeled in exactly similar manner as the
    MISO/MIMO.
  • During Resource definition, load unit/store is
    reserved for a few cycles before and after
    computation for memory access.
  • Original Code
  • main()
  • int a10
  • for(int i ilt10 i)
  • ai ai i2
  • Modified Code
  • int FU_miso_ld(int a, int i)
  • return a i2
  • main()
  • int a10
  • for(int i ilt10 i)
  • FU_miso_ld(ai, i)

31
Instruction Set Specialisation in Earlier
Framework
  • The function call representing the new
    instruction to be introduced passed through
    Impact without any modifications.
  • Impact requires the function call arguments to be
    present either in Macro Registers or on the
    stack.
  • The introduction of a new machine instruction was
    delayed to the Elcor Stage.

32
Overheads Introduced
  • Source Code
  • d FU_main_mimofun(a, b, c, d, e)
  • Impact Generated IR
  • (op 113 st_f2 (mac OP i)(i -24)(r 36 f2) lt(tm
    (i 300))gt
  • (op 115 st_i (mac OP i)(i -28)(r 111 i) lt(tm (i
    301))gt
  • (op 116 st_f2 (mac OP i)(i -36)(r 41 f2) lt(tm (i
    302))gt
  • (op 117 mov_f2 mac P5 f2) r 26 f2)
  • (op 118 mov_f2 mac P7 f2) r 31 f2)
  • (op 119 jsr (l_g_abs fn_FU_main_mimofun) lt(tr
    (mac P5 f2)(mac P7
  • f2)) (tm (i 300)(i 301)(i 302)) (tmo (i -24)(i
    -28)(i -36)) (ret (mac P4 f2)) (param size (I
    36))gt(call info (s_l_abs doubledoubledoubledou
    bleintdouble ))

33
Impact Compilation Phases
C Language Source
Pcode
Hcode
Machine Independent Lcode
Mcode
Target Architecture Code
34
Selection Of Hcode Phase
  • Pcode generated needs to be reverse translated
    into C for execution and subsequent collection of
    profiling information.
  • At Lcode level, data movement instructions are
    already introduced. Elimination requires complex
    handling of data dependencies.
  • Hcode forms a Natural Choice
  • No extra overhead is yet introduced.
  • Instruction Set Extension is easy to accomplish.

35
Customization of Machine Architecture
  • A new Functional Unit is introduced using HMDES,
    machine description language.
  • Operation format , latency, resource usage etc
    are all specified.
  • Semantics of the special machine instruction are
    provided to the new Simulator.
  • Trimaran Back End Compiler is modified so that
    it recognizes the new machine instruction and
    optimally schedules the execution of the new
    instruction on the special FU.

36
Enhanced Performance Evaluation Framework
Original C Program
IMPACT
Identification Selection Of FUs
Pcode
Hcode
Lcode
Instrumented C Program
ELCOR
SIMULATOR GENERATOR
Generated Simulator
Results and Statistics
MDES(Description of FUs)
37
Case Studies (Kalman Filter)
  • Modeled 5 MISOs with Load Store Units.
  • Latency of the FUs conformed to amount of
    computation involved.

38
Discussion Of Results (Kalman Filter)
  • Kalman_Update Better Performance Evaluation can
    be attributed to
  • Removal of extra data movement instructions as
    was required in earlier framework.
  • Reuse of register values containing memory
    addresses for successive MISOs.
  • Predict_State Performance efficiency evaluated
    remains the same
  • Completely different memory addresses required
    for successive MISOs.
  • Addresses generated stored in GPRs instead of
    Macro registers. Thus, elimination of data
    movement instructions is of no use.

39
Case Studies (Fast Fourier Transform)
  • Modeled a MIMO performing butterfly operation.
  • Latency of the MIMO is assumed to be 8.
  • MIMO has 6 sources and 4 destinations.

a
a bw
w


b
a - bw
a ar i(ac)b br i(bc) w wr i(wc) i
v-1
40
Results (Fast Fourier Transform)
  • Overheads due to additional scanfs are removed.
  • Quality of the code generated by the new
    framework is much better.
  • Loop optimizations like software pipelining could
    be applied unlike the previous framework.
  • Though optimizations performed, performance
    efficiency was lowered.

41
Explanation of the Anomaly
  • The scheduled code generated shows no evidence of
    software pipelining.
  • Extra code added to support these optimization
    features.
  • Generation of Statistics is not done accurately.






0
st
1





ld
ld
ld
2



ld
4
bfly
12
st
13
st
br
14
st
42
Case Studies (FFT)
  • For fair comparison, modulo scheduling algorithm
    is explicitly switched off.

43
Case Studies (AdpcmDecode)
  • 2 MISOs were introduced
  • dest1 (src2 (src1 gtgt 1)
  • dest1 (src2 (src1 gtgt 2)
  • Latency of the FUs was taken to be 1 assuming
    chaining of operations.

44
Case Studies (AdpcmDecode)
  • No performance gain is attributed to
  • Presence of only small MISOs.
  • Reduction of execution time on hardware is
    matched by the execution of instructions of the
    MISO in parallel with other instructions,
    achieved by VLIW compiler.
  • Poor estimation technique which assumes temporal
    execution of instructions on the processor.

45
Case Studies Predicated Adpcmdecode
  • Predication Done to identify larger MISOs
  • 3 MISOs identified
  • Critical Path lengths of the MISOs are 14 , 3, 4
    respectively.
  • Their latencies are assumed to be 7, 2, 2
    respectively.

46
Results Discussion
  • Though latency of the computations are
    dramatically reduced, gain is not as expected
  • VLIW Compiler is able to schedule the component
    instructions with remaining application code in
    parallel.
  • Gain achieved by shortening the critical path of
    the application.
  • Tradeoff between the ILP and the granularity of
    the MISOs considered.
  • Raises question Given a core with enough
    resources to handle Maximum parallelism
    available, will special FUs enhance performance
    further?

47
Need for VLIW Special FU ?
  • Modeled 1 MISO having critical path length of 14.
  • Latency assumed to be 7.
  • Varied the number of Integer ALUs.
  • Once VLIW compiler extracts maximum parallelism,
    MISOs which shorten critical path length of
    application will enhance efficiency.
  • Graph shows that modified code has a lower level
    of ILP.

48
Conclusions Contributions
  • A largely automated framework for the design and
    evaluation of ASIPs is achieved.
  • Pluggable Modules for Identification, Evaluation
    Selection of critical parts of application are
    implemented.
  • Trimaran Evaluation Framework is enhanced, to
    achieve better insight into the gain achieved.
  • Performance gain evaluated is significantly
    improved, and depend upon the base architecture
    and nature of application.
  • VLIW architecture augmented with special FUs will
    perform better provided the FU is capable of
    reducing the critical path of the application as
    dictated by the base architecture.

49
Future Work
  • Explore the complexity/performance tradeoff of
    ASIPs with control flows mapped on to the FUs.
  • Better Evaluation and Selection Criterion which
    depend on architectural constraints.
  • Multi-Objective Selection including area, power,
    I/O constraints etc.
  • Design of a better Memory Model to evaluate the
    gain.

50
References
  • Bhuvan Middha, Varun Raj, Anup Gangwar, Anshul
    Kumar, M. Balakrishnan, and Paolo Ienne. A
    Trimaran based framework for exploring the design
    space of VLIW ASIPs with coarse grain functional
    units. In Proceedings of the 15th International
    Symposium on System Synthesis, Kyoto, October
    2002.
  • Paolo Ienne, Laura Pozzi, and Miljan Vuletic. On
    the limits of processor specialisation by mapping
    dataflow sections on ad-hoc functional units.
    Technical Report 01/376, Swiss Federal Institute
    of Technology Lausanne (EPFL), Computer Science
    Department (DI), Lausanne, December 2001.
  • Laura Pozzi, Miljan Vuletic, and Paolo Ienne.
    Automatic topology-based identification of
    instruction-set extensions for embedded
    processors. Technical Report 01/377, Swiss
    Federal Institute of Technology Lausanne (EPFL),
    Computer Science Department (DI), Lausanne,
    December 2001.

51
References
  • P.P.Tirumalai B. Ramakrishna Rau, Michael S.
    Schlansker. Code generation schema for modulo
    scheduled loops. Technical Report HPL - 92 -47,
    Hewlett Packard Laboratories, April 1992.
  • Machine suif, http//www.eecs.harvard.edu/hube/sof
    tware
  • The trimaran compiler infrastructure,
    http//www.trimaran.org
  • B. Ramakrishna Rau Vinod Kathail, Michael S.
    Schlansker. Hpl-pd architecture specification
    Version 1.1. Technical Report HPL-93-80(R.1),
    Compiler and Architecture Research HP
    Laboratories Palo Alto, February, 2000 (Revised).

52
Acknowledgements
  • Prof. M.Balakrishnan
  • Dr. P.R. Panda
  • Anup Gangwar
  • Basant K. Dwivedi

53
Thank You
Write a Comment
User Comments (0)
About PowerShow.com