Instruction Level Parallelism ILP - PowerPoint PPT Presentation


PPT – Instruction Level Parallelism ILP PowerPoint presentation | free to view - id: 16b7ee-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Instruction Level Parallelism ILP


Monica S. Lam, Robert P. Wilson. 19th ISCA, May 1992, pages 19-21. ... Computer Architecture A Quantitative Approach, Hennessy & Patterson, 3rd edition, M Kaufmann ... – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 32
Provided by: sir47


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Instruction Level Parallelism ILP

Instruction Level Parallelism ILP
  • Advanced Computer Architecture
  • CSE 8383
  • Spring 2004 2/19/2004
  • Presented By
  • Saad Al-Harbi
  • Saeed Abu Nimeh

  • Whats ILP
  • ILP vs Parallel Processing
  • Sequential execution vs ILP execution
  • Limitations of ILP
  • ILP Architectures
  • Sequential Architecture
  • Dependence Architecture
  • Independence Architecture
  • ILP Scheduling
  • Open Problems
  • References

Whats ILP
  • Architectural technique that allows the overlap
    of individual machine operations ( add, mul,
    load, store )
  • Multiple operations will execute in parallel
  • Goal Speed Up the execution
  • Example
  • load R1 ? R2 add R3 ? R3, 1
  • add R3 ? R3, 1 add R4 ? R3, R2
  • add R4 ? R4, R2 store R4 ? R0

Example Sequential vs ILP
  • Sequential execution (Without ILP)
  • Add r1, r2 ? r8 4 cycles
  • Add r3, r4 ? r7 4 cycles 8 cycles
  • ILP execution (overlap execution)
  • Add r1, r2 ? r8
  • Add r3, r4 ? r7
  • Total of 5 cycles

ILP vs Parallel Processing
  • ILP
  • Overlap individual machine operations (add, mul,
    load) so that they execute in parallel
  • Transparent to the user
  • Goal speed up execution
  • Parallel Processing
  • Having separate processors getting separate
    chunks of the program ( processors programmed to
    do so)
  • Nontransparent to the user
  • Goal speed up and quality up

ILP Challenges
  • In order to achieve parallelism we should not
    have dependences among instructions which are
    executing in parallel
  • H/W terminology Data Hazards ( RAW, WAR, WAW)
  • S/W terminology Data Dependencies

Dependences and Hazards
  • Dependences are a property of programs
  • If two instructions are data dependent they can
    not execute simultaneously
  • A dependence results in a hazard and the hazard
    causes a stall
  • Data dependences may occur through registers or

Types of Dependencies
  • Name dependencies
  • Output dependence
  • Anti-dependence
  • Data True dependence
  • Control Dependence
  • Resource Dependence

Name dependences
  • Output dependence
  • When instruction I and J write the same register
    or memory location. The ordering must be
    preserved to leave the correct value in the
  • add r7,r4,r3
  • div r7,r2,r8
  • Anti-dependence
  • When instruction j writes a register or memory
    location that instruction I reads
  • i add r6,r5,r4
  • j sub r5,r8,r11

Data Dependences
  • An instruction j is data dependent on instruction
    i if either of the following hold
  • instruction i produces a result that may be used
    by instruction j , or
  • instruction j is data dependent on instruction k,
    and instruction k is data dependent on
    instruction i
  • LOOP LD F0, 0(R1)
  • ADD F4, F0, F2
  • SD F4, 0(R1)
  • SUB R1, R1, -8
  • BNE R1, R2, LOOP

Control Dependences
  • A control dependence determines the ordering of
    an instruction i, with respect to a branch
    instruction so that the instruction i is executed
    in correct program order.
  • Example
  • If p1
  • S1
  • If p2
  • S2
  • Two constraints imposed by control dependences
  • An instruction that is control dependent on a
    branch cannot be moved before the branch
  • An instruction that is not control dependent on
    a branch cannot be moved after the branch

Resource dependences
  • An instruction is resource-dependent on a
    previously issued instruction if it requires a
    hardware resource which is still being used by a
    previously issued instruction.
  • e.g.
  • div r1, r2, r3
  • div r4, r2, r5

ILP Architectures
  • Computer Architecture is a contract (instruction
    format and the interpretation of the bits that
    constitute an instruction) between the class of
    programs that are written for the architecture
    and the set of processor implementations of that
  • In ILP Architectures information embedded in
    the program pertaining to available parallelism
    between instructions and operations in the program

ILP Architectures Classifications
  • Sequential Architectures the program is not
    expected to convey any explicit information
    regarding parallelism. (Superscalar processors)
  • Dependence Architectures the program explicitly
    indicates the dependences that exist between
    operations (Dataflow processors)
  • Independence Architectures the program provides
    information as to which operations are
    independent of one another. (VLIW processors)

Sequential architecture and superscalar processors
  • Program contains no explicit information
    regarding dependencies that exist between
  • Dependencies between instructions must be
    determined by the hardware
  • It is only necessary to determine dependencies
    with sequentially preceding instructions that
    have been issued but not yet completed
  • Compiler may re-order instructions to facilitate
    the hardwares task of extracting parallelism

Superscalar Processors
  • Superscalar processors attempt to issue multiple
    instructions per cycle
  • However, essential dependencies are specified by
    sequential ordering so operations must be
    processed in sequential order
  • This proves to be a performance bottleneck that
    is very expensive to overcome

Dependence architecture and data flow processors
  • The compiler (programmer) identifies the
    parallelism in the program and communicates it to
    the hardware (specify the dependences between
  • The hardware determines at run-time when each
    operation is independent from others and perform
  • Here, no scanning of the sequential program to
    determine dependences
  • Objective execute the instruction at the
    earliest possible time (available input operands
    and functional units).

Dependence architectures Dataflow processors
  • Dataflow processors are representative of
    Dependence architectures
  • Execute instruction at earliest possible time
    subject to availability of input operands and
    functional units
  • Dependencies communicated by providing with each
    instruction a list of all successor instructions
  • As soon as all input operands of an instruction
    are available, the hardware fetches the
  • The instruction is executed as soon as a
    functional unit is available
  • Few Dataflow processors currently exist

Dataflow strengths and limitations
  • Dataflow processors use control parallelism alone
    to fully utilize the FU.
  • Dataflow processor is more successful than others
    at looking far down the execution path to find
    control parallelism
  • When successful its better than speculative
  • Every instruction is executed is useful
  • Processor does not have to deal with error
    conditions, because of speculative operations

Independence architecture and VLIW processors
  • By knowing which operations are independent, the
    hardware needs no further checking to determine
    which instructions can be issued in the same
  • The set of independent operations gtgt the set of
    dependent operations
  • Only a subset of independent operations are
  • The compiler may additionally specify on which
    functional unit and in which cycle an operation
    is executed
  • The hardware needs to make no run-time decisions

VLIW processors
  • Operation vs instruction
  • Operation is an unit of computation (add, load,
    branch instruction in sequential ar.)
  • Instruction set of operations that are intended
    to be issued simultaneously
  • Compiler decides which operation to go to each
    instruction (scheduling)
  • All operations that are supposed to begin at the
    same time are packaged into a single VLIW

VLIW strengths
  • In hardware it is very simple
  • consisting of a collection of function units
    (adders, multipliers, branch units, etc.)
    connected by a bus, plus some registers and
  • More silicon goes to the actual processing
    (rather than being spent on branch prediction,
    for example),
  • It should run fast, as the only limit is the
    latency of the function units themselves.
  • Programming a VLIW chip is very much like writing

VLIW limitations
  • The need for a powerful compiler,
  • Increased code size arising from aggressive
    scheduling policies,
  • Larger memory bandwidth and register-file
  • Limitations due to the lock-step operation,
    binary compatibility across implementations with
    varying number of functional units and latencies

Summary ILP Architectures
Sequential Architecture Dependence Architecture Independence Architectures
Additional info required in the program None Specification of dependences between operations Minimally, a partial list of independences. A complete specification of when and where each operation to be executed
Typical kind of ILP processor Superscalar Dataflow VLIW
Dependences analysis Performed by HW Performed by compiler Performed by compiler
Independences analysis Performed by HW Performed by HW Performed by compiler
Scheduling Performed by HW Performed by HW Performed by compiler
Role of compiler Rearranges the code to make the analysis and scheduling HW more successful Replaces some analysis HW Replaces virtually all the analysis and scheduling HW
ILP Scheduling
  • Static Scheduling boosted by parallel code

Dynamic Scheduling without static parallel code
Dynamic Scheduling boosted by static parallel
code optimization
  • done by the compiler
  • The processor receives dependency-free and
    optimized code for parallel execution
  • Typical for VLIWs and a few pipelined processors
    (e.g. MIPS)
  • done by the processor
  • The code is not optimized for parallel execution.
    The processor detects and resolves dependencies
    on its own
  • Early ILP processors (e.g. CDC 6600, IBM 360/91
  • done by processor in conjunction with parallel
    optimizing compiler
  • The processor receives optimized code for
    parallel execution, but it detects and resolves
    dependencies on its own
  • Usual practice for pipelined and superscalar
    processors (e.g. RS6000)

ILP Scheduling Trace scheduling
  • An optimization technique that has been widely
    used for VLIW, superscalar, and pipelined
  • It selects a sequence of basic blocks as a trace
    and schedules the operations from the trace
  • Example
  • Instr1
  • Instr2
  • Branch x
  • Instr3

Trace Scheduling
  • Extract more ILP
  • Increase machine fetch bandwidth by storing
    logically consecutive blocks in physically
    contiguous cache location (possible to fetch
    multiple basic blocks in one cycle)
  • Trace scheduling can be implemented by hardware
    or software

Trace Scheduling in HW
  • Hardware technique makes use of a large amount of
    information in dynamic execution to format traces
    dynamically and schedule the instructions in
    trace more efficiently.
  • Since the dependency and memory access addresses
    have been solved in dynamic execution,
    instructions in trace can be reordered more
    easily and efficiently.
  • Example trace cache approach

Trace scheduling in SW
  • Supplement to machines without hardware trace
    scheduling support.
  • Formats traces based on static profiled data, and
    schedules instructions using traditional compiler
    scheduling and optimization technique.
  • It faces some difficulties like code explosion
    and exception handling.

ILP open problems
  • Pipelined scheduling Optimized scheduling of
    pipelined behavioral descriptions. Two simple
    type of pipelining (structural and functional).
  • Controller cost Most scheduling algorithms do
    not consider the controller costs which is
    directly dependent on the controller style used
    during scheduling.
  • Area constraints The resource constrained
    algorithms could have better interaction between
    scheduling and floorplanning.
  • Realism
  • Scheduling realistic design descriptions that
    contain several special language constructs.
  • Using more realistic libraries and cost
  • Scheduling algorithms must also be expanded to
    incorporate different target architectures.

  • Instruction-Level Parallel Processing History,
    Overview and Perspective. B. Ramakrishna Rau,
    Joseph A. Fisher. Journal of Supercomputing, Vol.
    7, No. 1, Jan. 1993, pages 9-50.
  • Limits of Control Flow on Parallelism. Monica S.
    Lam, Robert P. Wilson. 19th ISCA, May 1992, pages
  • Global Code Generation for Instruction-Level
    Parallelism Trace Scheduling-2. Joseph A.
    Fisher. Technical Report, HPLabs HPL-93-43, Jun.
  • VLIW at IBM Research
  • http//
  • Intel and HP hope to speed CPUs with VLIW
    technology that's riskier than RISC, Dick
  • http//
  • Hardware and Software Trace Scheduling
  • http//
  • ILP open problems
  • http//
  • Computer Architecture A Quantitative Approach,
    Hennessy Patterson, 3rd edition, M Kaufmann