Multithreaded Processors - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Multithreaded Processors

Description:

Decode Unit : Gets an instruction from an instruction queue unit and decodes it. Branch instructions are executed within the decode unit. Processor Architecture ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 22
Provided by: drmehme
Category:

less

Transcript and Presenter's Notes

Title: Multithreaded Processors


1
Multithreaded Processors
  • A) Introduction
  • B) Processor Architecture
  • C) Instruction Scheduling Strategy
  • D) Static Code Scheduling
  • E) Estimation
  • F) Conclusion
  • G) References

2
A) Introduction
  • Why do we present multithreaded processor
    architecture ?
  • The generation of high quality images
    requires great processing power. Furthermore ,
    modelling the real world as faithfully as
    possible , intensive numerical computations are
    also needed.This architecture could run such a
    graphics system.
  • To give an example
  • Simulation results show that by executing 2
    and 4 threads in parallel on a nine-functional-uni
    t processor , a 2.02 and a 3.72 times speed-up ,
    respectively , can be achieved over a
    conventional RISC processor.

3
Introduction
  • Why do we use multiple threads?
  • Improves the utilization of the functional unit .
  • Multiple Threads Instructions from different
    threads are issued
  • simultaneously to
    multiple functional units , and
  • these instructions can begin execution
    unless
  • there are functional
    unit conflicts.
  • Applicable to the efficient execution of a single
    loop.
  • Scheduling Technique In order to control
    functional unit conflicts
  • between loop
    iterations , a new static code
  • scheduling
    technique has been developed.

4
Introduction
  • Single Thread Execution has some disadvantages
  • Each thread executes a number of data accesses
    and
  • conditional branches . In the case of a
    distrubuted shared
  • memory system
  • Low processor utilization can result from long
    latencies due to remote memory access.
  • Low utilization of functional units within a
    processor can result from inter-instruction
    dependencies and functional operation delays.

5
Introduction
  • Concurrent Multithreading
  • Attemps to remain active during long latencies
    due to remote memory access.
  • When a thread encounters an absence of data
    , the processor rapidly switches between threads.
  • Parallel Multithreading
  • Within a processor is a latency hiding
    technique at the instruction level.
  • When an instruction from a thread is not able
    to be issued because of either a control or data
    dependency within the thread , an independent
    instruction from another thread is executed.

6
B) Processor Architecture
7
Processor Architecture
  • The processor is provided with several
    instruction queue unit and decode unit pairs ,
    called thread slots .
  • Each thread slot , associated with a program
    counter , makes up a logical processor .
  • Instruction queue unit Has a buffer which saves
    some instructions succeeding the
    instruction indicated by the program counter .
  • Decode Unit Gets an instruction from an
    instruction queue unit and decodes it. Branch
    instructions are executed within the decode unit.

8
Processor Architecture
  • Issued instructions are dynamically scheduled by
    instruction schedule units and delivered to
    functional units.
  • When an instruction is not selected by an
    instruction schedule unit , it is stored in a
    standby station and remains there until it is
    selected.
  • Large Register Files Diveded into blocks , each
    of which is used as a full register set private
    to a thread. Each bank has two read ports and one
    write port. When a thread is executed the bank
    allocated for the thread is logically bound to
    the logical processor.
  • Queue Registers Special registers which enable
    communications between logical processors at the
    register-transfer level.

9
Processor Architecture
  • Instruction Pipeline
  • IF Instruction is read from a buffer of an
    inst. queue unit in one cycle .
  • D1 The format or type of the instruction is
    tested. In the case of branch instruction an
    inst. fetch request is sent to the inst. fetch
    unit at the end of the stage D1.
  • D2 The instruction is checked if it is issuable
    or not.

10
Processor Architecture
  • Instruction Pipeline
  • S This stage is inserted for the dynamic
    scheduling in instruction schedule units.
    Required operands are read from registers in
    stage S .
  • EX This stage is dependent on the kind of
    instruction .
  • W Result value is written back to a register.

11
C) Instruction Scheduling Strategy
  • Dynamic inst. scheduling policy is presented in
    the inst. schedule unit which works in one of two
    modes
  • 1) Implicit-rotation mode Priority rotation
    occurs at a given of cycles . (rotation
    interval) as shown in figure 4.
  • 2) Explicit-rotation mode Rotation of
    priority is controlled by software. The rotation
    is done when a change-priority instruction is
    executed on the logical processor with the
    highest priority.

12
Instruction Scheduling Strategy
  • Why explicit-rotation mode is used for our
    architecture ?
  • To aid the compiler to schedule the code of
    threads executed in parallel when it is possible.
  • To parallelize loops which are difficult to
    parallelize using other architectures.

13
D) Static Code Scheduling
  • Main Goal The complier reorders the code
    without consideration of
  • other threads , and concentrates on shortening
    the the processing time
  • each thread.
  • A new algorithm which makes the most of the
    function standby station
  • and instruction schedule units has been
    developed.
  • Algorithm employs a resource reservation table
    and standby table.

14
Static Code Scheduling
  • Resource Reservationtion Table
  • To avoid resource conflicts.
  • To tell the complier when the instruction in the
    standby station
  • is executed.
  • Standby Table
  • Stores the instructions which are not issued .
  • Explicit-rotation mode enables the
    complier to know which
  • instruction is selected.

15
E) Estimation
  • In our simulator cache simulation has not been
    implemented so we assumed that attempts to access
    caches were all hit.
  • Assumed that there was no bank conflict .
    Latencies of each instruction are listed in Table
    1.
  • In order to estimate our architecture , we use
    the speed-up ratio as a criterion.

16
Estimation
  • 1.83 times speed-up is gained by parallel
    execution by using 2
  • thread slots although all of the hardware of a
    single-threaded
  • processor is not duplicated in processor.
  • By using 4 thread slots 2.89/1.83 1.58 we gain
    less effective
  • increase.
  • When 8 thread slots are provided , the
    utilization of the busiest
  • functional unit , load/store unit , becomes 99.
    This is the reason why
  • speed up is saturated at only 3.22 times.
  • Addition of another load/store unit improves
    speed-up ratios by
  • 10.479.8 .

17
Estimation
  • Stand-by stations improve the speed-up ratio by
    02.2 .
  • In the case of application programs whose thread
    is rich in fine-grained parallelism , greater
    improvement can be achieved .

18
Estimation
  • The sample program is the Livermore Kernel 1
    written in Fortran.
  • Table 3 lists avarage execution cycles for one
    iteration.
  • Strategy A represents a simple list scheduling
    approach
  • Strategy B represents the list scheduling with
    resourse reservation
  • table and a standby table.

19
Estimation
  • Strategy B is overall superior to other
    strategies. It achieves the performance
    improvement by 019.3 .
  • The object code contains three load instructsons
    and store instruction , so at least (31)28
    cycles are required
  • for one iteration.

20
F) Conclusion
  • 2-threaded , 2 load/store unit processor achieves
    a factor of 2.02 speed up over a sequential
    machine and 4 threaded processor achieves a
    factor of 3.72.
  • A new static code scheduling algorithm has been
    developed , which is derived from idea of
    software pipelining.
  • Poor variety of tested programs ( ex cache
    effects ...) is the weak point .
  • Working on evaluating finite cache effects and
    the detailed degisn of the processor help us to
    confirm the effectiveness of architecture.

21
G) References
  • H.Hirata, Kozo Kimura,Satoshi Nagamine, Yoshiyuki
    Mochizuki, Akio Nishimura, Yoshimori Nakase and
    Teiji Nishizawa , An Elementary Processor
    Architecture with Simultaneous Instruction
    Issuing From Multiplr Threads, In International
    Symposium on Computer Architecture , pages
    136-145 , 1992
  • Alexandre Farcy , Olivier Temam , Improving
    Single-Process Performance with Multithreaded
    Processors, Universite de Versailles ,Paris.
  • H.Hirata, Yoshiyuki Mochizuki, Akio Nishimura,
    Yoshimori Nakase and Teiji Nishizawa, A
    Multithreaded Processor Architecture with
    Simultaneous Instruction Issuing In Proc.of
    Intl. Symp. On Supercomputing, Fukuoka,Japan,
    pp.87-96 , November 1991.
Write a Comment
User Comments (0)
About PowerShow.com