Multithreading and Dataflow Architectures CPSC 321 - PowerPoint PPT Presentation

About This Presentation
Title:

Multithreading and Dataflow Architectures CPSC 321

Description:

Definition 1 Different threads within the same process share the same address ... Definition 2 Alternatively, different threads have separate copies of the ... – PowerPoint PPT presentation

Number of Views:1006
Avg rating:3.0/5.0
Slides: 39
Provided by: faculty
Category:

less

Transcript and Presenter's Notes

Title: Multithreading and Dataflow Architectures CPSC 321


1
Multithreading and Dataflow Architectures CPSC
321
  • Andreas Klappenecker

2
Plan
  • T November 16 Multithreading
  • R November 18 Quantum Computing
  • T November 23 QC Exam prep
  • R November 25 Thanksgiving
  • M November 29 Review ???
  • T November 30 Exam
  • R December 02 Summary and Outlook
  • T December 07 move to November 29?

3
Announcements
  • Office hours 200pm-300pm
  • Bonfire memorial

4
Parallelism
  • Hardware parallelism
  • all current architectures
  • Instruction-level parallelism
  • superscalar processor, VLIW processor
  • Thread-level parallelism
  • Niagara, Pentium 4, ...
  • Process parallelism
  • MIMD computer

5
What is a Thread?
  • A thread is a sequence of instructions that can
    be executed in parallel with other sequences.
  • Threads typically share the same resources and
    have a minimal context.

6
Threads
  • Definition 1 Different threads within the same
    process share the same address space, but have
    separate copies of the register file, PC, and
    stack
  • Definition 2 Alternatively, different threads
    have separate copies of the register file, PC,
    and page table (more relaxed than previous
    definition).
  • One can use a multiple issue, out-of-order,
    execution engine.

7
Why Thread-Level Parallelism?
  • Extracting instruction-level parallelism is
    non-trivial
  • hazards and stalls
  • data dependencies
  • structural limitations
  • static optimization limits

8
Von Neumann Execution Model
Each node is an instruction. The pink arrow
indicates a static scheduling of the
instructions. If an instruction stalls (e.g. due
to a cache miss) then the entire program must
wait for the stalled instruction to resume
execution.
9
The Dataflow Execution Model
Each node represents an instruction. The
instructions are not scheduled until run-time. If
an instruction stalls, other instructions can
still execute, provided their input data is
available.
10
The Multithreaded Execution Model
Each node represents an instruction and each gray
region represents a thread. The instructions
within each thread are statically scheduled while
the threads themselves are dynamically scheduled.
If an instruction stalls, the thread stalls but
other threads can continue execution.
11
Single-Threaded Processors
Memory access latency can dominate the processing
time, because each time a cache miss occurs
hundreds of clock cycles can be lost when a
single-threaded processor is waiting for the
memory. Top Increasing the clock speed improves
the processing time, but does not affect the
memory access time.
12
Multi-Threaded Processors
13
Multithreading Types
  • Coarse-grained multithreading
  • If a thread faces a costly stall, switch to
    another thread. Usually flushes the pipe before
    switching threads.
  • Fine-grained multithreading
  • interleave the issue of instruction from multiple
    threads (cycle-by-cycle), skipping the threads
    that are stalled. Instructions issued in any
    given cycle comes from the same thread.

14
Scalar Execution
Dependencies reduce throughput and utilization.
15
Superscalar Execution
16
Chip Multiprocessor
17
Fine-Grained Multithreading
Instructions issues in the same cycle come from
the same thread
18
Fine-Grained Multithreading
  • Threads are switched every clock cycle, in round
    robin fashion, among active threads
  • Throughput is improved, instructions can be
    issued every cycle
  • Single-thread performance is decreased, because
    one thread is expected to get just every n-th
    clock cycle among n processes
  • Fine-grained multithreading requires hardware
    modifications to keep track of threads (separate
    register files, renaming tables and commit
    buffers)

19
Multithreading Types
  • A single thread cannot effectively use all
    functional units of a multiple issue processor
  • Simultaneous multithreading
  • uses multiple issue slots in each clock cycle for
    different threads.
  • More flexible than fine grained MT.

20
Simultaneous Multithreading
21
Comparison
  • Superscalar
  • looks at multiple instructions from same process,
    both horizontal and vertical waste.
  • Multithreaded
  • minimizes vertical waste tolerate long latency
    operations
  • Simultaneous Multithreading
  • Selects instructions from any "ready" thread

22
Superscalar
Multithreaded
SMT
Issue slots
23
SMT Issues
24
A Glance at a Pentium 4 Chip
Picture courtesy of Toms hardware guide
25
The Pipeline
Trace cache
26
Intels Hyperthreading Patent
27
Pentium 4 Pipeline
  • Trace cache access, predictor 5 clock cycles
  • Microoperation queue
  • Reorder buffer allocation, register renaming 4
    clock cycles
  • functional unit queues
  • Scheduling and dispatch unit 5 clock cycles
  • Register file access 2 clock cycles
  • Execution 1 clock cycle
  • reorder buffer
  • Commit 3 clock cycles (total 20 clock cycles)

28
PACT XPP
  • The XPP processes a stream of data using
    configurable arithmetic-logic units.
  • The architecture owes much to dataflow
    processing.

29
A Matrix-Vector Multiplication
Graphic courtesy of PACT
30
Basic Idea
  • Replace the von Neumann instruction stream with
    fixed instruction scheduling by a configuration
    stream.
  • Process streams of data as opposed to processing
    of small data entities.

31
von Neumann vs. XPP
32
Basic Components of XPP
  • Processing Arrays
  • Packet oriented communication network
  • Hierarchical configuration manager tree
  • A set of I/O modules
  • Supports the execution of multiple data flow
    applications running in parallel.

33
Four Processing Arrays
Graphics courtesy of PACT. SCM is short for
supervising configuration manager
34
Data Processing
35
Event Packets
36
(No Transcript)
37
XPP 64-A
38
Further Reading
  • PACT XPP A Reconfigurable Data Processing
    Architecture by Baumgarte, May, Nueckel, Vorbach
    and Weinhardt
Write a Comment
User Comments (0)
About PowerShow.com