Multithreading and Dataflow Architectures CPSC 321 - PowerPoint PPT Presentation

About This Presentation

Title:

Multithreading and Dataflow Architectures CPSC 321

Description:

Definition 1 Different threads within the same process share the same address ... Definition 2 Alternatively, different threads have separate copies of the ... – PowerPoint PPT presentation

Number of Views:1006

Avg rating:3.0/5.0

Slides: 39

Provided by: faculty

Learn more at: https://people.engr.tamu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multithreading and Dataflow Architectures CPSC 321

1
Multithreading and Dataflow Architectures CPSC
321

Andreas Klappenecker

2
Plan

T November 16 Multithreading
R November 18 Quantum Computing
T November 23 QC Exam prep
R November 25 Thanksgiving
M November 29 Review ???
T November 30 Exam
R December 02 Summary and Outlook
T December 07 move to November 29?

3
Announcements

Office hours 200pm-300pm
Bonfire memorial

4
Parallelism

Hardware parallelism
all current architectures
Instruction-level parallelism
superscalar processor, VLIW processor
Thread-level parallelism
Niagara, Pentium 4, ...
Process parallelism
MIMD computer

5
What is a Thread?

A thread is a sequence of instructions that can
be executed in parallel with other sequences.
Threads typically share the same resources and
have a minimal context.

6
Threads

Definition 1 Different threads within the same
process share the same address space, but have
separate copies of the register file, PC, and
stack
Definition 2 Alternatively, different threads
have separate copies of the register file, PC,
and page table (more relaxed than previous
definition).
One can use a multiple issue, out-of-order,
execution engine.

7
Why Thread-Level Parallelism?

Extracting instruction-level parallelism is
non-trivial
hazards and stalls
data dependencies
structural limitations
static optimization limits

8
Von Neumann Execution Model
Each node is an instruction. The pink arrow
indicates a static scheduling of the
instructions. If an instruction stalls (e.g. due
to a cache miss) then the entire program must
wait for the stalled instruction to resume
execution.
9
The Dataflow Execution Model
Each node represents an instruction. The
instructions are not scheduled until run-time. If
an instruction stalls, other instructions can
still execute, provided their input data is
available.
10
The Multithreaded Execution Model
Each node represents an instruction and each gray
region represents a thread. The instructions
within each thread are statically scheduled while
the threads themselves are dynamically scheduled.
If an instruction stalls, the thread stalls but
other threads can continue execution.
11
Single-Threaded Processors
Memory access latency can dominate the processing
time, because each time a cache miss occurs
hundreds of clock cycles can be lost when a
single-threaded processor is waiting for the
memory. Top Increasing the clock speed improves
the processing time, but does not affect the
memory access time.
12
Multi-Threaded Processors
13
Multithreading Types

Coarse-grained multithreading
If a thread faces a costly stall, switch to
another thread. Usually flushes the pipe before
switching threads.
Fine-grained multithreading
interleave the issue of instruction from multiple
threads (cycle-by-cycle), skipping the threads
that are stalled. Instructions issued in any
given cycle comes from the same thread.

14
Scalar Execution
Dependencies reduce throughput and utilization.
15
Superscalar Execution
16
Chip Multiprocessor
17
Fine-Grained Multithreading
Instructions issues in the same cycle come from
the same thread
18
Fine-Grained Multithreading

Threads are switched every clock cycle, in round
robin fashion, among active threads
Throughput is improved, instructions can be
issued every cycle
Single-thread performance is decreased, because
one thread is expected to get just every n-th
clock cycle among n processes
Fine-grained multithreading requires hardware
modifications to keep track of threads (separate
register files, renaming tables and commit
buffers)

19
Multithreading Types

A single thread cannot effectively use all
functional units of a multiple issue processor
Simultaneous multithreading
uses multiple issue slots in each clock cycle for
different threads.
More flexible than fine grained MT.

20
Simultaneous Multithreading
21
Comparison

Superscalar
looks at multiple instructions from same process,
both horizontal and vertical waste.
Multithreaded
minimizes vertical waste tolerate long latency
operations
Simultaneous Multithreading
Selects instructions from any "ready" thread

22
Superscalar
Multithreaded
SMT
Issue slots
23
SMT Issues
24
A Glance at a Pentium 4 Chip
Picture courtesy of Toms hardware guide
25
The Pipeline
Trace cache
26
Intels Hyperthreading Patent
27
Pentium 4 Pipeline

Trace cache access, predictor 5 clock cycles
Microoperation queue
Reorder buffer allocation, register renaming 4
clock cycles
functional unit queues
Scheduling and dispatch unit 5 clock cycles
Register file access 2 clock cycles
Execution 1 clock cycle
reorder buffer
Commit 3 clock cycles (total 20 clock cycles)

28
PACT XPP

The XPP processes a stream of data using
configurable arithmetic-logic units.
The architecture owes much to dataflow
processing.

29
A Matrix-Vector Multiplication
Graphic courtesy of PACT
30
Basic Idea

Replace the von Neumann instruction stream with
fixed instruction scheduling by a configuration
stream.
Process streams of data as opposed to processing
of small data entities.

31
von Neumann vs. XPP
32
Basic Components of XPP

Processing Arrays
Packet oriented communication network
Hierarchical configuration manager tree
A set of I/O modules
Supports the execution of multiple data flow
applications running in parallel.

33
Four Processing Arrays
Graphics courtesy of PACT. SCM is short for
supervising configuration manager
34
Data Processing
35
Event Packets
36
(No Transcript)
37
XPP 64-A
38
Further Reading