PPT – Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details) PowerPoint presentation

About This Presentation

Title:

Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details)

Description:

Title: PowerPoint Presentation Author: cabe Last modified by: Guan R. Gao Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 57

Provided by: cabe151

Category:

more less

Transcript and Presenter's Notes

Title: Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details)

1
Topic 2 -- II Compilers and Runtime Technology
Optimization Under Fine-Grain Multithreading-
The EARTH Model (in more details)
Guang R. Gao ACM Fellow and IEEE Fellow Endowed
Distinguished Professor Electrical Computer
Engineering University of Delaware ggao_at_capsl.udel
.edu
2
Outline

Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization - SSB
The percolation model and its applications
Summary

3
Outline

Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization - SSB
The percolation model and its applications
Summary

4
The EARTH Multithreaded Execution Model
fiber within a frame
Two Level of Fine-Grain Threads - threaded
procedures - fibers
Aync. function invocation
A sync operation
Invoke a threaded func
5
EARTH vs. CILK
CILK Model
EARTH Model
Note EARTH has it origin in static dataflow model
6
The Fiber Execution Model
7
The Fiber Execution Model
8
The Fiber Execution Model
2
2
9
The Fiber Execution Model
2
2
1
1
10
The Fiber Execution Model
1
1
11
The Fiber Execution Model
12
The Fiber Execution Model
2
2
13
The Fiber Execution Model
2
2
14
The Fiber Execution Model
15
The Fiber Execution Model
16
The Fiber Execution Model
17
The Fiber Execution Model
4
4
18
A Loop Example
i 1
i 2
i 3
i N
for(i 1 i lt N i) S1 S2 xi
S3 yi xi-1 . . . Sk

S1
S2
S3
Sk
T1
T2
T3
Note
How loop carried dependencies are handled?
And its implication on cross core software
pipelining
19
Main Features of EARTH

Fast thread context switching
Efficient parallel function invocation
Good support of fine grain dynamic load balancing
Efficient support split phase transactions and
fibers

Features unique to the EARTH model in comparison
to the CILK model

20
Outline

Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization - SSB
The percolation model and its applications
Summary

21
Compiling C for EARTHObjectives

Design simple high-level extensions for C that
allow programmers to write programs that will run
efficiently on multi-threaded architectures.
(EARTH-C)
Develop compiler techniques to automatically
translate programs written in EARTH-C to
multi-threaded programs. (EARTH-C, Threaded-C)
Determine if EARTH-C compiler can compete with
hand-coded Threaded-C programs.

22
Summary of EARTH-C Extensions

Explicit Parallelism
Parallel versus Sequential statement sequences
Forall loops
Locality Annotation
Local versus Remote Memory references (global,
local, replicate, )
Dynamic Load Balancing
Basic versus remote function and invocation sites

23
EARTH-C Compiler Environment
C
EARTH-C
McCAT
Program Dependence Analysis
EARTH SIMPLE
EARTH-C Compiler
Thread Generation
Threaded-C
Threaded-C Compiler
Threaded-C
EARTH Compilation Environment
The EARTH Compiler
24
The McCAT/EARTH Compiler
EARTH-C
PHASE I (Standard McCAT Analyses
Transformations)
EARTH-SIMPLE-C
PHASE II (Parallelization)
EARTH-SIMPLE-C
PHASE III
THREADED-C
25
result
n
done
fib

If n lt 2
DATA_RSYNC (1, result, done)
else
TOKEN (fib, n-1, sum1, slot_1)
TOKEN (fib, n-2, sum2, slot_2)
END_THREAD( )

0 0
2 2
THREAD-1 DATA_RSYNC (sum1 sum2, result,
done) END_THREAD ( )
END_FUNCTION
The Fibonacci Example
26
Matrix Multiplication
void main ( ) int i, j, k float
sum for (i0 i lt N i) for
(j0 j lt N j) sum 0
for (k0 k lt N k)
sum sum a i k b k j c
i j sum
Sequential Version
27
result
a
done
b
inner

BLKMOV_SYNC (a, row_a, N, slot_1)
BLKMOV_SYNC (b, column_b, N, slot_1)
sum 0
END_THREAD

0 0
2 2
THREAD-1 for (i0 iltN i ) sum
sum (row_ai column_bi)
DATA_RSYNC (sum, result, done) END_THREAD ( )
END_FUNCTION
The Inner Product Example
28
Summary of EARTH-C Extensions

Explicit Parallelism
Parallel versus Sequential statement sequences
Forall loops
Locality Annotation
Local versus Remote Memory references (global,
local, replicate, )
Dynamic Load Balancing
Basic versus remote function and invocation sites

29
EARTH C Threaded C(Thread Generation)

Given a sequence of statements, s1, s2, sn, we
wish to create threads such that
Maximize thread length (minimize thread switching
overhead)
retain sufficient parallelism
Issue remote memory requests as early as possible
(prefetching)
Compile split-phase remote memory operations and
remote function calls correctly

30
An Example
31
Example Partitioned into Four Fibers
1
Fiber-0
Fiber-1
1
3
Fiber-2
Fiber-3
32
Better Strategy Using List Scheduling

Put each instruction in the earliest possible
thread.
Within a thread, the remote operations are
executed as early as possible.
Build a Data Dependence Graph (DDG), and use a
list scheduling strategy, where the selection of
instructions is guided by Earliest Thread Number
and Statement Type.

33
Instruction Types

Schedule First
remote_read, remote_write
remote_fn_call
local_simple
remote_compound
local_compound
basic_fn_call
Schedule Last

34
List Scheduling Previous Example
(0,RR)
(0,LS)
(0,RR)
a xi
b xj
fact 1
(1,LS)
(1,LS)
(1,LC)
sumab
prodab
fact facta
(1,RF)
(1,RF)
(1,RF)
r1g(sum)
r2g(prod)
r3g(fact)
(2,LS)
return(r1 r2 r3)
35
Resulting List Scheduled Threads
axi bxj fact1
2
sumab r1g(sum) prodab r2g(prod) factfac
ti r3g(fact)
3
return (r1r2r3)
36
Generating Threaded-C Code
THREADED f ( int ret_parm, SLOT rsync_parm,
int x, int i, int j) SLOTS
SYNC_SLOTS2 int a, b, sum, prod, fact, r1,
r2, r3
/ THREAD_0 / INIT_SYNC(0, 2, 2, 1)
INIT_SYNC (1, 3, 3, 2) GET_SYNC_L (xi, a,
0) GET_SYNC_L (xj, b, 0) fact
1 END_THREAD( )
THREAD_1 sum a b TOKEN (G, r1,
SLOT_ADR(1), sum) prod a b TOKEN (g, r2,
SLOT_ADR(1), prod) fact fact a TOKEN (g,
r3, SLOT_ADR(1), fact) END_THREAD( )
THREAD_2 DATA_RSYNC_L(r1 r2 r3, ret_parm,
rsync_parm) END_FUNCTION( )
37
Outline

Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization - SSB
The percolation model and its applications
Summary

38
Fine-Grain Synchronization Two Types
Sync Type Enforce Mutual Exclusion Enforce Data Dependencies
Order No Specific Order required Uni-directional
Fine Grain Sync. Solution Software Fine grained locks Lock free concurrent data structures Full / Empty bits I-structures Full / Empty bits
39
Enforce Data Dependencies

A DoAcross loop with positive and constant
dependence distance.

In parallel iterations are assigned to different
threads
T0
T1
for(i D i lt N i) Ai
Ai-D
(i 2 D) A2D A2
(i 2) A2 A2-D
The data dependence needs to be enforced by
synchronization
40
Memory Based Fine-Grain Synchronization

Full/Empty Bits (HEP, Tera MTA, etc)
I-Structures (dataflow based machines)
Associate state to a memory location
(fine-granularity). Fine-grain synchronization
for the memory location is realized through
state transition on such state.

I-Structure state transition ArvindEtAl89 _at_
TOPLAS
41
With Memory Based Fine-Grain Sync

Using a single atomic operation complete
synchronized write/read in memory directly
No need to implement synchronization with other
resources, e.g., shared memory.
Low overhead just one memory transaction

42
With Memory Based Fine-Grain Sync

Using a single atomic operation complete
synchronized write/read in memory directly
No need to implement synchronization with other
resources, e.g., shared memory.
Low overhead just one memory transaction

43
An Alternative control-flow based
synchronizations

The post/wait instructions needs to be
implemented in shared memory in coordination with
the underline memory (consistency) models
You may need to worry about this

for(i D i lt N i) Ai post(i)
wait(i-D) Ai-D
No data dependency
No data dependency
Ai fence post(i)
wait(i-D) fence Ai-D
For computation with more complicated data
dependencies, memory-based fine-grain
synchronization is more effective and efficient.
ArvindEtAl89 _at_ TOPLAS
44
A Question!
Is that really necessary to tag every word in the
entire memory to support memory-based fine-grain
synchronization?
45
Key Observation
Key Observation
At any instance of a reasonable parallel
execution only a small fraction of memory
locations are actively participating in
synchronization.
Solution
Synchronization State Buffer (SSB) Only record
and manage states of active synchronized data
units to support fine-grain synchronization.
46
What is SSB?

A small hardware buffer attached to the memory
controller of each memory bank.
Record and manage states of actively synchronized
data units.
Hardware Cost
Each SSB is a small look-up table
Easy-to-implement
Independence of each SSB hardware cost increases
only linearly proportional to of memory banks

47
SSB on Many-Core (IBM C64)
IBM Cyclops-64, Designed by Monty Denneau.
48
SSB Synchronization Functionalities

Data Synchronization Enforce RAW data
dependencies
Support word-level
Two single-writer-single-reader (SWSR) modes
One single-writer-multiple-reader (SWMR) mode
Fine-Grain Locking Enforce mutual exclusion
Support word-level
write lock (exclusive lock)
read lock (shared lock)
recursive lock
SSB is capable of supporting more functionality

49
Experimental Infrastructure
50
SSB Fine-Grain Sync. is Efficient

For all the benchmarks, the SSB-based version
shows significant performance improvement over
the versions based on other synchronization
mechanisms.
For example, with up to 128 threads
Livermore loop 6 (linear recurrence) a 312
improvement over the barrier based version
Ordered integer set (hash table) outperform the
software-based fine-grain methods by up to 84

51
Outline

Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization - SSB
The percolation model and its applications
Summary

52
Research LayoutFuture Programming Models
Scientific Computation Kernels
Advanced Execution / Programming Model
Location Consistency
Percolation
High Performance Bio computing kernels

Infrastructure Tools
System Software
Simulation / Emulation
Analytical Modeling

Base Execution Model Fine Grain Multi threading
(e.g. EARTH, CARE)
Other High end Applications
53
Percolation Model
A Users Perspective
High Speed CPUs
CRAM
CPUs
Primary Execution Engine
Prepare and percolate parceled threads
SRAM PIM
S-PIM Engine
SRAM
Perform intelligent memory operations
DRAM PIM
DRAM
D-PIM Engine
Global Memory Management
Main M
54
The Percolation Model

What is percolation?
dynamic, adaptive computation/data movement,
migration, transformation in-place or on-the fly
to keep system resource usefully busy
Features of percolation
both data and thread may percolate
computation reorganization and data layout
reorganization
asynchronous invocation

55
Performance of SCCA2Kernel 4
threads C64 SMPs MTA2
4 2917082 5369740 752256
8 5513257 2141457 619357
16 9799661 915617 488894
32 17349325 362390 482681

Reasonable scalability
Scale well with threads
Linear speedup for threads lt 32

Commodity SMPs has poor performance
Competitive vs. MTA-2

Metric TEPS -- Traversed Edges per second
SMPs 4-way Xeon dual-core, 2MB L2 Cache
56
Outline

Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization - SSB
The percolation model and its applications
Summary

Write a Comment

User Comments (0)

About PowerShow.com

Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details) - PowerPoint PPT Presentation

Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details)

Title: PowerPoint Presentation Author: cabe Last modified by: Guan R. Gao Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation