Title: Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details)
1Topic 2 -- II Compilers and Runtime Technology
Optimization Under Fine-Grain Multithreading-
The EARTH Model (in more details)
Guang R. Gao ACM Fellow and IEEE Fellow Endowed
Distinguished Professor Electrical Computer
Engineering University of Delaware ggao_at_capsl.udel
.edu
2Outline
- Overview
- Fine-grain multithreading
- Compiling for fine-grain multithreading
- The power of fine-grain synchronization - SSB
- The percolation model and its applications
- Summary
3Outline
- Overview
- Fine-grain multithreading
- Compiling for fine-grain multithreading
- The power of fine-grain synchronization - SSB
- The percolation model and its applications
- Summary
4The EARTH Multithreaded Execution Model
fiber within a frame
Two Level of Fine-Grain Threads - threaded
procedures - fibers
Aync. function invocation
A sync operation
Invoke a threaded func
5EARTH vs. CILK
CILK Model
EARTH Model
Note EARTH has it origin in static dataflow model
6The Fiber Execution Model
7The Fiber Execution Model
8The Fiber Execution Model
2
2
9The Fiber Execution Model
2
2
1
1
10The Fiber Execution Model
1
1
11The Fiber Execution Model
12The Fiber Execution Model
2
2
13The Fiber Execution Model
2
2
14The Fiber Execution Model
15The Fiber Execution Model
16The Fiber Execution Model
17The Fiber Execution Model
4
4
18A Loop Example
i 1
i 2
i 3
i N
for(i 1 i lt N i) S1 S2 xi
S3 yi xi-1 . . . Sk
S1
S2
S3
Sk
T1
T2
T3
Note
How loop carried dependencies are handled?
And its implication on cross core software
pipelining
19Main Features of EARTH
- Fast thread context switching
- Efficient parallel function invocation
- Good support of fine grain dynamic load balancing
- Efficient support split phase transactions and
fibers
- Features unique to the EARTH model in comparison
to the CILK model
20Outline
- Overview
- Fine-grain multithreading
- Compiling for fine-grain multithreading
- The power of fine-grain synchronization - SSB
- The percolation model and its applications
- Summary
21Compiling C for EARTHObjectives
- Design simple high-level extensions for C that
allow programmers to write programs that will run
efficiently on multi-threaded architectures.
(EARTH-C) - Develop compiler techniques to automatically
translate programs written in EARTH-C to
multi-threaded programs. (EARTH-C, Threaded-C) - Determine if EARTH-C compiler can compete with
hand-coded Threaded-C programs.
22Summary of EARTH-C Extensions
- Explicit Parallelism
- Parallel versus Sequential statement sequences
- Forall loops
- Locality Annotation
- Local versus Remote Memory references (global,
local, replicate, ) - Dynamic Load Balancing
- Basic versus remote function and invocation sites
23EARTH-C Compiler Environment
C
EARTH-C
McCAT
Program Dependence Analysis
EARTH SIMPLE
EARTH-C Compiler
Thread Generation
Threaded-C
Threaded-C Compiler
Threaded-C
EARTH Compilation Environment
The EARTH Compiler
24The McCAT/EARTH Compiler
EARTH-C
PHASE I (Standard McCAT Analyses
Transformations)
EARTH-SIMPLE-C
PHASE II (Parallelization)
EARTH-SIMPLE-C
PHASE III
THREADED-C
25result
n
done
fib
- If n lt 2
- DATA_RSYNC (1, result, done)
- else
-
- TOKEN (fib, n-1, sum1, slot_1)
- TOKEN (fib, n-2, sum2, slot_2)
-
- END_THREAD( )
0 0
2 2
THREAD-1 DATA_RSYNC (sum1 sum2, result,
done) END_THREAD ( )
END_FUNCTION
The Fibonacci Example
26Matrix Multiplication
void main ( ) int i, j, k float
sum for (i0 i lt N i) for
(j0 j lt N j) sum 0
for (k0 k lt N k)
sum sum a i k b k j c
i j sum
Sequential Version
27result
a
done
b
inner
- BLKMOV_SYNC (a, row_a, N, slot_1)
- BLKMOV_SYNC (b, column_b, N, slot_1)
- sum 0
- END_THREAD
0 0
2 2
THREAD-1 for (i0 iltN i ) sum
sum (row_ai column_bi)
DATA_RSYNC (sum, result, done) END_THREAD ( )
END_FUNCTION
The Inner Product Example
28Summary of EARTH-C Extensions
- Explicit Parallelism
- Parallel versus Sequential statement sequences
- Forall loops
- Locality Annotation
- Local versus Remote Memory references (global,
local, replicate, ) - Dynamic Load Balancing
- Basic versus remote function and invocation sites
29EARTH C Threaded C(Thread Generation)
- Given a sequence of statements, s1, s2, sn, we
wish to create threads such that - Maximize thread length (minimize thread switching
overhead) - retain sufficient parallelism
- Issue remote memory requests as early as possible
(prefetching) - Compile split-phase remote memory operations and
remote function calls correctly
30An Example
31Example Partitioned into Four Fibers
1
Fiber-0
Fiber-1
1
3
Fiber-2
Fiber-3
32Better Strategy Using List Scheduling
- Put each instruction in the earliest possible
thread. - Within a thread, the remote operations are
executed as early as possible. - Build a Data Dependence Graph (DDG), and use a
list scheduling strategy, where the selection of
instructions is guided by Earliest Thread Number
and Statement Type.
33Instruction Types
- Schedule First
- remote_read, remote_write
- remote_fn_call
- local_simple
- remote_compound
- local_compound
- basic_fn_call
- Schedule Last
34List Scheduling Previous Example
(0,RR)
(0,LS)
(0,RR)
a xi
b xj
fact 1
(1,LS)
(1,LS)
(1,LC)
sumab
prodab
fact facta
(1,RF)
(1,RF)
(1,RF)
r1g(sum)
r2g(prod)
r3g(fact)
(2,LS)
return(r1 r2 r3)
35Resulting List Scheduled Threads
axi bxj fact1
2
sumab r1g(sum) prodab r2g(prod) factfac
ti r3g(fact)
3
return (r1r2r3)
36Generating Threaded-C Code
THREADED f ( int ret_parm, SLOT rsync_parm,
int x, int i, int j) SLOTS
SYNC_SLOTS2 int a, b, sum, prod, fact, r1,
r2, r3
/ THREAD_0 / INIT_SYNC(0, 2, 2, 1)
INIT_SYNC (1, 3, 3, 2) GET_SYNC_L (xi, a,
0) GET_SYNC_L (xj, b, 0) fact
1 END_THREAD( )
THREAD_1 sum a b TOKEN (G, r1,
SLOT_ADR(1), sum) prod a b TOKEN (g, r2,
SLOT_ADR(1), prod) fact fact a TOKEN (g,
r3, SLOT_ADR(1), fact) END_THREAD( )
THREAD_2 DATA_RSYNC_L(r1 r2 r3, ret_parm,
rsync_parm) END_FUNCTION( )
37Outline
- Overview
- Fine-grain multithreading
- Compiling for fine-grain multithreading
- The power of fine-grain synchronization - SSB
- The percolation model and its applications
- Summary
38Fine-Grain Synchronization Two Types
Sync Type Enforce Mutual Exclusion Enforce Data Dependencies
Order No Specific Order required Uni-directional
Fine Grain Sync. Solution Software Fine grained locks Lock free concurrent data structures Full / Empty bits I-structures Full / Empty bits
39Enforce Data Dependencies
- A DoAcross loop with positive and constant
dependence distance.
In parallel iterations are assigned to different
threads
T0
T1
for(i D i lt N i) Ai
Ai-D
(i 2 D) A2D A2
(i 2) A2 A2-D
The data dependence needs to be enforced by
synchronization
40Memory Based Fine-Grain Synchronization
- Full/Empty Bits (HEP, Tera MTA, etc)
I-Structures (dataflow based machines) - Associate state to a memory location
(fine-granularity). Fine-grain synchronization
for the memory location is realized through
state transition on such state.
I-Structure state transition ArvindEtAl89 _at_
TOPLAS
41With Memory Based Fine-Grain Sync
- Using a single atomic operation complete
synchronized write/read in memory directly - No need to implement synchronization with other
resources, e.g., shared memory. - Low overhead just one memory transaction
42With Memory Based Fine-Grain Sync
- Using a single atomic operation complete
synchronized write/read in memory directly - No need to implement synchronization with other
resources, e.g., shared memory. - Low overhead just one memory transaction
43An Alternative control-flow based
synchronizations
- The post/wait instructions needs to be
implemented in shared memory in coordination with
the underline memory (consistency) models - You may need to worry about this
for(i D i lt N i) Ai post(i)
wait(i-D) Ai-D
No data dependency
No data dependency
Ai fence post(i)
wait(i-D) fence Ai-D
For computation with more complicated data
dependencies, memory-based fine-grain
synchronization is more effective and efficient.
ArvindEtAl89 _at_ TOPLAS
44A Question!
Is that really necessary to tag every word in the
entire memory to support memory-based fine-grain
synchronization?
45Key Observation
Key Observation
At any instance of a reasonable parallel
execution only a small fraction of memory
locations are actively participating in
synchronization.
Solution
Synchronization State Buffer (SSB) Only record
and manage states of active synchronized data
units to support fine-grain synchronization.
46What is SSB?
- A small hardware buffer attached to the memory
controller of each memory bank. - Record and manage states of actively synchronized
data units. - Hardware Cost
- Each SSB is a small look-up table
Easy-to-implement - Independence of each SSB hardware cost increases
only linearly proportional to of memory banks
47SSB on Many-Core (IBM C64)
IBM Cyclops-64, Designed by Monty Denneau.
48SSB Synchronization Functionalities
- Data Synchronization Enforce RAW data
dependencies - Support word-level
- Two single-writer-single-reader (SWSR) modes
- One single-writer-multiple-reader (SWMR) mode
- Fine-Grain Locking Enforce mutual exclusion
- Support word-level
- write lock (exclusive lock)
- read lock (shared lock)
- recursive lock
- SSB is capable of supporting more functionality
49Experimental Infrastructure
50SSB Fine-Grain Sync. is Efficient
- For all the benchmarks, the SSB-based version
shows significant performance improvement over
the versions based on other synchronization
mechanisms. - For example, with up to 128 threads
- Livermore loop 6 (linear recurrence) a 312
improvement over the barrier based version - Ordered integer set (hash table) outperform the
software-based fine-grain methods by up to 84
51Outline
- Overview
- Fine-grain multithreading
- Compiling for fine-grain multithreading
- The power of fine-grain synchronization - SSB
- The percolation model and its applications
- Summary
52Research LayoutFuture Programming Models
Scientific Computation Kernels
Advanced Execution / Programming Model
Location Consistency
Percolation
High Performance Bio computing kernels
- Infrastructure Tools
- System Software
- Simulation / Emulation
- Analytical Modeling
Base Execution Model Fine Grain Multi threading
(e.g. EARTH, CARE)
Other High end Applications
53Percolation Model
A Users Perspective
High Speed CPUs
CRAM
CPUs
Primary Execution Engine
Prepare and percolate parceled threads
SRAM PIM
S-PIM Engine
SRAM
Perform intelligent memory operations
DRAM PIM
DRAM
D-PIM Engine
Global Memory Management
Main M
54The Percolation Model
- What is percolation?
- dynamic, adaptive computation/data movement,
migration, transformation in-place or on-the fly
to keep system resource usefully busy - Features of percolation
- both data and thread may percolate
- computation reorganization and data layout
reorganization - asynchronous invocation
55Performance of SCCA2Kernel 4
threads C64 SMPs MTA2
4 2917082 5369740 752256
8 5513257 2141457 619357
16 9799661 915617 488894
32 17349325 362390 482681
- Reasonable scalability
- Scale well with threads
- Linear speedup for threads lt 32
- Commodity SMPs has poor performance
- Competitive vs. MTA-2
Metric TEPS -- Traversed Edges per second
SMPs 4-way Xeon dual-core, 2MB L2 Cache
56Outline
- Overview
- Fine-grain multithreading
- Compiling for fine-grain multithreading
- The power of fine-grain synchronization - SSB
- The percolation model and its applications
- Summary