Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details) - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details)

Description:

Title: PowerPoint Presentation Author: cabe Last modified by: Guan R. Gao Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 57
Provided by: cabe151
Category:

less

Transcript and Presenter's Notes

Title: Topic 2 -- II: Compilers and Runtime Technology: Optimization Under Fine-Grain Multithreading - The EARTH Model (in more details)


1
Topic 2 -- II Compilers and Runtime Technology
Optimization Under Fine-Grain Multithreading-
The EARTH Model (in more details)
Guang R. Gao ACM Fellow and IEEE Fellow Endowed
Distinguished Professor Electrical Computer
Engineering University of Delaware ggao_at_capsl.udel
.edu
2
Outline
  • Overview
  • Fine-grain multithreading
  • Compiling for fine-grain multithreading
  • The power of fine-grain synchronization - SSB
  • The percolation model and its applications
  • Summary

3
Outline
  • Overview
  • Fine-grain multithreading
  • Compiling for fine-grain multithreading
  • The power of fine-grain synchronization - SSB
  • The percolation model and its applications
  • Summary

4
The EARTH Multithreaded Execution Model
fiber within a frame
Two Level of Fine-Grain Threads - threaded
procedures - fibers
Aync. function invocation
A sync operation
Invoke a threaded func
5
EARTH vs. CILK
CILK Model
EARTH Model
Note EARTH has it origin in static dataflow model
6
The Fiber Execution Model
7
The Fiber Execution Model
8
The Fiber Execution Model
2
2
9
The Fiber Execution Model
2
2
1
1
10
The Fiber Execution Model
1
1
11
The Fiber Execution Model
12
The Fiber Execution Model
2
2
13
The Fiber Execution Model
2
2
14
The Fiber Execution Model
15
The Fiber Execution Model
16
The Fiber Execution Model
17
The Fiber Execution Model
4
4
18
A Loop Example
i 1
i 2
i 3
i N
for(i 1 i lt N i) S1 S2 xi
S3 yi xi-1 . . . Sk

S1
S2
S3
Sk
T1
T2
T3
Note
How loop carried dependencies are handled?
And its implication on cross core software
pipelining
19
Main Features of EARTH
  • Fast thread context switching
  • Efficient parallel function invocation
  • Good support of fine grain dynamic load balancing
  • Efficient support split phase transactions and
    fibers
  • Features unique to the EARTH model in comparison
    to the CILK model

20
Outline
  • Overview
  • Fine-grain multithreading
  • Compiling for fine-grain multithreading
  • The power of fine-grain synchronization - SSB
  • The percolation model and its applications
  • Summary

21
Compiling C for EARTHObjectives
  • Design simple high-level extensions for C that
    allow programmers to write programs that will run
    efficiently on multi-threaded architectures.
    (EARTH-C)
  • Develop compiler techniques to automatically
    translate programs written in EARTH-C to
    multi-threaded programs. (EARTH-C, Threaded-C)
  • Determine if EARTH-C compiler can compete with
    hand-coded Threaded-C programs.

22
Summary of EARTH-C Extensions
  • Explicit Parallelism
  • Parallel versus Sequential statement sequences
  • Forall loops
  • Locality Annotation
  • Local versus Remote Memory references (global,
    local, replicate, )
  • Dynamic Load Balancing
  • Basic versus remote function and invocation sites

23
EARTH-C Compiler Environment
C
EARTH-C
McCAT
Program Dependence Analysis
EARTH SIMPLE
EARTH-C Compiler
Thread Generation
Threaded-C
Threaded-C Compiler
Threaded-C
EARTH Compilation Environment
The EARTH Compiler
24
The McCAT/EARTH Compiler
EARTH-C
PHASE I (Standard McCAT Analyses
Transformations)
EARTH-SIMPLE-C
PHASE II (Parallelization)
EARTH-SIMPLE-C
PHASE III
THREADED-C
25
result
n
done
fib
  • If n lt 2
  • DATA_RSYNC (1, result, done)
  • else
  • TOKEN (fib, n-1, sum1, slot_1)
  • TOKEN (fib, n-2, sum2, slot_2)
  • END_THREAD( )

0 0
2 2
THREAD-1 DATA_RSYNC (sum1 sum2, result,
done) END_THREAD ( )
END_FUNCTION
The Fibonacci Example
26
Matrix Multiplication
void main ( ) int i, j, k float
sum for (i0 i lt N i) for
(j0 j lt N j) sum 0
for (k0 k lt N k)
sum sum a i k b k j c
i j sum
Sequential Version
27
result
a
done
b
inner
  • BLKMOV_SYNC (a, row_a, N, slot_1)
  • BLKMOV_SYNC (b, column_b, N, slot_1)
  • sum 0
  • END_THREAD

0 0
2 2
THREAD-1 for (i0 iltN i ) sum
sum (row_ai column_bi)
DATA_RSYNC (sum, result, done) END_THREAD ( )
END_FUNCTION
The Inner Product Example
28
Summary of EARTH-C Extensions
  • Explicit Parallelism
  • Parallel versus Sequential statement sequences
  • Forall loops
  • Locality Annotation
  • Local versus Remote Memory references (global,
    local, replicate, )
  • Dynamic Load Balancing
  • Basic versus remote function and invocation sites

29
EARTH C Threaded C(Thread Generation)
  • Given a sequence of statements, s1, s2, sn, we
    wish to create threads such that
  • Maximize thread length (minimize thread switching
    overhead)
  • retain sufficient parallelism
  • Issue remote memory requests as early as possible
    (prefetching)
  • Compile split-phase remote memory operations and
    remote function calls correctly

30
An Example
31
Example Partitioned into Four Fibers
1
Fiber-0
Fiber-1
1
3
Fiber-2
Fiber-3
32
Better Strategy Using List Scheduling
  • Put each instruction in the earliest possible
    thread.
  • Within a thread, the remote operations are
    executed as early as possible.
  • Build a Data Dependence Graph (DDG), and use a
    list scheduling strategy, where the selection of
    instructions is guided by Earliest Thread Number
    and Statement Type.

33
Instruction Types
  • Schedule First
  • remote_read, remote_write
  • remote_fn_call
  • local_simple
  • remote_compound
  • local_compound
  • basic_fn_call
  • Schedule Last

34
List Scheduling Previous Example
(0,RR)
(0,LS)
(0,RR)
a xi
b xj
fact 1
(1,LS)
(1,LS)
(1,LC)
sumab
prodab
fact facta
(1,RF)
(1,RF)
(1,RF)
r1g(sum)
r2g(prod)
r3g(fact)
(2,LS)
return(r1 r2 r3)
35
Resulting List Scheduled Threads
axi bxj fact1
2
sumab r1g(sum) prodab r2g(prod) factfac
ti r3g(fact)
3
return (r1r2r3)
36
Generating Threaded-C Code
THREADED f ( int ret_parm, SLOT rsync_parm,
int x, int i, int j) SLOTS
SYNC_SLOTS2 int a, b, sum, prod, fact, r1,
r2, r3
/ THREAD_0 / INIT_SYNC(0, 2, 2, 1)
INIT_SYNC (1, 3, 3, 2) GET_SYNC_L (xi, a,
0) GET_SYNC_L (xj, b, 0) fact
1 END_THREAD( )
THREAD_1 sum a b TOKEN (G, r1,
SLOT_ADR(1), sum) prod a b TOKEN (g, r2,
SLOT_ADR(1), prod) fact fact a TOKEN (g,
r3, SLOT_ADR(1), fact) END_THREAD( )
THREAD_2 DATA_RSYNC_L(r1 r2 r3, ret_parm,
rsync_parm) END_FUNCTION( )
37
Outline
  • Overview
  • Fine-grain multithreading
  • Compiling for fine-grain multithreading
  • The power of fine-grain synchronization - SSB
  • The percolation model and its applications
  • Summary

38
Fine-Grain Synchronization Two Types
Sync Type Enforce Mutual Exclusion Enforce Data Dependencies
Order No Specific Order required Uni-directional
Fine Grain Sync. Solution Software Fine grained locks Lock free concurrent data structures Full / Empty bits I-structures Full / Empty bits
39
Enforce Data Dependencies
  • A DoAcross loop with positive and constant
    dependence distance.

In parallel iterations are assigned to different
threads
T0
T1
for(i D i lt N i) Ai
Ai-D
(i 2 D) A2D A2
(i 2) A2 A2-D
The data dependence needs to be enforced by
synchronization
40
Memory Based Fine-Grain Synchronization
  • Full/Empty Bits (HEP, Tera MTA, etc)
    I-Structures (dataflow based machines)
  • Associate state to a memory location
    (fine-granularity). Fine-grain synchronization
    for the memory location is realized through
    state transition on such state.

I-Structure state transition ArvindEtAl89 _at_
TOPLAS
41
With Memory Based Fine-Grain Sync
  • Using a single atomic operation complete
    synchronized write/read in memory directly
  • No need to implement synchronization with other
    resources, e.g., shared memory.
  • Low overhead just one memory transaction

42
With Memory Based Fine-Grain Sync
  • Using a single atomic operation complete
    synchronized write/read in memory directly
  • No need to implement synchronization with other
    resources, e.g., shared memory.
  • Low overhead just one memory transaction

43
An Alternative control-flow based
synchronizations
  • The post/wait instructions needs to be
    implemented in shared memory in coordination with
    the underline memory (consistency) models
  • You may need to worry about this

for(i D i lt N i) Ai post(i)
wait(i-D) Ai-D
No data dependency
No data dependency
Ai fence post(i)
wait(i-D) fence Ai-D
For computation with more complicated data
dependencies, memory-based fine-grain
synchronization is more effective and efficient.
ArvindEtAl89 _at_ TOPLAS
44
A Question!
Is that really necessary to tag every word in the
entire memory to support memory-based fine-grain
synchronization?
45
Key Observation
Key Observation
At any instance of a reasonable parallel
execution only a small fraction of memory
locations are actively participating in
synchronization.
Solution
Synchronization State Buffer (SSB) Only record
and manage states of active synchronized data
units to support fine-grain synchronization.
46
What is SSB?
  • A small hardware buffer attached to the memory
    controller of each memory bank.
  • Record and manage states of actively synchronized
    data units.
  • Hardware Cost
  • Each SSB is a small look-up table
    Easy-to-implement
  • Independence of each SSB hardware cost increases
    only linearly proportional to of memory banks

47
SSB on Many-Core (IBM C64)
IBM Cyclops-64, Designed by Monty Denneau.
48
SSB Synchronization Functionalities
  • Data Synchronization Enforce RAW data
    dependencies
  • Support word-level
  • Two single-writer-single-reader (SWSR) modes
  • One single-writer-multiple-reader (SWMR) mode
  • Fine-Grain Locking Enforce mutual exclusion
  • Support word-level
  • write lock (exclusive lock)
  • read lock (shared lock)
  • recursive lock
  • SSB is capable of supporting more functionality

49
Experimental Infrastructure
50
SSB Fine-Grain Sync. is Efficient
  • For all the benchmarks, the SSB-based version
    shows significant performance improvement over
    the versions based on other synchronization
    mechanisms.
  • For example, with up to 128 threads
  • Livermore loop 6 (linear recurrence) a 312
    improvement over the barrier based version
  • Ordered integer set (hash table) outperform the
    software-based fine-grain methods by up to 84

51
Outline
  • Overview
  • Fine-grain multithreading
  • Compiling for fine-grain multithreading
  • The power of fine-grain synchronization - SSB
  • The percolation model and its applications
  • Summary

52
Research LayoutFuture Programming Models
Scientific Computation Kernels
Advanced Execution / Programming Model
Location Consistency
Percolation
High Performance Bio computing kernels
  • Infrastructure Tools
  • System Software
  • Simulation / Emulation
  • Analytical Modeling

Base Execution Model Fine Grain Multi threading
(e.g. EARTH, CARE)
Other High end Applications
53
Percolation Model
A Users Perspective
High Speed CPUs
CRAM
CPUs
Primary Execution Engine
Prepare and percolate parceled threads
SRAM PIM
S-PIM Engine
SRAM
Perform intelligent memory operations
DRAM PIM
DRAM
D-PIM Engine
Global Memory Management
Main M
54
The Percolation Model
  • What is percolation?
  • dynamic, adaptive computation/data movement,
    migration, transformation in-place or on-the fly
    to keep system resource usefully busy
  • Features of percolation
  • both data and thread may percolate
  • computation reorganization and data layout
    reorganization
  • asynchronous invocation

55
Performance of SCCA2Kernel 4
threads C64 SMPs MTA2
4 2917082 5369740 752256
8 5513257 2141457 619357
16 9799661 915617 488894
32 17349325 362390 482681
  • Reasonable scalability
  • Scale well with threads
  • Linear speedup for threads lt 32
  • Commodity SMPs has poor performance
  • Competitive vs. MTA-2

Metric TEPS -- Traversed Edges per second
SMPs 4-way Xeon dual-core, 2MB L2 Cache
56
Outline
  • Overview
  • Fine-grain multithreading
  • Compiling for fine-grain multithreading
  • The power of fine-grain synchronization - SSB
  • The percolation model and its applications
  • Summary
Write a Comment
User Comments (0)
About PowerShow.com