Cyclops-64 TNT Threads Tutorial - PowerPoint PPT Presentation

Loading...

PPT – Cyclops-64 TNT Threads Tutorial PowerPoint presentation | free to download - id: 135c51-Y2Y4M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Cyclops-64 TNT Threads Tutorial

Description:

Fall 2008. 9/5/2007. eleg652-010-07F. 2. Logical or Physical Shared Memory ... { perror('Mutex lock errorn'); exit(6); int main (int argc, char **argv) int i; ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 44
Provided by: cabe1
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Cyclops-64 TNT Threads Tutorial


1
Cyclops-64 TNT Threads Tutorial
  • Fall 2008

2
Shared Memory Machines
  • Logical or Physical Shared Memory
  • Low cost interconnect
  • Shared fabric (bus)
  • Switch packet fabric (Crossbar)
  • Logical single address space
  • PSM Physical Shared Memory and DSM Distributed
    Shared Memory

P
P
P
P
P
P
M
IC
IC
M
M
M
M
P
P
3
Shared-Memory Execution Architecture Models
  • Uniform-memory-access model
  • (UMA)
  • Non-uniform-memory-access model
  • (NUMA)
  • without caches (BBN, cedar, Sequent)
  • COMA (Kendall Square KSR-1, DDM)
  • CC-NUMA (DASH)
  • Hybrid COMA-NUMA (Sun Wildfire)
  • Symmetric vs. Asymmetric MPs
  • Symmetric MP (SMPs)
  • Asymmetric MP (some master some slave)
  • Programming OpenMP, etc.

P
P
M
M
P
P
P
P
IC
IC
M
M
P
P
M
M
M
M
S
S
PM
S
S
4
Cyclops-64 Architecture
  • Massive Cellular Architecture System
  • A system can be composed of thousands of C64
    chips arranged in a mesh type network
  • Each Chip is composed of
  • 80 Processor per chip
  • Two thread units and one floating point unit
  • A Crossbar interconnect network
  • Non Uniform Shared Memory
  • Scratch Pad Memory
  • Small Memory with the lowest latency time Around
    3 to 5 cycles
  • Directly Accessible to the thread units
  • Global SRAM
  • Every thread unit can be seen as being matched
    with a SRAM bank. 160 banks in total of 30 KB
    each inside the same chip
  • Global DRAM
  • Four off chip huge memory banks with the highest
    latency (around 57 cycles)
  • Note that SRAM and DRAM accesses must be done
    through the crossbar

5
Cyclops-64 Levels
  • Host Level
  • Communication between the host (a cluster) and
    the Cyclops-64 System
  • Protocol Used CDP
  • Inter Chip Level
  • Communication between chips
  • Chip Component Used The A-switch
  • Languages / Protocols CNET and SCHMEM
  • Intra Chip Level
  • Communication inside the chip
  • Languages OpenMP and TNT Threads.

6
An Example Cyclops64 Configuration
7
Note 1 The SPMD Mode
  • In this mode, all the threads receives the same
    code
  • Thread specific work can be performed by using if
    statements with the thread ids
  • Helper threads (additional threads to pre-fetch
    data from other levels of memory) can produced in
    the SPMD mode
  • Some synchronization constructs (like barriers)
    behaves slightly different while in this mode.

8
TNT Threads
  • A programming model which uses the hardware of
    Cyclops-64
  • Can be used to create helper threads in the SPMD
    or normal threads when non-SPMD mode
  • Very similar to POSIX threads but with key
    differences
  • Tiny threads are mapped one to one to Cyclops-64
    thread units
  • They are not pre-emptive
  • They does not allow context switching
  • They have a sleep and signal model using hardware
  • Very fast creation (in the order of 100s of
    cycles)

9
Compiling and Running
  • Machines that are available
  • atlantic.capsl.udel.edu and europa.capsl.udel.edu
  • Permissions must be requested
  • Compiling your code
  • (CYCLOPS_COMPILER) hello.c -o hello -ltnt lc
  • The (CYCLOPS_COMPILER) will be different
    according to the server but its name is
    cyclops64-linux-elf-gcc
  • It supports almost all the flags that the gcc-4.1
    compiler supports.

10
Running your code
  • (CYCLOPS64-SIM) options binary
  • (CYCLOPS64-SIM) path can be different from each
    server but its name is cyclops64-linux-elf-sim
  • The most common options are
  • --no-spmd and --spmd ? turn on or off the SPMD
    mode
  • --bw ? Accurately simulate network and memory
    contention
  • -pn ? Set the number of processors (given by n)
    that this program will run on
  • Please remember that a processor is two thread
    units.
  • --memorys,i,d ? Set the size of scratch pad (s)
    (in KB), the interleaved memory (i) (in KB) and
    the DRAM (d) (in MB)

11
An Example of SPMD Mode
Source Code
include ltstdio.hgt include ltcyclops64.hgt include
lttnt.hgt int main(int argc, char argv)
printf("Hello, World from thread d\n",
tnt_self()-gtmy_thread)
Simulator Line
/usr/local/cyclops64/release_2_4/local/bin/cyclops
64-linux-elf-sim --spmd -p3 ./hello
Output
Hello, World from thread 3 Hello, World from
thread 2 Hello, World from thread 4 Hello,
World from thread 1 Hello, World from thread 0
12
The TNT Threads Tutorial
  • The POSIX threads one

13
TNT Threads Creation and Destruction
tnt_ret_t tnt_create (tnt_desc_t th, const
tnt_fn_t start_routine, const tnt_arg_t arg)
The descriptor of the thread. It is a structure
that contains many of the thread attributes
tnt_desc_t th
The function pointer to the function that will be
used by the created thread
tnt_fn_t start_routine
The variable that contains the argument for the
thread function. It can be casted as pointer to
pass any type
tnt_arg_t arg
tnt_ret_t tnt_join (const tnt_desc_t th, void
th_ret)
void th_ret
Thread return value can be left NULL
14
An Example of Thread Creation
void thr_func(void ptr) int id
(int)ptr printf("Hello world from thread
d\n", id)
define JOIN_RNG(init, thrs, ids)
\ int i
\ for(i init i lt thrs i)
\ if(tnt_join(idsi-init,
NULL)) \ perror("Problem when
joining\n") \ exit(10)
\
\
\
int main (int argc, char argv) int i
tnt_desc_t thNUM_THRS void
argsNUM_THRS for(i 0 i lt NUM_THRS
i) argsi (void )i
CREATE_RNG(0, NUM_THRS, th, thr_func, args)
JOIN_RNG(0, NUM_THRS, th) return 0
/usr/local/cyclops64/release_2_4/local/bin/cyclops
64-linux-elf-sim --no-spmd -p10 ./hello
define NUM_THRS 12 define CREATE_RNG(init,
thrs, ids, func, args) \ int i
\ for(i
init i lt thrs i) \
\
if(tnt_create(idsi-init, func,
argsi-init)) \
\ perror("Thread
cannot be created\n") \ exit(9)
\
\
\
15
Other Useful Calls
tnt_desc_t tnt_self (void)
uint64_t tnt_clock (void)
Returns the thread descriptor of the thread that
calls it
Returns the clock cycles since the start of the
machine
int tnt_is_my_spm (void addr)
int tnt_is_spmd (void)
Is addr part of the calling thread scratch pad
memory
Returns which mode is current active
int tnt_my_thread (void)
int tnt_num_threads (void)
Returns the number of thread in the node
Returns the thread id of the thread that calls it
16
Synchronization
  • Methods to safe guard from data races

There is an anomaly of concurrent access by two
or more threads to a shared memory and at least
one of the access is a write
The final results depends in the order and both
are incorrect!!!
17
Types
Intelligent Resource Counters. Zero means not
available. It allows two operations P and V
Semaphore
Only one thread can execute the code region in
the critical section at a time
Critical Sections
A monitor consists of (1) a set of procedures to
work on shared variables (2) a set of shared
variables (3) an invariant and (4) a lock to
protect from access by other threads
Monitors
Mutex / Locks
A Binary Semaphore
Barrier
Ensure an ordering between threads / operations
Consists of a invariant, a mutex lock and a
signal place holder which records other threads
activities
Conditional Variables
18
TNT ThreadsSynchronization
Conditional Variables
Mutex
Mutual Exclusion
Inter thread communication
Types
tnt_mutex_t
tnt_cond_t
The actual lock variable. It can be of 4
different types
The actual conditional variable.
tnt_mutexattr_t
tnt_condattr_t
A variable that has the attributes of the created
lock
A variable that has the attributes of the created
conditional variable
19
TNT ThreadsSynchronization
Conditional Variables
Mutex
Mutual Exclusion
Inter thread communication
API Creating Destroying
tnt_ret_t tnt_mutex_init (tnt_mutex_t mutex,
const tnt_mutexattr_t attr)
tnt_ret_t tnt_cond_init (tnt_cond_t cond, const
tnt_condattr_t cond_attr)
Initialized the lock with a given attribute
object. If the object is NULL then the object is
initialized to the default values
Initialized the conditional variable with a given
attribute object. If the object is NULL then the
object is initialized to the default values
tnt_ret_t tnt_mutex_destroy (tnt_mutex_t mutex)
tnt_ret_t tnt_cond_destroy (tnt_cond_t cond)
Destroy the lock object after it is not longer
needed
Destroy the conditional variable after it is not
longer needed
20
TNT ThreadsSynchronization
Conditional Variables
Mutex
Mutual Exclusion
Inter thread communication
API Using Them
tnt_ret_t tnt_cond_wait (tnt_cond_t cond,
tnt_mutex_t mutex)
tnt_ret_t tnt_mutex_lock (tnt_mutex_t mutex)
Makes the calling thread to wait for the signal
to continue. The mutex variable MUST be locked at
this time
Gives the lock to the calling thread or wait for
it to become available. In case of relocking, the
behavior is decided by the type of lock
tnt_ret_t tnt_cond_broadcast (tnt_cond_t cond)
tnt_ret_t tnt_mutex_unlock (tnt_mutex_t mutex)
Send a signal to one or more threads waiting on
the conditional variable
Release the lock from the calling thread. In
case, that the lock was not locked, its behavior
is decided by the type of lock being used.
21
TNT ThreadsSynchronization Extra Note
Different types of mutexes
Set by the tnt_mutexattr_settype or by the
tnt_mutex_init functions
tnt_mutex_default
The same as the fast mutex
The simplest lock. It might deadlock if relocked
and the excessive de-locking behavior is usually
undefined
tnt_mutex_fast
It will return errors if it is relocked or
excessively de-locked
tnt_mutex_errorcheck
It will allow multiple re-lockings but it will
only release the lock if an equal number of
unlocks is achieved
tnt_mutex_recursive
tnt_mutex_priority
Ensures FIFO ordering
22
An Example of Synchronization (1)
tnt_desc_t thNUM_THRS1 int cnt
0 tnt_mutex_t mutex tnt_cond_t condv
void thr_func(void ptr) int id
(int)ptr int local_cnt id 5 int
next_cnt (id1) 5 tnt_sleep(rand()201)
printf() if(tnt_mutex_lock(mutex) !
tnt_ok) perror("Mutex lock error\n")
exit(5) while(cnt lt local_cnt)
tnt_cond_wait(condv, mutex) cnt
next_cnt printf(") tnt_cond_broadcast(c
ondv) printf() if(tnt_mutex_unlock(mut
ex) ! tnt_ok) perror("Mutex lock
error\n") exit(6)
int main (int argc, char argv) int i
// Prepare arguments if(tnt_mutex_init(mutex,
NULL) ! tnt_ok) perror("Mutex
Initialization failed\n") exit(1)
if(tnt_cond_init(condv, NULL) ! tnt_ok)
perror("Conditional Variable Initialization
failed\n") exit(9) // Create and join
threads if(tnt_mutex_destroy(mutex) !
tnt_ok) perror("Mutex Destruction
failed\n") exit(2)
if(tnt_cond_destroy(condv) ! tnt_ok)
perror("Conditional Variable Destruction
failed\n") exit(10) return 0
23
TNT ThreadsSynchronization
Barriers
Type
Destruction
tnt_barrier_t
The barrier object. Must be visible to all
threads that are signed in with the barrier
tnt_ret_t tnt_barrier_destroy (tnt_barrier_t
barrier)
Destroy the barrier object
Creation
tnt_ret_t tnt_barrier_init (tnt_barrier_t
barrier, long num_threads)
Initialized the current barrier object to expect
num_threads to sign in with it. The default
barrier does not need to be initialized
The Default Barrier can be called by using
tnt_barrier_wait(NULL)
24
TNT ThreadsSynchronization
Signing in and out
void tnt_barrier_include (tnt_barrier_t barrier)
Include the current thread to the barrier object
void tnt_barrier_exclude (tnt_barrier_t barrier)
Sign out the current thread from the current
barrier object
Usage
void tnt_barrier_wait (tnt_barrier_t barrier)
Wait on the barrier object for all the threads
that have signed with it
SPMD Note The default barrier must be used to
ensure the correct initialization of the barriers
Non SPMD Note The first call to the barrier will
be a software barrier to ensure that all the
threads involved have signed in
25
An Example of Synchronization (2)
define NUM_THRS 12 tnt_desc_t thNUM_THRS1 tnt
_barrier_t barr
int main (int argc, char argv) int i
void argsNUM_THRS for(i 0 i lt NUM_THRS
i) argsi (void )i
tnt_barrier_init(barr, NUM_THRS) //
Creation and joining of Threads
tnt_barrier_destroy(barr) return 0
void thr_func(void ptr) int id
(int)ptr tnt_barrier_include(barr)
tnt_barrier_wait(barr) printf()
tnt_sleep(rand()201) tnt_barrier_wait(barr)
printf()
26
The TNT Threads Tutorial
  • The Memory Side of things

27
Cyclops-64 Memory Hierarchy
  • The highest latency but biggest
  • Off Chip DRAM
  • Access 57 cycles
  • Defaults Code sections and Heap variables
    allocated with malloc
  • Macro pragma dram var
  • The var parameter can be either a function call
    or a variable
  • Interleaved SRAM
  • Access 30 Cycles
  • Defaults Data Sections (data and bss)
  • Macro pragma sram var
  • Function void sram_malloc (size_t nbytes)
  • Scratch Pad
  • Access 5 Cycles
  • Defaults Local (stack based) variables
  • Macro pragma spm var

28
Memory Hierarchy Example
include ltstdio.hgt include ltcyclops64.hgt include
lttnt.hgt pragma spm a pragma sram b pragma
dram c int a, b, c int main(int argc, char
argv) int d sram_malloc(sizeof(int))
a 10 b 11 c 13
printf("Addresses p p p p\n", a, b, c,
d) free(d)
29
The TNT Threads Tutorial
  • The Signal and Wait Schemes

30
TNT ThreadsSignal and Wait Schemes
  • Why?
  • Low power schemes
  • Threads go to sleep and conserve power until
    another thread send him a signal
  • In Cyclops-64 is extremely efficient due to extra
    hardware support (the signal bus)

31
TNT ThreadsSignal and Wait Schemes
Sleep for the amount of cycles given by the
parameter cycles. Its maximum value is 255 and if
it is zero then it will sleep until someone else
wakes it up
tnt_sleep (cycles)
Analogous of calling tnt_sleep with zero cycles
void tnt_suspend (void)
Send a wake up signal to the thread which the th
descriptor belongs to
void tnt_awake (const tnt_desc_t th)
32
An Example for Signal and Wait Schemes
void thr_func(void ptr) int id
(int)ptr tnt_suspend() printf()
tnt_awake(thid1)
int main (int argc, char argv) int i
void argsNUM_THRS for(i 0 i lt NUM_THRS
i) argsi (void )i
CREATE_RNG(0, NUM_THRS, th, thr_func, args)
thNUM_THRS tnt_self() printf(")
tnt_awake(th0) tnt_sleep(0)
printf(") return 0
33
The TNT Threads Tutorial
  • Intrinsics

34
TNT ThreadsIntrinsics
  • C API calls which have one to one correspondence
    to assembly instructions
  • In Cyclops-64, they are used to do atomic
    operations in memory
  • This operations are usually very limited in scope
    (single arithmetic and logical operations)
  • For a complete list of the intrinsics, please
    refer to the Cyclops-64 Programming Manual
  • The example presented next uses _l_add_m to
    achieve atomic memory increments

35
An Example of Intrinsics
int main (int argc, char argv) int i
void argsNUM_THRS for(i 0 i lt NUM_THRS
i) argsi (void )i
if(tnt_mutex_init(mutex, NULL) ! tnt_ok)
perror(") exit(1) CREATE_RNG(0,
NUM_THRS, th, thr_func1, args) JOIN_RNG(0,
NUM_THRS, th) CREATE_RNG(0, NUM_THRS, th,
thr_func2, args) JOIN_RNG(0, NUM_THRS, th)
CREATE_RNG(0, NUM_THRS, th, thr_func3, args)
JOIN_RNG(0, NUM_THRS, th) if(tnt_mutex_destroy
(mutex) ! tnt_ok) perror("Mutex
Destruction failed\n") exit(2) for(i
0 i lt 100 NUM_THRS i) cnt4 cnt4
1 printf() return 0
long long cnt1 0 long long cnt2 0 long long
cnt3 0 long long cnt4 0 tnt_mutex_t mutex
void thr_func1(void ptr) int id (int)ptr,
i for(i 0 i lt 100 i)
_add_m(cnt1, 1) void thr_func2(void ptr)
int id (int)ptr, i for(i 0 i lt 100
i) tnt_mutex_lock(mutex) cnt2
cnt2 1 tnt_mutex_unlock(mutex)
void thr_func3(void ptr) int id
(int)ptr, i for(i 0 i lt 100 i) cnt3
cnt3 1
36
Random Access
  • Exercise Number 1

37
Description
  • A small benchmark use to test memory bandwidth
  • Benchmark Characteristics
  • An Array which should be at least half of the
    physical memory of the system
  • The size must be a power of 2
  • A series of pseudo random number which can be
    reproduced easily.
  • A synchronization method to keep the updates to
    the table safe from data races.
  • Each update to the array should XOR its
    location with another value (usually a random
    number)

38
Description
  • However
  • This benchmark would take too long in the
    simulator
  • Thus
  • A fixed array size is given (still it must be a
    power of 2)
  • The random numbers are saved into an array
    instead of producing them by a RNG function.
  • The random numbers are generated from 0 to table
    size -1 are produced.
  • This simplified version is called Toy RA
  • This specific version is used to test
    synchronization costs
  • Metrics Time in seconds and GUPS

39
Description
table0, , TABLE_SIZE -1
ran_table0, , UPDATES
The Update Procedure
tablej j
0
j
TABLE_SIZE -1
j
0
UPDATES should be at 4 times TABLE_SIZE
40
What is given
  • An array of 64 bits elements which size is
    defined by the POWER_2_SIZE macro
  • It resides in DRAM
  • TABLE_SIZE 2 POWER_2_SIZE
  • Another array filled with random numbers which
    size is four times the table size
  • Initialization routines
  • init_table(t, s) ? Initialize the table t of size
    s with consecutive integers
  • random_fill() ? Fill the random array with random
    numbers
  • Clocking Routines
  • get_time() ? Get the wall clock time in seconds
  • Error Checking Routines
  • check_err() ? Check the array for errors. This
    function is destructive since it changes the
    array to check for correctness

41
What is given
  • Two Files
  • The parallel.c file ? Contains an C skeleton that
    contains the necessary headers and routines.
  • The Makefile file ? It is used to compile the
    parallel.c file and to run its executable.
  • Use this to clean your project make clean
  • Use this to compile your project make
  • Use this to run your executable make run

42
What is expected
  • An parallel update function
  • It must have less than 1 of errors in the final
    array
  • It must use one method of synchronization or a
    method to ensure atomicity
  • A structure which contains the parameters for the
    thread
  • typedef struct __params unsigned long long
    begin, size params
  • To create at most 16 threads using the TNT thread
    library in which the work can be distributed by
    using this pseudocode
  • size_per_thread UPDATES / NUM_THREADS
  • thread_begin thread_id size_per_thread
  • If(LAST_THREAD) size_per_thread UPDATES
    NUM_THREADS

43
What is expected
  • Finally, students must present their methods with
    values representing 1, 2, 4, 8 and 16 threads,
    together with their scalability
  • Scalability ? Runtime for Threads / Runtime for a
    single thread.
  • This means that when calculating your values, the
    scalability for the first thread should be 1
  • Ideal Scalability is the number of thread
    themselves
  • If using two threads then the ideal scalability
    is 2
About PowerShow.com