Programming multicores using the sequential paradigm - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Programming multicores using the sequential paradigm

Description:

This is workshop on programming multicores, but... not admit communication deadlock as blocking reads occur only in acyclic graphs... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 32
Provided by: chrisje4
Category:

less

Transcript and Presenter's Notes

Title: Programming multicores using the sequential paradigm


1
Programming multicores using the sequential
paradigm
  • Chris Jesshope
  • Professor of Systems Architecture Engineering
  • University of Amsterdam
  • Jesshope_at_science.uva.nl

Workshop on programming multicores - Tsinghua
University10/9/2006
2
First an apology
  • This is workshop on programming multicores, but
  • I will be talking quite a bit about architectural
    issues
  • Why? because I believe we cannot program
    multicores effectively without a lot of support
    from the hardware
  • certainly not in a sequential language!

3
Overview
  • Architectural support for microthreading or
    Dynamic RISC (DRISC)
  • µTC - an intermediate language reflecting the
    support for named families of microthreads
  • Some compilation examples C to µTC
  • Status of our work

4
Why use sequential code?
  • To avoid non-determinism in programs
  • The legacy code problem both source binary
  • I also believe that extracting concurrency from
    sequential code is not the problem
  • the difficulty is resolving and scheduling
    dependencies
  • This has been known since the 1980s
  • e.g. dataflow with single assignment constraints
  • but tomorrows multi-core processors must support
    legacy C and its pointers as well as any new
    languages
  • This means the support we need must be dynamic

5
Why hardware support?
Concurency captured size of synchronising
memory
  • To schedule a unit of concurrency (UoC) some of
    its data must be availability so that it can
    proceed
  • this requires synchronisers and schedulers
  • UoC - can be one instruction or a complete
    program
  • Mainstream processors are not designed to do this
    and are not candidates for multi-core designs
  • they perform concurrency management in the
    operating system or the compilers run-time
    system
  • We must have context management, synchronisers
    and schedulers implemented in the processor
    hardware
  • implemented in ways that scale with the
    concurrency captured - both distributed and
    interleaved concurrency

6
ISADRISC ISARISC 5 instructions
  • create - creates a family of threads - yields
    family id fid
  • the family size is parametric and can even be
    infinite
  • sync(fid) - waits for a specified family to
    complete
  • a family barrier n.b. multiple families can be
    active at one time
  • break - terminates a family from one of its
    threads
  • stops allocating new threads and kills all active
    threads
  • kill(fid) - kills a specified family from a
    thread in another family, e.g. a control
    environment
  • squeeze(fid) - preempts an entire family from a
    control environment so that it can be restarted
    on other resources
  • N.b I will only be using 1..3, 4/5 support
    self-adaptative systems, which is outside the
    scope of this talk

7
DRISC pipeline
  • Note the potential for power efficiency
  • If a thread is inactive its TIB line is turned
    off
  • If the queue is empty the processor turns off
  • The queue length measures local load and can be
    used to adjust the local clock rate

3. Suspended instructions are rescheduled when
data is written
Synchronising memory
Fixed-delay operations
Thread instruction Buffer (TIB)
Queue Of Active threads
Variable-delay operations (e.g. memory)
0. Threads are created dynamically with a context
of synchronising memory
1. Instructions are issued and read synchronising
memory
2. If data is available it is sent for
processing otherwise the instruction suspends on
the empty register
instructions
data
8
Example Chip Architecture
Level 0 tile
Level 1 tile
Data-diffusion memory
Configuration switches
Pipe 2
Pipe 1
FPU Pipe(s)
Pipe 3
Pipe 0
Networks
coherency network - packet switched static ring
(64 bytes wide)
register-sharing network - circuit switched
ring(s) (8 bytes wide)
create delegation network - packet switched (1
byte wide)
9
Dynamic threads
  • A thread is
  • created
  • allocated a context of synchronising registers
  • executes its instructions, possibly just one
  • and terminates
  • On termination its context of synchronising
    registers is recycled and no state remains

10
Dynamic distribution
  • The create instruction distributes threads to an
    arbitrary set of processors deterministically
  • remember the thread leaves no state in the
    processors!
  • The number of threads can also also be determined
    at runtime
  • infinite families of homogeneous threads are
    possible
  • create by block and use the break instruction to
    terminate creation
  • The model does not admit communication deadlock
    as blocking reads occur only in acyclic graphs
  • Can deadlock on resources - just like dataflow
  • this deadlock can often be managed statically,
    although we have dynamic solutions for the
    exceptions

11
Dynamic interleaving
  • Each instruction in a thread executes as long as
    it has the data it requires in the registers it
    reads
  • if not the instruction halts and the thread
    suspends
  • Delay in obtaining data may be attributed to
  • prior operations in the same thread with
    non-deterministic delay e.g. memory fetch or
    maybe FPU operations
  • data generated within another thread
  • Any thread in the active queue is able to proceed
    by executing at least one instruction
  • This data-driven interleaving of instruction
    execution on a single processor is dynamic and
    provides massive latency tolerance

12
DRISC Processor characteristics
  • Thousands of registers in synchronising memory -
    hundreds or thousands of processors per chip
  • Chip can manage O(106) of units of concurrency
  • Little or no D-cache I-cache managed by thread
  • Processor is capable of
  • creating or rescheduling a thread on each cycle -
    concurrently with pipeline execution
  • context switching on each cycle
  • sysnchronising one external event on each cycle
  • Compile source code once and run on an arbitrary
    number of processors unmodified

13
µTC - models DRISC hardware
  • µTC is a low level language used in our tool
    chain compiling from C

14
µTC C 8 constructs
  • create( fid start limit step block local)
  • fid - variable to receive unique family
    identifier
  • start limit step - expressions defining the
    index of threads created, each thread gets its
    index variable set to a value in this range
  • block - expression defining number of threads
    from this family allocated per processor at any
    one time
  • local - if present creation is to the processor
    executing the create
  • break
  • kill(fid)
  • squeeze(fid)
  • thread - defines a function as a microthread
  • index - defines one local variable as the thread
    index
  • shared - defines a threads local variable as
    shared

15
Synchronisation in µTC
  • There is an assumption in µTC that local scalar
    variables in a thread are strictly syncrhronising
    - these are register variables
  • a thread can not proceed unless its register
    variables are set
  • memory latency is tolerated with loads to
    register variables - the load must complete
    before a dependent instruction can execute
  • to support thread-to-thread dependencies these
    variables are defined as shared - between
    adjacent threads only
  • The sync is a barrier on a family which enforces
    shared or distributed memory consistency for
    writes in that family
  • two threads in the same family can not read and
    write the same location in memory
    deterministically can use shared variables

16
Compilation schemas using µTC
17
Heterogeneous concurrency from sequence
  • Why should we compile sequence to concurrency?
  • to allow doacross concurrency - needs shared
    variables
  • to avoid sequential state when using squeeze

float f1() void f2(float y) float
a,r,pi3.142 rf1() a2.0pir f2(a)
thread f1(shared float x) thread f2(shared
float y) thread f1pt5(shared float z)
float pi3.142 z2.0piz int fidfloat
r0 create(fid,13local)
f1(r),f1pt5(r),f2(r)
In f1pt5 z is shared reads to it go to the prior
thread and writes are available to the next
thread - parameterising x,y,z with r in the
create provides a dependency chain through all
threads initialised in the creating environment
18
Homogeneous concurrency
float a100,b100,c100 int fid
create(fid099) index int i ci
aiai bibi sync(fid)
float a100,b100c100 int
i for(i0ilt100i) ci
aiai bibi
  • In each thread the local index variable is
    initialised on creation to a value in the range
    defined by the create
  • Threads are created in index order subject to
    resource availability

19
Homogeneous concurrency with dependencies - to
multiple threads
int fid,t1,t2,fib10 t1fib00 t2fib11 c
reate(fid29local) index int i shared
int t1, t2 fibit1t2 t1t2
t2fibonaccii sync(fid)
int i,fib10 fib00 fib11 For(i2ilt10i
) fibifibi-1fibi-2
Dependencies start as a local in the creating
environment and pass from thread to thread in
index order via a shared variable The non-local
communication is exposed in µTC using two shared
variables, t1, t2 - t1t2 implements the routing
step N.b shared must be scalar variables
20
Homogeneous concurrency with regular non-local
dependency
float a1024,t1024 int i for(i32ilt1024i
) ai(ai-ai-32)ti
float a1024,t1024 int f1,f2 create(f1031
4) index int ifloat sai
create(f2131) index int j shared
float s s(aij-s)tij
ajjs sync(f2) sync(f1)
In this example the non-local dependency is
compiled to a pair of nested creates which gives
32 independent threads each creating a family of
32 dependent threads
21
Unbounded homogeneous concurrency
int i0 while(true) if(ltcondgt) break
i print(i)
int i,fid, block set block create(fid0maxint
block) index int k if(ltcondgt)
break(k) sync(fidi) print(i)
  • break can return a single value to the creating
    environment on synchronisation
  • The parameter block in create allows for the
    management of resources to manage resource
    deadlock

22
Pointer disambiguation
void f(float a,b) int i for(i0iltni)
ai aibi
thread f(float a,b) int f kb-a create(f1
3local) case1,case2,case3 sync(f)
23
Pointer disambiguation (cont)
thread case2()int f1,f2block if(mod(k)ltnklt0)
set block create(f10k-1block)
index int ifloat sbi
if(iltnk)create(f20n/k) index int
jshared float s saijs
aijs else
create(f20n/k-1) index int jshared
float s saijs aijs
sync(f2) sync(f1)
Case 2 is like the non-local dependent example
except that the here n may not be divisible by k
and thus one of two possible bounds on the inner
create are required
k
b
a
n/k
nk
24
Summary
  • DRISC Hardware support for dynamic concurrency
    creation distribution and interleaving, i.e.
  • contexts synchronisers schedulers
  • supports the compilation of sequential programs
    into maximally concurrent binaries
  • Our design work has shown that 100s of in-order
    DRISC processors (FPU / processor) would fit onto
    todays chips
  • We are working on FPGA prototypes and a compiler
    tool chain based on gcc and CoSy
  • We would welcome serious collaboration in this
    work

25
Dynamic concurrency
  • Concurrency is exploited to
  • gain throughput - real concurrency
  • tolerate latency - virtual concurrency
  • Lines of iso-concurrency identify tradeoffs
    between the two
  • Dynamic interleaving provides the mechanism for
    tolerating latency
  • Dynamic distribution provides the mechanism for
    this tradeoff

Latency tolerance
iso-concurrency
Virtual concurrency (log)
tradeoff
Through- put
Real concurrency (log)
26
Dataflow ETS synchronisation
Dataflow program fragment
ml
sl
-
  • Two arcs incident on a node have the same tag or
    address - e.g. m
  • arcs are ordered (l/r ) for non-commutative
    operations
  • The first data/token sets a full/empty bit from
    empty (0) to full (1)
  • The second data/token schedules the operation to
    a queue at the local ALU
  • N.b
  • Tags are used to distributed data to any
    processor s Pid address in ETS
  • Matching store must be contextualised address in
    ETS context(i) offset

mr
Matching store
0
Empty
-
0
sl
Dataflow Matching on nodes in the dependency
graph!
27
Synchronisation in Dynamic RISC
m
  • In DRISC dataflow arcs are mapped registers in
    synchronisation memory
  • contexts are allocated dynamically
  • registers are set empty on allocation
  • Instructions read from registers but cannot
    execute until data is written
  • they suspend if data has not been written
  • In a multiprocessor the synchronising memory is
    distributed to the processors
  • It is larger than the ISA can address in order to
    manage many concurrent contexts

Synchronising Memory the register file
E
Empty
28
Power conservation
Two feedback loops
Workload measure
Schedule work When data available schedule work
to processor
Hardware scheduler
Power clock control
Voltage/frequency scaling Adjust voltage and
frequency to local or relative workload Stop
clocks and standby when no work
Schedule instructions
Data availability
Power/clock
DRISC pipeline Single cycle operations
Asynchronous operations. e.g. memory or FPU
29
Performance with and without D-cache
The residual D-cache made no difference on
performance
30
Historical reflection
  • 25 years ago in the book Parallel Computers I
    introduced the principle of conservation of
    parallelism - PLM
  • Where P,L,M are functions representing
    concurrency at problem stage, HLL stage, and
    machine code respectively
  • Today I would change this - L1 PM
  • Although I would also accept - PLM
  • i.e. the binary captures all concurrency there is
    in the problem but the HLL code can be sequential

31
So what changed!
  • It took the last 10 years to understand how to
    capture and schedule data-driven concurrency
    dynamically in a conventional RISC processor
  • Programming concurrency is difficult1 in part
    because it often involves static scheduling
  • There are exceptions to (b) e.g. -
  • Data parallel languages - e.g. Single-assignment
    C (SAC)
  • Stream languages - e.g. Snet (essentially
    dataflow)

1. E A Lee (2006) The Problem With Threads, IEEE
Computer, 36 (5), May 2006, pp 33-42
Write a Comment
User Comments (0)
About PowerShow.com