Microthreaded models for CMPs - PowerPoint PPT Presentation

About This Presentation
Title:

Microthreaded models for CMPs

Description:

Microthreaded models for CMPs IFIP 10.3 Seminar 6/12/2005 By Chris Jesshope University of Amsterdam Motivation Problems and opportunities in designing chip multi ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 42
Provided by: ChrisJe9
Learn more at: http://www.ifipwg103.org
Category:

less

Transcript and Presenter's Notes

Title: Microthreaded models for CMPs


1
Microthreaded models for CMPs
  • IFIP 10.3 Seminar 6/12/2005
  • By Chris Jesshope
  • University of Amsterdam

2
Motivation
  • Problems and opportunities in designing chip
    multi-processor architectures
  • Memory wall - large memories are slow need to
    tolerate long latency operations
  • Global communication - proportion of chip
    reachable in one clock cycle is diminishing
    exponentially need asynchrony
  • Unscalable support structures - uni-processor
    issue width does not scale need distributed
    structures
  • Power barrier - cannot obtain performance
    indefinitely through frequency scaling need to
    fully exploit concurrency in instruction execution

3
The facts of concurrency
  • Or how to teach mother to suck eggs

4
Concurrency - real and virtual
  • All code has concurrency
  • either explictly or implicitly
  • This concurrency can be exploited to
  • gain throughput - real
  • tolerate latency - virtual
  • Lines of iso-concurrency in this space identify
    tradeoffs between the two
  • Schedule invariance allows that tradeoff to be
    dynamic

Latency tolerance
iso-concurrency
Schedule invariance
Virtual concurrency
Through- put
Real concurrency
5
Example
For i 1,n sum sum a(i)b(i)
  • Take the simplest of examples the inner product
    operation
  • Depending on how this is viewed it has either
  • No concurrency
  • O(n) concurrency in parts
  • O(n) concurrency
  • What determines how the concurrency is exploited
    is the schedule of operations

6
Schedules - distribution and interleaving
  • Different kinds of schedules are required for
    physical concurrency - distribution virtual
    concurrency - interleaving
  • Ideally
  • distribution should be static or deterministic -
    to provide control over locality and
    communication
  • interleaving should be dynamic or
    non-deterministic - to allow for asynchrony in
    distributing operations and data

7
Dynamic scheduling
  • Dynamic scheduling requires synchronisation which
    in turn requires synchronising memory
  • the amount of synchronising memory limits the
    amount of dynamic concurrency
  • This is independent of the how the concurrency is
    identified or executed, e.g.
  • Out-of-order issue - issue windows or reservation
    stations and reorder buffers
  • Dataflow - matching stores
  • Any other approach!

8
Example
For i 1,n sum sum a(i)b(i)
  • Dynamic scheduling must resolve dependencies -
    there are
  • dependencies within and between iterations in
    this example.
  • within - sequence of operations (on
    non-deterministic delay)
  • between - generation of index summation of
    products

9
Conventional vs. dataflow ISAs
  • Synchronisation of dynamic concurrency in
    different ISAs
  • In a dataflow ISA synchronisation is on the nodes
  • receive one or more inputs with (almost)
    identical tags
  • schedule the instruction for execution
  • write the result to one or more target nodes
  • In a conventional ISA synchronisation is on the
    arcs
  • wait for one or two inputs with different tags
  • schedule the instruction for execution
  • write the result to a single arc

10
Synchronisation memory
  • Implemented with either
  • associative or pseudo-associative memory
  • expensive to implement but supports flexibility
    in tagging
  • explicit token store - memory with full/empty
    bits
  • here the tag is used as the address into memory
  • ETS is now more widely used
  • in an out-of-order issue processor the tags are
    the renamed, global register specifiers
  • in a dataflow architecture the tags identify a
    distributed memory processor, matching location

11
Dataflow matching
a(i)
ml
sl
  • Two arcs incident on a node have the same
    tag/address - m
  • arc ordering is needed for non-associative
    operations
  • hence have l-r discriminator
  • The first token arriving sets a full/empty bit to
    full (1)
  • The second token arriving at the same node
    schedules the operation

mr
b(i)
To ALU
a(i) b(i) sl
0
Empty

0
1
l
a(i)
sl
12
Conventional ISA synchronisation
  • Each arc corresponds to a unique address in the
    synchronisation memory
  • This is the renamed register specifier in
    out-of-order issue
  • The location is flagged as empty when initialised
  • An instruction that requires this location can
    not be executed until a value has been written
  • The flag is set to full on a write to this
    location

m
a(i)
0
Empty
0
1
a(i)
13
Contextual information
  • For reentrant code contextual information is
    required to manage the use of synchronisation
    memory
  • General purpose dataflow - frames see 1
  • Wavescalar - wave number identifies and
    sequentialises contexts ( memory references)
  • TRIPS - small fixed number of contexts exposed by
    speculatively executing high-level control flow
    graph
  • Out-of-order issue - no explicit contextual
    information

1 G. Papadopoulos (1996) Implementation of a
General-Purpose Dataflow Multiprocessor, MIT press
14
Summation dependency
All loads can be scheduled simultaneously in this
example also n independent multiplications are
each dependent on their respective loads the
summation however, is sequential or is it?
15
Reduction operations
For i 1,n s(i) a(i)b(i) sab sum(s)
  • The code can be transformed using a reduction
    operation - sum
  • however, the schedule below is too specific as
    the reduction can be performed in any order

Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)






etc






16
Dataflow scheduling of reductions
  • Sum is implemented most generally using a set of
    values s(i) where each scheduled operation
  • removes a pair of values from the set sums them
    and returns the result to the set
  • n-1 such operations are required and at most n/2
    can be performed concurrently

Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)
Load a(i)
Load b(i)






etc






17
Matching for reductions
  • Example of the decentralised control of
    reductions
  • Use special reduction tokens addressing special
    locations
  • The location matches pairs of operands and sends
    results to the same location
  • Again distribution can be deterministic
  • but the ordering of operations and the pairs
    matched is arbitrary
  • Termination can use token counting, i.e. tag one
    token with n-1 and all others with 1 terminate
    when their sum0

tag
s(i)
Pid
Data to be summed
Processor id to perform operation
special ID sum location
18
The loop dependency
  • The loop provides contextual information
  • out-of-order issue the loop-control is predicted
    and the loop is unrolled into synchronising
    memory - but no notion of context is retained
  • Dataflow loop-control operations executed as
    instructions and identify loop-frames
  • Vector loop is embedded completely in a single
    instruction
  • In general this dependency can be statically or
    deterministically removed

19
Removal of loop dependency
1
2
3
20
Microthreads
  • Fragmentation of sequential code

21
Goals of our work
  • Concurrent instruction issue such that
  • silicon area is proportional to issue width
  • power dissipated is proportional to issue width
    and
  • performance is proportional to power dissipated
    for a given clock frequency
  • Should be programmable from sequential code
  • should be backwards compatible
  • unmodified code should run on a single processor
  • or be translated once for any number of
    processors
  • binary-to-binary translation or by recompilation

Number of instruction issued concurrently on
chip regardless of mechanism
22
Concurrency in instruction issue
  • How to manage concurrency when departing from
    sequential-instruction issue
  • No controls - execute instructions out of
    programmed order, e.g. superscalar
  • Fixed schedules - execute instructions using
    compiled schedules, e.g. VLIW
  • Dataflow - execute instructions dynamically when
    data is available, e.g. Wavescalar
  • Software - thread-level concurrency on multiple
    processors, e.g. multithreading

Support dynamic concurrency
23
5. Code fragmentation
  • Transform sequential code into code fragments by
    adding explicit concurrency controls to the ISA
  • Execute the fragments out of order - concurrently
  • Execute instructions within a fragment in-order
  • Schedule instructions dynamically - data driven
  • interleave fragments on one processor to give
    tolerance to latency (e.g. memory other long
    operations)
  • distribute fragments to multiple processors to
    give scalable performance from wide-issue width
  • Examples Tera, Microthreads, Intrathreads

This is an incremental model that adds a handful
of additional instructions to a given ISA - for
compatibility
24
Microthreads
  • Microthreading is instruction-level concurrency
  • it uses registers as synchronising memory to give
    the lowest latency in dependency resolution
  • it is a bulk-synchronous model with MIMD, SPMD or
    mixed blocks of concurrent execution separated by
    barriers
  • it can support shared or distributed memory
  • Schedules have deterministic distribution and
    dynamic interleaving
  • the code is schedule invariant and can trade
    latency tolerance and speedup within resource
    constraints
  • The concurrency controls are efficiently
    implemented and the support structures are
    scalable

25
Microthreading
SPMD
Source code
Binary code
Create (i 1, n)
For i 1, n --- --- ---
code fragments
Hardware
MIMD
deterministicglobal schedule
µ-thread queues
i3
i6
i9
i12
i2
i5
i8
i11
i1
i4
i7
i10
local schedulers
pipelines
26
Microcontexts
  • A typical RISC ISA only has a 5-bit address
  • Microthread CMPs uses a large distributed
    register file that must be addressed by
    instructions
  • e.g. thousands of processors by up to a thousand
    registers per processor
  • Microcontexts give a mechanism to bridge this gap
  • A microcontext is a window on a given processor
    at a given location in the register file
    allocated dynamically to a thread
  • All registers implement i-structures
  • are allocated empty and implement a blocking read

27
Local register files
Large register file - no register renaming, e.g.
1024 entries
Sharing between microcontexts
Remotely by network
Locally by address translation
Offset from instruction
Context from thread state
Architectural register set, e.g. 32 entries
28
Different models
  • We can identify three models based on the
    flexibility of inter-context communication
  • Vector - threads read/write their own context and
    a subset of the enclosing context (globals)
  • Fixed dependency - as a) plus a subset of one or
    more other contexts at the same level at and a
    fixed distance from it (dependents)
  • General - as a) plus individual registers from
    any context at the same level as it
  • We are currently focusing on b)

29
Fixed-dependency model
  • To implement this model we store and associate
    with each instruction in the pipeline
  • two 10-bit offsets into the register file, and
  • two 5-bit boundaries within a context
  • The advantages of this model are
  • ring connectivity provides all remote
    communication with modulo schedules
  • on the same processor static mappings enable
    register bypassing on all local registers reads

30
Fixed-dependency model
G, S and in this case an dependency distance of 1
are invariants in the threads created
Locals
Shared
thread i1 offset
Locals
Thread i-1 on another processor
Dependent
Shared
S
Dependent
thread i offset
Broadcast between processors
Globals
Globals
G
global offset
thread i
thread i1 (on same processor)
31
CMP Concept
May be heterogeneous
Independently clocked domain
Multi-ported shared-memory system
Create
Create

Reconfigurable Broadcast Bus
Create/write G
Create/write G
Write G
Write G
Initialise L0
Initialise L0
Decoupled Lw
Decoupled Lw
D read
D read
Reconfigurable Ring network for micro-context
sharing
32
Typical structure of loop code
  1. Define profile - i.e. number of processors on
    which to execute the loop
  2. Define context parameters - i.e. partitions of
    context L, S, G
  3. Set any global variables used in loop
  4. Create loop - as concurrent microthreads using a
    control block start, step, limit, dependency
    distance, schedule block and pointer(s) to code

Profile n Context 6 1 1 Mv G0 L1 Create
controlblock Bsync
33
Scalable implementation
  • Broadcast register sharing are implemented by
    the same asynchronous ring network
  • Register file is distributed and has only 5 ports
  • Independent of the number of processors!
  • The scheduler memory is also distributed and is
    similar in size to the register file
  • Both can be scaled in size to adjust the latency
    tolerance
  • Uses a simple in-order processor
  • issue is stalled when data is not available -
    processors go into standby mode clock-cycle by
    clock cycle
  • wake up is asynchronous

34
GALS implementation
  • Each processor can be implemented in its own
    synchronous domain
  • with asynchronous interfaces to ring network,
    memory and FPU processors
  • Single-cycle operations are statically scheduled
    all other operations use the asynchronous ports
    to the register file
  • if a processor has no active threads it is
    waiting for an asynchronous input and can stop
    its clocks

35
Power conservation
Two feedback loops
Workload measure
Schedule work When data available schedule work
to processor
Hardware scheduler
Power clock control
Voltage/frequency scaling Adjust voltage and
frequency to relative workload Stop clocks and
standby when no work
Schedule instructions
Data availability
Power/clock
Fragment Processor Single cycle operations
Asynchronous operations. e.g. memory or FPU
36
Vector code performance
Speedup of µ-threaded code is linear over 2.5
orders of magnitude to within 2
Speedup is super-linear with respect to
non-threaded code as 20 fewer instructions are
executed to complete the computation
Max IPC 1612 on 2048 processors 80 of
theoretical maximum
37
Scalable power dissipation
Energy of computation is constant to within 2
over 2.5 orders of magnitude
Using fine-grain control of clock (dynamic
dissipation) and coarse grain control of power
(static dissipation)
38
Performance with and without D-cache
The residual D-cache made no difference on
performance
39
Simulation framework
  • Based on a cycle-accurate CMP simulator of
    ISA-Alpha ISA-µt
  • The same binary was executed on profiles of
    from 1 to 2048 processors with cold caches
    (I-cache and D-cache)
  • Fixed number of iterations (64K) of the Livermore
    hydro kernel

40
Adaptive System Environment
  • Legacy code executes unchanged on 1 processor
  • Microthreaded code executes on n processors
  • n can be dynamic (e.g. per loop) to adapt to
    system or user goals e.g. performance or power
    dissipation
  • n processors can be drawn from a pool
  • When the concurrency collapses, only the
    architectural state remains - on the creating
    processor
  • Software concurrency (distributed memory) can
    also be supported with I/O mapped to
    micro-threaded registers
  • Need a dynamic model of the resources to get
    self-adaptive execution of compiled code

41
Summary and future directions
  • We have been working on these models and
    verifying their scalability for some years
  • We have just started a 4-year project to
  • formalise the models
  • develop compilers for the models
  • thoroughly investigate performance relative to
    other approaches
  • implement IP
Write a Comment
User Comments (0)
About PowerShow.com