Dataflow Architecture - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Dataflow Architecture

Description:

For this reason the second scheme is often referred as the tagged-token approach. The tag associated with each token is a four-tuple c, i, b, a ... – PowerPoint PPT presentation

Number of Views:2018
Avg rating:3.0/5.0
Slides: 41
Provided by: chao7
Category:

less

Transcript and Presenter's Notes

Title: Dataflow Architecture


1
Dataflow Architecture
Liu, Chaofeng Dai, Chaowei Li, Mao
2
Dataflow Overview
Dataflow Overview
  • What is dataflow and what can it achieve
  • Distinguish dataflow from Von Neumans
  • control-flow model

3
Dataflow Overview
What is dataflow and what can it achieve
Arvind and Iannucci identifies two fundamental
issues that must be addressed to construct a
successful multiprocessor
  • Memory latency the time between issuing a
    memory request and receiving the corresponding
    response
  • Synchronization It is needed in order to
    enforce the ordering of instruction execution
    according to their data dependencies.

Before the dataflow, the majority of
multiprocessor computers system is based on the
von Neuman style processors. These processors use
a program counter to sequence the execution of
instructions in a program. This sequential
execution style may make it difficult to exploit
parallelism in a program. In 1970s, dataflow
computers was proposed and developed to meet
these deficiencies of von Neumans control-flow
multiprocessors
4
Distinguish dataflow from Von Neumanns
control-flow model --Von Neumans control-flow
model
Dataflow Overview
The control-flow model assumes that a program is
a series of addressable instructions, each of
which either specifies an operation along with
memory locations of the operands or specifies
transfer of control to another instruction
unconditionally or when some condition holds. the
following given example will illustrate the
difference of control-flow from dataflow
architecture. C if n 0 then ab else a b
fi When using a control-flow model, the program
will be translated into a series of instructions
starting with an instruction comparing n to 0 and
either transferring control to an instruction
adding a to b or to another instruction
subtracting b from a. In both cases, store the
result in c. What a control-flow essentially does
is the branch instruction wait to be executed
after the completion of comparing n to 0
instruction.
5
Distinguish dataflow from Von Neumanns
control-flow model --Dataflow model
Dataflow Overview
Unlike the control-flow model, dataflow assumes
that a program is a data-dependency graph whose
nodes denote operations and whose edges denote
dependencies between operations. Any operation
denoted by a node is executed as soon as its
incoming edges have the necessary operands. In
particular, if n is available, the operation
can be applied to its operands n and constant 0.
Similarly, if a and b are available, both of the
operation and can be applied to a and b even
before the comparison of n and 0 has been
completed.
Fig. Dataflow graph for c if n 0 then ab
else a-b fi
6
Static vs dynamic dataflow architecture
Static vs dynamic dataflow architecture
  • Based on the node firing rules, the dataflow
    model is classified into two types.
  • Static dataflow architecture
  • Dynamic dataflow architecture
  • In the first approach, the firing rule restricts
    each edge can only hold one token at a time and
    the operation is executed when tokens (values)
    are present on each of the input edges. It also
    implies that an executable operation can actually
    execute only when its output edge has no token on
    it. The static dataflow model can exploit
    structural parallelism (as in different unrelated
    operators executing at the same time) and
    pipeline parallelism (as in different stage of
    the graph consuming different tokens of a stream
    of tokens at the same time).

7
Static dataflow
Static vs dynamic dataflow architecture
  • Disadvantage
  • Since each edge can only hold one token at a
    time, only one iteration or one function
    invocation can be active. Thus it cannot exploit
    dynamic forms of parallelisms such as loop
    parallelism (from simultaneously executing
    different unrelated iterations of a loop body) or
    recursive parallelism (from simultaneously
    evaluating multiple recursive function calls).
  • Application
  • This model is thus well suited for applications
    with regular numeric computational structures
    such as signal processing and image processing
    applications that do not make heavy use of
    iterative or recursive program structures.

8
Dynamic dataflow
Static vs dynamic dataflow architecture
  • Contrast to the static dataflow model, infinite
    storage is allowed on each arch in the dynamic
    dataflow model. Since data values for a
    particular instantiation of the operator have to
    be identified, tags are assigned to the data
    tokens. For this reason the second scheme is
    often referred as the tagged-token approach. The
    tag associated with each token is a four-tuple
    .
  • c is the invocation ID To distinguish between
    tokens of different function invocations
  • i is the iteration ID To distinguish tokens
    belonging to different iterations
  • b is the code block address The code block
    address along with the instruction address of a
    token identify its destination.
  • a is the instruction address within the code
    block.

9
Dynamic dataflow
Static vs dynamic dataflow architecture
Interconnection Network (IN)
PE1
PE2
PE3
PE4
Fig. Overall organization of the tagged token
dataflow architecture
The overall organization of the tagged token
dataflow architecture is shown on the top. It
consist of several processing elements (PEs)
which is illustrated in next picture
interconnected via a packet switching
interconnection network (IN).
10
Dynamic dataflow
Static vs dynamic dataflow architecture
A processing element (PE) consists of a matching
store unit (MU), a instruction fetch unit (IFU),
a processing unit (PU), and a output unit (OU)
Fig. The Internal Structure of a PE
11
Dynamic dataflow
Static vs dynamic dataflow architecture
Step1 If a token arriving at the matching unit
completes all the input requirements for the
execution of an instruction, a group token is
formed with all the input data and is sent to the
code fetch unit. Otherwise, the token is added
into the matching unit with the token already
gathered for the instruction. Step2 When the
instruction fetch unit (IFU) receives a packet
with all the data for a particular instruction,
the corresponding instruction is fetched. The
instruction along with the data then forms an
executable packet that is sent to the processing
unit (PU). Step3 The PU contains a number of
function units (FUs) that can perform the
dataflow operation in parallel. The result
generated by the PU is sent to the output unit
(OU). Step4 The main function of the OU is to
form the tokens from the result generated by the
PU. Further, the OU unit also evaluates the
assignment function to determine the physical
address of the PE to which the token needs to be
sent.
12
Dynamic dataflow
Static vs dynamic dataflow architecture
In short, the evolution of dataflow computers has
been motivated by the need to get better
performance. Dynamic dataflow computer was
designed in order to be able to do loop
iterations and subprogram invocations in
parallel. The two well known as the MIT tagged
token dataflow architecture and the Manchester
dataflow computer.   More recently, hybrid
architecture was proposed in order to combine
positive features from the von Neumann and
dataflow architectures which bridges the gap
between existing systems and new dataflow
supercomputers by allowing execution of existing
software written for conventional Processors. It
will be further discussed in depth by Chaowei
Dai.
13
(No Transcript)
14
Hybrid Dataflow Architecture Model
  • Presented by Chaowei Dai

15
Computer Architecture Models
  • There are two basic modelsthe von Neumann
    sequential control model and the data-driven
    distributed control model.
  • The von Neumann model uses a program counter to
    sequence the execution of instructions in a
    program. Dataflow model is an alternative to the
    conventional stored-program (von Neumann)
    execution model .

16
Computer Architectures Models
  • In the dataflow architecture model, the program
    parallelism is expressed in terms of a directed
    graph, in which the nodes describe the operation
    to be performed and the arcs represent the data
    dependencies among the operations, execution of
    the directed graph is data driven in the sense
    that the execution of a node does not proceed
    until the availability of the data at the input
    of the node.

17
Advantages of Dataflow Architecture Models
  • The dataflow model of computation offers a sound,
    simple, yet powerful model of parallel
    computation.
  • In dataflow programming and architectures, there
    is not notion of a single point or locus of
    control.
  • Dataflow architectures have promised two
    fundamental problems of von Neumann computers in
    multiprocessing the memory latency and
    synchronization overhead .

18
Advantages of Dataflow Architecture Models
  • The advantages of von Neumann is its efficiency
    and simplicity of the instruction sequencing
    mechanism as well as over 40 years of
    optimization of the instruction execution
    mechanism .
  • Research work on combining the advantages of
    these two models together was carried out on last
    10 years. Some of the research achievements are
    introduced as follows.

19
Research Work on Hybrid Dataflow Model
  • One of the most valuable researches was conducted
    by dr. Gao. In his work, an efficient hybrid
    architecture model was proposed. The model
    employees
  • A conventional architecture technique to achieve
    fast pipelined instruction execution, while
    exploiting fine-grain parallelism by data-driven
    instruction scheduling
  • An efficient mechanism which supports concurrent
    operation of multiple instruction threads on the
    hybrid architecture model

20
Research Work on Hybrid Dataflow Model
  • (3) a compiling paradigm for dataflow software
    pipelining which efficiently exploits loop
    paradigm through limited balancing. A set of
    basic results was established by the author.
  • - It showed that the fine-grain parallelism in a
    loop exposed through limited balancing can fully
    exploited by a simple greedy runtime data-driven
    scheduling scheme, achieving both time and space
    efficiency simultaneously.

21
Research Work on Hybrid Dataflow Model
  • Based on his experience with the MIT dynamic
    (tagged-token) dataflow architecture, Iaanucci
    combined dataflow ideas with sequential thread
    execution to define a hybrid computation model.
    The ideas later evolved into a multithreaded
    architecture project at IBM Yorktown research
    center. The architecture includes features such
    as cache memory with synchronization controls,
    prioritized processor ready queues and features
    for efficient process migration to facilitate
    load balancing.

22
Research Work on Hybrid Dataflow Model
  • 3. The P-RISC is a hybrid model exploring the
    possibility of constructing a multithreaded
    architecture around a RISC processor(r.S. Nikhil,
    Arvind). The start-t project a successor of the
    monsoon project has defined a multiprocessor
    architecture using an extension of an
    off-the-shelf processor architecture to support
    fine-grain communication and scheduling of user
    micro threads. The architecture is intended to
    retain the latency-hiding feature of the monsoon
    split-phase global memory operation.

23
Example of Hybrid Dataflow Architecture Model
  • Dr. Gaos research work was primarily focused on
    the organization of a single processor which can
    support multiple instruction threads, while
    instruction from different threads are subjected
    to pipelined execution.
  • The proposed hybrid dataflow architecture model
    is an extension of the McGill dataflow
    architecture model which employs the
    argument-fetching principle.

24
MDFA Model
25
MDFA Model
26
Hybrid Dataflow Architecture Model
  • The idea is that the IPU directly generates the
    next p-instruction address to be executed,
    instead of going through the scheduling phase in
    ISU.
  • Each p-instruction is extended to carry an extra
    field a tag field (also called von Neumann bit)
    which indicates whether the instruction is
    following a dataflow style scheduling or a von
    Neumann style scheduling.

27
Hybrid Dataflow Model
28
Features of the Hybrid MDFA Model
  • Generality the hybrid MDFA model supports both
    thread level and instruction level parallelism
    through efficient fine-grain synchronization. At
    any time, IPU can execute several instructions in
    parallel. This model is different from so-called
    the macro-dataflow schemes where dataflow
    scheduling can only done at inter-procedure
    level. It retains the advantage of dataflow
    models in terms of dealing with the two
    fundamental issues of von Neumann
    multiprocessing.

29
Features of the Hybrid MDFA Model
  • 2. Flexibility there is no restrictions as to
    the size of an instruction thread which can be
    supported by this model. In fact, multiple
    instruction threads each with a different size
    can be active concurrently. This is an important
    advantage in comparison with other multi-thread
    architectures where the number of thread are
    fixed a priori.

30
Features of the Hybrid MDFA Model
  • 3. Simplicity under the hybrid MDFA model, any
    instruction in a program can be set to one or two
    modes, regardless of its function or type. This
    flexibility certainly make the job of a compiler
    easier, since the mode control and operation of
    an instruction become orthogonal.

31
Decoupled Scheduled Dataflow Multithreaded
Architecture
  • Mao Li
  • CIS
  • Cleveland State University

32
Decoupling memory accesses from Execution Pipeline
  • The gap between processor speed and average
    memory access speed limits achieving high
    performance.
  • Decoupled architectures offers a solution in
    leaping over the memory wall
  • Integrating the decoupled architecture with
    multithreading presents a wide range of
    implementations for next-generation architecture
  • Two multithreaded architectures that support
    decoupled memory accesses Rhamma and PL/PS

33
Rhamma Processor
  • Rhamma used two separate processors--Memory
    processor perform all Load and Store
    Instructions, other instructions by Execution
    processor
  • A single sequence of instructions (thread) is
    generated for both processors
  • When a Memory access instruction is decoded by
    the Execution Processor, a context switch is
    utilized to return the thread to the Memory
    processor
  • When the Memory Processor decodes a non-memory
    access instruction, a context switch causes the
    thread to be handed over to the Execute Processor
  • Threads are blocking (no other thread can run
    till current one finish)

34
PL/PS Architecture
  • Threads are non-blocking
  • All memory accesses are done by the Memory
    Processor, which delivered enabled threads to the
    Execute Processor
  • Each thread is enabled when the required inputs
    are available and all operands are pre-loaded
    into a register context
  • Once enabled, a thread executes to completion
    without blocking where the instructions belonging
    to a thread will execute on the Execute Processor
  • The Results from completed threads are
    post-stored by the Memory Processor

35
Limitations of Pure Dataflow
  • Dataflow model holds the promise of an elegant
    execution paradigm with parallelism in
    applications, but no actual implementation can
    offer the promised performance.
  • Major limitations of the pure dataflow model that
    prevented commercial implementation
  • Too fine-grained (instruction level)
    multithreading
  • Difficulty in using memory hierarchies and
    registers
  • Asynchronous triggering of instructions

36
Pipeline Structure of Scheduled Dataflow
37
Analytical Models Evaluating New Architecture
Effect of thread parallelism
  • The same normalized workload for all
    architectures (all architectures execute the same
    amount of useful work)
  • Latency of a pair of threads (the time difference
    between termination of a thread and the
    initialization of a successive thread)
  • Both Scheduled Dataflow and Rhamma show
    performance gain with increase of thread
    parallelism
  • Scheduled Dataflow executes the multithreaded
    workload faster than Rhamma for all values of
    thread parallelism
  • Scheduled Dataflow will provide higher degree of
    thread parallelism than Rhamma, since
    non-blocking nature of Scheduled Dataflow leads
    to finer-grained threads

38
Thread Granularity
  • Normalized thread length includes only functional
    instructions and does not include
    architecture-specific overhead instructions
  • For conventional and scheduled Dataflow, increase
    of thread run-lengths shows performance gains to
    a certain degree (since longer threads imply
    fewer context switches)
  • Rhamma, longer thread does not guarantee shorter
    execution times

39
Fraction of memory access instructions
  • Conventional architecture, increasing memory
    access instructions leads to increased cache
    misses, thus increasing the execution time
  • The decoupling allows for the two multithreaded
    processors to tolerate the cache miss penalties
  • Scheduled Dataflow outperforms Rhamma for all
    values of memory access instructions, because the
    pre-loading and post-storing are performed by
    Scheduled Dataflow

40
Conclusions
  • This paper presented a new data flow architecture
    utilizing control-flow like scheduling of
    instructions and separating memory accesses from
    instruction execution to tolerate long latency
    incurred by the memory access
  • The proposed scheduled Dataflow system is
    instruction driven where a program counter type
    sequencing is used to execute instructions the
    instructions within a thread for this system
    still retain dataflow (functional) properties,
    thus eliminate the need for complex hardware.
  • Decoupled access/execute implementations with
    multithread model presents better opportunities
    for exploiting the decoupling of memory accesses
    from execution pipeline
  • Grouping memory accesses (pre-load and
    post-store) for threads eliminates unnecessary
    delays caused by memory accesses.
Write a Comment
User Comments (0)
About PowerShow.com