Forrest%20Brewer%20forrest@ece.ucsb.edu - PowerPoint PPT Presentation

About This Presentation
Title:

Forrest%20Brewer%20forrest@ece.ucsb.edu

Description:

Santa Barbara. Forrest Brewer forrest_at_ece.ucsb.edu. UCSB CAD and Test Group. ECE/UCSB Santa Barbara CA 93106. Scheduling is Behavioral Synthesis ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 61
Provided by: steveh90
Category:

less

Transcript and Presenter's Notes

Title: Forrest%20Brewer%20forrest@ece.ucsb.edu


1
NDFA Based Scheduling
  • Forrest Brewer, Steve Haynal
  • University of California
  • Santa Barbara

2
Scheduling is Behavioral Synthesis
  • Exploits fundamental freedom -- ordering and
    binding of operations, operands
  • Subdivided into DFG transformation, resource
    allocation, time-scheduling, operation binding,
    memory binding, communication binding, resource
    modeling, reallocation...
  • Complexity of tasks requires top-down flow -- yet
    evaluations/constraints are bottom-up
  • Behavioral Synthesis difficult to use!
  • Seemingly trivial changes cause vast output
    changes
  • Design tradeoffs tied to a particular point
    language (VHDL, Verilog, Silage, Esterel...)
  • No direct control of implementation
  • No direct control of binding, mapping
  • No distinction between problem statement and
    constraints
  • No canonical representation of design space
  • Fundamental problem covers enormous scope
  • Universality issues in specification
  • How to capture design mapping knowledge?
  • How to create verifiable design representation
    without canonical model?
  • Our viewpoint -- wrong problem

3
Simpler Problem
  • Assume Designer creates the design
  • Support incremental refinement of design at all
    levels of representation
  • Support incremental design synthesis when
    possible
  • Provide well defined hierarchy on which to place
    constraints, trial implementations ...
  • Provide mechanism for subsystem abstraction,
    modeling and evaluation at each level
  • How to do this?
  • Drop representation distinction between logic,
    module, and sub-system levels
  • Drop potential for universality in internal
    representations
  • Create mechanism for automatic design abstraction
    within designer's design decomposition
  • Use efficient representation of fundamental model
  • Provide feedback to designer for evaluating both
    the design itself and the representation
  • Where do we start?
  • Interface Protocols are key complexity growth
    problem
  • Designer constructs system model with abstract
    protocols, required data-flows, possible maps
  • Generalize scheduling to provide possible
    sequencing of sub-systems into systems meeting
    external protocol constraints (models)

4
Protocol Constrained Scheduling
  • Problem Conventional scheduling algorithms
    cannot accommodate the typical complex sequencing
    and timing constraints of modern design.
  • Three Problems Specification, Scheduling, and
    Problem Scale
  • Specification How to specify the required timing
    in an concise, explicit way?
  • Scheduling How to systematically exploit mapping
    freedom while meeting the timing requirements?
  • Problem Scale Problems of interest to industry
    are enormously complex!
  • Idea Protocol specification is amenable to NDFA
    modeling -- so create automata-based model to
    represent Control/Data-flow freedom gt All
    possible implementations exist as sequences of
    states of the joint automaton

5
Protocol Specification
  • Sequencing complexity of digital system
    interfaces increasing
  • Specification languages Verilog?, VHDL require
    implicit protocol specification
  • Alternative specification via NDFA automata (e.g.
    PBS, Esterel, Custom point language)
  • Representation is finite
  • Synthesis can be very efficient -- can handle
    very complex designs
  • Provides mechanism for time sequence
    specification relatively independent of data-flow
    control semantics
  • Protocol CDFG semantics mapping abstractions
    make a complete model
  • No ad-hoc mapping library (beyond control of
    designer)
  • No convenient dependency binding assumptions (to
    be worked around by designer)
  • No encrypting desired sequential FSM in higher
    level language!
  • Designer specifies event sequences he wants
  • System evaluates/synthesizes ensemble FSM

6
Design Representation
  • Model System as hierarchy of design frames
  • Frames have external protocol specification NDFA,
    CDFG, and allowed Mappings
  • Frames contain instances of other frames
    abstractions (abstracted NDFA/CDFG model)
  • Resource utilization and sharing restricted to
    within a design frame

Sub-frame Model
Control Data Flow Graph
Frame
External Protocol
7
Hierarchy of Refinement
  • Exact protocol scheduling intractable for
    practical large problems
  • Hierarchy of Refinement
  • partition the problem into manageable
    abstractions
  • hides lower level details
  • allows systematic high-level pruning of designs
    before more detailed treatment
  • Completed sub-frame designs can be abstracted to
    high level component models
  • allows incremental design change/refinement at
    any level
  • --provides mechanism for consistency verification

8
Protocol Scheduling Implementation
  • Represent CDFG model as Causal (NDFA) Automaton
  • Generalization of current scheduling model
  • Models all valid data flows
  • Models code hoisting, unrolling,
    transformations...
  • Represent External Protocol as NDFA automaton
  • Very general, efficient model
  • Synchronous timing model (can be generalized--
    future work)
  • Alternative behavior as NDFA alternatives
  • CDFG maps I/O operations among sub-frames
  • Sub-frames have interface protocols, abstracted
    CDFG semantics
  • Construct ensemble automata model with all valid
    sequences of events meeting internal and external
    protocols and causal data-flow constraints
  • Need only find complete sub-set of all possible
    states for solution

9
Scheduling Solution
  • Every schedule is some subset of states of the
    ensemble automaton
  • Must construct causal and complete set of states
  • Exact solution strategies
  • Construct all states up to resource bounds
  • Depth-first search of states
  • Heuristic search -- choose good path, complete
    schedule automatically
  • Prune solution space
  • Additional constraints or objectives -- technique
    works best when highly constrained
  • Heuristic strategies
  • Sub-set BDD representation of reachable states
  • Incremental search (this is not verification!)
  • Possible objectives
  • Communication
  • Temporary storage (memory)
  • Performance
  • Control complexity

10
DFA Model of Two Stage Pipe
  • Input 1 indicates operands are supplied to the
    pipe
  • Output 1 indicates operand is produced by the
    pipe

State
a
b
c
b
d
c
b
d
d
c
a
11
NDFA Protocol for Two Stage Pipe
  • Inputs and outputs same as DFA model
  • Some transitions produce no outputs

12
Operand Scheduling a CDFG on NDFA Protocol
  • CDFG to Schedule
  • Two stage NDFA protocol description for component
  • Protocol alone is insufficient -- need internal
    data-flow requirements
  • Mapping is trivial (in this case)
  • Protocol CDFG is sufficient -- but also
    describes information not needed externally
  • Solution Simplify scheduling solution of
    sub-frame to make abstracted model

13
Operand Schedule on NDFA Protocol
  • Optimal one multiplier schedule (co-execution of
    protocol and causal automata)

14
Causal Automaton Formulation of Scheduling
  • Scheduling Problem (V, E, C, R)
  • vertex v eV is an operation
  • edge (u,v) e E is a directed edge representing a
    data dependency
  • hyper-edge vc,VTc,VFc groups a control
    operation and corresponding subsets of operations
  • hyper-edge bound, (T m V) e R represents a
    resource bound applied to a subset of (mapped)
    operations
  • The edge set is partitioned into a forest of
    forward edges and a subset of looping edges which
    point backward
  • Scheduling solution is a complete, compatible set
    of deterministic sequences of vertices such that
    all dependencies are causal and all resource
    bounds are met at each state, and the set has
    sequences for each possible future value of the
    set of controls.
  • In the following, we will discuss minimum latency
    and maximal throughput as objective functions.

15
Single-Cycle Operation Modeling Automata
1? 1
0?0
0?1
1? 0
  • 0?0 Operation unscheduled and remains so
  • 0?1 Operation scheduled next cycle
  • 1?1 Operation scheduled and remains so
  • 1?0 Operation scheduled but result lost

16
Scheduling Automata
  • State represents current set of available
    operands and state of modeling protocol automata
  • Constraints on transitions
  • Representation Compact
  • Product of Mapped Modeling automata for each
    resource protocol

17
Resource Bounds
  • 0?1 indicates resource
  • Resource bounds constrain simultaneous 0?1
    transitions
  • Iterative constraint on CA
  • ROBDD representation
  • 2? bound? operations

One Resource
18
Dependency Implication
  • All transitions in which j is active before
    all of its predecessors are known are removed
  • BDD Complexity is O(predecessors
    operations)

19
Example NFA
  • Assume 1 resource
  • Transition relation induces graph
  • Any path from all operations unknown to all known
    is a valid schedule
  • Shortest paths are minimum latency schedules

20
All Minimum Latency Schedules
  • Symbolic reachable state analysis
  • Newly reached states are saved each cycle
  • Backward pruning preserves transitions used in
    all shortest paths

21
All Minimum Latency Schedules
  • Symbolic reachable state analysis
  • Newly reached states are saved each cycle
  • Backward pruning preserves transitions used in
    all shortest paths

22
All Minimum Latency Schedules
  • Symbolic reachable state analysis
  • Newly reached states are saved each cycle
  • Backward pruning preserves transitions used in
    all shortest paths

23
All Minimum Latency Schedules
  • Symbolic reachable state analysis
  • Newly reached states are saved each cycle
  • Backward pruning preserves transitions used in
    all shortest paths

24
All Minimum Latency Schedules
  • Described construction is Exact --
  • Suitable heuristics are available and since they
    can use arbitrary subsets of the potential
    schedules are powerful

25
CDFG Representation
26
CDFGs Multiple Control Paths
  • Guard automata differentiate control paths
  • Before control operation scheduled

Control value unknown
  • After control operation scheduled
  • Guards are implemented as modified operation
    automata

27
CDFGs Multiple Control Paths
  • All control paths form ensemble schedule
  • Possibly 2c control paths to schedule
    (non-looping case)
  • Dummy operation identifies when control path
    terminates
  • Only one termination operation
  • Ensemble schedule need not be causal!
  • Need solution for each control path
    (Completeness)
  • Need compatibility between paths whose control is
    not resolved (Causality)
  • Solution validation algorithm
  • Validation is a path to path property for all
    control paths in ensemble schedule
  • Fixed Point Iteration

28
CDFG Example
  • One green resource
  • Shortest paths
  • False termination

29
Validated CDFG Example
  • Validation algorithm ensures control paths dont
    bifurcate before control value is known

30
Validated CDFG Example
  • Validation algorithm ensures control paths dont
    bifurcate before control value is known
  • Pruned for all shortest paths as before

31
Validation Algorithm
  • Validation Proceeds on potential traces
  • Re-traverse Automata, Dynamically Modifying
    Transition Relation based on current available
    states in each time step Allow guard computation
    only for states with matching histories if the
    guard is true or false.
  • Iterate until fixed point on all paths
  • Apply the following non-linear filter to each
    transition

32
Selected CDFG Benchmarks
33
Large Benchmarks
957
34
Comparison of CPU Times
35
Required CPU Seconds
36
Construction for Looping DFGs
  • Use trick 0/1 representation of the MA could be
    interpreted as 2 mutually exclusive operand
    productions
  • Schedule from know -gt known -gt known where each
    0-gt1 or 1-gt0 transition requires a resource.
  • Since dependencies are on operands, add new
    dependencies in 1 -gt0 sense as well
  • Idea is to remove all transitions which do not
    have complete set of known or known predecessors
    for respective sense of operation
  • So -- get looping DFG automata as nearly same
    automata as before
  • preserve efficient representation
  • Selection of Minimal Latency solutions is more
    difficult

37
Loop construction resources
  • Resources we now count both 0 -gt 1 and 1 -gt0
    transition as requiring a resource.
  • Use Tuple BDD construction at most k bits of n
    BDD
  • Despite exponential number of product terms, BDD
    complexity O(bound V)

38
Example CA
  • State order (v1,v2,v3,v4)
  • Path 0,9,C,7,2,9,C,7,2,is a valid schedule.
  • By construction, only 1 instance of any operator
    can occur in a state.

39
Strategy to Find Maximal Throughput
  • CA automata construction simple
  • How to find closed subset of paths guaranteeing
    optimal throughput
  • Could start from known initial state and prune
    slow paths as before-- but this is not optimal!
  • Instead find all reachable states (without
    resource bounds)
  • Use state set to prune unreachable transitions
    from CA
  • Choose operator at random to be pinned (marked)
  • Propagate all states with chosen operator until
    it appears again in same sense
  • Verify closure of constructed paths by Fixed
    Point iteration
  • If set is empty -- add one clock to latency and
    verify again
  • Result is maximal closed set of paths for which
    optimal throughput is guaranteed

40
Maximal Throughput Example
  • DFG above has closed 3-cycle solution (2
    resources)
  • However- average latency is 2.5-cycles
  • (a,d) (b,e) (a,c) (b,d) (c,e) (a,d)
  • Requires 5 states to implement optimal throughput
    instance
  • In general, it is possible that a k-cycle closed
    solution may exist, even if no k-state solution
    can be found
  • Current implementation finds all possible k-cycle
    solutions

41
EWF Looping Benchmarks
268
42
Synthetic Benchmarks
  • Over 100 synthetic benchmarks tested
  • Sizes 50 operator, 100 operator, randomly
    assigned dependency chains, resources
  • 32 had no causal schedule
  • 35 had all maximum throughput schedules found in
    15 minute timeout (1 minute Reachable States, 14
    minute Fixed Point)
  • 33 Timed Out
  • Analysis of timeout cases most included
    disconnected independent sub-graphs
  • Trial partitioning of the Transition Relation
    looks very promising on these cases (time/space
    reduction nearly quadratic!)

43
Synthetic Loop Benchmarks
44
Schedule Exploration Loops
  • Idea Use partial symbolic traversal to find
    states bounding minimal latency paths
  • Latency-- Identify all paths completing cycle in
    given number of steps
  • Repeatability-- Fixed Point Algorithm to
    eliminate all paths which cannot repeat in given
    latency
  • Validation-- Ensure all possible control paths
    are present for each remaining path
  • Optimization-- Selection of Performance Objective

45
Kernel Execution Sequence Set
  • Path from Loop cut to first repeating states
  • Represents candidates for loop kernel

Loop Kernel
46
Repeatable Kernel Execution Sequence Set
  • Fixed-point prunes non-repeating states
  • Only repeatable loop kernels remain
  • Paths not all same length
  • Average latency lt shortest Repeating Kernel

Loop Cut
Repeatable Loop Kernel
47
Validation I
  • Schedule Consists of bundle of compatible paths
    for each possible future
  • Not Feasible to identify all schedules
  • Instead, eliminate all states which do not belong
    to some ensemble schedule
  • Fragile since any further pruning requires
    re-validation
  • Double fixed point

48
Validation II
  • Path Divergence -- Control Behavior
  • Ensure each path is part of some complete set for
    each control outcome
  • Ensure that each set is Causal

49
Loop Cuts and Kernels
Loop Cut
  • Method Covers all Conventional Loop
    Transformations
  • Sequential Loop
  • Loop winding
  • Loop Pielining

Loop Kernel
Loop Cut
Loop Kernel
Loop Cut
Loop Kernel
50
Results
  • Conventional Scheduling
  • 100-500x speedup over ILP
  • Control Scheduling Complexity typically pseudo
    polynomial in number of branching variables
  • Cyclic Scheudling
  • Reduced preamble complexity
  • Capacity 200-500 operands in exact
    implementation
  • General Control Dominated Scheduling
  • Implicit formulation of all forms of CDFG
    transformation
  • Exact Solutions with Millions of Control paths
  • Protocol Constrained Scheduling
  • Exact for small instances needs sensible
    pruning of domain

51
MIPS Model
  • SimpleScalar (MIPS IV superset) Model
  • Trace Probabilities from MediaBench
  • Hierarchical Model
  • Collection of Instruction Tasks in Flight
  • Each Instruction Task is Complete Behavioral
    Model of Instruction Execution, including all
    instruction types, hazards, controls, and
    Contention for Physical Resources
  • Additional Sequential Protocols for Memory
    Subsystem, both Fetch and Load/Store

52
Processor Composition
  • Ordered Fetch/Commit
  • 3 Simultaneous Instruction Executions
  • Sequencing of Instructions separated from
    pipeline
  • Out of Order Prefetch or Commit can be Modeled

53
PC update Speculative Fetch
  • Speculate Joins to allow early prefetch and
    address computation

54
MIPS Transaction Dependencies
55
MIPS Results Constraints
  • Scenario A
  • 1/2 cycle tasks, Single Bypass
  • 2 cycle Pipelined Double Word Memory Fetch
  • 2 cycle Pipelined Multiply
  • 2R/1W Register File, 2 ALU's, 2 port Memory
  • Scenario B
  • 2 cycle Memory Read/Write/Fetch
  • 2R -1R/1W Register File, 1 ALU, 1 port Memory
  • Cache 1 cycle hit/3 cycle miss, Deferred Pipeline

56
MIPS Results Instruction Mix
  • Media Bench Tuning
  • 88 reg-reg, reg-imm, br taken, load single
  • 80 branch taken
  • 35 Single Bypass Hazard
  • 1 Multiple Bypass (Stall in model)?
  • Two Sets of Priority Mixes
  • Mix1 favors (reg-reg, reg-imm, br-taken)?
  • Mix 2 favors (load-sw, br-taken)?

57
MIPS Results Mix 1
  • Mix 1 favors reg-reg, reg-imm, and br-taken

58
MIPS Results Mix 2
  • Mix 2 Favors loads, reg-reg w. branches

59
Cache and I/O Protocol
  • For 3 instructions in flight gt 542,000 control
    paths!
  • Schedules still exact every optimal sequence is
    constructed

60
Conclusions
  • NFA protocol modeling shown to be effective
    representation for generalized scheduling problem
  • Efficiency of algorithms so far is comparable or
    superior to any known exact technique
  • Potential for powerful heuristics based on
    sub-set representation
  • First exact solutions for a wide variety of
    generalized scheduling problems
Write a Comment
User Comments (0)
About PowerShow.com