Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms - PowerPoint PPT Presentation

About This Presentation
Title:

Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms

Description:

10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003 ... RAW 2003. XPP Configuration Flow. Uses 3 stages to execute each configuration: ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 28
Provided by: joompc
Learn more at: https://www.ece.lsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms


1
Loop Dissevering A Technique for Temporally
Partitioning Loops in Dynamically Reconfigurable
Computing Platforms
João M. P. Cardoso University of Algarve, Faro,
INESC-ID, Lisboa Portugal
10th Reconfigurable Architectures Workshop (RAW
2003), Nice, France, April 22, 2003 17th Annual
Intl Parallel Distributed Processing Symposium
(IPDPS 2003)
2
Motivation
  • How to map sets of computational structures
    requiring more resources than available?

for(int i0 ilt8i) for(int j0jlt8j)
CosTransj8i CosBlocki8j for(int
i0 ilt8i) for(int j0jlt8j)
TempBlockij8 0 for(int k0klt8k)
TempBlockij8
InImik8 CosTranskj8
3
Motivation
  • How to map sets of computational structures
    requiring more resources than available?
  • Temporal Partitioning

for(int i0 ilt8i) for(int j0jlt8j)
CosTransj8i CosBlocki8j for(int
i0 ilt8i) for(int j0jlt8j)
TempBlockij8 0 for(int k0klt8k)
TempBlockij8
InImik8 CosTranskj8
4
Motivation
  • How to map sets of computational structures
    requiring more resources than available?
  • Temporal Partitioning
  • Other motivations for Partitioning Computations
    in Time
  • each design is simpler
  • may lead to better performance!
  • amortize some configuration time
  • by overlapping execution stages
  • use of smaller reconfigurable arrays to implement
    complex applications

For more info see Cardoso and Weinhardt, DATE
2003
5
Motivation
  • How to map sets of computational structures
    requiring more resources than available?
  • Temporal Partitioning
  • Computational structures for each loop or set of
    nested loops implemented in a single partition
  • But, what to do with a Loop requiring more
    resources than available?

6
Outline
  • Motivation
  • Configure-Execute Paradigm (execution stages)
  • Target Architecture
  • PACT XPP Architecture
  • XPP Configuration Flow
  • XPP-VC Compilation Flow
  • Temporal Partitioning of Loops
  • Experimental Results
  • Conclusions Future Work

7
Configure-Execute Paradigm (Execution Stages)
  • the program in a single configuration
  • two configurations
  • without on-chip context planes and without
    partial reconfiguration
  • with partial reconfiguration
  • with on-chip context planes

Fetch (f)
Configure (c)
Compute (comp)
f1
c1
comp1
comp2
f2
c2
f1
c1
comp1
comp2
f2
c2_2
c2_1
f1
c1
comp1
comp2
f2
c2
time
8
PACT XPP Architecture (briefly)
  • X Y Coarse-grained array
  • Processing elements (PEs) compute typical ALU
    operations
  • Two columns of SRAMs (Ms)
  • I/O ports for data streaming

9
PACT XPP Architecture (briefly)
  • Ready/ack. protocol for each programmable
    interconnection
  • Flow of data (pre-foundry parameterized
    bit-widths)
  • Flow of events (1-bit lines)

10
PACT XPP Architecture (briefly)
  • Dynamically reconfigurable
  • On-chip configuration cache and configuration
    manager
  • Partial reconfiguration (only those used
    resources are configured)

configure
Configuration Cache (CC)
fetch
Configuration Manager (CM)
CMPort0
CMPort1
11
XPP Configuration Flow
  • Uses 3 stages to execute each configuration
  • Array may request the next configuration
  • Configuration manager accepts requests and
    proceeds without intervention from external host

12
XPP-VC Compilation Flow
  • TempPart partitions and generates
    reconfiguration statements which are executed by
    Configuration Manager
  • MODGen maps C subset to NML (PACT proprietary
    structural language with reconfiguration
    primitives)

For more info see Cardoso and Weinhardt, FPL 2002
13
Temporal Partitioning
  • One partition for each node in the Hierarchical
    Task Graph (HTG) TOP level
  • Merge adjacent nodes if combination of both can
    be mapped to XPP device and if merge does not
    degrade overall performance
  • If HTG node too large, create separate partition
    for each node of the inner-HTG and call algorithm
    recursively

14
Temporal Partitioning of Loops
  • What to do when loops in the program cannot be
    mapped due to the lack of enough resources?
  • Software/reconfigware approach
  • control of the loop in software,
  • migrates to reconfigware inner-code sections,
    each one mapped to a single configuration
  • Loop Distribution
  • transforms a loop into two or more loops
  • each loop with the same iteration-space traversal
    of the original loop
  • inner statements of the original loop are split
    among the loops
  • Loop Dissevering
  • transforms a loop in a set of configurations
  • cyclic behavior implemented by the configuration
    flow

15
Temporal Partitioning of Loops
for(nx0nxltX_DIM_BLK nx)
for(ny0nyltY_DIM_BLK ny)
for(i0iltNi) for(j0jltNj)
tmp 0 Inner Loop 1
for(k0kltNk)
tmp XinyNknxN
CosBlockjk
TempBlockij tmp
// to be partitioned here
for(i0iltNi) for(j0jltNj)
tmp 0Inner Loop 2
for(k0kltNk)
tmp TempBlockkj
CosBlockik
YinyNjnxN tmp
  • Loop Distribution
  • Loop Dissevering

16
Loop Distribution
for(nx0nxltX_DIM_BLK nx)
for(ny0nyltY_DIM_BLK ny)
for(i0iltNi) for(j0jltNj)
Inner Loop 1 TempBlockinyNj
nxN tmp for(nx0nxltX_DIM_BLK
nx) for(ny0nyltY_DIM_BLK ny)
for(i0iltNi) for(j0jltNj)
tmp 0 for(k0kltNk) tmp
TempBlockknyNjnxN
CosBlockik YinyNjnxN
tmp
Conf. 1
begin
Conf. 1
Conf. 2
Conf. 2
end
tmp TempBlockkj
CosBlockik
17
Loop Distribution
  • Cannot be applied to all loops
  • no break of cycles in the dependence graph of the
    original loop
  • Use of auxiliary array variables
  • for each loop-independent flow dependence of a
    scalar variable (known as scalar expansion) and
  • for each control dependence in the place where we
    want to partition the loop
  • Expansion of some arrays
  • But, it preserves the software pipelining
    potential, and
  • may improve parallelization, cache hit/miss
    ratio, etc.

18
Loop Dissevering
L1 L3 L4 Finish nx0 write nx read nx If(nxgtX_DIM_BLK) goto Finish ny0 write ny read ny read nx If(nygtY_DIM_BLK) goto L4 for(i0iltNi) for(j0jltNj) Inner Loop 1 TempBlockij tmp read ny read nx for(i0iltNi) for(j0jltNj) Inner Loop 2 YinyNjnxN tmp ny write ny goto L3 nx write nx goto L1
Conf. 1
Conf. 2
begin
Conf. 1
Conf. 3
Conf. 2
end
Conf. 3
Conf. 4
Conf. 4
Conf. 5
Conf. 5
19
Loop Dissevering
  • Applicable to every loop
  • Only relies on a configuration manager to execute
    complex loops
  • May relieve the host microprocessor to execute
    other tasks
  • No array or scalar expansion (only scalar
    communication)
  • But,
  • Besides its usage to furnish feasible mappings,
    is it worth to be applied? Does it lead to
    efficient solutions (in terms of performance)?
  • What are the improvements if the architecture can
    switch between configurations in few clock cycles?

20
Experimental Results
  • Compared Architectures
  • Both with runtime support to partial
    reconfiguration
  • ARCH-A
  • word-grained partial reconfiguration
  • ARCH-B
  • context-planes
  • with switching between contexts in few clock
    cycles

comp2
f1
c1
comp1
f2
c2_2
c2_1
f1
c1
comp1
comp2
f2
c2
21
Experimental Results
  • Benchmarks

Benchmark Description LoC loops loops after loop dist.
DCT 8?8 Discrete Cosine Transform on an image 80 8 10
BPIC Binary pattern image coding 151 8 10
Life Conways game of life algorithm 118 10 -
22
Experimental Results (resource savings)
  • Using loop dissevering
  • When compared to implementations without loop
    dissevering only 44 (DCT), 66 (BPIC), and 85
    (Life) of resources are used

Benchmark w/o loop dissevering w/o loop dissevering w/ loop dissevering w/ loop dissevering Ratio (PEs)
Benchmark configs PEs configs PEs Ratio (PEs)
DCT 1 123 5 54/132 0.44
BPIC 1 148 5 97/189 0.66
Life 4 144/304 6 123/416 0.85
23
Experimental Results (speedups)
  • Architecture A (ARCH-A)
  • Word-grained partial reconfiguration
  • Architecture B (ARCH-B)
  • Context-planes
  • DCT

24
Experimental Results (speedups)
  • Life
  • Applying Loop Dissevering
  • Benefits of ARCH-B are neglected when partitions
    in the loop compute for long times

25
Conclusions
  • Temporal Partitioning Loop Dissevering
  • guarantees the mapping of theoretically unlimited
    computational structures
  • Loop Dissevering and Loop Distribution
  • may lead to performance enhancements
  • saving of resources
  • Loop Dissevering applicable to every loop
  • performance efficient implementations may require
    fast reconfiguration
  • the resultant performance may decrease
  • when innermost loops are partitioned (no more
    potential for loop pipelining)
  • when each active partition computes for short
    times (does not amortize the reconfiguration time)

26
Future Work
  • More study on the impact of Loop Dissevering and
    Loop Distribution
  • To understand the impact of the number of
    context-planes, configuration cache size, etc.
  • To evaluate loop partitioning when mapping to
    FPGAs
  • Automatic implementation of Loop Distribution
  • Methods to decide between Loop Dissevering and
    Loop Distribution

27
Acknowledgments (in the paper)
  • Part of this work has been done when the author
    was with PACT XPP Technologies, Inc, Munich,
    Germany.
  • We gratefully acknowledge the support of all the
    members of PACT XPP Technologies, Inc.,
    especially the help of Daniel Bretz, Armin
    Strobl, and Frank May, regarding the XDS tools. A
    special thanks to Markus Weinhardt regarding the
    fruitful discussions about loop dissevering and
    the XPP-VC compiler.
Write a Comment
User Comments (0)
About PowerShow.com