Title: Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms
1Loop Dissevering A Technique for Temporally
Partitioning Loops in Dynamically Reconfigurable
Computing Platforms
João M. P. Cardoso University of Algarve, Faro,
INESC-ID, Lisboa Portugal
10th Reconfigurable Architectures Workshop (RAW
2003), Nice, France, April 22, 2003 17th Annual
Intl Parallel Distributed Processing Symposium
(IPDPS 2003)
2Motivation
- How to map sets of computational structures
requiring more resources than available?
for(int i0 ilt8i) for(int j0jlt8j)
CosTransj8i CosBlocki8j for(int
i0 ilt8i) for(int j0jlt8j)
TempBlockij8 0 for(int k0klt8k)
TempBlockij8
InImik8 CosTranskj8
3Motivation
- How to map sets of computational structures
requiring more resources than available? - Temporal Partitioning
for(int i0 ilt8i) for(int j0jlt8j)
CosTransj8i CosBlocki8j for(int
i0 ilt8i) for(int j0jlt8j)
TempBlockij8 0 for(int k0klt8k)
TempBlockij8
InImik8 CosTranskj8
4Motivation
- How to map sets of computational structures
requiring more resources than available? - Temporal Partitioning
- Other motivations for Partitioning Computations
in Time - each design is simpler
- may lead to better performance!
- amortize some configuration time
- by overlapping execution stages
- use of smaller reconfigurable arrays to implement
complex applications
For more info see Cardoso and Weinhardt, DATE
2003
5Motivation
- How to map sets of computational structures
requiring more resources than available? - Temporal Partitioning
- Computational structures for each loop or set of
nested loops implemented in a single partition - But, what to do with a Loop requiring more
resources than available?
6Outline
- Motivation
- Configure-Execute Paradigm (execution stages)
- Target Architecture
- PACT XPP Architecture
- XPP Configuration Flow
- XPP-VC Compilation Flow
- Temporal Partitioning of Loops
- Experimental Results
- Conclusions Future Work
7Configure-Execute Paradigm (Execution Stages)
- the program in a single configuration
- two configurations
- without on-chip context planes and without
partial reconfiguration - with partial reconfiguration
- with on-chip context planes
Fetch (f)
Configure (c)
Compute (comp)
f1
c1
comp1
comp2
f2
c2
f1
c1
comp1
comp2
f2
c2_2
c2_1
f1
c1
comp1
comp2
f2
c2
time
8PACT XPP Architecture (briefly)
- X Y Coarse-grained array
- Processing elements (PEs) compute typical ALU
operations - Two columns of SRAMs (Ms)
- I/O ports for data streaming
9PACT XPP Architecture (briefly)
- Ready/ack. protocol for each programmable
interconnection - Flow of data (pre-foundry parameterized
bit-widths) - Flow of events (1-bit lines)
10PACT XPP Architecture (briefly)
- Dynamically reconfigurable
- On-chip configuration cache and configuration
manager - Partial reconfiguration (only those used
resources are configured)
configure
Configuration Cache (CC)
fetch
Configuration Manager (CM)
CMPort0
CMPort1
11XPP Configuration Flow
- Uses 3 stages to execute each configuration
- Array may request the next configuration
- Configuration manager accepts requests and
proceeds without intervention from external host
12XPP-VC Compilation Flow
- TempPart partitions and generates
reconfiguration statements which are executed by
Configuration Manager - MODGen maps C subset to NML (PACT proprietary
structural language with reconfiguration
primitives)
For more info see Cardoso and Weinhardt, FPL 2002
13Temporal Partitioning
- One partition for each node in the Hierarchical
Task Graph (HTG) TOP level - Merge adjacent nodes if combination of both can
be mapped to XPP device and if merge does not
degrade overall performance - If HTG node too large, create separate partition
for each node of the inner-HTG and call algorithm
recursively
14Temporal Partitioning of Loops
- What to do when loops in the program cannot be
mapped due to the lack of enough resources? - Software/reconfigware approach
- control of the loop in software,
- migrates to reconfigware inner-code sections,
each one mapped to a single configuration - Loop Distribution
- transforms a loop into two or more loops
- each loop with the same iteration-space traversal
of the original loop - inner statements of the original loop are split
among the loops - Loop Dissevering
- transforms a loop in a set of configurations
- cyclic behavior implemented by the configuration
flow
15Temporal Partitioning of Loops
for(nx0nxltX_DIM_BLK nx)
for(ny0nyltY_DIM_BLK ny)
for(i0iltNi) for(j0jltNj)
tmp 0 Inner Loop 1
for(k0kltNk)
tmp XinyNknxN
CosBlockjk
TempBlockij tmp
// to be partitioned here
for(i0iltNi) for(j0jltNj)
tmp 0Inner Loop 2
for(k0kltNk)
tmp TempBlockkj
CosBlockik
YinyNjnxN tmp
- Loop Distribution
- Loop Dissevering
16Loop Distribution
for(nx0nxltX_DIM_BLK nx)
for(ny0nyltY_DIM_BLK ny)
for(i0iltNi) for(j0jltNj)
Inner Loop 1 TempBlockinyNj
nxN tmp for(nx0nxltX_DIM_BLK
nx) for(ny0nyltY_DIM_BLK ny)
for(i0iltNi) for(j0jltNj)
tmp 0 for(k0kltNk) tmp
TempBlockknyNjnxN
CosBlockik YinyNjnxN
tmp
Conf. 1
begin
Conf. 1
Conf. 2
Conf. 2
end
tmp TempBlockkj
CosBlockik
17Loop Distribution
- Cannot be applied to all loops
- no break of cycles in the dependence graph of the
original loop - Use of auxiliary array variables
- for each loop-independent flow dependence of a
scalar variable (known as scalar expansion) and - for each control dependence in the place where we
want to partition the loop - Expansion of some arrays
- But, it preserves the software pipelining
potential, and - may improve parallelization, cache hit/miss
ratio, etc.
18Loop Dissevering
L1 L3 L4 Finish nx0 write nx read nx If(nxgtX_DIM_BLK) goto Finish ny0 write ny read ny read nx If(nygtY_DIM_BLK) goto L4 for(i0iltNi) for(j0jltNj) Inner Loop 1 TempBlockij tmp read ny read nx for(i0iltNi) for(j0jltNj) Inner Loop 2 YinyNjnxN tmp ny write ny goto L3 nx write nx goto L1
Conf. 1
Conf. 2
begin
Conf. 1
Conf. 3
Conf. 2
end
Conf. 3
Conf. 4
Conf. 4
Conf. 5
Conf. 5
19Loop Dissevering
- Applicable to every loop
- Only relies on a configuration manager to execute
complex loops - May relieve the host microprocessor to execute
other tasks - No array or scalar expansion (only scalar
communication) - But,
- Besides its usage to furnish feasible mappings,
is it worth to be applied? Does it lead to
efficient solutions (in terms of performance)? - What are the improvements if the architecture can
switch between configurations in few clock cycles?
20Experimental Results
- Compared Architectures
- Both with runtime support to partial
reconfiguration - ARCH-A
- word-grained partial reconfiguration
- ARCH-B
- context-planes
- with switching between contexts in few clock
cycles
comp2
f1
c1
comp1
f2
c2_2
c2_1
f1
c1
comp1
comp2
f2
c2
21Experimental Results
Benchmark Description LoC loops loops after loop dist.
DCT 8?8 Discrete Cosine Transform on an image 80 8 10
BPIC Binary pattern image coding 151 8 10
Life Conways game of life algorithm 118 10 -
22Experimental Results (resource savings)
- Using loop dissevering
- When compared to implementations without loop
dissevering only 44 (DCT), 66 (BPIC), and 85
(Life) of resources are used
Benchmark w/o loop dissevering w/o loop dissevering w/ loop dissevering w/ loop dissevering Ratio (PEs)
Benchmark configs PEs configs PEs Ratio (PEs)
DCT 1 123 5 54/132 0.44
BPIC 1 148 5 97/189 0.66
Life 4 144/304 6 123/416 0.85
23Experimental Results (speedups)
- Architecture A (ARCH-A)
- Word-grained partial reconfiguration
- Architecture B (ARCH-B)
- Context-planes
- DCT
24Experimental Results (speedups)
- Life
- Applying Loop Dissevering
- Benefits of ARCH-B are neglected when partitions
in the loop compute for long times
25Conclusions
- Temporal Partitioning Loop Dissevering
- guarantees the mapping of theoretically unlimited
computational structures - Loop Dissevering and Loop Distribution
- may lead to performance enhancements
- saving of resources
- Loop Dissevering applicable to every loop
- performance efficient implementations may require
fast reconfiguration - the resultant performance may decrease
- when innermost loops are partitioned (no more
potential for loop pipelining) - when each active partition computes for short
times (does not amortize the reconfiguration time)
26Future Work
- More study on the impact of Loop Dissevering and
Loop Distribution - To understand the impact of the number of
context-planes, configuration cache size, etc. - To evaluate loop partitioning when mapping to
FPGAs - Automatic implementation of Loop Distribution
- Methods to decide between Loop Dissevering and
Loop Distribution
27Acknowledgments (in the paper)
- Part of this work has been done when the author
was with PACT XPP Technologies, Inc, Munich,
Germany. - We gratefully acknowledge the support of all the
members of PACT XPP Technologies, Inc.,
especially the help of Daniel Bretz, Armin
Strobl, and Frank May, regarding the XDS tools. A
special thanks to Markus Weinhardt regarding the
fruitful discussions about loop dissevering and
the XPP-VC compiler.