Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms - PowerPoint PPT Presentation

About This Presentation

Title:

Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms

Description:

10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003 ... RAW 2003. XPP Configuration Flow. Uses 3 stages to execute each configuration: ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 28

Provided by: joompc

Learn more at: https://www.ece.lsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms

1
Loop Dissevering A Technique for Temporally
Partitioning Loops in Dynamically Reconfigurable
Computing Platforms
João M. P. Cardoso University of Algarve, Faro,
INESC-ID, Lisboa Portugal
10th Reconfigurable Architectures Workshop (RAW
2003), Nice, France, April 22, 2003 17th Annual
Intl Parallel Distributed Processing Symposium
(IPDPS 2003)
2
Motivation

How to map sets of computational structures
requiring more resources than available?

for(int i0 ilt8i) for(int j0jlt8j)
CosTransj8i CosBlocki8j for(int
i0 ilt8i) for(int j0jlt8j)
TempBlockij8 0 for(int k0klt8k)
TempBlockij8
InImik8 CosTranskj8
3
Motivation

How to map sets of computational structures
requiring more resources than available?
Temporal Partitioning

for(int i0 ilt8i) for(int j0jlt8j)
CosTransj8i CosBlocki8j for(int
i0 ilt8i) for(int j0jlt8j)
TempBlockij8 0 for(int k0klt8k)
TempBlockij8
InImik8 CosTranskj8
4
Motivation

How to map sets of computational structures
requiring more resources than available?
Temporal Partitioning
Other motivations for Partitioning Computations
in Time
each design is simpler
may lead to better performance!
amortize some configuration time
by overlapping execution stages
use of smaller reconfigurable arrays to implement
complex applications

For more info see Cardoso and Weinhardt, DATE
2003
5
Motivation

How to map sets of computational structures
requiring more resources than available?
Temporal Partitioning
Computational structures for each loop or set of
nested loops implemented in a single partition
But, what to do with a Loop requiring more
resources than available?

6
Outline

Motivation
Configure-Execute Paradigm (execution stages)
Target Architecture
PACT XPP Architecture
XPP Configuration Flow
XPP-VC Compilation Flow
Temporal Partitioning of Loops
Experimental Results
Conclusions Future Work

7
Configure-Execute Paradigm (Execution Stages)

the program in a single configuration
two configurations
without on-chip context planes and without
partial reconfiguration
with partial reconfiguration
with on-chip context planes

Fetch (f)
Configure (c)
Compute (comp)
f1
c1
comp1
comp2
f2
c2
f1
c1
comp1
comp2
f2
c2_2
c2_1
f1
c1
comp1
comp2
f2
c2
time
8
PACT XPP Architecture (briefly)

X Y Coarse-grained array
Processing elements (PEs) compute typical ALU
operations
Two columns of SRAMs (Ms)
I/O ports for data streaming

9
PACT XPP Architecture (briefly)

Ready/ack. protocol for each programmable
interconnection
Flow of data (pre-foundry parameterized
bit-widths)
Flow of events (1-bit lines)

10
PACT XPP Architecture (briefly)

Dynamically reconfigurable
On-chip configuration cache and configuration
manager
Partial reconfiguration (only those used
resources are configured)

configure
Configuration Cache (CC)
fetch
Configuration Manager (CM)
CMPort0
CMPort1
11
XPP Configuration Flow

Uses 3 stages to execute each configuration
Array may request the next configuration
Configuration manager accepts requests and
proceeds without intervention from external host

12
XPP-VC Compilation Flow

TempPart partitions and generates
reconfiguration statements which are executed by
Configuration Manager
MODGen maps C subset to NML (PACT proprietary
structural language with reconfiguration
primitives)

For more info see Cardoso and Weinhardt, FPL 2002
13
Temporal Partitioning

One partition for each node in the Hierarchical
Task Graph (HTG) TOP level
Merge adjacent nodes if combination of both can
be mapped to XPP device and if merge does not
degrade overall performance
If HTG node too large, create separate partition
for each node of the inner-HTG and call algorithm
recursively

14
Temporal Partitioning of Loops

What to do when loops in the program cannot be
mapped due to the lack of enough resources?
Software/reconfigware approach
control of the loop in software,
migrates to reconfigware inner-code sections,
each one mapped to a single configuration
Loop Distribution
transforms a loop into two or more loops
each loop with the same iteration-space traversal
of the original loop
inner statements of the original loop are split
among the loops
Loop Dissevering
transforms a loop in a set of configurations
cyclic behavior implemented by the configuration
flow

15
Temporal Partitioning of Loops
for(nx0nxltX_DIM_BLK nx)
for(ny0nyltY_DIM_BLK ny)
for(i0iltNi) for(j0jltNj)
tmp 0 Inner Loop 1
for(k0kltNk)
tmp XinyNknxN
CosBlockjk
TempBlockij tmp
// to be partitioned here
for(i0iltNi) for(j0jltNj)
tmp 0Inner Loop 2
for(k0kltNk)
tmp TempBlockkj
CosBlockik
YinyNjnxN tmp

Loop Distribution
Loop Dissevering

16
Loop Distribution
for(nx0nxltX_DIM_BLK nx)
for(ny0nyltY_DIM_BLK ny)
for(i0iltNi) for(j0jltNj)
Inner Loop 1 TempBlockinyNj
nxN tmp for(nx0nxltX_DIM_BLK
nx) for(ny0nyltY_DIM_BLK ny)
for(i0iltNi) for(j0jltNj)
tmp 0 for(k0kltNk) tmp
TempBlockknyNjnxN
CosBlockik YinyNjnxN
tmp
Conf. 1
begin
Conf. 1
Conf. 2
Conf. 2
end
tmp TempBlockkj
CosBlockik
17
Loop Distribution

Cannot be applied to all loops
no break of cycles in the dependence graph of the
original loop
Use of auxiliary array variables
for each loop-independent flow dependence of a
scalar variable (known as scalar expansion) and
for each control dependence in the place where we
want to partition the loop
Expansion of some arrays
But, it preserves the software pipelining
potential, and
may improve parallelization, cache hit/miss
ratio, etc.

18
Loop Dissevering
L1 L3 L4 Finish nx0 write nx read nx If(nxgtX_DIM_BLK) goto Finish ny0 write ny read ny read nx If(nygtY_DIM_BLK) goto L4 for(i0iltNi) for(j0jltNj) Inner Loop 1 TempBlockij tmp read ny read nx for(i0iltNi) for(j0jltNj) Inner Loop 2 YinyNjnxN tmp ny write ny goto L3 nx write nx goto L1
Conf. 1
Conf. 2
begin
Conf. 1
Conf. 3
Conf. 2
end
Conf. 3
Conf. 4
Conf. 4
Conf. 5
Conf. 5
19
Loop Dissevering

Applicable to every loop
Only relies on a configuration manager to execute
complex loops
May relieve the host microprocessor to execute
other tasks
No array or scalar expansion (only scalar
communication)
But,
Besides its usage to furnish feasible mappings,
is it worth to be applied? Does it lead to
efficient solutions (in terms of performance)?
What are the improvements if the architecture can
switch between configurations in few clock cycles?

20
Experimental Results

Compared Architectures
Both with runtime support to partial
reconfiguration
ARCH-A
word-grained partial reconfiguration
ARCH-B
context-planes
with switching between contexts in few clock
cycles

comp2
f1
c1
comp1
f2
c2_2
c2_1
f1
c1
comp1
comp2
f2
c2
21
Experimental Results

Benchmarks

Benchmark Description LoC loops loops after loop dist.
DCT 8?8 Discrete Cosine Transform on an image 80 8 10
BPIC Binary pattern image coding 151 8 10
Life Conways game of life algorithm 118 10 -
22
Experimental Results (resource savings)

Using loop dissevering
When compared to implementations without loop
dissevering only 44 (DCT), 66 (BPIC), and 85
(Life) of resources are used

Benchmark w/o loop dissevering w/o loop dissevering w/ loop dissevering w/ loop dissevering Ratio (PEs)
Benchmark configs PEs configs PEs Ratio (PEs)
DCT 1 123 5 54/132 0.44
BPIC 1 148 5 97/189 0.66
Life 4 144/304 6 123/416 0.85
23
Experimental Results (speedups)

Architecture A (ARCH-A)
Word-grained partial reconfiguration
Architecture B (ARCH-B)
Context-planes
DCT

24
Experimental Results (speedups)

Life
Applying Loop Dissevering
Benefits of ARCH-B are neglected when partitions
in the loop compute for long times

25
Conclusions

Temporal Partitioning Loop Dissevering
guarantees the mapping of theoretically unlimited
computational structures
Loop Dissevering and Loop Distribution
may lead to performance enhancements
saving of resources
Loop Dissevering applicable to every loop
performance efficient implementations may require
fast reconfiguration
the resultant performance may decrease
when innermost loops are partitioned (no more
potential for loop pipelining)
when each active partition computes for short
times (does not amortize the reconfiguration time)

26
Future Work

More study on the impact of Loop Dissevering and
Loop Distribution
To understand the impact of the number of
context-planes, configuration cache size, etc.
To evaluate loop partitioning when mapping to
FPGAs
Automatic implementation of Loop Distribution
Methods to decide between Loop Dissevering and
Loop Distribution

27
Acknowledgments (in the paper)

Part of this work has been done when the author
was with PACT XPP Technologies, Inc, Munich,
Germany.
We gratefully acknowledge the support of all the
members of PACT XPP Technologies, Inc.,
especially the help of Daniel Bretz, Armin
Strobl, and Frank May, regarding the XDS tools. A
special thanks to Markus Weinhardt regarding the
fruitful discussions about loop dissevering and
the XPP-VC compiler.