Processor Architectures and Program Mapping - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Processor Architectures and Program Mapping

Description:

We concentrate on Static Data structures like arrays ... Loop Folding (bumping) for (y=0;y M;y ) for (x=0;x N;x ) gauss_x_image[x][y] ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 57
Provided by: henkcor2
Category:

less

Transcript and Presenter's Notes

Title: Processor Architectures and Program Mapping


1
Processor Architectures and Program Mapping
Data Memory Management Part a Overview
  • 5kk10 TU/e
  • Henk Corporaal
  • Jef van Meerbergen
  • Bart Mesman

2
Data Memory Management Overview
  • Motivation
  • Example application
  • DMM steps
  • Results
  • Notes
  • We concentrate on Static Data structures like
    arrays
  • The Data Transfer and Storage Exploration
    (DTSE)methodology, on which these slides are
    based, has been developed at IMEC, Leuven

3
Design flow
4
The underlying idea
for (i0iltni) for (j0 jlt3 j) for
(k1 klt7 k) Bj Ai4k
5
Platform example TriMedia
6
Platform architecture model
Level-2
Level-3
Level-4
Level-1
SCSI bus
bus
bus
Chip
on-chip busses
bus-if
bridge
SCSI
Disk
L2 Cache
ICache
CPUs
Main Memory
DCache
Disk
HW accel
Local Memory
Local Memory
Disk
Local Memory
7
Data transfer and storage power
8
What about delay of memories?
  • Global wiring delay becomes dominant over gate
    delay

9
Positioning in the Y-chart
Architecture Instance
Applications
Applications
Applications
Mapping
Performance Analysis
Performance Numbers
10
Mapping
  • Given
  • architecture e.g. TriMedia TM1000
  • reference C code for applicatione.g. MPEG-4
    Motion Estimation
  • Task
  • map application on architecture
  • But wait a moment
  • me_at_workgt tmcc -o mpeg4_me mpeg4_me.cThank you
    for running TriMedia compiler.Your program uses
    257321886 bytes,78 Watt, and 428798765291 clock
    cycles

11
Lets help the compiler ...DTSE data transfer
and storage exploration
  • Transforms C-code of the application
  • By focusing on multi-dimensional signals (arrays)
  • To better exploit platform capabilities
  • This overview covers the major steps to improve
    power, area, performance trade-off in the context
    of platform based design

12
Application example
  • Application domain
  • Computer Tomography in medical imaging
  • Algorithm
  • Cavity detection in CT-scans
  • Detect dark regions in successive images
  • Indicate cavity in brain

Bad news for owner of brain
13
Application
Max Value
Compute Edges
Gauss Blur x
Reverse
Detect Roots
Gauss Blur y
  • Reference (conceptual) C code for the algorithm
  • all functions image_inN x Mt-1 -gt image_outN
    x Mt
  • new value of pixel depends on its neighbors
  • neighbor pixels read from background memory
  • approximately 110 lines of C code (ignoring file
    I/O etc)
  • experiments with N x M 640 x 400 pixels
  • straightforward implementation 6 image buffers

14
DMM (data mem. mgt.) principles
Off-chip SDRAM
Exploit limited life-time
15
DMM steps
C-in
Preprocessing
Dataflow transformations
Loop transformations
Data reuse Memory hierarchy layer assignment
Cycle budget distribution
Memory allocation and assignment
Data layout
Address optimization
C-out
16
The DM steps
  • Preprocessing
  • Rewrite code in 3 layers (parts)
  • Selective inlining, Single Assignment form, ....
  • Data flow transformations
  • Eliminate redundant transfers and storage
  • Loop and control flow transformations
  • Improve regularity of accesses and data locality
  • Data re-use and memory hierarchy layer assignment
  • Determine when to move which data between
    memories to meet the cycle budget of the
    application with low cost
  • Determine in which layer to put the arrays (and
    copies)

17
The DM steps
  • Per memory layer
  • Cycle budget distribution
  • determine memory access constraints for given
    cycle budget
  • Memory allocation and assignment
  • which memories to use, and where to put the
    arrays
  • Data layout
  • determine how to combine and put arrays into
    memories
  • Address optimization on the final C-code

18
Preprocessing Dividing an application in the 3
layers
Module1a
LAYER1
Module2
Module3
Module1b
- testbench call
- dynamic event behaviour
Synchronisation
- mode selection
LAYER2
int
func1(int a, int b)
LAYER3

return ab

19
Layered code structure
main() / Layer 1 code /
read_image(IN_NAME, image_in) cav_detect()
write_image(image_out)
void cav_detect() / Layer 2 code / for
(xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)

20
Layered code structure
void cav_detect() / Layer 2 code / for
(xGB xltN-1-GB x) for (yGB
yltM-1-GB y) gauss_x_tmp 0
for (k-GB kltGB k) gauss_x_tmp
in_imagexky Gaussabs(k)
gauss_x_imagexy foo(gauss_x_tmp)
/ Makes code for data access
/ / and data transfer explicit /
int foo(int arg1) / Layer 3 / /
arithmetic, data-dependent operations to be
mapped to data-path, controller /
21
Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y)
gauss_x_imagexy0
for (x1 xltN-2 x) for (y1 yltM-2 y)
gauss_x_tmp 0 for (k-1 klt1 k)
gauss_x_tmp image_inxkyGaussabs(
k) gauss_x_imagexy
foo(gauss_x_tmp)
accesses N M (N-2) (M-2)
22
Data-flow trafo - cavity detection
for (x0 xltN x) for (y0 yltM y) if
((xgt1 xltN-2) (ygt1 yltM-2))
gauss_x_tmp 0 for (k-1 klt1
k) gauss_x_tmp image_inxkyGau
ssabs(k) gauss_x_imagexy
foo(gauss_x_tmp) else
gauss_x_imagexy 0
accesses N M gain is almost 50
23
Data-flow transformation
  • In total 5 types of data-flow transformations
  • advanced signal substitution and (copy)
    propagation
  • algebraic transformations (associativity etc.)
  • shifting delay lines
  • re-computation
  • transformations to eliminate bottlenecksfor
    subsequent loop transformations

24
Data-flow transformation - result
25
Loop transformations
  • Loop transformations
  • improve regularity of accesses
  • improve temporal locality production ?
    consumption
  • Expected influence
  • reduce temporary storage and (anticipated)
    background storage

26
Global loop transformation steps applied to
cavity detection
  • Make all loop dimensions equal
  • Regularize loop traversalY and X loop
    interchange
  • follow order of input stream
  • Y loop folding and global mergingX loop folding
    and global merging
  • full, global scope regularity
  • nearly complete locality for main signals

27
Data enters Cavity Detectorrow-wise
serial scan
Buffer
image_in
GaussBlur loop
Cavity Detector
28
Loop trafo - cavity detection
N x M
Scanner
X
Y
From double bufferto single buffer
29
Loop interchange (Y ? X)
for (x0xltNx) for (y0yltMy) /
filtering code /
for (y0yltMy) for (x0xltNx) /
filtering code /
  • Not always possible check dependences
  • For all loops, to maintain regularity

30
Loop trafo - cavity detection
N x (2GB1)
N x 3
Compute Edges
Gauss Blur y
Gauss Blur x
Repeated fold and loop merge
3(offset arrays)
2GB1
From N x M toN x (3) buffer size
From N x M toN x (2GB1) buffer size
31
Improve regularity and locality? Loop Merging
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (y0yltMy) for
(x0xltNx) / 2nd filtering code /
for (y0yltMy) for (x0xltNx) / 1st
filtering code / for (x0xltNx) / 2nd
filtering code /
  • !! Impossible due to dependencies!

32
Data dependencies between1st and 2nd loop
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0yltMy) for
(x0xltNx) for (k-GB kltGB k)
gauss_x_imagexyk
33
Enable merging withLoop Folding (bumping)
for (y0yltMy) for (x0xltNx)
gauss_x_imagexy for (y0GByltMGBy)
for (x0xltNx) y-GB for (k-GB
kltGB k) gauss_x_imagexyk-GB
34
Y-loop merging on 1st and 2nd loop nest
for (y0yltMGBy) if (yltM) for
(x0xltNx) gauss_x_imagexy
if (ygtGB) for (x0xltNx) if
(xgtGB xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB
k) gauss_x_imagexy-GBk
else
35
Simplify conditions in merged loop
for (y0yltMGBy) for (x0xltNx) if
(yltM) gauss_x_imagexy
for (x0xltNx) if (ygtGB xgtGB
xltN-1-GB (y-GB)gtGB
(y-GB)ltM-1-GB) for (k-GB kltGB k)
gauss_x_imagexy-GBk else if
(ygtGB)
36
Global loop merging/folding steps
  • 1 x ? y Loop interchange (done)
  • 2 Global y-loop folding/merging 1st and 2nd nest
    (done)
  • 3 Global y-loop folding/merging 1st/2nd and 3rd
    nest
  • 4 Global y-loop folding/merging 1st/2nd/3rd and
    4th nest
  • 5 Global x-loop folding/merging 1st and 2nd nest
  • 6 Global x-loop folding/merging 1st/2nd and 3rd
    nest
  • 7 Global x-loop folding/merging 1st/2nd/3rd and
    4th nest

37
End result of global loop trafo
for (y0 yltMGB2 y) for (x0 xltN2
x) if (xgtGB xltN-1-GB
(y-GB)gtGB (y-GB)ltM-1-GB)
gauss_xy_computexy-GB0 0 for
(k-GB kltGB k) gauss_xy_computexy-
GBGBk1 gauss_xy_computexy-GB
GBk gauss_x_imagexy-GBk
Gaussabs(k) gauss_xy_imagexy-GB
gauss_xy_computexy-GB(2GB)1/tot
else if (xltN (y-GB)gt0 (y-GB)ltM)
gauss_xy_imagexy-GB 0
38
Loop transformations - result
39
Data re-use memory hierarchy
A 100
Processor Data Paths
Reg File
100
10
1
P (original) access x power/access 100
P (after) 100 x 0.01 10 x 0.1 1 x 1 3
  • Introduce memory hierarchy
  • reduce number of reads from main memory
  • heavily accessed arrays stored in smaller memories

40
Data re-use
  • Data flow transformations to introduce
    extracopies of heavily accessed signals
  • Step 1 figure out data re-use possibilities
  • Step 2 calculate possible gain
  • Step 3 decide on data assignment to memory
    hierarchy

41
Data re-use
  • Data flow transformations to introduce
    extracopies of heavily accessed signals
  • Step 1 figure out data re-use possibilities
  • Step 2 calculate possible gain
  • Step 3 decide on data assignment to memory
    hierarchy

1216
N216
42
Data re-use tree
image_in
gauss_xy/comp_edge
gauss_x
image_out
NM
M3
M3
M3
NM
NM
NM3
NM3
NM
0
11
N1
13
33
NM
NM8
NM8
NM3
31
NM3
CPU
CPU
CPU
CPU
CPU
43
Memory hierarchy assignment
image_in
gauss_x
gauss_xy
comp_edge
image_out
NM
NM
1MB SDRAM
0
NM
M3
M3
M3
16KB Cache
NM3
NM3
NM
NM
NM3
128 B RegFile
11
11
31
33
33
NM3
NM8
NM8
NM8
NM8
44
Data-reuse - cavity detection code
Code before reuse transformation
for (y0 yltM3 y) for (x0 xltN2 x)
if (xgt1 xltN-2 ygt1 yltM-2)
gauss_x_tmp 0 for (k-1 klt1 k)
gauss_x_tmp image_inxkyGaussabs(k)
gauss_x_imagexy foo(gauss_x_compute)
else if (xltN yltM)
gauss_x_linesxy 0 / Other
merged code omitted /
45
Data-reuse - cavity detection code
Code after reuse transformation
for (y0 yltM3 y) for (x0 xltN2 x)
/ first in_pixels initialized / if (x0
ygt1 yltM-2) for (k0 klt1 k)
in_pixels(xk)3 image_inxky /
copy rest of in_pixels in row / if (xgt0
xltN-2 ygt1 yltM-2)
in_pixels(x1)3 image_inx1y if
(xgt1 xltN-1-1 ygt1 yltM-2)
gauss_x_tmp0 for (k-1 klt1 k)
gauss_x_tmp in_pixels(xk)3GaussAbs(k)
gauss_x_linesxy3 foo(gauss_x_tmp)
else if (xltN yltM) gauss_x_linesxy
3 0
46
Data reuse memory hierarchy
47
Data layout optimization
  • At this point multi-dimensional arraysare to be
    assigned to physical memories
  • Data layout optimization determines exactly where
    in each memory an array should be placed, to
  • reduce memory size by in-placing arrays that do
    not overlap in time (disjoint lifetimes)
  • to avoid cache misses due to conflicts
  • exploit spatial locality of the data in memory to
    improve performance of e.g. page-mode memory
    access sequences

48
In-place mapping
Inter in-place
Both intrainter
addresses
Intra in-place
time
49
In-place mapping
  • Implements all the anticipated memory size
    savings obtained in previous steps
  • Modifies code to introduce one array per real
    memory
  • Changes indices to addresses in mem. arrays

b8 A100100 b6 B2020 for (i,j,k,l )
Bij f(Bji, Aikjl)
50
In-place mapping
  • Input image is partly consumed by the time first
    results for output image are ready

index
Image_in
time
index
Image_out
time
51
In-place - cavity detection code
for (y0 yltM3 y) for (x0 xltN5 x)
image_outx-5y-3 / code
removed / image_inx1y
for (y0 yltM3 y) for (x0 xltN5 x)
imagex-5y-3 / code
removed / image x1y
52
In-place mapping - results
53
The last step ADOPT
(Address OPTimization)
  • Increased execution time introduced by DTSE
  • Complicated address arithmetic (modulo!)
  • Additional complex control flow
  • Additional transformations needed to
  • Simplify control flow
  • Simplify address arithmetic common
    sub-expression elimination, modulo expansion,
  • Match remaining expressions on target machine

54
ADOPT principles
Example Full-search Motion Estimation
for (i- 8 ilt8 i) for (j- 4 jlt3
j) for (k- 4 klt3 k)
A((208i)2578j)257 16ik
B(8j)25716ik dist A3096 -
B((208i)2574)257 16i-4
cse1 (33025i6869616)2 cse3 1040i
cse4 j2571032 cse5
kcse4 cse5cse1 cse5cse3
3096 cse1
Algebraic transformations at word-level
55
Address optimization - result
56
Fixing platform parameters
  • Assume configurable on-chip memory hierarchy
  • Trade-off power versus cycle-budget

power mW
25
20
15
10
5
storagecyclebudget
50,000
100,000
150,000
57
Conclusion
  • Many applications use large (static) data
    structures
  • Access and layout of this data can be heavily
    optimized
  • Compilers don't do this
  • Source code (C-to-C) transformations needed !!
  • Showed systematic approach
  • Platform independent high-level transformations
  • Platform dependent transformations exploit
    platform characteristics (optimal use of cache,
    )
  • Substantial energy, memory size (cost) and
    performance improvements
  • MPEG-4, OFDM, H.263, ADSL, ...
Write a Comment
User Comments (0)
About PowerShow.com