A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy - PowerPoint PPT Presentation

About This Presentation
Title:

A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy

Description:

A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy Kaushal Sanghai David Kaeli ECE Department Northeastern University – PowerPoint PPT presentation

Number of Views:220
Avg rating:3.0/5.0
Slides: 45
Provided by: ns16
Category:

less

Transcript and Presenter's Notes

Title: A Code Layout Framework for Embedded Processors with Configurable Memory Hierarchy


1
A Code Layout Framework for Embedded Processors
with Configurable Memory Hierarchy
  • Kaushal Sanghai
  • David Kaeli
  • ECE Department
  • Northeastern University
  • Boston, MA

2
Outline
  • Motivation and goals
  • Blackfin 53x memory architecture
  • L1 code memory configurations
  • Code layout algorithm
  • PGO linker tool
  • Methodology
  • Results
  • Conclusions and future work
  • References

3
Motivation
  • Blackfin processor cores provide highly
    configurable memory subsystems to better match
    application-specific workload characteristics
  • Spatial and temporal locality present in
    applications should be exploited to produce
    efficient layouts
  • Code and data layout can be optimized by profile
    guidance

4
Motivation
  • Most developers rely on hand tuning the layout
    which not only increases the time-to-market
    embedded products but also results in an
    inefficient memory mapping
  • Program optimization techniques to automatically
    optimize memory layout for such memory subsytems
    are thereby needed

5
Goals
  • Develop a complete code-mapping framework that
    provides for automatic code layout for the range
    of L1 memory configurations available on Blackfin
  • Create tools that enable fast and easy design
    space exploration across the range of L1 memory
    configurations
  • Utilize execution profiles to tune code layout
  • Evaluate performance of the code mapping
    algorithms on the available L1 memory
    configurations for embedded multimedia
    applications

6
Memory Architecture
SDRAM (External) 4x (16 128 MB) An
optional L2 SRAM
SRAM
Core
10-12 system clock cycles
Cache
SRAM/Cache
Single cycle
L1 Instruction Memory
7
Memory Architecture
SDRAM (External) 4x (16 128 MB) An
optional L2 SRAM
SRAM
Core
10-12 system clock cycles
Cache
SRAM/Cache
Single cycle
L1 Instruction Memory
L1 SRAM Configuration
8
Memory Architecture
SDRAM (External) 4x (16 128 MB) An
optional L2 SRAM
SRAM
Core
10-12 system clock cycles
Cache
SRAM/Cache
Single cycle
L1 Instruction Memory
L1 Cache Configuration
9
Memory Architecture
SDRAM (External) 4x (16 128 MB) An
optional L2 SRAM
SRAM
Core
10-12 system clock cycles
Cache
SRAM/Cache
Single cycle
L1 Instruction Memory
L1 SRAM/Cache Configuration
10
Tradeoffs Involved
  • L1 SRAM
  • Most of the cache misses are avoided by mapping
    most frequently executed code (i.e., hot code) or
    critical code sections in SRAM
  • Performance can suffer if all of the hot code
    cannot be mapped to L1 SRAM
  • L1 Cache
  • Exploits temporal locality in code
  • May increase external memory bandwidth
    requirements
  • Performance can suffer if application has poor
    temporal locality
  • L1 SRAM/Cache
  • Mapping hot sections in L1 SRAM reduces external
    memory bandwidth requirements
  • Cache provides low latency access to infrequent
    code

11
Code Layout Algorithms
Memory configuration Greedy algorithms implemented within the framework
L1 SRAM Greedy sort to solve the Knapsack problem
L1 SRAM Cache Greedy heuristics to solve the Graph Coarsening problem
12
L1 SRAM Layout
Why Knapsack?
Objects A B C D E F G H
Value (execution freq) 25 20 15 10 7 3 2 1
Weight (code size) 2 5 2 2 2 1 1 2
Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5
Weight Bound (size of L1 SRAM space) 8
13
L1 SRAM Layout
Why Knapsack?
Objects A B C D E F G H
Value (execution freq) 25 20 15 10 7 3 2 1
Weight (code size) 2 5 2 2 2 1 1 2
Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5
Weight Bound (size of SRAM space) 8
Algorithm Objects Total value in Knapsack
Most Executed A,B,F 48
Optimal A,C,D,E 57
Greedy Sort A,C,D,E 57
14
L1 SRAM Layout
Why Knapsack?
Objects A B C D E F G H
Value (execution freq) 25 20 15 10 7 3 2 1
Weight (code size) 2 5 2 2 2 1 1 2
Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5
Weight Bound (size of SRAM space) 8
Algorithm Objects Total value in Knapsack
Most Executed A,B,F 48
Optimal A,C,D,E 57
Greedy Sort A,C,D,E 57
15
L1 SRAM Layout
Why Knapsack?
Objects A B C D E F G H
Value (execution freq) 25 20 15 10 7 3 2 1
Weight (code size) 2 5 2 2 2 1 1 2
Value/Weight 12.5 4 7.5 5 3.5 3 2 0.5
Weight Bound (size of SRAM space) 8
Algorithm Objects Total value in Knapsack
Most Executed A,B,F 48
Optimal A,C,D,E 57
Greedy Sort A,C,D,E 57
16
Efficient L1 SRAM Layout
Where, is the execution percentage of the
code section i relative to the entire
execution is the size of code section i
This is an NP-complete problem!
17
Efficient Cache Layout
A
300
2
F
D
30
50
50
B
C
E
50
200
H
G
Nodes functions Edge weight calling
frequency Each color represents a cache
line. Functions mapped to the same color conflict
Hashemi Kaeli98
18
Efficient Cache Layout
A
A
300
2
300
2
F
D
F
Improved Mapping
D
30
50
50
30
50
50
B
C
E
B
C
E
50
200
50
200
H
G
H
G
Nodes functions Edge weight calling
frequency Each color represents a cache
line. Functions mapped to the same color conflict
Hashemi Kaeli98
19
Efficient L1 SRAM/Cache layout
  • Partition code into sections to be placed in L1
    SRAM and L1 Cache
  • L1 SRAM mapping
  • Maximize the amount of execution from L1 SRAM
  • Map functions with low temporal locality in L1
    SRAM
  • Solve the Knapsack for all functions based on the
    execution percentage, size and temporal reuse
    distance
  • L1 Cache mapping
  • Of the remaining functions merge frequently
    executed caller/callee function pairs and map
    into contiguous memory locations

20
Algorithm
  • Inputs
  • Execution percentage and size
  • Weighted Call Graph
  • Temporal re-use distance (RUD) for every function

21
Algorithm
  • L1 SRAM mapping
  • Step 1 Filter out functions with less that 1
    of execution percentage
  • Step 2 Compute (Execution /Size) /RUD for the
    remaining functions
  • Step 3 Solve the Knapsack problem and map the
    solution to the L1 SRAM space

22
Algorithm
  • L1 cache mapping
  • Step 4 Form the call graph of the remaining
    functions and sort by edge weights
  • Step 5 Set the threshold on max merged node
    size (MNsize) this is equal to the size of one
    way of the cache
  • Step 6 For all edges in the sorted list start
    merging nodes until merged node size lt MNsize

23
Algorithm
  • Let A and B be the nodes connected to an edge
    and SA and SB be their corresponding sizes. We
    would have 4 cases based on the merged node
    assignment of the nodes connected to an edge
  • Step 7
  • case 1 A and B merged node SA SB lt
    MNsize
  • merge A and B and assign a common merged node
    id
  • case 2 A merged node and B merged node
  • if (SA SB lt MNsize) then merge B
    with A
  • else proceed to the next edge
  • case 3 B merged node and A merged node
  • same as in case 2 but swap A with B and B
    with A
  • case 4 A and B merged node
  • if total size of two merged nodes is
    less than MNsize
  • merged the two merged nodes to form a bigger
    node
  • else proceed to the next edge
  • Step 8 Map the resulting merged nodes in
    contiguous memory locations starting with the
    merged node containing the heaviest edge

24
PGO Linker Framework
Program application
25
PGO Linker Framework
Program application
Read function symbol module
26
PGO Linker Framework
Program application
Read function symbol module
Gather profile information module
27
PGO Linker Framework
Program application
Instrumentation
Read function symbol module
Gather profile information module
28
PGO Linker Framework
Program application
Instrumentation
Program instrumentation module
Read function symbol module
Gather profile information module
29
PGO Linker Framework
Program application
Instrumentation
Program instrumentation module
Function call trace and temporal reuse distance
Read function symbol module
Gather profile information module
Call trace processing module
30
PGO Linker Framework
Program application
Instrumentation
Program instrumentation module
Function call trace and temporal reuse distance
Read function symbol module
Gather profile information module
Call trace processing module
Call graph Reuse distance
EP/size
31
PGO Linker Framework
Program application
Instrumentation
Program instrumentation module
Function call trace and temporal reuse distance
Read function symbol module
Gather profile information module
Call trace processing module
Call graph
Call graph Reuse distance
EP/size
Code layout module
32
PGO Linker Framework
Program application
Instrumentation
Program instrumentation module
Function call trace and temporal reuse distance
Read function symbol module
Gather profile information module
Call trace processing module
Call graph Reuse distance
EP/size
Code layout module
Generate linker directive file
33
PGO Linker Framework
Program application
Instrumentation
Program instrumentation module
Function call trace and temporal reuse distance
Read function symbol module
Gather profile information module
Call trace processing module
Call graph Reuse distance
EP/size
Code layout module
Generate linker directive file
Relink the application
34
Methodology
  • Evaluated the algorithms on six consumer
    benchmark programs from the EEMBC suite

Benchmark Code Size (KB) of functions
JPEG2 encoder 56 380
JPEG2 decoder 61 388
MPEG2 encoder 84 330
MPEG2 decoder 68 351
MPEG4 encoder 197 480
MPEG4 decoder 131 404
35
Methodology
  • Configured L1 memory as L1 SRAM/Cache for all the
    benchmarks
  • All experiments are performed on the Blackfin 533
    EZ-Kit hardware board
  • 4 different L1 memory configurations considered
  • 12K L1 - divided as 8K SRAM and 4K Cache
  • 16K L1 - divided as 12K SRAM and 4K Cache and
    compared to
  • 16K L1 - divided as 8K SRAM and 8K Cache and
    compared to
  • 80K L1 - divided as 64K SRAM and 16K Cache

36
Results
37
Results
38
Results
39
Enhanced System Implementation Cycle
Code development
Debug successful
Program optimization
Compiler optimization and/or Profile guided
compiler optimizations
Evaluate L1 memory configurations and size
within the PGO linker framework
System design
40
Features of the framework
  • Process is completely automated
  • Gather profiles
  • Generate dynamic function call graphs
  • Run optimization algorithms
  • Re-linking the project for improved layout
  • Can be used with hardware, compiled simulation or
    cycle accurate simulation sessions in the
    VisualDSP development environment for BFxxx
  • Code mapping at the function level granularity
  • Efficient in run time

41
Conclusion
  • We have developed a completely automated and
    efficient code layout framework for a
    configurable L1 code memory supported by the
    BFxxx
  • We show a minimum of 3 to a maximum of 33
    performance improvement (20 on average) for the
    six benchmark programs with a 12K L1 memory
  • We show that by efficiently mapping code, a 16K
    L1 memory results in a similar performance as a
    80K of L1 memory

42
Future work
  • The mapping can be extended to basic block
    granularity
  • Code mapping to avoid external memory bank
    contention (SDRAM) can be incorporated
  • Code layout techniques for multi-core
    architectures can be developed considering
    shared memory accesses
  • The framework can be extended to data layout
    techniques

43
References
  • Kaushal Sanghai, David Kaeli and Richard Gentile,
    Code and Data Partitioning on Blackfin for
    partitioned multimedia benchmark programs, In
    the Proceedings of the 2005 Workshop on
    Optimizations for DSP and Embedded Systems,
    Mar-2005
  • Kaushal Sanghai, David Kaeli, Alex Raikman and
    Ken Butler, A Code Layout Framework for
    Configurable Memory Systems in Embedded
    processors, General Technical Conference, Analog
    Devices Inc., Jun-2006

44
Command Line Interface
  • PGOLinker ltdxefilegt ltlinker directive output
    file(.asm)gt -multicore algorithm
  • Sample Output
  • Algorithm Selected--gt KNAPSACK
  • Connecting to the IDDE and loading Program
  • Connection to the IDDE established and Program
    loaded
  • Gathering the function symbol information
  • Function symbol information obtained
  • No existing profile session. A new profile
    session will be created
  • Application Running.
  • Processor Halted
  • Getting profile Information
  • Analyzing the profile information obtained
  • Analysis Done
Write a Comment
User Comments (0)
About PowerShow.com