Spring 2008 CSE 591 Compilers for Embedded Systems - PowerPoint PPT Presentation

Loading...

PPT – Spring 2008 CSE 591 Compilers for Embedded Systems PowerPoint presentation | free to download - id: 150214-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Spring 2008 CSE 591 Compilers for Embedded Systems

Description:

Spring 2008 CSE 591. Compilers for Embedded Systems. Aviral ... Unsal, Israel Koren, C. Mani Krishna, Csaba Andras Moritz, U. of Massachusetts, Amherst, 2001 ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 26
Provided by: publi5
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Spring 2008 CSE 591 Compilers for Embedded Systems


1
Spring 2008 CSE 591 Compilers for Embedded Systems
  • Aviral Shrivastava
  • Department of Computer Science and Engineering
  • Arizona State University

2
Lecture 5 Scratch Pad Memories
  • Motivation

3
Processor-Memory Performance Gap
Moores Law
  • Huge Processor-Memory Performance Gap
  • Cold start can take billions of cycles

4
More serious dimensions of the memory problem
Sub-banking
  • Applications are getting larger and larger

5
Memory Performance Impact on Performance
  • Suppose a processor executes at
  • ideal CPI 1.1
  • 50 arith/logic, 30 ld/st, 20 control
  • and that 10 of data memory operations miss with
    a 50 cycle miss penalty
  • CPI ideal CPI average stalls per
    instruction
  • 1.1(cycle) ( 0.30 (datamemops/instr) x 0.10
    (miss/datamemop) x 50 (cycle/miss) ) 1.1 cycle
    1.5 cycle 2.6
  • so 58 of the time the processor is stalled
    waiting for memory!
  • A 1 instruction miss rate would add an
    additional 0.5 to the CPI!

6
The Memory Hierarchy Goal Create an illusion
  • Fact Large memories are slow and fast memories
    are small
  • How do we create a memory that gives the illusion
    of being large, cheap and fast (most of the
    time)?
  • With hierarchy
  • With parallelism

7
A Typical Memory Hierarchy
  • By taking advantage of the principle of locality
  • Can present the user with as much memory as is
    available in the cheapest technology
  • at the speed offered by the fastest technology

On-Chip Components
Control
eDRAM
Secondary Memory (Disk)
Instr Cache
Second Level Cache (SRAM)
ITLB
Main Memory (DRAM)
Datapath
Data Cache
RegFile
DTLB
Speed (cycles) ½s 1s
10s 100s
1,000s
Size (bytes) 100s Ks
10Ks Ms
Gs to Ts
Cost highest

lowest
8
Memory system frequently consumes gt50 of the
energy used for processing
Multi-processor with cache
  • Uni-processor without caches

M. Verma, P. Marwedel Advanced Memory
Optimization Techniques for Low-Power Embedded
Processors, Springer, May 2007
Osman S. Unsal, Israel Koren, C. Mani Krishna,
Csaba Andras Moritz, U. of Massachusetts,
Amherst, 2001
Segars 01 according to Vahid_at_ISSS01
9
Cache
  • Decoder logic

10
Energy Efficiency
Operations/Watt GOPS/W
Ambient Intelligence
10
DSP-ASIPs
1
Processors
ASIC
Reconfigurable Computing
µPs
0.1
0.01
Technology
0.13µ
0.07µ
0.25µ
0.5µ
1.0µ
Necessary to optimize otherwise the price for
flexibility cannot be paid!
H. de Man, Keynote, DATE02 T. Claasen, ISSCC99
11
Timing Predictability
Worst case execution time (WCET) larger than
without cache
G.721 using unified Cache_at_ARM7TDMI
12
Objectives for Memory System Design
  • (Average) Performance
  • Throughput
  • Latency
  • Energy consumption
  • Predictability, good worst case execution time
    bound (WCET)
  • Size
  • Cost
  • .

13
Scratch pad memories (SPM) Fast,
energy-efficient, timing-predictable
  • Address space

SPMs are small, physically separate memories
mapped into the address space Selection is by an
appropriate address decoder (simple!)
0
Small no tag memory
scratch pad memory
FFF..
ARM7TDMI cores, well-known for low power
consumption
Example
14
Comparison of currents
E.g. ATMEL board with ARM7TDMI and ext. SRAM
1/3
15
Scratchpad vs. main memory
Example Atmel ARM-Evaluation board
gt 86 savings
energy reduction 1/ 7.06 100 predictable
16
Why not just use a cache ?
Energy consumption in tags, comparators and muxes
is significant.
R. Banakar, S. Steinke, B.-S. Lee, 2001
17
Influence of the associativity
18
Systems with SPM
  • Most of the ARM architectures have an on-chip SPM
    termed as Tightly-coupled memory (TCM)
  • GPUs such as Nvidias 8800 have a 16KB SPM
  • Its typical for a DSP to have scratch pad RAM
  • Embedded processors like Motorola Mcore, TI
    TMS370C
  • Commercial network processors Intel IXP
  • And many more

19
And for the Cell processor
  • Local SPE processors fetch instructions and data
    from local storage LS (256 kB).
  • LS not designed as a cache. Separate DMA
    transfers required to fill and spill.

Main Memory
  • Same motivation
  • Large memory latency
  • Huge overhead for automatically managed caches

20
Advantages of Scratch Pads
  • Area advantage - For the same area, we can fit
    more memory of SPM than in cache (around 34)
  • SPM consists of just a memory array address
    decoding circuitry
  • Less energy consumption per access
  • Absence of tag memory and comparators
  • Performance comparable with cache
  • Predictable WCET required for RTES

21
Challenges in using SPMs
  • In SPMs, application developer, or compiler has
    explicitly move data between memories
  • Data mapping is transparent in cache based
    architectures
  • Binary compatible?
  • Do advantages translate to a different machine?

22
Data Allocation on SPM
  • Techniques focus on mapping
  • Global data
  • Stack data
  • Heap data
  • Broadly, we can classify as
  • Static Mapping of data decided at compile time
    and remains constant throughout the execution
  • Compile-time Dynamic Mapping of data decided at
    compile time and data in SPM changes throughout
    execution
  • Goals are
  • To minimize off-chip memory access
  • To reduce energy consumption
  • To achieve better performance

23
Global Data
  • Panda et al., Efficient Utilization of
    Scratch-Pad Memory in Embedded Processor
    Applications
  • Map all scalars to SPM
  • Very small in size
  • Estimate conflicts in array
  • IAC(u) Interference Access Count No. of
    accesses to other arrays during lifetime of u
  • VAC(u) Variable Access Count Number of accesses
    to elements of u
  • IF(u) ILT(u)VAC(u)
  • Loop Conflict Graph
  • Nodes are arrays
  • Edge weight of (u -gt v) is the number
  • of accesses to u and v in the loop
  • More conflict ? SPM
  • Either whole array goes to SPM or not

24
ILP Formulation
  • For Functions
  • For Basic Blocks
  • For global variables
  • ILP Variables

25
ILP Formulation
  • Energy Savings
  • Size Constraint
  • Need not jump to and back from memory for
    consecutive BBs
About PowerShow.com