CS252 Graduate Computer Architecture Lecture 14 3 1 Cs of Caching and many ways Cache Optimizations

About This Presentation

Title:

CS252 Graduate Computer Architecture Lecture 14 3 1 Cs of Caching and many ways Cache Optimizations

Description:

... Combine 2 independent loops that have same looping and some variables overlap ... For keys of fixed length and fixed radix a constant number of passes over the ... – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 66

Provided by: davidapa6

Category:

more less

Transcript and Presenter's Notes

Title: CS252 Graduate Computer Architecture Lecture 14 3 1 Cs of Caching and many ways Cache Optimizations

1
CS252Graduate Computer ArchitectureLecture
1431 Cs of Caching and many ways Cache
Optimizations

John Kubiatowicz
Electrical Engineering and Computer Sciences
University of California, Berkeley
http//www.eecs.berkeley.edu/kubitron/cs252

2
Review Cache performance

Miss-oriented Approach to Memory Access
Separating out Memory component entirely
AMAT Average Memory Access Time

3
12 Advanced Cache Optimizations (Cont)

Reducing hit time
Small and simple caches
Way prediction
Trace caches
Increasing cache bandwidth
Pipelined caches
Multibanked caches
Nonblocking caches

Reducing Miss Penalty
Critical word first
Merging write buffers
Reducing Miss Rate
Victim Cache
Hardware prefetching
Compiler prefetching
Compiler Optimizations

4
3. Fast (Instruction Cache) Hit times via Trace
Cache

Key Idea Pack multiple non-contiguous basic
blocks into one contiguous trace cache line

BR
BR
BR

Single fetch brings in multiple basic blocks
Trace cache indexed by start address and next n
branch predictions

5
3. Fast Hit times via Trace Cache (Pentium 4
only and last time?)

Find more instruction level parallelism?How
avoid translation from x86 to microops?
Trace cache in Pentium 4
Dynamic traces of the executed instructions vs.
static sequences of instructions as determined by
layout in memory
Built-in branch predictor
Cache the micro-ops vs. x86 instructions
Decode/translate from x86 to micro-ops on trace
cache miss
1. ? better utilize long blocks (dont exit in
middle of block, dont enter at label in middle
of block)
1. ? complicated address mapping since addresses
no longer aligned to power-of-2 multiples of word
size
- 1. ? instructions may appear multiple times in
multiple dynamic traces due to different branch
outcomes

6
4 Increasing Cache Bandwidth by Pipelining

Pipeline cache access to maintain bandwidth, but
higher latency
Instruction cache access pipeline stages
1 Pentium
2 Pentium Pro through Pentium III
4 Pentium 4
? greater penalty on mispredicted branches
? more clock cycles between the issue of the load
and the use of the data

7
5. Increasing Cache Bandwidth Non-Blocking
Caches

Non-blocking cache or lockup-free cache allow
data cache to continue to supply cache hits
during a miss
requires F/E bits on registers or out-of-order
execution
requires multi-bank memories
hit under miss reduces the effective miss
penalty by working during miss vs. ignoring CPU
requests
hit under multiple miss or miss under miss
may further lower the effective miss penalty by
overlapping multiple misses
Significantly increases the complexity of the
cache controller as there can be multiple
outstanding memory accesses
Requires muliple memory banks (otherwise cannot
support)
Penium Pro allows 4 outstanding memory misses

8
Value of Hit Under Miss for SPEC (old data)
0-gt1 1-gt2 2-gt64 Base
Hit under n Misses
Floating Point
Integer

FP programs on average Miss Penalty 0.68 -gt
0.52 -gt 0.34 -gt 0.26
Int programs on average Miss Penalty 0.24 -gt
0.20 -gt 0.19 -gt 0.19
8 KB Data Cache, Direct Mapped, 32B block, 16
cycle miss, SPEC 92

9
6 Increasing Cache Bandwidth via Multiple Banks

Rather than treat the cache as a single
monolithic block, divide into independent banks
that can support simultaneous accesses
E.g.,T1 (Niagara) L2 has 4 banks
Banking works best when accesses naturally spread
themselves across banks ? mapping of addresses to
banks affects behavior of memory system
Simple mapping that works well is sequential
interleaving
Spread block addresses sequentially across banks
E,g, if there 4 banks, Bank 0 has all blocks
whose address modulo 4 is 0 bank 1 has all
blocks whose address modulo 4 is 1

10
7. Reduce Miss Penalty Early Restart and
Critical Word First

Dont wait for full block before restarting CPU
Early restartAs soon as the requested word of
the block arrives, send it to the CPU and let the
CPU continue execution
Spatial locality ? tend to want next sequential
word, so not clear size of benefit of just early
restart
Critical Word FirstRequest the missed word first
from memory and send it to the CPU as soon as it
arrives let the CPU continue execution while
filling the rest of the words in the block
Long blocks more popular today ? Critical Word
1st Widely used

11
8. Merging Write Buffer to Reduce Miss Penalty

Write buffer to allow processor to continue while
waiting to write to memory
If buffer contains modified blocks, the addresses
can be checked to see if address of new data
matches the address of a valid write buffer entry
If so, new data are combined with that entry
Increases block size of write for write-through
cache of writes to sequential words, bytes since
multiword writes more efficient to memory
The Sun T1 (Niagara) processor, among many
others, uses write merging

12
9. Reducing Misses a Victim Cache

How to combine fast hit time of direct mapped
yet still avoid conflict misses?
Add buffer to place data discarded from cache
Jouppi 1990 4-entry victim cache removed 20
to 95 of conflicts for a 4 KB direct mapped data
cache
Used in Alpha, HP machines

DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
13
10. Reducing Misses by Hardware Prefetching of
Instructions Data

Prefetching relies on having extra memory
bandwidth that can be used without penalty
Instruction Prefetching
Typically, CPU fetches 2 blocks on a miss the
requested block and the next consecutive block.
Requested block is placed in instruction cache
when it returns, and prefetched block is placed
into instruction stream buffer
Data Prefetching
Pentium 4 can prefetch data into L2 cache from up
to 8 streams from 8 different 4 KB pages
Prefetching invoked if 2 successive L2 cache
misses to a page, if distance between those
cache blocks is lt 256 bytes

14
Issues in Prefetching

Usefulness should produce hits
Timeliness not late and not too early
Cache and bandwidth pollution

L1 Instruction
Unified L2 Cache
CPU
L1 Data
RF
Prefetched data
15
Hardware Instruction Prefetching

Instruction prefetch in Alpha AXP 21064
Fetch two blocks on a miss the requested block
(i) and the next consecutive block (i1)
Requested block placed in cache, and next block
in instruction stream buffer
If miss in cache but hit in stream buffer, move
stream buffer block into cache and prefetch next
block (i2)

Prefetched instruction block
Stream Buffer
Req block
Unified L2 Cache
CPU
L1 Instruction
Req block
RF
16
Hardware Data Prefetching

Prefetch-on-miss
Prefetch b 1 upon miss on b
One Block Lookahead (OBL) scheme
Initiate prefetch for block b 1 when block b is
accessed
Why is this different from doubling block size?
Can extend to N block lookahead
Strided prefetch
If observe sequence of accesses to block b, bN,
b2N, then prefetch b3N etc.
Example IBM Power 5 2003 supports eight
independent streams of strided prefetch per
processor, prefetching 12 lines ahead of current
access

17
Administrivia

Exam This Wednesday Location 310
Soda TIME 600-900pm
Material Everything up to next Monday, including
papers (especially ones discussed in detail in
class)
Closed Book, but 1 page hand-written notes (both
sides)
Meet at LaVals afterwards for Pizza and
Beverages
We have been reading Chapter 5
You should take a look, since might show up in
test

18
11. Reducing Misses by Software Prefetching Data

Data Prefetch
Load data into register (HP PA-RISC loads)
Cache Prefetch load into cache (MIPS IV,
PowerPC, SPARC v. 9)
Special prefetching instructions cannot cause
faultsa form of speculative execution
Issuing Prefetch Instructions takes time
Is cost of prefetch issues lt savings in reduced
misses?
Higher superscalar reduces difficulty of issue
bandwidth

19
12. Reducing Misses by Compiler Optimizations

McFarling 1989 reduced caches misses by 75 on
8KB direct mapped cache, 4 byte blocks in
software
Instructions
Reorder procedures in memory so as to reduce
conflict misses
Profiling to look at conflicts(using tools they
developed)
Data
Merging Arrays improve spatial locality by
single array of compound elements vs. 2 arrays
Loop Interchange change nesting of loops to
access data in order stored in memory
Loop Fusion Combine 2 independent loops that
have same looping and some variables overlap
Blocking Improve temporal locality by accessing
blocks of data repeatedly vs. going down whole
columns or rows

20
Merging Arrays Example

/ Before 2 sequential arrays /
int valSIZE
int keySIZE
/ After 1 array of stuctures /
struct merge
int val
int key
struct merge merged_arraySIZE
Reducing conflicts between val key improve
spatial locality

21
Loop Interchange Example

/ Before /
for (k 0 k lt 100 k k1)
for (j 0 j lt 100 j j1)
for (i 0 i lt 5000 i i1)
xij 2 xij
/ After /
for (k 0 k lt 100 k k1)
for (i 0 i lt 5000 i i1)
for (j 0 j lt 100 j j1)
xij 2 xij
Sequential accesses instead of striding through
memory every 100 words improved spatial locality

22
Loop Fusion Example

/ Before /
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
aij 1/bij cij
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
dij aij cij
/ After /
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
aij 1/bij cij
dij aij cij
2 misses per access to a c vs. one miss per
access improve spatial locality

23
Blocking Example

/ Before /
for (i 0 i lt N i i1)
for (j 0 j lt N j j1)
r 0
for (k 0 k lt N k k1)
r r yikzkj
xij r
Two Inner Loops
Read all NxN elements of z
Read N elements of 1 row of y repeatedly
Write N elements of 1 row of x
Capacity Misses a function of N Cache Size
2N3 N2 gt (assuming no conflict otherwise )
Idea compute on BxB submatrix that fits

24
Blocking Example

/ After /
for (jj 0 jj lt N jj jjB)
for (kk 0 kk lt N kk kkB)
for (i 0 i lt N i i1)
for (j jj j lt min(jjB-1,N) j j1)
r 0
for (k kk k lt min(kkB-1,N) k k1)
r r yikzkj
xij xij r
B called Blocking Factor
Capacity Misses from 2N3 N2 to 2N3/B N2
Conflict Misses Too?

25
Reducing Conflict Misses by Blocking

Conflict misses in caches not FA vs. Blocking
size
Lam et al 1991 a blocking factor of 24 had a
fifth the misses vs. 48 despite both fit in cache

26
Summary of Compiler Optimizations to Reduce Cache
Misses (by hand)
27
Impact of Hierarchy on Algorithms

Today CPU time is a function of (ops, cache
misses)
What does this mean to Compilers, Data
structures, Algorithms?
Quicksort fastest comparison based sorting
algorithm when keys fit in memory
Radix sort also called linear time sort For
keys of fixed length and fixed radix a constant
number of passes over the data is sufficient
independent of the number of keys
The Influence of Caches on the Performance of
Sorting by A. LaMarca and R.E. Ladner.
Proceedings of the Eighth Annual ACM-SIAM
Symposium on Discrete Algorithms, January, 1997,
370-379.
For Alphastation 250, 32 byte blocks, direct
mapped L2 2MB cache, 8 byte keys, from 4000 to
4000000

28
Quicksort vs. Radix Instructions
Job size in keys
29
Quicksort vs. Radix Inst Time
Time
Insts
Job size in keys
30
Quicksort vs. Radix Cache misses
Job size in keys
31
Experimental Study (Membench)

Microbenchmark for memory system performance

1 experiment
32
Membench What to Expect
average cost per access
s stride

Consider the average cost per load
Plot one line for each array length, time vs.
stride
Small stride is best if cache line holds 4
words, at most ¼ miss
If array is smaller than a given cache, all those
accesses will hit (after the first run, which is
negligible for large enough runs)
Picture assumes only one level of cache
Values have gotten more difficult to measure on
modern procs

33
Memory Hierarchy on a Sun Ultra-2i
Sun Ultra-2i, 333 MHz
Array length
See www.cs.berkeley.edu/yelick/arvindk/t3d-isca95
.ps for details
34
Memory Hierarchy on a Power3
Power3, 375 MHz
Array size
35
Compiler Optimization vs. Memory Hierarchy Search

Compiler tries to figure out memory hierarchy
optimizations
New approach Auto-tuners 1st run variations of
program on computer to find best combinations of
optimizations (blocking, padding, ) and
algorithms, then produce C code to be compiled
for that computer
Auto-tuner targeted to numerical method
E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity
(Sparse linear algebra), Spiral (DSP), FFT-W

36
Sparse Matrix Search for Blocking
for finite element problem Im, Yelick, Vuduc,
2005
37
Best Sparse Blocking for 8 Computers
8
4
row block size (r)
2
1
1
2
4
8
column block size (c)

All possible column block sizes selected for 8
computers How could compiler know?

38
(No Transcript)
39
Main Memory Background

Performance of Main Memory
Latency Cache Miss Penalty
Access Time time between request and word
arrives
Cycle Time time between requests
Bandwidth I/O Large Block Miss Penalty (L2)
Main Memory is DRAM Dynamic Random Access Memory
Dynamic since needs to be refreshed periodically
(8 ms, 1 time)
Addresses divided into 2 halves (Memory as a 2D
matrix)
RAS or Row Address Strobe
CAS or Column Address Strobe
Cache uses SRAM Static Random Access Memory
No refresh (6 transistors/bit vs. 1
transistorSize DRAM/SRAM 4-8, Cost/Cycle
time SRAM/DRAM 8-16

40
Core Memories (1950s 60s)
DEC PDP-8/E Board, 4K words x 12 bits, (1968)
First magnetic core memory, from IBM 405
Alphabetical Accounting Machine.

Core Memory stored data as magnetization in iron
rings
Iron cores woven into a 2-dimensional mesh of
wires by hand (25 billion a year at peak
production)
invented by Forrester in late 40s/early 50s at
MIT for Whirlwind
Origin of the term Dump Core
Rumor that IBM consulted Life Saver company
Robust, non-volatile storage
Used on space shuttle computers until recently
Core access time 1ms
See http//www.columbia.edu/acis/history/core.htm
l

41
Semiconductor Memory, DRAM

Semiconductor memory began to be competitive in
early 1970s
Intel formed to exploit market for semiconductor
memory
First commercial DRAM was Intel 1103
1Kbit of storage on single chip
charge on a capacitor used to hold value
Semiconductor memory quickly replaced core in
1970s
Today (March 2009), 4GB DRAM lt 40
People can easily afford to fill 32-bit address
space with DRAM (4GB)
New Vista systems often shipping with 6GB

42
DRAM Architecture

Bits stored in 2-dimensional arrays on chip
Modern chips have around 4 logical banks on each
chip
each logical bank physically implemented as many
smaller arrays

43
Review1-T Memory Cell (DRAM)
row select

Write
1. Drive bit line
2.. Select row
Read
1. Precharge bit line to Vdd/2
2.. Select row
3. Cell and bit line share charges
Very small voltage changes on the bit line
4. Sense (fancy sense amp)
Can detect changes of 1 million electrons
5. Write restore the value
Refresh
1. Just do a dummy read to every cell.

bit
44
DRAM Capacitors more capacitance in a small area

Trench capacitors
Logic ABOVE capacitor
Gain in surface area of capacitor
Better Scaling properties
Better Planarization

Stacked capacitors
Logic BELOW capacitor
Gain in surface area of capacitor
2-dim cross-section quite small

45
DRAM Operation Three Steps

Precharge
charges bit lines to known value, required before
next row access
Row access (RAS)
decode row address, enable addressed row (often
multiple Kb in row)
bitlines share charge with storage cell
small change in voltage detected by sense
amplifiers which latch whole row of bits
sense amplifiers drive bitlines full rail to
recharge storage cells
Column access (CAS)
decode column address to select small number of
sense amplifier latches (4, 8, 16, or 32 bits
depending on DRAM package)
on read, send latched bits out to chip pins
on write, change sense amplifier latches. which
then charge storage cells to required value
can perform multiple column accesses on same row
without another row access (burst mode)

46
DRAM Read Timing (Example)

Every DRAM access begins at
The assertion of the RAS_L
2 ways to read early or late v. CAS

DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
47
Main Memory Performance
Cycle Time
Access Time
Time

DRAM (Read/Write) Cycle Time gtgt DRAM
(Read/Write) Access Time
21 why?
DRAM (Read/Write) Cycle Time
How frequent can you initiate an access?
Analogy A little kid can only ask his father for
money on Saturday
DRAM (Read/Write) Access Time
How quickly will you get what you want once you
initiate an access?
Analogy As soon as he asks, his father will give
him the money
DRAM Bandwidth Limitation analogy
What happens if he runs out of money on Wednesday?

48
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
49
Main Memory Performance

Wide
CPU/Mux 1 word Mux/Cache, Bus, Memory N words
(Alpha 64 bits 256 bits)

Interleaved
CPU, Cache, Bus 1 word Memory N Modules(4
Modules) example is word interleaved

Simple
CPU, Cache, Bus, Memory same width (32 bits)

50
Main Memory Performance

Timing model
1 to send address,
4 for access time, 10 cycle time, 1 to send data
Cache Block is 4 words
Simple M.P. 4 x (1101) 48
Wide M.P. 1 10 1 12
Interleaved M.P. 1101 3 15

51
Avoiding Bank Conflicts

Lots of banks
int x256512
for (j 0 j lt 512 j j1)
for (i 0 i lt 256 i i1)
xij 2 xij
Even with 128 banks, since 512 is multiple of
128, conflict on word accesses
SW loop interchange or declaring array not power
of 2 (array padding)
HW Prime number of banks
bank number address mod number of banks
bank number address mod number of banks
address within bank ?address / number of words
in bank
modulo divide per memory access with prime no.
banks?

52
Finding Bank Number and Address within a bank

Problem Determine the number of banks, Nb and
the number of words in each bank, Wb, such that
given address x, it is easy to find the bank
where x will be found, B(x), and the address of x
within the bank, A(x).
for any address x, B(x) and A(x) are unique
the number of bank conflicts is minimized
Solution Use the following relation to determine
B(x) and A(x) B(x) x MOD Nb A(x) x MOD Wb
where Nb and Wb are co-prime (no factors)
Chinese Remainder Theorem shows that B(x) and
A(x) unique.
Condition is satisfied if Nb is prime of form
2m-1
Then, 2k 2k-m (2m-1) 2k-m ? 2k MOD Nb 2k-m
MOD Nb 2j with j?lt m
And, remember that (AB) MOD C (A MOD C)(B
MOD C) MOD C
Simple circuit for x mod Nb
for every power of 2, compute single bit MOD (in
advance)
B(x) sum of these values MOD Nb (low
complexity circuit, adder with m bits)

53
Quest for DRAM Performance

Fast Page mode
Add timing signals that allow repeated accesses
to row buffer without another row access time
Such a buffer comes naturally, as each array will
buffer 1024 to 2048 bits for each access
Synchronous DRAM (SDRAM)
Add a clock signal to DRAM interface, so that the
repeated transfers would not bear overhead to
synchronize with DRAM controller
Double Data Rate (DDR SDRAM)
Transfer data on both the rising edge and falling
edge of the DRAM clock signal ? doubling the peak
data rate
DDR2 lowers power by dropping the voltage from
2.5 to 1.8 volts offers higher clock rates up
to 400 MHz
DDR3 drops to 1.5 volts higher clock rates up
to 800 MHz
Improved Bandwidth, not Latency

54
Fast Memory Systems DRAM specific

Multiple CAS accesses several names (page mode)
Extended Data Out (EDO) 30 faster in page mode
Newer DRAMs to address gap what will they cost,
will they survive?
RAMBUS startup company reinvented DRAM
interface
Each Chip a module vs. slice of memory
Short bus between CPU and chips
Does own refresh
Variable amount of data returned
1 byte / 2 ns (500 MB/s per chip)
Synchronous DRAM 2 banks on chip, a clock signal
to DRAM, transfer synchronous to system clock (66
- 150 MHz)
DDR DRAM Two transfers per clock (on rising and
falling edge)
Intel claims FB-DIMM is the next big thing
Stands for Fully-Buffered Dual-Inline RAM
Same basic technology as DDR, but utilizes a
serial daisy-chain channel between different
memory components.

55
Fast Page Mode Operation

Regular DRAM Organization
N rows x N column x M-bit
Read Write M-bit at a time
Each M-bit access requiresa RAS / CAS cycle
Fast Page Mode DRAM
N x M SRAM to save a row
After a row is read into the register
Only CAS is needed to access other M-bit blocks
on that row
RAS_L remains asserted while CAS_L is toggled

Column Address
DRAM
Row Address
N rows
N x M SRAM
M bits
M-bit Output
56
SDRAM timing (Single Data Rate)

Micron 128M-bit dram (using 2Meg?16bit?4bank ver)
Row (12 bits), bank (2 bits), column (9 bits)

57
Double-Data Rate (DDR2) DRAM
200MHz Clock
Row
Column
Precharge
Row
Data

Micron, 256Mb DDR2 SDRAM datasheet

400Mb/s Data Rate
58
DRAM name based on Peak Chip Transfers / SecDIMM
name based on Peak DIMM MBytes / Sec
59
DRAM Packaging
7
Clock and control signals
DRAM chip
Address lines multiplexed row/column address
12
Data bus (4b,8b,16b,32b)

DIMM (Dual Inline Memory Module) contains
multiple chips arranged in ranks
Each rank has clock/control/address signals
connected in parallel (sometimes need buffers to
drive signals to all chips), and data pins work
together to return wide word
e.g., a rank could implement a 64-bit data bus
using 16x4-bit chips, or a 64-bit data bus using
8x8-bit chips.
A modern DIMM usually has one or two ranks
(occasionally 4 if high capacity)
A rank will contain the same number of banks as
each constituent chip (e.g., 4-8)

60
DRAM Channel
Rank
Rank
64-bit Data Bus
Memory Controller
Command/Address Bus
61
FB-DIMM Memories
Regular DIMM
FB-DIMM

Uses Commodity DRAMs with special controller on
actual DIMM board
Connection is in a serial form

62
FLASH Memory
Samsung 2007 16GB, NAND Flash

Like a normal transistor but
Has a floating gate that can hold charge
To write raise or lower wordline high enough to
cause charges to tunnel
To read turn on wordline as if normal transistor
presence of charge changes threshold and thus
measured current
Two varieties
NAND denser, must be read and written in blocks
NOR much less dense, fast to read and write

63
Phase Change memory (IBM, Samsung, Intel)

Phase Change Memory (called PRAM or PCM)
Chalcogenide material can change from amorphous
to crystalline state with application of heat
Two states have very different resistive
properties
Similar to material used in CD-RW process
Exciting alternative to FLASH
Higher speed
May be easy to integrate with CMOS processes

64
Tunneling Magnetic Junction

Tunneling Magnetic Junction RAM (TMJ-RAM)
Speed of SRAM, density of DRAM, non-volatile (no
refresh)
Spintronics combination quantum spin and
electronics
Same technology used in high-density disk-drives

65
Conclusion

Memory wall inspires optimizations since much
performance lost
Reducing hit time Small and simple caches, Way
prediction, Trace caches
Increasing cache bandwidth Pipelined caches,
Multibanked caches, Nonblocking caches
Reducing Miss Penalty Critical word first,
Merging write buffers
Reducing Miss Rate Compiler optimizations
Reducing miss penalty or miss rate via
parallelism Hardware prefetching, Compiler
prefetching
Performance of programs can be complicated
functions of architecture
To write fast programs, need to consider
architecture
True on sequential or parallel processor
We would like simple models to help us design
efficient algorithms
Will Auto-tuners replace compilation to
optimize performance?
Main memory is Dense, Slow
Cycle time gt Access time!
Techniques to optimize memory
Wider Memory
Interleaved Memory for sequential or independent
accesses
Avoiding bank conflicts SW HW
DRAM specific optimizations page mode
Specialty DRAM