CS252 Graduate Computer Architecture Lecture 14 3 1 Cs of Caching and many ways Cache Optimizations - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

CS252 Graduate Computer Architecture Lecture 14 3 1 Cs of Caching and many ways Cache Optimizations

Description:

... Combine 2 independent loops that have same looping and some variables overlap ... For keys of fixed length and fixed radix a constant number of passes over the ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 66
Provided by: davidapa6
Category:

less

Transcript and Presenter's Notes

Title: CS252 Graduate Computer Architecture Lecture 14 3 1 Cs of Caching and many ways Cache Optimizations


1
CS252Graduate Computer ArchitectureLecture
1431 Cs of Caching and many ways Cache
Optimizations
  • John Kubiatowicz
  • Electrical Engineering and Computer Sciences
  • University of California, Berkeley
  • http//www.eecs.berkeley.edu/kubitron/cs252

2
Review Cache performance
  • Miss-oriented Approach to Memory Access
  • Separating out Memory component entirely
  • AMAT Average Memory Access Time

3
12 Advanced Cache Optimizations (Cont)
  • Reducing hit time
  • Small and simple caches
  • Way prediction
  • Trace caches
  • Increasing cache bandwidth
  • Pipelined caches
  • Multibanked caches
  • Nonblocking caches
  • Reducing Miss Penalty
  • Critical word first
  • Merging write buffers
  • Reducing Miss Rate
  • Victim Cache
  • Hardware prefetching
  • Compiler prefetching
  • Compiler Optimizations

4
3. Fast (Instruction Cache) Hit times via Trace
Cache
  • Key Idea Pack multiple non-contiguous basic
    blocks into one contiguous trace cache line

BR
BR
BR
  • Single fetch brings in multiple basic blocks
  • Trace cache indexed by start address and next n
    branch predictions

5
3. Fast Hit times via Trace Cache (Pentium 4
only and last time?)
  • Find more instruction level parallelism?How
    avoid translation from x86 to microops?
  • Trace cache in Pentium 4
  • Dynamic traces of the executed instructions vs.
    static sequences of instructions as determined by
    layout in memory
  • Built-in branch predictor
  • Cache the micro-ops vs. x86 instructions
  • Decode/translate from x86 to micro-ops on trace
    cache miss
  • 1. ? better utilize long blocks (dont exit in
    middle of block, dont enter at label in middle
    of block)
  • 1. ? complicated address mapping since addresses
    no longer aligned to power-of-2 multiples of word
    size
  • - 1. ? instructions may appear multiple times in
    multiple dynamic traces due to different branch
    outcomes

6
4 Increasing Cache Bandwidth by Pipelining
  • Pipeline cache access to maintain bandwidth, but
    higher latency
  • Instruction cache access pipeline stages
  • 1 Pentium
  • 2 Pentium Pro through Pentium III
  • 4 Pentium 4
  • ? greater penalty on mispredicted branches
  • ? more clock cycles between the issue of the load
    and the use of the data

7
5. Increasing Cache Bandwidth Non-Blocking
Caches
  • Non-blocking cache or lockup-free cache allow
    data cache to continue to supply cache hits
    during a miss
  • requires F/E bits on registers or out-of-order
    execution
  • requires multi-bank memories
  • hit under miss reduces the effective miss
    penalty by working during miss vs. ignoring CPU
    requests
  • hit under multiple miss or miss under miss
    may further lower the effective miss penalty by
    overlapping multiple misses
  • Significantly increases the complexity of the
    cache controller as there can be multiple
    outstanding memory accesses
  • Requires muliple memory banks (otherwise cannot
    support)
  • Penium Pro allows 4 outstanding memory misses

8
Value of Hit Under Miss for SPEC (old data)
0-gt1 1-gt2 2-gt64 Base
Hit under n Misses
Floating Point
Integer
  • FP programs on average Miss Penalty 0.68 -gt
    0.52 -gt 0.34 -gt 0.26
  • Int programs on average Miss Penalty 0.24 -gt
    0.20 -gt 0.19 -gt 0.19
  • 8 KB Data Cache, Direct Mapped, 32B block, 16
    cycle miss, SPEC 92

9
6 Increasing Cache Bandwidth via Multiple Banks
  • Rather than treat the cache as a single
    monolithic block, divide into independent banks
    that can support simultaneous accesses
  • E.g.,T1 (Niagara) L2 has 4 banks
  • Banking works best when accesses naturally spread
    themselves across banks ? mapping of addresses to
    banks affects behavior of memory system
  • Simple mapping that works well is sequential
    interleaving
  • Spread block addresses sequentially across banks
  • E,g, if there 4 banks, Bank 0 has all blocks
    whose address modulo 4 is 0 bank 1 has all
    blocks whose address modulo 4 is 1

10
7. Reduce Miss Penalty Early Restart and
Critical Word First
  • Dont wait for full block before restarting CPU
  • Early restartAs soon as the requested word of
    the block arrives, send it to the CPU and let the
    CPU continue execution
  • Spatial locality ? tend to want next sequential
    word, so not clear size of benefit of just early
    restart
  • Critical Word FirstRequest the missed word first
    from memory and send it to the CPU as soon as it
    arrives let the CPU continue execution while
    filling the rest of the words in the block
  • Long blocks more popular today ? Critical Word
    1st Widely used

11
8. Merging Write Buffer to Reduce Miss Penalty
  • Write buffer to allow processor to continue while
    waiting to write to memory
  • If buffer contains modified blocks, the addresses
    can be checked to see if address of new data
    matches the address of a valid write buffer entry
  • If so, new data are combined with that entry
  • Increases block size of write for write-through
    cache of writes to sequential words, bytes since
    multiword writes more efficient to memory
  • The Sun T1 (Niagara) processor, among many
    others, uses write merging

12
9. Reducing Misses a Victim Cache
  • How to combine fast hit time of direct mapped
    yet still avoid conflict misses?
  • Add buffer to place data discarded from cache
  • Jouppi 1990 4-entry victim cache removed 20
    to 95 of conflicts for a 4 KB direct mapped data
    cache
  • Used in Alpha, HP machines

DATA
TAGS
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
One Cache line of Data
Tag and Comparator
To Next Lower Level In
Hierarchy
13
10. Reducing Misses by Hardware Prefetching of
Instructions Data
  • Prefetching relies on having extra memory
    bandwidth that can be used without penalty
  • Instruction Prefetching
  • Typically, CPU fetches 2 blocks on a miss the
    requested block and the next consecutive block.
  • Requested block is placed in instruction cache
    when it returns, and prefetched block is placed
    into instruction stream buffer
  • Data Prefetching
  • Pentium 4 can prefetch data into L2 cache from up
    to 8 streams from 8 different 4 KB pages
  • Prefetching invoked if 2 successive L2 cache
    misses to a page, if distance between those
    cache blocks is lt 256 bytes

14
Issues in Prefetching
  • Usefulness should produce hits
  • Timeliness not late and not too early
  • Cache and bandwidth pollution

L1 Instruction
Unified L2 Cache
CPU
L1 Data
RF
Prefetched data
15
Hardware Instruction Prefetching
  • Instruction prefetch in Alpha AXP 21064
  • Fetch two blocks on a miss the requested block
    (i) and the next consecutive block (i1)
  • Requested block placed in cache, and next block
    in instruction stream buffer
  • If miss in cache but hit in stream buffer, move
    stream buffer block into cache and prefetch next
    block (i2)

Prefetched instruction block
Stream Buffer
Req block
Unified L2 Cache
CPU
L1 Instruction
Req block
RF
16
Hardware Data Prefetching
  • Prefetch-on-miss
  • Prefetch b 1 upon miss on b
  • One Block Lookahead (OBL) scheme
  • Initiate prefetch for block b 1 when block b is
    accessed
  • Why is this different from doubling block size?
  • Can extend to N block lookahead
  • Strided prefetch
  • If observe sequence of accesses to block b, bN,
    b2N, then prefetch b3N etc.
  • Example IBM Power 5 2003 supports eight
    independent streams of strided prefetch per
    processor, prefetching 12 lines ahead of current
    access

17
Administrivia
  • Exam This Wednesday Location 310
    Soda TIME 600-900pm
  • Material Everything up to next Monday, including
    papers (especially ones discussed in detail in
    class)
  • Closed Book, but 1 page hand-written notes (both
    sides)
  • Meet at LaVals afterwards for Pizza and
    Beverages
  • We have been reading Chapter 5
  • You should take a look, since might show up in
    test

18
11. Reducing Misses by Software Prefetching Data
  • Data Prefetch
  • Load data into register (HP PA-RISC loads)
  • Cache Prefetch load into cache (MIPS IV,
    PowerPC, SPARC v. 9)
  • Special prefetching instructions cannot cause
    faultsa form of speculative execution
  • Issuing Prefetch Instructions takes time
  • Is cost of prefetch issues lt savings in reduced
    misses?
  • Higher superscalar reduces difficulty of issue
    bandwidth

19
12. Reducing Misses by Compiler Optimizations
  • McFarling 1989 reduced caches misses by 75 on
    8KB direct mapped cache, 4 byte blocks in
    software
  • Instructions
  • Reorder procedures in memory so as to reduce
    conflict misses
  • Profiling to look at conflicts(using tools they
    developed)
  • Data
  • Merging Arrays improve spatial locality by
    single array of compound elements vs. 2 arrays
  • Loop Interchange change nesting of loops to
    access data in order stored in memory
  • Loop Fusion Combine 2 independent loops that
    have same looping and some variables overlap
  • Blocking Improve temporal locality by accessing
    blocks of data repeatedly vs. going down whole
    columns or rows

20
Merging Arrays Example
  • / Before 2 sequential arrays /
  • int valSIZE
  • int keySIZE
  • / After 1 array of stuctures /
  • struct merge
  • int val
  • int key
  • struct merge merged_arraySIZE
  • Reducing conflicts between val key improve
    spatial locality

21
Loop Interchange Example
  • / Before /
  • for (k 0 k lt 100 k k1)
  • for (j 0 j lt 100 j j1)
  • for (i 0 i lt 5000 i i1)
  • xij 2 xij
  • / After /
  • for (k 0 k lt 100 k k1)
  • for (i 0 i lt 5000 i i1)
  • for (j 0 j lt 100 j j1)
  • xij 2 xij
  • Sequential accesses instead of striding through
    memory every 100 words improved spatial locality

22
Loop Fusion Example
  • / Before /
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • aij 1/bij cij
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • dij aij cij
  • / After /
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • aij 1/bij cij
  • dij aij cij
  • 2 misses per access to a c vs. one miss per
    access improve spatial locality

23
Blocking Example
  • / Before /
  • for (i 0 i lt N i i1)
  • for (j 0 j lt N j j1)
  • r 0
  • for (k 0 k lt N k k1)
  • r r yikzkj
  • xij r
  • Two Inner Loops
  • Read all NxN elements of z
  • Read N elements of 1 row of y repeatedly
  • Write N elements of 1 row of x
  • Capacity Misses a function of N Cache Size
  • 2N3 N2 gt (assuming no conflict otherwise )
  • Idea compute on BxB submatrix that fits

24
Blocking Example
  • / After /
  • for (jj 0 jj lt N jj jjB)
  • for (kk 0 kk lt N kk kkB)
  • for (i 0 i lt N i i1)
  • for (j jj j lt min(jjB-1,N) j j1)
  • r 0
  • for (k kk k lt min(kkB-1,N) k k1)
  • r r yikzkj
  • xij xij r
  • B called Blocking Factor
  • Capacity Misses from 2N3 N2 to 2N3/B N2
  • Conflict Misses Too?

25
Reducing Conflict Misses by Blocking
  • Conflict misses in caches not FA vs. Blocking
    size
  • Lam et al 1991 a blocking factor of 24 had a
    fifth the misses vs. 48 despite both fit in cache

26
Summary of Compiler Optimizations to Reduce Cache
Misses (by hand)
27
Impact of Hierarchy on Algorithms
  • Today CPU time is a function of (ops, cache
    misses)
  • What does this mean to Compilers, Data
    structures, Algorithms?
  • Quicksort fastest comparison based sorting
    algorithm when keys fit in memory
  • Radix sort also called linear time sort For
    keys of fixed length and fixed radix a constant
    number of passes over the data is sufficient
    independent of the number of keys
  • The Influence of Caches on the Performance of
    Sorting by A. LaMarca and R.E. Ladner.
    Proceedings of the Eighth Annual ACM-SIAM
    Symposium on Discrete Algorithms, January, 1997,
    370-379.
  • For Alphastation 250, 32 byte blocks, direct
    mapped L2 2MB cache, 8 byte keys, from 4000 to
    4000000

28
Quicksort vs. Radix Instructions
Job size in keys
29
Quicksort vs. Radix Inst Time
Time
Insts
Job size in keys
30
Quicksort vs. Radix Cache misses
Job size in keys
31
Experimental Study (Membench)
  • Microbenchmark for memory system performance

1 experiment
32
Membench What to Expect
average cost per access
s stride
  • Consider the average cost per load
  • Plot one line for each array length, time vs.
    stride
  • Small stride is best if cache line holds 4
    words, at most ¼ miss
  • If array is smaller than a given cache, all those
    accesses will hit (after the first run, which is
    negligible for large enough runs)
  • Picture assumes only one level of cache
  • Values have gotten more difficult to measure on
    modern procs

33
Memory Hierarchy on a Sun Ultra-2i
Sun Ultra-2i, 333 MHz
Array length
See www.cs.berkeley.edu/yelick/arvindk/t3d-isca95
.ps for details
34
Memory Hierarchy on a Power3
Power3, 375 MHz
Array size
35
Compiler Optimization vs. Memory Hierarchy Search
  • Compiler tries to figure out memory hierarchy
    optimizations
  • New approach Auto-tuners 1st run variations of
    program on computer to find best combinations of
    optimizations (blocking, padding, ) and
    algorithms, then produce C code to be compiled
    for that computer
  • Auto-tuner targeted to numerical method
  • E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity
    (Sparse linear algebra), Spiral (DSP), FFT-W

36
Sparse Matrix Search for Blocking
for finite element problem Im, Yelick, Vuduc,
2005
37
Best Sparse Blocking for 8 Computers
8
4
row block size (r)
2
1
1
2
4
8
column block size (c)
  • All possible column block sizes selected for 8
    computers How could compiler know?

38
(No Transcript)
39
Main Memory Background
  • Performance of Main Memory
  • Latency Cache Miss Penalty
  • Access Time time between request and word
    arrives
  • Cycle Time time between requests
  • Bandwidth I/O Large Block Miss Penalty (L2)
  • Main Memory is DRAM Dynamic Random Access Memory
  • Dynamic since needs to be refreshed periodically
    (8 ms, 1 time)
  • Addresses divided into 2 halves (Memory as a 2D
    matrix)
  • RAS or Row Address Strobe
  • CAS or Column Address Strobe
  • Cache uses SRAM Static Random Access Memory
  • No refresh (6 transistors/bit vs. 1
    transistorSize DRAM/SRAM 4-8, Cost/Cycle
    time SRAM/DRAM 8-16

40
Core Memories (1950s 60s)
DEC PDP-8/E Board, 4K words x 12 bits, (1968)
First magnetic core memory, from IBM 405
Alphabetical Accounting Machine.
  • Core Memory stored data as magnetization in iron
    rings
  • Iron cores woven into a 2-dimensional mesh of
    wires by hand (25 billion a year at peak
    production)
  • invented by Forrester in late 40s/early 50s at
    MIT for Whirlwind
  • Origin of the term Dump Core
  • Rumor that IBM consulted Life Saver company
  • Robust, non-volatile storage
  • Used on space shuttle computers until recently
  • Core access time 1ms
  • See http//www.columbia.edu/acis/history/core.htm
    l

41
Semiconductor Memory, DRAM
  • Semiconductor memory began to be competitive in
    early 1970s
  • Intel formed to exploit market for semiconductor
    memory
  • First commercial DRAM was Intel 1103
  • 1Kbit of storage on single chip
  • charge on a capacitor used to hold value
  • Semiconductor memory quickly replaced core in
    1970s
  • Today (March 2009), 4GB DRAM lt 40
  • People can easily afford to fill 32-bit address
    space with DRAM (4GB)
  • New Vista systems often shipping with 6GB

42
DRAM Architecture
  • Bits stored in 2-dimensional arrays on chip
  • Modern chips have around 4 logical banks on each
    chip
  • each logical bank physically implemented as many
    smaller arrays

43
Review1-T Memory Cell (DRAM)
row select
  • Write
  • 1. Drive bit line
  • 2.. Select row
  • Read
  • 1. Precharge bit line to Vdd/2
  • 2.. Select row
  • 3. Cell and bit line share charges
  • Very small voltage changes on the bit line
  • 4. Sense (fancy sense amp)
  • Can detect changes of 1 million electrons
  • 5. Write restore the value
  • Refresh
  • 1. Just do a dummy read to every cell.

bit
44
DRAM Capacitors more capacitance in a small area
  • Trench capacitors
  • Logic ABOVE capacitor
  • Gain in surface area of capacitor
  • Better Scaling properties
  • Better Planarization
  • Stacked capacitors
  • Logic BELOW capacitor
  • Gain in surface area of capacitor
  • 2-dim cross-section quite small

45
DRAM Operation Three Steps
  • Precharge
  • charges bit lines to known value, required before
    next row access
  • Row access (RAS)
  • decode row address, enable addressed row (often
    multiple Kb in row)
  • bitlines share charge with storage cell
  • small change in voltage detected by sense
    amplifiers which latch whole row of bits
  • sense amplifiers drive bitlines full rail to
    recharge storage cells
  • Column access (CAS)
  • decode column address to select small number of
    sense amplifier latches (4, 8, 16, or 32 bits
    depending on DRAM package)
  • on read, send latched bits out to chip pins
  • on write, change sense amplifier latches. which
    then charge storage cells to required value
  • can perform multiple column accesses on same row
    without another row access (burst mode)

46
DRAM Read Timing (Example)
  • Every DRAM access begins at
  • The assertion of the RAS_L
  • 2 ways to read early or late v. CAS

DRAM Read Cycle Time
CAS_L
A
Row Address
Junk
Col Address
Row Address
Junk
Col Address
WE_L
OE_L
D
High Z
Data Out
Junk
Data Out
High Z
Read Access Time
Output Enable Delay
Early Read Cycle OE_L asserted before CAS_L
Late Read Cycle OE_L asserted after CAS_L
47
Main Memory Performance
Cycle Time
Access Time
Time
  • DRAM (Read/Write) Cycle Time gtgt DRAM
    (Read/Write) Access Time
  • 21 why?
  • DRAM (Read/Write) Cycle Time
  • How frequent can you initiate an access?
  • Analogy A little kid can only ask his father for
    money on Saturday
  • DRAM (Read/Write) Access Time
  • How quickly will you get what you want once you
    initiate an access?
  • Analogy As soon as he asks, his father will give
    him the money
  • DRAM Bandwidth Limitation analogy
  • What happens if he runs out of money on Wednesday?

48
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving
CPU
Memory
D1 available
Start Access for D1
Start Access for D2
Memory Bank 0
Access Pattern with 4-way Interleaving
Memory Bank 1
CPU
Memory Bank 2
Memory Bank 3
Access Bank 1
Access Bank 0
Access Bank 2
Access Bank 3
We can Access Bank 0 again
49
Main Memory Performance
  • Wide
  • CPU/Mux 1 word Mux/Cache, Bus, Memory N words
    (Alpha 64 bits 256 bits)
  • Interleaved
  • CPU, Cache, Bus 1 word Memory N Modules(4
    Modules) example is word interleaved
  • Simple
  • CPU, Cache, Bus, Memory same width (32 bits)

50
Main Memory Performance
  • Timing model
  • 1 to send address,
  • 4 for access time, 10 cycle time, 1 to send data
  • Cache Block is 4 words
  • Simple M.P. 4 x (1101) 48
  • Wide M.P. 1 10 1 12
  • Interleaved M.P. 1101 3 15

51
Avoiding Bank Conflicts
  • Lots of banks
  • int x256512
  • for (j 0 j lt 512 j j1)
  • for (i 0 i lt 256 i i1)
  • xij 2 xij
  • Even with 128 banks, since 512 is multiple of
    128, conflict on word accesses
  • SW loop interchange or declaring array not power
    of 2 (array padding)
  • HW Prime number of banks
  • bank number address mod number of banks
  • bank number address mod number of banks
  • address within bank ?address / number of words
    in bank
  • modulo divide per memory access with prime no.
    banks?

52
Finding Bank Number and Address within a bank
  • Problem Determine the number of banks, Nb and
    the number of words in each bank, Wb, such that
  • given address x, it is easy to find the bank
    where x will be found, B(x), and the address of x
    within the bank, A(x).
  • for any address x, B(x) and A(x) are unique
  • the number of bank conflicts is minimized
  • Solution Use the following relation to determine
    B(x) and A(x) B(x) x MOD Nb A(x) x MOD Wb
    where Nb and Wb are co-prime (no factors)
  • Chinese Remainder Theorem shows that B(x) and
    A(x) unique.
  • Condition is satisfied if Nb is prime of form
    2m-1
  • Then, 2k 2k-m (2m-1) 2k-m ? 2k MOD Nb 2k-m
    MOD Nb 2j with j?lt m
  • And, remember that (AB) MOD C (A MOD C)(B
    MOD C) MOD C
  • Simple circuit for x mod Nb
  • for every power of 2, compute single bit MOD (in
    advance)
  • B(x) sum of these values MOD Nb (low
    complexity circuit, adder with m bits)

53
Quest for DRAM Performance
  • Fast Page mode
  • Add timing signals that allow repeated accesses
    to row buffer without another row access time
  • Such a buffer comes naturally, as each array will
    buffer 1024 to 2048 bits for each access
  • Synchronous DRAM (SDRAM)
  • Add a clock signal to DRAM interface, so that the
    repeated transfers would not bear overhead to
    synchronize with DRAM controller
  • Double Data Rate (DDR SDRAM)
  • Transfer data on both the rising edge and falling
    edge of the DRAM clock signal ? doubling the peak
    data rate
  • DDR2 lowers power by dropping the voltage from
    2.5 to 1.8 volts offers higher clock rates up
    to 400 MHz
  • DDR3 drops to 1.5 volts higher clock rates up
    to 800 MHz
  • Improved Bandwidth, not Latency

54
Fast Memory Systems DRAM specific
  • Multiple CAS accesses several names (page mode)
  • Extended Data Out (EDO) 30 faster in page mode
  • Newer DRAMs to address gap what will they cost,
    will they survive?
  • RAMBUS startup company reinvented DRAM
    interface
  • Each Chip a module vs. slice of memory
  • Short bus between CPU and chips
  • Does own refresh
  • Variable amount of data returned
  • 1 byte / 2 ns (500 MB/s per chip)
  • Synchronous DRAM 2 banks on chip, a clock signal
    to DRAM, transfer synchronous to system clock (66
    - 150 MHz)
  • DDR DRAM Two transfers per clock (on rising and
    falling edge)
  • Intel claims FB-DIMM is the next big thing
  • Stands for Fully-Buffered Dual-Inline RAM
  • Same basic technology as DDR, but utilizes a
    serial daisy-chain channel between different
    memory components.

55
Fast Page Mode Operation
  • Regular DRAM Organization
  • N rows x N column x M-bit
  • Read Write M-bit at a time
  • Each M-bit access requiresa RAS / CAS cycle
  • Fast Page Mode DRAM
  • N x M SRAM to save a row
  • After a row is read into the register
  • Only CAS is needed to access other M-bit blocks
    on that row
  • RAS_L remains asserted while CAS_L is toggled

Column Address
DRAM
Row Address
N rows
N x M SRAM
M bits
M-bit Output
56
SDRAM timing (Single Data Rate)
  • Micron 128M-bit dram (using 2Meg?16bit?4bank ver)
  • Row (12 bits), bank (2 bits), column (9 bits)

57
Double-Data Rate (DDR2) DRAM
200MHz Clock
Row
Column
Precharge
Row
Data
  • Micron, 256Mb DDR2 SDRAM datasheet

400Mb/s Data Rate
58
DRAM name based on Peak Chip Transfers / SecDIMM
name based on Peak DIMM MBytes / Sec
59
DRAM Packaging
7
Clock and control signals
DRAM chip
Address lines multiplexed row/column address
12
Data bus (4b,8b,16b,32b)
  • DIMM (Dual Inline Memory Module) contains
    multiple chips arranged in ranks
  • Each rank has clock/control/address signals
    connected in parallel (sometimes need buffers to
    drive signals to all chips), and data pins work
    together to return wide word
  • e.g., a rank could implement a 64-bit data bus
    using 16x4-bit chips, or a 64-bit data bus using
    8x8-bit chips.
  • A modern DIMM usually has one or two ranks
    (occasionally 4 if high capacity)
  • A rank will contain the same number of banks as
    each constituent chip (e.g., 4-8)

60
DRAM Channel
Rank
Rank
64-bit Data Bus
Memory Controller
Command/Address Bus
61
FB-DIMM Memories
Regular DIMM
FB-DIMM
  • Uses Commodity DRAMs with special controller on
    actual DIMM board
  • Connection is in a serial form

62
FLASH Memory
Samsung 2007 16GB, NAND Flash
  • Like a normal transistor but
  • Has a floating gate that can hold charge
  • To write raise or lower wordline high enough to
    cause charges to tunnel
  • To read turn on wordline as if normal transistor
  • presence of charge changes threshold and thus
    measured current
  • Two varieties
  • NAND denser, must be read and written in blocks
  • NOR much less dense, fast to read and write

63
Phase Change memory (IBM, Samsung, Intel)
  • Phase Change Memory (called PRAM or PCM)
  • Chalcogenide material can change from amorphous
    to crystalline state with application of heat
  • Two states have very different resistive
    properties
  • Similar to material used in CD-RW process
  • Exciting alternative to FLASH
  • Higher speed
  • May be easy to integrate with CMOS processes

64
Tunneling Magnetic Junction
  • Tunneling Magnetic Junction RAM (TMJ-RAM)
  • Speed of SRAM, density of DRAM, non-volatile (no
    refresh)
  • Spintronics combination quantum spin and
    electronics
  • Same technology used in high-density disk-drives

65
Conclusion
  • Memory wall inspires optimizations since much
    performance lost
  • Reducing hit time Small and simple caches, Way
    prediction, Trace caches
  • Increasing cache bandwidth Pipelined caches,
    Multibanked caches, Nonblocking caches
  • Reducing Miss Penalty Critical word first,
    Merging write buffers
  • Reducing Miss Rate Compiler optimizations
  • Reducing miss penalty or miss rate via
    parallelism Hardware prefetching, Compiler
    prefetching
  • Performance of programs can be complicated
    functions of architecture
  • To write fast programs, need to consider
    architecture
  • True on sequential or parallel processor
  • We would like simple models to help us design
    efficient algorithms
  • Will Auto-tuners replace compilation to
    optimize performance?
  • Main memory is Dense, Slow
  • Cycle time gt Access time!
  • Techniques to optimize memory
  • Wider Memory
  • Interleaved Memory for sequential or independent
    accesses
  • Avoiding bank conflicts SW HW
  • DRAM specific optimizations page mode
    Specialty DRAM
Write a Comment
User Comments (0)
About PowerShow.com