Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall PowerPoint PPT Presentation

presentation player overlay
1 / 19
About This Presentation
Transcript and Presenter's Notes

Title: Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall


1
Performance Analysis, Modeling, and Optimization
Understanding the Memory Wall
  • Leonid Oliker (LBNL) and
  • Katherine Yelick (UCB and LBNL)

2
Berkeley Institute for Performance Studies
  • Joint venture between U.C. Berkeley (Demmel
    Yelick)
  • And LBNL (Oliker, Strohmaier, Bailey, and others)
  • Three performance techniques
  • Analysis (benchmarking)
  • Modeling (prediction)

3
Investigating Architectural Balance using
Adaptable Probes
  • Kaushik Datta, Parry Husbands, Paul Hargrove,
    Shoaib Kamil, Leonid Oliker, John Shalf,
    Katherine Yelick

4
Overview
  • Gap between peak and sustained performance well
    known problem in HPC
  • Generally attributed to memory system, but
    difficult to identify bottleneck
  • Application benchmarks too complex to isolate
    specific architectural features
  • Microbenchmarks too narrow to predict actual code
    performance
  • We use adaptable probes to isolate performance
    limitations
  • Give application developers possible
    optimizations
  • Give hardware designers feedback on current and
    proposed architectures
  • Single processor probes
  • Sqmat captures regular and irregular memory
    access patterns (such as dense and sparse linear
    algebra)
  • Stencil captures nearest neighbor computation
    (work in progress)
  • Architectures examined
  • Commercial Intel Itanium2, AMD Opteron, IBM
    Power3, IBM Power4, G5
  • Research Imagine, Iram, DIVA

5
Sqmat overview
  • Sqmat based on matrix multiplication and linear
    solvers
  • Java program used to generate optimally unrolled
    C code
  • Square a set of matrices M times in (use enough
    matrices to exceed cache)
  • M controls computational intensity (CI) - the
    ratio between flops and mem access
  • Each matrix is size NxN
  • N controls working set size 2N2 registers
    required per matrix
  • Direct Storage Sqmats matrix entries stored
    continuously in memory
  • Indirect Entries accessed indirectly through
    pointer Parameter S controls degree of
    indirection, S matrix entries stored
    contiguously, then random jump in memory

6
Unit Stride Algorithmic Peak
  • Curve increases until memory system fully
    utilized, plateaus when FPU units saturate
  • Itanium2 requires longer time to achieve plateau
    due to register spill penalty
  • Opterons SIMD nature of SSE2 inhibits high
    algorithmic peak
  • Power3 effective hiding latency of cache-access
  • Power4s deep pipeline inhibits ability to find
    sufficient ILP to saturate FPUs

7
Slowdown due to Indirection
Unit stride access via indirection (S1)
  • Operton, Power3/4 less 10 penalty once Mgt8 -
    demonstrating bandwidth between cache and
    processor effectively delivers addresses and
    values
  • Itanium2 showing high penalty for indirection -
    issue is currently under invesigation

8
Cost of Irregularity (1)
  • Itanium and Opteron perform well for irregular
    accesses due to
  • Itanium2s L2 caching of FP values (reduces cost
    of cache miss)
  • Opterons low memory latency due to on-chip
    memory controller

9
Cost of Irregularity (2)
  • Power3 and Power4 perform well for irregular
    accesses due to
  • Power3s high penalty cache miss (35 cycles) and
    limited prefetch abilities
  • Power4s requires 4 cache-line hit to activate
    prefetching

10
Tolerating Irregularity
  • S50
  • Start with some M at S? (indirect unit stride)
  • For a given M, how large must S be to achieve at
    least 50 of the original performance?
  • M50
  • Start with M1, S?
  • At S1 (every access random), how large must M be
    to achieve 50 of the original performance

11
Tolerating Irregularity
S50 What of memory access can be random before performance decreases by half? M50 How much computational intensity is required to hide penalty of all random access?
Gather/Scatter expensive on commodity cache-based systems Power4 can is only 1.6 (1 in 64) Itanium2 much less sensitive at 25 (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75!
Interested in developing application driven
architectural probes for evaluation of emerging
petascale systems
12
Emerging Architectures
  • General purpose processors badly suited for data
    intensive ops
  • Large caches not useful if re-use is low
  • Low memory bandwidth, especially for irregular
    patterns
  • Superscalar methods of increasing ILP inefficient
  • Power consumption
  • Application-specific ASICs
  • Good, but expensive/slow to design.
  • Solution general purpose memory aware
    processors
  • Large number of ALUs to exploit data-parallelism
  • Huge memory bandwidth to keep ALUs busy
  • Concurrency overlap memory w/ computation

13
VIRAM Overview
  • MIPS core (200 MHz)
  • Main memory system
  • 8 banks w/13 MB of on-chip DRAM
  • Large 6.4 GBytes/s on-chip peak bandwidth
  • Cache-less Vector unit
  • Energy efficient way to express fine-grained
    parallelism and exploit bandwidth
  • Single issue, in order
  • Low power consumption 2.0 W
  • Peak vector performance
  • 1.6/3.2/6.4 Gops
  • 1.6 Gflops (single-precision)
  • Fabricated by IBM
  • Deep pipelines mask DRAM latency
  • Crays vcc compiler adapted to VIRAM
  • Simulator used for results

14
VIRAM Power Efficiency
  • Comparable performance with lower clock rate
  • Large power/performance advantage for VIRAM from
  • PIM technology, data parallel execution model

15
Imagine Overview
  • Vector VLIW processor
  • Coprocessor to off-chip host processor
  • 8 arithmetic clusters control in SIMD w/ VLIW
    instructions
  • Central 128KB Stream Register File _at_ 32GB/s
  • SRF can overlap computation w/ memory (double
    buffering)
  • SRF cab reuse intermediate results (proc-consum
    locality)
  • Stream-aware memory system with 2.7 GB/s off-chip
  • 544 GB/s intercluster comm
  • Host sends instuctions to stream controller, SC
    issues commands to on-chip modules

16
VIRAM and Imagine
VIRAM IMAGINEMemory IMAGINE SRF
Bandwidth GB/s 6.4 2.7 32
Peak Float 32bit 1.6 GF/s 20 GF/s 20
Peak Float/Word 1 30 2.5
Speed MHz 200 400
Chip Area 15x18mm 12x12mm
Data widths 64/32/16 32/16/8
Transistors 130 x 106 21 x 106
Power Consmp 2 Watts 10 Watts
  • Imagine order of magnitude higher performance
  • VIRAM twice memory bandwidth, less power
    consumption
  • Notice peak Flop/Word ratios

17
What Does This Have to Do with PIMs?
  • Performance of Sqmat on PIMs and others for 3x3
    matrices, squared 10 times (high computational
    intensity!)
  • Imagine much faster for long streams, slower for
    short ones

18
SQMAT Performance Crossover
  • Large number of ops/word N10 where N3x3
  • Crossover point L64 (cycles) , L 256 (MFlop)
  • Imagine power becomes apparent almost 4x VIRAM at
    L1024Codes at this end of spectrum greatly
    benefit from Imagine arch

19
Stencil Probe
  • Stencil computations core of wide range of
    scientific applications
  • Applications include Jacobi solvers, complex
    multigrid, block structured AMR
  • We developing adaptable stencil probe to model
    range of computations
  • Findings isolate importance of streaming memory
    accesses which engage automatic prefetch engines,
    thus greatly increasing memory throughput
  • Previous L1 tiling techniques mostly ineffective
    for stencil computations on modern
    microprocessors
  • Small blocks inhibit automatic prefetching
    performance
  • Modern large on-chip L2/L3 caches have similar
    bandwidth to L1
  • Currently investigating tradeoffs between
    blocking and prefetching (paper in preparation)
  • Interested in exploring potential benefits of
    enhancing commodity processors with explicitly
    programmable prefetching
Write a Comment
User Comments (0)
About PowerShow.com