CS 395T: Software for Multicore Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

CS 395T: Software for Multicore Architectures

Description:

Understand high-end programming paradigms, compilers and runtime systems ... Meanwhile processor frequency and power consumption are scaling in lockstep ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 35
Provided by: Ping60
Category:

less

Transcript and Presenter's Notes

Title: CS 395T: Software for Multicore Architectures


1
CS 395TSoftware for Multi-core
Architectures
2
Administration
  • Instructor Keshav Pingali
  • 4.126A ACES
  • pingali_at_cs.utexas.edu
  • TA Milind Kulkarni
  • 4.104 ACES
  • milind_at_cs.utexas.edu

3
Course content
  • Understand high-end programming paradigms,
    compilers and runtime systems
  • Applications requirements
  • Shared-memory programming (OpenMP)
  • Optimistic and pessimistic parallelization
  • Transactional memory
  • Dependence analysis
  • Memory hierarchy optimization
  • Self-optimizing systems
  • Focus on software problem for multicore
    processors

4
Prerequisites
  • Knowledge of basic computer architecture
  • Software and Math maturity
  • Comfortable with implementing large programs
  • Some background in compilers (Dragon book)
  • Comfortable with mathematical concepts like
    linear programming
  • Ability to read and evaluate papers on current
    research

5
What is a processor?
  • A single chip package that fits in a socket
  • 1 core
  • Cores can have functional units, cache,
    etc.associated with them
  • Cores can be fast or slow
  • Shared resources
  • Lower cache levels
  • Buses, cache/memory controllers, high-speed
    serial links, etc.
  • One system interface no matter how many cores
  • Number of signal pins doesnt scale with number
    of cores

6
Need for multicore processors
  • Commercial end-customers are demanding
  • More capable systems with more capable processors
  • New systems must stay within existing
    power/thermal infrastructure
  • High-level argument
  • Silicon designers can choose a variety of
    approaches to increase processor performance but
    these are maxing out ?
  • Meanwhile processor frequency and power
    consumption are scaling in lockstep ?
  • One solution multicore processors ?

Material adapted from presentation by Paul Teich
of AMD
7
Conventional approaches to improving performance
  • Add functional units
  • Superscalar is known territory
  • Diminishing returns for adding more functional
    blocks
  • Alternatives like VLIW have been considered and
    rejected by the market
  • Wider data paths
  • Increasing bandwidth between functional units in
    a core makes a difference
  • Such as comprehensive 64-bit design, but then
    where to?

8
Conventional approaches (contd.)
  • Deeper pipeline
  • Deeper pipeline buys frequency at expense of
    increased branch mis-prediction penalty and cache
    miss penalty
  • Deeper pipelines gt higher clock frequency gt
    more power
  • Industry converging on middle ground9 to 11
    stages
  • Successful RISC CPUs are in the same range
  • More cache
  • More cache buys performance until working set of
    program fits in cache

9
Power problem
  • Moores Law isnt dead, more transistors for
    everyone!
  • Butit doesnt really mention scaling transistor
    power
  • Chemistry and physics at nano-scale
  • Stretching materials science
  • Transistor leakage current is increasing
  • As manufacturing economies and frequency
    increase, power consumption is increasing
    disproportionately
  • There are no process quick-fixes

10
Static Current vs. Frequency
Non-linear as processors approach max frequency
15
Static Current
Fast, High Power
Fast, Low Power
0
Frequency
1.0
1.5
11
Power vs. Frequency
  • AMDs process
  • Frequency step 200MHz
  • Two steps back in frequency cuts power
    consumption by 40 from maximum frequency
  • Result
  • dual-core running 400MHz slower than single-core
    running flat out operates in same thermal
    envelope
  • Substantially lower power consumption with lower
    frequency


12
AMD Multi-Core Processor
  • Dual-core AMD Opteron processor is 199mm2 in
    90nm technology
  • Single-core AMD Opteron processor is 193mm2 in
    130nm technology

13
Multi-Core Software
  • More aggregate performance for
  • Multi-threaded apps
  • Transactions many instances of same app
  • Multi-tasking
  • Problem
  • Most apps are not multithreaded
  • Writing multithreaded code increases software
    costs dramatically
  • factor of 3 for Unreal game engine (Tim Sweeney,
    EPIC games)

14
First software problem Parallelization
We are the cusp of a transition to multicore,
multithreaded architectures, and we still have
not demonstrated the ease of programming the
move will require I have talked with a few
people at Microsoft Research who say this is also
at or near the top of their list of critical CS
research problems. Justin
Rattner, Senior Fellow, Intel
15
Second software problem memory hierarchy
  • The CPU chip industry has now reached the
    point that instructions can be executed more
    quickly than the chips can be fed with code and
    data. Future chip design is memory design. Future
    software design is also memory design. .
  • Controlling memory access patterns will drive
    hardware and software designs for the foreseeable
    future.
  • Richard Sites, DEC

16
Memory Hierarchy of SGI Octane
Memory
128MB
size
L2 cache
1MB
L1 cache
32KB (I) 32KB (D)
Regs
64
access time (cycles)
2
10
70
  • R10 K processor
  • 4-way superscalar, 2 fpo/cycle, 195MHz
  • Peak performance 390 Mflops
  • Experience sustained performance is less than
    10 of peak
  • Processor often stalls waiting for memory system
    to load data

17
Memory-wall solutions
  • Latency avoidance
  • multi-level memory hierarchies (caches)
  • Latency tolerance
  • Pre-fetching
  • multi-threading
  • Techniques are not mutually exclusive
  • Most microprocessors have caches and pre-fetching
  • Modest multi-threading is coming into vogue
  • Our focus memory hierarchies

18
Hiding latency in numerical codes
  • Most numerical kernels O(n3) work, O(n2) data
  • all factorization codes
  • Cholesky factorization A LLT (A is spd)
  • LU factorization A LU
  • LU factorization with pivoting A LU
  • QR factorization A QR (Q is orthogonal)
  • BLAS-3 matrix multiplication
  • use latency avoidance techniques
  • Matrix-vector product O(n2) work, O(n2) data
  • use latency tolerance techniques such as
    pre-fetching
  • particularly important for iterative solution of
    large sparse systems

19
Software problem
  • Caches are useful only if programs have
    locality of reference
  • temporal locality program references to given
    memory address are clustered together in time
  • spatial locality program references clustered in
    address space are clustered in time
  • Problem
  • Programs obtained by expressing most algorithms
    in the straight-forward way do not have much
    locality of reference
  • Worrying about locality when coding algorithms
    complicates the software process enormously.

20
Example matrix multiplication
DO I 1, N //assume arrays stored in
row-major order DO J 1, N DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)
  • Great algorithmic data reuse each array element
    is touched O(N) times!
  • All six loop permutations are computationally
    equivalent (even modulo round-off error).
  • However, execution times of the six versions can
    be very different if machine has a cache.

21
IJK version (large cache)
B
K
  • DO I 1, N
  • DO J 1, N
  • DO K 1, N
  • C(I,J) C(I,J) A(I,K)B(K,J)

A
C
K
  • Large cache scenario
  • Matrices are small enough to fit into cache
  • Only cold misses, no capacity misses
  • Miss ratio
  • Data size 3 N2
  • Each miss brings in b floating-point numbers
  • Miss ratio 3 N2 /b4N3 0.75/bN 0.019 (b
    4,N10)

22
IJK version (small cache)
B
K
  • DO I 1, N
  • DO J 1, N
  • DO K 1, N
  • C(I,J) C(I,J) A(I,K)B(K,J)

A
C
K
  • Small cache scenario
  • Matrices are large compared to cache/row-major
    storage
  • Cold and capacity misses
  • Miss ratio
  • C N2/b misses (good temporal locality)
  • A N3 /b misses (good spatial locality)
  • B N3 misses (poor temporal and spatial
    locality)
  • Miss ratio ? 0.25 (b1)/b 0.3125 (for b 4)

23
MMM Experiments
  • Simulated L1 Cache Miss Ratio for Intel Pentium
    III
  • MMM with N 11300
  • 16KB 32B/Block 4-way 8-byte elements

24
Quantifying performance differences
  • DO I 1, N //assume arrays stored in
    row-major order
  • DO J 1, N
  • DO K 1, N
  • C(I,J) C(I,J) A(I,K)B(K,J)
  • Octane
  • L2 cache hit 10 cycles, cache miss 70 cycles
  • Time to execute IKJ version
  • 2N3 700.134N3 100.874N3 73.2 N3
  • Time to execute JKI version
  • 2N3 700.54N3 100.54N3 162 N3
  • Speed-up 2.2
  • Key transformation loop permutation

25
Even better..
  • Break MMM into a bunch of smaller MMMs so that
    large cache model is true for each small MMM
  • ? large cache model is valid for entire
    computation
  • ? miss ratio will be 0.75/bt for entire
    computation where t is

26
Loop tiling
Jt
B
J
DO It 1,N, t DO Jt 1,N,t DO Kt 1,N,t
DO I It,Itt-1 DO J Jt,Jtt-1
DO K Kt,Ktt-1 C(I,J)
C(I,J)A(I,K)B(K,J)
A
It
t
t
I
t
t
K
C
Kt
  • Break big MMM into sequence of smaller MMMs where
    each smaller MMM multiplies sub-matrices of size
    txt.
  • Parameter t (tile size) must be chosen carefully
  • as large as possible
  • working set of small matrix multiplication must
    fit in cache

27
Speed-up from tiling
  • Miss ratio for block computation
  • miss ratio for large cache model
  • 0.75/bt
  • 0.001 (b 4, t 200) for Octane
  • Time to execute tiled version
  • 2N3 700.0014N3 100.9994N3 42.3N3
  • Speed-up over JKI version 4

28
Observations
  • Locality optimized code is more complex than
    high-level algorithm.
  • Loop orders and tile size must be chosen
    carefully
  • cache size is key parameter
  • associativity matters
  • Actual code is even more complex must optimize
    for processor resources
  • registers register tiling
  • pipeline loop unrolling
  • Optimized MMM code can be 1000 lines of C code

29
One solution to both problems restructuring
compilers (1975-)
  • Programmer writes high-level architecture
    independent code
  • Restructuring compiler optimizes program for
  • Number of cores
  • Number of register
  • Cache organization
  • Instruction set mul-add? vector extensions?

30
Two key issues
P1
1
P2
P
P3
2
  • Program restructuring given program P, determine
  • set of equivalent programs P1, P2, P3,
  • Program selection determine which program
  • performs best on target architecture

31
Automatic parallelization
  • Pessimistic parallelization
  • Compiler determines partial order on program
    operations by determining dependences
  • At run-time, execute operations in parallel,
    respecting dependences
  • Works reasonably well for array programs but not
    for irregular data structures like trees and
    graphs
  • Optimistic parallelization
  • Execute operations speculatively in parallel,
    assuming that dependences do not exist
  • Check at runtime if dependences are violated
  • If so, roll-back execution to safe point and
    re-execute sequentially
  • Works only if optimism is warranted
  • Lots of interest in transactional memory which
    is one model of optimistic parallelization

32
Automatic locality enhancement
  • Some methodology exists for array programs but
    little is known for irregular programs
  • Many compilers can perform tiling and permutation
    automatically (gcc)
  • Choosing parameter values tile sizes etc.
  • Compiler can use architectural models
  • Self-optimizing systems system determines best
    values using some kind of heuristic search
    (ATLAS,FFTW)

33
Course outline
  • Applications requirements
  • Scientific and engineering applications
  • Shared-memory programming
  • Memory consistency models
  • OpenMP
  • Optimistic and pessimistic parallelization
  • Dependence analysis techniques for array and
    irregular programs
  • Transactional memory models and implementations
  • Automatic locality enhancement
  • Self-optimizing systems

34
Course work
  • Small number of programming assignments
  • Paper presentations and class participation
  • Substantial course project
  • independent reading
  • implementation work
  • presentation
Write a Comment
User Comments (0)
About PowerShow.com