CS 395T: Software for Multicore Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

CS 395T: Software for Multicore Architectures

Description:

Understand high-end programming paradigms, compilers and runtime systems ... Meanwhile processor frequency and power consumption are scaling in lockstep ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 35

Provided by: Ping60

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 395T: Software for Multicore Architectures

1
CS 395TSoftware for Multi-core
Architectures
2
Administration

Instructor Keshav Pingali
4.126A ACES
pingali_at_cs.utexas.edu
TA Milind Kulkarni
4.104 ACES
milind_at_cs.utexas.edu

3
Course content

Understand high-end programming paradigms,
compilers and runtime systems
Applications requirements
Shared-memory programming (OpenMP)
Optimistic and pessimistic parallelization
Transactional memory
Dependence analysis
Memory hierarchy optimization
Self-optimizing systems
Focus on software problem for multicore
processors

4
Prerequisites

Knowledge of basic computer architecture
Software and Math maturity
Comfortable with implementing large programs
Some background in compilers (Dragon book)
Comfortable with mathematical concepts like
linear programming
Ability to read and evaluate papers on current
research

5
What is a processor?

A single chip package that fits in a socket
1 core
Cores can have functional units, cache,
etc.associated with them
Cores can be fast or slow
Shared resources
Lower cache levels
Buses, cache/memory controllers, high-speed
serial links, etc.
One system interface no matter how many cores
Number of signal pins doesnt scale with number
of cores

6
Need for multicore processors

Commercial end-customers are demanding
More capable systems with more capable processors
New systems must stay within existing
power/thermal infrastructure
High-level argument
Silicon designers can choose a variety of
approaches to increase processor performance but
these are maxing out ?
Meanwhile processor frequency and power
consumption are scaling in lockstep ?
One solution multicore processors ?

Material adapted from presentation by Paul Teich
of AMD
7
Conventional approaches to improving performance

Add functional units
Superscalar is known territory
Diminishing returns for adding more functional
blocks
Alternatives like VLIW have been considered and
rejected by the market
Wider data paths
Increasing bandwidth between functional units in
a core makes a difference
Such as comprehensive 64-bit design, but then
where to?

8
Conventional approaches (contd.)

Deeper pipeline
Deeper pipeline buys frequency at expense of
increased branch mis-prediction penalty and cache
miss penalty
Deeper pipelines gt higher clock frequency gt
more power
Industry converging on middle ground9 to 11
stages
Successful RISC CPUs are in the same range
More cache
More cache buys performance until working set of
program fits in cache

9
Power problem

Moores Law isnt dead, more transistors for
everyone!
Butit doesnt really mention scaling transistor
power
Chemistry and physics at nano-scale
Stretching materials science
Transistor leakage current is increasing
As manufacturing economies and frequency
increase, power consumption is increasing
disproportionately
There are no process quick-fixes

10
Static Current vs. Frequency
Non-linear as processors approach max frequency
15
Static Current
Fast, High Power
Fast, Low Power
0
Frequency
1.0
1.5
11
Power vs. Frequency

AMDs process
Frequency step 200MHz
Two steps back in frequency cuts power
consumption by 40 from maximum frequency
Result
dual-core running 400MHz slower than single-core
running flat out operates in same thermal
envelope
Substantially lower power consumption with lower
frequency

12
AMD Multi-Core Processor

Dual-core AMD Opteron processor is 199mm2 in
90nm technology
Single-core AMD Opteron processor is 193mm2 in
130nm technology

13
Multi-Core Software

More aggregate performance for
Multi-threaded apps
Transactions many instances of same app
Multi-tasking
Problem
Most apps are not multithreaded
Writing multithreaded code increases software
costs dramatically
factor of 3 for Unreal game engine (Tim Sweeney,
EPIC games)

14
First software problem Parallelization
We are the cusp of a transition to multicore,
multithreaded architectures, and we still have
not demonstrated the ease of programming the
move will require I have talked with a few
people at Microsoft Research who say this is also
at or near the top of their list of critical CS
research problems. Justin
Rattner, Senior Fellow, Intel
15
Second software problem memory hierarchy

The CPU chip industry has now reached the
point that instructions can be executed more
quickly than the chips can be fed with code and
data. Future chip design is memory design. Future
software design is also memory design. .
Controlling memory access patterns will drive
hardware and software designs for the foreseeable
future.
Richard Sites, DEC

16
Memory Hierarchy of SGI Octane
Memory
128MB
size
L2 cache
1MB
L1 cache
32KB (I) 32KB (D)
Regs
64
access time (cycles)
2
10
70

R10 K processor
4-way superscalar, 2 fpo/cycle, 195MHz
Peak performance 390 Mflops
Experience sustained performance is less than
10 of peak
Processor often stalls waiting for memory system
to load data

17
Memory-wall solutions

Latency avoidance
multi-level memory hierarchies (caches)
Latency tolerance
Pre-fetching
multi-threading
Techniques are not mutually exclusive
Most microprocessors have caches and pre-fetching
Modest multi-threading is coming into vogue
Our focus memory hierarchies

18
Hiding latency in numerical codes

Most numerical kernels O(n3) work, O(n2) data
all factorization codes
Cholesky factorization A LLT (A is spd)
LU factorization A LU
LU factorization with pivoting A LU
QR factorization A QR (Q is orthogonal)
BLAS-3 matrix multiplication
use latency avoidance techniques
Matrix-vector product O(n2) work, O(n2) data
use latency tolerance techniques such as
pre-fetching
particularly important for iterative solution of
large sparse systems

19
Software problem

Caches are useful only if programs have
locality of reference
temporal locality program references to given
memory address are clustered together in time
spatial locality program references clustered in
address space are clustered in time
Problem
Programs obtained by expressing most algorithms
in the straight-forward way do not have much
locality of reference
Worrying about locality when coding algorithms
complicates the software process enormously.

20
Example matrix multiplication
DO I 1, N //assume arrays stored in
row-major order DO J 1, N DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

Great algorithmic data reuse each array element
is touched O(N) times!
All six loop permutations are computationally
equivalent (even modulo round-off error).
However, execution times of the six versions can
be very different if machine has a cache.

21
IJK version (large cache)
B
K

DO I 1, N
DO J 1, N
DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

A
C
K

Large cache scenario
Matrices are small enough to fit into cache
Only cold misses, no capacity misses
Miss ratio
Data size 3 N2
Each miss brings in b floating-point numbers
Miss ratio 3 N2 /b4N3 0.75/bN 0.019 (b
4,N10)

22
IJK version (small cache)
B
K

DO I 1, N
DO J 1, N
DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

A
C
K

Small cache scenario
Matrices are large compared to cache/row-major
storage
Cold and capacity misses
Miss ratio
C N2/b misses (good temporal locality)
A N3 /b misses (good spatial locality)
B N3 misses (poor temporal and spatial
locality)
Miss ratio ? 0.25 (b1)/b 0.3125 (for b 4)

23
MMM Experiments

Simulated L1 Cache Miss Ratio for Intel Pentium
III
MMM with N 11300
16KB 32B/Block 4-way 8-byte elements

24
Quantifying performance differences

DO I 1, N //assume arrays stored in
row-major order
DO J 1, N
DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

Octane
L2 cache hit 10 cycles, cache miss 70 cycles
Time to execute IKJ version
2N3 700.134N3 100.874N3 73.2 N3
Time to execute JKI version
2N3 700.54N3 100.54N3 162 N3
Speed-up 2.2
Key transformation loop permutation

25
Even better..

Break MMM into a bunch of smaller MMMs so that
large cache model is true for each small MMM
? large cache model is valid for entire
computation
? miss ratio will be 0.75/bt for entire
computation where t is

26
Loop tiling
Jt
B
J
DO It 1,N, t DO Jt 1,N,t DO Kt 1,N,t
DO I It,Itt-1 DO J Jt,Jtt-1
DO K Kt,Ktt-1 C(I,J)
C(I,J)A(I,K)B(K,J)
A
It
t
t
I
t
t
K
C
Kt

Break big MMM into sequence of smaller MMMs where
each smaller MMM multiplies sub-matrices of size
txt.
Parameter t (tile size) must be chosen carefully
as large as possible
working set of small matrix multiplication must
fit in cache

27
Speed-up from tiling

Miss ratio for block computation
miss ratio for large cache model
0.75/bt
0.001 (b 4, t 200) for Octane
Time to execute tiled version
2N3 700.0014N3 100.9994N3 42.3N3
Speed-up over JKI version 4

28
Observations

Locality optimized code is more complex than
high-level algorithm.
Loop orders and tile size must be chosen
carefully
cache size is key parameter
associativity matters
Actual code is even more complex must optimize
for processor resources
registers register tiling
pipeline loop unrolling
Optimized MMM code can be 1000 lines of C code

29
One solution to both problems restructuring
compilers (1975-)

Programmer writes high-level architecture
independent code
Restructuring compiler optimizes program for
Number of cores
Number of register
Cache organization
Instruction set mul-add? vector extensions?

30
Two key issues
P1
1
P2
P
P3
2

Program restructuring given program P, determine
set of equivalent programs P1, P2, P3,
Program selection determine which program
performs best on target architecture

31
Automatic parallelization

Pessimistic parallelization
Compiler determines partial order on program
operations by determining dependences
At run-time, execute operations in parallel,
respecting dependences
Works reasonably well for array programs but not
for irregular data structures like trees and
graphs
Optimistic parallelization
Execute operations speculatively in parallel,
assuming that dependences do not exist
Check at runtime if dependences are violated
If so, roll-back execution to safe point and
re-execute sequentially
Works only if optimism is warranted
Lots of interest in transactional memory which
is one model of optimistic parallelization

32
Automatic locality enhancement

Some methodology exists for array programs but
little is known for irregular programs
Many compilers can perform tiling and permutation
automatically (gcc)
Choosing parameter values tile sizes etc.
Compiler can use architectural models
Self-optimizing systems system determines best
values using some kind of heuristic search
(ATLAS,FFTW)

33
Course outline