Parallel Sorting - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Parallel Sorting

Description:

0.00 0.00 0.00 27.87 27.93 28.27 28.02 27.16 all forward fft. 0.00 0.00 0.00 4.44 3.00 2.00 3.00 4 ... Extra s Radix: Stream Broadcast Problem What s ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 63
Provided by: DavidC363
Category:

less

Transcript and Presenter's Notes

Title: Parallel Sorting


1
Parallel Sorting
  • James Demmel
  • www.cs.berkeley.edu/demmel/cs267_Spr10

2
Some Sorting algorithms
  • Choice of algorithm depends on
  • Is the data all in memory, or on disk/tape?
  • Do we compare operands, or just use their values?
  • Sequential or parallel? Shared or distributed
    memory? SIMD or not?
  • We will consider all data in memory and parallel
  • Bitonic Sort
  • Naturally parallel, suitable for SIMD
    architectures
  • Sample Sort
  • Good generalization of Quick Sort to parallel
    case
  • Radix Sort
  • Data measured on CM-5
  • Written in Split-C (precursor of UPC)
  • www.cs.berkeley.edu/culler/papers/sort.ps

3
LogP model to predict, understand performance
P ( processors )
M
P
M
P
M
P
  
o (overhead)
o
g (gap)
L (latency)
Limited Volume
Interconnection Network
(
L/ g
to or from a proc)
  • Latency in sending a (small) message between
    modules
  • overhead felt by the processor on sending or
    receiving msg
  • gap between successive sends or receives (1/BW)
  • Processors

a
b
4
Bottom Line on CM-5 using Split-C (Preview)
Algorithm, Procs
  • Good fit between predicted (using LogP model) and
    measured (10)
  • No single algorithm always best
  • scaling by processor, input size, sensitivity
  • All are global / local hybrids
  • the local part is hard to implement and model

5
Bitonic Sort (1/2)
  • A bitonic sequence is one that is
  • Monotonically increasing and then monotonically
    decreasing
  • Or can be circularly shifted to satisfy 1
  • A half-cleaner takes a bitonic sequence and
    produces
  • First half is smaller than smallest element in
    2nd
  • Both halves are bitonic
  • Apply recursively to each half to complete
    sorting
  • Where do we get a bitonic sequence to start with?

6
Bitonic Sort (2/2)
  • A bitonic sequence is one that is
  • Monotonically increasing and then monotonically
    decreasing
  • Or can be circularly shifted to satisfy this
  • Any sequence of length 2 is bitonic
  • So we can sort it using bitonic sorter on last
    slide
  • Sort all length-2 sequences in alternating
    increasing/decreasing order
  • Two such length-2 sequences form a length-4
    bitonic sequence
  • Recursively
  • Sort length-2k sequences in alternating
    increasing/decreasing order
  • Two such length-2k sequences form a length-2k1
    bitonic sequence (that can be sorted)

7
Bitonic Sort for n16 all dependencies shown
Similar pattern as FFT similar optimizations
possible
Block Layout
8
Bitonic Sort time per key
Predicted using LogP Model
Measured
Procs
80
80
70
70
512
60
60
256
50
50
128
40
40
us/key
us/key
30
30
64
20
20
32
10
10
0
0
16384
32768
65536
16384
32768
65536
131072
262144
524288
131072
262144
524288
1048576
1048576
N/P
N/P
9
Sample Sort
1. compute P-1 values of keys that would
split the input into roughly equal pieces.
take S64 samples per processor sort PS keys
(on processor 0) let keys S, 2S, . . .
(P-1)S be Splitters broadcast Splitters 2.
Distribute keys based on Splitters
Splitter(i-1) lt keys ? Splitter(i) all sent to
proc i 3. Local sort of keys on each proc 4.
possibly reshift, so each proc has N/p keys
If samples represent total population, then
Splitters should divide population into P roughly
equal pieces
10
Sample Sort Times
Predicted
Measured
Processors
30
30
25
25
512
20
20
256
128
15
15
us/key
us/key
64
10
10
32
5
5
0
0
16384
32768
65536
16384
32768
65536
131072
262144
524288
131072
262144
524288
1048576
1048576
N/P
N/P
11
Sample Sort Timing Breakdown
Predicted and Measured (-m) times
12
Sequential Radix Sort Counting Sort
  • Idea build a histogram of the keys and compute
    position in answer array for each element
  • A 3, 5, 4, 1, 3, 4, 1, 4
  • Make temp array B, and write values into position
  • B 1, 1, 3, 3, 4, 4, 4, 5
  • Cost O(keys size of histogram)
  • What if histogram too large (eg all 32-bit ints?
    All words?)

13
Radix Sort Separate Key Into Parts
  • Divide keys into parts, e.g., by digits (radix)
  • Using counting sort on these each radix
  • Start with least-significant

sat run sat pin
saw pin saw run
tip tip tip sat
run sat pin saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip tip sat
run sat pin saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip pin sat
run sat tip saw
pin saw run tip
sat run sat pin
saw pin saw run
tip tip pin sat
run sat tip saw
pin saw run tip
sort on 3rd character
sort on 2nd character
sort on 1st character
  • Cost O(keys characters)

14
Histo-radix sort
  • Per pass
  • compute local histogram
  • r-bit keys, 2r bins
  • 2. compute position of 1st
  • member of each bucket in
  • global array
  • 2r scans with end-around
  • 3. distribute all the keys
  • Only r 4, 8,11,16 make sense
  • for sorting 32 bit numbers

P
nN/P
2
2r
3
15
Histo-Radix Sort (again)
Local Data
Local Histograms
P
Each Pass form local histograms form global
histogram globally distribute data
16
Radix Sort Times
procs
512
256
128
64
32
17
Radix Sort Timing Breakdown
Predicted and Measured (-m) times
18
Local Sort Performance on CM-5
Entropy -Si pi log2 pi , pi Probability of
key i Ranges from 0 to log2 (different keys)
(11 bit radix sort of 32 bit numbers)
19
Radix Sort Timing dependence on Key distribution
Entropy -Si pi log2 pi , pi Probability of
key i Ranges from 0 to log2 (different keys)
Slowdown due to contention in redistribution
20
Bottom Line on CM-5 using Split-C
Algorithm, Procs
  • Good fit between predicted (using LogP model) and
    measured (10)
  • No single algorithm always best
  • scaling by processor, input size, sensitivity
  • All are global / local hybrids
  • the local part is hard to implement and model

21
Sorting Conclusions
  • Distributed memory model leads to hybrid global /
    local algorithm
  • Use best local algorithm combined with global
    part
  • LogP model is good enough to model global part
  • bandwidth (g) or overhead (o) matter most
  • including end-point contention
  • latency (L) only matters when bandwidth doesnt
  • Modeling local computational performance is
    harder
  • dominated by effects of storage hierarchy (eg
    TLBs),
  • depends on entropy
  • See http//www.cs.berkeley.edu/culler/papers/sort
    .ps
  • See http//now.cs.berkeley.edu/Papers2/Postscript/
    spdt98.ps
  • disk-to-disk parallel sorting

22
Extra slides
23
Radix Stream Broadcast Problem
  • Processor 0 does only sends
  • Others receive then send
  • Receives prioritized over sends
  • ? Processor 0 needs to be delayed

n
(P-1) ( 2o L (n-1) g ) ?
Need to slow first processor to pipeline well
24
Whats the right communication mechanism?
  • Permutation via writes
  • consistency model?
  • false sharing?
  • Reads?
  • Bulk Transfers?
  • what do you need to change in the algorithm?
  • Network scheduling?

25
Comparison
  • Good fit between predicted and measured (10)
  • Different sorts for different sorts
  • scaling by processor, input size, sensitivity
  • All are global / local hybrids
  • the local part is hard to implement and model

26
Outline
  • Some Performance Laws
  • Performance analysis
  • Performance modeling
  • Parallel Sorting combining models with
    measurments
  • Reading
  • Chapter 3 of Fosters Designing and Building
    Parallel Programs online text
  • http//www-unix.mcs.anl.gov/dbpp/text/node26.html
  • Abbreviated as DBPP in this lecture
  • David Baileys Twelve Ways to Fool the Masses

27
Measuring Performance
  • Performance criterion may vary with domain
  • There may be limits on acceptable running time
  • E.g., a climate model must run 1000x faster than
    real time.
  • Any performance improvement may be acceptable
  • E.g., faster on 4 cores than on 1.
  • Throughout may be more critical than latency
  • E.g., number of images processed per minute
    (throughput) vs. total delay for one image
    (latency) in a pipelined system.
  • Execution time per unit cost
  • E.g., GFlop/sec, GFlop/s/ or GFlop/s/Watt
  • Parallel scalability (speedup or parallel
    efficiency)
  • Percent relative to best possible (some kind of
    peak)

28
Amdahls Law (review)
  • Suppose only part of an application seems
    parallel
  • Amdahls law
  • let s be the fraction of work done sequentially,
    so (1-s) is
    fraction parallelizable
  • P number of processors

Speedup(P) Time(1)/Time(P)
lt 1/(s (1-s)/P) lt 1/s
  • Even if the parallel part speeds up perfectly
    performance is limited by the sequential
    part

29
Amdahls Law (for 1024 processors)
Does this mean parallel computing is a hopeless
enterprise?
Source Gustafson, Montry, Benner
30
Scaled Speedup
See Gustafson, Montry, Benner, Development of
Parallel Methods for a 1024 Processor Hypercube,
SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609.
31
Scaled Speedup (background)
32
Littles Law
  • Latency vs. Bandwidth
  • Latency is physics (wire length)
  • e.g., the network latency on the Earth Simulation
    is only about 2x times the speed of light across
    the machine room
  • Bandwidth is cost
  • add more cables to increase bandwidth
    (over-simplification)
  • Principle (Little's Law) the relationship of a
    production system in steady state is
  • Inventory Throughput Flow Time
  • For parallel computing, Littles Law is about the
    required concurrency to be limited by bandwidth
    rather than latency
  • Required concurrency Bandwidth Latency

  • bandwidth-delay product
  • For parallel computing, this means
  • Concurrency bandwidth x latency

33
Littles Law
  • Example 1 a single processor
  • If the latency is to memory is 50ns, and the
    bandwidth is 5 GB/s (.2ns / Bytes 12.8 ns /
    64-byte cache line)
  • The system must support 50/12.8 4 outstanding
    cache line misses to keep things balanced (run at
    bandwidth speed)
  • An application must be able to prefetch 4 cache
    line misses in parallel (without dependencies
    between them)
  • Example 2 1000 processor system
  • 1 GHz clock, 100 ns memory latency, 100 words of
    memory in data paths between CPU and memory.
  • Main memory bandwidth is
  • 1000 x 100 words x 109/s 1014
    words/sec.
  • To achieve full performance, an application
    needs
  • 10-7 x 1014 107 way concurrency
  • (some of that may be hidden in the
    instruction stream)

34
In Performance Analysis Use more Data
  • Whenever possible, use a large set of data rather
    than one or a few isolated points. A single point
    has little information.
  • E.g., from DBPP
  • Serial algorithm scales as N N2
  • Speedup of 10.8 on 12 processors with problem
    size N100
  • Case 1 T N N2/P
  • Case 2 T (N N2)/P 100
  • Case 2 T (N N2)/P 0.6P2
  • All have speedup 10.8 on 12 procs
  • Performance graphs (n 100, 1000) show
    differences in scaling

35
Example Immersed Boundary Simulation
  • Using Seaborg (Power3) at NERSC and DataStar
    (Power4) at SDSC
  • How useful is this data? What are ways to make
    is more useful/interesting?

Joint work with Ed Givelberg, Armando Solar-Lezama
36
Performance Analysis
37
Building a Performance Model
  • Based on measurements/scaling of components
  • FFT is time is
  • 5nlogn flops flops/sec (measured for FFT)
  • Other costs are linear in either material or
    fluid points
  • Measure constants
  • flops/point (independent machine or problem
    size)
  • Flops/sec (measured per machine, per phase)
  • Time is a b points
  • Communication done similarly
  • Find formula for message size as function of
    problem size
  • Check the formula using tracing of some kind
  • Use a/b model to predict running time a b
    size

38
A Performance Model
  • 5123 in lt 1 second per timestep not possible
  • Primarily limited by bisection bandwidth

39
Model Success and Failure
40
OSKI SPMV What We Expect
  • Assume
  • Cost(SpMV) time to read matrix
  • 1 double-word 2 integers
  • r, c in 1, 2, 4, 8
  • CSR 1 int / non-zero
  • BCSR(r x c) 1 int / (rc non-zeros)
  • As rc increases, speedup should
  • Increase smoothly
  • Approach 1.5 (eliminate all index overhead)

41
What We Get (The Need for Search)
42
Using Multiple Models
43
Multiple Models
44
Multiple Models
45
Extended ExampleUsing Performance Modeling
(LogP)To Explain DataApplication to Sorting
46
Deriving the LogP Model
  • Processing
  • powerful microprocessor, large DRAM, cache gt
    P
  • Communication
  • significant latency (100's 1000s of
    cycles) gt L
  • limited bandwidth (1 5 of memory bw) gt g
  • significant overhead (10's 100's of
    cycles) gt o
  • - on both ends
  • no consensus on topology
  • gt should not exploit structure
  • limited network capacity
  • no consensus on programming model
  • gt should not enforce one

47
LogP
P ( processors )
M
P
M
P
M
P
  
o (overhead)
o
g (gap)
L (latency)
Limited Volume
Interconnection Network
(
L/ g
to or from a proc)
  • Latency in sending a (small) message between
    modules
  • overhead felt by the processor on sending or
    receiving msg
  • gap between successive sends or receives (1/BW)
  • Processors

a
b
48
Using the LogP Model
o
L
o
o
o
L
g
time
  • Send n messages from proc to proc in time
  • 2o L g (n-1)
  • each processor does on cycles of overhead
  • has (g-o)(n-1) L available compute cycles
  • Send n total messages from one to many
  • in same time
  • Send n messages from many to one
  • in same time
  • all but L/g processors block
  • so fewer available cycles, unless scheduled
    carefully

P
P
49
Use of the LogP Model (cont)
  • Two processors sending n words to each other
    (i.e., exchange)
  • 2o L max(g,2o) (n-1) max(g,2o) L
  • P processors each sending n words to all
    processors (n/P each) in a static, balanced
    pattern without conflicts , e.g., transpose, fft,
    cyclic-to-block, block-to-cyclic
  • exercise whats wrong with the formula above?

o
L
o
o
o
L
g
o
Assumes optimal pattern of send/receive, so could
underestimate time
50
LogP "philosophy"
  • Think about
  • mapping of N words onto P processors
  • computation within a processor, its cost, and
    balance
  • communication between processors, its cost,
    and balance
  • given a charaterization of processor and network
    performance
  • Do not think about what happens within the
    network

This should be good enough!
51
Typical Sort
  • Exploits the n N/P grouping
  • Significant local computation
  • Very general global communication /
    transformation
  • Computation of the transformation

52
Costs Split-C (UPC predecessor) Operations
  • Read, Write x G, G x 2 (L 2o)
  • Store G x L 2o
  • Get x G o
  • .... 2L 2o
  • sync() o
  • with interval g
  • Bulk store (n words with w words/message)
  • 2o (n-1)g
    L
  • Exchange 2o 2L (ì n/w ù - 1 - L/g) max(g,2o)
  • One to many
  • Many to one

53
LogP model
  • CM5
  • L 6 µs
  • o 2.2 µs
  • g 4 µs
  • P varies from 32 to 1024
  • NOW
  • L 8.9
  • o 3.8
  • g 12.8
  • P varies up to 100
  • What is the processor performance?
  • Application-specific
  • 10s of Mflops for these machines

54
LogP Parameters Today
55
Local Computation Parameters - Empirical
Parameter Operation µs per key Sort Swap Simulate
cycle butterfly per key 0.025 lg
N Bitonic mergesort Sort bitonic
sequence 1.0 scatter Move key for
Cyclic-to-block 0.46 gather Move key for
Block-to-cyclic 0.52 if nlt64k or Plt64 Bitonic
Column 1.1 otherwise local sort Local radix
sort (11 bit) 4.5 if n lt 64K 9.0 -
(281000/n) merge Merge sorted lists 1.5 Column cop
y Shift Key 0.5 zero Clear histogram
bin 0.2 Radix hist produce histogram 1.2 add produ
ce scan value 1.0 bsum adjust scan of
bins 2.5 address determine desitination 4.7 compar
e compare key to splitter 0.9 Sample localsort8 lo
cal radix sort of samples 5.0
56
Odd-Even Merge - classic parallel sort
N values to be sorted
Treat as two lists of M N/2
Sort each separately
A0 A1 A2 A3 AM-1
B0 B1 B2 B3 BM-1
Redistribute into even and odd sublists
A0 A2 AM-2
A1 A3 AM-1
B0 B2 BM-2
B1 B3 BM-1
Merge into two sorted lists
E0 E1 E2 E3 EM-1
O0 O1 O2 O3 OM-1
Pairwise swaps of Ei and Oi will put it in order
57
Wheres the Parallelism?
1xN
2xN/2
4xN/4
E0 E1 E2 E3 EM-1
O0 O1 O2 O3 OM-1
1xN
58
Mapping to a Butterfly (or Hypercube)
two sorted sublists
Reverse Order of one list via cross edges
A0
A1
A2
A3
B2
B3
B1
B0
Pairwise swaps on way back
2
3
4
8
7
6
5
1
2
3
4
7
6
5
1
8
2
4
6
1
3
5
7
8
1
2
3
4
5
6
7
8
59
Bitonic Sort with N/P per node
A bitonic sequence decreases and then increases
(or vice versa) Bitonic sequences can be merged
like monotonic sequences
  • all_bitonic(int APROCSn)
  • sort(tolocal(AME0),n,0)
  • for (d 1 d lt logProcs d)
  • for (i d-1 i gt 0 i--)
  • swap(A,T,n,pair(i))
  • merge(A,T,n,mode(d,i))
  • sort(tolocal(AME0),n,mask(i))

sort
swap
60
Bitonic Breakdown
P 512, random
61
Bitonic Effect of Key Distributions
P 64, N/P 1 M
62
Sequential Radix Sort Counting Sort
  • Idea build a histogram of the keys and compute
    position in answer array for each element
  • A 3, 5, 4, 1, 3, 4, 1, 4
  • Make temp array B, and write values into position

4
1
3
4
4
1
3
4
5
63
Counting Sort Pseudo Code
  • Counting Sort
  • static void countingSort(int A)
  • int N A.length
  • int L min(A), U max(A)
  • int count new intU-L2
  • for (int i 0 i lt N i 1)
  • countAi - L 1 1
  • for (int j 1 j lt count.length j)
  • countj countj-1

A 3, 5, 4, 1, 3, 4, 1, 4
N 8
L1, U5
count 0,0,0,0,0,0
count 0,2,0,2,3,1
count 0,2,2,4,7,8
64
Distribution Sort Continued
  • static void countingSort (int A)
  • int B new int N
  • for (int i 0 i lt N i 1)
  • BcountAi-L Ai
  • countAi-L 1
  • // copy back into A
  • for (int i 0 i lt N i 1)
  • Ai Bi

A 3, 5, 4, 1, 3, 4, 1, 4
count 0,2,2,4,7,8
B 0, 0, 0, 0, 0, 0, 0, 0
65
Analysis of Counting Sort
  • What is the complexity of each step for an
    n-element array?
  • Find min and max Q(n)
  • Fill in count histogram Q(n)
  • Compute sums of count Q(max-min)
  • Fill in B (run over count) Q(max-min)
  • Copy B to A Q(n)
  • So this is a Q(n m) algorithm,
  • where mmax-min
  • Great if the range of keys isnt too large
  • If m lt n, then Q(n) overall
Write a Comment
User Comments (0)
About PowerShow.com