Title: A Memoryhierarchy Conscious and Selftunable Sorting Library
1A Memory-hierarchy Conscious and Self-tunable
Sorting Library
Xiaoming Li, María Jesús Garzarán, and David Padua
University of Illinois at Urbana-Champaign
- To appear in 2004 International Symposium on
- Code Generation and Optimization (CGO04)
2Motivation
- Sorting
- Core operation in many applications, such as
databases - Well understood symbolic computing problem
- Libraries generators such as ATLAS and SPIRAL
have used empirical search to adapt to - Architectural features of the target machine
- Size of the input data
But, performance of sorting also depends on the
distribution of the values to be sorted
3Motivation
- Main difficulties to build a sorting library
- Theoretical complexity is not sufficient to
measure quality - Cache effect, instructions executed
- Performance depends on the characteristics of the
input - Amount distribution of data to sort
- A single algorithm is not optimal for all
possible input sets
4Contributions
- Identify the architectural and runtime factors
that affect the performance of the sorting
algorithms. - Use empirical search to identify the best shape
and parameter values of a sorting algorithm. - Use machine learning and runtime adaptation to
select the best sorting algorithm for a specific
input set.
5Contributions
IBM Power 3, sorting 12 M keys (integer 32 bits)
Execution Time (Cycles)
Standard deviation of the inputs
6Outline
- Sorting Algorithms
- Factors that determine performance
- The Library
- Evaluation
- Future Work
- Conclusions
7Sorting Algorithms
- Our sorting library contains
- Quicksort
- CC-Radix
- Multiway Merge
- Insertion Sort
- Sorting Networks
For small partitions
8Quicksort
- Divide and conquer in-place sorting algorithm
- Our implementation includes Sedgewicks
optimizations - Set guardians at both ends of the input array.
- Eliminate recursion.
- Correctly select the pivot.
- Use insertion sort for small partitions.
9Radix sort
Vector to sort
31 1 12 23 33 4
1 1 2 3 3 4
3 1 2 3
12 23 31 13 4 1
0 1 2 3 4 5
2 3 1 3 4 1
1 2 3 1
3
12
23
10CC-radix (Cache Conscious Radix Sort)
- Tries to exploit data locality in caches
- Based on radix sort (Jimenez and Larriba UPC)
-
-
CC-radix(bucket)
if fits in cache (bucket) then radix sort
(bucket)
else sub-buckets Reverse sorting(bucket)
for each sub-bucket in sub-buckets
CC-radix(sub-buckets)
endfor endif
11Multiway Merge Sort
- This algorithm exploits data locality very
efficiently
Heap
2p -1 nodes
Sorted Subset
Sorted Subset
Sorted Subset
Sorted Subset
p subsets
12Sorting algorithms for small partitions
- Insertion sort ? Exploits locality in the cache
line - Sorting networks ? Register blocking
13Performance Comparison
Pentium III Xeon, 16 M keys (float)
14Outline
- Sorting Algorithms
- Factors that determine performance
- The Library
- Evaluation
- Future Work
- Conclusions
15Factors that determine performance
- Architectural Factors Considered
- Cache / TLB size
- Number of Registers
- Cache Line Size
- Runtime Factors Considered
- Amount of data to Sort
- Distribution of the data
16Architectural Cache Size/TLB Size
- Tiling Partition the data in subsets that fit in
the cache - Quicksort
- Using multiple pivots to tile
- CC-radix
- Fit each partition into cache
- The active partitions lt TLB size
- Multiway Merge Sort
- Fit the heap into cache
- Fit sorted subsets into cache
17Architectural Number of Registers
- For small partitions, sort in place using the
processor registers - Optimizations like unroll and scheduling can be
applied
cmpswap(r0,r1) cmpswap(r2,r3) cmpswap(r1,r2) cm
pswap(r0,r3) cmpswap(r4,r5) ..
cmpswap(r0,r1) cmpswap(r2,r3) cmpswap(r4,r5) cm
pswap(r1,r2) cmpswap(r0,r3)
18Architectural Cache Line Size
- Fanout Cache Line Size
- Increase cache line utilization when accessing
children nodes
Cache Line
19Runtime Amount and Distribution Shape
Execution Time (Cycles)
Number of Keys (Millions)
20Runtime Amount and Distribution Shape
Execution Time (Cycles)
Number of Keys (Millions)
21Runtime Standard Deviation
Pentium III Xeon, 16 M keys
Execution Time (Cycles)
Standard deviation of the keys
22Outline
- Sorting Algorithms
- Factors that determine performance
- The Library
- Evaluation
- Future Work
- Conclusions
23Library adaptation
- Architectural Factors
- Cache / TLB size
- Number of Registers
- Cache Line Size
Empirical Search
- Runtime Factors
- Distribution shape of the data
- Amount of data to Sort
- Standard Deviation
Does not matter
Machine learning and runtime adaptation
24The Library
- Building the library ? Intallation time
- Empirical Search
- Learning Procedure
- Use of training data
- Running the library ? Runtime
- Runtime Procedure
Runtime Adaptation
25Runtime Adaptation Learning Procedure
- Goal function
-
- f(N,E) ? Multiway Merge Sort, Quicksort,
CC-radix - N amount of input data
- E the entropy vector
- Use N to choose between Multiway Merge or
Quicksort - Use the entropy and Winnow algorithm to learn the
best algorithm - Output weight vector ( ) and threshold (?)
26Runtime AdaptationRuntime Procedure
- Sample the input array
- Compute the entropy vector
- Compute S ?i wi entropyi
- If S ?
- choose CC-radix
- else
- choose others
27Outline
- Sorting Algorithms
- Factors that determine performance
- The Library
- Evaluation
- Future Work
- Conclusions
28Experimental Setup
- Test Platforms
- SGI R12000 300 Mhz L1I/D32KB L2 4MB
- UltraSparcIII 750 Mhz L1I/D32KB, 64KB L2
8MB - PentiumIII Xeon 550 Mhz L1I/D16KB L2 512KB
- IBM Power3 375 Mhz, L1I/D64KB L2 8MB
29Sun UltraSparcIII 12 M keys
Execution Time (Cycles per key)
Standard deviation of the keys
30IBM Power3 12 M Keys
Execution Time (Cycles per key)
Standard deviation of the keys
31Conclusions
- Identify the architectural and runtime factors
- Use empirical search to find the best parameters
values - Our machine learning techniques prove to be quite
effective - Always selects the best algorithm.
- The wrong decision introduces a 37 average
performance degradation - Overhead (average 5, worst case 7)
-
32Future Work
- Search in the space of sorting algorithms using
high-level primitives - Extend sorting to include more data types
- Include other comparison strategies
- Parallel algorithms
- Explore other database operations, such as join.
- For example, less than to sort vectors, graphs,
33Empirical Search
- Adaptation to the architecture of the machine
- Quicksort and CC-radix,
- the best configuration does not change
significantly with the characteristics of the
input data set. - Quicksort, CC-Radix
- Use of insertion sort/sorting networks for small
partitions - Threshold to use them
- CC-radix
- Size of the radix
- Multiway Merge Sort
- the best configuration changes with the amount
and the distribution of the input data. - The best values will be searched during the
learning procedure.
34(No Transcript)
35Multiway Merge Sort
42
60
60
42
28
Heap
21
60
42
28
23
11
21
23
60
7
42
28
4
Sorted Run
Sorted Run
Sorted Run
Sorted Run
36Empirical Search
- Example
- Multiway Merge
- Search the heap size that obtains the best
performance - Different amount of data and standard deviation