A Memoryhierarchy Conscious and Selftunable Sorting Library

About This Presentation

Title:

A Memoryhierarchy Conscious and Selftunable Sorting Library

Description:

To appear in 2004 International Symposium on. Code Generation and Optimization (CGO'04) ... Use the entropy and Winnow algorithm to learn the best algorithm ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 37

Provided by: polaris

Learn more at: http://polaris.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Memoryhierarchy Conscious and Selftunable Sorting Library

1
A Memory-hierarchy Conscious and Self-tunable
Sorting Library
Xiaoming Li, María Jesús Garzarán, and David Padua
University of Illinois at Urbana-Champaign

To appear in 2004 International Symposium on
Code Generation and Optimization (CGO04)

2
Motivation

Sorting
Core operation in many applications, such as
databases
Well understood symbolic computing problem
Libraries generators such as ATLAS and SPIRAL
have used empirical search to adapt to
Architectural features of the target machine
Size of the input data

But, performance of sorting also depends on the
distribution of the values to be sorted
3
Motivation

Main difficulties to build a sorting library
Theoretical complexity is not sufficient to
measure quality
Cache effect, instructions executed
Performance depends on the characteristics of the
input
Amount distribution of data to sort
A single algorithm is not optimal for all
possible input sets

4
Contributions

Identify the architectural and runtime factors
that affect the performance of the sorting
algorithms.
Use empirical search to identify the best shape
and parameter values of a sorting algorithm.
Use machine learning and runtime adaptation to
select the best sorting algorithm for a specific
input set.

5
Contributions
IBM Power 3, sorting 12 M keys (integer 32 bits)
Execution Time (Cycles)
Standard deviation of the inputs
6
Outline

Sorting Algorithms
Factors that determine performance
The Library
Evaluation
Future Work
Conclusions

7
Sorting Algorithms

Our sorting library contains
Quicksort
CC-Radix
Multiway Merge
Insertion Sort
Sorting Networks

For small partitions
8
Quicksort

Divide and conquer in-place sorting algorithm
Our implementation includes Sedgewicks
optimizations
Set guardians at both ends of the input array.
Eliminate recursion.
Correctly select the pivot.
Use insertion sort for small partitions.

9
Radix sort

Non comparison algorithm

Vector to sort
31 1 12 23 33 4
1 1 2 3 3 4
3 1 2 3
12 23 31 13 4 1
0 1 2 3 4 5
2 3 1 3 4 1
1 2 3 1
3
12
23
10
CC-radix (Cache Conscious Radix Sort)

Tries to exploit data locality in caches
Based on radix sort (Jimenez and Larriba UPC)

CC-radix(bucket)
if fits in cache (bucket) then radix sort
(bucket)
else sub-buckets Reverse sorting(bucket)
for each sub-bucket in sub-buckets
CC-radix(sub-buckets)
endfor endif
11
Multiway Merge Sort

This algorithm exploits data locality very
efficiently

Heap
2p -1 nodes
Sorted Subset
Sorted Subset
Sorted Subset
Sorted Subset
p subsets
12
Sorting algorithms for small partitions

Insertion sort ? Exploits locality in the cache
line
Sorting networks ? Register blocking

13
Performance Comparison
Pentium III Xeon, 16 M keys (float)
14
Outline

Sorting Algorithms
Factors that determine performance
The Library
Evaluation
Future Work
Conclusions

15
Factors that determine performance

Architectural Factors Considered
Cache / TLB size
Number of Registers
Cache Line Size
Runtime Factors Considered
Amount of data to Sort
Distribution of the data

16
Architectural Cache Size/TLB Size

Tiling Partition the data in subsets that fit in
the cache
Quicksort
Using multiple pivots to tile
CC-radix
Fit each partition into cache
The active partitions lt TLB size
Multiway Merge Sort
Fit the heap into cache
Fit sorted subsets into cache

17
Architectural Number of Registers

For small partitions, sort in place using the
processor registers
Optimizations like unroll and scheduling can be
applied

cmpswap(r0,r1) cmpswap(r2,r3) cmpswap(r1,r2) cm
pswap(r0,r3) cmpswap(r4,r5) ..
cmpswap(r0,r1) cmpswap(r2,r3) cmpswap(r4,r5) cm
pswap(r1,r2) cmpswap(r0,r3)
18
Architectural Cache Line Size

Fanout Cache Line Size
Increase cache line utilization when accessing
children nodes

Cache Line
19
Runtime Amount and Distribution Shape
Execution Time (Cycles)
Number of Keys (Millions)
20
Runtime Amount and Distribution Shape
Execution Time (Cycles)
Number of Keys (Millions)
21
Runtime Standard Deviation
Pentium III Xeon, 16 M keys
Execution Time (Cycles)
Standard deviation of the keys
22
Outline

Sorting Algorithms
Factors that determine performance
The Library
Evaluation
Future Work
Conclusions

23
Library adaptation

Architectural Factors
Cache / TLB size
Number of Registers
Cache Line Size

Empirical Search

Runtime Factors
Distribution shape of the data
Amount of data to Sort
Standard Deviation

Does not matter
Machine learning and runtime adaptation
24
The Library

Building the library ? Intallation time
Empirical Search
Learning Procedure
Use of training data
Running the library ? Runtime
Runtime Procedure

Runtime Adaptation
25
Runtime Adaptation Learning Procedure

Goal function
f(N,E) ? Multiway Merge Sort, Quicksort,
CC-radix
N amount of input data
E the entropy vector
Use N to choose between Multiway Merge or
Quicksort
Use the entropy and Winnow algorithm to learn the
best algorithm
Output weight vector ( ) and threshold (?)

26
Runtime AdaptationRuntime Procedure

Sample the input array
Compute the entropy vector
Compute S ?i wi entropyi
If S ?
choose CC-radix
else
choose others

27
Outline

Sorting Algorithms
Factors that determine performance
The Library
Evaluation
Future Work
Conclusions

28
Experimental Setup

Test Platforms
SGI R12000 300 Mhz L1I/D32KB L2 4MB
UltraSparcIII 750 Mhz L1I/D32KB, 64KB L2
8MB
PentiumIII Xeon 550 Mhz L1I/D16KB L2 512KB
IBM Power3 375 Mhz, L1I/D64KB L2 8MB

29
Sun UltraSparcIII 12 M keys
Execution Time (Cycles per key)
Standard deviation of the keys
30
IBM Power3 12 M Keys
Execution Time (Cycles per key)
Standard deviation of the keys
31
Conclusions

Identify the architectural and runtime factors
Use empirical search to find the best parameters
values
Our machine learning techniques prove to be quite
effective
Always selects the best algorithm.
The wrong decision introduces a 37 average
performance degradation
Overhead (average 5, worst case 7)

32
Future Work

Search in the space of sorting algorithms using
high-level primitives
Extend sorting to include more data types
Include other comparison strategies
Parallel algorithms
Explore other database operations, such as join.

For example, less than to sort vectors, graphs,

33
Empirical Search

Adaptation to the architecture of the machine
Quicksort and CC-radix,
the best configuration does not change
significantly with the characteristics of the
input data set.
Quicksort, CC-Radix
Use of insertion sort/sorting networks for small
partitions
Threshold to use them
CC-radix
Size of the radix
Multiway Merge Sort
the best configuration changes with the amount
and the distribution of the input data.
The best values will be searched during the
learning procedure.

34
(No Transcript)
35
Multiway Merge Sort
42
60
60
42
28
Heap
21
60
42
28
23
11
21
23
60
7
42
28
4
Sorted Run
Sorted Run
Sorted Run
Sorted Run
36
Empirical Search

Example
Multiway Merge
Search the heap size that obtains the best
performance
Different amount of data and standard deviation

Write a Comment

User Comments (0)

About PowerShow.com

A Memoryhierarchy Conscious and Selftunable Sorting Library - PowerPoint PPT Presentation

A Memoryhierarchy Conscious and Selftunable Sorting Library

To appear in 2004 International Symposium on. Code Generation and Optimization (CGO'04) ... Use the entropy and Winnow algorithm to learn the best algorithm ... – PowerPoint PPT presentation