A Memoryhierarchy Conscious and Selftunable Sorting Library - PowerPoint PPT Presentation

About This Presentation
Title:

A Memoryhierarchy Conscious and Selftunable Sorting Library

Description:

To appear in 2004 International Symposium on. Code Generation and Optimization (CGO'04) ... Use the entropy and Winnow algorithm to learn the best algorithm ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 37
Provided by: polaris
Category:

less

Transcript and Presenter's Notes

Title: A Memoryhierarchy Conscious and Selftunable Sorting Library


1
A Memory-hierarchy Conscious and Self-tunable
Sorting Library
Xiaoming Li, María Jesús Garzarán, and David Padua
University of Illinois at Urbana-Champaign
  • To appear in 2004 International Symposium on
  • Code Generation and Optimization (CGO04)

2
Motivation
  • Sorting
  • Core operation in many applications, such as
    databases
  • Well understood symbolic computing problem
  • Libraries generators such as ATLAS and SPIRAL
    have used empirical search to adapt to
  • Architectural features of the target machine
  • Size of the input data

But, performance of sorting also depends on the
distribution of the values to be sorted
3
Motivation
  • Main difficulties to build a sorting library
  • Theoretical complexity is not sufficient to
    measure quality
  • Cache effect, instructions executed
  • Performance depends on the characteristics of the
    input
  • Amount distribution of data to sort
  • A single algorithm is not optimal for all
    possible input sets

4
Contributions
  • Identify the architectural and runtime factors
    that affect the performance of the sorting
    algorithms.
  • Use empirical search to identify the best shape
    and parameter values of a sorting algorithm.
  • Use machine learning and runtime adaptation to
    select the best sorting algorithm for a specific
    input set.

5
Contributions
IBM Power 3, sorting 12 M keys (integer 32 bits)
Execution Time (Cycles)
Standard deviation of the inputs
6
Outline
  • Sorting Algorithms
  • Factors that determine performance
  • The Library
  • Evaluation
  • Future Work
  • Conclusions

7
Sorting Algorithms
  • Our sorting library contains
  • Quicksort
  • CC-Radix
  • Multiway Merge
  • Insertion Sort
  • Sorting Networks

For small partitions
8
Quicksort
  • Divide and conquer in-place sorting algorithm
  • Our implementation includes Sedgewicks
    optimizations
  • Set guardians at both ends of the input array.
  • Eliminate recursion.
  • Correctly select the pivot.
  • Use insertion sort for small partitions.

9
Radix sort
  • Non comparison algorithm

Vector to sort
31 1 12 23 33 4
1 1 2 3 3 4
3 1 2 3
12 23 31 13 4 1
0 1 2 3 4 5
2 3 1 3 4 1
1 2 3 1
3
12
23
10
CC-radix (Cache Conscious Radix Sort)
  • Tries to exploit data locality in caches
  • Based on radix sort (Jimenez and Larriba UPC)

CC-radix(bucket)
if fits in cache (bucket) then radix sort
(bucket)
else sub-buckets Reverse sorting(bucket)
for each sub-bucket in sub-buckets
CC-radix(sub-buckets)
endfor endif
11
Multiway Merge Sort
  • This algorithm exploits data locality very
    efficiently

Heap
2p -1 nodes
Sorted Subset
Sorted Subset
Sorted Subset
Sorted Subset
p subsets
12
Sorting algorithms for small partitions
  • Insertion sort ? Exploits locality in the cache
    line
  • Sorting networks ? Register blocking

13
Performance Comparison
Pentium III Xeon, 16 M keys (float)
14
Outline
  • Sorting Algorithms
  • Factors that determine performance
  • The Library
  • Evaluation
  • Future Work
  • Conclusions

15
Factors that determine performance
  • Architectural Factors Considered
  • Cache / TLB size
  • Number of Registers
  • Cache Line Size
  • Runtime Factors Considered
  • Amount of data to Sort
  • Distribution of the data

16
Architectural Cache Size/TLB Size
  • Tiling Partition the data in subsets that fit in
    the cache
  • Quicksort
  • Using multiple pivots to tile
  • CC-radix
  • Fit each partition into cache
  • The active partitions lt TLB size
  • Multiway Merge Sort
  • Fit the heap into cache
  • Fit sorted subsets into cache

17
Architectural Number of Registers
  • For small partitions, sort in place using the
    processor registers
  • Optimizations like unroll and scheduling can be
    applied

cmpswap(r0,r1) cmpswap(r2,r3) cmpswap(r1,r2) cm
pswap(r0,r3) cmpswap(r4,r5) ..
cmpswap(r0,r1) cmpswap(r2,r3) cmpswap(r4,r5) cm
pswap(r1,r2) cmpswap(r0,r3)
18
Architectural Cache Line Size
  • Fanout Cache Line Size
  • Increase cache line utilization when accessing
    children nodes


Cache Line
19
Runtime Amount and Distribution Shape
Execution Time (Cycles)
Number of Keys (Millions)
20
Runtime Amount and Distribution Shape
Execution Time (Cycles)
Number of Keys (Millions)
21
Runtime Standard Deviation
Pentium III Xeon, 16 M keys
Execution Time (Cycles)
Standard deviation of the keys
22
Outline
  • Sorting Algorithms
  • Factors that determine performance
  • The Library
  • Evaluation
  • Future Work
  • Conclusions

23
Library adaptation
  • Architectural Factors
  • Cache / TLB size
  • Number of Registers
  • Cache Line Size

Empirical Search
  • Runtime Factors
  • Distribution shape of the data
  • Amount of data to Sort
  • Standard Deviation

Does not matter
Machine learning and runtime adaptation
24
The Library
  • Building the library ? Intallation time
  • Empirical Search
  • Learning Procedure
  • Use of training data
  • Running the library ? Runtime
  • Runtime Procedure

Runtime Adaptation
25
Runtime Adaptation Learning Procedure
  • Goal function
  • f(N,E) ? Multiway Merge Sort, Quicksort,
    CC-radix
  • N amount of input data
  • E the entropy vector
  • Use N to choose between Multiway Merge or
    Quicksort
  • Use the entropy and Winnow algorithm to learn the
    best algorithm
  • Output weight vector ( ) and threshold (?)

26
Runtime AdaptationRuntime Procedure
  • Sample the input array
  • Compute the entropy vector
  • Compute S ?i wi entropyi
  • If S ?
  • choose CC-radix
  • else
  • choose others

27
Outline
  • Sorting Algorithms
  • Factors that determine performance
  • The Library
  • Evaluation
  • Future Work
  • Conclusions

28
Experimental Setup
  • Test Platforms
  • SGI R12000 300 Mhz L1I/D32KB L2 4MB
  • UltraSparcIII 750 Mhz L1I/D32KB, 64KB L2
    8MB
  • PentiumIII Xeon 550 Mhz L1I/D16KB L2 512KB
  • IBM Power3 375 Mhz, L1I/D64KB L2 8MB

29
Sun UltraSparcIII 12 M keys
Execution Time (Cycles per key)
Standard deviation of the keys
30
IBM Power3 12 M Keys
Execution Time (Cycles per key)
Standard deviation of the keys
31
Conclusions
  • Identify the architectural and runtime factors
  • Use empirical search to find the best parameters
    values
  • Our machine learning techniques prove to be quite
    effective
  • Always selects the best algorithm.
  • The wrong decision introduces a 37 average
    performance degradation
  • Overhead (average 5, worst case 7)

32
Future Work
  • Search in the space of sorting algorithms using
    high-level primitives
  • Extend sorting to include more data types
  • Include other comparison strategies
  • Parallel algorithms
  • Explore other database operations, such as join.
  • For example, less than to sort vectors, graphs,

33
Empirical Search
  • Adaptation to the architecture of the machine
  • Quicksort and CC-radix,
  • the best configuration does not change
    significantly with the characteristics of the
    input data set.
  • Quicksort, CC-Radix
  • Use of insertion sort/sorting networks for small
    partitions
  • Threshold to use them
  • CC-radix
  • Size of the radix
  • Multiway Merge Sort
  • the best configuration changes with the amount
    and the distribution of the input data.
  • The best values will be searched during the
    learning procedure.

34
(No Transcript)
35
Multiway Merge Sort
42
60
60
42
28
Heap
21
60
42
28
23
11
21
23
60
7
42
28
4
Sorted Run
Sorted Run
Sorted Run
Sorted Run
36
Empirical Search
  • Example
  • Multiway Merge
  • Search the heap size that obtains the best
    performance
  • Different amount of data and standard deviation
Write a Comment
User Comments (0)
About PowerShow.com