Title: An InstallTime System for Automatic Generation of Optimized Parallel Sorting Algorithms
1An Install-Time System for Automatic Generation
ofOptimized Parallel Sorting Algorithms
- Marek Olszewski and Michael Voss
- ECE Department
- University of Toronto
2Motivation
- Sorting is a fundamental algorithm
- Many algorithmic choices for sorting
- Performance heavily influenced by
- Data being sorted (type, entropy)
- Target machine being used
- How can we build the best sort for a given
machine? - An empirical install-time system
3Outline of Talk
- Motivation
- An Overview of Sorting Algorithms
- Our install-time empirical system
- An adaptive hybrid sequential sort
- An adaptive hybrid parallel sort
- An Evaluation
- Related Work
- Conclusions
4An overview of sorting algorithms
- Art of Computer Programming V3 (Knuth)
- 25 algorithms comprehensively studied
- Comparison sorts
- Lower bound shown to be W (n log n)
- Examples include insertion sort, quick sort and
merge sort - Non-comparison sorts
- Can be linear time, i.e. O(n)
- But require knowing the range of the data
- Examples include radix sort and bucket sort
5An overview of sorting algorithms
- Hybrid sorts
- Divide and conquer sorts are recursive
- May be beneficial to switch algorithms
- Most C STL sorts are hybrid sorts
- Gnu stdsort is a hybrid sort with pre-defined
points to switch between heap sort, quick sort,
merge sort and insertion sort
6An overview of parallel sorts
- Ideally, O( (n log n) / p)
- If p n, then O( log n)
- Several parallel sorts demonstrate this bound,
e.g. Column sort - Parallelized sequential sorts often better for
low numbers of processors (our focus). - Parallelized divide and conquer algorithms
- Effective for small numbers of processors
- Use a work-queue model
- Tasks are place in a shared work-queue
- Idle processors remove tasks from the queue
- Good load balance
7Our install-time system
Sample input data provided to installer
Time Sorts Random algorithms at each recursive
step
Calculate best sorting algorithm for each data
aet size
Start
Specialized decision Function place in library
Convert tree to C
C4.5 creates decision tree
Parallel?
Time Sorts Different input sizes and work-share
points
Work-share cutoff point tree and C functions
generated
End
End
8Algorithms available to our hybrid sort
9Hybrid Adaptive Sequential Sort
- Use random data to train system
- Up to 10 million elements
- Insertion sort not used for large inputs
- Not all inputs sorted to completion
- Dynamic programming used to find best choice
- Assume best sort at each subsequent step
- Per step timings were measured
- C4.5 decision tree used to analyze this data
- C4.5 tree converted to C template code
10Hybrid Adaptive Parallel Sort
- Start with sequential hybrid sort
- Determine work-sharing cutoff point
- When should a thread execute its own tasks
- When should a thread place tasks in work queue
- Determines the point at which synchronization
costs are no longer amortized by small work
11Methodology Platforms
- Sequential platforms
- Linux 2.4.18 Intel Penitum 4 1.6 GHz Xeon
- Linux 2.4.24 AMD Athlon XP 1700
- SunOS 5.8 on a 600 MHz Sparc Workstation
- Parallel platform
- 4 processor 1.6 GHz Intel Xeon SMP
- Modified 2.4.18-smp kernel (allowed binding)
12Methodology Comparisons
- Adaptive Hybrid Sequential Sort
- Adaptive Hybrid Parallel Sort
- Gnu G 2.96 stdsort and stdstable_sort
- Also hybrid sorts
- Complex not easily parallelized
- 8 equally sized merge sorts that called stdsort
and stdstable_sort in parallel
13Serial Non-Optimized (w/o O) Results
14Serial Optimized (w O) Results
15Parallel Work-share Cutoff Point
16Parallel Non-Optimized (w/o O) Results
17Parallel Optimized (with O) Results
18Parallel Sort Speedups
19Related Work
- Install-time empirical optimization systems
- ATLAS Level 3 BLAS
- FFTW FFT
- STAPL Adaptive Parallel C Library
- Uses decision trees like our approach
- Uses only single-level sorts, not hybrids
- Not available for comparison
- A Dynamically Tuned Sorting Library (CGO04)
- Install-time tuning of sequential sorts
- Only single-level sorts, not hybrid
20Conclusion
- Presented an install-time system for empirically
constructing a best sorting algorithm for a
target machine - Competitive with STL sort on 1 processor
- Better than a parallelized STL sort on multiple
processors