Title: Operating System Support for Fine-Grain Parallelism on Multicore Architectures
1Operating System Support forFine-Grain
Parallelism on Multicore Architectures
John Giacomoni
- Manish Vachharajani
- University of Colorado at Boulder
- 2007.10.14
2Problem
- UP performance at end of life
- Chip-Multiprocessor systems
- What do we want from multicore systems?
- Individual cores less powerful than UP
- Asymmetric and Heterogeneous
- 10s-100s-1000s of cores
Performance!
Intel (2x2-core)
MIT RAW (16-core)
100-core
400-core
3ExtractingPerformance
- Task Parallelism
- Desktop
- Data Parallelism
- Web serving
- Split/Join, MapReduce, etc
- Pipeline Parallelism
- Video decoding
- Network processing
4ExtractingPerformance (2)
- Stream Parallelism
- Combines
- Data Parallelism
- Pipeline Parallelism
- Ad-Hoc Parallelism
- Semi- or unstructured
- Usual thread model
5Focus onPipeline Parallelism
- Most stringent timing requirements
- Example applications
- Network Processing
- Network Intrusion Detection
- DDoS Filtering
- Multimedia processing
- Transcoding
- Signal Processing
- Software Defined Radio
- Also applies to
- Data parallelism
- Stream Parallelism
6Soft Network Processing(Soft-NP)
- GigE Network Properties
- 1,488,095 frames/sec
- 672 ns/frame
- Frame dependencies
Frame Shared Memory Line-Rate Networking on
Commodity Hardware. To Appear Proceedings of
the ACM/IEEE Symposium on Architectures for
Networking and Communications Systems 2007
(ANCS), December 2007. John Giacomoni, John K.
Bennett, Antonio Carzaniga, Douglas C. Sicker,
Manish Vachharajani and Alexander L. Wolf.
7Frame Shared Memory(Soft-NP)
Input (IP) Output(OP)
8What OS support is necessary?
9Low-OverheadCommunication
Gigabit Ethernet
Syscalls 170ns
pthread mutex 200ns
10FastForward
- Portable software only framework
- 35-40ns/queue operation 2.0 GHz AMD Opteron
- 26-28ns/queue operation 2.6 GHz AMD Opteron
- Architecturally tuned CLF queues
- Works with strong to weak consistency models
- Hides die-die communication
- Robust against unbalanced stages
- Poster FastForward for Efficient Pipeline
Parallelism. Proceedings of the 16th
International Conference on Parallel
Architectures and Compilation Techniques (PACT),
September 2007. John Giacomoni, Tipp Moseley,
Manish Vachharajani.
11FastForwardPerformance
Lamport
FF
FF Unbalanced
FF Re-Balanced
12Zero-StallGuarantee
13GangScheduling
- Optimize for application performance
- Instead of system throughput or fairness
- Computer Utility -gt max(System Utilization)
- Multicore system -gt excess of resources.
- Dedicate resources to pipeline applications
- Want selective timesharing
14SystemServices
- Fast!
- Synchronous calls introduce too much overhead
- System calls 170ns
- Asynchronous calls may limit parallelism
- Want System services with independent I/O paths
15PipelinableSystem Services
- Mixing stages from multiple process domains
- Push model vs. call/return or poll
- Hardware can be an active participant
16HeterogeneousGang Scheduling
- Need a single scheduling label for every pipeline
stage - Ensures simultaneous scheduling of every
necessary resource - (zero-stall guarantee)
- Including hardware stages.
- Scheduling multi-domain entities
17Multi-DomainEntities
- Application state
- Shared with local stages
- Pipeline private state
- Stage state shared with pipeline and parent
process. - The multi-domain application model respects the
private data model implicit in single-domain
applications while providing first-class naming
for multi-domain pipelines.
18Summaryof Discussion
- Low-overhead communication
- Zero-stall guarantee
- Selective timesharing
- Pipelineable system services
- Heterogenous gang scheduling
- Pipelines as multi-domain applications
19Questions?
john.giacomoni_at_colorado.edu