Title: Flexible%20and%20Efficient%20%20I/O%20Optimization%20%20for%20Parallel%20Applications
1(No Transcript)
2Hiding Periodic I/O Costsin Parallel Applications
- Xiaosong Ma
- Department of Computer Science
- University of Illinois at Urbana-Champaign
- Spring 2003
3Roadmap
- Introduction
- Active buffering hiding recurrent output cost
- Ongoing work hiding recurrent input cost
- Conclusions
4Introduction
- Fast-growing technology propels high performance
applications - Scientific computation
- Parallel data mining
- Web data processing
- Games, movie graphics
- Individual components growth un-coordinated
- Manual performance tuning needed
5We Need Adaptive Optimization
- Flexible and automatic performance optimization
desired - Efficient high-level buffering and prefetching
for parallel I/O in scientific simulations
6Scientific Simulations
- Important
- Detail and flexibility
- Save money and lives
- Challenging
- Multi-disciplinary
- High performance crucial
7Parallel I/O in Scientific Simulations
- Write-intensive
- Collective and periodic
- Poor stepchild
- Bottleneck-prone
- Existing collective I/O focused on data transfer
Computation
I/O
Computation
I/O
Computation
I/O
Computation
8My Contributions
- Idea I/O optimizations in larger scope
- Parallelism between I/O and other tasks
- Individual simulations I/O need
- I/O related self-configuration
- Approach hide the I/O cost
- Results
- Publications, technology transfer, software
9Roadmap
- Introduction
- Active buffering hiding recurrent output cost
- Ongoing work hiding recurrent input cost
- Conclusions
10 Latency Hierarchy on Parallel Platforms
local memory access
inter-processor communication
disk I/O
wide-area transfer
- Along path of data transfer
- Smaller throughput
- Lower parallelism and less scalable
11Basic Idea of Active Buffering
- Purpose maximize overlap between computation and
I/O - Approach buffer data as early as possible
12Challenges
- Accommodate multiple I/O architectures
- No assumption on buffer space
- Adaptive
- Buffer availability
- User request patterns
13Roadmap
- Introduction
- Active buffering hiding recurrent output cost
- With client-server I/O architecture IPDPS 02
- With server-less architecture
- Ongoing work hiding recurrent input cost
- Related work and future work
- Conclusions
14Client-Server I/O Architecture
compute processors
I/O servers
15Client State Machine
16Server State Machine
data to receive
enough buffer space
prepare
receive a block
recv.
write
out of buffer space
request
write a block
fetch
idle- listen
init.
recv.
recv.
alloc. buffers
got write
idle, no
exit
req.
all data
message
data to
received
fetch
no
fetch a block
data to
write done
data
no request
write
idle
fetch write all
busy- listen
to fetch
write done
exit
msg.
17Maximize Apparent Throughput
- Ideal apparent throughput per server
-
Dtotal - Tideal Dc-buffered Dc-overflow
Ds-overflow
- Tmem-copy
TMsg-passing Twrite - More expensive data transfer only becomes visible
when overflow happens - Efficiently masks the difference in write speeds
18Write Throughput without Overflow
- Panda Parallel I/O library
- SGI Origin 2000, SHMEM
- Per client 16MB output data per snapshot, 64MB
buffer - Two servers, each with 256MB buffer
19Write Throughput with Overflow
- Panda Parallel I/O library
- SGI Origin 2000, SHMEM, MPI
- Per client 96MB output data per snapshot, 64MB
buffer - Two servers, each with 256MB buffer
20Give Feedback to Application
- Softer I/O requirements
- Parallel I/O libraries have been passive
- Active buffering allows I/O libraries to take
more active role - Find optimal output frequency automatically
21Server-side Active Buffering
data to receive
enough buffer space
prepare
receive a block
recv.
write
out of buffer space
request
write a block
fetch
idle- listen
init.
recv.
recv.
alloc. buffers
got write
idle, no
exit
req.
all data
message
data to
received
fetch
no
fetch a block
data to
write done
data
no request
write
idle
fetch write all
busy- listen
to fetch
write done
exit
msg.
22Performance with Real Applications
- Application overview GENX
- Large-scale, multi-component, detailed rocket
simulation - Developed at Center for Simulation of Advanced
Rockets (CSAR), UIUC - Multi-disciplinary, complex, and evolving
- Providing parallel I/O support for GENX
- Identification of parallel I/O requirements
PDSECA 03 - Motivation and test case for active buffering
23Overall Performance of GEN1
- SDSC IBM SP (Blue Horizon)
- 64 clients, 2 I/O servers with AB
- 160MB output data per snapshot (in HDF4)
24Aggregate Write Throughput in GEN2
- LLNL IBM SP (ASCI Frost)
- 1 I/O server per 16-way SMP node
- Write in HDF4
25Scientific Data Migration
- Output data need to be moved
- Online migration
- Extend active buffering to migration
- Local storage becomes another layer in buffer
hierarchy
Computation
I/O
Computation
I/O
Computation
I/O
Computation
26I/O Architecture with Data Migration
compute processors
27Active Buffering for Data Migration
- Avoid unnecessary local I/O
- Hybrid migration approach
memory-to-memory transfer
disk staging
- Combined with data compression ICS 02
- Self-configuration for online visualization
28Roadmap
- Introduction
- Active buffering hiding recurrent output cost
- With client-server I/O architecture
- With server-less architecture IPDPS 03
- Ongoing work hiding recurrent input cost
- Conclusions
29Server-less I/O Architecture
I/O thread
compute processors
30Making ABT Transparent and Portable
- Unchanged interfaces
- High-level and file-system independent
- Design and evaluation IPDPS 03
- Ongoing transfer to ROMIO
31Active Buffering vs. Asynchronous I/O
32Roadmap
- Introduction
- Active buffering hiding recurrent output cost
- Ongoing work hiding recurrent input cost
- Conclusions
33I/O in Visualization
- Periodic reads
- Dual modes of operation
- Interactive
- Batch-mode
- Harder to overlap reads with computation
Computation
I/O
Computation
I/O
Computation
I/O
Computation
34Efficient I/O Through Data Management
- In-memory database of datasets
- Manage buffers or values
- Hub for I/O optimization
- Prefetching for batch mode
- Caching for interactive mode
- User-supplied read routine
35Related Work
- Overlapping I/O with computation
- Replacing synchronous calls with async calls
Agrawal et al. ICS 96 - Threads Dickens et al. IPPS 99, More et al.
IPPS 97 - Automatic performance optimization
- Optimization with performance models Chen et al.
TSE 00 - Graybox optimization Arpaci-Dusseau et al. SOSP
01
36Roadmap
- Introduction
- Active buffering hiding recurrent output cost
- Ongoing work hiding recurrent input cost
- Conclusions
37Conclusions
- If we cant shrink it, hide it!
- Performance optimization can be done
- more actively
- at higher-level
- in larger scope
- Make I/O part of data management
38References
- IPDPS 03 Xiaosong Ma, Marianne Winslett,
Jonghyun Lee and Shengke Yu, Improving MPI-IO
Output Performance with Active Buffering Plus
Threads, 2003 International Parallel and
Distributed Processing Symposium - PDSECA 03 Xiaosong Ma, Xiangmin Jiao, Michael
Campbell and Marianne Winslett, Flexible and
Efficient Parallel I/O for Large-Scale
Multi-component Simulations, The 4th Workshop on
Parallel and Distributed Scientific and
Engineering Computing with Applications - ICS 02 Jonghyun Lee, Xiaosong Ma, Marianne
Winslett and Shengke Yu, Active Buffering Plus
Compressed Migration An Integrated Solution to
Parallel Simulations' Data Transport Needs, the
16th ACM International Conference on
Supercomputing - IPDPS 02 Xiaosong Ma, Marianne Winslett,
Jonghyun Lee and Shengke Yu, Faster Collective
Output through Active Buffering, 2002
International Parallel and Distributed Processing
Symposium