Title: High Performance I/O and Data Management System Group Seminar
1High Performance I/O and Data ManagementSystem
Group Seminar
- Xiaosong Ma
- Department of Computer Science
- North Carolina State University
- September 12, 2003
2Roadmap
- Introduction
- Research area description
- Past research
- Future research directions
3About Myself
- Xiaosong Ma
- Pronunciation Shiao-song
- Homepage through the faculty directory
- Brief bio
- B.S., Peking University, China
- Ph.D., UIUC
- Hobbies
- Traveling
- Food
- Photography, movies, tennis
4High-Performance Computing
- Enabled by increasing computational power
- Scientific computation
- Parallel data mining
- Web data processing
- High-performance computing in daily life
- Weather forecast
- Web crawling and web search
- Games, movie graphics, virtual reality
5Past Research
- I/O performance optimization for parallel
applications
- High-level buffering and prefetching techniques
- Hiding the I/O cost
- Utilizes idle resources for maximizing inter-task
parallelism - Lightweight database support for visualization
applications - Making optimizations portable and adaptive
6Parallel I/O in Scientific Simulations
- Write-intensive
- Collective and periodic
- Bottleneck-prone
- Poor stepchild
- Traditional collective I/O focused on data
transfer
Computation
I/O
Computation
I/O
Computation
I/O
Computation
7Active Buffering
- Hides periodic I/O costs behind computation
phases IPDPS 02, ICS 02, IPDPS 03 - Organizes idle memory resources into buffer
hierarchy - Controlled by state machines
- Flexible regarding buffer space availability
- Adapts to applications output pattern
- Flexible software architecture
8AB vs. Asynchronous I/O
9Deployment of Active Buffering
- Panda Parallel I/O Library
- University of Illinois
- Client-server architecture
- ROMIO Parallel I/O Library
- Argonne National Lab
- Popular MPI-IO implementation, included in MPICH
- Server-less architecture
- ABT (Active Buffering with Threads)
10Sample Execution with ABT
comp. phase 1
Data reorganization and buffering
comp. phase 2
time
Data reorganization and buffering
I/O phase 2
comp. phase 3
Data reorganization and buffering
I/O phase 3
comp. phase 4
11I/O in Visualization
- Periodic reads
- Dual modes of operation
- Interactive
- Batch-mode
- Harder to overlap I/O with computation
Computation
I/O
Computation
I/O
Computation
I/O
Computation
12Lightweight Data Management
- Process large number of datasets
- Scientific data are structured
- Conventional DBMS rarely used in parallel
scientific codes - GODIVA framework ICDE 04
- General Object Data Interface for Visualization
Applications - In-memory database managing data buffer locations
- Relational database-like interfaces
- Developer controllable prefetching and caching
- Developer-supplied read functions
13GODIVA Architecture
14Sample Record Instance
- Sample query
- Where is the temperature array holding block_0003
at time-step 0.000075 in a fluid record?
15Prefetching and Caching
- process unit
- readUnit
- addUnit and waitUnit
- finishUnit and deleteUnit
- // add all units.
- addUnit("fluid_file1", read_file)
- addUnit("fluid_file2", read_file)
- // process array records in fluid_file1
- waitUnit("fluid_file1")
- do_visualization_computation("fluid_file1")
- deleteUnit("fluid_file1")
-
- // process array records in fluid_file2
- waitUnit("fluid_file2")
- do_visualization_computation("fluid_file2")
- deleteUnit("fluid_file2")
16Voyager on a Single-processor Workstation
17Voyager on a Dual-processor Cluster node
18Future work I/O Performance Prediction
- Objective to predict the I/O time for
high-performance applications -
- Challenge lack of information in the Grid
environment - Knowledge on applications or systems not
available - Hard to simulate real applications in real
environments - Hard to predict scalability
- How do we parameterize an application?
19Future work Sci. Data Management
- Objective to manage data in scientific
applications effectively and efficiently - Challenge two research world not well connected
- Conventional databases not suitable for HPC
- Scientific databases designed for specific
applications - General approach? Need to handle storage and I/O
for different types of datasets and their
distribution
20Summary
- Wide area of potential research
- Parallel computing
- Databases
- Operating systems/storage systems
- Many open problems and new challenges
21References
- ICDE 04 Xiaosong Ma, Marianne Winslett, John
Norris, Xiangmin Jiao and Robert Fiedler, GODIVA
Lightweight Data Management for Scientific
Visualization, the 20th International Conference
on Data Engineering, 2004 - PhD Thesis Xiaosong Ma, Hiding Periodic I/O
Costs for Parallel Applications, PhD thesis,
University of Illinois, 2003 - IPDPS 03 Xiaosong Ma, Marianne Winslett,
Jonghyun Lee and Shengke Yu, Improving MPI-IO
Output Performance with Active Buffering Plus
Threads, 2003 International Parallel and
Distributed Processing Symposium - PDSECA 03 Xiaosong Ma, Xiangmin Jiao, Michael
Campbell and Marianne Winslett, Flexible and
Efficient Parallel I/O for Large-Scale
Multi-component Simulations, The 4th Workshop on
Parallel and Distributed Scientific and
Engineering Computing with Applications - ICS 02 Jonghyun Lee, Xiaosong Ma, Marianne
Winslett and Shengke Yu, Active Buffering Plus
Compressed Migration An Integrated Solution to
Parallel Simulations' Data Transport Needs, the
16th ACM International Conference on
Supercomputing - IPDPS 02 Xiaosong Ma, Marianne Winslett,
Jonghyun Lee and Shengke Yu, Faster Collective
Output through Active Buffering, 2002
International Parallel and Distributed Processing
Symposium