Title: Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems
1Scalable Parallel I/O Alternatives for Massively
Parallel Partitioned Solver Systems
- Jing Fu, Ning Liu, Onkar Sahni,
- Ken Jansen, Mark Shephard, Chris Carothers
- Computer Science Department
- Scientific Computation Research Center (SCOREC)
- Rensselaer Polytechnic Institute
- chrisc_at_cs.rpi.edu
Acknowledgments Partners Simmetrix, Acusim,
Kitware, IBM NSF PetaApps, DOE INCITE, ITR, CTS
DOE SciDAC-ITAPS, NERI AFOSR IndustryIBM,
Northrup Grumman, Boeing, Lockheed Martin,
Motorola Computer Resources TeraGrid, ANL,
NERSC, RPI-CCNI
2Outline
- Motivating application CFD
- Blue Gene Platforms
- I/O Alternatives
- POSIX
- PMPIO
- syncIO
- reduce blocking rbIO
- Blue Gene Results
- Summary
3PHASTA Flow Solver Parallel Paradigm
- Time-accurate, stabilized FEM flow solver
- Input partitioned on a per-processor basis
- Unstructured mesh parts mapped to cores
- Two types of work
- Equation formation
- O(40) peer-to-peer non-blocking comms
- Overlapping comms with comp
- Scales well on many machines
- Implicit, iterative equation solution
- Matrix assembled on processor ONLY
- Each Krylov vector is
- qAp (matrix-vector product)
- Same peer-to-peer comm of q PLUS
- Orthogonalize against prior vectors
- REQUIRES NORMSgtMPI_Allreduce
- This sets up a cycle of global comms. separated
by modest amount of work
4Parallel Implicit Flow Solver
IncompressibleAbdominal Aorta Aneurysm (AAA)
Cores (avg. elems./core) IBM BG/L RPI-CCNI
t (secs.) scale factor
512 (204800) 2119.7 1 (base)
1024 (102400) 1052.4 1.01
2048 (51200) 529.1 1.00
4096 (25600) 267.0 0.99
8192 (12800) 130.5 1.02
16384 (6400) 64.5 1.03
32768 (3200) 35.6 0.93
32K parts shows modest degradation due to 15
node imbalance
5AAA Adapted to 109 ElementsScaling on Blue Gene
/P
New _at_ 294,912 cores ? 82 scaling But getting
I/O done is a challenge
6Blue Gene /L Layout
- CCNI fen
- 32K cores/ 16 racks
- 12 TB / 8 TB usable RAM
- 1 PB of disk over GPFS
- Custom OS kernel
7Blue Gene /P Layout
- ALCF/ANL Intrepid
- 163K cores/ 40 racks
- 80TB RAM
- 8 PB of disk over GPFS
- Custom OS kernel
8Blue Gene/ P (vs. BG/L)
9Blue Gene I/O Archiectures
- Blue Gene/L _at_ CCNI
- 1 2-core I/O node per 32 compute nodes
- 32K system has 512, 1 Gbit/sec network interfaces
- I/O nodes connected 48 GPFS file servers
- Servers 0, 2, 4, and 6 are metadata servers
- Server 0 does RAS and other duties
- 800 TB of storage from 26 IBM DS4200 storage
arrays - Split into 240 LUNs, each server has 10 LUNs (7 _at_
1MB and 3 _at_ 128KB) - Peak bandwidth is 8GB/sec read and 4 GB/sec
write - Blue Gene/P _at_ ALCF
- Similar I/O node to compute node ratio
- 128 dual core fileservers over Myrinet w/ 4MB
GPFS block size - Metadata can be done by any server
- 16x DDN 9900 ? 7.5 PB (raw) storage w/ peak
bandwidth of 60 GB/sec.
10Non-Parallel I/O A Bad Approach
- Sequential I/O
- All processes send data to rank 0, and 0 writes
it to the file
Lacks scaling and results in excessive memory use
on rank 0Must think parallel from the start,
but that implies data/file partitioning
111 POSIX File Per Processor (1PFPP)
- Pros
- parallelism, high performance at small core
counts - Cons
- lots of small files to manage
- LOTS OF METADATA stress parallel filesystem
- difficult to read back data from different number
of processes - _at_ 300K cores yields 600K files
- _at_ JSC ? kernel panic!!
- PHASTA currently uses this approach
12New Partitioned Solver Parallel I/O Format
- Assumes data accessed in a coordinated manner
- File master header series of data blocks
- Each data block has header and data
- Ex 4 parts w/ 2 fields per part
- Allows for different processor config
- (1 core _at_ 4 parts),
- (2 core _at_ 2 parts)
- (4 cores _at_ 1 part)
- Allows for 1 to many files to control metadata
overheads
13MPI_File alternatives PMPIO
- PMPIO ? poor mans parallel I/O from silo
mesh and field library - Divides app into groups of writers
- w/i a group only 1 writer at a time to a file
- Passing of a token ensures synchronization w/i
a group - Support for HDF5 file format
- Uses MPI_File_read/write_at routines
14MPI_File alternatives syncIO
- Flexible design allows variable number files and
procs/writers per file - Within a file, can be configured to write on
block size boundries which are typically 1 to
4MB. - Implemented using collective I/O routines e.g.,
MPI_File_write_at_all_begin
15MPI_File alternatives rbIO
- Rb ? reduced blocking
- Targets checkpointing
- Divides application into workers and writers with
1 writer MPI task per group of workers. - Workers send I/O to writers over MPI_Isend and
are free to continue - e.g., hides the latency of blocking parallel I/O
- Writers then perform blocking MPI_File_write_at
operation using MPI_SELF communicator
16BG/L 1PFPP w/ 7.7 GB data
17BG/L PMPIO w/ 7.7 GB data
HDF5 Peak 600MB/sec
RAW MPI-IO Peak 900 MB/sec
18BG/L syncIO w/ 7.7 GB data
Write Performance Peak 1.3 GB/sec
Read Performance Peak 6.6 GB/sec
19BG/P syncIO w/ 60 GB data
20BG/L rbIO actual BW w/ 7.7 GB data
21BG/L rbIO perceived BW w/ 7.7 GB data
22 TB/sec
11 TB/sec
22BG/P rbIO actual BW w/ 60 GB data
17.9 GB/sec
23BG/P rbIO perceived BW w/ 60 GB data
21 TB/sec
24Related Work
- A. Nisar, W. Liao, and A. Choudhary, Scaling
Parallel I/O Performance through I/O Delegate and
Caching System, in Proceedings of the 2008
ACM/IEEE conference on Supercomputing, 2008. - Performance rbIO inside MPI via threads and
using upto 10 compute cores as I/O workers - Benchmark studies (hightlight just a few)
- H Yu et al 18 BG/L 2 GB/sec _at_ 1K
- Saini et al 19 512 NEC SX-8 cores I/O was
not scalable when all processors access a shared
file. - Larkin et al 17 large performance drop at 2K
core count for CrayXT3/XT4 - Lang et al 30 large I/O study across many
benchmarks on Intrepid/BG-P. Found 60 GB/s read
and 45 GB/s write. In practice, Intrepid has a
peak I/O rate of around 35 GB/sec
25Summary and Future Work
- We examine several parallel I/O approaches..
- 1 POSIX File per Proc lt 1 GB/sec on BG/L
- PMPIO lt 1 GB/sec on BG/L
- syncIO all processors write as groups to
different files - BG/L 6.6 GB/sec read, 1.3 GB/sec write
- BG/P 11.6 GB/sec read, 25 GB/sec write
- rbIO gives up 3 to 6 of compute nodes to hide
latency of blocking parallel I/O. - BG/L 2.3 GB/sec actual write, 22 TB/sec
perceived write - BG/P 18 GB/sec actual write, 22 TB/sec
perceived write - Good trade-off on Blue Gene
- All procs to 1 file does not yield good
performance even if aligned. - Performance sweet spot for syncIO depends
significantly on I/O architecture and so file
format must be tuned accordingly - BG/L _at_ CCNI has a metadata bottleneck and must
adjust of files according e.g., 32 to 128
writers - BG/P _at_ ALCF can sustain much higher performance,
but requires more files e.g., 1024 writers - Suggest collective I/O is sensitive to underlying
file system performance. - For rbIO, we observed that 1024 writers was the
best performance so far for both BG/L and BG/P
platforms.. - Future Work impact on performance of different
filesystems - Leverage Darshan logs _at_ ALCF to better understand
Intrepid performance - More experiments on Blue Gene/P under PVFS,
CrayXT5 under Lustre