Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems

Description:

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni, Ken Jansen, Mark Shephard, Chris Carothers – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 26

Provided by: DaveH172

Learn more at: http://cmes.colorado.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems

1
Scalable Parallel I/O Alternatives for Massively
Parallel Partitioned Solver Systems

Jing Fu, Ning Liu, Onkar Sahni,
Ken Jansen, Mark Shephard, Chris Carothers
Computer Science Department
Scientific Computation Research Center (SCOREC)
Rensselaer Polytechnic Institute
chrisc_at_cs.rpi.edu

Acknowledgments Partners Simmetrix, Acusim,
Kitware, IBM NSF PetaApps, DOE INCITE, ITR, CTS
DOE SciDAC-ITAPS, NERI AFOSR IndustryIBM,
Northrup Grumman, Boeing, Lockheed Martin,
Motorola Computer Resources TeraGrid, ANL,
NERSC, RPI-CCNI
2
Outline

Motivating application CFD
Blue Gene Platforms
I/O Alternatives
POSIX
PMPIO
syncIO
reduce blocking rbIO
Blue Gene Results
Summary

3
PHASTA Flow Solver Parallel Paradigm

Time-accurate, stabilized FEM flow solver
Input partitioned on a per-processor basis
Unstructured mesh parts mapped to cores
Two types of work
Equation formation
O(40) peer-to-peer non-blocking comms
Overlapping comms with comp
Scales well on many machines
Implicit, iterative equation solution
Matrix assembled on processor ONLY
Each Krylov vector is
qAp (matrix-vector product)
Same peer-to-peer comm of q PLUS
Orthogonalize against prior vectors
REQUIRES NORMSgtMPI_Allreduce
This sets up a cycle of global comms. separated
by modest amount of work

4
Parallel Implicit Flow Solver
IncompressibleAbdominal Aorta Aneurysm (AAA)

Cores (avg. elems./core) IBM BG/L RPI-CCNI
t (secs.) scale factor
512 (204800) 2119.7 1 (base)
1024 (102400) 1052.4 1.01
2048 (51200) 529.1 1.00
4096 (25600) 267.0 0.99
8192 (12800) 130.5 1.02
16384 (6400) 64.5 1.03
32768 (3200) 35.6 0.93
32K parts shows modest degradation due to 15
node imbalance
5
AAA Adapted to 109 ElementsScaling on Blue Gene
/P
New _at_ 294,912 cores ? 82 scaling But getting
I/O done is a challenge
6
Blue Gene /L Layout

CCNI fen
32K cores/ 16 racks
12 TB / 8 TB usable RAM
1 PB of disk over GPFS
Custom OS kernel

7
Blue Gene /P Layout

ALCF/ANL Intrepid
163K cores/ 40 racks
80TB RAM
8 PB of disk over GPFS
Custom OS kernel

8
Blue Gene/ P (vs. BG/L)
9
Blue Gene I/O Archiectures

Blue Gene/L _at_ CCNI
1 2-core I/O node per 32 compute nodes
32K system has 512, 1 Gbit/sec network interfaces
I/O nodes connected 48 GPFS file servers
Servers 0, 2, 4, and 6 are metadata servers
Server 0 does RAS and other duties
800 TB of storage from 26 IBM DS4200 storage
arrays
Split into 240 LUNs, each server has 10 LUNs (7 _at_
1MB and 3 _at_ 128KB)
Peak bandwidth is 8GB/sec read and 4 GB/sec
write
Blue Gene/P _at_ ALCF
Similar I/O node to compute node ratio
128 dual core fileservers over Myrinet w/ 4MB
GPFS block size
Metadata can be done by any server
16x DDN 9900 ? 7.5 PB (raw) storage w/ peak
bandwidth of 60 GB/sec.

10
Non-Parallel I/O A Bad Approach

Sequential I/O
All processes send data to rank 0, and 0 writes
it to the file

Lacks scaling and results in excessive memory use
on rank 0Must think parallel from the start,
but that implies data/file partitioning
11
1 POSIX File Per Processor (1PFPP)

Pros
parallelism, high performance at small core
counts
Cons
lots of small files to manage
LOTS OF METADATA stress parallel filesystem
difficult to read back data from different number
of processes
_at_ 300K cores yields 600K files
_at_ JSC ? kernel panic!!
PHASTA currently uses this approach

12
New Partitioned Solver Parallel I/O Format

Assumes data accessed in a coordinated manner
File master header series of data blocks
Each data block has header and data
Ex 4 parts w/ 2 fields per part
Allows for different processor config
(1 core _at_ 4 parts),
(2 core _at_ 2 parts)
(4 cores _at_ 1 part)
Allows for 1 to many files to control metadata
overheads

13
MPI_File alternatives PMPIO

PMPIO ? poor mans parallel I/O from silo
mesh and field library
Divides app into groups of writers
w/i a group only 1 writer at a time to a file
Passing of a token ensures synchronization w/i
a group
Support for HDF5 file format
Uses MPI_File_read/write_at routines

14
MPI_File alternatives syncIO

Flexible design allows variable number files and
procs/writers per file
Within a file, can be configured to write on
block size boundries which are typically 1 to
4MB.
Implemented using collective I/O routines e.g.,
MPI_File_write_at_all_begin

15
MPI_File alternatives rbIO

Rb ? reduced blocking
Targets checkpointing
Divides application into workers and writers with
1 writer MPI task per group of workers.
Workers send I/O to writers over MPI_Isend and
are free to continue
e.g., hides the latency of blocking parallel I/O
Writers then perform blocking MPI_File_write_at
operation using MPI_SELF communicator

16
BG/L 1PFPP w/ 7.7 GB data
17
BG/L PMPIO w/ 7.7 GB data
HDF5 Peak 600MB/sec
RAW MPI-IO Peak 900 MB/sec
18
BG/L syncIO w/ 7.7 GB data
Write Performance Peak 1.3 GB/sec
Read Performance Peak 6.6 GB/sec
19
BG/P syncIO w/ 60 GB data
20
BG/L rbIO actual BW w/ 7.7 GB data
21
BG/L rbIO perceived BW w/ 7.7 GB data
22 TB/sec
11 TB/sec
22
BG/P rbIO actual BW w/ 60 GB data
17.9 GB/sec
23
BG/P rbIO perceived BW w/ 60 GB data
21 TB/sec
24
Related Work

A. Nisar, W. Liao, and A. Choudhary, Scaling
Parallel I/O Performance through I/O Delegate and
Caching System, in Proceedings of the 2008
ACM/IEEE conference on Supercomputing, 2008.
Performance rbIO inside MPI via threads and
using upto 10 compute cores as I/O workers
Benchmark studies (hightlight just a few)
H Yu et al 18 BG/L 2 GB/sec _at_ 1K
Saini et al 19 512 NEC SX-8 cores I/O was
not scalable when all processors access a shared
file.
Larkin et al 17 large performance drop at 2K
core count for CrayXT3/XT4
Lang et al 30 large I/O study across many
benchmarks on Intrepid/BG-P. Found 60 GB/s read
and 45 GB/s write. In practice, Intrepid has a
peak I/O rate of around 35 GB/sec

25
Summary and Future Work

We examine several parallel I/O approaches..
1 POSIX File per Proc lt 1 GB/sec on BG/L
PMPIO lt 1 GB/sec on BG/L
syncIO all processors write as groups to
different files
BG/L 6.6 GB/sec read, 1.3 GB/sec write
BG/P 11.6 GB/sec read, 25 GB/sec write
rbIO gives up 3 to 6 of compute nodes to hide
latency of blocking parallel I/O.
BG/L 2.3 GB/sec actual write, 22 TB/sec
perceived write
BG/P 18 GB/sec actual write, 22 TB/sec
perceived write
Good trade-off on Blue Gene
All procs to 1 file does not yield good
performance even if aligned.
Performance sweet spot for syncIO depends
significantly on I/O architecture and so file
format must be tuned accordingly
BG/L _at_ CCNI has a metadata bottleneck and must
adjust of files according e.g., 32 to 128
writers
BG/P _at_ ALCF can sustain much higher performance,
but requires more files e.g., 1024 writers
Suggest collective I/O is sensitive to underlying
file system performance.
For rbIO, we observed that 1024 writers was the
best performance so far for both BG/L and BG/P
platforms..
Future Work impact on performance of different
filesystems
Leverage Darshan logs _at_ ALCF to better understand
Intrepid performance
More experiments on Blue Gene/P under PVFS,
CrayXT5 under Lustre