Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems

Description:

Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni, Ken Jansen, Mark Shephard, Chris Carothers – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 26
Provided by: DaveH172
Learn more at: http://cmes.colorado.edu
Category:

less

Transcript and Presenter's Notes

Title: Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems


1
Scalable Parallel I/O Alternatives for Massively
Parallel Partitioned Solver Systems
  • Jing Fu, Ning Liu, Onkar Sahni,
  • Ken Jansen, Mark Shephard, Chris Carothers
  • Computer Science Department
  • Scientific Computation Research Center (SCOREC)
  • Rensselaer Polytechnic Institute
  • chrisc_at_cs.rpi.edu

Acknowledgments Partners Simmetrix, Acusim,
Kitware, IBM NSF PetaApps, DOE INCITE, ITR, CTS
DOE SciDAC-ITAPS, NERI AFOSR IndustryIBM,
Northrup Grumman, Boeing, Lockheed Martin,
Motorola Computer Resources TeraGrid, ANL,
NERSC, RPI-CCNI
2
Outline
  • Motivating application CFD
  • Blue Gene Platforms
  • I/O Alternatives
  • POSIX
  • PMPIO
  • syncIO
  • reduce blocking rbIO
  • Blue Gene Results
  • Summary

3
PHASTA Flow Solver Parallel Paradigm
  • Time-accurate, stabilized FEM flow solver
  • Input partitioned on a per-processor basis
  • Unstructured mesh parts mapped to cores
  • Two types of work
  • Equation formation
  • O(40) peer-to-peer non-blocking comms
  • Overlapping comms with comp
  • Scales well on many machines
  • Implicit, iterative equation solution
  • Matrix assembled on processor ONLY
  • Each Krylov vector is
  • qAp (matrix-vector product)
  • Same peer-to-peer comm of q PLUS
  • Orthogonalize against prior vectors
  • REQUIRES NORMSgtMPI_Allreduce
  • This sets up a cycle of global comms. separated
    by modest amount of work

4
Parallel Implicit Flow Solver
IncompressibleAbdominal Aorta Aneurysm (AAA)

Cores (avg. elems./core) IBM BG/L RPI-CCNI
t (secs.) scale factor
512 (204800) 2119.7 1 (base)
1024 (102400) 1052.4 1.01
2048 (51200) 529.1 1.00
4096 (25600) 267.0 0.99
8192 (12800) 130.5 1.02
16384 (6400) 64.5 1.03
32768 (3200) 35.6 0.93
32K parts shows modest degradation due to 15
node imbalance
5
AAA Adapted to 109 ElementsScaling on Blue Gene
/P
New _at_ 294,912 cores ? 82 scaling But getting
I/O done is a challenge
6
Blue Gene /L Layout
  • CCNI fen
  • 32K cores/ 16 racks
  • 12 TB / 8 TB usable RAM
  • 1 PB of disk over GPFS
  • Custom OS kernel

7
Blue Gene /P Layout
  • ALCF/ANL Intrepid
  • 163K cores/ 40 racks
  • 80TB RAM
  • 8 PB of disk over GPFS
  • Custom OS kernel

8
Blue Gene/ P (vs. BG/L)
9
Blue Gene I/O Archiectures
  • Blue Gene/L _at_ CCNI
  • 1 2-core I/O node per 32 compute nodes
  • 32K system has 512, 1 Gbit/sec network interfaces
  • I/O nodes connected 48 GPFS file servers
  • Servers 0, 2, 4, and 6 are metadata servers
  • Server 0 does RAS and other duties
  • 800 TB of storage from 26 IBM DS4200 storage
    arrays
  • Split into 240 LUNs, each server has 10 LUNs (7 _at_
    1MB and 3 _at_ 128KB)
  • Peak bandwidth is 8GB/sec read and 4 GB/sec
    write
  • Blue Gene/P _at_ ALCF
  • Similar I/O node to compute node ratio
  • 128 dual core fileservers over Myrinet w/ 4MB
    GPFS block size
  • Metadata can be done by any server
  • 16x DDN 9900 ? 7.5 PB (raw) storage w/ peak
    bandwidth of 60 GB/sec.

10
Non-Parallel I/O A Bad Approach
  • Sequential I/O
  • All processes send data to rank 0, and 0 writes
    it to the file

Lacks scaling and results in excessive memory use
on rank 0Must think parallel from the start,
but that implies data/file partitioning
11
1 POSIX File Per Processor (1PFPP)
  • Pros
  • parallelism, high performance at small core
    counts
  • Cons
  • lots of small files to manage
  • LOTS OF METADATA stress parallel filesystem
  • difficult to read back data from different number
    of processes
  • _at_ 300K cores yields 600K files
  • _at_ JSC ? kernel panic!!
  • PHASTA currently uses this approach

12
New Partitioned Solver Parallel I/O Format
  • Assumes data accessed in a coordinated manner
  • File master header series of data blocks
  • Each data block has header and data
  • Ex 4 parts w/ 2 fields per part
  • Allows for different processor config
  • (1 core _at_ 4 parts),
  • (2 core _at_ 2 parts)
  • (4 cores _at_ 1 part)
  • Allows for 1 to many files to control metadata
    overheads

13
MPI_File alternatives PMPIO
  • PMPIO ? poor mans parallel I/O from silo
    mesh and field library
  • Divides app into groups of writers
  • w/i a group only 1 writer at a time to a file
  • Passing of a token ensures synchronization w/i
    a group
  • Support for HDF5 file format
  • Uses MPI_File_read/write_at routines

14
MPI_File alternatives syncIO
  • Flexible design allows variable number files and
    procs/writers per file
  • Within a file, can be configured to write on
    block size boundries which are typically 1 to
    4MB.
  • Implemented using collective I/O routines e.g.,
    MPI_File_write_at_all_begin

15
MPI_File alternatives rbIO
  • Rb ? reduced blocking
  • Targets checkpointing
  • Divides application into workers and writers with
    1 writer MPI task per group of workers.
  • Workers send I/O to writers over MPI_Isend and
    are free to continue
  • e.g., hides the latency of blocking parallel I/O
  • Writers then perform blocking MPI_File_write_at
    operation using MPI_SELF communicator

16
BG/L 1PFPP w/ 7.7 GB data
17
BG/L PMPIO w/ 7.7 GB data
HDF5 Peak 600MB/sec
RAW MPI-IO Peak 900 MB/sec
18
BG/L syncIO w/ 7.7 GB data
Write Performance Peak 1.3 GB/sec
Read Performance Peak 6.6 GB/sec
19
BG/P syncIO w/ 60 GB data
20
BG/L rbIO actual BW w/ 7.7 GB data
21
BG/L rbIO perceived BW w/ 7.7 GB data
22 TB/sec
11 TB/sec
22
BG/P rbIO actual BW w/ 60 GB data
17.9 GB/sec
23
BG/P rbIO perceived BW w/ 60 GB data
21 TB/sec
24
Related Work
  • A. Nisar, W. Liao, and A. Choudhary, Scaling
    Parallel I/O Performance through I/O Delegate and
    Caching System, in Proceedings of the 2008
    ACM/IEEE conference on Supercomputing, 2008.
  • Performance rbIO inside MPI via threads and
    using upto 10 compute cores as I/O workers
  • Benchmark studies (hightlight just a few)
  • H Yu et al 18 BG/L 2 GB/sec _at_ 1K
  • Saini et al 19 512 NEC SX-8 cores I/O was
    not scalable when all processors access a shared
    file.
  • Larkin et al 17 large performance drop at 2K
    core count for CrayXT3/XT4
  • Lang et al 30 large I/O study across many
    benchmarks on Intrepid/BG-P. Found 60 GB/s read
    and 45 GB/s write. In practice, Intrepid has a
    peak I/O rate of around 35 GB/sec

25
Summary and Future Work
  • We examine several parallel I/O approaches..
  • 1 POSIX File per Proc lt 1 GB/sec on BG/L
  • PMPIO lt 1 GB/sec on BG/L
  • syncIO all processors write as groups to
    different files
  • BG/L 6.6 GB/sec read, 1.3 GB/sec write
  • BG/P 11.6 GB/sec read, 25 GB/sec write
  • rbIO gives up 3 to 6 of compute nodes to hide
    latency of blocking parallel I/O.
  • BG/L 2.3 GB/sec actual write, 22 TB/sec
    perceived write
  • BG/P 18 GB/sec actual write, 22 TB/sec
    perceived write
  • Good trade-off on Blue Gene
  • All procs to 1 file does not yield good
    performance even if aligned.
  • Performance sweet spot for syncIO depends
    significantly on I/O architecture and so file
    format must be tuned accordingly
  • BG/L _at_ CCNI has a metadata bottleneck and must
    adjust of files according e.g., 32 to 128
    writers
  • BG/P _at_ ALCF can sustain much higher performance,
    but requires more files e.g., 1024 writers
  • Suggest collective I/O is sensitive to underlying
    file system performance.
  • For rbIO, we observed that 1024 writers was the
    best performance so far for both BG/L and BG/P
    platforms..
  • Future Work impact on performance of different
    filesystems
  • Leverage Darshan logs _at_ ALCF to better understand
    Intrepid performance
  • More experiments on Blue Gene/P under PVFS,
    CrayXT5 under Lustre
Write a Comment
User Comments (0)
About PowerShow.com