Parallel and Grid I/O Infrastructure - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel and Grid I/O Infrastructure

Description:

Allow application programmers to tune ROMIO with hints rather than using different MPI-IO calls ... MPI-IO hints provide means for specifying number of stripes, ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 41
Provided by: rro1
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Parallel and Grid I/O Infrastructure


1
Parallel and Grid I/O Infrastructure
  • Rob Ross, Argonne National Lab
  • Parallel Disk Access and Grid I/O (P4)
  • SDM All Hands Meeting
  • March 26, 2002

2
Participants
  • Argonne National Laboratory
  • Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham,
    Anthony Chan
  • Northwestern University
  • Alok Choudhary, Wei-keng Liao, Avery Ching, Kenin
    Coloma, Jianwei Li
  • Collaborators
  • Lawrence Livermore National Laboratory
  • Ghaleb Abdulla, Tina Eliassi-Rad, Terence
    Critchlow
  • Application groups

3
Focus Areas in Project
  • Parallel I/O on clusters
  • Parallel Virtual File System (PVFS)
  • MPI-IO hints
  • ROMIO MPI-IO implementation
  • Grid I/O
  • Linking PVFS and ROMIO with Grid I/O components
  • Application interfaces
  • NetCDF and HDF5
  • Everything is interconnected!
  • Wei-keng Liao will drill down into specific tasks

4
Parallel Virtual File System
  • Lead developer R. Ross (ANL)
  • R. Latham (ANL), developer
  • A. Ching, K. Coloma (NWU), collaborators
  • Open source, scalable parallel file system
  • Project began in mid 90s at Clemson University
  • Now a collaborative between Clemson and ANL
  • Successes
  • In use on large Linux clusters (OSC, Utah,
    Clemson, ANL, Phillips Petroleum, )
  • 100 unique downloads/month
  • 160 users on mailing list, 90 on developers
    list
  • Multiple Gigabyte/second performance shown

5
Keeping PVFS Relevant PVFS2
  • Scaling to thousands of clients and hundreds of
    servers requires some design changes
  • Distributed metadata
  • New storage formats
  • Improved fault tolerance
  • New technology, new features
  • High-performance networking (e.g. Infiniband,
    VIA)
  • Application metadata
  • New design and implementation warranted (PVFS2)

6
PVFS1, PVFS2, and SDM
  • Maintaining PVFS1 as a resource to community
  • Providing support, bug fixes
  • Encouraging use by application groups
  • Adding functionality to improve performance (e.g.
    tiled display)
  • Implementing next-generation parallel file system
  • Basic infrastructure for future PFS work
  • New physical distributions (e.g. chunking)
  • Application metadata storage
  • Ensuring that a working parallel file system will
    continue to be available on clusters as they scale

7
Data Staging for Tiled Display
  • Contact Joe Insley (ANL)
  • Commodity components
  • projectors, PCs
  • Provide very high resolutionvisualization
  • Staging application preprocesses frames into a
    tile stream for each visualization node
  • Uses MPI-IO to access data from PVFS file system
  • Streams of tiles are merged into movie files on
    visualization nodes
  • End goal is to display frames directly from PVFS
  • Enhancing PVFS and ROMIO to improve performance

8
Example Tile Layout
  • 3x2 display, 6 readers
  • Frame size is 2532x1408 pixels
  • Tile size is 1024x768 pixels (overlapped)
  • Movies broken into frames with each frame stored
    in its own file in PVFS
  • Readers pull data from PVFS and send to display

9
Tested access patterns
  • Subtile
  • Each reader grabs a piece of a tile
  • Small noncontiguous accesses
  • Lots of accesses for a frame
  • Tile
  • Each reader grabs a whole tile
  • Larger noncontiguous accesses
  • Six accesses for a frame
  • Reading individual pieces is simply too slow

10
Noncontiguous Access in ROMIO
  • ROMIO performs data sieving to cut down number
    of I/O operations
  • Uses large reads which grab multiple
    noncontiguous pieces
  • Example, reading tile 1

11
Noncontiguous Access in PVFS
  • ROMIO data sieving
  • Works for all file systems (just uses contiguous
    read)
  • Reads extra data (three times desired amount)
  • Noncontiguous access primitive allows requesting
    just desired bytes (A. Ching, NWU)
  • Support in ROMIO allowstransparent use of new
    optimization (K. Coloma,NWU)
  • PVFS and ROMIO supportimplemented

12
Metadata in File Systems
  • Associative arrays of information related to a
    file
  • Seen in other file systems (MacOS, BeOS,
    ReiserFS)
  • Some potential uses
  • Ancillary data (from applications)
  • Derived values
  • Thumbnail images
  • Execution parameters
  • I/O library metadata
  • Block layout information
  • Attributes on variables
  • Attributes of dataset as a whole
  • Headers
  • Keeps header out of data stream
  • Eliminates need for alignment in libraries

13
Metadata and PVFS2 Status
  • Prototype metadata storage for PVFS2 implemented
  • R. Ross (ANL)
  • Uses Berkeley DB for storage of keyword/value
    pairs
  • Need to investigate how to interface to MPI-IO
  • Other components of PVFS2 coming along
  • Networking in testing (P. Carns, Clemson)
  • Client side API under development (Clemson)
  • PVFS2 beta early fourth quarter?

14
ROMIO MPI-IO Implementation
  • Written by R. Thakur (ANL)
  • R. Ross and R. Latham (ANL), developers
  • K. Coloma (NWU), collaborator
  • Implementation of MPI-2 I/O specification
  • Operates on wide variety of platforms
  • Abstract Device Interface for I/O (ADIO) aids in
    porting to new file systems
  • Successes
  • Adopted by industry(e.g. Compaq, HP, SGI)
  • Used at ASCI sites(e.g. LANL Blue Mountain)

15
ROMIO Current Directions
  • Support for PVFS noncontiguous requests
  • K. Coloma (NWU)
  • Hints - key to efficient use of HW SW
    components
  • Collective I/O
  • Aggregation (synergy)
  • Performance portability
  • Controlling ROMIO Optimizations
  • Access patterns
  • Grid I/O
  • Scalability
  • Parallel I/O benchmarking

16
ROMIO Aggregation Hints
  • Part of ASCI Software Pathforward project
  • Contact Gary Grider (LANL)
  • Implementation by R. Ross, R. Latham (ANL)
  • Hints control what processes do I/O in
    collectives
  • Examples
  • All processes on same node as attached storage
  • One process per host
  • Additionally limit number of processes who open
    file
  • Good for systems w/out shared FS (e.g. O2K
    clusters)
  • More scalable

17
Aggregation Example
  • Cluster of SMPs
  • Only one SMP box has connection to disks
  • Data is aggregated to processes on single box
  • Processes on that box perform I/O on behalf of
    the others

18
Optimization Hints
  • MPI-IO calls should be chosen to best describe
    the I/O taking place
  • Use of file views
  • Collective calls for inherently collective
    operations
  • Unfortunately sometimes choosing the right
    calls can result on lower performance
  • Allow application programmers to tune ROMIO with
    hints rather than using different MPI-IO calls
  • Avoid the misapplication of optimizations
    (aggregation, data sieving)

19
Optimization Problems
  • ROMIO checks for applicability of two-phase
    optimization when collective I/O is used
  • With tiled display application using subtile
    access, this optimization is never used
  • Checking for applicability requires communication
    between processes
  • Results in 33 drop in throughput (on test
    system)
  • A hint that tells ROMIO not to apply the
    optimization can avoid this without changes to
    the rest of the application

20
Access Pattern Hints
  • Collaboration between ANL and LLNL (and growing)
  • Examining how access pattern information can be
    passed to MPI-IO interface, through to underlying
    file system
  • Used as input to optimizations in MPI-IO layer
  • Used as input to optimizations in FS layer as
    well
  • Prefetching
  • Caching
  • Writeback

21
Status of Hints
  • Aggregation control finished
  • Optimization hints
  • Collectives, data sieving read finished
  • Data sieving write control in progress
  • PVFS noncontiguous I/O control in progress
  • Access pattern hints
  • Exchanging log files, formats
  • Getting up to speed on respective tools

22
Parallel I/O Benchmarking
  • No common parallel I/O benchmarks
  • New effort (consortium) to
  • Define some terminology
  • Define test methodology
  • Collect tests
  • Goal provide a meaningful test suite with
    consistent measurement techniques
  • Interested parties at numerous sites (and
    growing)
  • LLNL, Sandia, UIUC, ANL, UCAR, Clemson
  • In infancy

23
Grid I/O
  • Looking at ways to connect our I/O work with
    components and APIs used in the Grid
  • New ways of getting data in and out of PVFS
  • Using MPI-IO to access data in the Grid
  • Alternative mechanisms for transporting data
    across the Grid (synergy)
  • Working towards more seamless integration of the
    tools used in the Grid and those used on clusters
    and in parallel applications (specifically MPI
    applications)
  • Facilitate moving between Grid and Cluster worlds

24
Local Access to GridFTP Data
  • Grid I/O Contact B. Allcock (ANL)
  • GridFTP striped server provides high-throughput
    mechanism for moving data across Grid
  • Relies on proprietary storage format on striped
    servers
  • Must manage metadata on stripe location
  • Data stored on servers must be read back from
    servers
  • No alternative/more direct way to access local
    data
  • Next version assumes shared file system underneath

25
GridFTP Striped Servers
  • Remote applications connect to multiple striped
    servers to quickly transfer data over Grid
  • Multiple TCP streams better utilize WAN network
  • Local processes would need to use same mechanism
    to get to data on striped servers

26
PVFS under GridFTP
  • With PVFS underneath, GridFTP servers would store
    data on PVFS I/O servers
  • Stripe information stored on PVFS metadata server

27
Local Data Access
  • Application tasks that are part of a local
    parallel job could access data directly off PVFS
    file system
  • Output from application could be retrieved
    remotely via GridFTP

28
MPI-IO Access to GridFTP
  • Applications such as tiled display reader desire
    remote access to GridFTP data
  • Access through MPI-IO would allow this with no
    code changes
  • ROMIO ADIO interface provides the infrastructure
    necessary to do this
  • MPI-IO hints provide means for specifying number
    of stripes, transfer sizes, etc.

29
WAN File Transfer Mechanism
  • B. Gropp (ANL), P. Dickens (IIT)
  • Applications
  • PPM and COMMAS (Paul Woodward, UMN)
  • Alternative mechanism for moving data across Grid
    using UDP
  • Focuses on requirements for file movement
  • All data must arrive at destination
  • Ordering doesnt matter
  • Lost blocks can be retransmitted when detected,
    but need not stop the remainder of the transfer

30
WAN File Transfer Performance
  • Comparing TCP utilization to WAN FT technique
  • See 10-12 utilization with single TCP stream (8
    streams to approach max. utilization)
  • With WAN FT obtain near 90 utilization, more
    uniform performance

31
Grid I/O Status
  • Planning with Grid I/O group
  • Matching up components
  • Identifying useful hints
  • Globus FTP client library is available
  • 2nd generation striped server being implemented
  • XIO interface prototyped
  • Hooks for alternative local file systems
  • Obvious match for PVFS under GridFTP

32
NetCDF
  • Applications in climate and fusion
  • PCM
  • John Drake (ORNL)
  • Weather Research and Forecast Model (WRF)
  • John Michalakes (NCAR)
  • Center for Extended Magnetohydrodynamic Modeling
  • Steve Jardin (PPPL)
  • Plasma Microturbulence Project
  • Bill Nevins (LLNL)
  • Maintained by Unidata Program Center
  • API and file format for storing multidimensional
    datasets and associated metadata (in a single
    file)

33
NetCDF Interface
  • Strong points
  • Its a standard!
  • I/O routines allow for subarray and strided
    access with single calls
  • Access is clearly split into two modes
  • Defining the datasets (define mode)
  • Accessing and/or modifying the datasets (data
    mode)
  • Weakness no parallel writes, limited parallel
    read capability
  • This forces applications to ship data to a single
    node for writing, severely limiting usability in
    I/O intensive applications

34
Parallel NetCDF
  • Rich I/O routines and explicit define/data modes
    provide a good foundation
  • Existing applications are already describing
    noncontiguous regions
  • Modes allow for a synchronization point when file
    layout changes
  • Missing
  • Semantics for parallel access
  • Collective routines
  • Option for using MPI datatypes
  • Implement in terms of MPI-IO operations
  • Retain file format for interoperability

35
Parallel NetCDF Status
  • Design document created
  • B. Gropp, R. Ross, and R. Thakur (ANL)
  • Prototype in progress
  • J. Li (NWU)
  • Focus is on write functions first
  • Biggest bottleneck for checkpointing applications
  • Read functions follow
  • Investigate alternative file formats in future
  • Address differences in access modes between
    writing and reading

36
FLASH Astrophysics Code
  • Developed at ASCI Center at University of Chicago
  • Contact Mike Zingale
  • Adaptive mesh (AMR) code for simulating
    astrophysical thermonuclear flashes
  • Written in Fortran90, uses MPI for communication,
    HDF5 for checkpointing and visualization data
  • Scales to thousands of processors, runs for
    weeks, needs to checkpoint
  • At the time, I/O was a bottleneck (½ of runtime
    on 1024 processors)

37
HDF5 Overhead Analysis
  • Instrumented FLASH I/O to log calls to H5Dwrite

H5Dwrite
MPI_File_write_at
38
HDF5 Hyperslab Operations
  • White region is hyperslab gather (from memory)
  • Cyan is scatter (to file)

39
Hand-Coded Packing
  • Packing time is in black regions between bars
  • Nearly order of magnitude improvement

40
Wrap Up
  • Progress being made on multiple fronts
  • ANL/NWU collaboration is strong
  • Collaborations with other groups maturing
  • Balance of immediate payoff and medium term
    infrastructure improvements
  • Providing expertise to application groups
  • Adding functionality targeted at specific
    applications
  • Building core infrastructure to scale, ensure
    availability
  • Synergy with other projects
  • On to Wei-keng!
Write a Comment
User Comments (0)
About PowerShow.com