IO Best Practices For Franklin - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

IO Best Practices For Franklin

Description:

... Hongzang Shan, John Shalf and Harvey Wasserman for s and data, Nick Cardo ... With X GB of data to output running on Y processors -- do this. ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 43
Provided by: nerscwork
Category:

less

Transcript and Presenter's Notes

Title: IO Best Practices For Franklin


1
IO Best Practices For Franklin Katie
Antypas User Services Group Kantypas_at_lbl.gov NERS
C User Group Meeting September 19, 2007
2
Outline
  • Goals and scope of tutorial
  • IO Formats
  • Parallel IO strategies
  • Striping
  • Recommendations

Thanks to Julian Borrill, Hongzang Shan, John
Shalf and Harvey Wasserman for slides and data,
Nick Cardo for Franklin/Lustre tutorials and
NERSC-IO group for feedback
3
Goals
  • Very high level answer question of how should I
    do my IO on Franklin?
  • With X GB of data to output running on Y
    processors -- do this.

4
Axis of IO
Total Output Size
File System Hints
Transfer Size
Blocksize
Collective vs Independent
Weak vs Strong Scaling
This is why IO is complicated..
Number of Processors
Number of Files per Ouput Dump
Strided or Contiguous Access
Striping
IO Library
Chunking
File Size Per Processor
5
Axis of IO
Total Output Size
File System Hints
Transfer Size
Blocksize
Collective vs Independent
Weak vs Strong Scaling
Number of Files per Ouput Dump
Number of Processors
Strided or Contiguous Access
Striping
IO Library
Chunking
File Size Per Processor
6
Axis of IO
Primarily large block IO, transfer size same as
blocksize
Total File Size
Transfer Size
Blocksize
Strong Scaling
Number of Processors
Number of Writers
Striping
Some Basic Tips
IO Library
File Size Per Processor
Used HDF5
7
Parallel I/O A User Perspective
  • Wish List
  • Write data from multiple processors into a single
    file
  • File can be read in the same manner regardless of
    the number of CPUs that read from or write to the
    file. (eg. want to see the logical data layout
    not the physical layout)
  • Do so with the same performance as writing
    one-file-per-processor (only writing
    one-file-per-processor because of performance
    problems)
  • And make all of the above portable from one
    machine to the next

8
I/O Formats
9
Common Storage Formats
  • ASCII
  • Slow
  • Takes more space!
  • Inaccurate
  • Binary
  • Non-portable (eg. byte ordering and types sizes)
  • Not future proof
  • Parallel I/O using MPI-IO
  • Self-Describing formats
  • NetCDF/HDF4, HDF5, Parallel NetCDF
  • Example in HDF5 API implements Object DB model
    in portable file
  • Parallel I/O using pHDF5/pNetCDF (hides MPI-IO)
  • Community File Formats
  • FITS, HDF-EOS, SAF, PDB, Plot3D
  • Modern Implementations built on top of HDF,
    NetCDF, or other self-describing object-model API

10
HDF5 Library
HDF5 is a general purpose library and file format
for storing scientific data
  • Can store data structures, arrays, vectors,
    grids, complex data types, text
  • Can use basic HDF5 types integers, floats, reals
    or user defined types such as multi-dimensional
    arrays, objects and strings
  • Stores metadata necessary for portability -
    endian type, size, architecture

11
HDF5 Data Model
  • Groups
  • Arranged in directory hierarchy
  • root group is always /
  • Datasets
  • Dataspace
  • Datatype
  • Attributes
  • Bind to Group Dataset
  • References
  • Similar to softlinks
  • Can also be subsets of data

/ (root)
authorJane Doe
date10/24/2006
subgrp
Dataset0 type,space
Dataset1 type, space
time0.2345
validityNone
Dataset0.1 type,space
Dataset0.2 type,space
12
A Plug for Self Describing Formats ...
  • Application developers shouldnt care about about
    physical layout of data
  • Using own binary file format forces user to
    understand layers below the application to get
    optimal IO performance
  • Every time code is ported to a new machine or
    underlying file system is changed or upgraded,
    user is required to make changes to improve IO
    performance
  • Let other people do the work
  • HDF5 can be optimized for given platforms and
    file systems by HDF5 developers
  • User can stay with the high level
  • But what about performance?

13
IO Library Overhead
Very little, if any overhead from HDF5 for one
file per processor IO compared to Posix and MPI-IO
Data from Hongzhang Shan
14
Ways to do Parallel IO
15
Serial I/O
0
1
2
3
4
5
processors
  • Each processor sends its data to the master who
    then writes the data to a file
  • Advantages
  • Simple
  • May perform ok for very small IO sizes
  • Disadvantages
  • Not scalable
  • Not efficient, slow for any large number of
    processors or data sizes
  • May not be possible if memory constrained

File
16
Parallel I/O Multi-file
0
1
2
3
4
5
processors
File
File
File
File
File
File
  • Each processor writes its own data to a separate
    file
  • Advantages
  • Simple to program
  • Can be fast -- (up to a point)
  • Disadvantages
  • Can quickly accumulate many files
  • With Lustre, hit metadata server limit
  • Hard to manage
  • Requires post processing
  • Difficult for storage systems, HPSS, to handle
    many small files

17
Flash Center IO Nightmare
  • Large 32,000 processor run on LLNL BG/L
  • Parallel IO libraries not yet available
  • Intensive I/O application
  • checkpoint files .7 TB, dumped every 4 hours, 200
    dumps
  • used for restarting the run
  • full resolution snapshots of entire grid
  • plotfiles - 20GB each, 700 dumps
  • coarsened by a factor of two averaging
  • single precision
  • subset of grid variables
  • particle files 1400 particle files 470MB each
  • 154 TB of disk capacity
  • 74 million files!
  • Unix tool problems
  • 2 Years Later still trying to sift though data,
    sew files together

18
Parallel I/O Single-file
1
2
3
4
5
0
processors
File
  • Each processor writes its own data to the same
    file using MPI-IO mapping
  • Advantages
  • Single file
  • Manageable data
  • Disadvantages
  • Lower performance than one file per processor at
    some concurrencies

19
Parallel IO single file
0
1
2
3
4
5
processors
array of data
Each processor writes to a section of a data
array. Each must know its offset from the
beginning of the array and the number of elements
to write
20
Trade offs
  • Ideally users want speed, portability and
    usability
  • speed - one file per processor
  • portability - high level IO library
  • usability
  • single shared file and
  • own file format or community file format layered
    on top of high level IO library

It isnt hard to have speed, portability or
usability. It is hard to have speed, portability
and usability in the same implementation
21
Benchmarking Methodology and Results
22
Disclaimer
  • IO runs done during production time
  • Rates dependent on other jobs running on the
    system
  • Focus on trends rather than one or two outliers
  • Some tests ran twice, others only once

23
Peak IO Performance on Franklin
  • Expectation that IO rates will continue to rise
    linearly
  • Back end saturated around 250 processors
  • Weak scaling IO, 300 MB/proc
  • Peak performance 11GB/Sec (5 DDNs 2GB/sec)

Image from Julian Borrill
24
Description of IOR
  • Developed by LLNL used for purple procurement
  • Focuses on parallel/sequential read/write
    operations that are typical in scientific
    applications
  • Can exercise one file per processor or shared
    file access for common set of testing parameters
  • Exercises array of modern file APIs such as
    MPI-IO, POSIX (shared or unshared), HDF5 and
    parallel-netCDF
  • Parameterized parallel file access patterns to
    mimic different application situations

25
Benchmark Methodology
Focus on performance difference between single
shared and one file per processor
26
Benchmark Methodology
  • Using IOR HDF5 Interface
  • Contiguous IO
  • Not intended to be a scaling study
  • Blocksize and transfer size always the same but
    vary from run to run
  • Goal is to fill out opposite chart with best IO
    strategy

4096
2048
Processors
1024
512
256
100 MB
1 GB
10 GB
100 GB
1 TB
Aggregate Output Size
27
Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - GB/Sec
Aggregate File Size 100 MB
Aggregate File Size 1 GB
Peak performance line - Anything greater than
this is due to caching effect or timer
granularity
Clearly the one file per processor strategy
wins in the low concurrency cases correct?
28
Small Aggregate Output Sizes 100 MB - 1GB
One File per Processor vs Shared File - Time
Aggregate File Size 1 GB
Aggregate File Size 100 MB
But when looking at absolute time, the difference
doesnt seem so big...
29
Aggregate Output Size 100GB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
Peak performance line
2.5 mins
390 MB/proc
24 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
30
Hybrid Model
1
2
3
4
5
0
processors
File
File
  • Examine 4096 processor case more closely
  • Group subsets of processors to write to separate
    shared files
  • Try grouping 64, 256, 512, 1024, and 2048
    processors to see performance difference from
    file per processor case vs single shared file case

31
Effect of Grouping Processors into Separate
Smaller Shared Files
100GB Aggregate Output Size on 4096 procs
  • Each processor writes out 24MB
  • Only difference between runs is number of files
    to which processors are grouped
  • Created a new MPI communicator in IOR for
    multiple shared files
  • User gains some from grouping files
  • Since very little data is written per processor,
    overhead for synchronization dominates

Number of Files
512 procs write to single file
64 procs write to single file
2048 procs write to single file
1 file per proc
Single Shared File
32
Aggregate Output Size 1TB
One File per Processor vs Shared File
Rate GB/Sec
Time Seconds
3 mins
976 MB/proc
244 MB/proc
Is there anything we can do to improve the
performance of the 4096 processor shared file
case ?
33
Effect of Grouping Processors into Separate
Smaller Shared Files
  • Each processor writes out 244MB
  • Only difference between runs is number of files
    to which processors are grouped
  • Created a new MPI communicator in IOR for
    multiple shared files

64 procs write to single file
2048 procs write to single file
1 file per proc
Single Shared File
512 procs write to single file
34
Effect of Grouping Processors into Separate
Smaller Shared Files
  • Each processor writes out 488MB
  • Only difference between runs is number of files
    to which processors are grouped
  • Created a new MPI communicator in IOR for
    multiple shared files

64 procs write to single file
1 file per proc
Single Shared File
512 procs write to single file
35
What is Striping?
  • Lustre file system on Franklin made up of an
    underlying set of file systems calls Object
    Storage Targets (OSTs), essentially a set of
    parallel IO servers
  • File is said to be striped when read and write
    operations access multiple OSTs concurrently
  • Striping can be a way to increase IO performance
    since writing or reading from multiple OSTs
    simultaneously increases the available IO
    bandwidth

36
What is Striping?
  • File striping will most likely improve
    performance for applications which read or write
    to a single (or multiple) large shared files
  • Striping will likely have little effect for the
    following type of IO patterns
  • Serial IO where a single processor performs all
    the IO
  • Multiple node perform IO, but access files at
    different times
  • Multiple nodes perform IO simultaneously to
    different files that are small (each lt 100 MB)
  • One file per processor

37
Striping Commands
  • Striping can be set at a file or directory level
  • Set striping on an directory then all files
    created in that directory with inherit striping
    level of the directory
  • Moving a file into a directory with a set
    striping will NOT change the striping of that
    file
  • stripe-size -
  • Number of bytes in each stripe (multiple of 64k
    block)
  • OST offset -
  • Always keep this -1
  • Choose starting OST in round robin
  • stripe count -
  • Number of OSTs to stripe over
  • -1 stripe over all OSTs
  • 1 stripe over one OST

lfs setstripe ltdirectoryfilegt ltstripe sizegt ltOST
Offsetgt ltstripe countgt
38
Stripe-Count Suggestions
  • Franklin Default Striping
  • 1MB stripe size
  • Round robin starting OST (OST Offset -1)
  • Stripe over 4 OSTs (Stripe count 4)
  • Many small files, one file per proc
  • Use default striping
  • Or 0 -1, 1
  • Large shared files
  • Stripe over all available OSTs (0 -1 -1)
  • Or some number larger than 4 (0 -1 X)
  • Stripe over odd numbers?
  • Prime numbers?

39
Recommendations
Legend
4096
Single Shared File, Default or No Striping
2048
Single Shared File, Stripe over some OSTs (10)
1024
Processors
Single Shared File, Stripe over many OSTs
512
Single Shared File, Stripe over many OSTs OR File
per processor with default striping
256
Benefits to mod n shared files
100 MB
1 GB
10 GB
100 GB
1 TB
Aggregate File Size
40
Recommendations
  • Think about the big picture
  • Run time vs Post Processing trade off
  • Decide how much IO overhead you can afford
  • Data Analysis
  • Portability
  • Longevity
  • H5dump works on all platforms
  • Can view an old file with h5dump
  • If you use your own binary format you must keep
    track of not only your file format version but
    the version of your file reader as well
  • Storability

41
Recommendations
  • Use a standard IO format, even if you are
    following a one file per processor model
  • One file per processor model really only makes
    some sense when writing out very large files at
    high concurrencies, for small files, overhead is
    low
  • If you must do one file per processor IO then at
    least put it in a standard IO format so pieces
    can be put back together more easily
  • Splitting large shared files into a few files
    appears promising
  • Option for some users, but requires code changes
    and output format changes
  • Could be implemented better in IO library APIs
  • Follow striping recommendations
  • Ask the consultants, we are here to help!

42
Questions?
Write a Comment
User Comments (0)
About PowerShow.com