File Consistency in a Parallel Environment - PowerPoint PPT Presentation

About This Presentation
Title:

File Consistency in a Parallel Environment

Description:

File Consistency in a Parallel Environment Kenin Coloma kcoloma_at_ece.northwestern.edu – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 33
Provided by: KeninC1
Category:

less

Transcript and Presenter's Notes

Title: File Consistency in a Parallel Environment


1
File Consistency in a Parallel Environment
  • Kenin Coloma
  • kcoloma_at_ece.northwestern.edu

2
Outline
  • Data consistency in parallel file systems
  • Consistency Semantics
  • File caching effect
  • Consistency in MPI-IO
  • 2-phase collective IO in ROMIO (a popular MPI-IO
    implementation)
  • Intuitive Solutions
  • Persistent File Domains
  • PFDs - concept
  • PFDs - statically blocked assignment
  • PFDs - statically striped assignment
  • PFDs - dynamic assignment
  • Performance Comparisons
  • Conclusions Future Work

3
Consistency Semantics
  • POSIX and UNIX sequential consistency
  • Once a write has returned, the resulting file
    must be visible to all processors
  • MPI-IO sequential consistency
  • Once a write has returned, the resulting file
    must be visible only to processors in the same
    Communicator
  • If the underlying file system does not support
    POSIX or UNIX consistency semantics, MPI-IO must
    enforce its sequential consistency semantics
    itself

4
Caching and Consistency
  • The client-server model for file systems often
    relies on client-side caching for performance
    benefits
  • Client-side caching reduces the amount of data
    that needs to be transferred from the server
  • NFS is one such file system, and does not enforce
    POSIX or UNIX consistency semantics

5
Caching and Consistency
  • A simple example using MPI and unix io on NFS - 4
    procs

user buffers
p0
Open Seek(0 byte_off)
p1
p2
Read(16 bytes) Barrier
p3
client-side file caches
p0
Seek(rank4 byte_off) Write(4 bytes) Barrier
p1
p2
p3
Seek(0 byte_off) Read(16 bytes) Close
6
2-phase Collective IO in ROMIO
  • 2-phase I/O, proposed and designed in PASSION (by
    Prof. Choudhary) is widely used in parallel I/O
    optimizations.
  • MPI-IO implementation in ROMIO uses 2-phase
    collective I/O
  • Advantages of collective IO
  • Awareness of access patterns (often
    non-contiguous) of all participating processes
  • Means of coordinating participating processes to
    optimize overall IO performance

7
2-phase Collective IO in ROMIO
  • 2-phase IO
  • Communication
  • IO
  • Reduce the number of IO calls to IO servers as
    well as the number of IO requests generated at
    the server
  • All the IO done is more localized than it would
    otherwise be

2-phase Collective Write
User buffers
Comm. buffers
IO buffers
File
8
2-phase Collective IO in ROMIO
  • A simple example to exhibit the file consistency
    problems even with collective IO in ROMIO - 4
    procs

user buffers
p0
MPI_File_open
p1
MPI_File_read_all() whole file
p2
p3
client-side file caches
MPI_File_write_all() stripe 1st half
p0
p1
p2
MPI_File_read_all() whole file
p3
MPI_File_close
9
Intuitive Solutions
  • The cause obsolete data cached in client-side
    system buffer
  • Simple solutions
  • Disabling client-side caching
  • entails changes to system configuration
  • lose performance benefits of caching
  • Use file locking
  • can serialize I/O
  • not feasible on large scale parallel systems
  • effectively disables client-side caching
  • Explicitly flushing out the cached data is the
    simplest solution, such as on Cplant
  • ioctl(fd, BLKBLSBUF)
  • fsync(fd) ensure the write reside on disk
  • also effectively disables client-side caching

10
File locking
  • File locking can cause IO serialization even if
    accesses do not logically overlap
  • This is evident in collective IO where file
    domains never overlap

p0
p1
11
fsync and ioctl
  • On Cplant
  • Flush before every read
  • Fsync after every write
  • Performance ramifications
  • Could be invalidating perfectly good data

Open Seek(0 byte_off) Read(16 bytes) Barrier Seek(
rank4 byte_off) Write(4 bytes) Barrier Seek(0
byte_off) Read(16 bytes) Close
lt fsync(fd)
12
Persistent File Domains
  • Similar to the file domains concept in ROMIOs
    collective IO routines
  • Enforces MPI-IO consistency semantics while
    retaining client-side file caching
  • Safe concurrent accesses
  • 3 - assignment strategies
  • Statically blocked assignment
  • Statically striped assignment
  • Dynamic (on-the-fly) assignment

13
Statically blocked assignment
fsync(fd-gtfd_sys) ioctl(fd-gtfd_sys, BLKFLSBUF)
  • Client side caches are coherent before starting
  • File domains are kept the same between collective
    IO calls
  • Maintain file consistency -- each byte can only
    be accessed by one processor
  • Avoids excessive fsync and ioctl

MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_write_all MPI_File_read_all MPI_File_clos
e
File size could be useful in creating file
domains Create file domains
Delete file domains
fsync(fd-gtfd_sys) ioctl(fd-gtfd_sys, BLKFLSBUF)
Compute Nodes
ENFS Servers File Domains
14
Statically blocked assignment
  • Statically Blocked Assignment
  • Based on equal division of whole file
  • Least complexity least amount of changes to
    ROMIO
  • ADIOI_Calc_aggregator() - just a calculation,
    based on
  • File size
  • Number of processes

15
Statically blocked assignment
  • A Key Structure - ADIOI_Access
  • struct
  • ADIO_Offset offsets
  • int lens
  • MPI_Aint mem_ptrs
  • int file_domains
  • int count

my_reqsnprocs others_reqsnprocs
16
Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
17
Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
18
Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
19
Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
20
Statically blocked assignment
user buffers
  • Drawback
  • File inconsistency comes about when there are
    multiple IO calls often to different regions of
    the file rather than the whole file
  • The previous point means that this assignment
    scheme will not be efficient unless accesses are
    rather large portions of file (3/4 of the file
    size)

p0
p1
p2
p3
client-side file caches
p0
p1
p2
p3
21
Statically striped assignment
  • Statically Striped Assignment
  • Based on a striping block size parameter passed
    to ROMIO through file system hints mechanism
  • Somewhat more complex than statically blocked
    assignments
  • Processes can own multiple file domains
  • More end cases
  • ADIOI_Calc_Aggregator() - still just a
    calculation, based on
  • Striping block size
  • Number of processes

22
Statically striped assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
23
Statically striped assignment
buf_idx1
  • One significant change due to processes having
    multiple file domains and communication
  • Mapping communicated data to or from the user
    buffer

p0
p1
buf_idx1
p0
p1
p0
p1
24
Statically striped assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
25
Statically striped assignment
26
Statically striped assignment
27
Statically striped assignment
user buffers
  • Opportunity to match stripe size to access
    pattern
  • Should work particularly well if the aggregate
    access regions for each IO call are fairly
    consistent nprocsstripe size
  • This becomes less significant if the stripe size
    is greater than the data sieve buffer (dflt 4MB)

p0
p1
p2
p3
client-side file caches
p0
p1
p2
p3
28
Dynamically assigned
  • Static approaches cannot autonomously adapt to
    actual file access patterns
  • 2 approaches
  • Incremental book keeping approach
  • reassignment
  • Most complex of the three
  • Multiple file domains
  • With respect to the file layout, file domains are
    irregular
  • Assignment a definitive assignment policy must be
    established

p0
p1
p2
p3
p2
p3
p0
p1
write_all 1
write_all 2
29
Dynamically assigned
  • ADIOI_Calc_aggregator will become a search
    function
  • Augment ADIOI_Access
  • Struct
  • ADIO_Offset offsets
  • int lens
  • int count
  • Data structure pointers (e.g. b tree)

30
Performance Comparisons
MPI_File_Open MPI_File_set_size() Loop
(iter) MPI_File_Read_all MPI_File_Write_all MPI_
File_close
Factors Collective Buffer Size (4MB) Stripe
Size in Application Available cache Aggregate
Access File size (Static Block) No. procs
31
Conclusions Future Work
  • File consistency can be realized without locking
    or any changes to system configuration
  • Except for the statically block assigned method,
    all the methods tested resulted in similar
    results
  • The exact conditions under which each solution
    will perform best still need to be determined
    through further experimentation
  • The Dynamic approach to persistent file domains
    is still unimplemented and is still under design
    considerations
  • Reassignment vs. book keeping
  • Specifics of each policy also need to be worked
    out

32
Data sieving in ROMIO
Read case
  • Quick overview of data sieving
  • Data sieving is best suited for small densely
    distributed non-contiguous accesses

User buffer
Data sieve buffer
File
Write a Comment
User Comments (0)
About PowerShow.com