File Consistency in a Parallel Environment - PowerPoint PPT Presentation

About This Presentation

Title:

File Consistency in a Parallel Environment

Description:

File Consistency in a Parallel Environment Kenin Coloma kcoloma_at_ece.northwestern.edu – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 33

Provided by: KeninC1

Learn more at: http://users.ece.northwestern.edu

Category:

more less

Transcript and Presenter's Notes

Title: File Consistency in a Parallel Environment

1
File Consistency in a Parallel Environment

Kenin Coloma
kcoloma_at_ece.northwestern.edu

2
Outline

Data consistency in parallel file systems
Consistency Semantics
File caching effect
Consistency in MPI-IO
2-phase collective IO in ROMIO (a popular MPI-IO
implementation)
Intuitive Solutions
Persistent File Domains
PFDs - concept
PFDs - statically blocked assignment
PFDs - statically striped assignment
PFDs - dynamic assignment
Performance Comparisons
Conclusions Future Work

3
Consistency Semantics

POSIX and UNIX sequential consistency
Once a write has returned, the resulting file
must be visible to all processors
MPI-IO sequential consistency
Once a write has returned, the resulting file
must be visible only to processors in the same
Communicator
If the underlying file system does not support
POSIX or UNIX consistency semantics, MPI-IO must
enforce its sequential consistency semantics
itself

4
Caching and Consistency

The client-server model for file systems often
relies on client-side caching for performance
benefits
Client-side caching reduces the amount of data
that needs to be transferred from the server
NFS is one such file system, and does not enforce
POSIX or UNIX consistency semantics

5
Caching and Consistency

A simple example using MPI and unix io on NFS - 4
procs

user buffers
p0
Open Seek(0 byte_off)
p1
p2
Read(16 bytes) Barrier
p3
client-side file caches
p0
Seek(rank4 byte_off) Write(4 bytes) Barrier
p1
p2
p3
Seek(0 byte_off) Read(16 bytes) Close
6
2-phase Collective IO in ROMIO

2-phase I/O, proposed and designed in PASSION (by
Prof. Choudhary) is widely used in parallel I/O
optimizations.
MPI-IO implementation in ROMIO uses 2-phase
collective I/O
Advantages of collective IO
Awareness of access patterns (often
non-contiguous) of all participating processes
Means of coordinating participating processes to
optimize overall IO performance

7
2-phase Collective IO in ROMIO

2-phase IO
Communication
IO
Reduce the number of IO calls to IO servers as
well as the number of IO requests generated at
the server
All the IO done is more localized than it would
otherwise be

2-phase Collective Write
User buffers
Comm. buffers
IO buffers
File
8
2-phase Collective IO in ROMIO

A simple example to exhibit the file consistency
problems even with collective IO in ROMIO - 4
procs

user buffers
p0
MPI_File_open
p1
MPI_File_read_all() whole file
p2
p3
client-side file caches
MPI_File_write_all() stripe 1st half
p0
p1
p2
MPI_File_read_all() whole file
p3
MPI_File_close
9
Intuitive Solutions

The cause obsolete data cached in client-side
system buffer
Simple solutions
Disabling client-side caching
entails changes to system configuration
lose performance benefits of caching
Use file locking
can serialize I/O
not feasible on large scale parallel systems
effectively disables client-side caching
Explicitly flushing out the cached data is the
simplest solution, such as on Cplant
ioctl(fd, BLKBLSBUF)
fsync(fd) ensure the write reside on disk
also effectively disables client-side caching

10
File locking

File locking can cause IO serialization even if
accesses do not logically overlap
This is evident in collective IO where file
domains never overlap

p0
p1
11
fsync and ioctl

On Cplant
Flush before every read
Fsync after every write
Performance ramifications
Could be invalidating perfectly good data

Open Seek(0 byte_off) Read(16 bytes) Barrier Seek(
rank4 byte_off) Write(4 bytes) Barrier Seek(0
byte_off) Read(16 bytes) Close
lt fsync(fd)
12
Persistent File Domains

Similar to the file domains concept in ROMIOs
collective IO routines
Enforces MPI-IO consistency semantics while
retaining client-side file caching
Safe concurrent accesses
3 - assignment strategies
Statically blocked assignment
Statically striped assignment
Dynamic (on-the-fly) assignment

13
Statically blocked assignment
fsync(fd-gtfd_sys) ioctl(fd-gtfd_sys, BLKFLSBUF)

Client side caches are coherent before starting
File domains are kept the same between collective
IO calls
Maintain file consistency -- each byte can only
be accessed by one processor
Avoids excessive fsync and ioctl

MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_write_all MPI_File_read_all MPI_File_clos
e
File size could be useful in creating file
domains Create file domains
Delete file domains
fsync(fd-gtfd_sys) ioctl(fd-gtfd_sys, BLKFLSBUF)
Compute Nodes
ENFS Servers File Domains
14
Statically blocked assignment

Statically Blocked Assignment
Based on equal division of whole file
Least complexity least amount of changes to
ROMIO
ADIOI_Calc_aggregator() - just a calculation,
based on
File size
Number of processes

15
Statically blocked assignment

A Key Structure - ADIOI_Access
struct
ADIO_Offset offsets
int lens
MPI_Aint mem_ptrs
int file_domains
int count

my_reqsnprocs others_reqsnprocs
16
Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
17
Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
18
Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
19
Statically blocked assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
20
Statically blocked assignment
user buffers

Drawback
File inconsistency comes about when there are
multiple IO calls often to different regions of
the file rather than the whole file
The previous point means that this assignment
scheme will not be efficient unless accesses are
rather large portions of file (3/4 of the file
size)

p0
p1
p2
p3
client-side file caches
p0
p1
p2
p3
21
Statically striped assignment

Statically Striped Assignment
Based on a striping block size parameter passed
to ROMIO through file system hints mechanism
Somewhat more complex than statically blocked
assignments
Processes can own multiple file domains
More end cases
ADIOI_Calc_Aggregator() - still just a
calculation, based on
Striping block size
Number of processes

22
Statically striped assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
23
Statically striped assignment
buf_idx1

One significant change due to processes having
multiple file domains and communication
Mapping communicated data to or from the user
buffer

p0
p1
buf_idx1
p0
p1
p0
p1
24
Statically striped assignment
MPI_File_open MPI_File_set_size MPI_File_read_all
MPI_File_close
25
Statically striped assignment
26
Statically striped assignment
27
Statically striped assignment
user buffers

Opportunity to match stripe size to access
pattern
Should work particularly well if the aggregate
access regions for each IO call are fairly
consistent nprocsstripe size
This becomes less significant if the stripe size
is greater than the data sieve buffer (dflt 4MB)

p0
p1
p2
p3
client-side file caches
p0
p1
p2
p3
28
Dynamically assigned

Static approaches cannot autonomously adapt to
actual file access patterns
2 approaches
Incremental book keeping approach
reassignment
Most complex of the three
Multiple file domains
With respect to the file layout, file domains are
irregular
Assignment a definitive assignment policy must be
established

p0
p1
p2
p3
p2
p3
p0
p1
write_all 1
write_all 2
29
Dynamically assigned

ADIOI_Calc_aggregator will become a search
function
Augment ADIOI_Access
Struct
ADIO_Offset offsets
int lens
int count
Data structure pointers (e.g. b tree)

30
Performance Comparisons
MPI_File_Open MPI_File_set_size() Loop
(iter) MPI_File_Read_all MPI_File_Write_all MPI_
File_close
Factors Collective Buffer Size (4MB) Stripe
Size in Application Available cache Aggregate
Access File size (Static Block) No. procs
31
Conclusions Future Work

File consistency can be realized without locking
or any changes to system configuration
Except for the statically block assigned method,
all the methods tested resulted in similar
results
The exact conditions under which each solution
will perform best still need to be determined
through further experimentation
The Dynamic approach to persistent file domains
is still unimplemented and is still under design
considerations
Reassignment vs. book keeping
Specifics of each policy also need to be worked
out