Parallel IO - PowerPoint PPT Presentation

Loading...

PPT – Parallel IO PowerPoint presentation | free to download - id: 20db24-MzIxY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Parallel IO

Description:

http://www.cs.dartmouth.edu/pario/bib/short.html (bibliography) ... of collective I/O in the Intel Paragon parallel file system: Initial experiences. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 44
Provided by: SathishV4
Learn more at: http://www.serc.iisc.ernet.in
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parallel IO


1
Parallel I/O
  • Sources/Credits
  • R. Thakur, W. Gropp, E. Lusk. A Case for Using
    MPI's Derived Datatypes to Improve I/O
    Performance. Supercomputing 98
  • http//www.cs.dartmouth.edu/pario/bib/short.html
    (bibliography)
  • Xiaosong Ma, Marianne Winslett, Jonghyun Lee, and
    Shengke Yu. Improving MPI IO output performance
    with active buffering plus threads. In
    Proceedings of the International Parallel and
    Distributed Processing Symposium. IEEE Computer
    Society Press, April 2003.
  • Mahmut Kandemir. Compiler-directed collective
    I/O. IEEE Transactions on Parallel and
    Distributed Systems, 12(12)1318-1331, December
    2001.
  • Meenakshi A. Kandaswamy, Mahmut Kandemir, Alok
    Choudhary, and David Bernholdt. An experimental
    evaluation of I/O optimizations on different
    applications. IEEE Transactions on Parallel and
    Distributed Systems, 13(7)728-744, July 2002.

2
High Performance with Derived Data Types (Thakur
et. al SC 98)
  • Potential of parallel file systems not fully
    utilized because of applications I/O access
    patterns
  • Many small requests to non-contiguous blocks
  • Most parallel file systems access single large
    chunk
  • Thus motivation for making a single call using
    derived data types
  • ROMIO performs 2 optimizations data sieving and
    collective I/O

3
ROMIO Architecture
4
Datatype Constructors in MPI
  • contiguous
  • vector/hvector
  • indexed/hindexed/indexed_block
  • struct
  • subarray
  • darray

I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
D
D
D
D
C
C
5
Different levels of access
6
Different levels of access
7
Different levels of access
8
Optimizations in ROMIO for derived-datatype
noncontiguous access
  • Data sieving
  • Make a few, large contiguous requests to the file
    system even if the users requests consists of
    several, small, nocontiguous requests
  • Extract (pick out data) in memory that is really
    needed
  • This is ok for read? For write?
  • Use small buffer for writing with data sieving
    than for reading with data sieving. Why?

Read-modify-write along with locking
Greater the size of the write buffer, greater the
contention among processes for locks
9
Optimizations in ROMIO for derived-datatype
noncontiguous access
  • Data sieving
  • Collective I/O
  • During collective-I/O functions, the
    implementation can analyze and merge the requests
    of different processes
  • The merged request can be large and continuous
    although the individual requests were
    noncontiguous.
  • Perform I/O in 2 phases
  • I/O phase processes perform I/O for the merged
    request. Some data may belong to other processes.
    If the merged request is not contiguous, use data
    sieving
  • Communication phase processes redistribute data
    to obtain the desired distribution
  • Additional cost of communication phase can be
    offset by performance gain due to contiguous
    access.
  • Data sieving and collective-I/O also help improve
    caching and prefetching in underlying file system

10
Collective I/O Illustration
P0
P1
P0
P1
P0
P1
P0
P1
P0
P1
P0
P1
P0
P1
11
Results
Table 1 Read performance for distributed-array
access (array size 512 x 512 x 512 integers, file
size 512 Mbytes)
Improvement due to data-sieving
Improvement due to collective-I/O

If requests of processes that call a collective
function is not interleaved in the file, ROMIOS
collective implementations just calls
corresponding independent-I/O function on each
process. Hence Level 1 Level 0
12
Results
Table 3 Write performance for distributed-array
access (array size 512 x 512 x 512 integers, file
size 512 Mbytes)
IBM SPs PIOFS does not support file locking

13
Active Buffering with Threads (Xiaosong Ma et.
al IPDPS 2003)
  • Above optimizations alone are not enough.
  • Active Buffering use of separate I/O nodes
  • Overlapping I/O access with computation by
    threads
  • Buffer space automatically adjusted to available
    memory

14
Original Scheme (Ma IPDPS 2002)
  • Hierarchical buffering scheme
  • Dedicated I/O server nodes
  • During I/O
  • if(not overflow in compute nodes)
  • compute nodes -gt local buffers
  • else
  • if(not overflow in server nodes)
  • compute nodes -gtserver buffers (using
    MPI)
  • else
  • server nodes -gt I/O system
  • During computation
  • Server nodes clear local buffers and I/O
    write
  • Fetch data from compute nodes (one-sided
    communication) and I/O write

15
Current Scheme
  • I/O threads collective I/O overlapped with main
    threads computation and communication
  • Uses pthreads with kernel-level scheduling
  • Interception of ROMIOs I/O calls
  • Main threads and I/O threads coordinate by buffer
    queue
  • Producer-consumer and bounded-buffer problem

16
Execution Timeline
17
Other issues
  • Background thread initiated during first
    collective I/O
  • Interesting termination
  • During MPI_FILE_CLOSE, a special buffer with
    special tag is appended to the buffer space. On
    seeing this, the background thread terminates.

18
Compiler-directed collective I/O (Kandemir 2001)
  • Under what circumstances are collective-I/O
    useful. Should we use level 3 access all the
    time?
  • Compiler analysis of data access and storage
    access patterns
  • Selective insertion of MPI collective I/O or
    independent I/O calls

19
Example
  • Conforming and non-conforming access patterns

20
Bibliography
  • Philip H. Carns, Walter B. Ligon III, Robert B.
    Ross, and Rajeev Thakur. PVFS A parallel file
    system for linux clusters. In Proceedings of the
    4th Annual Linux Showcase and Conference, pages
    317-327, Atlanta, GA, October 2000. USENIX
    Association.
  • Jose Aguilar. A graph theoretical model for
    scheduling simultaneous I/O operations on
    parallel and distributed environments. Parallel
    Processing Letters, 12(1)113-126, March 2002.
  • Rajesh Bordawekar. Implementation of collective
    I/O in the Intel Paragon parallel file system
    Initial experiences. In Proceedings of the 11th
    ACM International Conference on Supercomputing,
    pages 20-27. ACM Press, July 1997.
  • Peter Brezany, Marianne Winslett, Denis A.
    Nicole, and Toni Cortes. Parallel I/O and storage
    technology. In Proceedings of the Seventh
    International Euro-Par Conference, volume 2150 of
    Lecture Notes in Computer Science, pages 887-888,
    Manchester, UK, August 2001. Springer-Verlag.
  • Bradley Broom, Rob Fowler, and Ken Kennedy.
    KelpIO A telescope-ready domain-specific I/O
    library for irregular block-structured
    applications. In Proceedings of the First
    IEEE/ACM International Symposium on Cluster
    Computing and the Grid, pages 148-155, Brisbane,
    Australia, May 2001. IEEE Computer Society Press

21
Bibliography
  • J. Carretero, F. Pérez, P. de Miguel, F.
    Garc\'\ia, and L. Alonso. I/O data mapping in \em
    ParFiSys support for high-performance I/O in
    parallel and distributed systems. In
    Euro-Par '96, volume 1123 of Lecture Notes in
    Computer Science, pages 522-526. Springer-Verlag,
    August 1996
  • Ying Chen, Marianne Winslett, Y. Cho, and S. Kuo.
    Automatic parallel I/O performance optimization
    using genetic algorithms. In Proceedings of the
    Seventh IEEE International Symposium on High
    Performance Distributed Computing, pages 155-162.
    IEEE Computer Society Press, July 1998.
  • Ying Chen, Ian Foster, Jarek Nieplocha, and
    Marianne Winslett. Optimizing collective I/O
    performance on parallel computers A multisystem
    study. In Proceedings of the 11th ACM
    International Conference on Supercomputing, pages
    28-35. ACM Press, July 1997.
  • Avery Ching, Alok Choudhary, Kenin Coloma, Wei
    keng Liao, Robert Ross, and William Gropp.
    Noncontiguous I/O accesses through MPI-IO. In
    Proceedings of the Third IEEE/ACM International
    Symposium on Cluster Computing and the Grid,
    pages 104-111, Tokyo, Japan, May 2003. IEEE
    Computer Society Press.
  • Phillip M. Dickens and Rajeev Thakur. Evaluation
    of collective I/O implementations on parallel
    architectures. Journal of Parallel and
    Distributed Computing, 61(8)1052-1076, August
    2001.

22
Bibliography
  • Félix Garcia-Carballeira, Alejandro Calderon,
    Jesus Carretero, Javier Fernandez, and Jose M.
    Perez. The design of the Expand parallel file
    system. The International Journal of High
    Performance Computing Applications, 17(1)21-38,
    2003
  • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
    Leung. The Google file system. In Proceedings of
    the Nineteenth ACM Symposium on Operating Systems
    Principles, pages 96-108, Bolton Landing, NY,
    October 2003. ACM Press.
  • James V. Huber, Jr., Christopher L. Elford,
    Daniel A. Reed, Andrew A. Chien, and David S.
    Blumenthal. PPFS A high performance portable
    parallel file system. In Hai Jin, Toni Cortes,
    and Rajkumar Buyya, editors, High Performance
    Mass Storage and Parallel I/O Technologies and
    Applications, chapter 22, pages 330-343. IEEE
    Computer Society Press and Wiley, New York, NY,
    2001.
  • Meenakshi A. Kandaswamy, Mahmut Kandemir, Alok
    Choudhary, and David Bernholdt. An experimental
    evaluation of I/O optimizations on different
    applications. IEEE Transactions on Parallel and
    Distributed Systems, 13(7)728-744, July 2002.
  • Mahmut Kandemir. Compiler-directed collective
    I/O. IEEE Transactions on Parallel and
    Distributed Systems, 12(12)1318-1331, December
    2001.

23
Bibliography
  • Xiaosong Ma, Marianne Winslett, Jonghyun Lee, and
    Shengke Yu. Improving MPI IO output performance
    with active buffering plus threads. In
    Proceedings of the International Parallel and
    Distributed Processing Symposium. IEEE Computer
    Society Press, April 2003.
  • Tara M. Madhyastha and Daniel A. Reed. Learning
    to classify parallel input/output access
    patterns. IEEE Transactions on Parallel and
    Distributed Systems, 13(8)802-813, August 2002.
  • Ethan L. Miller and Randy H. Katz. RAMA An
    easy-to-use, high-performance parallel file
    system. Parallel Computing, 23(4-5)419-446, June
    1997.
  • Bill Nitzberg and Virginia Lo. Collective
    buffering Improving parallel I/O performance. In
    Proceedings of the Sixth IEEE International
    Symposium on High Performance Distributed
    Computing, pages 148-157, Portland, OR, August
    1997. IEEE Computer Society Press.See also later
    version nitzbergbcollective.
  • Huseyin Simitci and Daniel Reed. A comparison of
    logical and physical parallel I/O patterns. The
    International Journal of High Performance
    Computing Applications, 12(3)364-380, Fall 1998.

24
Bibliography
  • Domenico Talia and Pradip K. Srimani. Parallel
    data-intensive algorithms and applications.
    Parallel Computing, 28(5)669-671, May 2002.
  • Len Wisniewski, Brad Smisloff, and Nils
    Nieuwejaar. Sun MPI I/O Efficient I/O for
    parallel applications. In Proceedings of SC99
    High Performance Networking and Computing,
    Portland, OR, November 1999. ACM Press and IEEE
    Computer Society Press
  • K. K. Lee, M. Kallahalla, B. S. Lee, and P. J.
    Varman. Performance comparison of prefetching and
    placement policies for parallel I/O.
    International Journal of Parallel and Distributed
    Systems and Networks, 5(2)76-84, 2002.
  • M. Kallahalla and P. J. Varman. PC-OPT Optimal
    offline prefetching and caching for parallel I/O
    systems. IEEE Transactions on Computers,
    51(11)1333-1344, November 2002.

25
  • JUNK !

26
SCF 1.1 Efficient Interface and Prefetching
27
SCF 3.0 effect of balanced I/O
28
FFT effect of layout optimization
29
BTIO effect of collective I/O
30
AST effect of collective I/O
31
Collective Buffering (Nitzberg et.al. HPDC 97)
  • Mapping problem between memory layout and
    physical layout
  • There is mismatch in terms of data distribution,
    individual units of data accesses and the order
    of accesses between memory and file

32
Canonical File
  • Canonical file or sequence of file bytes are
    usually distributed in cyclic manner in parallel
    file systems

33
Collective buffering techniques
  • Compromising network traffic to reduce disk
    latencies
  • Collective buffering performance depends on
  • - intermediate data distribution
  • - efficiency of permutations
  • - number of nodes used for permutation
  • - buffer sizes used on the nodes

34
Other issues
  • Background thread initiated during first
    collective I/O
  • Interesting termination
  • During MPI_FILE_CLOSE, a special buffer with
    special tag is appended to the buffer space. On
    seeing this, the background thread terminates.
  • ABT implemented with ROMIO
  • I/O cost lesser than that of ROMIO
  • During reads, the write buffers are checked and
    written to disks

35
Results
36
Compiler analysis
  • How do the program components access data?
  • If access by storage pattern, then store it in
    that fashion and use independent parallel I/O
  • If not, find majority access pattern. Use it as
    storage pattern.
  • For those components that do not adhere to this
    access pattern, use collective I/O
  • For others, use independent parallel I/O

37
Compiler analysis
  • Weighted communication graph(WCG)
  • A node represents code block (data is kept in
    memory)
  • Between code blocks, data is stored to disks
  • An edge between node 1 and 2 iff there is a data
    set produced in node 1 and consumed in node 2
  • Weight on the edge represents the number of
    transitions
  • Also define producers, consumers for each data
    set
  • Determine access patterns in consumers and
    appropriately determine the storage pattern in
    the producer

38
Strategy
  • Determine access patterns
  • Determine storage patterns
  • Decide on I/O strategy
  • Rewrite the code using appropriate MPI I/O calls

39
Steps
  • Access pattern detection by loop index analysis
    and representative access pattern using the
    number of references through a particular access
    pattern
  • Storage layout detection algorithm used
    Producer-Consumer subgraphs (PCSs) of WCG
  • I/O insertion

40
Results
  • Version 1 independent parallel I/O for all
  • Version 2 collective I/O for all
  • Version 3 selective collective I/O

41
Experimental evaluation of I/O optimizations
Kandaswamy et. al (IPDPS 2002)
  • 5 different I/O apps. considered
  • 5 software optimizations studied collective
    I/O, prefetching, file layout, efficient I/O
    interface, balanced I/O optimizations
  • Experiments carried with different I/O nodes on
    Intel Paragon and IBM SP2

42
Summary
43
Guidelines
About PowerShow.com