Used on all runs:        Used on some runs:
-a- API  for I/O   -c- collective I/O (Used by
-b 16m- block size used   -V- use MPI_file_set_view
-r- reads file                    -IOR_HINT__MPI__romio_ds_read
-w- write file      -IOR_HINT__MPI__romio_ds_write
-i 5- number of repetitions    -MPICH_ROMIO_NO_RECORD_ -t- transfer size                         LOCKING=1
   - 256k and 16M   
Comparison of MPI-IO and HDF5 Parallel I/O on the Cray XT3/XT4
Joylika Yvette Adams
Fisk University
Research Alliance in Math and Science
Computer Science and Mathematics Division
Mentor: Dr. Mark Fahey
http://www.csm.ornl.gov/Internships/rams_07/poster/Joylika_Poster.pdf
A special thanks goes out to my faculty advisor Dr. Stephen Egarievwe, of Fisk University,  for helping me receive this internship.  Many thanks also goes out to my  Dr. Mark Fahey for all of his help, especially when I had no clue what he was talking about. Finally, special  thanks goes  to Debbie McCoy, who made this research experience possible and also exciting.
Conclusions
•MPI-IO and HDF5 can provide good I/O bandwidth on XT platforms with Lustre
•MPI-IO and HDF5 can be poor when using collectives and fileview respectively
•BUT can be much better when not-so-well-known tricks are used 
•MPI-IO fileview needs hints set
•IOR_HINT__MPI__romio_ds_read
•IOR_HINT__MPI__romio_ds_write
•Either enable, automatic, or disable
•HDF5 collective needs environment variable set
•MPICH_ROMIO_NO_RECORD_
   LOCKING=1
•For the test with small number of clients, the HDF5 rates were almost as fast as MPI-IO
•For the test with large number of clients, the     MPI-IO rate were twice as fast as HDF5
•Bigger transfer size only help collective
default_horizontal_line
Abstract
Future Research
•Run more tests to fill in the missing data (HDF5 with hints using the larger number of clients)
•Run netcdf tests to compare with HDF5 and MPI-IO
The Research Alliance in Math and Science program is sponsored by the Office of Advanced Scientific Computing Research, U.S. Department of Energy.  The work was performed as part of a joint
 project funded by Office of Naval Research Discovery and Innovation Program, at the Oak Ridge National Laboratory which is managed by UT-Battelle, LLC under Contract No. De-AC05-00OR22725. This work has been authored by a contractor of the U.S.
Government, accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes.
Background
•Jaguar
•11,706 dual-core processors
•Peak performance of 119 TFlops
•46 TB of memory
• Lustre file system
•600 TB of scratch disk space
•Lustre
•3 separate file systems (two 150 TB and one 300 TB)
•Previously measured read and write bandwidth of 43 and 26 GB/s, respectively
Results
      The National Center for Computational Sciences (NCCS) was founded in 1992 to provide world-class, high-performance scientific resources to scientists for the purpose of advancing science and technology research. To accomplish these goals, researchers and scientists need to determine how to most effectively run their applications on supercomputers with tens of thousands of processing cores, and one of the most daunting challenges is how to efficiently write applications data to disk, which can be on the order of hundreds of terabytes for a simulation. The tests that were performed on Jaguar (Cray XT3/XT4) with a Lustre file system. Lustre is a parallel, object-based file system designed to provide large, high-bandwidth storage on large clustered computers. The project continues previous work by evaluating and cataloging the performance of various I/O methodologies and libraries. To evaluate various parallel I/O methodologies, the IOR (Interleaved Or Random) code was used to perform parallel reads and writes to/from a file system using MPI-IO and HDF5 interfaces. The tests that were performed used only the MPI-IO and the HDF5 interfaces primarily to replicate I/O done by typical users of the NCCS Cray XT4. With these interfaces, performance results were obtained when using IOR with a constant buffer size per client while increasing from 2 to 1024 processes. The collective and fileview options were also tested. Previous test results indicated that good parallel I/O performance could be obtained with MPI-IO, but HDF5 performance had yet to be studied. These new tests show that parallel HDF5 rates are nearly as good as the MPI-IO rates for relatively small tests and within a factor of 2 for large processor counts. It was also discovered that the I/O performance could be very poor for both the MPI-IO and the HDF5 interfaces within IOR when using the collective and fileview options. However, by using “hints” can result in collective and fileview tests performing very well, These findings will be documented and made available to the NCCS users.
Methods
•  Use IOR- a software used for benchmarking parallel file systems
• Ran multiple instances of IOR with the HDF5 and MPI-IO interfaces
•Report the maximum of the plots
•The runs were done in non-dedicated time
• Compared I/O libraries
•With and without fileview
•With and without collective
•With small and large transfer sizes
Motivation
•Researchers have  very large I/O requirements [Kothe2007]
•Some require writing out (100) GB of data every hour
•Some will need to write out (10) TB of data every hour in a year from now
• Ensure that users efficiently use of the Lustre file system
• I/O portions of their codes do not dominate users runtime
•Produce scientific results
•Continue previous work by further studying performance of various I/O libraries
•Identify the best practices
•Compare different I/O libraries
•Compare Interfaces
•
MPI-IO
HDF5
Simulation requirements on a 1-PF LC system with 200 TB of memory [Kothe2007]
[Kothe2007] "Computational Science Requirements for Leadership Computing,"D. Kothe, et. al., in preparation, Leadership Computing Facility, Oak Ridge National Laboratory, 2007.
Options Used While Collecting Data