Title: Testing Network Attached Storage Presenting the Fermi Disk Test Suite and some Preliminary Results
1Testing Network Attached StoragePresenting the
Fermi Disk Test Suite and some Preliminary Results
- C. Brew, L. Giacchetti, J. Kaiser, H. Wenzel R.
Pasetes. - Fermilab
- http//www-oss.fnal.gov/projects/disksuite/
2Dealing With Demo Systems
3Aim
- To test the overall performance of some of the
Network Attached Storage Systems becoming
available. - To develop a suite of tests and a procedure that
can be used for objective comparisons between
these different NAS devices independently of the
technology being used.
4How Would We Use The Storage?
- Interactive Hot Data Store
- Batch Farm Shared Disk
- Home Area Server
5Interactive Hot Data Store
- Shared disk on an Interactive Analysis farm where
users/admins can put the heavily used data to
save copying it from mass storage - Requires
- Support for large numbers of clients
- Moderate throughput on Writes
- High Throughput on Reads
6Batch Farm Shared Disk
- Shared disk for a Batch farm for storage of
executables and configuration file. Also as a
work disk where data sets copied out of Mass
Storage are worked on. - Requires
- Support for very large number of clients
- Ability to handle large numbers of simultaneous
reads and writes - High Throughput on Reads and Writes
7Home Area Server
- Need to look at future technologies for user home
areas - Requires
- Support for access by all FERMI OSs
- Global Namespace
- I/O Ops rate more important than overall
throughput - Scaleable
- Ability to back up/snapshot important areas
- ACL and Kerberos support desirable
8Fermilab Disk Test Suite
- Set of scripts and binaries for running single
client and cluster disk performance tests - Technology agnostic can test anything that
presents as a file system to the clients - Supports multiple clients and multiple processes
per client - Standard tools, Bash scripts, PERL Scripts,
IOZone, Bonnie and some simple C programs - Linux, Solaris and Irix
- Configuration via a simple text file
- http//www-oss.fnal.gov/projects/disksuite/fdts.tg
z
9Tests
- Performance Tests
- Max Throughput Read and Write
- Max Throughput Reading a Single File
- Simultaneous Reads and Writes
- Creation, Listing and Deletion of Large Numbers
of Small Files - Data Integrity
- Manageability Tests
- Ease of setup
- Ease to Reconfigure
- Failure Tests
- Fail various parts of the system and see what
happens
10FDTS Components
- Benchmark Binaries
- IOZone, Bonnie and Reader/Writer
- Test Scripts
- ops_cluster.sh, tput_cluster.sh and
single_node.sh - Data Processing Scripts
- proc_data_opstputsingle.pl
- Internal Control Scripts
- rfork, lfork, rw_control.pl and parse_config.sh
- Configuration File
- disksuite.conf
11Benchmark Binaries
- FDTS uses four benchmark binaries
- Should be located in the bin directory called
exe_name.uname - Reader/Writer
- Used for throughput measurements
- Simple C programs written at Fermilab
- Uses C native read and write functions
- Bonnie
- Used for Operations Measurement
- IOZone
- Used for optional Data Integrity Test
12Test Scripts
- opstput_cluster.sh
- Run the cluster operations and throughput tests
respectively - Single_node.sh
- Runs throughput and operations tests on a single
node - Usage
- opstput_cluster.sh config_file key
- single_node.sh config_file key
13Data Processing Scripts
- Take the raw data produces by the Testing scripts
and process to produce numbers and output the
results in comma separated variable format - Usage
- proc_data_ops.pl --key test_key --datadir
results_dir --out output_file --debug - proc_data_tpt.pl --key test_key --datadir
results_dir --out output_file --debug - proc_data_single.pl --key test_key --host
hostname - --datadir results_dir --out output_file
14Internal Control Scripts
- rfork
- Runs multiple copies of a command on remote nodes
- lfork
- Runs multiple copies of a command on the local
node - rw_control.pl
- Wrapper script for reader and writer
- parse-config.sh
- Contains the subroutine to parse the config file,
called by the testing scripts - See http//www-oss.fnal.gov/projects/disksuite/con
trol.html for a full write up
15Configuration File (1)
- NODEFILE File containing a list of nodes for the
cluster tests, one node per line - HOSTBASE Host name prefix for the cluster tests.
Ignored if NODEFILE is set - STARTNODE Host suffix to start at for the cluster
tests. Ignored if NODEFILE is set - TEST_RUNS Space separated list of numbers of
nodes to use for each cluster test - TEST_THREADS Space separated list of the number
of processes to run on each node in the
throughput tests - WORK_DIR Working directory, full path and must be
visible from all clients - RESULTS_DIR Results directory, only need to
visible from the node from which the tests are
started - KERBEROS If "YES" switch on kerberos ticket and
AFS token renewal - DEFUNCT Comma separated list of node suffixes to
skip from the cluster tests. Ignored if NODEFILE
is set - FILESIZE File size for the throughput tests
16Configuration File (2)
- BLOCKSIZE Block sizes for the throughput tests.
Single node tests will take a space separated
list, cluster tests will use just the first in
that case - DF_TEST If "YES", monitor output of df and
calculate write throughput from it - DF_KEY String to identify line in output of df
that contains the working directory - OPS_FILES Number of thousands of files to create
per node during the operations tests - OPS_MIN Minimum files size for files created
during the operations tests (in bytes) - OPS_MAX Maximum files size for files created
during the operations tests (in bytes) - OPS_DIRS Number of directories to create the
files in during the operations tests. - DATA_INTEG If "YES", perform the iozone data
integrity test on all the nodes after the
operations test has finished
17How to Measure Total Throughput?
- Uses simple C programs to measure R/W speeds
- We are testing Black Boxes so cannot run a
process to monitor throughput on the server
need to calculate it from the client side - Each client or process writes/reads 5 1GB files
in succession and calculates throughput
individually for each file - Calculate two measures of throughput
- Sustained throughput
- Overall Throughput
18Measures of Throughput
- Sustained Throughput
- End time (T1) is defined as when the 1st client
finishes the 5th file - All files stats included if finished before end
time - Stats averaged on each node then summed across
all nodes - Overall Throughput
- (Total data written/read)/Total Time Taken (T2)
- If a node has not completed any files, then
the stats from the first file are included
19Operations Tests
- Storage companies always quote IO Ops per sec or
NFS Ops per sec, what does that really mean? We
still dont know! - We use Bonnie to measure
- File Creates per sec
- Files Stats per sec
- File deletes per sec
- For sequentially and randomly chosen files
- Data Integrity
- Uses 2 processes on all available clients to run
IOZone in Verify mode writes pattern into
test file and checks it during the read
20Some NAS systems we tested
21The Test Farm
64 dual AMD 1.9 in 1U formfactor The machines
are connected via fast ethernet to cisco 6509
switch.
22Results (Throughput)
- Zambeel, Spinnaker, Linux File Server
- Linux Server did not drop clients
- Zambeel had problems doing server side caching
23Results (Operations)
- Zambeel, Spinnaker, Linux File Server
- Zambeel timed out on an average of 7 clients at
50 clients
24In Addition we also developed a test suite for
storage systems which dont access files through
a mounted File system. Usually these systems
stage files in and out using get and put
commands. In some cases the data can be accessed
directly from within an application via POSIX
compliant function calls (.e.g. TDCacheFile).
The processes are synchronized using the FBSNG
batch system.
25dfarm is a product developed at Fermilab which
utilizes the data disks on the farm nodes.
- Name space is organized into virtual file name
space - Virtual path /E123/data/file.5 this is what
user knows - Physical path fnpc221/local/stage2/XYZ123
this is what disk farm knows so that user does
not have to - User operates in familiar UNIX-like file name
space using familiar commands - Solution for node unreliability problem
replicate data - Make 2,3,4 copies of the file on different nodes
- Data is easy to reproduce or has short life 1
copy - Data is precious 2,5,10 copies
- Disk Farm replicates data off-line
- Remote access via GridFTP
- Load sharing and control
- Each node has a limit for the number of
simultaneous reads/writes - Load is evenly distributed and optimized
26Dfarm was installed on the testfarm that was
described on earlier slides. For this test we
used about 50 dual AMD 1.9 nodes with 80 GB of
local data disk. In this test the nodes serve as
servers and clients at the same time.
27CD/ISD
File Transfer
ENSTORE (Hierarchical Storage Manager)
Disk Cache
Production
Personal Analysis
CMS- specific
Random Access Sequential Access
28What do we expect from dCache
- Making a multi-terabyte server farm look like one
- coherent and homogeneous storage system.
- Rate adaptation between the application and the
- tertiary storage resources.
- Optimized usage of tape robot systems and drives
- by coordinated read and write requests.
- No explicit staging is necessary to access the
data - (but pre-staging possible).
- The data access method is unique independent of
where the data resides. - High performance and fault tolerant transport
protocol - between applications and data servers
- Fault tolerant, no specialized servers which can
cause - severe downtime when crashing.
- Can be accessed directly from application (e.g.
root TDCacheFile class). - Can be used as scalable file store without HSM.
- Remote access via GRIDFTP/SRM.
29dCache
30Conclusion
- With these tools we have the basis of a test
suite and procedure for comparing the different
storage technologies that are becoming available - We have tested devices from Spinnaker and Zambeel
along with a Linux Terabyte file server for
comparison. We have tested dFarm and dCache. In
the coming months we hope to test devices from
Panasas and DataDirect. We also hope to work with
DESY to run these tests on the Exanet they have
been testing.