ADIC CASPUR CERN DataDirect ENEA IBM RZ Garching SGI New results from CASPUR Storage Lab

1 / 45
About This Presentation
Title:

ADIC CASPUR CERN DataDirect ENEA IBM RZ Garching SGI New results from CASPUR Storage Lab

Description:

ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI ... battery to keep the cache afloat during power cuts - 8 through 16 drive slots ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 46
Provided by: andrei163

less

Transcript and Presenter's Notes

Title: ADIC CASPUR CERN DataDirect ENEA IBM RZ Garching SGI New results from CASPUR Storage Lab


1
ADIC / CASPUR / CERN / DataDirect / ENEA / IBM /
RZ Garching / SGINew results from CASPUR
Storage Lab
  • Marco Mililotti
  • CASPUR Consortium
  • May 2004

2
Participated ADIC Software
E.Eastman CASPUR A.Maslennikov(),
M.Mililotti, G.Palumbo CERN
C.Curran, J.Garcia Reyero, M.Gug, A.Horvath,
J.Iven,
P.Kelemen, G.Lee, I.Makhlyueva,
B.Panzer-Steindel,
R.Többicke, L.Vidak
DataDirect Networks L.Thiers ENEA
G.Bracco, S.Pecoraro IBM F.Conti, S.De
Santis, S.Fini RZ Garching H.Reuter SGI
L.Bagnaschi, P.Barbieri, A.Mattioli
() Project Coordinator
3
Sponsors for these test sessions ACAL Storage
Networking Loaned a 16-port Brocade
switch ADIC Soiftware Provided the StorNext
file system product, actively participated
in tests DataDirect Networks Loaned
an S2A 8000 disk system,
actively
participated in tests E4 Computer Engineering
Loaned 10 assembled biprocessor nodes Emulex
Corporation Loaned 16 fibre channel HBAs IBM
Loaned a FASTt900 disk system and
SANFS product complete with 2 MDS
units, actively participated in
tests Infortrend-Europe Sold 4 EonStor disk
systems at discount price INTEL Donated 10
motherboards and 20 CPUs SGI Loaned the
CXFS product Storcase Loaned an InfoStation
disk system
4
Contents
  • Goals
  • Components under test
  • Measurements
  • - SATA/FC systems
  • - SAN File Systems
  • - AFS Speedup
  • - Lustre (preliminary)
  • - LTO2
  • Final remarks

5
Goals for these test series
  • Performance of low-cost SATA/FC disk systems
  • Performance of SAN File Systems
  • AFS Speedup options
  • Lustre
  • Performance of LTO-2 tape drive

6
Components
Disk systems 4x Infortrend
EonStor A16F-G1A2 16 bay SATA-to-FC arrays
Maxtor Maxline Plus II 250 GB SATA disks (7200
rpm) Dual Fibre Channel outlet at 2 Gbit
Cache 1 GB 2x IBM FAStT900 dual
controller arrays with SATA expansion units 4
x EXP100 expansion units with 14 Maxtor SATA
disks of the same type Dual Fibre Channel outlet
at 2 Gbit Cache 1 GB 1x StorCase
InfoStation 12 bay array same Maxtor SATA
disks Dual Fibre Channel outlet at 2 Gbit
Cache 256 MB 1x DataDirect S2A 8000
System 2 controllers with 74 FC disks of
146GB 8 Fibre Channel outlets at 2 Gbit
Cache 2.56 GB
7
Infortrend EonStor A16F-G1A2
- Two 2Gbps Fibre Host Channels - RAID levels
supported RAID 0, 1 (01), 3, 5, 10, 30, 50,
NRAID and JBOD - Multiple arrays configurable
with dedicated or global hot spares - Automatic
background rebuild - Configurable stripe size and
write policy per array - Up to 1024 LUNs
supported - 3.5", 1" high 1.5Gbps SATA disk
drives - Variable stripe size per logical drive -
Up to 64TB per LD - Up to 1GB SDRAM
8
FAStT900 Storage Server
- 2 Gbps SFP - Expansion units
EXP700 FC / EXP100 sATA - Four SAN
(FW-SW), or eight direct (FC-AL) - Four
(redundant) 2 Gbps drive channels -
Capacity min 250GB max 56TB (14 disks x EXP100
sATA) min 32GB max
32TB (14 disks x EXP700 FC) - Dual-active
controllers - Cache 2GB - RAID
support 0, 1, 3, 5, 10
EXP100
FAStT900
9
STORCase Fibre-to-SATA
- SATA and Ultra ATA/133 Drive
Interface - 12 hot swappable drives
- Switched or FC-AL host connections -
RAID levels 0, 1, 01, 3, 5, 30, 50 and JBOD
- Dual Fibre 2Gbps host ports -
Support up to 8 arrays and 128 LUNs - Up
to 1GB PC200 DDR cache memory
10
DataDirect S²A8000
- Single 2U S2A8000 with Four 2Gb/s
Ports or Dual 4U with Eight 2Gb/s
Ports - Up to 1120 Disk Drives 8192 LUNs
supported - 5TB to 130TB with FC Disks,
20TB to 250TB with SATA disks - Sustained
Performance well over 1GB/s (1.6 GB/s
theoretical) - Full Fibre-Channel Duplex
Performance on every port - PowerLUN 1
GB/s individual LUNs without host-based striping
- Up to 20GB of Cache, LUN-in-Cache
Solid State Disk functionality - Real
time Any to Any Virtualization - Very
fast rebuild rate
11
Components
  • High-end Linux units for both servers and
    clients
  • Biprocessor Pentium IV Xeon 2.4 GHz, 1GB RAM
  • Qlogic QLA2300 2Gbit or Emulex LP9xxx Fibre
    Channel HBAs
  • Network
  • 2x Dell 5224 GigE switches
  • SAN
  • Brocade 3800 switch 16 ports (test series 1)
  • Qlogic Sanbox 5200 32 ports (test series 2)
  • Tapes
  • 2x IBM Ultrium LTO2 (3580-TD2, Rev 36U3 )

12
Qlogic SANbox 5200 Stackable Switch
  • - 8, 12 or 16 auto-detecting 2Gb/1Gb device ports
    with 4-port incremental upgrade
  • - Stacking of up to 4 units for 64 available user
    ports
  • - Interoperable with all FC SW-2 compliant Fibre
    Channel switches
  • - Full-fabric, public-loop or switch-to-switch
    connectivity on 2Gb or 1Gb front ports
  • - "No-Wait" routing - guaranteed maximum
    performance independent of data traffic
  • - Support traffic between switches, servers and
    storage at up to 10Gb/s
  • - Low cost 5200/16p is at least twice less
    expensive than Brocade 3800/16p
  • - May be upgraded in 8p steps

13
IBM LTO Ultrium 2 Tape Drive Features
- 200 GB Native Capacity (400 GB
compressed) - 35 MB/s native (70 MB/s
compressed) - Read/Write LTO 1 Cartridge
- Native 2Gb FC Interface -
Backward read/write with Ultrium 1 cartridge
- 64 MB buffer (vs 32 MB buffer in Ultrium 1)
- Speed Matching, Channel Calibration
- 512 Tracks vs. 384 Tracks in Ultrium 1
- 64 MB Buffer vs. 32 MB in Ultrium 1
- Enhanced Capacity (200GB) - Enhanced
Performance (35 MB/s) - Backward Compatible -
Faster Load/Unload Time, Data Access Time, Rewind
Time
14
SATA / FC Systems
15
SATA / FC Systems hw details
Typical array features - single o dual
(active-active) controller - up to 1GB of
Raid Cache - battery to keep the cache
afloat during power cuts - 8 through 16
drive slots - cost 4-6 KUSD per 12/16 bay
unit (Infortrend, Storcase) Case and backplane
directly impact on the disks lifetime -
protection against inrush currents -
protection against the rotational vibration
- orientation (H better than V remark by
A.Sansum) Infortrend EonStor well
engineered (removable controller module,
lower
vibration, H orientation) Storcase
special protection against inrush
currents
(soft-start drive power circuitry), low
vibration
16
SATA / FC Systems hw details
High capacity ATA/SATA disk drives -
250GB (Maxtor, IBM), 400GB (Hitachi) - RPM
7200 - improved quality
warranty 3 years,
component design lifetime 5 years CASPUR
experience with Maxtor drives - In 1.5
years lost 5 drives out of 100, 2 of which due
to power cuts - Factory quality for
recent Maxtor Maxline Plus II 250 GB disks
out of 66 disks purchased, 4 were shortly
replaced. Others stand the stress very
well Learned during this meeting - RAL
annual failure rate is 21 out of 920 Maxtor
Maxline drives
17
SATA / FC Systems test setup
4x IFT A16F- G1A2
Qlogic 2x 5200
16 2x2.4 GHz Nodes Qlogic 2310F HBA
Dell 5224
4x IBM FASTt 900
Storcase Infostation
Parameters to select / tune - stripe
size for RAID-5 - SCSI queue depth on
controller and on Qlogic HBAs - number of
disks per logical drive In the end, we were
working with RAID-5 LUNs composed of 8 HDs each
Stripe size 128K (and 256K, in some tests)
18
SATA / FC tests kernel and fs details
Kernel settings - Kernels
2.4.20-30.9smp, 2.4.20-20.9.XFS1.3.1smp -
vm.bdflush 2 500 0 0 500 1000 20 10 0 -
vm.max(min)-readahead 256(127) (large streaming
writes)
4(3) (random reads with small blksize)
File Systems - EXT3 (128k RAID-5 stripe
size) fs options -m O j J size128 R
stride32 T largefile4 mount options
datawriteback - XFS 1.3.1 (128k RAID-5
stripe size) fs options -i size512 d
agsize4g,su128k,sw7,unwritten0 l su128k
mount options logbsize262144,logbufs8
19
SATA / FC tests benchmarks used
Large serial writes and reads - lmdd
from lmbench suite http//sourceforge.net/proje
cts/lmbench typical invocation
lmdd of/fs/file bs1000k count8000 fsync1
Random reads - Pileup benchmark
(Rainer.Toebbicke_at_cern.ch) designed to
emulate the disk activity for multiple data
analysis jobs 1) series of 2GB files are
being created in the desination directory 2)
these files are then being read in a random way,
in many threads
20
SATA / FC results
EXT3 results filling 1.7 TB with 8GB files
IFT systems show anomalous behaviour with EXT3
file system performance varies along the file
system. The effect visibly depends on the RAID-5
stripe size
32K
128k
256K
! The problem was reproduced and understood
by Infortrend New firmware is due in July
21
SATA / FC results
IBM FAStT and Storcase behave in a more
predictable manner with EXT3. Both these systems
may however lose up to 20 in performance along
the file system
22
SATA / FC results
XFS results filling 1.7 TB with 8GB files
The situation changes radically with this file
system. The curves are now becoming almost flat,
everything is much faster compared with EXT3

IBM
STORCASE
INFORTREND
Infortrend and Storcase show compatible write
speeds of about 135-140 MB/sec, IBM is much
slower on writes (below 100 MB/sec). Read speeds
are visibly higher thanks to the read-ahead
function of controller (IBM and IFT systems had 1
GB of raid cache, Storcase had only 256 MB)
23
SATA / FC results
Pileup tests These tests were done only on
IFT and Storcase systems. Results to a
large extent depend on the number of threads that
access the previously prepared files (after a
certain number of threads performance may drop
since the test machines may have problems to
handle many threads at a time). The best result
was obtained with the Infortrend array for XFS
file system
24
SATA / FC results
Operation in degraded mode We have tried
it on a single Infortrend LUN of 5HDs and EXT3.
One of the disks was removed, and rebuild
process was started. The Write speed went down
from 105 to 91 MB/sec The Read speed went down
from 105 to 28 MB/sec and even less
25
SATA / FC results - conclusions
1) The recent low-cost SATA-to-FC disk arrays
(Infortrend, Storcase) operate very well and
are able to deliver excellent I/O speeds far
exceeding that of Gigabit Ethernet.
Cost of such systems may be as low as 2.5
USD/rawGB. Quality of these systems is
dominated by the quality of SATA disks. 2) The
choice of local file system is fundamental. XFS
easily outperforms EXT3. In one occasion we
have observed an XFS hang under a very heavy
load. xfs_repair was run, and the error
had never reappeared again. We are now
planning to investigate this in deep. CASPUR AFS
and NFS servers are all XFS-based, and there
was only one XFS-related problem since we
have put XFS in production 1.5 years ago. But
probably we were simply lucky.
26
SAN File Systems
27
SAN File Systems
  • SAN FS Placement
  • These advanced distributed file systems allow
    clients to operate directly
  • with block devices (block-level file access).
    Metadata traffic via GigE.
  • Required Storage Area Network.
  • Current cost of a single fibre channel
    connection gt 1000 USD
  • Switch port, min 500 USD
    including GBIC
  • Host Based Adapter, min 800 USD
  • Special discounts for massive purchases are
    not impossible,
  • but it is very hard to imagine that the cost
    of connection will

28
SAN File Systems
  • Where SAN File Systems may be used
  • 1) High Performance Computing fast parallel
    I/O, faster sequential I/O
  • 2) Hybrid SAN / NAS systems relatively small
    number of
  • SAN clients acting as (also redundant) NAS
    servers
  • 3) HA Clusters with file locking Mail
    (shared pool), Web etc

29
SAN File Systems
  • So far, we have tried these products
  • 0) Sistina GFS (see our 2003 report)
  • 1) ADIC StorNext File System
  • 2) IBM SANFS (StorTank)
  • 3) SGI CXFS (work still in progress)

30
SAN File Systems
31
SAN File Systems
16 2x2.4 GHz Nodes Qlogic 2310F HBA
4x IFT A16F- G1A2
Qlogic 2x 5200
Dell 5224
4x IBM FASTt 900
IA32 IBM StorTank MDS
Origin 200 CXFS MDS
  • What was measured (StorNext and StorTank)
  • 1) Aggregate write and read speeds on 1, 7 and 14
    clients
  • 2) Aggregate Pileup speed on 1,7, and 14 clients
    accessing
  • A) different sets of files
  • B) same set of files
  • During these tests we used 4 LUNS of 13 HDs each
    as recommended by IBM
  • For each SAN FS we have tried both IFT and FAStT
    disk systems

32
SAN File Systems
Large sequential files StorNext and
StorTank behave in a similar manner on writes.
StorNext does better on reads. IBM disk systems
are performing better than IFT on reads for
multiple clients
IBM StorTank
ADIC StorNext
All numbers in MB/sec
33
SAN File Systems
Pileup tests StorTank is definitevely
outperforming StorNext for this type of
benchmark. The results are very interesting as it
comes out that peak Pileup speeds with StorTank
on a single client may reach the GigE speed (case
of IFT disk)
IBM StorTank
! Unstable for IFT with more than 1 client
ADIC StorNext
All numbers in MB/sec
34
SAN File Systems
  • CXFS experience
  • MDS on SGI Origin 200 with 1 GB of RAM (IRIX
    6.5.22), 4 IFT arrays
  • First numbers were not so bad, but with 4
    clients or more the system
  • becomes unstable (when they are used all at a
    time, one client will hang).
  • That is what we have observed so far

We are currently investigating the problem
together with SGI.
35
SAN File Systems
  • StorNext on DataDirect system

2x S2A8000 8 FC outlets
2x Brocade 3800
16 2x2.4 GHz Nodes Emulex LP9xxx HBAs
Dell 5224
- S2A 8000 came with FC disks, although we asked
for SATA - Quite easy in configuration, extremly
flexible - Multiple levels of redundancy, small
declared performance degradation on rebuilds - We
ran only large serial wrirte and read 8GB lmdd
tests using all the available power
36
SAN File Systems some remarks
- Performance of a SAN File System is quite
close to that of disk hardware it is built
upon. - StorNext is easiest in
configuration. It does not require a standalone
MDS. Works smoothly with all kinds of disk
systems, fc switches etc. We were able to
export it via NFS, but with the loss of 50 of
available bandwidth. - StorTank is probably
the most solid implementation of SAN FS, and it
has a lot of useful options. It delivers the
best numbers for random reads, and probably
may be considered as a good candidate for small
clusters destinated for express data
analysis. May have issues with 3rd party disks.
- CXFS uses the very performant XFS base and
hence should have a good potential,
although the 2 TB file system size on Linux/32bit
is a real limitation (same is true for GFS).
Some functions like MDS fencing require
particular hardware. - MDS loads small
for StorNext, CXFS and very high for StorTank.

37
AFS Speedup
38
AFS speedup options
- AFS performance for large files is quite
poor (max 35-40 MB/sec even on a very
performant hardware). To a large extent this is
due to the limitations of Rx RPC protocol,
and to the not most optimal implementation of the
file server. - One possible workaround is
to replace the Rx protocol with an alternative
one in all cases where it is used for file
serving. We were evaluating two such
experimental implementations 1) AFS
with OSD support (Rainer Toebbicke). Rainer
stores AFS data inside the
Object-based Storage Devices (OSDs) which should
not necessarily reside inside the
AFS File Servers. The OSD performs
basic space management and access control and is
implemented as Linux daemon in user
space on an EXT2 file system. AFS file
server acts only as an MDS. 2)
Reuters Fast AFS (Hartmut Reuter). In this
approach, AFS partitions (/vicepXX)
are made visible on the clients with fast SAN or
NAS mechanism. As in the case 1),
AFS file sever acts as an MDS and directs the
clients to the right files inside
the /vicepXX for faster data acess.
39
AFS speedup options
Both methods worked! The AFS/OSD
scheme was tested during the Fall 2003 test
session, the tests were done with the
DataDirects S2A 8000 system. In one particular
test we were able to achieve 425 MB/sec write
speed for both native EXT2 and AFS/OSD
configurations. The Reuter AFS was
evaluated during the Spring 2004 session.
StorNext SAN File System was used to distribute
a /vicepX partition among several clients. Like
in the previous case, AFS/Reuter performance was
practically equal to the native performance of
StorNext for large files. To learn more
on the DataDirect system and the Fall 2003
session, please visit the following site
http//afs.caspur.it/slab2003b.
40
Lustre!
41
Lustre preliminary results
- Lustre 1.0.4 - We used 4 Object Storage
Targets on 4 Infortrend arrays, no striping -
Very interesting numbers for sequential I/O (8GB
files, MB/sec)
- These numbers may be directly compared with
SAN FS results obtained with the same disk
arrays
42
LTO-2 Tape Drive
43
LTO-2 tape drive
The drive is a Factor 2 evolution of its
predecessor, LTO-1. According to the specs, it
should be able to deliever up to 35 MB/sec native
I/O speed, and 200 GB of native capacity. We
were mainly interested to check the following
(see next page) - write speed as a
function of block size - time to write a
tape mark - positioning times The overall
judgement quite positive. The drive fits well
for backup applications, and is acceptable for
staging systems. Its strong point Is definitively
a relatively low cost (10-11 KUSD) which makes it
quite competitive (cmp with 30 KUSD for STK
9940B).
44
LTO-2
Write speed as a function of blocksize gt
31 MB/sec native for large blocks, very stable
Tape mark writing is rather slow, 1.4-1.5
sec/TM
Positioning it may take up to 1.5 minutes to
fsf to the needed file (Average 1minute)
45
Final remarks
Our immediate plans include - Further
investigation of StorTank, CXFS and yet another
SAN file system (Veritas) including NFS export
- Further Lustre testing on IFT and IBM
hardware (new version 1.2, striping, other
benchmarks) - Evaluation of ISCSI-enabled
SATA RAID arrays Feel free to join us at any
moment !
Write a Comment
User Comments (0)