Title: Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF
1Experience with Fabric Storage Area Network and
HSM Software at the Tier1 INFN CNAF
Ricci Pier Paolo et al., on behalf of INFN TIER1
Staff pierpaolo.ricci_at_cnaf.infn.it
- ACAT 2007
- April 22-27 2007
- Nikhef Amsterdam
2Summary
- Overall Tier1 Hardware description
- Castor v.2 HSM Software
- SAN Fabric implementation
- Monitoring and Administration tools
- GPFS and Castor performance test
3Overall Fabric Tier1 Hardware description
- Here is what we have in production
- Disk (SAN) 980 TB RAW (ATA RAID-5)
- 9 Infortrends A16F-R1211-M2 56TB
- 1 SUN STK Bladestore 32TB
- 4 IBM FastT900 (DS 4500) 200TB
- 5 SUN STK 290TB
- 3 DELL EMC 400TB
- Tape 1PByte uncomp. (tapes for 670TB)
- 1 SUN STK L5500 partitioned in 2000 slots LTO-2
(200GB) and 3500 slots 9940B (200GB) - 6 LTO-2 Drives (20-30 MB/s each)
- 7 9940B Drives (25-30 MB/s each)
4TIER1 INFN CNAF Storage
HSM (1PB)
Worker Nodes (LSF Batch System) Farm 800 nodes
for 1500KSPI2k (3000 KSPI2k 2nd half 2007)
90 Diskservers with Qlogic FC (HBA 2340 and 2462)
STK180 with 180 LTO-1 (18Tbyte Native)
W2003 Server with LEGATO Networker (Backup)
Fibre Channel
RFIO
RFIO,GPFS, Xroot
SAN
WAN or TIER1 LAN
CASTOR-2 HSM Castor services servers and
tapeservers
STK L5500 robot (5500 slots) 6 IBM LTO-2, 7 STK
9940B drives
Fibre Channel
SAN ( 980TB RAW -15/25 for NET SPACE gt 770TB)
400TB RAW
290TB RAW
200TB RAW
32TB RAW
56TB RAW
1 SUN STK BladeStore 1x 24000 GByte 1250 SATA
Blades 4 x 2Gb FC interfaces
5 SUN STK FLX680 5 x 46000 Gbyte 500GB SATA
Blades 4 x 2Gb FC interfaces each
4 Infortrend A16F-R1A2-M1 4 x 3200 GByte SATA 2
x 2Gb FC interfaces each
4 IBM FastT900 (DS 4500) 4x43000Gbyte SATA 4 x
2Gb FC interfaces each
5 Infortrend A16F-R1211-M2 JBOD 5 x 6400 GByte
SATA 2 x 2Gb FC interfaces each
3 EMC CX380 3 x 114000GByte FATA 8 x 4Gb FC
intefaces each
5Castor v.2 Hardware
- Core services are on machines with scsi disks,
hardware raid1, redundant power supplies - tape servers and disk servers have lower level
hardware, like WNs
Sun Blade v100 with 2 internal ide disks with
software raid-1 running ACSLS 7.0 OS Solaris 9.0
40 disk servers attached to a SAN full
redundancy FC 2Gb/s or 4Gb/s (latest...) connectio
ns (dual controller HW and Qlogic SANsurfer Path
Failover SW or Vendor Specific Software)
- STK L5500 silos (5500 slots, partitioned wth 2
form-factor slots, about 2000 LTO2 for and 3500
9940B, 200GB cartridges, tot capacity 1.1PB non
compressed ) - 6 LTO2 7 9940B drives, 2 Gbit/s FC interface,
20-30 MB/s rate (some more 9940B going to be
acquired in next months).
SAN
Brocade FC SAN
13 tape servers
STK FlexLine 600, IBM FastT900
6Setup core services
2 more machines for the Name Server and Stager DB
CASTOR core services v2.1.1-9 on 4 machines
(v2.1.1-9 also for clients).
Name server Oracle DB (Oracle 9.2)
castor-6 rhserver, stager, rtcpclientd,
MigHunter, cleaningDaemon
Stager Oracle DB (Oracle 10.2)
castorlsf01 Master LSF, rmmaster, expert
- 2 SRMv1 endpoints, DNS balanced
- srm//castorsrm.cr.cnaf.infn.it8443 (used for
tape svc classes) - srm//sc.cr.cnaf.infn.it8443 (used for
disk-only svc classes for cms and atlas) - srm//srm-lhcb durable.sc.cr.cnaf.infn.it8443
(used as disk-only svc classes for lhcb) - V2.1.1, disk server in TURL
dlf01 dlfserver, Cmonit, Oracle for DLF
castor-8 nsdaemon, vmgr, vdqm, msgd, cupvd
7Setup disk servers
- 40 disk servers
- about 5-6 fs per node, both XFS and EXT3 used,
typical size 1.5-2 TB - LSF software distributed via NFS (exported by
the LSF Master node) - LSF slots from 30 to 450, modified many
times.(lower or highter values only for test ) - Many servers are used both for file transfers
and for job reco/analysis gt max slots limitation
not very useful in such a case
8Supported VOs - Svcclasses - diskpools (270TB
net)
9Setup Monitoring e Alarms
We use NagiosRRD for alarm notifications.
10Setup Nagios
- typical parameters such as disk I/O, CPU,
network, connections,procs, available disk
space, raids status. - CASTOR specific parameters such as tape and
disk pool free space, daemons, LSF queue - still missing checks on the stager db tables
such as newrequests, subrequests
castorlsf01
castor-6
castorlsf01
11Setup Monitoring (Lemon)
- Lemon is in production v.1.0.7-15 as a Monitoring
Tool - Lemon is the CERN suggested monitoring tool,
strong integration with Castor v.2 - Oracle v.10.2 as database backend
12Lemon Monthly usage (production)
- Castor diskservers (disk to disk transfers)
- Max 301.3 MB/s
- Average 106.9 MB/s
- Castor tapeservers (disk to/from tapes transfers)
- Max 110.3 MB/s
- Average 41.7 MB/s
13Lemon Monthly usage (production)
- GPFS
- Max 205 MB/s
- Average 18.3 MB/s
- XRootD
- Max 47.6 MB/s
- Average 6.28 MB/s
14SAN Fabric Implementation
- Why a Storage Area Network?
- As Tier1 we need to grant a 7/24 service to the
users (LHC and not-LHC) - A good SAN hardware installation could be a
"real" No Single Point of Failure System NSPF (if
software supports it!) - So failures to the storage infrastructure
components, or planned events (like firmware
upgrade) can be done without stopping any service
- Also SAN gives the best flexibility
- we can dinamically vary the diskserver/disk-storag
e assignament (adding diskservers, or changing
ratios...) - we can use as diskpool clustered filesystem like
GPFS
15SAN Fabric Implementation
SINGLE SAN (980TB RAW) Hardware based
on Brocade Switches SAN as one single Fabric
managed with a single management web tool and
Fabric Manager Software for failures and
performace monitoring Qlogic Qla2340 (2Gb/s)
2462 (4Gb/s) HBA HA failover implemented with
SANsurfer configuration and Vendor Specific Tool
(EMC PowerPath)
1st Director 24000 FULL RENDUNDANT with 128 2Gb/s
ports The 2005 tender price was 1 KEuro/port.
2 x 2Gb/s trunked uplink
2 x 2Gb/s trunked uplink
Silkworm 3900 with 32 2Gb/s ports
Silkworm 3900 with 32 2Gb/s ports
2 x 2Gb/s trunked uplink
2nd Director 48000 FULL RENDUNDANT with 128 (out
of 256) 4Gb/s ports The 2006 tender price was
0.5 KEuro/port.
- DISK STORAGE
- 4 x IBM FastT900 DS 4500 (4 x 2Gb/s output for
each box) 200TB gt 20 diskservers with single
HBA - 4 x Flexline 600 (4 x 2Gb/s each) 290TB gt
20 diskservers with double HBAs - 3 x CX-380 (8 x 4Gb/s each) 400TB gt 36
diskservers with double HBA - 1 x SUN STK bladestore (4 x 2 Gb/s) gt 5
diskservers with single HBA - 9 x Infortrend A16F-R1211-M2 (2 x 2Gb/s each)
JBOD 56TB gt 9 primary diskservers with single
HBA - About 6-12 TB RAW accessed by one diskserver,
depending on the fs/protocol could be enough. - Fibre Channel Physical connections, failover and
zoning are configured in the simplest way,
traffic from diskservers remain in the local
switch in most cases so uplink usage is minimized.
16DISK access typical case (NSPF)
LAN
12 Diskserver Dell 1950 Dual Core Biprocessors
2 x 1.6Ghz 4MB L2 Cache, 4 GByte RAM, 1066 MHz
FSB SL 3.0 or 4.0 OS, Hardware Raid1 on system
disks and redundant power supply
4 TB Raid Group (81) 2TB Logical
Disk LUN0 LUN1 ...
Gb Ethernet Connections
LUN0 gt /dev/sda LUN1 gt /dev/sdb ...
2 x 4Gb Qlogic 2460 FC redundand connections
every Diskserver
- SAN ZONING
- Each diskserver gt 4 paths to the storage
- EMC PowerPath for Load-Balancing and Failover (or
Qlogic SANSurfer if problem with SL will arise)
- Example of Application
- High Avaliability
- GPFS with configuration
- Network Shared Disk
- /dev/sda Primary Diskserver 1
- Secondary Diskserver2
- /dev/sdb Primary Diskserver 2
- Secondary Diskserver3
- .....
4Gb FC connections
2 Storage Processor (A e B)
110TB EMC CX380 Dual redundant Controllers
(Storage Processors A,B) 4 Ouput for each SP
(1,2,3,4)
A3
A4
A1
A2
B1
B2
B3
B4
17SAN Monitoring Tool
Web Admin Tool (from Browser)
Web Admin Tool (zoning)
18SAN Monitoring Tool
Fabric Manager Software (Software installed in a
dedicated machine)
Fabric Manager Software (performance monitoring
showing Powerpath Load Balancing)
19SAN Disk Distribution
- Storage used in production
- 270TB net Castor v.2 staging area to tape
backend or disk-only pools (see above) - 140TB net Xroot (Babar)
- 130TB net GPFS v3.1.0-10
- 230TB net Still Unassigned (used for test in
these weeks - NFS used mainly for accessing experiment
software (TB) - strongly discouraged for data
access (Virgo) and currently under migration (to
castor v.2 and GPFS)
20GPFS implementation
- The idea of GPFS is to provide a fast and
reliable (NSPF) pure diskpool storage with direct
access (posix file protocol) from the Worker
Nodes Farm - SRM interface (Storm http//storm.forge.cnaf.infn.
it/doku.ph) - One single "big filesystem" for each VO could be
possible (strongly preferred by users) - Further Step Creation of a single (or future
multiple) cluster with a total of all the Worker
Nodes (700-800) and the 40 NSD GPFS diskservers.
Before we had only the front-ends (i.e. storage
element) accessing the GPFS cluster and the WNs
used to copy the data locally. Now all the WNs
could access the GPFS system directly - Test of the whole system using part of the
storage hardware infrastructure (24 NSD dedicated
diskservers SL 4.4 64 bit and 2 CX-380 EMC
Storage Arrays tot. 230TB) locally and remotely
with dedicated Farm queues (280 WN distributed in
8 racks for a total of 1100 LSF queues slot )
21Test Layout
22Test Phase (local)
- Local Access using the 24 diskservers (actually
23) - Xfs locally mounted filesystem
- GPFS cluster. 1 filesystem in one single "200 TB
filesystem" - Test using linux command "dd" from memory
(/dev/zero) Block Size of 1024k and 12 Gbyte
every thread (dd process in background) equally
distributed over the diskservers.
23Test Results (local)
24GPFS works using parallel I/O so the maximum
bandwidth (plateou) is reached with a very
limited number of thread (1 dd process for each
diskserver is enought).
In general GPFS works better when reading. When
writing all the diskservers must comunicate to
each other to mantain sync. This generate a
"background" traffic that could limit write
throughtput. Anyway the controller array limit is
still reached in our site (the disk is the
bottleneck)
25Test Phase (remote)
- Remote Access using dedicated Farm nodes
(dedicated Slot in LSF Batch System) - Castor (rfio over Xfs filesystems) diskpool only
(no tape backend) - GPFS cluster. 1 filesystem in one single "200 TB
filesystem" - Test using C coded "dd" commands as Farm Jobs
(5GByte files, bs64k x 1000 Jobs) - We were interested in reliability of GPFS and
overall performance comparison between our two
disk storage pools production systems
26Test Results (remote)
Castor remote read
Castor remote write
Some jobs (10) failures were detected when
reading due probably mainly to Castor2 queue
timeouts. Writing the efficency was higher (98).
The Aggregate Network statistics reported on the
uplink connections show a identical trend a
10 overhead
27Test Results (remote)
GPFS remote read
GPFS remote write
Overall efficiency 98 (2 jobs failed due to
multiple reasons, WN problems, .exe nfs area down
etc...)
The Aggregate Network statistics reported on the
uplink connections show a identical trend a
10 overhead
28Conclusion
- The GPFS system cluster is working fine in a
single "big cluster" implementation (all WNs in
cluster) - Test shows that the theorical hardware bandwidth
(limit from the controllers of the disk array)
could be saturated locally and remotely with the
GPFS cluster - Remote test comparisons with Castor v.2 show that
Jobs writing and reading from GPFS are the
fastest (1200MByte/s vs 850 MB/s Writing and
1500MB/s vs 1300MB/s Reading). This that could
prove very useful in some critical I/O access
activities (i.e. critical data transfers or
analysis jobs) - Also reliability is improved since a GPFS cluster
is very close to a NSPF system (while failures in
a Castor diskserver node will put the
corresponding filesystems offline, so part of the
diskpool will be unaccessible) - GPFS administration is also simpler compared to
Castor (no Oracle, no LSF, intuitive admin
commands etc...)
29Abstract
Title Experience with Fabric Storage Area
Network and HSM Software at the Tier1 INFN
CNAF Abstract This paper is a report from the
INFN Tier1 (CNAF) about the storage solutions we
have implemented over the last few years of
activity. In particular we describe the current
Castor v.2 installation at our site, the HSM
(Hierarchical Storage Manager) software chosen as
(low cost) tape storage archiving solution.
Beside Castor, we also have in production a large
GPFS cluster relying on a Storage Area Network
(SAN) infrastructure to obtain a fast and
disk-only solution for the users. In this paper,
summarizing our experience with these two storage
system solutions, we focus on the management and
monitoring tools implemented and on the technical
solutions needed to improve reliability on the
whole system.