Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF - PowerPoint PPT Presentation

About This Presentation

Title:

Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF

Description:

Experience with Fabric Storage Area Network and HSM ... Silkworm 3900 with 32 2Gb/s ports. Silkworm 3900 with 32 2Gb/s ports. 2 x 2Gb/s. trunked uplink ... – PowerPoint PPT presentation

Number of Views:190

Avg rating:3.0/5.0

Slides: 30

Provided by: nik8

Category:

more less

Transcript and Presenter's Notes

Title: Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF

1
Experience with Fabric Storage Area Network and
HSM Software at the Tier1 INFN CNAF
Ricci Pier Paolo et al., on behalf of INFN TIER1
Staff pierpaolo.ricci_at_cnaf.infn.it

ACAT 2007
April 22-27 2007
Nikhef Amsterdam

2
Summary

Overall Tier1 Hardware description
Castor v.2 HSM Software
SAN Fabric implementation
Monitoring and Administration tools
GPFS and Castor performance test

3
Overall Fabric Tier1 Hardware description

Here is what we have in production
Disk (SAN) 980 TB RAW (ATA RAID-5)
9 Infortrends A16F-R1211-M2 56TB
1 SUN STK Bladestore 32TB
4 IBM FastT900 (DS 4500) 200TB
5 SUN STK 290TB
3 DELL EMC 400TB
Tape 1PByte uncomp. (tapes for 670TB)
1 SUN STK L5500 partitioned in 2000 slots LTO-2
(200GB) and 3500 slots 9940B (200GB)
6 LTO-2 Drives (20-30 MB/s each)
7 9940B Drives (25-30 MB/s each)

4
TIER1 INFN CNAF Storage
HSM (1PB)
Worker Nodes (LSF Batch System) Farm 800 nodes
for 1500KSPI2k (3000 KSPI2k 2nd half 2007)
90 Diskservers with Qlogic FC (HBA 2340 and 2462)
STK180 with 180 LTO-1 (18Tbyte Native)
W2003 Server with LEGATO Networker (Backup)
Fibre Channel
RFIO
RFIO,GPFS, Xroot
SAN
WAN or TIER1 LAN
CASTOR-2 HSM Castor services servers and
tapeservers
STK L5500 robot (5500 slots) 6 IBM LTO-2, 7 STK
9940B drives
Fibre Channel
SAN ( 980TB RAW -15/25 for NET SPACE gt 770TB)
400TB RAW
290TB RAW
200TB RAW
32TB RAW
56TB RAW
1 SUN STK BladeStore 1x 24000 GByte 1250 SATA
Blades 4 x 2Gb FC interfaces
5 SUN STK FLX680 5 x 46000 Gbyte 500GB SATA
Blades 4 x 2Gb FC interfaces each
4 Infortrend A16F-R1A2-M1 4 x 3200 GByte SATA 2
x 2Gb FC interfaces each
4 IBM FastT900 (DS 4500) 4x43000Gbyte SATA 4 x
2Gb FC interfaces each
5 Infortrend A16F-R1211-M2 JBOD 5 x 6400 GByte
SATA 2 x 2Gb FC interfaces each
3 EMC CX380 3 x 114000GByte FATA 8 x 4Gb FC
intefaces each
5
Castor v.2 Hardware

Core services are on machines with scsi disks,
hardware raid1, redundant power supplies
tape servers and disk servers have lower level
hardware, like WNs

Sun Blade v100 with 2 internal ide disks with
software raid-1 running ACSLS 7.0 OS Solaris 9.0
40 disk servers attached to a SAN full
redundancy FC 2Gb/s or 4Gb/s (latest...) connectio
ns (dual controller HW and Qlogic SANsurfer Path
Failover SW or Vendor Specific Software)

STK L5500 silos (5500 slots, partitioned wth 2
form-factor slots, about 2000 LTO2 for and 3500
9940B, 200GB cartridges, tot capacity 1.1PB non
compressed )
6 LTO2 7 9940B drives, 2 Gbit/s FC interface,
20-30 MB/s rate (some more 9940B going to be
acquired in next months).

SAN
Brocade FC SAN
13 tape servers
STK FlexLine 600, IBM FastT900
6
Setup core services
2 more machines for the Name Server and Stager DB
CASTOR core services v2.1.1-9 on 4 machines
(v2.1.1-9 also for clients).
Name server Oracle DB (Oracle 9.2)
castor-6 rhserver, stager, rtcpclientd,
MigHunter, cleaningDaemon
Stager Oracle DB (Oracle 10.2)
castorlsf01 Master LSF, rmmaster, expert

2 SRMv1 endpoints, DNS balanced
srm//castorsrm.cr.cnaf.infn.it8443 (used for
tape svc classes)
srm//sc.cr.cnaf.infn.it8443 (used for
disk-only svc classes for cms and atlas)
srm//srm-lhcb durable.sc.cr.cnaf.infn.it8443
(used as disk-only svc classes for lhcb)
V2.1.1, disk server in TURL

dlf01 dlfserver, Cmonit, Oracle for DLF
castor-8 nsdaemon, vmgr, vdqm, msgd, cupvd
7
Setup disk servers

40 disk servers
about 5-6 fs per node, both XFS and EXT3 used,
typical size 1.5-2 TB
LSF software distributed via NFS (exported by
the LSF Master node)
LSF slots from 30 to 450, modified many
times.(lower or highter values only for test )
Many servers are used both for file transfers
and for job reco/analysis gt max slots limitation
not very useful in such a case

8
Supported VOs - Svcclasses - diskpools (270TB
net)
9
Setup Monitoring e Alarms
We use NagiosRRD for alarm notifications.
10
Setup Nagios

typical parameters such as disk I/O, CPU,
network, connections,procs, available disk
space, raids status.
CASTOR specific parameters such as tape and
disk pool free space, daemons, LSF queue
still missing checks on the stager db tables
such as newrequests, subrequests

castorlsf01
castor-6
castorlsf01
11
Setup Monitoring (Lemon)

Lemon is in production v.1.0.7-15 as a Monitoring
Tool
Lemon is the CERN suggested monitoring tool,
strong integration with Castor v.2
Oracle v.10.2 as database backend

12
Lemon Monthly usage (production)

Castor diskservers (disk to disk transfers)
Max 301.3 MB/s
Average 106.9 MB/s
Castor tapeservers (disk to/from tapes transfers)
Max 110.3 MB/s
Average 41.7 MB/s

13
Lemon Monthly usage (production)

GPFS
Max 205 MB/s
Average 18.3 MB/s
XRootD
Max 47.6 MB/s
Average 6.28 MB/s

14
SAN Fabric Implementation

Why a Storage Area Network?
As Tier1 we need to grant a 7/24 service to the
users (LHC and not-LHC)
A good SAN hardware installation could be a
"real" No Single Point of Failure System NSPF (if
software supports it!)
So failures to the storage infrastructure
components, or planned events (like firmware
upgrade) can be done without stopping any service
Also SAN gives the best flexibility
we can dinamically vary the diskserver/disk-storag
e assignament (adding diskservers, or changing
ratios...)
we can use as diskpool clustered filesystem like
GPFS

15
SAN Fabric Implementation
SINGLE SAN (980TB RAW) Hardware based
on Brocade Switches SAN as one single Fabric
managed with a single management web tool and
Fabric Manager Software for failures and
performace monitoring Qlogic Qla2340 (2Gb/s)
2462 (4Gb/s) HBA HA failover implemented with
SANsurfer configuration and Vendor Specific Tool
(EMC PowerPath)
1st Director 24000 FULL RENDUNDANT with 128 2Gb/s
ports The 2005 tender price was 1 KEuro/port.
2 x 2Gb/s trunked uplink
2 x 2Gb/s trunked uplink
Silkworm 3900 with 32 2Gb/s ports
Silkworm 3900 with 32 2Gb/s ports
2 x 2Gb/s trunked uplink
2nd Director 48000 FULL RENDUNDANT with 128 (out
of 256) 4Gb/s ports The 2006 tender price was
0.5 KEuro/port.

DISK STORAGE
4 x IBM FastT900 DS 4500 (4 x 2Gb/s output for
each box) 200TB gt 20 diskservers with single
HBA
4 x Flexline 600 (4 x 2Gb/s each) 290TB gt
20 diskservers with double HBAs
3 x CX-380 (8 x 4Gb/s each) 400TB gt 36
diskservers with double HBA
1 x SUN STK bladestore (4 x 2 Gb/s) gt 5
diskservers with single HBA
9 x Infortrend A16F-R1211-M2 (2 x 2Gb/s each)
JBOD 56TB gt 9 primary diskservers with single
HBA
About 6-12 TB RAW accessed by one diskserver,
depending on the fs/protocol could be enough.
Fibre Channel Physical connections, failover and
zoning are configured in the simplest way,
traffic from diskservers remain in the local
switch in most cases so uplink usage is minimized.

16
DISK access typical case (NSPF)
LAN
12 Diskserver Dell 1950 Dual Core Biprocessors
2 x 1.6Ghz 4MB L2 Cache, 4 GByte RAM, 1066 MHz
FSB SL 3.0 or 4.0 OS, Hardware Raid1 on system
disks and redundant power supply
4 TB Raid Group (81) 2TB Logical
Disk LUN0 LUN1 ...
Gb Ethernet Connections
LUN0 gt /dev/sda LUN1 gt /dev/sdb ...
2 x 4Gb Qlogic 2460 FC redundand connections
every Diskserver

SAN ZONING
Each diskserver gt 4 paths to the storage
EMC PowerPath for Load-Balancing and Failover (or
Qlogic SANSurfer if problem with SL will arise)

Example of Application
High Avaliability
GPFS with configuration
Network Shared Disk
/dev/sda Primary Diskserver 1
Secondary Diskserver2
/dev/sdb Primary Diskserver 2
Secondary Diskserver3
.....

4Gb FC connections
2 Storage Processor (A e B)
110TB EMC CX380 Dual redundant Controllers
(Storage Processors A,B) 4 Ouput for each SP
(1,2,3,4)
A3
A4
A1
A2
B1
B2
B3
B4
17
SAN Monitoring Tool
Web Admin Tool (from Browser)
Web Admin Tool (zoning)
18
SAN Monitoring Tool
Fabric Manager Software (Software installed in a
dedicated machine)
Fabric Manager Software (performance monitoring
showing Powerpath Load Balancing)
19
SAN Disk Distribution

Storage used in production
270TB net Castor v.2 staging area to tape
backend or disk-only pools (see above)
140TB net Xroot (Babar)
130TB net GPFS v3.1.0-10
230TB net Still Unassigned (used for test in
these weeks
NFS used mainly for accessing experiment
software (TB) - strongly discouraged for data
access (Virgo) and currently under migration (to
castor v.2 and GPFS)

20
GPFS implementation

The idea of GPFS is to provide a fast and
reliable (NSPF) pure diskpool storage with direct
access (posix file protocol) from the Worker
Nodes Farm
SRM interface (Storm http//storm.forge.cnaf.infn.
it/doku.ph)
One single "big filesystem" for each VO could be
possible (strongly preferred by users)
Further Step Creation of a single (or future
multiple) cluster with a total of all the Worker
Nodes (700-800) and the 40 NSD GPFS diskservers.
Before we had only the front-ends (i.e. storage
element) accessing the GPFS cluster and the WNs
used to copy the data locally. Now all the WNs
could access the GPFS system directly
Test of the whole system using part of the
storage hardware infrastructure (24 NSD dedicated
diskservers SL 4.4 64 bit and 2 CX-380 EMC
Storage Arrays tot. 230TB) locally and remotely
with dedicated Farm queues (280 WN distributed in
8 racks for a total of 1100 LSF queues slot )

21
Test Layout
22
Test Phase (local)

Local Access using the 24 diskservers (actually
23)
Xfs locally mounted filesystem
GPFS cluster. 1 filesystem in one single "200 TB
filesystem"
Test using linux command "dd" from memory
(/dev/zero) Block Size of 1024k and 12 Gbyte
every thread (dd process in background) equally
distributed over the diskservers.

23
Test Results (local)
24
GPFS works using parallel I/O so the maximum
bandwidth (plateou) is reached with a very
limited number of thread (1 dd process for each
diskserver is enought).
In general GPFS works better when reading. When
writing all the diskservers must comunicate to
each other to mantain sync. This generate a
"background" traffic that could limit write
throughtput. Anyway the controller array limit is
still reached in our site (the disk is the
bottleneck)
25
Test Phase (remote)

Remote Access using dedicated Farm nodes
(dedicated Slot in LSF Batch System)
Castor (rfio over Xfs filesystems) diskpool only
(no tape backend)
GPFS cluster. 1 filesystem in one single "200 TB
filesystem"
Test using C coded "dd" commands as Farm Jobs
(5GByte files, bs64k x 1000 Jobs)
We were interested in reliability of GPFS and
overall performance comparison between our two
disk storage pools production systems

26
Test Results (remote)
Castor remote read
Castor remote write
Some jobs (10) failures were detected when
reading due probably mainly to Castor2 queue
timeouts. Writing the efficency was higher (98).
The Aggregate Network statistics reported on the
uplink connections show a identical trend a
10 overhead
27
Test Results (remote)
GPFS remote read
GPFS remote write
Overall efficiency 98 (2 jobs failed due to
multiple reasons, WN problems, .exe nfs area down
etc...)
The Aggregate Network statistics reported on the
uplink connections show a identical trend a
10 overhead
28
Conclusion

The GPFS system cluster is working fine in a
single "big cluster" implementation (all WNs in
cluster)
Test shows that the theorical hardware bandwidth
(limit from the controllers of the disk array)
could be saturated locally and remotely with the
GPFS cluster
Remote test comparisons with Castor v.2 show that
Jobs writing and reading from GPFS are the
fastest (1200MByte/s vs 850 MB/s Writing and
1500MB/s vs 1300MB/s Reading). This that could
prove very useful in some critical I/O access
activities (i.e. critical data transfers or
analysis jobs)
Also reliability is improved since a GPFS cluster
is very close to a NSPF system (while failures in
a Castor diskserver node will put the
corresponding filesystems offline, so part of the
diskpool will be unaccessible)
GPFS administration is also simpler compared to
Castor (no Oracle, no LSF, intuitive admin
commands etc...)

29
Abstract
Title Experience with Fabric Storage Area
Network and HSM Software at the Tier1 INFN
CNAF Abstract This paper is a report from the
INFN Tier1 (CNAF) about the storage solutions we
have implemented over the last few years of
activity. In particular we describe the current
Castor v.2 installation at our site, the HSM
(Hierarchical Storage Manager) software chosen as
(low cost) tape storage archiving solution.
Beside Castor, we also have in production a large
GPFS cluster relying on a Storage Area Network
(SAN) infrastructure to obtain a fast and
disk-only solution for the users. In this paper,
summarizing our experience with these two storage
system solutions, we focus on the management and
monitoring tools implemented and on the technical
solutions needed to improve reliability on the
whole system.

Write a Comment

User Comments (0)