Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF - PowerPoint PPT Presentation

About This Presentation
Title:

Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF

Description:

Experience with Fabric Storage Area Network and HSM ... Silkworm 3900 with 32 2Gb/s ports. Silkworm 3900 with 32 2Gb/s ports. 2 x 2Gb/s. trunked uplink ... – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 30
Provided by: nik8
Category:

less

Transcript and Presenter's Notes

Title: Experience with Fabric Storage Area Network and HSM Software at the Tier1 INFN CNAF


1
Experience with Fabric Storage Area Network and
HSM Software at the Tier1 INFN CNAF
Ricci Pier Paolo et al., on behalf of INFN TIER1
Staff pierpaolo.ricci_at_cnaf.infn.it
  • ACAT 2007
  • April 22-27 2007
  • Nikhef Amsterdam

2
Summary
  • Overall Tier1 Hardware description
  • Castor v.2 HSM Software
  • SAN Fabric implementation
  • Monitoring and Administration tools
  • GPFS and Castor performance test

3
Overall Fabric Tier1 Hardware description
  • Here is what we have in production
  • Disk (SAN) 980 TB RAW (ATA RAID-5)
  • 9 Infortrends A16F-R1211-M2 56TB
  • 1 SUN STK Bladestore 32TB
  • 4 IBM FastT900 (DS 4500) 200TB
  • 5 SUN STK 290TB
  • 3 DELL EMC 400TB
  • Tape 1PByte uncomp. (tapes for 670TB)
  • 1 SUN STK L5500 partitioned in 2000 slots LTO-2
    (200GB) and 3500 slots 9940B (200GB)
  • 6 LTO-2 Drives (20-30 MB/s each)
  • 7 9940B Drives (25-30 MB/s each)

4
TIER1 INFN CNAF Storage
HSM (1PB)
Worker Nodes (LSF Batch System) Farm 800 nodes
for 1500KSPI2k (3000 KSPI2k 2nd half 2007)
90 Diskservers with Qlogic FC (HBA 2340 and 2462)
STK180 with 180 LTO-1 (18Tbyte Native)
W2003 Server with LEGATO Networker (Backup)
Fibre Channel
RFIO
RFIO,GPFS, Xroot
SAN
WAN or TIER1 LAN
CASTOR-2 HSM Castor services servers and
tapeservers
STK L5500 robot (5500 slots) 6 IBM LTO-2, 7 STK
9940B drives
Fibre Channel
SAN ( 980TB RAW -15/25 for NET SPACE gt 770TB)
400TB RAW
290TB RAW
200TB RAW
32TB RAW
56TB RAW
1 SUN STK BladeStore 1x 24000 GByte 1250 SATA
Blades 4 x 2Gb FC interfaces
5 SUN STK FLX680 5 x 46000 Gbyte 500GB SATA
Blades 4 x 2Gb FC interfaces each
4 Infortrend A16F-R1A2-M1 4 x 3200 GByte SATA 2
x 2Gb FC interfaces each
4 IBM FastT900 (DS 4500) 4x43000Gbyte SATA 4 x
2Gb FC interfaces each
5 Infortrend A16F-R1211-M2 JBOD 5 x 6400 GByte
SATA 2 x 2Gb FC interfaces each
3 EMC CX380 3 x 114000GByte FATA 8 x 4Gb FC
intefaces each
5
Castor v.2 Hardware
  • Core services are on machines with scsi disks,
    hardware raid1, redundant power supplies
  • tape servers and disk servers have lower level
    hardware, like WNs

Sun Blade v100 with 2 internal ide disks with
software raid-1 running ACSLS 7.0 OS Solaris 9.0
40 disk servers attached to a SAN full
redundancy FC 2Gb/s or 4Gb/s (latest...) connectio
ns (dual controller HW and Qlogic SANsurfer Path
Failover SW or Vendor Specific Software)
  • STK L5500 silos (5500 slots, partitioned wth 2
    form-factor slots, about 2000 LTO2 for and 3500
    9940B, 200GB cartridges, tot capacity 1.1PB non
    compressed )
  • 6 LTO2 7 9940B drives, 2 Gbit/s FC interface,
    20-30 MB/s rate (some more 9940B going to be
    acquired in next months).

SAN
Brocade FC SAN
13 tape servers
STK FlexLine 600, IBM FastT900
6
Setup core services
2 more machines for the Name Server and Stager DB
CASTOR core services v2.1.1-9 on 4 machines
(v2.1.1-9 also for clients).
Name server Oracle DB (Oracle 9.2)
castor-6 rhserver, stager, rtcpclientd,
MigHunter, cleaningDaemon
Stager Oracle DB (Oracle 10.2)
castorlsf01 Master LSF, rmmaster, expert
  • 2 SRMv1 endpoints, DNS balanced
  • srm//castorsrm.cr.cnaf.infn.it8443 (used for
    tape svc classes)
  • srm//sc.cr.cnaf.infn.it8443 (used for
    disk-only svc classes for cms and atlas)
  • srm//srm-lhcb durable.sc.cr.cnaf.infn.it8443
    (used as disk-only svc classes for lhcb)
  • V2.1.1, disk server in TURL

dlf01 dlfserver, Cmonit, Oracle for DLF
castor-8 nsdaemon, vmgr, vdqm, msgd, cupvd
7
Setup disk servers
  • 40 disk servers
  • about 5-6 fs per node, both XFS and EXT3 used,
    typical size 1.5-2 TB
  • LSF software distributed via NFS (exported by
    the LSF Master node)
  • LSF slots from 30 to 450, modified many
    times.(lower or highter values only for test )
  • Many servers are used both for file transfers
    and for job reco/analysis gt max slots limitation
    not very useful in such a case

8
Supported VOs - Svcclasses - diskpools (270TB
net)
9
Setup Monitoring e Alarms
We use NagiosRRD for alarm notifications.
10
Setup Nagios
  • typical parameters such as disk I/O, CPU,
    network, connections,procs, available disk
    space, raids status.
  • CASTOR specific parameters such as tape and
    disk pool free space, daemons, LSF queue
  • still missing checks on the stager db tables
    such as newrequests, subrequests

castorlsf01
castor-6
castorlsf01
11
Setup Monitoring (Lemon)
  • Lemon is in production v.1.0.7-15 as a Monitoring
    Tool
  • Lemon is the CERN suggested monitoring tool,
    strong integration with Castor v.2
  • Oracle v.10.2 as database backend

12
Lemon Monthly usage (production)
  • Castor diskservers (disk to disk transfers)
  • Max 301.3 MB/s
  • Average 106.9 MB/s
  • Castor tapeservers (disk to/from tapes transfers)
  • Max 110.3 MB/s
  • Average 41.7 MB/s

13
Lemon Monthly usage (production)
  • GPFS
  • Max 205 MB/s
  • Average 18.3 MB/s
  • XRootD
  • Max 47.6 MB/s
  • Average 6.28 MB/s

14
SAN Fabric Implementation
  • Why a Storage Area Network?
  • As Tier1 we need to grant a 7/24 service to the
    users (LHC and not-LHC)
  • A good SAN hardware installation could be a
    "real" No Single Point of Failure System NSPF (if
    software supports it!)
  • So failures to the storage infrastructure
    components, or planned events (like firmware
    upgrade) can be done without stopping any service
  • Also SAN gives the best flexibility
  • we can dinamically vary the diskserver/disk-storag
    e assignament (adding diskservers, or changing
    ratios...)
  • we can use as diskpool clustered filesystem like
    GPFS

15
SAN Fabric Implementation
SINGLE SAN (980TB RAW) Hardware based
on Brocade Switches SAN as one single Fabric
managed with a single management web tool and
Fabric Manager Software for failures and
performace monitoring Qlogic Qla2340 (2Gb/s)
2462 (4Gb/s) HBA HA failover implemented with
SANsurfer configuration and Vendor Specific Tool
(EMC PowerPath)
1st Director 24000 FULL RENDUNDANT with 128 2Gb/s
ports The 2005 tender price was 1 KEuro/port.
2 x 2Gb/s trunked uplink
2 x 2Gb/s trunked uplink
Silkworm 3900 with 32 2Gb/s ports
Silkworm 3900 with 32 2Gb/s ports
2 x 2Gb/s trunked uplink
2nd Director 48000 FULL RENDUNDANT with 128 (out
of 256) 4Gb/s ports The 2006 tender price was
0.5 KEuro/port.
  • DISK STORAGE
  • 4 x IBM FastT900 DS 4500 (4 x 2Gb/s output for
    each box) 200TB gt 20 diskservers with single
    HBA
  • 4 x Flexline 600 (4 x 2Gb/s each) 290TB gt
    20 diskservers with double HBAs
  • 3 x CX-380 (8 x 4Gb/s each) 400TB gt 36
    diskservers with double HBA
  • 1 x SUN STK bladestore (4 x 2 Gb/s) gt 5
    diskservers with single HBA
  • 9 x Infortrend A16F-R1211-M2 (2 x 2Gb/s each)
    JBOD 56TB gt 9 primary diskservers with single
    HBA
  • About 6-12 TB RAW accessed by one diskserver,
    depending on the fs/protocol could be enough.
  • Fibre Channel Physical connections, failover and
    zoning are configured in the simplest way,
    traffic from diskservers remain in the local
    switch in most cases so uplink usage is minimized.

16
DISK access typical case (NSPF)
LAN
12 Diskserver Dell 1950 Dual Core Biprocessors
2 x 1.6Ghz 4MB L2 Cache, 4 GByte RAM, 1066 MHz
FSB SL 3.0 or 4.0 OS, Hardware Raid1 on system
disks and redundant power supply
4 TB Raid Group (81) 2TB Logical
Disk LUN0 LUN1 ...
Gb Ethernet Connections
LUN0 gt /dev/sda LUN1 gt /dev/sdb ...
2 x 4Gb Qlogic 2460 FC redundand connections
every Diskserver
  • SAN ZONING
  • Each diskserver gt 4 paths to the storage
  • EMC PowerPath for Load-Balancing and Failover (or
    Qlogic SANSurfer if problem with SL will arise)
  • Example of Application
  • High Avaliability
  • GPFS with configuration
  • Network Shared Disk
  • /dev/sda Primary Diskserver 1
  • Secondary Diskserver2
  • /dev/sdb Primary Diskserver 2
  • Secondary Diskserver3
  • .....

4Gb FC connections
2 Storage Processor (A e B)
110TB EMC CX380 Dual redundant Controllers
(Storage Processors A,B) 4 Ouput for each SP
(1,2,3,4)
A3
A4
A1
A2
B1
B2
B3
B4
17
SAN Monitoring Tool
Web Admin Tool (from Browser)
Web Admin Tool (zoning)
18
SAN Monitoring Tool
Fabric Manager Software (Software installed in a
dedicated machine)
Fabric Manager Software (performance monitoring
showing Powerpath Load Balancing)
19
SAN Disk Distribution
  • Storage used in production
  • 270TB net Castor v.2 staging area to tape
    backend or disk-only pools (see above)
  • 140TB net Xroot (Babar)
  • 130TB net GPFS v3.1.0-10
  • 230TB net Still Unassigned (used for test in
    these weeks
  • NFS used mainly for accessing experiment
    software (TB) - strongly discouraged for data
    access (Virgo) and currently under migration (to
    castor v.2 and GPFS)

20
GPFS implementation
  • The idea of GPFS is to provide a fast and
    reliable (NSPF) pure diskpool storage with direct
    access (posix file protocol) from the Worker
    Nodes Farm
  • SRM interface (Storm http//storm.forge.cnaf.infn.
    it/doku.ph)
  • One single "big filesystem" for each VO could be
    possible (strongly preferred by users)
  • Further Step Creation of a single (or future
    multiple) cluster with a total of all the Worker
    Nodes (700-800) and the 40 NSD GPFS diskservers.
    Before we had only the front-ends (i.e. storage
    element) accessing the GPFS cluster and the WNs
    used to copy the data locally. Now all the WNs
    could access the GPFS system directly
  • Test of the whole system using part of the
    storage hardware infrastructure (24 NSD dedicated
    diskservers SL 4.4 64 bit and 2 CX-380 EMC
    Storage Arrays tot. 230TB) locally and remotely
    with dedicated Farm queues (280 WN distributed in
    8 racks for a total of 1100 LSF queues slot )

21
Test Layout
22
Test Phase (local)
  • Local Access using the 24 diskservers (actually
    23)
  • Xfs locally mounted filesystem
  • GPFS cluster. 1 filesystem in one single "200 TB
    filesystem"
  • Test using linux command "dd" from memory
    (/dev/zero) Block Size of 1024k and 12 Gbyte
    every thread (dd process in background) equally
    distributed over the diskservers.

23
Test Results (local)
24
GPFS works using parallel I/O so the maximum
bandwidth (plateou) is reached with a very
limited number of thread (1 dd process for each
diskserver is enought).
In general GPFS works better when reading. When
writing all the diskservers must comunicate to
each other to mantain sync. This generate a
"background" traffic that could limit write
throughtput. Anyway the controller array limit is
still reached in our site (the disk is the
bottleneck)
25
Test Phase (remote)
  • Remote Access using dedicated Farm nodes
    (dedicated Slot in LSF Batch System)
  • Castor (rfio over Xfs filesystems) diskpool only
    (no tape backend)
  • GPFS cluster. 1 filesystem in one single "200 TB
    filesystem"
  • Test using C coded "dd" commands as Farm Jobs
    (5GByte files, bs64k x 1000 Jobs)
  • We were interested in reliability of GPFS and
    overall performance comparison between our two
    disk storage pools production systems

26
Test Results (remote)
Castor remote read
Castor remote write
Some jobs (10) failures were detected when
reading due probably mainly to Castor2 queue
timeouts. Writing the efficency was higher (98).
The Aggregate Network statistics reported on the
uplink connections show a identical trend a
10 overhead
27
Test Results (remote)
GPFS remote read
GPFS remote write
Overall efficiency 98 (2 jobs failed due to
multiple reasons, WN problems, .exe nfs area down
etc...)
The Aggregate Network statistics reported on the
uplink connections show a identical trend a
10 overhead
28
Conclusion
  • The GPFS system cluster is working fine in a
    single "big cluster" implementation (all WNs in
    cluster)
  • Test shows that the theorical hardware bandwidth
    (limit from the controllers of the disk array)
    could be saturated locally and remotely with the
    GPFS cluster
  • Remote test comparisons with Castor v.2 show that
    Jobs writing and reading from GPFS are the
    fastest (1200MByte/s vs 850 MB/s Writing and
    1500MB/s vs 1300MB/s Reading). This that could
    prove very useful in some critical I/O access
    activities (i.e. critical data transfers or
    analysis jobs)
  • Also reliability is improved since a GPFS cluster
    is very close to a NSPF system (while failures in
    a Castor diskserver node will put the
    corresponding filesystems offline, so part of the
    diskpool will be unaccessible)
  • GPFS administration is also simpler compared to
    Castor (no Oracle, no LSF, intuitive admin
    commands etc...)

29
Abstract
Title Experience with Fabric Storage Area
Network and HSM Software at the Tier1 INFN
CNAF Abstract This paper is a report from the
INFN Tier1 (CNAF) about the storage solutions we
have implemented over the last few years of
activity. In particular we describe the current
Castor v.2 installation at our site, the HSM
(Hierarchical Storage Manager) software chosen as
(low cost) tape storage archiving solution.
Beside Castor, we also have in production a large
GPFS cluster relying on a Storage Area Network
(SAN) infrastructure to obtain a fast and
disk-only solution for the users. In this paper,
summarizing our experience with these two storage
system solutions, we focus on the management and
monitoring tools implemented and on the technical
solutions needed to improve reliability on the
whole system.
Write a Comment
User Comments (0)
About PowerShow.com