Lustre File System Evaluation at FNAL - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Lustre File System Evaluation at FNAL

Description:

Lustre File System Evaluation at FNAL Stephen Wolbers for Alex Kulyavtsev, Matt Crawford, Stu Fuess, Don Holmgren, Dmitry Litvintsev, Alexander Moibenko, Stan Naymola, – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 24
Provided by: AlexK162
Category:

less

Transcript and Presenter's Notes

Title: Lustre File System Evaluation at FNAL


1
Lustre File System Evaluation at FNAL
Stephen Wolbers for Alex Kulyavtsev, Matt
Crawford, Stu Fuess, Don Holmgren, Dmitry
Litvintsev, Alexander Moibenko, Stan Naymola,
Gene Oleynik,Timur Perelmutov, Don Petravick,
Vladimir Podstavkov, Ron Rechenmacher, Nirmal
Seenu, Jim Simone Fermilab
  • CHEP'09, Prague
    March 23, 2009

2
Outline
  • Goals of Storage Evaluation Project and
    Introduction
  • General Criteria
  • HPC Specific Criteria
  • HSM Related Criteria
  • HEP Specific Criteria
  • Test Suite and Results
  • Lustre at FNAL HPC Facilities Cosmology and LQCD
  • Conclusions

3
Storage Evaluation
  • Fermilab's Computing Division regularly
    investigates global/high-performance file systems
    which meet criteria in
  • capacity, scalability and I/O performance
  • data integrity, security and accessibility
  • usability, maintainability, ability to
    troubleshoot isolate problems
  • tape integration
  • namespace and its performance
  • Produced a list of weighted criteria a system
    would need to meet (HEP and HPC) FNAL CD DocDB
    2576
  • http//cd-docdb.fnal.gov/cgi-bin/ShowDocument?doci
    d2576
  • Set up test stands to get experience with file
    systems and to perform measurements where possible

4
Storage Evaluation
  • Additional input for evaluation
  • File system documentation
  • Design and performance of existing installations
  • Communications with vendor/organization staff
  • Training
  • The focus of this talk is our evaluation of
    Lustre
  • Most effort so far concentrated on general
    functionality and HPC (Cosmological Computing
    Cluster and Lattice QCD)
  • We are in process of evaluating Lustre for HEP
  • Lustre Installations
  • Preproduction and production systems on Cosmology
    Computation Cluster
  • LQCD Cluster

5
Storage Evaluation Criteria
  • Capacity of 5 PB scalable by adding storage units
  • Aggregate I/O gt 5 GB/s today scalable by adding
    I/O units
  • LLNL BlueGene/L has 200,000 clients processes
    access 2.5PB Lustre fs through 1024 IO nodes
    over 1Gbit Ethernet network at 35 GB/sec
  • Disk subsystem should impose no limit on sizes of
    files. The typical file used in HEP today are 1GB
    to 50GB
  • Legend for the criteria
  • Means satisfies criteria
  • Means either doesnt satisfy or partially
    satisfies criteria
  • Not tested
  • green - example exists
  • purple coming soon
  • red needs attention

5
6
Criteria Functionality
  • Storage capacity and aggregate data IO bandwidth
    scale linearly by adding scalable storage units
  • Storage runs on general purpose platform.
    Ethernet is primary access medium for capacity
    computing
  • Easy to add, remove, replace scalable storage
    unit. Can work on mix of storage hardware. System
    scales up when units are replaced by ones with
    advanced technology.
  • addition and replacement are fine
  • removal still has issues

7
Criteria Namespace
  • Provides hierarchical namespace mountable with
    POSIX access on apx. 2000 nodes (apx. 25,000
    nodes on RedStorm at SNL)
  • Supports millions of online (and tape resident)
    files
  • 74 million _at_ LLNL.
  • Client processes can open at least 100 files, and
    tens of thousands files can be open for read in
    the system. We have not tested this requirement,
    but Lustres metadata server does not limit the
    number of open files
  • Supports hundreds metadata ops/sec without
    affecting I/O
  • The measured meta data rate is by factor 10-100
    better
  • It must be possible to make a backup or dump of
    the namespace metadata without taking the
    system down
  • We perform hourly LVM snapshots, but we really
    need the equivalent of transaction logging

8
Criteria Scaling Data Transfers and
Recovery
  • Must support at least 600 WAN and 6000 LAN
    transfers simultaneously, with a mixture of
    writes and reads (perhaps 1 to 4 ratio). One set
    must not starve the other
  • Must be able to control number of WAN and LAN
    transfers independently, and/or set limits for
    each transfer protocol
  • Must be able to limit striping across storage
    units to contain the impact of a total disk
    failure to a small percentage of files
  • Serving hot files to multiple clients may
    conflict with less striping
  • We are interested in some future features
  • file Migration for Space Rebalancing
  • set of tools for Information Lifecycle Management

9
Criteria Data Integrity Security
  • Support for Hashing or checksum algorithms.
  • Adler32, CRC32 and more will be provided on
    Lustre v2.0. End-to-end data integrity will be
    provided by ZFS DMU
  • System must scan itself or allow or allow
    scanning for the silent file corruption without
    undue impact on performance
  • under investigation
  • Security over the WAN is provided by WAN
    protocols such as GridFTP, SRM, etc.
  • We require communication integrity rather than
    confidentiality
  • Kerberos support will be available in Lustre v2.0
  • ACLs, user/group quotas
  • Space management (v2.x ? )

10
HPC Specific Criteria and Lustre
  • Lustre on Computational Cosmology and LQCD
    clusters
  • Large, transparently (without downtime)
    extensible, hierarchal file system accessible
    through standard Unix system calls
  • Parallel file access
  • File system visible to all executables, with
    possibility of parallel I/O
  • Deadlock free for MPI jobs
  • POSIX IO
  • More stable than NFS (Computational Cosmology
    only)
  • Ability to run on commodity hardware and Linux OS
    to reuse existing hardware

11
HSM-Related Criteria
  • Integration of the file system sites HSM (e.g.,
    Enstore, CASTOR, HPSS) is required for use at
    large HEP installations
  • The HSM shall provide transparent access to 10 to
    100 PB of data on tape (growing in time)
  • Must be able to create file stubs in Lustre for
    millions of files already existing in HSM in a
    reasonable time
  • Automatically migrates designated files to tape
  • Transparent file restore on open()
  • Pre-stage large file sets from HSM to disk.
    Enqueue many read requests (current CMS FNAL T1
    peak 30,000) with O(100) active transfers
  • Evict files already archived when disk space is
    needed

12
Lustre HSM Feature
  • Lustre does not yet have HSM feature. Some sites
    implement simple tape backup schemes
  • HSM integration feature is under development by
    CEA and Sun
  • HSM version v1.0
  • Basic HSM in a future release of Lustre
    beta in fall 2009 ?
  • Integration with HPSS (v1), others will follow
  • Metadata scans to select files to store in HSM v1
  • File store on close() on-write in HSM v2
  • Integration work
  • Work specific to the HSM is required for
    integration

13
HEP and general Criteria and Lustre
  • GridFTP server from Globus Toolkit v4.0.7 worked
    out of the box
  • BeStMan SRM gateway server is installed on LQCD
    cluster
  • Storm SRM performance on top of Lustre is
    reported on this conference
  • Open source, training and commercial support
    available
  • Issues reported to Lustre-discuss list are
    quickly answered

14
Lustre Tests at Fermilab
  • Developed test harness and test suite to evaluate
    systems against criteria
  • Torture tests to emulate large loads for large
    data sets
  • metadata stressor create millions of files
  • Used standard tests against a Lustre filesystem
  • data I/O IOZone, b_eff_io
  • metadata I/O fileop/IOZone, metarates, mdtest
  • Pilot test system was used to get experience and
    validate system stability
  • Initial throughputs were limited by the disks
    used for Lustres data storage. Subsequent
    installations used high-performance disk arrays
    and achieved higher speeds.

15
Lustre Test System
Three to five client nodes Two data servers Each
node Dual CPU quad-core Xeon 16 GB mem Local
SATA disk 500 GB
16
Lustre Metadata Rates
Single client metadata rates measured with
fileop/IOZone benchmark
17
Lustre Metadata Rates
Aggregate metadata rate measured with metarates
benchmark for one to 128 multiple clients running
on 5 nodes 8 cores
18
Lustre on Computational Cosmology Cluster
125 Compute nodes
1Gb Ethernet Stackable Switch SMS 8848M
4 Lustre data Servers one shared with Metadata
Server 250 GB on LVM
Lustre DATA 2 SATABeasts 72 TB RAID6 2 12
LUNS 2 3 vol. 4 partitions
19
Lustre on FNAL LQCD clusters
IPoIB
IPoIB
Lustre Infiniband network 1 MetaData Server
4 file servers 72 TB data RAID6 two 4Gbit FC
per SATABeast
LCC Bldg.
750 GB RAID1
10 GigE
GCC Bldg.
20
Lustre Experience - HPC
  • From our experience in production on
    Computational Cosmology Cluster (starting summer
    2008) and limited pre-production on LQCD JPsi
    cluster (December 2008) the Lustre File system
  • Lustre doesnt suffer the MPI deadlocks of dCache
  • direct access eliminates the staging of files
    to/from worker nodes that was needed with dCache
    (Posix IO)
  • improved IO rates compared to NFS and eliminated
    periodic NFS server freezes
  • reduced administration effort

21
Conclusions - HEP
  • Lustre file system meets and exceeds our storage
    evaluation criteria in most areas, such as system
    capacity, scalability, IO performance,
    functionality, stability and high availability,
    accessibility, maintenance, and WAN access.
  • Lustre has much faster metadata performance than
    our current storage system.
  • At present Lustre can only be used for HEP
    applications not requiring large scale tape IO,
    such as LHC T2/T3 centers or scratch or volatile
    disk space at T1 centers.
  • Lustre near term roadmap (about one year) for HSM
    in principle satisfies our HSM criteria. Some
    work will still be needed to integrate any
    existing tape system.

22
  • Backup Slides

23
Lustre Jargon
  • Client client node where user application runs.
    It talks to MetaData Server and data server (OSS)
  • MDS MetaData Server - the node, one active per
    system
  • MDT MetaData Target disk storage for
    metadata, connected to MDS
  • OSS Object Store Server the node serving data
    files or file stripes
  • OST Object Store Target disk storage for data
    files or file stripes, connected to OSS

24
What is Lustre?
Metadata Server
Lustre
Client Nodes (workers)
Data Storage
Data Servers
Write a Comment
User Comments (0)
About PowerShow.com