The RHIC Computing Facility at BNL - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

The RHIC Computing Facility at BNL

Description:

... to former system, then Python DAG-builder creates job and submits to Condor pool ... Condor is being deployed and tested as a possible complement or replacement ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 19
Provided by: legacywe
Category:

less

Transcript and Presenter's Notes

Title: The RHIC Computing Facility at BNL


1
The RHIC Computing Facility at BNL
  • HEPIX-HEPNT
  • Vancouver, BC, Canada
  • October 20, 2003
  • Ofer Rind
  • RHIC Computing Facility
  • Brookhaven National Laboratory

2
RCF - Overview
  • Brookhaven National Lab is a multi-disciplinary
    DOE research laboratory
  • RCF formed in the mid-90s to provide computing
    infrastructure for the RHIC experiments. Named US
    Atlas Tier 1 computing center in late 90s
  • Currently supports both HENP and HEP scientific
    computing efforts as well as various general
    services (backup, email, web hosting, off-site
    data transfer)
  • 25 FTEs (expanding soon)
  • RHIC Run-3 completed in Spring. Run-4 slated to
    begin in Dec/Jan

3
RCF Structure
4
Mass Storage
  • 4 StorageTek tape silos managed by HPSS (v4.5)
  • Upgraded to 37 9940B drives (200GB/cartridge)
    prior to Run-3 (2 mos. to migrate data)
  • Total data store of 836TB (4500TB capacity)
  • Aggregate bandwidth up to 700MB/s expect
    300MB/s in next run
  • 9 data movers with 9TB of disk (Future array to
    be fully replaced after next run with faster
    disk)
  • Access via pftp and HSI, both integrated with K5
    authentication (Future authentication through
    Globus certificates)

5
Mass Storage
6
Centralized Disk Storage
  • Large SAN served via NFS
  • Processed data store user home directories and
    scratch
  • 16 Brocade switches and 150TB of Fibre Channel
    Raid5 managed by Veritas (MTI Zzyzx
    peripherals)
  • 25 Sun Servers (E450 V480) running Solaris 8
    (load issues with nfsd and mountd precluded
    update to Solaris 9)
  • Can deliver data to farm at up to 55MB/sec/server
  • RHIC and USAtlas AFS cells
  • Software repository user home directories
  • Total of 11 AIX servers, 1.2TB (RHIC) 0.5TB
    (Atlas)
  • Transarc on server side, OpenAFS on client side
  • RHIC cell recently renamed (standardized)

7
Centralized Disk Storage
E450s
Zzyzx
MTI
8
The Linux Farm
  • 1097 dual Intel CPU VA and IBM rackmounted
    servers total of 918 kSpecInt2000
  • Nodes allocated by expt and further divided for
    reconstruction analysis
  • 1GB memory typically 1.5GB swap
  • Combination of local SCSI IDE disk with
    aggregate storage of gt120TB available to users
  • Experiments starting to make significant use of
    local disk through custom job schedulers, data
    repository managers and rootd

9
The Linux Farm
10
The Linux Farm
  • Most RHIC nodes recently upgraded to latest RH8
    rev. (Atlas still at RH7.3)
  • Installation of customized image via Kickstart
    server
  • Support for networked file systems (NFS, AFS) as
    well as distributed local data storage
  • Support for open source and commercial compilers
    (gcc, PGI, Intel) and debuggers (gdb, totalview,
    Intel)

11
Linux Farm - Batch Management
  • Central Reconstruction Farm
  • Up to now, data reconstruction was managed by a
    locally produced Perl-based batch system
  • Over the past year, this has been completely
    rewritten as a Python-based custom frontend to
    Condor
  • Leverages DAGman functionality to manage job
    dependencies
  • User defines task using JDL identical to former
    system, then Python DAG-builder creates job and
    submits to Condor pool
  • Tk GUI provided to users to manage their own jobs
  • Job progress and file transfer status monitored
    via Python interface to a MySQL backend

12
Linux Farm - Batch Management
  • Central Reconstruction Farm (cont.)
  • New system solves scalability problems of former
    system
  • Currently deployed for one expt. with others
    expected to follow prior to Run-4

13
Linux Farm - Batch Management
  • Central Analysis Farm
  • LSF 5.1 licensed on virtually all nodes, allowing
    use of CRS nodes in between data reconstruction
    runs
  • One master for all RHIC queues, one for Atlas
  • Allows efficient use of limited hardware,
    including moderation of NFS server loads through
    (voluntary) shared resources
  • Peak dispatch rates of up to 350K jobs/week and
    6K jobs/hour
  • Condor is being deployed and tested as a possible
    complement or replacement still nascent,
    awaiting some features expected in upcoming
    release
  • Both accepting jobs through Globus gatekeepers

14
Security Authentication
  • Two layers of firewall with limited network
    services and limited interactive access
    exclusively through secured gateways
  • Conversion to Kerberos5-based single sign-on
    paradigm
  • Simplify life by consolidating password databases
    (NIS/Unix, SMB, email, AFS, Web). SSH gateway
    authentication ? password-less access inside
    facility with automatic AFS token acquisition
  • RCF Status AFS/K5 fully integrated, Dual K5/NIS
    authentication with NIS to be eliminated soon
  • USAtlas Status K4/K5 parallel authentication
    paths for AFS with full K5 integration on Nov. 1,
    NIS passwords already gone
  • Ongoing work to integrate K5/AFS with LSF, solve
    credential forwarding issues with multihomed
    hosts, and implement a Kerberos certificate
    authority

15
US Atlas Grid Testbed
giis01 Information Server
LSF (Condor) pool
amds Mover
HPSS
AFS server
Globus RLS Server
aftpexp00
Gatekeeper Job manager
Globus-client
aafs
70MB/S
GridFtp
atlas02
Grid Job Requests
Disks
17TB
Internet
Local Grid development currently focused on
monitoring and user management
16
Monitoring Control
  • Facility monitored by a cornucopia of
    vendor-provided, open-source and home-grown
    software...recently,
  • Ganglia was deployed on the entire farm, as well
    as the disk servers
  • Python-based Farm Alert scripts were changed
    from SSH push (slow), to multi- threaded SSH pull
    (still too slow), to TCP/IP push, which finally
    solved the scalability issues
  • Cluster management software is a requirement for
    linux farm purchases (VACM, xCAT)
  • Console access, power up/downreally came in
    useful this summer!

17
The Great Blackout of 03
18
Future Plans Initiatives
  • Linux farm expansion this winter addition of
    gt100 2U servers packed with local disk
  • Plans to move beyond NFS-served SAN with more
    scalable solutions
  • Panasas - file system striping at block level
    over distributed clients
  • dCache - potential for managing distributed disk
    repository
  • Continuing development of grid services with
    increasing implementation by the two large RHIC
    experiments
  • Very successful RHIC run with a large
    high-quality dataset!
Write a Comment
User Comments (0)
About PowerShow.com