RHIC/US ATLAS Tier 1 Computing Facility Site Report - PowerPoint PPT Presentation

About This Presentation
Title:

RHIC/US ATLAS Tier 1 Computing Facility Site Report

Description:

... late 1990's to act as the tier 1 computing center for ATLAS in the United States ... Ramping up resources provided to ATLAS: Data Challenge 2 (DC2) underway ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 18
Provided by: rhic1
Learn more at: https://www.racf.bnl.gov
Category:

less

Transcript and Presenter's Notes

Title: RHIC/US ATLAS Tier 1 Computing Facility Site Report


1
RHIC/US ATLAS Tier 1 Computing FacilitySite
Report
HEPiX Upton, NY, USA October 18, 2004
  • Christopher Hollowell
  • Physics Department
  • Brookhaven National Laboratory
  • hollowec_at_bnl.gov

2
Facility Overview
  • Created in the mid 1990's to provide centralized
    computing services for the RHIC experiments
  • Expanded our role in the late 1990's to act as
    the tier 1 computing center for ATLAS in the
    United States
  • Currently employ 28 staff members planning on
    adding 5 additional employees in the next fiscal
    year

3
Facility Overview (Cont.)
  • Ramping up resources provided to ATLAS Data
    Challenge 2 (DC2) underway
  • RHIC Run 5 scheduled to begin in late December
    2004

4
Centralized Disk Storage
  • 37 NFS Servers Running Solaris 9 recent upgrade
    from Solaris 8
  • Underlying filesystems upgraded to VxFS 4.0
  • Issue with quotas on filesystems larger than 1 TB
    in size
  • 220 TB of fibre channel SAN-based RAID5 storage
    available added 100 TB in the past year

5
Centralized Disk Storage (Cont.)
  • Scalability issues with NFS (network-limited to
    70 MB/s max per server 75-90 MB/s max local
    I/O in our configuration) testing of new
    network storage models including Panasas and
    IBRIX in progress
  • Panasas tests look promising. 4.5 TB of storage
    on 10 blades available for evaluation by our user
    community. DirectFlow client in use on over 400
    machines
  • Both systems allow for NFS export of data

6
Centralized Disk Storage (Cont.)
7
Centralized Disk Storage AFS
  • Moving servers from Transarc AFS running on AIX
    to OpenAFS 1.2.11 on Solaris 9
  • The move from Transarc to OpenAFS motivated by
    Kerberos4/Kerberos5 issues and Transarc AFS end
    of life
  • Total of 7 fileservers and 6 DB servers 2 DB
    servers and 2 fileservers running OpenAFS
  • 2 Cells

8
Mass Tape Storage
  • Four STK Powderhorn silos provided, each with the
    capability of holding 6000 tapes
  • 1.7 PB data currently stored
  • HPSS Version 4.5.1 likely upgrade to version 6.1
    or 6.2 after RHIC Run 5
  • 45 tape drives available for use
  • Latest STK tape technology 200 GB/tape
  • 12 TB disk cache in front of the system

9
Mass Tape Storage (Cont.)
  • PFTP, HSI and HTAR available as interfaces

10
CAS/CRS Farm
  • Farm of 1423 dual-CPU (Intel) systems
  • Added 335 machines this year
  • 245 TB local disk storage (SCSI and IDE)
  • Upgrade of RHIC Central Analysis Servers/Central
    Reconstruction Servers (CAS/CRS) to Scientific
    Linux 3.0.2 (updates) underway should be
    complete before next RHIC run

11
CAS/CRS Farm (Cont.)
  • LSF (5.1) and Condor (6.6.6/6.6.5) batch systems
    in use. Upgrade to LSF 6.0 planned
  • Kickstart used to automate node installation
  • GANGLIA custom software used for system
    monitoring
  • Phasing out the original RHIC CRS Batch System
    replacing with a system based on Condor
  • Retiring 142 VA Linux 2U PIII 450 MHz systems
    after next purchase

12
CAS/CRS Farm (Cont.)
13
CAS/CRS Farm (Cont.)
14
Security
  • Elimination of NIS, complete transition to
    Kerberos5/LDAP in progress
  • Expect K5 TGT to X.509 certificate transition in
    the future KCA?
  • Hardening/monitoring of all internal systems
  • Growing web service issues unknown services
    accessed through port 80

15
Grid Activities
  • Brookhaven planning on upgrading external network
    connectivity to OC48 (2.488 Gbps) from OC12 (622
    Mbps) to support ATLAS activity
  • ATLAS Data Challenge 2 jobs submitted via Grid3
  • GUMS (Grid User Management System)
  • Generates grid-mapfiles for gatekeeper hosts
  • In production since May 2004

16
Storage Resource Manager (SRM)
  • SRM middleware providing dynamic storage
    allocation and data management services
  • Automatically handles network/space allocation
    failures
  • HRM (Hierarchical Resource Manager)-type SRM
    server in production
  • Accessible from within and outside the facility
  • 350 GB Cache
  • Berkeley HRM 1.2.1

17
dCache
  • Provides global name space over disparate storage
    elements
  • Hot spot detection
  • Client software data access through libdcap
    library or libpdcap preload library
  • ATLAS PHENIX dCache pools
  • PHENIX pool expanding performance tests to
    production machines
  • ATLAS pool interacting with HPSS using HSI no
    way of throttling data transfer requests as of yet
Write a Comment
User Comments (0)
About PowerShow.com