RHIC/US ATLAS Tier 1 Computing Facility Site Report

About This Presentation

Title:

RHIC/US ATLAS Tier 1 Computing Facility Site Report

Description:

... late 1990's to act as the tier 1 computing center for ATLAS in the United States ... Ramping up resources provided to ATLAS: Data Challenge 2 (DC2) underway ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 18

Provided by: rhic1

Learn more at: https://www.racf.bnl.gov

Category:

more less

Transcript and Presenter's Notes

Title: RHIC/US ATLAS Tier 1 Computing Facility Site Report

1
RHIC/US ATLAS Tier 1 Computing FacilitySite
Report
HEPiX Upton, NY, USA October 18, 2004

Christopher Hollowell
Physics Department
Brookhaven National Laboratory
hollowec_at_bnl.gov

2
Facility Overview

Created in the mid 1990's to provide centralized
computing services for the RHIC experiments
Expanded our role in the late 1990's to act as
the tier 1 computing center for ATLAS in the
United States
Currently employ 28 staff members planning on
adding 5 additional employees in the next fiscal
year

3
Facility Overview (Cont.)

Ramping up resources provided to ATLAS Data
Challenge 2 (DC2) underway
RHIC Run 5 scheduled to begin in late December
2004

4
Centralized Disk Storage

37 NFS Servers Running Solaris 9 recent upgrade
from Solaris 8
Underlying filesystems upgraded to VxFS 4.0
Issue with quotas on filesystems larger than 1 TB
in size
220 TB of fibre channel SAN-based RAID5 storage
available added 100 TB in the past year

5
Centralized Disk Storage (Cont.)

Scalability issues with NFS (network-limited to
70 MB/s max per server 75-90 MB/s max local
I/O in our configuration) testing of new
network storage models including Panasas and
IBRIX in progress
Panasas tests look promising. 4.5 TB of storage
on 10 blades available for evaluation by our user
community. DirectFlow client in use on over 400
machines
Both systems allow for NFS export of data

6
Centralized Disk Storage (Cont.)
7
Centralized Disk Storage AFS

Moving servers from Transarc AFS running on AIX
to OpenAFS 1.2.11 on Solaris 9
The move from Transarc to OpenAFS motivated by
Kerberos4/Kerberos5 issues and Transarc AFS end
of life
Total of 7 fileservers and 6 DB servers 2 DB
servers and 2 fileservers running OpenAFS
2 Cells

8
Mass Tape Storage

Four STK Powderhorn silos provided, each with the
capability of holding 6000 tapes
1.7 PB data currently stored
HPSS Version 4.5.1 likely upgrade to version 6.1
or 6.2 after RHIC Run 5
45 tape drives available for use
Latest STK tape technology 200 GB/tape
12 TB disk cache in front of the system

9
Mass Tape Storage (Cont.)

PFTP, HSI and HTAR available as interfaces

10
CAS/CRS Farm

Farm of 1423 dual-CPU (Intel) systems
Added 335 machines this year
245 TB local disk storage (SCSI and IDE)
Upgrade of RHIC Central Analysis Servers/Central
Reconstruction Servers (CAS/CRS) to Scientific
Linux 3.0.2 (updates) underway should be
complete before next RHIC run

11
CAS/CRS Farm (Cont.)

LSF (5.1) and Condor (6.6.6/6.6.5) batch systems
in use. Upgrade to LSF 6.0 planned
Kickstart used to automate node installation
GANGLIA custom software used for system
monitoring
Phasing out the original RHIC CRS Batch System
replacing with a system based on Condor
Retiring 142 VA Linux 2U PIII 450 MHz systems
after next purchase

12
CAS/CRS Farm (Cont.)
13
CAS/CRS Farm (Cont.)
14
Security

Elimination of NIS, complete transition to
Kerberos5/LDAP in progress
Expect K5 TGT to X.509 certificate transition in
the future KCA?
Hardening/monitoring of all internal systems
Growing web service issues unknown services
accessed through port 80

15
Grid Activities

Brookhaven planning on upgrading external network
connectivity to OC48 (2.488 Gbps) from OC12 (622
Mbps) to support ATLAS activity
ATLAS Data Challenge 2 jobs submitted via Grid3
GUMS (Grid User Management System)
Generates grid-mapfiles for gatekeeper hosts
In production since May 2004

16
Storage Resource Manager (SRM)

SRM middleware providing dynamic storage
allocation and data management services
Automatically handles network/space allocation
failures
HRM (Hierarchical Resource Manager)-type SRM
server in production
Accessible from within and outside the facility
350 GB Cache
Berkeley HRM 1.2.1

17
dCache

Provides global name space over disparate storage
elements
Hot spot detection
Client software data access through libdcap
library or libpdcap preload library
ATLAS PHENIX dCache pools
PHENIX pool expanding performance tests to
production machines
ATLAS pool interacting with HPSS using HSI no
way of throttling data transfer requests as of yet