The RHIC Computing Facility at BNL - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

The RHIC Computing Facility at BNL

Description:

... to former system, then Python DAG-builder creates job and submits to Condor pool ... Condor is being deployed and tested as a possible complement or replacement ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 19

Provided by: legacywe

Category:

more less

Transcript and Presenter's Notes

Title: The RHIC Computing Facility at BNL

1
The RHIC Computing Facility at BNL

HEPIX-HEPNT
Vancouver, BC, Canada
October 20, 2003
Ofer Rind
RHIC Computing Facility
Brookhaven National Laboratory

2
RCF - Overview

Brookhaven National Lab is a multi-disciplinary
DOE research laboratory
RCF formed in the mid-90s to provide computing
infrastructure for the RHIC experiments. Named US
Atlas Tier 1 computing center in late 90s
Currently supports both HENP and HEP scientific
computing efforts as well as various general
services (backup, email, web hosting, off-site
data transfer)
25 FTEs (expanding soon)
RHIC Run-3 completed in Spring. Run-4 slated to
begin in Dec/Jan

3
RCF Structure
4
Mass Storage

4 StorageTek tape silos managed by HPSS (v4.5)
Upgraded to 37 9940B drives (200GB/cartridge)
prior to Run-3 (2 mos. to migrate data)
Total data store of 836TB (4500TB capacity)
Aggregate bandwidth up to 700MB/s expect
300MB/s in next run
9 data movers with 9TB of disk (Future array to
be fully replaced after next run with faster
disk)
Access via pftp and HSI, both integrated with K5
authentication (Future authentication through
Globus certificates)

5
Mass Storage
6
Centralized Disk Storage

Large SAN served via NFS
Processed data store user home directories and
scratch
16 Brocade switches and 150TB of Fibre Channel
Raid5 managed by Veritas (MTI Zzyzx
peripherals)
25 Sun Servers (E450 V480) running Solaris 8
(load issues with nfsd and mountd precluded
update to Solaris 9)
Can deliver data to farm at up to 55MB/sec/server
RHIC and USAtlas AFS cells
Software repository user home directories
Total of 11 AIX servers, 1.2TB (RHIC) 0.5TB
(Atlas)
Transarc on server side, OpenAFS on client side
RHIC cell recently renamed (standardized)

7
Centralized Disk Storage
E450s
Zzyzx
MTI
8
The Linux Farm

1097 dual Intel CPU VA and IBM rackmounted
servers total of 918 kSpecInt2000
Nodes allocated by expt and further divided for
reconstruction analysis
1GB memory typically 1.5GB swap
Combination of local SCSI IDE disk with
aggregate storage of gt120TB available to users
Experiments starting to make significant use of
local disk through custom job schedulers, data
repository managers and rootd

9
The Linux Farm
10
The Linux Farm

Most RHIC nodes recently upgraded to latest RH8
rev. (Atlas still at RH7.3)
Installation of customized image via Kickstart
server
Support for networked file systems (NFS, AFS) as
well as distributed local data storage
Support for open source and commercial compilers
(gcc, PGI, Intel) and debuggers (gdb, totalview,
Intel)

11
Linux Farm - Batch Management

Central Reconstruction Farm
Up to now, data reconstruction was managed by a
locally produced Perl-based batch system
Over the past year, this has been completely
rewritten as a Python-based custom frontend to
Condor
Leverages DAGman functionality to manage job
dependencies
User defines task using JDL identical to former
system, then Python DAG-builder creates job and
submits to Condor pool
Tk GUI provided to users to manage their own jobs
Job progress and file transfer status monitored
via Python interface to a MySQL backend

12
Linux Farm - Batch Management

Central Reconstruction Farm (cont.)
New system solves scalability problems of former
system
Currently deployed for one expt. with others
expected to follow prior to Run-4

13
Linux Farm - Batch Management

Central Analysis Farm
LSF 5.1 licensed on virtually all nodes, allowing
use of CRS nodes in between data reconstruction
runs
One master for all RHIC queues, one for Atlas
Allows efficient use of limited hardware,
including moderation of NFS server loads through
(voluntary) shared resources
Peak dispatch rates of up to 350K jobs/week and
6K jobs/hour
Condor is being deployed and tested as a possible
complement or replacement still nascent,
awaiting some features expected in upcoming
release
Both accepting jobs through Globus gatekeepers

14
Security Authentication

Two layers of firewall with limited network
services and limited interactive access
exclusively through secured gateways
Conversion to Kerberos5-based single sign-on
paradigm
Simplify life by consolidating password databases
(NIS/Unix, SMB, email, AFS, Web). SSH gateway
authentication ? password-less access inside
facility with automatic AFS token acquisition
RCF Status AFS/K5 fully integrated, Dual K5/NIS
authentication with NIS to be eliminated soon
USAtlas Status K4/K5 parallel authentication
paths for AFS with full K5 integration on Nov. 1,
NIS passwords already gone
Ongoing work to integrate K5/AFS with LSF, solve
credential forwarding issues with multihomed
hosts, and implement a Kerberos certificate
authority

15
US Atlas Grid Testbed
giis01 Information Server
LSF (Condor) pool
amds Mover
HPSS
AFS server
Globus RLS Server
aftpexp00
Gatekeeper Job manager
Globus-client
aafs
70MB/S
GridFtp
atlas02
Grid Job Requests
Disks
17TB
Internet
Local Grid development currently focused on
monitoring and user management
16
Monitoring Control

Facility monitored by a cornucopia of
vendor-provided, open-source and home-grown
software...recently,
Ganglia was deployed on the entire farm, as well
as the disk servers
Python-based Farm Alert scripts were changed
from SSH push (slow), to multi- threaded SSH pull
(still too slow), to TCP/IP push, which finally
solved the scalability issues

Cluster management software is a requirement for
linux farm purchases (VACM, xCAT)
Console access, power up/downreally came in
useful this summer!

17
The Great Blackout of 03
18
Future Plans Initiatives

Linux farm expansion this winter addition of
gt100 2U servers packed with local disk
Plans to move beyond NFS-served SAN with more
scalable solutions
Panasas - file system striping at block level
over distributed clients
dCache - potential for managing distributed disk
repository
Continuing development of grid services with
increasing implementation by the two large RHIC
experiments
Very successful RHIC run with a large
high-quality dataset!