Site%20Report:%20The%20Linux%20Farm%20at%20the%20RCF - PowerPoint PPT Presentation

About This Presentation
Title:

Site%20Report:%20The%20Linux%20Farm%20at%20the%20RCF

Description:

Ofer Rind - RHIC Computing Facility Site Report. RCF - Overview ... SSH2 only access through gateway bastion nodes (Solaris x86) ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 13
Provided by: conferen
Category:

less

Transcript and Presenter's Notes

Title: Site%20Report:%20The%20Linux%20Farm%20at%20the%20RCF


1
Site Report The Linux Farm at the RCF
  • HEPIX-HEPNT
  • October 22-25, 2002
  • Ofer Rind
  • RHIC Computing Facility
  • Brookhaven National Laboratory

2
RCF - Overview
  • Provide computing facilities for RHIC users
  • General computing environment
  • General interactive tasks (email, document
    processing, web)
  • Data analysis facility
  • Computing infrastructure for RHIC experiments
  • Code development, repository distribution
  • Raw data recording reconstruction
  • Data analysis
  • ACF US Atlas Tier 1 Computing Facility
  • Shared infrastructure and synergy with RCF
  • Support staff 25 FTE's (4 dedicated to Linux
    Farm)

Ofer Rind - RHIC Computing Facility Site Report
3
RCF - Structure
Ofer Rind - RHIC Computing Facility Site Report
4
RCF - Component Summary
  • Mass Storage Subsystem
  • StorageTek library managed by HPSS
  • 4 Silos, 1.2PB capacity (expanding to 4.5PB)
  • In Run-2, raw data recorded at a common rate of
    70MB/sec for a total of 170TB
  • Total data store 300TB
  • Disk Storage
  • Fibre channel SAN served by NFS
  • 110TB Raid5
  • 14 Sun 450, Solaris 8 2-02 (5 Sun 480 coming
    online)
  • IBM AFS servers (AIX)
  • Linux Server Farm

Ofer Rind - RHIC Computing Facility Site Report
5
Linux Farm Hardware
  • 840 1U and 2U servers (pre-'99 towers have been
    retired)
  • 69 kSPECint95, expanding to 100 kSPECint95 (2
    TFLOPS)
  • Most have 1GB mem (at least 500MB)
  • Local SCSI disks up to 140GB/node
  • Allocated by experiment
  • Further allocated for Raw Data Reconstruction
    (CRS) and Re- constructed Data Analysis (CAS)

VA Linux PIII 450Mz 148 Jun 99 VA Linux PIII
700Mz 48 Aug 00 VA Linux PIII 800Mz 168 Nov 00
IBM PIII 1000Mz 316 Aug 01 IBM PIII
1400Mz 160 Oct 02
Ofer Rind - RHIC Computing Facility Site Report
6
Linux Farm Software Configuration
  • RedHat 7.2 upgraded to 2.4.9-31 kernel
  • Image(s) installed via Kickstart server and
    customized for RCF environment via rpm
  • NFS AFS home directory and file access
  • Interactive login allowed on selected nodes
  • Job management
  • (CAS) LSF 4.2 - slightly re-architected for
    robustness. Peak throughput before summer
    conferences was gt150K jobs/week.
  • (CRS) Locally produced Perl-based batch system
    (AIX needed for HPSS API). Approx. 670K jobs
    processed for Run-2.
  • Expanding use of distributed disk models (rootd,
    ??)
  • Atlas Grid testbed

Ofer Rind - RHIC Computing Facility Site Report
7
Tracking LSF Usage
Star queues weekly job statistics (week of Oct.
10)
Job starts/hr
Avg runtime/hr
Runtime
Ofer Rind - RHIC Computing Facility Site Report
8
Security and Monitoring
  • Security
  • RCF firewall within BNL site firewall
  • SSH2 only access through gateway bastion nodes
    (Solaris x86)
  • User access restricted to a subset of systems
    (CAS only)
  • Monitoring
  • 24 hr. on-call staff for critical systems during
    RHIC operation
  • Cluster mgmt. software
  • VACM (VA Linux)
  • xCAT (IBM, http//www.x-cat.org)
  • Cron scripts to "clean" nodes and head off
    possible problems (memory leaks, full disks,
    etc.)
  • CTS system for problem reports

Ofer Rind - RHIC Computing Facility Site Report
9
Farm Alert System
Web-monitoring (user-accessible) plus
paging/email alerts Python scripts running
locally transferring node status information to a
MySQL database. Notification of problems with
NFS/AFS (e.g. stale file handles), LSF daemons,
high load, etc.
Ofer Rind - RHIC Computing Facility Site Report
10
Network Operation Status
Perl scripts monitor network service connectivity
for all nodes (ssh, yp, etc.)
Ofer Rind - RHIC Computing Facility Site Report
11
Load Monitoring and History
MySQL database for usage history History
available back to Sept. '01 via web
interface. CPU Load averaged over (98) Phenix
machines during the month of September.
Ofer Rind - RHIC Computing Facility Site Report
12
Plans for the Near Future
  • 160 newly delivered IBM nodes to be brought
    online
  • Expect purchase bid to go out for 220 more nodes
    at beginning of FY03 (pending funding approval)
  • Scaling up data storage capacity and throughput
    for Run-3 (up to 10X data increase over Run-2,
    starting in December)
  • Evaluation of LSF 5 and Condor ongoing, with an
    eye towards distributed disk services
  • Expanding Atlas GRID services

Ofer Rind - RHIC Computing Facility Site Report
Write a Comment
User Comments (0)
About PowerShow.com