Wolfgang Friebel, - PowerPoint PPT Presentation

About This Presentation
Title:

Wolfgang Friebel,

Description:

Two talks on SGEEE (formerly known as Global Resource Director GRD or Codine) ... Escalade 6200 series 4 channel IDE RAID, with 3 72GB drives striped. Results ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 14
Provided by: wolfgang4
Category:

less

Transcript and Presenter's Notes

Title: Wolfgang Friebel,


1
HEPiX Fall 2001 Report (2)NERSC, Berkeley
  • Wolfgang Friebel,
  • 16.11.2001
  • C5 report

2
Further topics covered
  • Batch (Sun Grid Engine Enterprise Edition)
  • Distributed Filesystems (Benchmarks)
  • Security (again) (the concept at NERSC)

3
Batch systems
  • Two talks on SGEEE (formerly known as Global
    Resource Director GRD or Codine), see below
  • FNAL presented new version of their batch system
  • Main scope is resource management not load
    balancing
  • FBSNG, written primarily in Python, Python API
    exists
  • Comes with Kerberos 5 support
  • NERSC reported experiences with LSF
  • Not very pleased with LSF, will also evaluate
    alternatives

4
SGEEE Batch
  • Ease of installation from source
  • Access to source code
  • Chance of integration into a monitoring system
  • API for C and Perl
  • Excellent load balancing mechanisms (4 scheduler
    policies)
  • Managing the requests of concurrent groups
  • Mechanisms for recovery from machine crashes
  • Fallback solutions for dying daemons
  • Weakest point is AFS integration and Token
    prolongation mechanism (basically the same code
    as for Loadleveler and for older LSF versions)

5
SGEEE Batch
  • SGEEE has all ingredients to build a company wide
    batch infrastructure
  • Allocation of resources according to policies
    ranging from departmental policies to individual
    user policies
  • Dynamic adjustment of priorities for running jobs
    to meet policies
  • Supports interactive jobs, array jobs, parallel
    jobs
  • Can be used with Kerberos (4 and 5) and AFS,
  • Globus integration underway
  • SGEEE is open source maintained by Sun
  • Getting deeper knowledge by studying the code
  • Can enhance the code (examples more schedulers,
    tighter AFS integration, monitoring only daemons)
  • Code is centrally maintained by a core developer
    team
  • Could play a more important role in HEP
    (component of a grid environment, open industry
    grade batch system as recommended solution within
    HEPiX?)

6
Scheduling policies
  • Within SGEEE tickets are used to distribute the
    workload
  • User based functional policy
  • Tickets are assigned to projects, users and jobs.
    More tickets mean higher priority and faster
    execution (if concurrent jobs are running on a
    CPU)
  • Share based policy
  • Certain fractions of the system resources
    (shares) can be assigned to projects and users.
  • Projects and users receive that shares during a
    configurable moving time window (e.g. CPU usage
    for a month based on usage during the past month)
  • Deadline policy
  • By redistributing tickets the system can assign
    jobs an increasing weight to meet a certain
    deadline. Can be used by authorized users only
  • Override policy
  • Sysadmins can give additional tickets to jobs,
    users or projects to temporarily adjust their
    relative importance.

7
Distributed Filesystems
  • Candidates for benchmarking
  • NFS versions 2 and 3
  • GFS (University of Minnesota/Sistina Software)
  • AFS
  • GPFS (IBM cluster file system, being ported to
    Linux)
  • PVFS Parallel Virtual Filesystem
  • Not taken
  • GPFS IBM could get it working at NERSC under
    Linux (not ready?)
  • PVFS unstable in tests, single point of failure
    (metadata server)
  • AFS slower than NFS, tests done elsewhere,
    successfully running
  • GFS designed for SAN, runs over TCP with
    significant performance penalties, lock
    management not mature, stability for high number
    of clients not expected to be good. Good
    candidate for SANs

8
Distributed Filesystems
  • Conclusion for NERSC only NFS remains, AFS too
    heavy for them
  • The talk discussed various combinations of Linux
    kernel versions (2.2.x and 2.4.x), NFS clients
    (v2 and v3) and servers (v2 and v3)
  • Benchmarking tools used
  • Bonnie
  • Iozone
  • Postmark
  • Benchmarked equipment
  • Dual 866Mhz PIII with 512MB RAM
  • Escalade 6200 series 4 channel IDE RAID, with 3
    72GB drives striped
  • Results
  • By carefully choosing Kernel and NFS Versions
    throughput can be increased
  • For much more details consult the talk
  • Other sites reported very bad NFS performance
    (confirms NERSC findings, that tuning for NFS is
    a must)

9
Distributed Filesystems GFS
  • Caspur is looking for a filesystem attached to a
    multinode Linux farm
  • Looked for SAN based solutions
  • NFS and GPFS discarded (NFS performance, GPFS
    extra HW SW)
  • Have chosen GFS, but trying to use GFS over IP
    (see next slide)
  • By using a SCSI to IP converter (Axis from
    Dothill) they would be able to setup a serverless
    GFS
  • Contradicting kernel requirements for GFS and
    AXIS currently
  • Issues probably solved (11/2001) with equipment
    from Cisco
  • Looks promising to them, more investigations to
    come

10
(No Transcript)
11
Computer Security at NERSC
  • Very open community, need a balance between
    security and availability
  • Main concepts used
  • Intrusion detection using BRO (in house
    development, open source)
  • Immediate actions against attackers (shunning)
  • Scanning systems for vulnerabilities
  • Keeping systems/software up to date
  • Firewall for critical assets only(operation
    consoles, development systems)
  • Virus wall for incoming emails
  • Top level staff in computer security and
    networking
  • Observed ever increasing scans (30-40 a day!!),
    threats
  • Were able to track down hackers and reconstruct
    the attacks

12
Computer Security BRO
  • Passively monitors network
  • Carefully designed to avoid packet drops at high
    speeds 622Mbps (OC-12)
  • Two main components
  • Event engine, converts network traffic into
    events (compression)
  • Policy script interpreter (interprets output of
    event handlers)
  • BRO interacts with the border router to drop
    hosts immediately (using ACLs) on attacks
  • BRO records all input in interactive sessions
  • Allows to reconstruct data even if type ahead or
    completion mechanisms used

13
Computer Security BRO
  • Some of the analysis done in real time, deeper
    analysis done once a day offline
  • NERSC is relying heavily on intrusion detection
    by BRO
  • NERSC was able to quickly react on the Code Red
    worm (changes to BRO)
  • Subsequently Nimda did very little damage
  • Many more useful tips on practical security (have
    a look to the talk)
Write a Comment
User Comments (0)
About PowerShow.com