Jon Wakelin, Physics - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Jon Wakelin, Physics

Description:

384 core - AMD Opteron 2.6 Ghz dual-socket dual-core system, 8GB Mem. MVB server room ... 3328 core - Intel Harpertown 2.8Ghz dual-socket, quad-core, 8GB Mem. ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 20
Provided by: isx4
Category:
Tags: jon | physics | wakelin

less

Transcript and Presenter's Notes

Title: Jon Wakelin, Physics


1
Jon Wakelin, Physics ACRC
  • Bristol

2
ACRC
  • Server Rooms
  • PTR 48 APC water cooled racks (Hot aisle, cold
    aisle)
  • MVB 12 APC water cooled racks (Hot aisle, cold
    aisle)
  • HPC
  • IBM, ClusterVision, ClearSpeed.
  • Storage
  • 2008-2011?
  • Petabyte scale facility
  • 6 Staff
  • 1 Director, 2 HPC Admins, 1 Research Facilitator
  • 1 Visualization Specialist, 1 e-Research
    Specialist
  • (1 Storage admin post?)

3
ACRC Resources
  • Phase 1 March 07
  • 384 core - AMD Opteron 2.6 Ghz dual-socket
    dual-core system, 8GB Mem.
  • MVB server room
  • CVOS and SL 4 on WN. GPFS, Torque/Maui, QLogic
    InfiniPath
  • Phase 2 - May 08
  • 3328 core - Intel Harpertown 2.8Ghz dual-socket,
    quad-core, 8GB Mem.
  • PTR server room - 600 meter from MVB server
    room.
  • CVOS and SL? WN. GPFS, Torque/Moab, QLogic
    InfiniPath
  • Storage Project (2008 - 2011)
  • Initial purchase of additional 100 TB for PP and
    Climate Modelling groups
  • PTR server room
  • Operational by sep 08.
  • GPFS will be installed on initial 100TB.

4
ACRC Resources
  • 184 Registered Users
  • 54 Projects
  • 5 Faculties
  • Eng
  • Science
  • Social Science
  • Medicine Dentistry
  • Medical Vet.

5
PP Resources
  • Initial LCG/PP setup
  • SE (DPM), CE and 16 core PP Cluster, MON and UI
  • CE for HPC (and SE and GridFTP servers for use
    with ACRC facilities)
  • HPC Phase 1
  • PP have a 5 target fair-share, and up to 32
    concurrent jobs
  • New CE, but uses existing SE - accessed via NAT
    (and slow).
  • Operational since end of Feb 08
  • HPC Phase 2
  • SL 5 will limit PP exploitation in short term.
  • Exploring Virtualization but this is a medium-
    to long-term solution
  • PP to negotiate larger share of Phase 1 system to
    compensate
  • Storage
  • 50TB to arrive shortly, operational Sep 08
  • Additional networking necessary for
    short/medium-term access.

6
Storage
  • Storage Cluster
  • Separate to HPC cluster
  • Will run GPFS
  • Being installed and configure as we speak
  • Running a test Storm SE
  • This is the second time
  • Due to changes in the underlying architecture
  • Passing simple SAM SE tests
  • But, now removed from BDII
  • Direct access between storage and WN
  • Through multi-cluster GPFS (rather than NAT)
  • Test and Real system may differ in the following
    ways
  • Real system will have a separate GridFTP server
  • Possibly NFS export for Physics Cluster
  • 10Gb NICs (Myricom Myri10G PCI-Express)

7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
HPC Phase 2
x3650 Myri-10G
x3650 Myri-10G
x3650 Myri-10G
x3650 Myri-10G
Storage
PTR Server Room
5510-48
NB All components are Nortel
5510-48
5510-48
HPC Phase 1
MVB Server Room
15
HPC Phase 2
x3650 Myri-10G
x3650 Myri-10G
x3650 Myri-10G
x3650 Myri-10G
Storage
PTR Server Room
5530
5510-48
NB All components are Nortel
5510-48
5510-48
HPC Phase 1
MVB Server Room
16
(No Transcript)
17
SoC
  • Separation of Concerns
  • Storage/Compute managed independently of Grid
    Interfaces
  • Storage/Compute managed by dedicated HPC experts.
  • Tap into storage/compute in the manner the
    electricity grid analogy suggested
  • Provide PP with centrally managed compute and
    storage
  • Tarball WN install on HPC
  • Storm writing files to a remote GPFS mount (devs.
    and tests confirm this)
  • In theory this is a good idea - in practice it is
    hard to achieve
  • (Originally) implicit assumption that admin has
    full control over all components
  • Software now allows for (mainly) non-root
    installations
  • Depend on others for some aspects of support
  • Impact on turn-around times for resolving issues
    (SLAs?!?!!)

18
General Issues
  • Limit the number of task that we pass on to HPC
    admins
  • Set up user, admin accounts (sudo) and shared
    software areas
  • Torque - allow remote submission host (i.e. our
    CE)
  • Maui ADMIN3 access for certain users (All users
    are A3 anyway)
  • NAT
  • Most other issues are solvable with less
    privileges
  • SSH Keys
  • RPM or rsync for Cert updates
  • WN tarball for software
  • Other issues
  • APEL accounting assumes ExecutingCE SubmitHost
    (Bug report)
  • Work around for Maui client - key embedded in
    binaries!!! (now changed)
  • Home dir path has to be exactly the same on CE
    and Cluster.
  • Static route into HPC private network

19
Qs?
  • Any questions
  • https//webpp.phy.bris.ac.uk/wiki/index.php/Grid/H
    PC_Documentation
  • http//www.datadirectnet.com/s2a-storage-systems/c
    apacity-optimized-configuration
  • http//www.datadirectnet.com/direct-raid/direct-ra
    id
  • hepix.caspur.it/spring2006/TALKS/6apr.dellagnello.
    gpfs.ppt
Write a Comment
User Comments (0)
About PowerShow.com