FermiGrid Fermilab Grid Gateway - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

FermiGrid Fermilab Grid Gateway

Description:

HN. HN. HN. HN. HN. SDSS: Lattice. QCD: CMS: GP. Farm: SAZ. VO. Users. Storage. Storage. Storage ... VO creation authorization mechanism? ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 32
Provided by: keithch9
Category:

less

Transcript and Presenter's Notes

Title: FermiGrid Fermilab Grid Gateway


1
FermiGrid Fermilab Grid Gateway
  • Keith Chadwick
  • Bonnie Alcorn
  • Steve Timm

2
FermiGrid - Strategy and Goals
  • In order to better serve the entire program of
    the laboratory the Computing Division will place
    all of its production resources in a Grid
    infrastructure called FermiGrid. This strategy
    will continue to allow the large experiments who
    currently have dedicated resources to have first
    priority usage of certain resources that are
    purchased on their behalf. It will allow access
    to these dedicated resources, as well as other
    shared Farm and Analysis resources, for
    opportunistic use by various Virtual
    Organizations (VOs) that participate in FermiGrid
    (i.e. all of our lab programs) and by certain VOs
    that use the Open Science Grid. (Add something
    about prioritization and scheduling lab/CD
    new forums). The strategy will allow us
  • to optimize use of resources at Fermilab
  • to make a coherent way of putting Fermilab on the
    Open Science Grid
  • to save some effort and resources by implementing
    certain shared services and approaches
  • to work together more coherently to move all of
    our applications and services to run on the Grid
  • to better handle a transition from Run II to LHC
    (and eventually to BTeV) in a time of shrinking
    budgets and possibly shrinking resources for Run
    II worldwide
  • to fully support Open Science Grid and the LHC
    Computing Grid and gain positive benefit from
    this emerging infrastructure in the US and Europe.

3
FermiGrid What It Is
  • FermiGrid is a meta-facility composed of a number
    of existing resources, many of which are
    currently dedicated to the exclusive use of a
    particular stakeholder.
  • FermiGrid (the facility) provides a way for jobs
    of one VO to run either on shared facilities
    (such as the current General Purpose Farm or a
    new GridFarm?) or on the Farms primarily provided
    for other VOs. (gtgtgt needs wordsmithing to say
    what not how)
  • FermiGrid will require some development and test
    facilities to be put in place in order to make it
    happen.
  • FermiGrid will provide access to storage elements
    and storage and data movement services for jobs
    running on any of the compute elements of
    FermiGrid
  • The resources that comprise FermiGrid will
    continue to be accessible in local mode as well
    as Grid mode

4
The FermiGrid Project
  • This is a cooperative project across the
    Computing Division and its stakeholders to define
    and execute the steps necessary to achieve the
    goals of FermiGrid
  • Effort is expected to come from
  • Providers of shared resources and services CSS
    and CCF
  • Stakeholders and providers of currently dedicated
    resources - Run II, CMS, MINOS, SDSS
  • The total program of work is not fully known at
    this time but the WBS is being fleshed out. It
    will involve at least the following
  • Adding services required by some stakeholders to
    other stakeholders dedicated resources
  • Work on authorization and accounting
  • Providing some common FermiGrid Services (e.g .
    )
  • Providing some head-nodes and gateway machines
  • Modifying some stakeholders scripts, codes, etc.
    to run in the FermiGrid environment
  • Working with OSG technical activities to make
    sure FermiGrid and OSG (and thereby LCG) are well
    aligned and interoperable
  • Working on monitoring and web pages and whatever
    else it takes to make this all work and happen
  • Evolving and defining forums for prioritizing
    access to resources and scheduling

5
FermiGrid Some Notations
  • Condor Condor / Condor-G as necessary.

6
FermiGrid The Situation Today
  • Many separate clusters
  • CDF (x3), CMS, D0 (x3), GP Farms, FNALU Batch,
    etc.
  • When the cluster landlord does not fully
    utilize the cluster cycles it is very difficult
    for others to opportunistically utilize the
    excess computing capacity.
  • In the face of flat or declining budgets, we need
    to make the most effective use of the computing
    capacity.
  • We need some sort of system to capture the unused
    available computing and put it to use.

7
FermiGrid The State of Chaos Today
CDF Clusters
D0 Clusters
CMS Clusters
GP Farms
8
FermiGrid The Vision
  • The Future is Grid enabled computing.
  • Dedicated systems resources will be assimilated
    slowly...
  • Existing access to resources will be maintained.
  • I am chadwick of grid prepare to be
    assimilated Not!
  • Enable Grid based computing, but do not require
    all computing to be Grid.
  • Preserve existing access to resources for current
    installations.
  • Let a thousand flowers bloom Well not quite.
  • Implement Grid interfaces to existing resources
    without perturbation of existing access
    mechanisms.
  • Once FermiGrid is in production, deploy new
    systems as Grid enabled from the get go.
  • People will naturally migrate when they need
    expanded resources.
  • Help people with their migrations?

9
FermiGrid The Mission
  • FermiGrid is the Fermilab Grid Gateway
    infrastructure to accept jobs from the Open
    Science Grid, and following appropriate
    credential authorization, schedule these jobs for
    execution on Fermilab Grid resources.

10
FermiGrid The Rules
  • First do no harm
  • Wherever possible, implement such that existing
    systems and infrastructure is not compromised.
  • Only when absolutely necessary, require changes
    in existing systems or infrastructure, and work
    with those affected to minimize and mitigate the
    impact of the required changes.
  • Provide resources and infrastructure to help
    experiments transition to a Grid enabled model of
    operation.

11
FermiGrid Players and Roles
  • CSS
  • Hardware Operating System Management Support.
  • CCF
  • Grid Infrastructure Application Management
    Support.
  • OSG A cast of thousands
  • Submit Jobs utilize resources.
  • CDF
  • D0
  • CMS
  • Lattice QCD
  • Sloan
  • Minos
  • MiniBoone
  • FNAL
  • Others?

12
FermiGrid System Evolution
  • Start small, but plan for success.
  • Build the FermiGrid gateway system as a cluster
    of redundant server systems to provide 24x7
    service.
  • Initial implementation will not be redundant,
    that will follow as soon as we learn how to
    implement the necessary failovers.
  • Were going to have to experiment a bit an learn
    how to operate these services.
  • We will need the capability of testing upgrades
    without impacting production services.
  • Schedule OSG jobs on excess/unused cycles from
    existing systems and infrastructure.
  • How? Initial thoughts were to utilize checkpoint
    capability within Condor.
  • Feedback from D0 and CMS is that this is not an
    acceptable solution.
  • Alternatives 24 hour CPU limit?, nice?, other?
  • Will think about this more policy?.
  • Just think of FermiGrid like PACMAN (munch,
    munch, munch)

13
FermiGrid Software Components
  • Operating System and Tools
  • Scientific Linux 3.0.3
  • VDT Globus toolkit.
  • Cluster tools
  • Keep the cluster sane.
  • Migrate services as necessary.
  • Cluster aware file system
  • Google file system?
  • Lustre?
  • other?.
  • Applications and Tools
  • VOMS VOMRS
  • GUMS
  • Condor-G GRIS GIIS

14
FermiGrid Overall Architecture
FermiGrid Common Gateway Services
SAZ
Storage SRM dcache
Lattice QCD
SDSS
GPFarm
CMS
D0
HN
CDF
HN
HN
HN
HN
HN
Storage
Storage
Storage
Storage
Storage
Storage
15
FermiGrid General Purpose Farm Example
FermiGrid
GP Farm Users
Via Globus / Condor
Farm Head Node
FBS
The D0 Wolf stealing food out of the mouth of
babies.
16
FermiGrid D0 Example
D0 Jobs
FermiGrid
SamGrid
Via Globus / Condor
Globus / Condor
FNSF0
SamGfarm
FBS
Babies stealing food out of the mouth of the D0
wolf
17
FermiGrid Future Grid Farms?
FermiGrid
Via Globus / Condor
18
FermiGrid Gateway Software
See http//computing.fnal.gov/docs/products/vopri
vilege/index.html
19
FermiGrid Gateway Hardware Architecture
FermiGrid
FNAL
20
FermiGrid Gateway Hardware Roles
  • FermiGate1
  • Primary for Condor GRIS GIIS
  • Backup for FermiGate2
  • Secondary backup for FermiGate3
  • FermiGate2
  • Primary for VOMS VOMRS
  • Backup for FermiGate3
  • Secondary backup for FermiGate1
  • FermiGate3
  • Primary for GUMS PRIMA (eventually)
  • Backup for FermiGate1
  • Secondary backup for FermiGate2
  • All FermiGate systems will have VDT Globus job
    manager.

21
FermiGrid Gateway Hardware Specification
  • 3 x Poweredge 6650
  • Dual processor 3.0 Xeon MP, 4 MB cache
  • Rapid rails for dell rack
  • 4 GB DDR SDRAM, 8x512
  • PERC3-DC, 128MB 1 int, 1 ext.
  • 2x 36GB 15k RPM drive
  • 2x 73GB 10k RPM drive
  • dual on-board 10/100/1000 nics
  • Redundant power supply
  • Dell Remote Access Card, Version III, without
    modem
  • 24x IDE CD-Rom
  • Poweredge Basic Setup
  • 3yr same day 4 hr response parts _ onsite labor
    24x7
  • 14,352.09 each
  • Cyclades console dual PM20 local switch
    Rack
  • Total system cost 50K
  • Expandable in place by addition of processors or
    disks within systems.

22
FermiGrid Alternate Hardware Specification
  • 3 x Poweredge 2850 (2U server)
  • Dual processor 3.6 Xeon, 1MB cache, 800 MHz FSB
  • Rapid rails for dell rack
  • 4 GB DDR2 400 MHZ 4x1GB
  • Embedded Perc4ei controller
  • 2x 36Gb 15K RPM drive
  • 2x 73Gb 10K RPM drive
  • Dual on-board 10/100/1000 nics
  • Redundant power supply
  • Dell Remote Access Card, 4th generation
  • 24x IDE CD-Rom
  • Poweredge Basic Setup
  • 3yr same day 4 hr response 24x7 parts onsite
    labor
  • 6,951.24 each
  • Cyclades console dual PM20 local switch
    Rack
  • Total system cost 25K
  • Limited CPU expandability can only add whole
    systems or perform forklift upgrade.

23
FermiGrid Condor and Condor-G
  • Condor (Condor-G) will be used for batch queue
    management.
  • Within FermiGrid gateway systems definitely.
  • May feed into other head node batch systems (eg.
    FBS) as necessary.
  • VOs that own the resource will have priority
    access to the resource.
  • Policy? - guest VOs will only be allowed to
    utilize idle/unused resources.
  • Policy? how quickly must a guest VO free
    resource when desired by owner VO?
  • Condor checkpoint would provide this, but D0 and
    CMS jobs will not function in this environment.
  • Alternatives 24 hour CPU limit?, nice?, other?
  • More thought required (perhaps helped by policy
    decisions above?).
  • For Condor information see
  • http//www.cs.wisc.edu/condor/

24
FermiGrid VO Management
  • Currently VO management is performed via CMS in a
    back pocket fashion.
  • Not a viable solution for the long term.
  • CMS would probably like to direct that effort
    towards their work.
  • We recommend that FermiGrid infrastructure should
    take over the VO Management Server/services and
    migrate onto the appropriate gateway system
    (FermiGate2).
  • Existing VOs should be migrated to the new VO
    Management Server (in the FermiGrid gateway) once
    the FermiGrid gateway is commissioned.
  • Existing VO management roles delegated to
    appropriate members of the current VOs.
  • New VOs for existing infrastructure clients (eg.
    FNAL, CDF, D0, CMS, Lattice QCD, SDSS, others)
    should be created as necessary/authorized.

25
FermiGrid VO Creation and Support
  • All new VOs created on the new VO Management
    Server by FermiGrid project personnel or
    Helpdesk.
  • Policy? - VO creation authorization mechanism?
  • VO management authority delegated to the
    appropriate members of the VO.
  • Policy? - FNAL VO membership administered by
    the Helpdesk?
  • Like accounts in the FNAL Kerberos domain and
    Fermi Windows 2000 domain.
  • Policy? - Small experiments may apply to CD to
    have their VO managed by the Helpdesk also?
  • Need to provide the Helpdesk with the necessary
    tools for VO membership management.

26
FermiGrid GUMS
  • Grid User Management System
  • Developed at BNL
  • Translates a Grid identity to a local identity
    (certificate -gt local user)
  • Think of it as an automated mechanism to maintain
    the gridmap file.
  • See
  • http//www.rhic.bnl.gov/hepix/talks/041018pm/carca
    ssi.ppt

27
FermiGrid Project Management
  • Weekly FermiGrid project management meeting
  • Fridays from 200 PM to 300 PM in FCC1.
  • We would like to empanel a set of Godparents
  • Representatives from
  • CMS
  • Run II
  • Grid Developers?
  • Security Team?
  • Other?
  • Godparent panel would be used to provide (short
    term?) guidance and feedback to the FermiGrid
    project management team.
  • Longer term guidance and policy from CD line
    management.

28
FermiGrid Time Scale for Implementation
  • Today Decide and order hardware for gateway
    systems
  • Explore / kick tires on existing software.
  • Jan 2005 Hardware installation.
  • Begin software installation and initial
    configure.
  • Feb-Mar 2005 Common Grid services available in
    non-redundant mode (Condor-G, VOMS, GUMS,
    etc.).
  • Future Transition to redundant mode as
    hardware/software matures.

29
FermiGrid Open Questions
  • Policy Issues?
  • Lots of policy issues need direction from CD
    management.
  • Role of FermiGrid?
  • Direct Grid access to Fermilab Grid resources
    directly without FermiGrid?
  • Grid access to Fermilab Grid resources only via
    FermiGrid?
  • guest VO access to Fermilab Grid resources only
    via FermiGrid?
  • Resource Allocation?
  • owner VO vs. guest VO?
  • How fast?
  • Under what circumstances?
  • Grid Users Meeting a-la Farm Users Meeting?
  • Accounting?
  • Who, where, what, when, how?
  • Recording vs. Access.

30
FermiGrid Guest vs. Owner VO Access
Required?
Allowed
FermiGrid Gateway
Not Allowed?
Allowed?
Resource Head Node
31
FermiGrid Fin
  • Any Questions?
Write a Comment
User Comments (0)
About PowerShow.com