Readiness and Operation of Grid Infrastructure for US CMS - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Readiness and Operation of Grid Infrastructure for US CMS

Description:

... etc) and ex-officio membership of partners (including WLCG, EGEE); and DOE-NSF ... Accounting (Gratia) Workload Management and Job Execution needs: ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 17
Provided by: Rut168
Category:

less

Transcript and Presenter's Notes

Title: Readiness and Operation of Grid Infrastructure for US CMS


1
Readiness and Operation of Grid Infrastructure
for US CMS
  • Ruth Pordes
  • US CMS Grid Services and Interfaces
  • Fermilab

2
CMS Distributed Facility in the US
  • US CMS contributes to and relies on the Open
    Science Grid (OSG) in the US to provide a common,
    shared infrastructure in the US (and non-US sites
    that opt in) to contribute to the WLCG.
  • US Tier-1 and all US and Brazil Tier-2s part of
    the OSG.
  • Some Tier-3s already participating in the OSG.
  • Integration with and interfacing to the WLCG is
    achieved through participation in many
    management, operational and technical activities.
  • Successes in Grid software, services, integration
    and operation is a result of very good
    cooperation and collaboration across US CMS SC
    Fermilab CD/FermiGrid Tier-2 administrators NSF
    DISUN project OSG.
  • In 2006 OSG has effectively contributed to CSA06
    and CMS simulation production.
  • In 2007 OSG plans are to improve the reliability
    and scalability of the infrastructure to meet LHC
    needs, as well as add and support needed
    additional services, sites and users.

3
US Grid Deliverables 2006
  • Good level of contributions to CSA06 (see other
    talks)
  • Tier-2s
  • Supported ongoing production simulation.
  • Provided distribution of CMS software to all OSG
    sites used by CMS.
  • Contributed monitoring and storage support,
    utilities, and troubleshooting.
  • DISUN contributions to
  • CSA06 OSG resources,
  • Compute Element scalability, performance and
    reliability improvements.
  • Testing of OSG storage software deployments,
    integration and testing of new software services.
  • US CMS SC Fermilab contributions to
  • CMS WLCG storage software authorization and
    security services workload management software
    integration testing information accounting

4
OSG - Project Funding Organization
  • OSG funding from DOE and NSF of 6M/year for
    five years (pending successful annual reviews)
    started at the end of FY06 (Sept/Oct).
  • US LHC people have management roles in the
    project.
  • Executive Board includes technical project leads
    of external projects (such as US LHC) with many
    defined interactions.
  • Stakeholder connections to the project well
    defined and active.
  • Project reports to Council of stakeholder
    representatives (including US LHC, Tier-1
    facilities etc) and ex-officio membership of
    partners (including WLCG, EGEE) and DOE-NSF
    Joint Oversight Team.
  • OSG Project has a Year 1 Project Plan, Project
    Execution and Management Plans, Security Plan.
    WBS (in the process of being baselined),
    Milestones (both agency and non-agency
    reportable, including deliverables relying on
    Experiment contributions), Change Control plan
    (in review) and funded (newly hired) Project
    Support Staff.

5
OSG Project Scope in support of US CMS/US LHC
  • Operation and evolution of a common, shared,
    distributed Facility to provide a mix of
    guaranteed use at the US CMS Tier-1 and Tier-2
    sites as well as opportunistic use of non-US CMS
    farms and disk caches (e.g. includes CMS Tier-3
    and US ATLAS sites).
  • A distributed operations support and services
    infrastructure which provides dispatch and follow
    up of any Grid level issues
  • A reference software stack (based on the Virtual
    Data Toolkit - VDT) which provides an integrated,
    tested stack for compute and storage resources,
    selected application level common middleware and
    common Grid services.
  • Security infrastructure, policies and processes
  • Troubleshooting and documentation activities
    based on the shared, common expertise and support
    of OSG staff.
  • Support for interoperation with the EGEE and WLCG
    (and campus and community grids e.g. FermiGrid).
  • Integration and extensions of new, externally
    developed, software and baseline services into
    the Facility -- needed by US LHC and other
    science collaborations.
  • Note this funding addresses concerns of support
    for the Grid middleware use by US LHC and US
    peering with EGEE in security, operations, for
    the next several years.

6
OSG Project Organization
7
OSG Project Organization
8
OSG - Project Constraints
  • No hardware purchases. All compute, storage and
    network resources contributed by consortium
    stakeholders (e.g. CMS, Fermilab).
  • No software development. OSG relies on external
    projects to develop new services. (e.g. US CMS,
    Fermilab, Condor, etc).
  • Activities targetted to benefit many (or more
    than one) Experiment/Science Community, with
    non-physics communities an integral part of the
    scope.

OSG - Effort Distribution
Facility operations 6.0
Security and troubleshooting 4.5
Software release and support 6.5
Engagement 2.0
Education, outreach training 2.0
Facility management 1.0
Extensions in capability and scale. 8.0
Staff 3.0
Total Full Time Equivalents 33
9
OSG Milestones Year 1
Define Operational Metrics for Year 1 1/1/07
Release Security Plan 1/1/07
Release OSG 0.6.0 (main deliverables first release of storage components support in place s/w updates accounting new information and resource selection services) 2/27/07
Production use of OSG by one additional science community 3/31/07
OSG-TeraGrid software using common Globus and Condor releases. 4/2/07
Complete deployment and registration of 15 Storage Resources using srm/dCache from VDT. 6/10/07
Release OSG 0.8.0 (main deliverables just in time workload management multi-vo storage reservation and allocation TBD). 8/15/07
Report on Operational Metrics for Year 1 9/1/07
Production use of OSG by a 2nd additional science community 9/28/07
LIGO Binary Inspiral Analysis runs on OSG, Warren Anderson 6/15/07
ATLAS Validation of OSG infrastructure and extensions in full-chain production challenge., Jim Shank 6/15/07
CMS Full support for opportunistic use of OSG resources for MC production and data processing., Lothar Bauerdick 6/15/07
STAR Migration of gt80 of simulation to OSG, Jerome Lauret 6/15/07
CDF Full use of OSG for MC, Ashutosh Kotwal 6/15/07
D0 Full use of OSG sites for D0 reprocessing in 2007, Brad Abbott 6/15/07
10
US Grid Deliverables for 2007
  • Goals
  • Mid-2007full functionality for initial data
    taking (commisioning).
  • End-2007 2 x CSA06 performance.
  • Data Movement and Storage needs
  • Space reservation (SRM V2.2) and compatability
    with WLCG Storage Classes definitions (not needed
    by CMS in the US)
  • Deployed, interoperable, access control
    (gPlazma)
  • Improved maintainability and ease of
    configuration and support (VDT).
  • Accounting (Gratia)
  • Workload Management and Job Execution needs
  • Performance and robustness improvements at the
    execution sites.
  • Experiment level prioritization by job class and
    user role.
  • Improved success rate (through understanding and
    fixing of errors improved retry and fault
    tolerance in the workload management system and
    workflows.
  • Operational needs
  • Security infrastructure, processes and auditing
    following CDF OSG plans.
  • Improved coverage, usability and interpretation
    of monitoring and error information.

11
WLCG Plans - infrastructure readiness for the
experiments
Month VOs Involved Goals/Milestones
February SRM v2.2 preparation All SRM v2.2 support in experiment s/w
March Multi-VO T0-T1 transfer tests All Understand in detail possible bottlenecks and couplings between VOs, maintenance and house-keeping operations etc. Extend to all sites.
March CMS MTCC3 CMS
April ATLAS FDR, CMS CSA07 ATLAS, CMS Services required for ATLAS Full Data Reconstruction CMS CSA07 in production ready for sub-system testing.
April gLite 3.x / Scientific Linux(C)4 production services.
June CMS CSA07 CMS Readiness for later CMS milestone
July SRM v2.2 services All Full SRM v2.2 production services at all sites
July 3D services All Full databases production services at all sites
July CMS CSA07 CMS Readiness for later CMS milestone Facility Readiness.
October LHC operations testing LHC operations testing LHC operations testing
November Full LHC machine checkout Full LHC machine checkout Full LHC machine checkout
December Beam commissioning Engineering run Beam commissioning Engineering run Beam commissioning Engineering run
12
WLCG (OSG, EGEE) Baseline Services
Processing Workload Management Goal 100,000 jobs submissions per day. Joint project to test workload management/job scheduling technologies and decide which one to recommend for deployment and use. Collaboration between OSG, US ATLAS, US CMS on use of PANDA Workload Management System. OSG plans to support just in time job scheduling in next 6 months.
Storage Management Support for Storage Resource Manager (SRM) V2.2 based storage elements, space reservation. Joint project including US CMS SC, OSG, Fermilab CD. OSG will deploy SRM V2.2 SE as soon as available and tested and is responsible for testing the interoperability of SRM V2 installations on EGEE/OSG/WLCG.
File Transfer Service (FTS) OSG supports VOs installation of FTS. OSG working with experiments and EGEE towards a scalable service to address current problems.
Job Application Monitoring. CMS dashboard and MonaLisa application monitoring.
Accounting OSG Accounting service deployed at Tier-1 and in deployment at Tier-2s and across OSG. Accounting records sent monthly to WLCG accounting databases for past year.
VO Boxes (Edge Services) Supported by VOs. Low priority for OSG at present.
Access Control Role based Authorization based on extended X509 certificates in common with EGEE. OSG authorization infrastructure in place - not all sites yet configure it.
13
Other US CMS requirements
  • Increased use of opportunistic resources for
    simulation production
  • Local storage elements for output/input of
    intermediate results.
  • Dynamic deployment of CMS software. (available in
    some form today.)
  • Access to 5GB of shared space accessible from
    all worker nodes
  • OSG plans wide deployment of SRM interfaced,
    dCache based SEs June 15 sites September 25
    sites.
  • CMS specific tests in the WLCG Site Availability
    Monitoring (SAM) infrastructure
  • Monitor the individual site performance and
    availability.
  • The tests will be CMS workflows.
  • Goal to monitor OSG sites equivalently to EGEE
    sites.
  • OSG operations working with WLCG team to test
    common infrastructure on basic/bare OSG sites.
  • Information and Accounting systems and
    interoperability
  • Need to ensure contributions of OSG sites
    properly represented and acknowledged by WLCG.
  • Accounting in installation on Tier-2 sites.
    Tier-1 information delivered to WLCG. Storage
    transfer accounting in test.

14
Security
  • Integrated Security Management means everyone
    addresses security for their own
    responsibilities.
  • US CMS addresses Grid Security through depending
    on and contributing to Fermilab CD security
    groups and OSG security activities.
  • Fermilab Security provides leadership in defining
    and implementing security models, infrastructure
    and processes.
  • OSG Security
  • Identifies and provides security management,
    operations and technical controls for the OSG
    core assets, responsibilities, roles.
  • Facilitates -- makes aware, provides templates,
    includes common s/w utilities in VDT --
    processes and controls for the non-core ie the
    Sites and the Experiments middleware and
    applications.
  • Communicates and coordinates responses to risks
    and incidents identifying chains-of-trust with
    Site and VO managers, Grid Security contacts and
    the end Users and Site support.
  • Collaborates with peer organizations.

15
Security Incident Communication and Response
  • 24x7 Grid Operations Center receives notification
    of potentical security incident and alerts
    Security Contact (Operations Coordinator) who
    determines the criticality of the report.
  • Urgent includes a compromise to multiple OSG
    resources. The security contact will contact the
    OSG Security Officer and the OSG Executive
    Director by cell phone.
  • http//www.grid.iu.edu/docsfiles/docs/osg_sec.php
  • We have had 4 such incidents and the procedure
    has worked well including
  • broad notification of the OSG participants and
    partners, including the EGEE and WLCG. A meeting
    of the OSG Security team was convened within a
    few hours of notification.
  • discussion, research, and then response
    appropriately to the risk and the vulnerability
    identified.
  • held a root cause and debriefing sessions to
    determine future best practice.

16
Operations
  • US CMS, Fermilab, OSG, EGEE and WLCG Operations
    organizations working well together.
  • Interfaces, ownership and hand off of problems
    constantly exercised.
  • WLCG/EGEE understanding of their different
    responsibilities and roles slowly improving.
    (sometimes we have to explain it many times).
  • Many US CMS central Grid services being supported
    by the Fermilab Grid - FermiGrid.
  • Next OSG Consortium/All Hands meeting in March
    co-scheduled with US ATLAS and US CMS Tier-3
    meetings. Will include training and support
    discussions.
Write a Comment
User Comments (0)
About PowerShow.com