Grid Testbeds: A U'S' View HEPCCC 29 June 2002 - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Grid Testbeds: A U'S' View HEPCCC 29 June 2002

Description:

Pacman - package management and distribution tool Saul Youssef ... Develop an MDS information provider for Pacman-deployed software ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 46
Provided by: ianf193
Category:
Tags: hepccc | grid | june | man | pac | testbeds | view

less

Transcript and Presenter's Notes

Title: Grid Testbeds: A U'S' View HEPCCC 29 June 2002


1
Grid Testbeds A U.S. ViewHEP-CCC 29 June 2002
  • Larry Price
  • Argonne National Laboratory

2
Existing or PlannedUS-Testbed Activities
  • US-ATLAS Test Grid
  • US-CMS Test Grid
  • Combine to make an iVDGL-1 Test Grid by November
  • Combine further with EDG Testbed to make an iVDGL
    / DataTAG Testbed
  • GLUE Collaboration (iVDGL / DataTAG)
  • TeraGrid Facility

3
US-ATLAS Test Grid
  • Testbed has been functional for 1 year
  • Decentralized account management
  • Grid credentials (based on globus CA)
  • updating to ESnet CA
  • Grid software Globus 1.1.4/2.0, Condor 6.3
    (moving towards full VDT 1.x)
  • ATLAS core software distribution at 2 sites (RH
    6.2)

4
ATLAS related grid software
  • GridView - simple tool to monitor status of
    testbed Kaushik De, Patrick McGuigan
  • Gripe - unified user accounts Rob Gardner
  • Magda - MAnager for Grid DAta Torre Wenaus,
    Wensheng Deng (see Gardner Wenaus talks)
  • Pacman - package management and distribution tool
    Saul Youssef
  • Being widely used or adopted by iVDGL VDT, Ganga,
    and others
  • Grappa - web portal using active notebook
    technology Shava Smallen (see Gardner talk)
  • GRAT - GRid Application Toolkit
  • Gridsearcher - MDS browser Jennifer Schopf
  • GridExpert - Knowledge Database Mark Sosebee
  • VO Toolkit - Site AA Rich Baker

5
Some Near Term US-ATLAS Plans
  • Develop a data movement/management prototype
    using
  • Condor, GDMP, MAGDA
  • Continue development of "easy to use" data
    analysis tools
  • Enhance GRAPPA web portal
  • Further develop leading work on packaging and
    Pacman
  • Automate the packaging of a grid production
    mechanism
  • Develop an MDS information provider for
    Pacman-deployed software
  • Participate in Data Challenge 1 (begins this
    month in June!)
  • Interoperate with US-CMS Test Grid and EDG
  • Run ATLAS apps on US-CMS Test Grid (done!)
  • Run ATLAS apps from US-ATLAS Site on EDG Testbed
    (done!)
  • Expand to Interregional ATLAS Testbedlate 2002

6
ATLAS Summer Schedule
  • July 1-15 Phase 0, 107 events
  • Globus 2.0 beta, Athena 3.0.1, Grappa, common
    disk model, Magda, 5 physics processes, BNL VO
    manager, minimal job scheduler, GridView
    monitoring
  • August 5-19 Phase 1, 108 events
  • VDT 1.1.1, Hierarchical GIIS server,
    Athena-atlfast 3.2.0, Grappa, Magda - data
    replica management with metadata catalogue, 10
    physics processes, static MDS based job
    scheduler, new visualization
  • September 2-16 Phase 2, 109 events, 1 TB
    storage, 40k files
  • Athena-atlfast 3.2.0 instrumented, 20 physics
    processes, upgraded BNL VO manager, dynamic job
    scheduler, fancy monitoring
  • Need some planning of analysis tools

7
Atlfast Production Architecture
Boxed Athena-Atlfast
Storage Elements
MDS
Globus
Resource Broker
Magda
VDC
JobOptions Higgs SUSY QCD Top W/Z
Grappa Portal or GRAT script
User
8
US-CMS Test Grid
  • Testbed has been functional for 1/2 year
  • Decentralized account management
  • Grid credentials (based on Globus CA)
  • updating to ESnet CA
  • Grid software VDT 1.0
  • Globus 2.0 beta
  • Condor-G 6.3.1
  • Condor 6.3.1
  • ClassAds 0.9
  • GDMP 3.0
  • Objectivity 6.1
  • CMS Grid Related Software
  • MOP distributed CMS Monte carlO Production
  • VDGS - Virtual Data Grid System Prototype
  • CLARENS - Distributed CMS Physics Analysis
  • DAR Distribution After Release for CMS
    applications (RH 6.2)

9
Some Near Term US-CMS Plans
  • Prototype Virtual Data Grid System (VDGS)
  • Based upon the GriPhyN Data Grid Architecture
  • First prototype by August
  • Production prototype for November
  • Grid-enabled Monte Carlo Production
  • Build upon the CMS experience (already quite
    mature)
  • Run live CMS production this Summer
  • Integrate with VDGS for November
  • Grid-enabled Analysis Environment
  • Based upon web services (XML, RPC, SOAP, etc)
  • Integrate with VDT and VDGS for November
  • Interoperate with US-ATLAS Test Grid and EDG
  • Run CMS apps on US-ATLAS Test Grid
  • Run CMS apps from US-CMS Site on EDG Testbed

10
Example of CMS MC Production on a Grid
Remote Site 1
Batch Queue
Master Site
GDMP
DAGMan Condor-G
IMPALA/ BOSS
mop_submitter
GDMP
Remote Site N
Batch Queue
GDMP
11
Recent CMS Results
  • Assigned 200 000 Monte Carlo Events from CMS
  • Requested a real assignment to ensure production
    quality of the grid software
  • Produced 100 000 Fully Validated Events (many
    times over)
  • Discovered/corrected many fundemental core-grid
    software bugs (Condor and Globus)
  • huge success from this point of view alone
  • Anticipate finishing full assignment soon
  • Will request another assignment for the
    summer

12
ATLAS-CMS Demo. Architecture
SC2002 Demo
Visualization (status, physics)
ATLAS-CMS User Job
Globus, Condor-G?
MDS, Ganglia, Paw/Root
Scheduling Policy
??
Condor, Python?
ATLAS-CMS Testbed
Production Jobs
MOP, GRAT, Grappa
13
General Observations
  • Easy to say We have a Grid!
    ...more difficult to make it do
    real work
  • a grid is like an "on-line system" for a physics
    detector.
  • a grid is complex with many modes of failure
  • often difficult to track down simple problems
  • host certificates expired
  • gatekeepers not synchronised
  • sometimes difficult to fix the problem
  • bugs still exist in the core-grid software itself
  • need for multiple monitoring systems
    (performance, match-making, debugging,
    heart-beat, etc)
  • need for more transparent data access
    (Virtual Data)
  • Nevertheless, these are typical of "growing
    pains!"

14
Lessons Learned (so far)
  • Test Grid commissioning revealed a need for
  • grid-wide debugging
  • ability to log into a remote site and talk to the
    System Manager over the phone proved vital...
  • remote logins telephone calls not a scalable
    solution!
  • site configuration monitoring
  • how are Globus, Condor, etc configured?
  • on what port is a particular grid service
    listening?
  • should be monitored by standard monitoring tools?
  • programmers to write very robust code!

15
Current Grid-Monitoring Efforts
  • MDS (Globus)
  • Uses an LDAP schema as well as "sensors"
    (Information Providers)
  • Scalable GRIS update information to a central
    GIIS
  • FLAMES (Caltech Package)
  • Provides a complete-seamless picture from LAN to
    WAN
  • Based on modular services
  • e.g. SNMP with interfaces to JINI, MDS, Web, etc
  • Hawkeye (Condor)
  • Leverages the ClassAd system of collecting
    dynamic information on large pools

16
Virtual Organisation Management on US-Test Grids
  • US-projects are just beginning work on VO
    management
  • looking to adopt much of the excellent work from
    the EDG.
  • Currently no VO management on the Test Grids
  • Account management is fully decentralized
  • Local account policy/management varies widely
    from site to site
  • Some sites do not allow group accounts !
  • Decentralized situation is starting to become
    un-wieldy
  • EDG VO Scripts are being deployed and tested
    across two US-CMS Test Grid sites
  • Florida
  • Caltech

17
VO Management (cont.)
  • Group (or pooled) accounts vs. individual
    accounts
  • Current VO management scripts only work well if
    group accounts are allowed at every site
  • Not every site allows group accounts (for very
    good reasons) !
  • The interaction between local and grid-wide
    account management needs to be understood
  • Will sites be willing (or even able) to delegate
    account management to a trusted CA or RA?
  • Need a centralized LDAP server where information
    for all iVDGL VOs is kept
  • Hope to begin evaluating the Globus Community
    Authorization Service (CAS) and Policy
    enforcement soon

18
GLUE Grid Laboratory Uniform Environment
  • Originated by HICB/HIJTB process
  • Goal interoperability between EU (EDG) and US
    (iVDGL) HEP Grid Projects at the core-grid
    middleware level
  • Define a common schema to describe Grid Resources
  • Advantage independent of the particular Data
    Grid Architecture implementation

19
CDF
GLUE
20
GLUE history its ancient history
  • HICB and JTB established. Looked for
    Intercontinental Testbed Showcases
  • Experiment organizations responded with their
    already planned data challenge schedules and/or
    cross experiment tests. Experiment top down
    organizational process already in place
  • HICB and JTB organization naturally focus on
    collaboration between Grid Middleware/Service
    projects no top down organization that spans
    the projects
  • Proposed common TestBed GRIT

21
GRIT Grid Reference Intercontinental Testbed
  • GRIT - A TestBed solely for the purposes of
    testing interoperability across the Grid Projects
    without impacting progress of each End User,
    Project or Experiment.
  • Concern about overlap with other projects
    deliverables and duplication of effort.
  • At that time there was a strategy of one and
    only one EDG TestBed in EU (Trident had not
    been agreed to). US Experiment TestGrids only
    just being considered.
  • Disposible Logical Grids raising of TestBeds,
    TestGrids, Prototype grids, development,
    integration and stable grids easy and ubiquitous
    a goal for the future.

22
So we have GLUE
  • Acquire and define software, configuration
    guidelines and policies to allow Middleware
    Services developed by the participating Grid
    Projects to Interoperate.
  • Results must be independent of the particular
    Data Grid Architecture implementation.
  • But to succeed will need agreement on
    Principles
  • What does Interoperate mean?
  • Someone in US/EU can run on the EDG/US Grid
    using their US/EU obtained credentials.
  • Someone in EU/US can submit a Job in the EU/US
    that will be scheduled on Resources that span EU
    and US sites.
  • Someone in US/EU can move Data from/to US and EU
    sites without specific knowledge of which
    continent the Site is on the knowledge of the
    Site Resources, Performance, and Policies is
    sufficient.

23
Glue is NOT
  • What does interoperate not mean?
  • Interoperability between the Experiments and end
    user applications User Interfaces and Portals
    VO specific Planners, Brokers, Optimizers
  • That all Grid Sites being used necessarily use
    the same software
  • That all Grid Sites necessarily have consistent
    policies
  • GLUE is not an independent software development
    project. Software development is proposed and
    work done by others.
  • GLUE is not a project staffed to the scope of its
    deliverables and schedule.

24
Organization Responsibility
  • First 2 projects that in practice need to
    interoperate with Middleware and Grid Services
  • EU EDG/DataTAG and US Physics Grid Projects
    Trillium
  • DataTAG and iVDGL have mandates to work on
    Interoperability Issues.
  • Thus GLUE is a collaboration coordinated by
    DataTAG and iVDGL.
  • These projects are responsible for the
    Deliverables and Delivering to the Schedules.
  • Deliverables will be made to
  • EDG Release and Virtual Data Toolkit (VDT)

25
Documents, Plans and Meetings
  • GLUE V1.1.2 now in existence recent edits by
    Ewa (iVDGL) and Antonia (DataTAG)
  • Is a living document - IF projects are to rely
    on its deliverables this is not sufficient
  • Joint DataTAG WP4 and iVDGL Interoperability
    meetings have proven useful.
  • Core Team to date Antonia, Cristina, Sergio,
    Flavia, Roberto, Rick, Ewa, Jenny, Brian, Doug,
    Ruth
  • Core Team relies on many others to get the work
    done
  • Core team not sufficient to successful meet
    milestones for Deliverables
  • Need to Tie the Deliverables to Software Release
    Schedules
  • Need good coordination with EDG Release
    Management, Bob, Cal, and VDT Release management,
    Scott, Alain

26
Basic Components as Defined in the Document
  • Authentication Infrastructure
  • Service Discovery (information) Infrastructure)
  • Data Movement Infrastructure
  • Authorization Services
  • Computational services
  • Added through work of DataTAG maps well to
    other activities of iVDGL
  • Packaging
  • Interoperable Experiment Grids

27
Authentication
  • Cross organizational authentication using Grid
    Security Infrastructure
  • DONE?
  • DOE SG and EDG Cas have mutual trust for
    Certificates.
  • Policies are agreed to
  • BUT Still issues with respect to Revocation
    Lists.
  • BUT Distribution of Certificate files not
    consistent or automated.
  • Work Remains..

28
Service Discovery (Information) Infrastructure
  • The GLUE SCHEMA Project.
  • Define common and base set of information to be
    supplied by both EDG and VDT software.
  • Idea adopted by Globus MDS group as useful for
    inclusion as part of base release of MDS.
  • Goal of consistent naming and structure of
    information
  • Requires also upgrade of Information Providers
    and Consumers to match the new information.

29
Defined new structure for Compute Element
information
30
Glue Schema the full flavor of collaboration
  • Spreadsheet version 12. Listing EDG, Globus and
    new schemas for CE
  • First draft March 19 12 calls, more than 500
    emails, and MANY hours of work later Draft 12
    sent out May 22 pm
  • Document defining schema structure v. 6 Layered
    structure CE, cluster, (homogeneous) subcluster,
    node Attributes collected into object classes
    UML class diagram for generic representation
  • Mapping into ldap is going on now.
  • Modification of information providers
  • MDS information providers with new schemas part
    of the MDS 2.2 release, mid July
  • EDG information providers need updating what
    is the schedule for this?
  • After CE, SE and NE Discussion of SEs has
    started other experts joining in Arie
    Shoshani, John Gordon. Glue Storage Model
    beginning to be discussed to understand general
    constraints.

31
UML Class diagram GLUE Schema
32
GLUE Schema Test Grid
iVDGL
Gatekeeper Padova-site
GIIS giis.ivdgl.org mds-vo-nameivdgl-glue
DataTAG
LSF
Gatekeeper US-CMS
GIIS edt004.cnaf.infn.it Mds-vo-nameDatatag
Resource Broker (tbc)
Condor
Gatekeeper US-ATLAS
Gatekeeper edt004.cnaf.infn.it
Gatekeeper grid006f.cnaf.infn.it
LSF
Computing Element -2 Fork/pbs
WN1 edt001.cnaf.infn.it WN2 edt002.cnaf.infn.it
Computing Element-1 PBS
dc-user.isi.edu
hamachi.cs.uchicago.edu
rod.mcs.anl.gov
Job manager Fork
33
Initial Timeline (very optimistic)
  • 1)  Refine and agree on above terminology
    (April 1-15) (done?)
  • 2)      Define common schema attributes for a CE.
    This includes definitions of all attributes,
    constants, and relationships between objects.
    Ensure that the schema maps to LDAP, R-GMA
    (relational tables), and XML. (April 7-30)
    (glue-schema group)
  • 3)      Convert CE schema to GOS. (May 1-7)
    (Cristina ?)
  • 4)      Identify GLUE testbed hosts, install
    Globus, get host certifications, and configure
    trusted CAs. Install EDG CE/SE software in the
    European (and possibly US?) part of the glue
    testbed. There should be at least 2 CEs and 1 SE
    in both the EU and US. (April 15-May 15) (TBD
    ??)
  • 5)      Add new CE schemas to MDS information
    provider package, add the new CE schema to
    EDG/MDS information providers, and write new
    information providers as necessary. Write unit
    tests to test modifications (May 1-30) (ISI?)
  • 6)      Grid Searcher is a tool for browsing
    MDS. (http//people.cs.uchicago.edu/
    hai/GridSearcher/) This should be a good tool
    for testing new schemas. Need to modify Grid
    Searcher Tool to work with new schemas (May
    1-30) (Jenny student)
  • 7)      Modify EDG WP1 Resource Broker and other
    middleware to use the new schemas, and test. (May
    1 - June 30) (EDG?)
  • 8)      Define a schema versioning mechanism to
    allow easy changes/additions to the schemas in
    the future. (May 1-30) (Brian, Cristina, Jenny)
  • 9)      Install R-GMA on GLUE testbed, and test.
    Configure EDG CE/SE with R-GMA information
    providers, and write new information providers if
    necessary (July 1-30) (EDG?)
  • 10)  Modify R-GMA equivalent to Grid Searcher
    to work with new schemas (does this tool
    currently exist?) (July 1-30) (EDG?)
  • 11)  Define common schema for Storage Element in
    GOS, and repeat step 4 and 5 (June 1-30)
    (glue-schema group)
  • 12)  Define common schema for Network Elements in
    GOS, and repeat step 4 and 5 (July 1-30)
    (glue-schema group)
  • 13)  More testing (August) (all)

34
Data Movement Infrastructure
  • Move data from storage services operated by one
    organization to another.
  • Work Already started
  • GDMP - in EDG and VDT
  • SRM (HRM,DRM) Buy In conceptually from VDT and
    EDG
  • GridFTP protocol server work
  • This could use a dedicated meeting of the right
    technical people to discuss.
  • In iVDGL core team - Brian. In DataTAG it is ?

35
Authorization services
  • Perform some level of cross organization,
    community based authorization
  • Still a bit chaotic on the US /VDT Side.
  • EDG has a set of LDAP/VO management scripts and
    policies. Continuous discussion in this area.
  • For iVDGL Core Team Doug is organizing meetings
    of Site AA on the issue and Rick and Kaushik are
    aware of Experiment needs and short term
    deployments. Ewa is working with Cristina on
    specific needs for the Glue Testbed.

36
Authorization Site, Service and Virtual
Organization
  • In US have already engaged the Site Security
    people. US Physics Grid Project experimental
    groups are looking to use their Grids in
    Production.
  • Possible 3 Layers of Authorization (work in
    progress)
  • Cert publishing directory that ESnet is providing
  • LDAP VO server from each VO
  • Mapping cert DN to local user account.
  • Proposal for new BOF at GGF for presentation of
    the Site Security view.

37
Deployments of Authorization Infrastructure
starting in the US
  • Currently no VO management on the Test Grids
  • Account management is fully decentralized
  • Local account policy/management varies widely
    from site to site
  • Some sites do not allow group accounts !
  • Decentralized situation is starting to become
    un-wieldy
  • EDG VO Scripts are being deployed and tested
    across two US-CMS Test Grid sites
  • Group (or pooled) accounts vs. individual
    accounts
  • Current VO management scripts only work well if
    group accounts are allowed at every site
  • Not every site allows group accounts (for very
    good reasons) !
  • The interaction between local and grid-wide
    account management needs to be understood
  • Will sites be willing (or even able) to delegate
    account management to a trusted CA or RA?

38
Computational services
  • Coordinate computation across organizations to
    allow submission of jobs in EU to run on US sites
    and vice versa.
  • All Grids use GRAM as the Job Submission
    Interface. Thus we can interoperate at this time
    if we try. This is in the short term plans for
  • CMS using MOP/Impala/Boss
  • ATLAS using EDG
  • D0/CDF using Job Information Management layer
    with SAM
  • Making it Robust to do actual Work is necessary
    before can claim any success.
  • Would like to have interoperability at High Level
    Services
  • Job Submission Language and
    Resource Broker
  • These may well be VO specific and so there will
    be several Interoperating Infrastructures and may
    NOT be packaged with VDT (or at the HENP
    Application Layer) with the EDG..

39
Experiment Grid Interoperability
  • DataTAG has significant focus on Experiment Grid
    Interoperability
  • iVDGL has significant focus on deployment and
    operation multi-application multi-site Grids
  • Part of wider DataTAG and US Physics Grid Project
    collaboration, and is part of the Joint
    Trillium and DataTAG meeting agendae.
  • Explicitly includes non-LHC experiments D0,
    CDF, LIGO/Virgo

40
GLUE, the LCG and Schedules
  • Deliverables from GLUE are needed to meet the LCG
    Production Grid Milestone at the end of the year.
  • That means deliverables from GLUE are needed for
    EDG TestBed 2.
  • To date the effort available and schedules of the
    actual Sub-Projects are slower than the
    milestones require.
  • Does sufficient infrastructure already exist in
    practice?
  • Experiment Jobs Scheduled through Condor-G and
    GRAM
  • Data Movement through GridFTP and GDMP
    interfaced to HRM.
  • Authorization grid mapfiles done by hand.

41
The TeraGrid A National Infrastructure
www.teragrid.org
42
NSF TeraGrid 14 TFLOPS, 750 TB
ANL Visualization
Caltech Data collection analysis
  • WAN Bandwidth Options
  • Abilene (2.5 Gb/s, ?10Gb/s late 2002)
  • State and regional fiber initiatives plus CANARIE
    CANet
  • Leased OC48
  • Dark Fiber, Dim Fiber, Wavelengths
  • WAN Architecture Options
  • Myrinet-to-GbE Myrinet as a WAN
  • Layer2 design
  • Wavelength Mesh
  • Traditional IP Backbone

Myrinet
Myrinet
NCSA Compute-Intensive
SDSC Data-Intensive
43
NSF TeraGrid 14 TFLOPS, 750 TB
574p IA-32 Chiba City
256p HP X-Class
HR Display VR Facilities
128p HP V2500
Caltech Data collection analysis
92p IA-32
HPSS
HPSS
ANL Visualization
SDSC Data-Intensive
Myrinet
UniTree
HPSS
Myrinet
1024p IA-32 320p IA-64
1176p IBM SP Blue Horizon
1500p Origin
Sun E10K
NCSA Compute-Intensive
44
To Build a Distributed Terascale Cluster
Interconnect
Big Fast Interconnect
OC-192
n x GbE (large n)
5 GB/s 200 x 25 MB/s 25 MB/s 200 Mb/s or 20
of GbE
10 TB
10 TB
45
Terascale Status
  • Finalizing node architecture (e.g. 2p vs 4p
    nodes) and site configurations (due by end of
    June)
  • Expecting initial delivery of cluster hardware
    late 2002
  • IA-32 TeraGrid Lite clusters up today, will be
    expanded in June/July
  • Testbed for service definitions, configuration,
    coordination, etc.
  • NSF has proposed Extensible TeraScale Facility
  • Joint between ANL, Caltech, NCSA, PSC, and SDSC
  • PSC Alpha-based TCS1 as test case for
    heterogeneity, extensibility
  • Transformation of 4-site internal mesh network
    to extensible hub-based hierarchical network
  • First Chicago-to-LA lambda on order
  • Expect to order 2nd lambda in July, 3rd and 4th
    in September
  • Initial definitions being developed
  • Batch_runtime
  • Data_intensive_on-demand_runtime
  • Interactive_runtime
  • Makefile Compatibility service
  • Merging Grid, Cluster, and Supercomputer
    Center cultures is challenging!
Write a Comment
User Comments (0)
About PowerShow.com