LCG%20Monitoring%20and%20Accounting - PowerPoint PPT Presentation

About This Presentation
Title:

LCG%20Monitoring%20and%20Accounting

Description:

GOC generates RSS feeds which clients can pull using an RSS aggregator. ... Applet queries the logging and bookkeeping service to get information about grid jobs. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 37
Provided by: dave113
Category:

less

Transcript and Presenter's Notes

Title: LCG%20Monitoring%20and%20Accounting


1
LCG Monitoring and Accounting
  • Dave Kant
  • CCLRC e-Science Centre, UK
  • HEPSYSMAN
  • April 2005

2
Introduction
  • Overview of some of the monitoring tools in
    action in the LHC Computing Grid
  • GOCDB
  • GPPMON
  • GRIDICE
  • GSTAT
  • CERTIFICATION TESTING
  • REAL TIME GRID MONITOR
  • Accounting Use Case
  • Future Plans

3
Monitoring the LCG Grid is a Challenge!
Number of participating sites is growing every
day August 2003 gt 12 sites 100
CPUs October 2004 gt 83 sites 8000
CPUs April 2005 gt 138 site 14000 CPUs
4TB Disk
Grid Operations Centre Monitor the operational
status of sites Fault detection Problem
Management Identify problems escalate track
4
Monitoring Challenges
  • With so many sites participating, there is a
    requirement for operational information in order
    to manage a grid environment
  • What are the core grid services
  • e.g. RBs/SEs/BDIIs the VOs are using for data
    challenges.
  • Who do we contact when there is a security
    incident?
  • Require a toolkit test specific core services.
  • We have to concentrate on functional behaviour of
    services e.g. If an RB sends your job to a CE,
    then we must assume the RB is working fine. Is
    this the only test of a RB?
  • Not all the tests that we perform are effective
    at finding problems.
  • We must develop tests which simulate the life
    cycle of real applications in a Grid environment.
  • and lots more

5
GOC Configuration Database
http//goc.grid-support.ac.uk/gridsite/gocdb
Secure Database Management via HTTPS /
X.509 Store a Subset of the Grid Information
system People, Contact Information,
Resources Scheduled Maintenance
  • Monitoring Services
  • Operations Maps
  • Configure other Tools
  • Organisation Structure
  • - People/Institites/Projects
  • Secure services
  • - News
  • Self Certification
  • Accounting

GOC GridSite MySQL
SERVER
SQL
https
Resource Centre Resources Site Information EDG,
LCG-1, LCG-2,
bdii
ce
GOC DB can also contain information that is not
present in the IS such as Scheduled maintenance
News Organisational Structures Geographic
coordinates for maps.
se
rb
RC
6
Operations Map Job Submission Tests
GPPMON Displays the results of tests
against sites. Test Job Submission Job is a
simple test of the grid middleware components
e.g. Gatekeeper service, RB service, and the
Information System via JDL requirements.
This kind of test deals with the functional
behaviour core grid services do simple jobs
run. They are lightweight tests which run hourly.
However, they have certain limitations e.g. Dteam
VO WN reach (specialised monitoring queues).
7
Operations Map Certificate Lifetime
GPPMON Displays the results of tests
against sites. TestCertificate Lifetime Many
grid services require a valid certificate for
security.
By probing the host certificates on CEs and SEs
at sites with a simple SSL client service, we can
identify certificates which are due to expire and
send an early warning to them. A predictive tool!
8
GRIDICE Architecture
A different kind of monitoring tool processes /
low level metrics / grid metrics Developed by the
INFN-GRID Team http//infnforge.cnaf.infn.it/gridi
ce
Data harvest via discovery service (postgreSQL)
Publication service
Measurement
service monitoring sensor agents probe process
table, memory, cpu
9
GRIDICE Global View
Different Views of the data Site / VO /
Geographic
Resource Usage CPU, Load, Storage, Job Info
List of Sites
Display shows the processes belonging to the
Broker service. Problems are flagged
10
GridIce Job Monitoring
  • Recently deployed version 1.6.3 on to LCG which
    features job monitoring Queued, Running,
    Finished organised in different ways (site, Vo
    etc)
  • XML views of data

11
GRIDICE Expert View
Node
Processes
Display shows the processes belonging to the
Broker service. Problems are flagged
12
Ganglia Monitoring
  • http//gridpp.ac.uk/ganglia
  • Can use Ganglia to monitor a cluster

Scalable distributed monitoring system for
clusters and grids using RRD for storage and
visualisation. RAL Tier-1 Centre LCG PBS Server
displays Job status for each VO Get a lot for
little effort
13
Federating Cluster Information
  • Can also use Ganglia to monitor clusters of
    clusters

Ganglia/R-GMA integration through Ranglia.
14
GIIS Monitor
  • Developed by MinTsai (GOC Taipei)
  • Tool to display and check information published
    by the site GIIS (sanity checks, fault detection)
  • http//goc.grid.sinica.edu.tw/gstat/

15
Regional Monitoring
Dealing with the complexities of
managing a grid.
  • EGEE is made up of regions.
  • Each region contains many computing centres.
  • Regional Operational Centres is a focus for
    operations.

16
Regional Monitoring Maps
  • http//goc.grid-support.ac.uk/roc_map/map.php
  • Provide ROCs with a package to monitor the
    resources in the region
  • Tailored Monitoring
  • GUIs to create organisations and populate them
    with sites
  • Hierarchical view of Resources
  • Example UK Particle Physics GridPP
  • Materialised Path encoding

17
Site Functional Tests (SFT)
  • In terms of middleware, the installation and
    configuration of a site is quite a complicated
    procedure.
  • When there is a new release, sites dont upgrade
    at the same time
  • Some upgrades dont always go smoothly
  • Unexpected things happen (who turned of the
    power?)
  • Day-to-day problems robustness of service under
    load?
  • Its necessary to actively hunt for problems
  • Site certification testing is by CERN deployment
    team on a daily basis. First step toward
    providing this service involves running a series
    of replica manager tests which register files
    onto the grid, move them around, delete them and
    3rd party copies from remote SE.
  • Unlike the simple job submission tests
    implemented in GPPMON, these tests are more heavy
    weight and attempt simulate the life cycle of
    real applications.

18
Certification Test Results
http//lcg-testzone-reports.web.cern.ch/lcg-testzo
ne-reports/cgi-bin/listreports.cgi
19
Syndication of Monitoring Information
GOC generates RSS feeds which clients can pull
using an RSS aggregator. How can we integrate
feeds and ticketing systems?
Aggregator RSSReader (Windows Client)
20
Real Time Grid Monitor
http//www.hep.ph.ic.ac.uk/e-science/projects/demo
/index.html
Why are jobs failing? Why are jobs queued at
sites while others are empty?
A Visualisation tool to track jobs currently
running on the grid. Applet queries the logging
and bookkeeping service to get information about
grid jobs.
21
Problems with existing tools
  • Lots of monitoring tools have described they
    have a few things in common
  • - all the information which they generate is
    hidden away or difficult to access
  • - limited interfaces the data can only be
    accessed in specific ways
  • Therefore, its difficult to build on-demand
    services to allow communities Players to
    interact with the data.
  • Examples include
  • Job Accounting service to allow an Organisation
    to compare resources usage for each VO
  • Certification Testing service Secure service to
    allow a site administrator to run the
    certification test suite against their site
    through a RB of their choice?
  • The idea is for the services to collect
    information and put it into a common repository
    such as an RGMA Archiver. In this way, the
    information can be shared and accessible to all.
  • Services (EGEE parlance ROC and CIC services)
    munch the data and present it to the community.
  • Example GIIS is that its hard to drill down to
    the information you want e.g How much CPU in
    GridPP today? How much disk in the UKI ROC? The
    new paradigm solves this problem by allowing the
    data to be aggregated in different ways.

22
Monitoring Paradigm
A Better way to unify monitoring information. GOC
Services collect information and publish into an
archiver. ROC/CIC Services provide a means for
the community to interact with this information
on-demand. GOC provides services tailored to the
requirements of the community.
23
GOC UseCase Job Accounting
  • An accounting package for LCG has been developed
    by the GOC at RAL
  • There are two main parts
  • the accounting data-gathering infrastructure
    based on R-GMA which brings the data to a central
    point
  • a web portal to allow on-demand reports for a
    variety of players.

24
Requirements
  • A historical record of grid usage to identify the
    use of individual sites by VOs as a function of
    time
  • To demonstrate the total delivery of resources by
    that site to the Grid
  • Aggregated views of the collected data by
  • VO
  • Country a requirement of LCG which has a
    country-based structure
  • EGEE Region for use by EGEE Regional Operations
    Centre (ROC)
  • A presentation front-end to the data to allow the
    selection on-demand of the views described above
    for different VOs and periods of time.
  • To present the data as
  • A graphical view for interpretation
  • A tabular view for precision
  • To support sites that already had their own
    methods of data collection by allowing arbitrary
    data collection techniques and insertion of the
    data in the standard schema into the central
    database.

25
Requirements
  • It was not an explicit requirement that user
    information be captured but we included this in
    the design as we were sure this would be a
    secondary requirement
  • This is a reporting system, not a charging
    mechanism.
  • The information is under the control of the site,
    so it does not meet the requirement of a charging
    system to be digitally signed and irrefutable.
  • Information is gathered centrally, not under the
    control of the VO

26
Design
  • Information collected at each site from batch
    logs, gatekeeper logs etc
  • Information joined at site level to select grid
    jobs and stored in database on R-GMA MON box at
    site.
  • Information published through R-GMA and collected
    centrally in an R-GMA archive at GOC
  • Web site presents various views of this data for
    presentation
  • Structure of Grid taken from GOC DB the grid
    configuration database.
  • Only normalised cpu time collected

27
APEL Accounting Processor for Event Logs
28
How APEL Works?
  • PBS/LSF log processed daily on site CE to extract
    required data, filter acts as R-GMA DBProducer -gt
    PbsRecords table
  • Gatekeeper log processed daily on site CE to
    extract required data, filter acts as R-GMA
    DBProducer -gt GkRecords table
  • Message log processed daily on site CE to extract
    required data, filter acts as R-GMA DBProducer -gt
    MessageRecords table
  • Site GIIS interrogated daily on site CE to obtain
    SpecInt and SpecFloat values for CE, acts as
    DBProducer -gt SpecRecords table, one dated record
    per day
  • These three tables joined daily on MON to produce
    LcgRecords table. As each record is produced
    program acts as StreamProducer to send the
    entries to the LcgRecords table on the GOC site.
  • Site now has table containing its own accounting
    data GOC has aggregated table over whole of LCG.
  • Interactive and regular reports produced by site
    or at GOC site as required.

29
GOC
Job Records In via RGMA
1 Record per Grid Job (Millions of records
expected)
RGMA MON
SQL QUERY TO Accounting Server 1 Query / Hour
Summary data refreshed every hour (Max records
about 100K per year)
Home Page
On-Demand Accounting Pages based on SQL queries
to summary data
30
Description
  • Web allows information to be selected by
  • VO, time range, (Whole Grid, Country, EGEE
    Region, site)
  • Also shows information on data collected

31
Select date range
Select VOs (Default All)
Web form to apply selection criteria on the data
Aggregate data across an organisation structure
(Default All ROCs)
32
Summed CPU (Seconds) consumed by resources in
selected Region
VO Index
Selected Date Range
33
List of Sites Belonging to the Selected ROC
A breakdown of the resource usage per Site, per
VO, per Month
34
http//goc.grid-support.ac.uk//
65 Sites publishing
data to GOC (April 2005) Over 1.3 Million Job
records 50K records per week
35
GOC Accounting Services
http//goc.grid-support.ac.uk/gridsite/accounting/
index.html
On Demand Services to EGEE Community
Simple interface to customise views of data VO,
time frame and Region (default EGEE)
BaseCpuSeconds Aggregated across EGEE
Each Region, per VO, per Month
Other Distributions Normalised CPU Jobs
36
Provide Interface to the Data Driven
by User Requirements
Materialised Path Library
Tier-1 View Regional View Country
View
37
Including Graphing Features
38
Number of Sites per Country Publishing Accounting
Records to GOC
39
GridPP Accounting Status April 2005
  • Sites that have never published or have not
    published recently.
  • CAVENDISH-LCG2 -- never published
  • Dublin-CSTCDIE -- never published
  • DURHAM last published 18th Feb
    2005
  • IC-LCG2 -- last published 9th
    April 2005
  • RAL-LCG2 last published 16th
    March 2005
  • HP-Bristol -- never published
  • Lancs-LCG2 never published
  • LivHep-LCG2 never published
  • QMUL-eScience never published
  • RHUL-LCG2 -- never published
  • ScotGrid-Glas last published 17th
    Jan 2005
  • UCL-CCC last published 12th
    Feb 2005
  • UCL-HEP never published
  • Contact Dave if you need advise about
    installing Apel
  • D.Kant_at_rl.ac.uk Tel 01235 778178

40
Batch System Support
  • APEL supports PBS (Released) and LSF (Testing)
  • Implementations are separate and independent of
    one another. Currently LCG2_4_0 has PBS support
    only.
  • Re-factoring to a single package with plug-in
    batch specific components is currently in
    progress.
  • What is the current status about LSF Support?
  • LSF currently comes in three flavours (version 4,
    5 and 6), each has a different usage record
    format
  • New RPM edg-rgma-apel-lsf has been released to
    CERN for testing.
  • Expect a release in the 2_4_1 tag next Month.

41
Issues
  • Which RPM Version?
  • Latest version on http//goc.grid-support.ac.uk/gr
    idsite/accounting
  • 3.4.44 for LCG2_4_0
  • Change Log 3.4.37 to 3.4.43
  • Apel 3.4.43 (April 6th) Startup script modified
    for RGMA 2_4_0 s/w release
  • Apel 3.4.42 (Mar 20th) Improved core
    functionality
  • Better handling of dn suppression
  • Check flexible archiver on-line before attempting
    to send job records
  • Apel 3.4.41 (Feb 2nd) Minor fix to SQL script
  • Apel 3.4.40 (Jan 17th)
  • Normalisation issue (see later)
  • CatchAll specInt/specFloat set to value in GIIS
    rather than 0
  • Apel 3.4.39 (Dec 16th) Current PBS log excluded
    from archive
  • Apel 3.4.38 (Nov 19th)
  • Bug in reprocess option during Join
  • Added cleanAll option
  • Apel 3.4.37 (Oct 14th)
  • grant mechanism to allow GK and CE to connect to
    MySQL database

42
Issues
  • VO Filtering
  • National Grid VOs activities running on same
    infrastructure as EGEE/LCG
  • Privacy reasons why sites dont want to publish
    National VO data to GOC
  • APEL does not discriminate between the VOs
  • Develop a solution? What can we suggest today?
  • GOCDB can hide resources
  • APEL made the requirement to exclude Local work
    not published but non LCG work does come through.
  • Whats the model 1 CE per VOwhat do people do?
  • Dont need to install Apel on non-LCG VO CEs
  • SARA-LCG2, IISAS-Bratislava
  • GridPP?

43
Issues
  • Development of Tests to Check the Accounting
    Service
  • Is site accounting working?
  • Is the GOC listening for new data?
  • Is the RGMA Registry working?
  • GSTAT
  • GOC Flexible archiver service listens for
    accounting producers
  • If the service is down, no data can be sent to
    the GOC!
  • Use the service every 5 minutes to update a
    timestamp in a test record in the accounting
    database. GSTAT can query table, look at the
    timestamp and compare with the current date/time.
  • 3rd party to use the flexi service.
  • Use RGMA to compare records in the site database
    and GOC
  • Site Functional Tests
  • Can check the RPM version installed on the CE
  • Testing the Whole Thing instead of the Pieces
  • Investigate an Apel heart-beat
  • Site cron writes a test record every hour and
    publishes to GOC

44
Issues
  • Which Log Files Should Site Administrators
    Backup?
  • To build accounting records, we need to process
    data from THREE log file sources. This is a
    mandatory requirement in order to reconstruct
    what has been done during the 2004 period.
  • /var/log/globus-gatekeeper
  • Match between grid-user dn to GramScriptJobId
  • /var/spool/pbs/server_priv/accounting/
  • Local jobID and details of resources consumed
  • No distinction between grid jobs and non-grid
    jobs.
  • /var/log/messages
  • Map GramScriptJobID to local JobID
  • This is how we separate grid jobs from local user
    jobs which run on the local fabric.
  • If the site has deleted its messages files, we
    may be able to work around this by matching local
    unix groups in the batch logs. Accounting records
    formed in this way will not contain the dn of the
    grid-user.

45
Issues
  • siteName Changes
  • Recent problem with presenting data from the
    French ROC where CCIN2P3 was renamed to IN2P3-CC
    via GOCDB portal
  • All records associated with the site are updated
    in order for SQL queries to match the new
    siteName.
  • Namespace Convention?
  • Naming scheme to identify data belonging to large
    sites which provide services for different
    communities etc.
  • NIKHEF lcgprod.nikhef.nl , lcg2prod.nikhef.nl,
    edgapptb.nikhef.nl
  • SiteName is a bad choice because we get
    multiple hits
  • IC-LCG2 gives multiple matches PIC-LCG2 and
    IFIC-LCG2
  • Request sites stick to the convention .SiteName
  • h1.desy.de, zeus.desy.de

46
Issues
  • Normalisation
  • We want to perform a reasonably sensible first
    order estimate to account for the differences in
    worker node performance.
  • Homogeneous vs Heterogeneous
  • PBS Job Records dont have any information about
    the worker node benchmarks, so we must insert one
    manually
  • PBS Farms setup in different ways can lead to an
    error in the normalisation calculation (Blindman
    vs internal normalisation)
  • Histories - What SpecInts do we use in order to
    process archived Job Records?
  • LSF Job Records have a CPU_FACTOR (1 - 4) in the
    Job Record.
  • What does a value of 1 correspond to?
  • Different calibration value at each site
  • Conversion table?
  • Can the site publish a weighted specInt2000 for
    the farm?

47
Issues
  • Service Reliability Hardening
  • If flexible archiver is down, sites unable to
    publish data to GOC
  • Update 3.4.42/43 Apel core checks if flexible
    archiver service is available before attempting
    to publish data.
  • GOC publishes a test record every 5 minutes to
    check the service is alive automatic service
    recovery mechanism now in place
  • Investigate running multiple flexible archiver
    services
  • 1 per GOC or 1 per ROC?
  • At the moment, the archiver service listens for
    all producers rather than producers belonging to
    a ROC.
  • Single point of failure if registry is down?
  • Multiple registry replicas supported in the RC1
    (gLite) release?
  • Update Multiple registries supported in LCG2_4_0
    ?

48
Future Plans
  • Integration into gLite Framework?
  • Work started
  • Apel for Storage
  • Capturing billing information for dcache
  • Cron runs, publish recent data into R-GMA
  • SE snapshot e.g df of filesystem
  • Use of disk and tape
  • Cron runs on SE which is a script but script
    tailored for different SE e.g. dcache, tapestore
    etc
  • Web Services Interface to accounting data
  • How would such a thing work?
  • Any UseCases?

49
Accounting Issues
  • A stable release of accounting package has been
    certified and tested at CERN Should sites wait
    for the official release of press ahead
    independently?
  • Package supports PBS only initial implementation
    for LSF.
  • 80 sites advertising 313 Job managers
  • - 300 PBS (91 of sites)
  • - 3 CONDOR (KFKI, FNAL, TRIUMF)
  • 7 LSF (GSI, LNL, CERN).
  • Accounting requires the R-GMA infrastructure to
    be deployed at the site.
  • The VO associated with a users DN is not
    available in the batch or gatekeeper logs. It
    will be assumed that the group ID used to execute
    user jobs, which is available, is the same as the
    VO name.
  • The global jobID assigned by the Resource Broker
    is not available in the batch or gatekeeper logs.
    This global jobID cannot therefore appear in the
    accounting reports. The RB Events Database
    contains this, but that is not accessible nor is
    it designed to be easily processed. Andrea
    Guarise JRA1 proposal

50
Accounting Issues
  1. Most sites keep GK/Batch logs but throw away
    message log files after 9 weeks due to default
    log rotation.
  2. At present the logs provide no means of
    distinguishing sub-clusters of a CE which have
    nodes of differing processing power. Changes to
    the information logged by the batch system will
    be required before such heterogeneous sites can
    be accounted properly. At present it is believed
    all sites are homogeneous.

51
Summary
  • Accounting Information gathering infrastructure
    has been developed
  • It has been through the CT cycle and should be
    deployed in the next release.
  • A web portal for display of this information has
    been developed (work in progress)
  • This is an EGEE deliverable (DSA1.3)
  • The display infrastructure can be deployed for
    other monitoring information.
  • Development towards on-demand services to provide
    the community with up-to-date information,
    aggregated at different levels.
  • Development of Visualisation tools to enhance our
    understanding of the grid.
Write a Comment
User Comments (0)
About PowerShow.com