Grid Troubleshooting - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Grid Troubleshooting

Description:

find log messages related to service=condor, user=Joe, site=Indiana ... working on Condor log parsers. More Information. CEDPS Troubleshooting: ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 27
Provided by: Carl1171
Category:

less

Transcript and Presenter's Notes

Title: Grid Troubleshooting


1
Grid Troubleshooting
  • Brian Tierney, Dan Gunter
  • Lawrence Berkeley National Laboratory
  • Laura Pearlman
  • Information Sciences Institute
  • http//www.cedps-scidac.org/

2
CEDPS in a Nutshell
  • Center for Enabling Distributed Petascale Science
  • CEDPS seds (silent P)
  • DOE SciDAC Center for Enabling Technology
  • July 1, 2006 June 30, 2011, 2.4M/yr
  • Collaboration between 5 sites
  • Argonne National Laboratory
  • Fermi National Laboratory
  • Lawrence Berkeley National Laboratory
  • USC Information Sciences Institute
  • University of Wisconsin Madison
  • Three focus areas
  • Moving data to compute resources
  • Moving compute services to data sites
  • Troubleshooting and diagnosis tools

3
The Troubleshooting Problem
  • Large production Grids (OSG, TeraGrid, etc.)
    report a high failure rate
  • 20-30 of jobs submitted to the Grid fail
  • mostly authentication errors and disk space
    problems
  • Users dont always notice, as jobs may be
    automatically resubmitted and may succeed the
    next time
  • Troubleshooting in this environment is very
    difficult
  • Current Approach
  • Log into all hosts used (if possible)
  • grep various log files looking for problems
  • Inconsistent logging levels
  • Multiple file formats
  • Often a tedious and time consuming problem

4
CEDPS Troubleshooting Goals
  • Be useful to all of the following
  • Grid Operation Center folks
  • Site Administrators
  • Grid Users
  • Grid Developers

5
Use Case 1 Troubleshooting
  • Allow GOC personnel to
  • find log messages for jobs from VOAtlas running
    at siteFNAL
  • find log messages related to servicecondor,
    userJoe, siteIndiana
  • find log messages for userJoe
  • find log messages with statuserror
  • find all logs where job manager statuskilled
    (ie jobs that were killed for running too long)
  • find log messages with start events with no
    matching end event

6
Use Case 2 Monitoring / Performance Analysis
  • Allow GOC/site admins to determine
  • what sites had connection attempts for a given
    user DN
  • what data files were accessed most often
  • which user moved the largest amount of data
  • find log messages where the time between
    start/end events are more than 3X the baseline

7
Use Case 3 User Debugging / Provenance
  • Allow a user to query for their own logs
  • find log messages for all my jobs
  • find log messages with statuserror
  • Allow a user to determine all hosts/services that
    my job used
  • find log messages related to Job X
  • use this to determine what hosts were actually
    used

7
8
Overview of our approach
Grid stack
  • Current focus
  • Monitoring of Grid middleware
  • Normalize information being logged
  • Collect logs at each site (syslog-ng)
  • Load logs at each site into a relational database
    (MySQL)
  • Query cross-site with a distributed database
    layer (OGSA-DAI)

9
Log Normalization
  • To troubleshoot effectively requires correct,
    precise, and understandable logs
  • Time-synchronized hosts (using NTP)
  • Information logged at the
  • Start of all important operations
  • End of all important operations
  • Unique identifier in space/time for the operation
  • Basic attributes describing the operation
  • who (user DN)
  • where (host IP addresses)
  • what (operation arguments)
  • A simple, parseable log format

10
Social aspects of logging
  • Developers add log messages for their own benefit
  • Their test environment is far simpler and more
    reliable than the actual deployment
  • maybe even their own laptop!
  • Necessary to convince developers that
  • this will not be a performance bottleneck
  • there is a benefit for them
  • namely, better ways to find bugs before deploying
    the software and recreate them afterwards

11
Correlating log messages
  • Grid workflows are highly parallel
  • multiple sites
  • data resources, instruments, compute resources
  • multiple components in each place
  • all this being used by multiple users and jobs at
    the same time!
  • In order to separate out which component
    activities were associated with your job, local
    and global identifiers need to be recorded
    together whenever possible

12
Logging Best Practices Recommendations
  • Practices
  • All logs should contain a unique event name and
    an ISO-format timestamp
  • All system operations that might fail or
    experience performance variations should be
    wrapped with start and end events.
  • All logs from a given execution thread should be
    tagged with a globally unique ID (or GUID), such
    as a Universal Unique Identifiers (UUIDs)
  • Log format
  • Logs should be composed of lines of ASCII
    namevalue pairs
  • Example
  • ts2006-12-08T184827.598448Z
    eventorg.globus.gridFTP.transfer.start
  • progGridFTP-v4.2 guid1DDF1F3D-A677-4DBC-8C4E-6A
    8A3B252AE3
  • filefilename src.hostH1 src.portP1
    dst.hostH2 dst.portP2
  • http//www.cedps.net/wiki/index.php/LoggingBestPra
    ctice

13
Event Names
  • Use a '.' as a separator and go from general to
    specific
  • Same as Java class names
  • First part of name should be used as a unique
    namespace (e.g. org.globus)
  • Use start/end suffixes whenever possible
  • Helps immensely with troubleshooting
  • Examples
  • org.globus.gridFTP.start
  • org.globus.gridFTP.authn.start
  • org.globus.gridFTP.authn.end
  • org.globus.gridFTP.transfer.start
  • org.globus.gridFTP.transfer.end
  • org.globus.gridFTP.end
  • org.globus.MDS.response.start
  • org.globus.MDS.query.start
  • org.globus.MDS.query.end
  • org.globus.MDS.write.net.start
  • org.globus.MDS.write.net.end
  • org.globus.MDS.response.end

14
Reporting Errors
  • Errors should be reported as part of the end
    event if possible
  • Use statusN (gt 0 success)
  • Not attempting to define other status codes
  • too hard to get agreement on these
  • Example
  • ts2006-12-08T183923.114369Z eventorg.globus.au
    thz.gridmap.end status-1 DN/OCEDS/CNSo
    me User msgCannot open gridmap file
    /etc/grid-security/grid-mapfile for reading
    guidF7D64975-069A-4152-A21F-57109AA46DFA
    levelERROR

15
Globally Unique IDs
  • Use the guid reserved name to allow correlation
    of a set of events together
  • eventorg.globus.gridFTP.authn.start guid27023
  • eventorg.globus.gridFTP.authn.end guid27023
  • eventorg.globus.gridFTP.transfer.start
    guid27023
  • eventorg.globus.gridFTP.transfer.end guid27023
  • Recommend use of standard program uuidgen to
    generate globally unique ID
  • e.g. A5A563CD-D80C-4E58-9ECD-79C6B611E122

16
Logging Example
Log file
Logical flow
timeT1 eventjob.create.start timeT2
eventjob.create.end job.idJ1 status0 timeT3
eventjob.run.start job.idJ1 timeT4
eventjob.run.end job.idJ1 status-1 timeT5
eventjob.copy.start job.idJ1 timeT6
eventjob.copy.end job.idJ1 status0
create job
run job
copy results
17
Example Log GridFTP
  • ts2006-12-08T183923.114369Z eventorg.globus.gr
    idFTP.start progGridFTP-4.0.3 localhostmyhost
    remoteHostsomehost.gov56010 serverModeinetd
    guid1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3
  • ts2006-12-08T183923.114567Z eventorg.globus.gr
    idFTP.authn.start DN/DCorg/DCdoegrids/OUPeopl
    e/CNSomebody guid1DDF1F3D-A677-4DBC-8C4E-6A8A3B
    252AE3
  • ts2006-12-08T183925.514369Z eventorg.globus.gr
    idFTP.authn.end DN/DCorg/DCdoegrids/OUPeople/
    CNSomebody msg123456 successfully authorized
    localUseruscmspool381 guid1DDF1F3D-A677-4DBC-8C4
    E-6A8A3B252AE3 status0
  • ts2006-12-08T183925.864369Z eventorg.globus.gr
    idFTP.transfer.start file/tmp/myfile
    tcpBufferSize128KB dataBlockSize262144
    numStreams1 numStripes1 destHost129.79.4.64
    guid1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3
  • ts2006-12-08T184502.214369Z eventorg.globus.gr
    idFTP.transfer.end file/tmp/myfile
    bytesTransferred678433 guid1DDF1F3D-A677-4DBC-8C
    4E-6A8A3B252AE3 status0
  • ts2006-12-08T184502.214386Z eventorg.globus.gr
    idFTP.end guid1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE
    3 status226

18
Example Scenario Firewall blocking GridFTP
  • globus-url-copy from server A to server B (3rd
    party transfer) hangs why?
  • gridftp server logs contain the following
  • ts2008-02-27. id30922 eventglobus-gridftp-ser
    ver.start modeinetd
  • ts2008-02-27. id30922 eventglobus-gridftp-ser
    ver.session.start
  • ts2008-02-27. id30922 eventglobus-gridftp-ser
    ver.session.authn.start DN/DC
  • ts2008-02-27. id30922 eventglobus-gridftp-ser
    ver.session.authn.end status0
  • ts2008-02-27. id30922 eventglobus-gridftp-ser
    ver.session.authz.start
  • ts2008-02-27. id30922 eventglobus-gridftp-ser
    ver.session.authz.end status0
  • ts2008-02-27. id30922 eventglobus-gridftp-ser
    ver.transfer.start
  • But the logs are missing the remaining events
  • globus-gridftp-server.transfer.end
  • globus-gridftp-server.session.end
  • globus-gridftp-server.end
  • Since both authn and authz succeeded, a firewall
    is likely the problem

19
WSRF Logging
  • Goal is to include enough information in log
    messages to
  • correlate log messages involved in servicing a
    single request
  • correlate log messages associated with the same
    resource
  • Example
  • GRAM invokes an RFT service within the same
    container when servicing a GRAM request
  • need to be able to correlate the GRAM logs with
    the RTF logs for this user job
  • For WSRF services, we suggest
  • use the guid as the session ID
  • add a resource ID to uniquely identify a resource
  • for each unique guid (session ID), there should
    be at least one log message linking that guid
    with a resource.id
  • a hash of the EPR can be used for the resource ID

20
Log collection architecture
Site
Node
Node
Node
Log parser DB loader
syslog-ng
Node
Node
Node
MySQL
Node
Node
Node
21
syslog-ng
  • No need to invent something new for this
    syslog-ng tool fills all requirements
  • Open source, runs on all major OSes
  • Fault tolerant, secure (via stunnel), scalable,
    easy to configure, etc.
  • Large user base
  • Can filter logs based on level and content
  • Any number of sources and destinations
  • Can act as a proxy, tunnel thru firewalls
  • Execute programs Send email, load DB, etc.
  • Built-in log rotation
  • Timezone support and fully qualified host names
  • http//www.balabit.com/products/syslog-ng/

Node
software
logs
/opt/log/
syslog-ng sender
syslog-ng receiver
22
Logging Architecture for the Open Science Grid
23
OGSA-DAI Integration Goals
  • Provide access to log data over the grid
  • Support flexible authorization policies
  • E.g., Site admins can see local site data, VO
    admins can see data related to their VO, and
    users can see data related to jobs running under
    their DN
  • Facilitate queries across multiple sites
  • Possibly add support for joins with other
    existing databases
  • E.g., GRAM audit db

23
24
OGSA-DAI Deployment
site DB
site DB
OGSA-DAI
OGSA-DAI
View1
View2
DB
View1
View2
DB
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
PDPs/ PIPs
OGSA-DAI
Resource group 1
Res. group 2
user 1
user 2
25
Current Status
  • syslog-ng -gt parser -gt loader -gt mySQL pipeline
    now running on LBL OSG cluster and at NERSC
  • log parsers for Globus 4.2 logs, Globus 4.1
    gatekeeper, SGE, Pegasus
  • working on DB query tools
  • working on OSGA-DAI integration
  • working on Condor log parsers

26
More Information
  • CEDPS Troubleshooting
  • http//www.cedps.net/index.php/Troubleshooting
  • Contact us if you need troubleshooting help!
  • email BLTierney_at_lbl.gov, DKGunter_at_lbl.gov
Write a Comment
User Comments (0)
About PowerShow.com