Dan Gunter, Brian L' Tierney, Keith Beattie: LBNL - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Dan Gunter, Brian L' Tierney, Keith Beattie: LBNL

Description:

find log messages related to service=condor, user=Joe, site=Indiana ... OU=People/CN=Keith R. Jackson 633921 | 44094000-01dd-11dd-b78f-8fb0dd1ed847 ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 28
Provided by: Carl1171
Category:

less

Transcript and Presenter's Notes

Title: Dan Gunter, Brian L' Tierney, Keith Beattie: LBNL


1

CEDPS Troubleshooting Effort Status Report, April
2008
  • Dan Gunter, Brian L. Tierney, Keith Beattie LBNL
  • Laura Pearlman, ISI

Center for Enabling Distributed Petascale
Science http//www.cedps-scidac.org
2
Overview
  • Updated Architecture for OSG
  • New Grid Passport Idea
  • Log Database schema work
  • OSGA-DAI use
  • NERSC deployment plans
  • Plans for the next 6 months

3
Use Case Troubleshooting
  • Allow GOC personnel to
  • find log messages for jobs from VOAtlas running
    at siteFNAL
  • find log messages related to servicecondor,
    userJoe, siteIndiana
  • find log messages for userJoe
  • find log messages with statuserror
  • find all logs where job manager statuskilled
    (ie jobs that were killed for running too long)
  • find log messages with start events with no
    matching end event

4
Use Case 2 Monitoring / Performance Analysis
  • Allow GOC personnel to
  • what sites had connection attempts for a given
    user DN
  • what data files were accessed most often
  • which user moved the largest amount of data
  • find log messages where the time between
    start/end events are more than 3X the baseline

5
Use Case 3 User Debugging / Provenance
  • Allow a user to query for their own logs
  • find log messages for all my jobs
  • find log messages with statuserror
  • Allow a user to determine all hosts/services that
    my job used
  • find log messages related to Job X

6
Previously suggested architecture
7
Problems with this approach
  • Main idea of old architecture
  • all grid logs are sent to a central collector
  • After talking to several sites, the following has
    become clear
  • some sites are quite worried about sensitive data
    in the log files
  • some sites want to be able to control exactly who
    gets access to what log data
  • Proposed Solution
  • most logs stay local to the site, only a minimal
    subset is sent to the central collector
  • sites deploy a new service that provides
    X.509-authenticated access to logs

8
Updated Architecture
9
New Ideas
  • Key Points
  • Minimal logging sent to central collector by
    default
  • eg resource name, job ID, start time, end_time,
    DN, VO
  • enough information to locate log files of
    interest
  • basically the same info currently collected by
    Gratia
  • site can send more if they choose to
  • Site admins have control over access to log
    database
  • hopefully sites will allow users to see their own
    logs
  • Site admins see exactly what data is being sent
    to the central collector
  • data is sent to the central collector using ssl

10
New Functionality
  • Deployment of site log archives will provide OSG
    with the following new functionality
  • OSG security staff can easily query site archives
    to see what DNs have been used
  • Users can query site archives for their own logs
  • GOC stuff can query site archives for to aid
    troubleshooting

11
Grid Passport Stamps
  • We are working with Miron to define the concept
    of a Grid passport stamp
  • Every middleware component that comes in contact
    with a Grid workflow adds entry and exit
    stamp to the passport
  • Condor sandbox makes this do-able
  • This provides the following
  • Tells the user exactly which components and hosts
    were used by their workflow
  • provides workflow provenance
  • For workflows that fail, provides a mechanism to
    determine where the failure occurred
  • passport stamps must also be collected along the
    way

12
Grid Passport Stamps
  • Stamps are generated and recorded by an Issuer
  • Issuer attributes name, address, and GUID
  • timestamp
  • local identity (if exists)

13
Open Issues
  • What about workflows that spawn sub-workflows?
  • need to clone the passport, and reassemble at
    the end of the job
  • probably many more issues.

14
CEDPS-TS DB Schema
  • Dan Gunter

15
DB Schema Requirements
  • Deal with semi-structured data
  • timestamp, event name, level are the only
    guaranteed fields
  • Load data at relatively high rates
  • detailed logs from Globus, Condor, etc.
  • Perform many different types of queries
  • new troubleshooting scenarios
  • site admins, users, etc.

16
DB Schema Approach
  • Put only required info. in main table (avoid
    NULLs) time, event type, severity (level)
  • Make tables for the event type and the attribute
    names so they can be referred to by an integer
  • better for speed, since indexing by integers is
    fast (.1sec for 0.5M records), but adds an extra
    lookup to map the name to the id
  • Add some special tables for DN, identifiers,
    large text values (less important)

17
Schema Loading Perf.
  • For efficiency, the loader program keeps the
    mapping between the attribute names and event
    type names in memory (loaded at startup) so only
    INSERT statements are needed
  • For MySQL, the extended multiple-insert syntax is
    used for further speedup
  • impressively fast on my Mac to localhost, 7200
    records/sec. Faster than parsing!

18
Fun Query on GT-4.2 Logs
  • select DATE_ADD('1970-01-01', interval start_time
    second) as 'start time',
  • (end_time - start_time) as 'duration', dn.value
    as 'user DN', jobResource from
  • (
  • select e.time as 'start_time', e_done.time as
    'end_time', e.guid, e.jobResource from
  • (
  • select e1.time, e1.id, e1.value as 'guid',
    ident.value as 'jobResource' from
  • (
  • select event.id, event.time, ident.value
    as 'value' from
  • event join event_type on event_type.id
    event.et_id
  • left join ident on ident.e_id event.id
  • where event_type.name
    'org.globus.execution.job.create.start' and
  • ident.relationship 'guid'
  • ) as e1
  • join
  • (
  • select event.id, event.time, ident.value
    as 'value' from
  • event join event_type on event_type.id
    event.et_id
  • left join ident on ident.e_id event.id
  • where event_type.name
    'org.globus.execution.job.create.end' and

join ident on ident.value e.jobResource
join event as e_done on e_done.id ident.e_id
join event_type on event_type.id e_done.et_id
where event_type.name 'org.globus.execution.job
.terminate.end' and
ident.relationship 'jobResource' ) as e join
ident on ident.value e.jobResource join ident
as id2 on id2.e_id ident.e_id join ident as id3
on id3.value id2.value left join dn on dn.e_id
id3.e_id left join event on event.id
dn.e_id left join event_type on event_type.id
event.et_id where ident.relationship
'jobResource' and id2.relationship 'guid'
and id3.relationship 'guid' and
event_type.name 'org.globus.security.authn.trans
port.end' group by jobResource order by
start_time
19
Fun Query Results
Duration, user, and jobResource.id of all jobs
(in a given time period) Despite the messy
query, it runs very quickly (0.03s w/ gt250K
events)
  • -----------------------------------------------
    --------------------------------------------------
    ---------------------------------------
  • start time duration user
    DN
    jobResource
  • -----------------------------------------------
    --------------------------------------------------
    ---------------------------------------
  • 2008-04-03 230746 8.74900007247925
    /DCorg/DCdoegrids/OUPeople/CNBrian Tierney
    180017 c56196d0-01d2-11dd-874c-80235764bb7a
  • 2008-04-03 230945 10.710000038147
    /DCorg/DCdoegrids/OUPeople/CNBrian Tierney
    180017 0c6a4040-01d3-11dd-b78e-8fb0dd1ed847
  • 2008-04-03 231039 7.16599988937378
    /DCorg/DCdoegrids/OUPeople/CNBrian Tierney
    180017 2ca067e0-01d3-11dd-b78e-8fb0dd1ed847
  • 2008-04-03 231051 6.73299980163574
    /DCorg/DCdoegrids/OUPeople/CNBrian Tierney
    180017 3392f860-01d3-11dd-b78e-8fb0dd1ed847
  • 2008-04-03 231143 6.90599989891052
    /DCorg/DCdoegrids/OUPeople/CNBrian Tierney
    180017 53002ab0-01d3-11dd-b78e-8fb0dd1ed847
  • 2008-04-03 231202 6.91799998283386
    /DCorg/DCdoegrids/OUPeople/CNBrian Tierney
    180017 5debbac0-01d3-11dd-b78e-8fb0dd1ed847
  • 2008-04-03 231437 7.9760000705719
    /DCorg/DCdoegrids/OUPeople/CNBrian Tierney
    180017 baa9cb80-01d3-11dd-b78e-8fb0dd1ed847
  • 2008-04-04 001031 6.84200000762939
    /DCorg/DCdoegrids/OUPeople/CNBrian Tierney
    180017 89b74d60-01db-11dd-b78f-8fb0dd1ed847
  • 2008-04-04 001522 6.05399990081787
    /DCorg/DCdoegrids/OUPeople/CNBrian Tierney
    180017 374e8ab0-01dc-11dd-b78f-8fb0dd1ed847
  • 2008-04-04 002253 6.45799994468689
    /DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
    633921 44094000-01dd-11dd-b78f-8fb0dd1ed847
  • 2008-04-04 003259 10.5170001983643
    /DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
    633921 acdbb3f0-01de-11dd-a9fc-d8b7a3370b2a
  • 2008-04-04 003614 5.73799967765808
    /DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
    633921 21698080-01df-11dd-a9fc-d8b7a3370b2a
  • 2008-04-04 005442 6.32100009918213
    /DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
    633921 b5ce04b0-01e1-11dd-a9fc-d8b7a3370b2a
  • 2008-04-04 011439 6.54200005531311
    /DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
    633921 7f87f250-01e4-11dd-a9fc-d8b7a3370b2a
  • 2008-04-04 011911 0.796000003814697
    /DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
    633921 216ea1e0-01e5-11dd-a9fc-d8b7a3370b2a
  • 2008-04-04 012015 0.749000072479248
    /DCorg/DCdoegrids/OUPeople/CNKeith R. Jackson
    633921 477d3770-01e5-11dd-a9fc-d8b7a3370b2a

20
OGSA-DAI and CEDPS-TS
  • Laura Pearlman

21
OGSA-DAI Integration Goals
  • Provide access to log data over the grid
  • Support flexible authorization policies
  • E.g., Site admins can see local site data, VO
    admins can see data related to their VO, and
    users can see data related to jobs running under
    their DN
  • Facilitate queries across multiple sites
  • Possibly add support for joins with other
    existing databases
  • E.g., GRAM audit db

22
OGSA-DAI Useful Features
  • Can use Globus authz framework to specify
    authorization of OGSA-DAI resources
  • View resources (planned feature) represent
    database views
  • Can use DN in a view definition to represent
    the callers DN
  • This will enable us to enforce policies like
    users can only see their own records.
  • Resource Groups fan queries out to multiple
    OGSA-DAI resources
  • Resource groups of OGSA-DAI views will allow us
    to define resources representing things like all
    your own records"

23
OGSA-DAI Proposed Deployment
GOC DB
OGSA-DAI
OGSA-DAI
Res. group 1
Res. group 2
  • View1 select where DN DN
  • Shows records with the same DN as the user making
    the OGSA-DAI call
  • Res. Group 1 then contains all records at all
    sites with the same DN as the user making the
    query.

24
NERSC Collaboration
  • Attending weekly NERSC production grid meetings
  • Deploying CEDPS syslog-ng -based log collection
    configuration at NERSC
  • Helping NERSC with troubleshooting

25
Globus Best Practice Logs
  • still a number of issues with the logs
  • http//www.cedps.net/index.php/GT4.2_Logging

26
Focus for the next 6 months
  • log parser/ DB loader deployment at NERSC
  • write additional data parsers
  • Log database scalability testing
  • OSGA-DAI integration
  • Improve interface to the database

27
More Information
  • General Information
  • http//www.cedps.net/index.php/Troubleshooting
Write a Comment
User Comments (0)
About PowerShow.com