Lessons Learned: The Organizers - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Lessons Learned: The Organizers

Description:

GLAST DC2 Closeout June 1 2006. R.Dubois Lessons Learned the Organizers 1 /9. Lessons Learned: ... Easy to use and peruse. Will clone for Beamtest. Great teamwork ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 10
Provided by: Rich141
Category:

less

Transcript and Presenter's Notes

Title: Lessons Learned: The Organizers


1
Lessons LearnedThe Organizers
2
Kinds of Lessons
  • Operational
  • Distributing the Code
  • Making the sky data
  • Required compute resources
  • Required people resources
  • Remaking the sky data
  • Remaking the sky data
  • Distributing the data - DataServers
  • Functional
  • Problems extracting livetime history
  • Problems extracting pointing history SAA
    entry/exit
  • Organizational
  • How/when to draw on expert help for problem
    solving
  • Sky model
  • Confluence/Workbook
  • Analysis
  • Access to standard cuts
  • GTIs
  • Livetime cubes, diffuse response

Or things to fix for next time
3
Making the Sky 1
  • Code Distribution
  • Navid made nice self-installers with wrappers
    that took care of env vars etc
  • Creation of distributions is semi-manual.
  • Should find out how to automate rules based
  • We needed a lot more compute resources than we
    anticipated
  • 200k CPU hrs for background and sky generation
  • Did sky gen (30k CPU-hrs) twice
  • ? need more compute resources under our control
    then planned maxed out at SLAC with svac, DC2,
    BT, Handoff
  • Aiming for 350-400 GLAST boxes call on SLAC
    general queues for noticeable periods
  • Berrie ran 10,000 jobs at Lyon for the original
    background CT runs a horrible thing to have to
    do
  • Manualy transferred merit files back to SLAC
  • Extend LAT automated pipeline infrastructure to
    make use of not-SLAC compute farms (may or may
    not have to transfer files back to SLAC) Lyon
    UW Padova? GSFC-LHEA?
  • Speaks to maximizing sims capabilty
  • We juggled priority with SVAC commissioning
  • Pipeline 1 handles 2 streams well enough
  • More would have been tricky
  • Ate up about 3-4 TB of disk to keep all MC, Digi,
    Recon etc files
  • ? Pipeline 2

4
Making the Sky 2
  • People resources
  • Tom Glanzman put his BABAR expertise to minimize
    exposure to SLAC resource bottlenecks
  • Accessing nfs from upwards of 400 CPUs was the
    biggest problem
  • Use afs and batch node local disk as much as
    possible
  • Made good use of SCS Ganglia server/disk
    monitoring tools
  • Developed pipeline performance plots (as shown at
    the Kickoff meeting)
  • Tom and I (mostly Tom) ran off the DC2 datasets
  • Some complexity due to secret sky code and
    configs
  • Some complexity due to last minute additions of
    variables calculated outside Gleam
  • Effort front loaded setting up tasks
  • Now a fairly small load to monitor/repair during
    routine running
  • Some cleanup at the end
  • Root4?Root5 transition disrupted the DataServer
  • Will likely need a volunteer for future big LAT
    simulations

5
Grab Bag
  • Great to have GBM involved!
  • Should at least have archival copy of GBM
    simulation code used
  • DC2 Confluence worked
  • Nice organization by Seth on Forum and Analysis
    pages
  • Easy to use and peruse
  • Will clone for Beamtest
  • Great teamwork
  • It was really fun to work with this group
  • The secret sky made it hard to ask many people to
    help with problems but that is behind us now
  • Histories
  • Pointing and livetime needed manual intervention
    to fix SAA passages etc. Should track that down.
  • Analysis details
  • Might have been nice to have Class A/B in merit
    (IMHO)
  • GTIs were a pain if you got them wrong. Tools now
    more tolerant.
  • Livetime cubes were made by hand
  • Diffuse Response in FT1 was somewhat cobbled
    together

6
GSSC Data Server
890 hits total during DC2
  • repopulating the server is manual 2 months
    takes about 5 hrs
  • brings up questions
  • what chunks of data will be retransmitted to
    GSSC?
  • what are failure modes for data delivery
  • what will Event data look like?
  • how many versions of data to be kept online in
    servers?

7
LAT DataServer Usage
  • ½ of usage from Julie!
  • similar questions posed as from GSSC server

8
(No Transcript)
9
Lessons
  • Statistics dont include astro data server or
    WIRED event display use.
  • Lessons Learned
  • Problem Jobs running out of time
  • Need more accurate way to predict time, or run
    jobs with no time limit
  • Problem Need clearer notification to user if job
    fails
  • LAT Astro server never got the GTIs right
  • Hence little used, even as west coast US mirror
  • Were not able to implement efficient connection
    to Root files (main reason for its existence).
    Still needs work.
  • Unknown if limited use of Event Display is
    significant.
Write a Comment
User Comments (0)
About PowerShow.com