Dave Newbold, University of BristolGridPP Middleware Meeting - PowerPoint PPT Presentation

About This Presentation
Title:

Dave Newbold, University of BristolGridPP Middleware Meeting

Description:

Stuff that is not (yet) well covered by the available tools. I know that... Robustness and reliability paramount (raw data is the crown jewels' ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 6
Provided by: davene2
Category:

less

Transcript and Presenter's Notes

Title: Dave Newbold, University of BristolGridPP Middleware Meeting


1
Real World issues from DC04
  • DC04
  • Trying to operate the CMS computing system at
    25Hz for one month
  • We are three days in!
  • We are using components that are ready NOW
  • Even if its not politically correct
  • Often using several different approaches for
    comparison
  • This talk concentrates on data management issues
  • Real World issues that have come up during DC04
    preparation
  • Stuff that is not (yet) well covered by the
    available tools
  • I know that
  • Some issues may be application problems, not
    middleware ones
  • Some issues may be covered by components under
    development
  • Some issues may be self-inflicted injuries

2
Directed data transfer
  • Data management type I replica management
  • The (automatic?) movement of data products to
    where they are needed managing relevant system
    and application metadata
  • Best-effort optimisation of data location in
    response to dynamic workload needs
  • Well-covered by current and future middleware
  • Data transfer type II bulk data management
  • The predictable straight(ish) production line
    of data flow
  • Detector -gt DAQ -gt Buffer -gt Reco farm -gt T1 -gt
    MSS -gt calib -gt
  • Requirements are different to replica management
  • Robustness and reliability paramount (raw data is
    the crown jewels)
  • Throughput is very important best effort is
    not good enough
  • Not explicitly addressed by current middleware
    products
  • Data distribution is explicitly directed by
    policy
  • Seeds the replica mangement system from the
    Tier-1s.

3
Directed data transfer
  • Our current solution
  • Cooperating system of simple agents at Tier-0
    and Tier-1
  • They communicate only through a shared (Oracle)
    DB
  • They have little or no state - its all held in
    the central DB
  • Could this be useful as generic middleware?
  • Other related issues
  • Lack of a single consistent interface to MSS (in
    Europe and US) makes life difficult (being
    addressed?)
  • There are very many failure modes in the data
    management system that we must think of
  • Would be good to factorise out the problems of
    failing storage components by having the MSS
    remap our data when required
  • Predict at least one disk failure per day
    somewhere in DC04

4
Data transfer tools
  • Need low-level transfer tools that
  • Log what is going on! (We have ad-hoc solutions
    here for DC04)
  • Adjust policy automatically for optimum
    throughput according to network conditions
  • Fail gracefully when something is wrong at an
    end-point
  • Play nice with firewalls, etc
  • NB performance is not currently the problem, but
    the tools are
  • Checksumming
  • We would like a system that performs fast
    file-level checksum of data ON THE DISK
  • No, TCP checksum does not catch all errors
  • Silent disk problems, filesystem errors, NFS
    problems, etc etc
  • Checksumming data from MSS after-the-fact is very
    difficult
  • Would also like
  • Some SIMPLE means of distributed, authenticated,
    atomic, reliable message-passing between agents
    over the Grid
  • With a command-line level API for scripting

5
Other issues
  • Small files!
  • They seem to be inevitable, but play havoc with
    efficiency
  • Huge lists of files in catalogues
  • Not dealt with efficiently by MSS, transfer
    tools, etc
  • Basic unit of information management data
    produced by one MC, reco, filter job during its
    run (with unique GUID)
  • Do not want to make jobs too long (too much
    state in the system)
  • Can aggregation help? Perhaps, but we need the
    tools
  • Metadata
  • Currently a hot topic?
  • How to handle efficient distribution of system-
    and user-level metadata?
  • Which metadata are immutable after creation?
    Which need to be distributed widely? How to
    handle schema extension on per-user basis?
Write a Comment
User Comments (0)
About PowerShow.com