[Amusing title goes here] GridPP project management - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

[Amusing title goes here] GridPP project management

Description:

Project Map GridPP3 Q4 08. 8/9/09. Project Map GridPP3 Q2 09. 4. GridPP23 ... R5: Service insufficiently resilient wrt storage downgraded to medium risk (2,4) ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 16
Provided by: neasan
Category:

less

Transcript and Presenter's Notes

Title: [Amusing title goes here] GridPP project management


1
Amusing title goes hereGridPP project
management
  • Sarah Pearce
  • 8 September 2009
  • GridPP23

2
Project Map GridPP3 Q4 08
3
Project Map GridPP3 Q2 09
4
Project map - statistics
  Q208 Q308 Q408 Q109 Q209
Metric OK 99 142 155 172 184
Metric close to target 24 47 39 32 22
Metric not OK 41 32 32 21 27
Not able to be measured 27 22 11 10 3
Milestone achieved 11 22 32 42 57
Milestone overdue 2 7 13 17 4
Milestone not due / metric n/a 101 80 69 60 58
Suspended 0 6 6 9 12
Awaiting input 34 5 12 10 3
Total 339 363 369 373 370
Metrics
Milestones
5
Experiments - red metrics
  • ATLAS, CMS and Other experiments
  • No red metrics (although ATLAS has lots of
    amber)
  • Previously ATLAS job failures dominated by access
    to storage.
  • LHCb
  • 1.2.2 - MC production (generation) efficiency
    (84/ target 95)
  • 1.2.3 - T1 MC production (reconstruction,
    stripping) efficiency (55/90)
  • 1.2.4 T1 MC/Event user analysis UK efficiency
    (43/70)
  • 1.2.11 LHCb SAM tests uptime T1 (98/82)
  • 1.2.23 Keep LHCb GANGA user training material
    updated

Mainly problems with LHCb application software.
Various scheduled downtimes T1 (moving to new
building, CASTOR and network developments). LHCb
note that support at the UK Tier-1 and Tier-2
sites for MC production has been excellent and
communications between sites and experiment are
improved.
6
Grid services
  • Operations
  • 2.1.3 - Proportion of available jobslots used
    (51/ target 80)
  • 2.1.6 - Job success rates (85/95) was 90 in
    Q408. Expect it to decrease as sites get busier.
  • 2.1.10 - GridPP deployment web-pages up-to-date
    - review underway
  • Rest of Grid Services
  • No red metrics or milestones

7
Tier-1 - metrics
  • Front end systems
  • 3.1.8 - Availability of CE service (91/99)
    Scheduled downtime R89, network
  • Resource delivery
  • 3.2.11 - Farm Occupancy (67/target 80) up
    from 45 in Q408
  • 3.2.13 quarterly report not available
  • Previously 3.2.10 - Job Efficiency (now 88, was
    69)
  • Storage systems
  • 3.4.4 - met of UB Allocation for
    Disk (87/100). UB allocations to be revised.
  • 3.4.8 CASTOR SAM tests LHC VOs (93/99).
    CASTOR development, R89, network.

8
Tier-1 overdue milestones
  • Front end systems
  • 3.1.22 LHC Monitoring infrastructure operational
    at RAL waiting on work by Dante
  • Resource delivery
  • 3.2.16 - Disaster and Business Continuity Plan
    Available.
  • 3.2.18 - Disaster Plan fully implemented
  • New disaster management system is operational
    and working well, but some contingency plans
    remain to be completed.
  • Storage systems
  • General ADS Service Ends. Not been a priority but
    closure process has started.

9
Tier-2s
  • of promised disk and CPU available green for
    all Tier-2s (metrics 12).
  • SAM availability and reliability tests green for
    most Tier-2s (metrics 34). Not a weighted
    average, so can be brought down by a couple of
    poorly performing small sites.
  • Other red metrics
  • Metric 5 SLL ATLAS test performance,
    LondonGrid and SouthGrid.
  • 4.2.6 - Average SLL SE test performance,
    ScotGrid (86/95)
  • CPU utilisation (wall clock time CPU time,
    metrics 78) LondonGrid, SouthGrid
  • of disk used (metric 9) ScotGrid, SouthGrid
  • Number of management meetings NorthGrid
    (metric 11)
  • Middleware upgrading LondonGrid (metric 14).

10
Management and external
  • Project execution red metrics
  • Nearly all staff now in post
  • 5.2.9 CB meetings (target 1 per year)
  • NGI
  • Milestones amended in light of EGI developments
  • Outreach, LCG and EGEE
  • No red metrics

11
Risk register
12
High level risks
  • R1 Recruitment and retention difficulties
  • Likelihood 3, impact 3 (reduced from 4,3)
  • Nearly all staff now in place, but staff turnover
    remains a concern
  • R12 Machine room problems compromise Tier-1
  • Likelihood 4, impact 3
  • Transfer to R89 went smoothly, but issues since
    have triggered Tier-1 disaster planning process
    (air conditioning problem and water leak)
  • R5 Service insufficiently resilient wrt storage
    downgraded to medium risk (2,4)
  • Resilience expressed explicitly
  • CASTOR more stable
  • But impact of problems increases as data taking
    approaches

13
Finances
  • 335k of Tier-1 hardware rolled over from FY08 to
    FY09
  • 1m of Tier-1 hardware delayed until FY10
  • Most Tier-2 hardware grants should be in early
    FY10 small number of sites require in FY09

14
Staffing
  • Some areas not finished recruiting, so funded
    effort under that expected
  • But in all cases more than compensated by
    unfunded effort

15
Next steps
  • New Oversight Committee next week
  • Transition plan for EGEE posts to EGI
  • This quarter finishes at end of September will
    send reminder out then for next round of reports
  • Aim to keep the Quarterly reports metric green
    but NOT using the Dilbert method
Write a Comment
User Comments (0)
About PowerShow.com