DIVISION%20INFRASTRUCTURE%20and%20PLANNING - PowerPoint PPT Presentation

About This Presentation
Title:

DIVISION%20INFRASTRUCTURE%20and%20PLANNING

Description:

DOE Tevatron Operations Review. Division Direction (26 FTE) ... Quarterly walkthroughs with DOE. Assist in improving or maintaining building safety (on-going) ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 23
Provided by: with83
Category:

less

Transcript and Presenter's Notes

Title: DIVISION%20INFRASTRUCTURE%20and%20PLANNING


1
  • DIVISION INFRASTRUCTURE and PLANNING
  • Vicky White
  • Fermilab
  • March 17, 2004

2
Division Direction (26 FTE)
  • Head Vicky White (also labs Cyber Security
    Executive, with active deputy Dane Skow)
  • Deputy Head Robert Tschirhart
  • Chief Scientist of the Division
  • MOUs and Stakeholder requirements
  • 3 Associate Heads with cross-cutting
    responsibilities and small team of staff each
  • - Facilities, ESH Gerry Bellendir (6.5 FTE)
  • Budget, Computing resource planning Steve
    Wolbers (7 FTE)
  • External communications, Admin staff, Project
    initiation status, Division web Ruth Pordes
    (9 FTE)
  • 1 Assistant Head DOE relations Irwin Gaines (at
    DOE)

3
How the Division works
  • Unique (among HEP labs) in having a Computing
    Division that fully contributes to the scientific
    program
  • Mix of Scientists, Engineers, Computing
    Professionals, Technical and Administrative Staff
  • We think this works and we are very proud of it.
  • We encourage our scientists to do science and are
    proud of their scientific contributions
  • We think communications with our stakeholders is
    outstanding and aided by this
  • We believe in
  • System Solutions hardware and software
  • Matrixed project work across organizations
  • Common services and solutions
  • Evolving all of our systems aggressively (e.g. -gt
    Grid)

4
Computing Division ESH Program
  • 793 days without a lost-time injury!
  • - 3 first-aid cases (15 month period)

5
Computing Division ESH Program
  • Training Ergonomics, Beryllium Handling, Lead,
    Computer Room, GERT, Emergency Warden, Service
    Coordinator
  • - 96 complete on required ESH courses
  • Ergonomic Workstation Reviews (about ¼ of the
    division has had their workstation reviewed).
  • Hold annual fire and tornado drills

6
Computing Division ESH Program
  • Monthly walkthroughs with Department Heads
    (average 2 per month)
  • Quarterly walkthroughs with DOE
  • Assist in improving or maintaining building
    safety (on-going)
  • Investigate and record injuries (first-aid and
    recordable)
  • Assist in writing and review of Hazard Analysis

7
Facility Operations and Planning
  • This is a big job !!
  • We now have 3 Facilities smaller computing room
    in Wilson Hall
  • Feynman Computing Center
  • New Muon center for Lattice Gauge
  • High Density Computing Facility demolition and
    reconstruction starting April 5
  • We have a posted opening for an assistant
    building manager
  • Computer Facility planning space, power,
    networking, installations etc.
  • Facility construction planning working with
    FESS
  • Monitoring all of the systems in our facility

8
Meetings and Communications inside CD
  • Kitchen Cabinet meeting of Division Head, Deputy
    and Assoc Heads weekly
  • Department Heads meetings monthly
  • Operations meeting weekly
  • Budget Meeting monthly, 2 Budget retreats per
    year
  • Briefings on issues/project proposals weekly
  • Facility planning meetings regularly
  • Project status reports weekly
  • Matrixed projects meetings
  • R2D2 (Run 2 Data handling) -gt Grid Projects
  • Accel Projects meeting (monthly)
  • CMS Projects (monthly)
  • All-hands division meetings (2 or 3 per year)

9
Stakeholder communications
  • Bi-weekly meeting slot with CDF and D0
    spokespersons Computing leaders
  • As needed meetings with other expt spokespersons
    and/or computing leaders
  • CD Head is member of CMS SWC PMG and CMS
    Advisory Software Comp. Board
  • CD Deputy attends BTeV PMG
  • CD representative on CDF and D0 PMGs
  • Stakeholders participate in briefings and status
    meetings present needs, requests, etc.
  • Lab Scheduling, LAM and All-Experimenters
    meetings
  • Windows Policy committee
  • Computer Security Working Group

10
How do we set Priorities ?
  • Listen and discuss with
  • Director, Associate Director for Research, Deputy
    Director, other division/section heads
  • PAC, HEPAP
  • Experiment Spokesperson and Computing leaders and
    liasons
  • Run II Reviews
  • Project Reviews
  • US-CMS SWC PMG, ASCB and Reviews
  • Budget Retreat discussions
  • External Project Steering Groups and
    collaborators
  • Funding Agency contacts
  • Then we just make decisions and do it!

11
Evolving our workforce
  • I issued a challenge in Jan 2003 to each
    department to become 10 more efficient in
    operational areas
  • So we would be able to invest and move forward
    into the future
  • Big emphasis on measuring what we are doing
    define your own metrics, but show us
  • Strong encouragement to reassign staff, train,
    offer opportunities to change assignments

12
Has it worked?
  • I believe it has worked to a large extent
  • But there is much more to do
  • We are down from 275 FTEs in Sep 03 to 258 FTEs
    but that has brought stress in places
  • We have taken on 15 FTE of work in Accelerator
    Division (some taken from BTeV)
  • We need to hire have 10 openings posted
  • We will go to lights out operations
  • We need more skilled computer professionals and
    fewer limited skill level operational staff
  • We have taken a tough stance on performance
  • We have no fat
  • no-one is messing around on some unapproved
    project
  • everyone effort reports each month
  • we need to work smarter not harder in some areas

13
Some Common Services
Common Service Customer/Stakeholder Comments
Storage and Data movement and caching CDF, D0, CMS, MINOS, Theory, SDSS, KTeV, all Enstore 1.5 Petabytes data ! dCache, SRM
Databases CDF, D0, MINOS, CMS, Accelerator, ourselves Oracle 24x7 mySQL,Postgres
Networks, Mail, Print Servers, Helpdesk, Windows, Linux, etc. Everyone ! First class, many 24X7, services lead Cyb.Security
SAM-GRID CDF, D0, MINOS Aligning with LHC
Simulation, MC and Analysis Tools CDF, D0, CMS, MINOS, Fixed Target, Accel. Div. Growing needs
Farms All experiments Moving to GRID
Engineering Support and RD CDF, D0, BTeV, JDEM, Accel. Div. Projects Q outside our door
14
Budget FY04-FY06
15

16
(No Transcript)
17
FTE spread FY04
18
FTE spread FY05
19
FTE spread FY06
20
Risks
Risk Type of Risk Plan/mitigation
Provision of computer center building infrastructure fails to keep up with programmatic demands for power and cooling for computing power Infrastructure Multi-year plan to re-use existing buildings separate plan each year to build to match characteristics of systems given changing technologies
Processing time for CDF or D0 events and/or need to reprocess pushes computing needs outside planning envelope. Programmatic Establish Grid model for provision of computing resources in a seamless way. (Already close to established). Execute plan at Fermilab to make all computing generic Grid computing to meet peak demands by load sharing.
Demands for serving up Run II data both on-site and off-site, escalate to a point where the central storage and caching systems fail to scale Programmatic Much work has been done to assure scalability of the central storage system. We have many robots and can add tape drives to robots in a scalable way.
Tape technologies do not continue to follow the cost/GB curve we plan for or tape technologies become obsolete Programmatic We have two different types of robots including two large ADIC flexible media robots that can take a broad range of media types. If STK silos become obsolete and STK makes no new media we can expect LTO drives or their descendents to continue for several years. Our caching strategy allows us to transparently go to an all disk solution, and to replicate data on disk, should this become cost effective.
21
Risks
Rely on Grid Computing to solve many problems If the Grid has been oversold, or oversubscribed and Run II experiments have increasing difficulty getting resources as we approach LHC turnon this could limit the physics from Run II. Programmatic We plan to maintain a solid base of processing capability at Fermilab. Experiments will have to make hard choices that could limit the physics.
Success with Accelerator Division joint projects means we are likely to be asked to be engaged in this work longer. Already this is happening. Applying resources to BTeV has to be balanced with these needs. Programmatic Plan carefully what we take on.
For the Grid to work the Network infrastructure must be highly performant to all locations Programmatic Fermilab procuring fiber connection to starlight Fermilab worked on ESnet roadmap report in office of science and now working with ESnet to use the fiber for a Metropolitan Area Network, with ANL. RD proposals and continual push on improved networking capabilities worldwide (ICFA SCIC), Internet working group, etc.
22
Risk Type of Risk Plan/mitigation
All data tapes in FCC. All data tapes for one experiment in one or at most two tape silos. Risk of catastrophic data loss low, but non zero. Programmatic and Infrastructure Working on Physical Infrastructure to house silo(s) Combining all silos into one logical system Dispersal of data to multiple physical locations
Satellite computer center buildings will not have Generator backup, only UPS to allow for orderly shutdown of systems on power failure Programmatic Need 10 more processors to mitigate effects of power outages which leave many dead systems in their wake. Have adopted a policy on use of buildings to minimize effects of downtime of worker nodes, keeping file servers, machines with state in FCC.
Satellite computer centers need money to run them. FCC costs us a lot to run. FESS do not provide all of the services. We have to pay for many contracts ourselves. Each additional building will need maintainance, up to high standards, if millions of dollars of computing are to be within and monitored. Infrastructure We still need to squeeze these costs out of the budget. If necessary will have to tax purchasers of computing.
Plan for lights out computing center could get derailed. Two legacy tape systems are being migrated to robotic storage. Building monitoring systems need improvement. Programmatic Finish executing plan to put all active data into a robot. Work with FESS on enhanced and secure access to building monitoring information is ongoing.
Write a Comment
User Comments (0)
About PowerShow.com