Ian M. Fisk

About This Presentation

Title:

Ian M. Fisk

Description:

Three activities are driving most of the US-CMS Software and Computing efforts ... Capability (Yujun Wu, James Letts, Michael Ernst, Tanya Levshina) ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 53

Provided by: usc1

Category:

more less

Transcript and Presenter's Notes

Title: Ian M. Fisk

1
Preparations for CMS Milestones

Ian M. Fisk

2
Introduction

Three activities are driving most of the US-CMS
Software and Computing efforts through the next
twelve months
The CMS 2004 Data Challenge
The preparation for the Physics TDR
The roll out of the LCG distributed computing
grid
In order for US-CMS to meet its obligations there
is substantial effort expended in both
sub-projects Core Application Software (CAS) and
User Facilities (UF). We decided to present the
activities together because the milestones and
goals are common to both.

3
CMS Milestones DC04

Data Challenge 2004 (DC04)
The purpose of the milestone is to demonstrate
the validity of the software baseline
Successfully cope with a sustained data-taking
rate of 25Hz at 0.2x1034 luminosity for a period
of 1 month
Validate the deployed grid model on a sufficient
number of Tier-0, Tier-1, and Tier-2 sites.
Completing this milestone requires
Software Development
Enhancing the distributed computing
infrastructure
Generating 50 million events

4
CMS Milestones Physics TDR

Physics TDR
A physics validity test of software, computing,
and peoples knowledge and skills
Consists of two volumes
Detector Response, physics objects, calibration
and parameterization.
High Level Analysis small number of full
analyses and larger number of general physics
topics
Completing this milestone requires
Lots of Analysis
More Event Production
Improvements in software and analysis tools

5
LCG Rollout

The deployment of the LCG prototypes is not by
itself a CMS milestone, but the functionality
expected from the releases is often tightly
coupled to CMS milestones
CMS is expected some functionality for DC04 for
data and process management
LCG can provide a consistent set of grid
middleware
Their initial methods for this has often been
more intrusive than is acceptable to
participating institutions
Longer term goal of individual components and
services
Short term techniques of installing a complete
environment
LCG ramp up of functionality has not been as fast
as they expected
US-CMS is evaluating ways of interfacing
efficiently with the LCG
Trying to balance our need for distributed
computing development and tools for producing
events, with the need to develop an worldwide
system.

6
Preparations for DC04

CAS Pieces
Preparing the Software Framework (Bill Tanenbaum,
Vladimir Litvin)
Preparing the Software Support Components (Lassi
Tuura, Xhen Xie, Michael Case, Ianna Osborne,
Shahzad Muzaffar, Michael Case)
Preparing the Tools for Distributing Software and
Running Jobs (Natalia Ratnikova and Greg Graham)
UF Pieces
Facilities Improvements
Capacity (David Fagan, Michael Ernst)
Capability (Yujun Wu, James Letts, Michael Ernst,
Tanya Levshina)
Distributed Production (Greg Graham, Anzar Afaq,
Joe Kaiser)
Analysis Environment (Hans Wenzel)

7
Software Framework Improvements

A little over a year ago CMS decided to formally
switch away from Objectivity.
Objectivity provided us with a persistency
mechanism but also
a file catalogue, data streaming tools, failure
recovery tools
The LCG developed persistency mechanism POOL is
our new baseline solution
Bill Tanenbaum worked to extract Objectivity for
the CMS software framework COBRA
There was a prototype that wrote directly to
ROOT-IO files
Bill is now working on interfacing POOL to COBRA
We are about a month delayed on milestone
2.1.2.14 for the release of the Cobra POOL
integration of the pre-challenge production
This schedule was very tight to begin with
We expect a release on the 14th of July

8
Software Framework Improvements

Vladimir Litvin is working on coordinating and
developing the Calorimetry Framework
Improving navigation efficiency.
Interfacing the Calorimeter code with the common
XML detector description database (DDD)
Are on schedule for Release of code for
pre-challenge production reconstruction
(2.1.3.1.6)

9
Software Support Tools

The biggest and most important was the new
persistency mechanism
POOL, which is progressing as planned (WBS
2.1.1.1) (Zhen Xie)
Version 0.5 provided a clean and generic
interface to persistency and was released
on-time.
Contain references to object identifiers in V1.0
(WBS 2.1.1.1.7) also released on time
POOL Catalogue can either be XML for local sites,
mySQL for large installations and clusters, and
soon an EDG Grid catalogue for distributed
systems
POOL has been remarkably true to a fairly
ambitious schedule the was set in November, but
the interface of POOL to the Distribute Data
Management System has been delayed
Should still be available in time for DC04, but
will not be useful in the Pre-Challenge Production

10
Software Support Tools (2)

Completing a maintenance and integration phase
of Detector Description Database (WBS 2.1.3.3)
(Michael Case).
Code has been released in the CMSToolBox
Reconstruction now mainly reading geometry from
DDD
Michael Case currently working on cleaning up and
adding features
The expertise gained during this project is being
directed toward the development of the CMS
conditions database
IGUANA visualization progressing (Ianna Osborne,
Shahzad Muzaffar
Modified to read the new ROOT-IO persistency in
COBRA without incident.
Impressive demonstration at CHEP
Common SEAL Framework (Lassi Tuura)
New LCG Project
Puts a lot of common infrastructure in a common
library

11
Tools for Running and Distributing

Greg Graham has released a version of MC_RunJob
for Pre Challenge Production (Completing WBS
2.3.2.2.5)
MC_RunJob is the CMS tool which replaced IMPALA
for specifying CMS production jobs.
MC_RunJob is used for local and grid production
Interesting additional effort from D0 and CDF
Groups
Natalia Ratnikova released proposed improvements
for Distribution After Release Tool (DAR)
Ability to read software configuration
automatically from reference database
Should reduce human interaction and possibility
for error
Investigating possibility of a simple database of
installed software distributions
Looking at an environment for automating the
build and install process

12
Software Status

The OSCAR GEANT4 based simulation has been slow
to be validated
US-CMS does not currently have an OSCAR
contribution
OSCAR was listed as a necessary component of
DC04. However,
There were significant stability issues at the
beginning
Then there were speed issues
OSCAR events were initially 10 times slower than
CMSIM events. They have since become 2-3 and
recently less than 2 times slower
There were issues with the amount of memory, but
it will be OK
We will start the Pre-Challenge Production with
10 million CMSIM events. We hope to start OSCAR
production at the beginning of August.
This doesnt leave a lot of time for testing the
infrastructure, but we need to get started
running events

13
Preparing the Facilities

In order to be ready of the Data Challenge the
User Facilities
Needs to increase the US-CMS Capacity
Increase the processing and storage at the Tier1
and Tier2 centers
Need to increase our services offered
RD effort to increase the efficiency that we use
the facilities we have
Networking, data serving and transfer, etc.
Need to improve the automation of simulation
processing and later event reconstruction
Improvements and extensions to the Distributed
Processing Environment (DPE)
Increase in scale and robustness
Testing middleware and components
Establishing grid services

14
Increasing the US Resources

Predicted resources for CMS
The US Tier1 Center represents 10 of all the
offsite resources

100000
10000
kSI2000.month
1000
100
10
Years Boxes at FNAL
2002 2003 2004 2005
2006 2007 2008 2009
136 190
270 300 700 1100
1500
15
Tier1 Procurements in FY03

Currently the US Tier1 Center has
40 Dual CPU P3 750 nodes which will be out of
maintenance in Oct.
We will keep them running until the end of the
year at RH6.1 for Objectivity Licensing reasons
to access and analyze old data
66 Dual CPU 1900 Athlon nodes purchased 6 months
ago
Together these are less than half the SI2000
expected from the US Tier1 center in 2003
Currently procuring 76 new dual CPU 2.4 GHz Xeon
nodes
60 will be used for production and reconstruction
16 will be reserved for analysis users

16
Resources at Tier2s

The hierarchical Tier model calls for the total
of the Tier2 resources to sum to the resources at
the Tier1 center
Through iVDGL support the Tier2 centers have been
upgrading to fulfill US-CMS obligations to CMS
Caltech has procured 32 dual Xeon 2.4 GHz compute
nodes
UCSD has 20 and will procure an additional 10
Florida is in the process of upgrading with
additional support from U-Florida
The resources available are about what was
expected
Tier2 centers made significant contributions to
the last official CMS official production
Expected to make a big contribution to DC04

17
Improving our services Networking

Offsite data transfer requirements have
consistently outpaced available bandwidth
Upgrade by ESnet to OC12 (12/02) becoming
saturated already
FNAL planning to obtain an optical network
connection to the premier optical network
switching center on the North American continent
StarLight in Chicago, enables network research
and holds promise for
Handling peak production loads for times when
production demand exceeds what ESnet can supply.
Acting as a backup if the ESNET link is
unavailable
Potential on a single fiber pair
Wavelength Division Multiplexing (WDM) for
multiple independent data links
Capable of supporting 66 independent 40 GB/s
links if fully configured
Initial configuration is for 4 independent 1 Gbps
links across 2 wavelength
Allows to configure bandwidth to provide a mix of
immediate service upgrades as well as validation
of non-traditional network architectures
Immediate benefit to production bulk data
transfers, a test bed for high performance
network investigations and scalability into the
area of LHC operations

18
Current Off-site Networking

All FNAL off-site traffic carried by ESnet link
ESnet Chicago PoP has 1Gb/s Starlight link
Peering with CERN, Surfnet, CAnet there
Also peering with Abilene there (for now)
ESnet peers with other networks at other places

19
Proposed Network Configuration

Dark fiber is an alternate path to
StarLight-connected networks
Also an alternate path back into ESnet

20
End-to-End Performance/ Network Performance and
Prediction

Need to actively pursue Integration of Network
Stack Implementations supporting Ultrascale
Networking for Rapid Data Transactions and
Data-Intensive Dynamic Workspaces
Maintain Statistical Multiplexing End-to-End
Flow Control
Maintain functional compatibility with Reno/TCP
implementation
FAST Project (Caltech) has shown dramatic
improvements over Reno Stack by moving from loss
based congestion to delay based control mechanism
with standard segment size and fewer streams
Sender-side only modifications
Fermilab/CMS is FAST partner as a well supported
user having the FAST stack installed on Facility
RD Data Servers (first results look very
promising)
Aiming at Installations/Evaluations for
Integration with Production Environment at
Fermilab, CERN and Tier-2 sites
Work in Collaboration with the FAST project team
at Caltech and Fermilab CCF Department

21
(No Transcript)
22
Improving Data Server Service

The Objectivity AMS had a nice set of features.
Weve been looking at dCache to replace some of
the functionality
D-cache is a disk caching system developed at
DESY as a front end for Mass Storage System, with
significant developer support at FNAL
We are using it as a way to utilize disk space
on the worker nodes and efficiently supply data
in intense applications like simulation with
pile-up.
Applications access the data in d-cache space
over a POSIX compliant interface. The d-cache
directory (/pnfs) from the user perspective looks
like any other cross mounted file system
Since this was designed as a front-end to MSS,
once closed, files cannot be appended
Very promising set of features for load balancing
and error recovery
d-cache can replicate data between servers if the
load is too high
if a server fails, d-cache can create a new pool
and the application can wait until data is
available.

23
Data Intensive Applications

Simulation of CMS detector is difficult
There are 17 interactions per crossing on average
There are 25ns between crossings
The previous 5 crossing and the following 3
influence the detector response.
Each simulated signal event requires 170 minimum
bias events
To simulate new minimum bias events would take
about 90 minutes
A large sample is created and recycled
The sample is sufficiently large it doesnt
usually fit on local disk
It is about 70MB per event
These events are randomly sampled, so it is
taxing on the minimum bias servers and the
network.
We were able to saturate 100Mbit network links in
this application using systems that were three
times slower than what we are proposing to procure

24
(No Transcript)
25
Data Rate Into the Application

Performance is fairly flat across the nodes and
the performance is good
The performance in the pile-up application is
sufficient that the analysis application should
be well served.

26
Load Balancing
27
Performance with Load Balancing
28
Improving our automation

At the time of the last large CMS official event
production (Spring 2002) we used 5 large centers
in the US. With 5 production coordinators
Production is not enormously difficult, but it
does require diligence at each site to make sure
the system is running efficiently
In the Fall of 2002, the first grid enabled
production system was deployed in the US.
The Distributed Production Environment (DPE)
consists of
VDT for grid middleware
MC_Runjob for job specification
MOP a custom developed application which submits
jobs to Condor-G
1.5 Million events were produced with jobs submit
from a single site
Looking this year to harden the environment and
improve the efficiency.

29
Introduction to DPE

Rolling Prototypes deployed across three test
environments
DGT Development Grid Testbed
RD Efforts, Evaluation of new components,
shake-out testing
IGT Integration Grid Testbed
Scalability and stability testing and development
PG Production Grid
Stable production running
Working to carefully version and release DPE
packages
Validating and testing versions on DGT before
migrating to IGT and PG

30
Version 1.0

In Version 1.0
Using a few hundred CPUs at 5 sites
Based on VDT 1.1.4 (Condor 6.4.3 Globus 2.0) for
primary middleware
MOP and Impala for CMS production tools
Locally installed DAR distributions for CMS
application software
What we learned?
Found scaling issues with Condor at 200 CPUs
Found scaling issues in the in the client
architectures
Issues with writing out too much information into
common areas
Found scaling issues with the way data is
returned to submitting site
Currently using globus-url-copy jobs submit at
the end of the job
Significant load on head node

31
Recent Efforts

Increasing number of sites in DGT by adding three
sites
MIT, Iowa and Rice
Some of these will be added to IGT
Approached by Taiwan to join IGT
Increases the scale at which we test the system

MIT
MIT
Iowa
Rice
32
DPE 1.5

Created and released DPE version 1.5 May 23
VDT 1.1.8 based
Changes globus version to 2.2.4
We support multiple queuing systems on DPE sites.
FBSNG used at FNAL is not supported by the
globus gatekeeper. When the scripting changed
from shell to perl this had to be rewritten
We believe Condor scaling issues have been solved
for this release, so we expect to be able to
utilize more CPUs efficiently
Current release of MCRunJob for job specification
All software for client sites has been packaged
with Pacman
VDT has been Pacman based since the beginning
Also added CMS application software
Reduces installation DPE client site software to
one command issued at headnode
pacman -get http//computing.fnal.gov/cms/software
/DPE-downloadDPE-client
local queue assumed to be Condor,condor must be
started on the nodes

33
Current Activities

Installing new DPE version on IGT
Verify new installation method works on all DGT
sites. Ran 400 small scale test jobs.
Moving the master site software into Pacman.
MOP and MCRunJob
Should allow rapid installation of additional MOP
masters
Will allow additional groups to easily evaluate
this effort
We should be ready for large scale CMSIM
production in a week
Environment should be well shaken out by the time
OSCAR production starts
Prepare and Deploy a PG version of DPE 1.x for
PCP
On schedule to perform no local production in the
US

34
DPE Development Activities

Working on a Configuration Monitoring tool based
on MDS
Currently the location that the DPE software is
installed and the space available for output is
passed between client and master by e-mail
prone to error and doesnt allow the client
administrator sufficient control
Current methods of data management are
insufficient for a large scale distributed
production system
Output is written using globus-url-copy from
headnode
This prevents nodes from needing external network
access, but stresses headnode
At a minimum we need the ability to queue
transfers.
Currently transfers commence as soon as the jobs
are finished.
In the near future a real data management system
is needed

35
Data Management Systems

As we increase the amount of data generated by
our automated production system, and as we
prepare for analysis applications, we need to
improve the data management tools deployed.
Weve taken a two pronged approach
CMS has adopted the Storage Resource Broker
(SRB), developed in part by PPDG, to handle our
data transfers during Pre-Challenge Production
SRB is a working solution, which is well
supported and has a global catalogue
The SRB architecture currently has some
limitations
US-CMS and LCG have started to develop the
Storage Element (SE) which will be used to
provide data management services
Based on the Storage Resource Manager (SRM) and
Replica Location services (RLS)

36
SRB Architecture
37
SRB Status

SRB tests started with a demonstration that a
sufficient amount of data for PCP could be
transferred from US sites to CERN using SRB
One server at FNAL and one server at UCSD
transferred data to a server at CERN
There was no other complete, tested solution for
this problem, so CMS decided to move ahead with a
large scale deployment.
Currently there are servers in
2 in Russia
3 in Wisconsin
1 at CERN
1 in Spain
SRB has some limitations
The current implementation with a single MCAT
requires good networking and is a single point of
failure, but it is working for PCP

1 in Ukraine 1 client installation in Pakistan 1
in San Diego 1 in Germany 2 in FNAL (one of
which writes directly to dCache)
38
Storage Element Development

While Compute resource scheduling technology is
fairly advanced,
Batch systems are grid enabled, condor-G
allocated processing requests reasonably
efficiently over distributed resources
However, mechanisms for Data Access and Storage
Resource Management in the Grid environment are
lacking. Issues include
Shared storage resource allocation scheduling
Staging management - Files are typically archived
on a mass storage system (MSS)
Wide area networks minimize transfers
File replication and caching
A working group was mandated by the GDB to
understand the Grid File Access requirements and
present a proposal to enable applications perform
file access using the LCG infrastructure

39
(No Transcript)
40
(No Transcript)
41
Storage Virtualization with SRM

A collaboration of Fermilab, Jefferson Lab, LBNL
Participating CERN, EDG and others
Supported by PPDG
SRM functionality
Manage space
Negotiate and assign space to users
Manage lifetime of spaces
Manage files on behalf of a user
Pin files in storage until they are released
Manage lifetime of files
Manage file sharing
Policies on what should reside on a storage
resource at any one time
Get files from remote locations when necessary
Manage multi-file requests
A brokering function queue file requests,
pre-stage when possible
Provide grid access to/from mass storage systems
Enstore (FNAL), Castor (CERN), ATLAS (RAL),
JASMine (JLAB),

42
Advantages of SRM

Provides uniform Grid access to heterogeneous
Mass Storage Systems
Synchronization between storage resources
Pinning file, releasing files
Allocating space dynamically on as needed basis
Insulate clients from storage and network system
failures
Transient MSS failure
Network failures
Interruption of large file transfers
Facilitate file sharing
Eliminate unnecessary file transfers
Support streaming model
Use space allocation policies by SRMs no
reservations needed
Use explicit release by client for reuse of space
Control number of concurrent file transfers
(queuing and traffic shaping)
From/to MSS avoid flooding Head/Gateway Node,
MSS and thrashing
From/to network avoid flooding and packet loss

43
Status of SE Development Project

We have identified manpower in the US to start
the evaluation and development of the Storage
Element
SRM is relatively stable and understood
The local area data serving tools (dCache, rfio,
ROOTD, nfs) are stable and well debugged
The Replication Location Service (RLS) and the
Replication Location Interface (RLI) and the
functionality that creates all the data
cataloguing services are still in prototypes
Several implementations not all of which are
compatible
The interface of the catalogue services to the
applications in still in the development phase
Still trying to understand how the SE
communicates with POOL and the applications

44
New Project for Distributed Production

As the DPE work shifts to stable running toward
the latter portion of the summer, we expect to
migrate the functionality to GRID3
GRID3 is a combined US-ATLAS, US-CMS, PPDG,
GriPhyN, iVDGL effort to form a persistent and
large scale, interoperable grid of computing
resources.
We are currently in the process of determining
the middleware required by US-CMS and US-ATLAS
Try to keep as close as possible to the existing
DPE, but make it compatible with US-ATLAS
installations
This will increase the scale at which we validate
our environment and increase the resources
available to us.
We expect a minimal set of services to be
installed at the beginning of August. The
complexity and functionality will increase
through the fall
We are in the process of defining the metrics for
success. Complexity achieved, data transferred,
concurrent users and jobs running.

45
Summary of DC04 Preparations

Software preparations are nearly finished
We are behind on POOL preparations, but this has
not slowed down PCP preparations
CMS is more behind on OSCAR validation
Software support tools are in good shape
User Facility preparations are in progress
We have a reasonable procurement to perform
Service improvements are proceeding well, but
more work is needed.
DPE is proceeding well and we should be able to
avoid local production in the US.
We look forward to the additional resource and
scale expected by the GRID3 efforts.

46
Analysis Preparation

After DC04 is completed, CMS will enter a year of
intense analysis activity preparing for the
Physics TDR. US-CMS needs to increase the
analysis capability of the Tier1 center and the
number of users working in the US.
A lot of the preparations overlap with DC04
preparations
Data serving
Data management
Software distributions
Some are extensions of DC04 preparations
Need some simple VO management for DC04, but the
multiple user environment of analysis requires
more services
Extensions of the production tools to allow
custom user simulation and distributed analysis
applications
Some are new efforts
Load balancing for interactive users and the user
analysis environment

47
VO Management

The first virtual organization infrastructure
will be deployed for PCP, but there are only a
few production users and the application is
predictable and organized
We dont worry production users will do something
malicious or foolish
The analysis environment is much more
complicated.
Many more users with diverse applications,
abilities, and access patterns
The VO Project is working with US-ATLAS to
developed the infrastructure for authenticating
and authorizing users
First prototypes will concentrate on
authentication
Need to satisfy experiment wide and local site
policies
Authorization at the level of individual
resources is necessary and it soon couples to
auditing and usage policies
This needs additional work

48
The Analysis User Environment

US-CMS is trying to increase the number of people
working at the Tier1 center
Improving the software and developer environment,
getting a better match to CERN
Hans Wenzel has written a nice document
describing the analysis farm and its use
Using Fermilab batch system to balance
interactive log-ins
Working with the networking people at FNAL to
evaluate the solution CERN uses for this
Deploying dFarm and dCache to provide data space
to analysis users
Working on procuring additional analysis
resources
More analysis CPUs
More data servers for analysis
A high performance replacement for current
analysis home directories in afs

49
Distributed and Batch Analysis

We are working to provide functionality for users
to produce small custom simulation productions
and run analysis jobs in a DPE-like environment.
This is an extension of tools like MC_RunJob
Also an extension of the current software
packaging we use in DPE
Should allow physics groups to investigate new
ideas and validate applications without involving
the full production team
Should also allow physics users more freedom
The grid enabled production through the fall
should free human resources to work on the
analysis environment development so that we are
ready for the Physics TDR analysis
The analysis techniques used in the Physics TDR
are intended to be as close as possible to those
used at the start of the experiment. US-CMS
cannot afford to have the analysis centralized at
CERN.

50
LCG Roll-out

In May of 2003 the LCG rolled out the first
prototype of their production grid (LCG-0)
It was based on an older version of the VDT and
the EDG software
It was mainly a packaging and installation test
This was installed at FNAL and 5 other Tier1
centers
The installation was successful, but the
environment was fairly rigid, it only worked with
specific queuing systems, the components were
tightly coupled to each other, and the components
had to be installed with the LCG environment
LCG-1 is expected shortly
Some functionality has been pushed back because
it is not available
The VDT and many of the EDG components are
up-to-date
The packaging is improved, but the components are
still tightly coupled

51
LCG Roll-out

US-CMS in our DPE development, services
development and GRID3 integration is pushing for
interoperable services development
We are attempting to stress test the newest
versions of the middleware
Develop and push for components that can be
installed independently
Develop well defined and reasonably lightweight
interfaces that can be installed on top of large
computing centers without making too many demands
on the centers themselves
This is the only way we can see leveraging
resources we dont have complete control over,
like the TeraGrid
We are aiming to be compatible with the LCG but
not necessarily adopt all the LCG infrastructure

52
Outlook

We have been making reasonable progress preparing
for DC04 and the Physics TDR but there is a lot
of work left to do
Software preparations coming along well. We
should be ready to commence all steps of
Pre-Challenge Production by August
Delivery of 50 million events by Christmas is
tight
User Facility preparations are proceeding
A lot of our activity is still ramping up
SE Development
Network RD
Grid3
Many thanks for the reasonable personnel and
equipment profiles. It is making our current
efforts possible.