United States ATLAS. 29 universities, 3 national labs. 20 - PowerPoint PPT Presentation

Loading...

PPT – United States ATLAS. 29 universities, 3 national labs. 20 PowerPoint presentation | free to download - id: 2a104-YzkwO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

United States ATLAS. 29 universities, 3 national labs. 20

Description:

United States ATLAS. 29 universities, 3 national labs. 20% of ATLAS. ATLAS ... ATLAS (and D0) will require grid technologies to meet their computational and ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 46
Provided by: abo84
Learn more at: http://www.sdsc.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: United States ATLAS. 29 universities, 3 national labs. 20


1
Case Studies MGRID and DZero
Abhijit Bose The University of Michigan Ann
Arbor, MI 48109 abose_at_eecs.umich.edu 2003
Summer Computing Institute, SDSC
Slides contributed by Shawn Mckee, Lee Leuking,
Igor Terekhov, Gabriele Garzoglio and Jianming
Qian
2
Just What is a Grid, Anyway?
  • Three key criteria
  • Coordinates distributed resources
  • using standard, open protocols and interfaces
  • to deliver non-trivial qualities of service to
    users and applications
  • What is not a Grid?
  • A cluster, network attached storage device,
    scientific instrument, network, etc.
  • Each is an important component of a Grid, but by
    itself does not constitute a Grid

3
Why Grids?
  • A biochemist exploits 10,000 computers to screen
    100,000 compounds in an hour
  • 1,000 physicists worldwide pool resources for
    petaop analyses of petabytes of data
  • Civil engineers collaborate to design, execute,
    analyze shake table experiments
  • Climate scientists visualize, annotate, analyze
    terabyte simulation datasets
  • An emergency response team couples real time
    data, weather model, population data

4
Why Grids? (contd)
  • A multidisciplinary analysis in aerospace couples
    code and data in four companies
  • An application service provider purchases cycles
    from compute cycle providers
  • Scientists use a specialized (and expensive)
    microscope at a remote facility on a time-shared
    basis
  • The bandwidth, reliability and ubiquity of
    networks enable us to envision transparent access
    to geographically distributed resources
    computers, storage, instruments, communication
    devices and data.

5
The Advent of Grid Computing
  • I would argue we have crossed a threshold in
    technology that has enabled this new paradigm to
    arise.
  • There are numerous examples of the advent of grid
    computing

6
Network for EarthquakeEngineering Simulation
  • NEESgrid national infrastructure to couple
    earthquake engineers with experimental
    facilities, databases, computers, each other
  • On-demand access to experiments, data streams,
    computing, archives, collaboration

NEESgrid Argonne, Michigan, NCSA, UIUC, USC
7
ATLAS
  • A Torroidal LHC Apparatus
  • Collaboration
  • 150 institutes
  • 1850 physicists
  • Detector
  • Inner tracker
  • Calorimeter
  • Magnet
  • Muon
  • United States ATLAS
  • 29 universities, 3 national labs
  • 20 of ATLAS

8
(No Transcript)
9
(No Transcript)
10
How Much Data is Involved?
High Level-1 Trigger(1 MHz)
High No. ChannelsHigh Bandwidth(500 Gbit/s)
Level 1 Rate (Hz)
106
LHCB
ATLAS CMS
105
HERA-B
KLOE
TeV II
104
Hans Hoffman DOE/NSF Review, Nov 00
High Data Archive(PetaByte)
CDF/D0
103
H1ZEUS
ALICE
NA49
UA1
102
104
105
106
107
LEP
Event Size (bytes)
11
Where does Michigan Fit?
  • ATLAS (and D0) will require grid technologies to
    meet their computational and data handling needs
  • Many other fields are or will be encountering
    equivalent problems
  • Michigan has a long history of leadership in
    networking and computer science RD

12
MGRID www.mgrid.umich.eduMichigan Grid
Research and Infrastructure Development
  • A center to develop, deploy, and sustain an
    institutional grid at Michigan
  • Many groups across the University participate in
    compute/data/network-intensive research grants
    increasingly Grid is the solution
  • ATLAS, DZero, NPACI, NEESGrid, Visible Human,
    MCBI, NFSv4, NMI
  • MGRID allows work on common infrastructure
    instead of custom solutions

13
MGRID Center
  • Central core of technical staff (new hires)
  • Faculty and staff from participating units
  • Executive committee from participating units and
    the provost office
  • Technical Leadership Team to guide design and
  • development
  • Collaborative grid research and development with
    technical staff from participating units
  • Primary goal develop and deploy an Institutional
    Grid for the University of Michigan

14
MGrid Research Project Partners
  • Center for Advanced Computing (cac.engin.umich.edu
    )
  • College of LSA (Physics) (www.lsa.umich.edu)
  • Center for Information Technology Intergration
    (www.citi.umich.edu)
  • Michigan Center for BioInformatics(www.ctaalliance
    .org)
  • Visible Human Project (vhp.med.umich.edu)
  • Mental Health Research Institute
    (www.med.umich.edu/mhri)
  • ITCom (www.itcom.itd.umich.edu)
  • School of Information (si.umich.edu)
  • Electron Microbeam Analysis Laboratory
    (emalwww.engin.umich.edu/emal/fset.html)

15
MGRID Goals
  • Provide participating units knowledge, support
    and a framework to deploy Grid technologies
  • Provide testbeds for existing and emerging Grid
    technologies
  • Coordinate activities within the national Grid
    community (GGF, GlobusWorld, )
  • Provide a context for the University to invest in
    computational and other Grid resources

16
MGRID activities for Year 1
  • Grid enable the CAC computer clusters and mass
    store.
  • Install NMI and NPACKage on development computers
    and assess changes necessary for NMI and NPACKage
    to support UM IT infrastructure.
  • Enable the AccessGrid across the campus backbone
    and within the units that elect to utilize the
    AccessGrid.

17
MGRID Activities for Year 1 (cont.)
  • Issue Request for Expression of Interest to
    faculty and staff across the UM. These projects
    may include instructional activity or other
    scholarly endeavor as a "research application" of
    grid computing.

18
MGRID Activities for Year 1
  • Work with network experts (e.g, the Internet2
    End-to-End Performance Initiative Technical
    Advisory Group) to design and deploy campus
    network monitors and beacons to measure baseline
    campus network performance.
  • Establish collaborative efforts within the UM
    community to develop and submit proposals to
    funding agencies to support grid computing at
    Michigan and to support the use of MGRID
    resources and services for research and
    instructional activities.
  • Pursue interactions with Internet2

19
Longer term goals
  • We hope to make significant contributions to
    general grid problems related to
  • Sharing resources among multiple VOs
  • Network monitoring and QoS issues for grids
  • Integration of middleware with domain specific
    applications
  • Grid File Systems
  • We want to enable Michigan researchers to take
    advantage of grid resources both on campus and
    beyond

20
MGRID Resource
MGRID User
21
An Application Grid Domain using NPACI resources
DZero/SAM-Grid
Deployment at Michigan (Slides Contributed
by Lee Leuking, Igor Terekhov, Gabriele
Garzoglio at FNAL)
22
Timelines Planning Meetings (CAC and Fermi
Labs) Sep-Oct, 2002 Demonstration of SAM-Grid at
SC2002 Nov, 2002 Deployment/Site-customization
Dec, 2002 - Mar, 2003 NPACI
Production Runs at Michigan July,
2003 (plus site visits, students spent part of
their time at Fermi Labs) Moving on to
scheduling and data movement issues at
DZero resource sites
23
Michigan DZero NPACI Processing and Grid Team A
group of Physicists and Computer
Scientists Project Leaders Jianming Qian
(Physics) Abhijit Bose (CAC,EECS) Graduate
Students Brian Wickman (CAC,EECS) James
Degenhardt (Physics) Jeremy Herr (Physics)
Glenn Lopez (Physics) Core DZero Faculty in
Physics Jianming Qian Bing Zhou Drew Alten Need
both Science/Application experts as well as
Computer Scientists
24
The D0 Collaboration
  • 500 Physicists
  • 72 institutions
  • 18 Countries

25
(No Transcript)
26
Scale of Challenges
Computing sufficient CPU cycles and storages,
good network bandwidth Software efficient
reconstruction program, realistic
simulation, easy data-access,
50 Hz
2 MHz
5 KHz
1 KHz
Level - 1
Level - 2
Level - 3
  • Luminosity-dependent physics menu leads to
    approximately
  • constant Level 3 output rate
  • With a combined duty factor of 50, we are
    writing at 25 Hz DC,
  • corresponding to 800 million events a year

27
Data Path
Raw
Data
MC
Offsite Farms
Fermilab Farm
Data Handling System
?
Offsite
Fermilab
28
Major Software
  • Trigger algorithms (Level-2/Level-3) and
    simulation
  • Filtering through firmware programming and/or
    (partial) reconstruction,
  • event building and monitoring. Simulating Level-1
    hardware and wrap
  • Level-2 and Level-3 code offline simulation.
  • Data management and access
  • SAM (Sequential Access to data via Meta-data) is
    used to catalog
  • data files produced by the experiment and
    provides distributed data
  • storage and access to production and analysis
    sites.
  • Event reconstruction
  • Reconstructing all physics object candidates,
    producing Data
  • Summary Tape (DST) and Thumbnails (TMB) for
    further analyses.
  • Physics and detector simulation
  • Off the shelf event generators to simulate
    physics processes and
  • home grown Geant-based (slow) and parameterized
    (fast) programs
  • to simulate detector responses.

29
Computing Architecture
Central Data Storage
dØmino


Central Analysis Backend
(CAB)
Remote Analysis
30
Storage and Disk Needs
For two-year running
Storage store all officially produced and user
derived datasets at Fermilab robotic tape
system ? 1.5 PB Disk all data and some MC
TMBs are disk resident, sufficient disk cache
for user files ? 40 TB at analysis centers
31
Analysis CPU Needs
CPU needs are estimated based on the layered
analysis approach
  • DST based
  • Resource intensive, limited to physics, object
    ID, and detector groups
  • Example insufficient TMB information, improved
    algorithms, bug fixes,
  • TMB based
  • Medium resource required, expect to be done
    mostly by subgroups
  • Example creating derived datasets, direct
    analyses on TMB,
  • Derived datasets
  • Individuals done daily on their desktops and/or
    laptops
  • Example Root-tree level analyses, selection
    optimization,

The CPU needs is about 4 THz for a data sample
of two-year running
32
Analysis Computing
RAC (Regional Analysis Center)
  • dØmino and its backend (CAB) at Fermilab
    Computing Center
  • provided and managed by Fermilab Computing
    Division
  • dØmino is a cluster of SGI O2000 CPUs, provides
    limited CPU power,
  • but large disk caches and high performance I/O
  • CAB is a 160 dual 1.8 GHz AMD CPU Linux farm on
    dØmino backend,
  • should provide majority of analysis computing
    at Fermilab
  • CluEDØ at DØ
  • over 200 Linux desktop PCs from collaborating
    institutions
  • managed by volunteers from the collaboration
  • CAB and CluEDØ are expected to provide half of
    the estimated analysis
  • CPU needs, the remaining is to be provided by
    regional analysis centers

33
Overview of SAM (Sequential Access to data via
Meta-Data)
Database Server(s) (Central Database)
Name Server
Global Resource Manager(s)
Log server
Shared Globally
Station 1 Servers
Station 3 Servers
Local
Station n Servers
Station 2 Servers
Mass Storage System(s)
Arrows indicate Control and data flow
Shared Locally
34
Components of a SAM Station
/Consumers
Producers/
Project Managers
Cache Disk
Temp Disk
MSS or Other Station
MSS or Other Station
File Storage Server
Station Cache Manager
File Storage Clients
File Stager(s)
Data flow
Control
Worker nodes
35
SAM Deployment
  • The success of SAM data handling system is the
    first step towards
  • utilizing offsite computing resources
  • SAM stations deployed at collaborating
    institutions provide easy
  • data storage and access

Only most active sites are shown
36
(No Transcript)
37
SAM-GRID
SAM-GRID is a Particle Physics Data Grid project.
It integrates Job and Information Management
(JIM) with the SAM data management system. A
first version of SAM-GRID is successfully
demonstrated at Super Computing 2002
in Baltimore SAM-GRID could be an important job
management tool for our offsite analysis efforts
38
Objectives of SAMGrid
  • Bring standard grid technologies (Globus and
    Condor) to the Run II experiments.
  • Enable globally distributed computing for D0 and
    CDF.
  • JIM complements SAM by adding job management and
    monitoring to data handling.
  • Together, JIM SAM SAMGrid

39
Principal Functionality
  • Enable all authorized users to use off-fermi-site
    computing resources
  • Provide standard interface for all job submission
    and monitoring
  • Jobs can be 1. Analysis, 2. Reconstruction, 3.
    Monte Carlo, 4. Generic (vanilla)
  • JIM v1 features
  • Submission to SAM station of users choice,
  • Automatic selection of SAM station based on
    amount of input data cached at each station
  • Web-based monitoring

40
Job Management
User Interface
User Interface
Submission Client
Submission Client
Match Making Service
Match Making Service
Broker
Queuing System
Queuing System
Information Collector
Information Collector
JOB
Data Handling System
Data Handling System
Data Handling System
Data Handling System
Execution Site 1
Execution Site n
Computing Element
Computing Element
Computing Element
Storage Element
Storage Element
Storage Element
Storage Element
Storage Element
Grid Sensors
Grid Sensors
Grid Sensors
Grid Sensors
Computing Element
41
Site Requirements
  • Linux i386 hardware architecture machines
  • SAM station with working sam submit
  • For MC clusters, mc_runjob installed
  • Submission and execution sites continuously run
    SAM and JIM servers with auto-restart procedures
    provided.
  • Execution sites can configure their Grid users
    and batch queues to avoid self inflicted DOS, eg
    all users mapped to one user d0grid with
    limited resources.
  • Firewalls must have specific ports open for
    incoming connections to SAM and JIM. Client hosts
    may include FNAL and all submission sites.
  • Execution sites trust grid credentials used by
    D0, including DOE Science Grid, FNAL KCA, and
    others by agreement. To use FNAL KCA, Kerberos
    client must be installed at the submission site.

42
JIM V1 Deployment
  • A site can join SAM-Grid with combos of services
  • Monitoring
  • Execution
  • Submission
  • April 1, 2003 Expect 5 initial sites for SAMGrid
    deployment, and 20 submission sites.
  • May 1, 2003 A few additional sites, depending on
    success and use of initial deployment.
  • Summer 2003 Continue to add execution and
    submission sites. Hope to grow to dozens exe. and
    hundreds of sub. sites.
  • CAB is powerful resource at FNAL, but...
  • Globus software not well supported on IRIX (CAB
    station server runs on d0mino).
  • FNAL computer security team restricts Grid jobs
    to in situ exes, or KCA certificates for user
    supplied exes.

43
SAMGrid Dependencies
44
Expectations from D0
  • By March 15 Need 2 volunteers to help set up
    beta-sites and conduct submission tests.
  • We expect runjob to be interface with the JIM v1
    release to run MC jobs.
  • In the early stages, it may require ½ FTE at each
    site to deploy and help troubleshoot and fix
    problems.
  • Initial deployment expectations
  • GrkdKa Analysis site
  • Imperial College and Lancaster MC sites
  • U. Michigan (NPACI) Reconstruction center.
  • Second round of deployments Lyon (ccin2p3),
    Manchester, MSU, Princeton, UTA.
  • Others include NIKHEF and Prague, need to
    understand EDG/LCG implications.

45
How Do We Spell Success?
  • Figures of merit
  • Number of jobs successfully started in remote exe
    batch queues.
  • If a job crashes its beyond our control.
  • May be issues related to data delivery that could
    be included as special failure mode.
  • How effectively CPU is used at remote sites
  • May change scheduling algorithm for job
    submission and/or tune queue config at sites.
  • Requires cooperation of participating sites
  • Ease of use and how much work gets done on the
    Grid.
About PowerShow.com