The GIOD Project (Globally Interconnected Object Databases) For High Energy Physics Harvey Newman, Julian Bunn, Koen Holtman and Richard Wilkinson A Joint Project between Caltech (HEP and CACR), CERN and Hewlett - PowerPoint PPT Presentation

About This Presentation

Title:

The GIOD Project (Globally Interconnected Object Databases) For High Energy Physics Harvey Newman, Julian Bunn, Koen Holtman and Richard Wilkinson A Joint Project between Caltech (HEP and CACR), CERN and Hewlett

Description:

Future Directions: GIOD II. Review the advantages of ODBMS vs. (O)RDBMS for ... Use of Self Organizing Maps (e.g. Kohonen) to recluster frequently accessed data ... – PowerPoint PPT presentation

Number of Views:134

Avg rating:3.0/5.0

Slides: 18

Provided by: chep200

Category:

more less

Transcript and Presenter's Notes

Title: The GIOD Project (Globally Interconnected Object Databases) For High Energy Physics Harvey Newman, Julian Bunn, Koen Holtman and Richard Wilkinson A Joint Project between Caltech (HEP and CACR), CERN and Hewlett

1
The GIOD Project(Globally Interconnected
Object Databases)For High Energy PhysicsHarvey
Newman, Julian Bunn, Koen Holtman and Richard
WilkinsonA Joint Project between Caltech (HEP
and CACR), CERN and Hewlett Packardhttp//pcbun
n.cacr.caltech.edu/

CHEP2000 Padova, Italy

2
The GIOD Project - Overview

GIOD Project began 1997, a joint effort of
Caltech and CERN with funding from Hewlett
Packard for two years
with collaboration from FNAL, SDSC
Leveraging existing facilities at Caltechs
Center for Advanced Computing Research (CACR)
Exemplar SPP2000, HPSS system, high speed WAN,
CACR expertise
Build a prototype LHC data processing and
analysis Center using
Object Oriented software, tools and ODBMS
Large scale data storage equipment and software
High bandwidth LAN (campus) and WAN (regional,
national, transoceanic) connections
Measure, evaluate and tune the components of the
center for LHC data analysis and physics
Confirm the viability of the LHC Computing Models

3
Components of the GIOD Infrastructure

Supercomputer facilities at CACR
Large pool of fully simulated multi-jet events in
CMS
Experienced large-scale systems engineers at CACR
Connections at T3- gtOC3 in the Local and Wide
Area Networks Fiberoptic links Caltech HEP/CACR
Strong collaborative ties with CMS, RD45,
Fermilab and San Diego Supercomputer
CenterCERN, CALREN-2 and Internet2 Network
Teams

4
Generation of CMS multi-jet events
Simple Tag class

Made possible by 1998, 1999 (NSF-sponsored) NPACI
Exemplar allocations
Produced 1,000,000 fully-simulated multi-jet QCD
events since May 98 selected from 2 X 109
pre-selected generated events
Directly study Higgs ? ?? backgrounds for first
time
Computing power of the HP-Exemplar SPP 2000
(0.2 TIPs) made this attainable
Events used to populate a GIOD Object Database
system
Tag database implemented and kept
separately Can be quickly replicated to client
machines
In 2000 Proposal to NPACI requesting 25 of the
Exemplar has been granted
Targeted at event simulation for ORCA (CMS)
Replicas of this database were installed at FNAL
and Padua/INFN (Italy)

5
Scalability tests using the Exemplar

Caltech Exemplar used as a relatively convenient
testbed for multiple client tests with
Objectivity
Two main thrusts
Using simple fixed object data
Using simulated LHC events
Results gave support to the viability of the
ODBMS system for LHC data
CMS 100 MB/sec milestonemet (170 MB/sec achieved)

gt 170 MB/sec writing LHC raw event data to the
database
Up to 240 clients reading simple objects from the
database
6
Java 3D Applet to view GIOD events

Attaches to the GIOD database allows to scan all
events in the database, at multiple detail
levels
Demonstrated at the Internet2 meetings in 1998
and 1999, and at SuperComputing98 in Florida at
the iGrid, NPACI and CACR stands

ECAL crystals
Java2 GUI
HCAL towers
Tracker geometry and hitmap
Run/event selection widget
Reconstructed Tracks
Reconstructed Jets
7
Other ODBMS tests
Tests with Versant(fallback ODBMS)
DRO WAN Tests with CERN
Production on CERNs PCSF and file movement to
Caltech
Objectivity/DB Creation of 32000 database
federation
8
Tests with Objy/Java binding and JAS
Objy DIM and analysis using Java Analysis Studio
Java2D Tracker viewer
Java Track Fitter
9
WAN tests Caltech ? SDSC,FNAL

Client tests between SDSC/CACR, CACR/FNAL and
CACR/HEP
ftp, LHC event reconstruction, event analysis,
event scanning
Investigated network throughput dependence on
TCP window size, MSS, round trip time (RTT), etc.
payload (ftp, Objy, Web, telnet etc.)

Simple ftp traffic
Flattened by staggering client startups
Objectivity Schema transfer 8 kB DB Pages
10
WAN tests Caltech ? SDSC,FNAL

Using out of the box single-stream ftp,
achieved
7 MB/sec over LAN ATM _at_ OC3
3 MB/sec over WAN _at_ OC3
Expect to ramp up capability by use of
Tuned ftp (buffer, packet and window sizes)
Jumbo frames
New IP implementations or other protocols
Predict 1 GB/sec in WAN by LHC 2005 using
parallel streams
Measurements to be used as a basis for model
parametersin further MONARC simulations

11
Using the Globus Tools

Tests with gsiftp, a modified ftp server/client
that allows control of the TCP buffer size
Transfers of Objy database files from the
Exemplar to
Itself
An O2K at Argonne (via CalREN2 and Abilene)
A Linux machine at INFN (via US-CERN
Transatlantic link)
Target /dev/null in multiple streams (1 to 16
parallel gsiftp sesssions).
Aggregate throughput as a function of number of
streams and send/receive buffer sizes

25 MB/sec on HiPPI loop-back
4MB/sec to Argonne by tuning TCP window size
Saturating available B/W to Argonne
12
GIOD - Summary

GIOD investigated
Usability, scalability, portability of Object
Oriented LHC codes
In a hierarchy of large-servers, and medium/small
client machines
With fast LAN and WAN connections
Using realistic raw and reconstructed LHC event
data
GIOD has
Constructed a large set of fully simulated events
and used these to create a large OO database
Learned how to create large database federations
Developed prototype reconstruction and analysis
codes that work with persistent objects
Deployed facilities and database federations as
testbeds for Computing Model studies

13
Associated Projects

MONARC - Models Of Networked Analysis at Regional
Centers (CERN)
Caltech, CERN, FNAL, Heidelberg, INFN, KEK,
Marseilles, Munich, Orsay, Oxford, Tufts,
Specify candidate models performance
throughputs, latencies
Find feasible models for LHC matched to network
capacity and data handling
Develop Baseline Models in the feasible
category
PPDG - Particle Physics Data Grid (DoE Next
Generation Internet)
Argonne Natl. Lab., Caltech, Lawrence Berkeley
Lab., Stanford Linear Accelerator Center, Thomas
Jefferson National Accelerator Facility,
University of Wisconsin, Brookhaven Natl. Lab.,
Fermi Natl. Lab., San Diego Supercomputer Center
Delivery of infrastructure for widely distributed
analysis of particle
physics data at multi-PetaByte scales by 100s to
1000s of physicists
Acceleration of development of network and
middleware infrastructure
aimed at data-intensive collaborative science.
ALDAP - Accessing Large Data Archives in
Astronomy and Particle Physics (NSF Knowledge
Discovery Initiative)
Caltech, Johns Hopkins University, FNAL
Explore data structures, physical data storage
hierarchies for archival of next generation
astronomy and particle physics data
Develop spatial indexes, novel data
organisations, distribution and delivery
strategies.
Create prototype data query execution systems
using autonomous agent workers

14
Future Directions GIOD II

Review the advantages of ODBMS vs. (O)RDBMS for
persistent LHC datain light of recent (e.g.
Web-enabled) RDBMS developments, for HEPand
other scientific fields
Fast traversal of complex class hierarchies ?
Global (federation) schema and transparent
access ?
Impedance match between the database and the OO
code ?
What are the scalability and use issues
associated with implementing a traditional RDBMS
as a persistent object store for LHC data?
What benefits would the use of an RDBMS bring, if
any ?
Which RDBMS systems, if any, are capable of
supporting, or projected to support, the size,
distribution and access patterns of the LHC data
?

15
GIOD II Other New Investigations

What are the implications/benefits for the
Globally-distributed LHC computing systems of
Having Web-like object caching and delivery
mechanisms (distributed content delivery,
distributed cache management)
The use of Autonomous Agent query systems
Organizing the data and resources in an N-tiered
hierarchy
Choosing (de facto) standard Grid tools as
middleware
How can data migration flexibility be built in ?
Schema/data to XML conversion (Wisdom, Goblin) ?
Data interchange using JDBC or ODBC
Known format binary files for bulk data
interchangefor simple and efficient transport
across WANs

16
GIOD II and ALDAP

Optimizing performance of Objectivity for
LHC/SDSS data
Use of Self Organizing Maps (e.g. Kohonen) to
recluster frequently accessed data into
collections in contiguous storage
Use of Autonomous Agents to carry queries and
data in WAN distributed database system
Identify known performance issues get them fixed
by the vendor
Example 1 11,000 cycles cf. 300 cycles overhead
to open an object
Example 2 Selection speeds with simple cuts on
Tag objects
Make new performance comparisons between
Objectivity and ER database (SQLServer)
on identical platforms,
with identical data,
with identical queries,
with all recommended tweaks,
with all recommended coding tricks
We have begun tests with SDSS sky objects, and
with GIODTag objects

17
GIOD II and PPDG

Distributed Analysis of ORCA data
Using Grid middleware (notably gsiftp, SRB) to
move database files across the WAN
Custom tools to select subset of database files
required in local replica federations, and
attach them once copied
Making compact data collections
Remote requests from clients for sets of DB files
Simple staging schemes that asynchronously make
data available, and give ETA for delivery, and
migrate cool files to tertiary storage
Marshalling of distributed resources to achieve
production task goals
Complementary ORCA DB files in Caltech, FNAL and
CERN replicas
Full pass analysis involves distributing task to
all three sites
Move/compute cost decision
Task and results carried by Autonomous Agents
between sites (work in ALDAP)