Title: Anaphe - OO Libraries for Data Analysis using C and Python AIDA
1Anaphe - OO Libraries for Data Analysis using C
and PythonAIDA Abstract Interfaces for Data
Analysis
2 AnapheOO Libraries for Data Analysis using C
and Python
Andreas Pfeiffer CERN IT/API andreas.pfeiffer_at_cer
n.ch
3Outline
- Motivation
- Anaphe Components
- C
- Lizard Interactive Data Analysis
- Python
- Software quality control
- Summary
4 5LHC The Alps
Interaction Points
100m deep
27km circumference
6LHC Computing Challenge
- 4 experiments will create huge amount of data
- gt1 PetaByte/year for each experiment !
- 1015 Bytes
- 1,000 TeraBytes
- 20,000 Redwood tapes
- 100,000 dual-sided DVD-RAM disks
- 1,500,000 sets of the Encyclopaedia Britannica
(w/o photos) - Need lots of CPU power to reconstruct/analyse
- about 1000 PC boxes per experiment (2005 ones !)
- 40.000 of todays boxes (dual P-III 800 MHz)
- complex data models
- reconstruction s/w is also used for online
filtering - needs high quality s/w in order not to waste beam
time
7Lifetime of LHC software 25 yrs
8Technology (R)Evolution
- 10 yrs major cycle length (HW,SW,OS)
- 12 evolutionary changes in the market
- 1 revolutionary change
- towards greater diversity
- dont forget changes of requirements
- Consequences
- s/w written today most probably will be rewritten
tomorrow - we must anticipate changes
9Anaphe what it is
- Analysis for physics experiments
- Modular (OO/C) replacement of CERNLIB
functionality for use in HEP experiments - memory management
- I/O
- foundation classes
- histogramming
- minimizing/fitting
- visualization
- interactive data analysis
- Trying to use standards wherever possible
- Trying to re-use existing class libraries
10Anaphe Components
11- AIDA
- Abstract Interfaces for Data Analysis
- ? next talk
12 13Layered Approach
- Basic functionalities (histograms, fitting,
etc.) are available as individual C class
libraries. - Easy replacing one part without throwing away
everything - Objectivity/DB to provide persistence
- HepODBMS library (insulating layer, tags)
- Histogram library (HTL)
- Fitting libraries (Gemini, HepFitting)
- Graphics libraries (Qt, Qplotter)
- Insulate components through Abstract Interfaces
- wrapper layer to implement Interfaces in terms
of existing libs - Apply s/w quality control tools
- code checking, testing
14ANAPHE Components
Python / SWIG Objectivity/DB HBook NAG-C
Minuit Qt (free edition)
User Interface - using Abstract Types
15Basic 3D Graphic Libraries
- OpenGL (basic graphics)
- De-facto industry standard for basic 3D graphics
- Used in CAD/CAE, games, VR, medical imaging
- OpenInventor (scene mgmt.)
- OO 3D toolkit for graphics
- Cubes, polygons, text, materials
- Cameras, lights, picking
- 3D viewers/editors,animation
- Based on OpenGL/MesaGL
162D Graphics libraries
- Qt
- multi-platform C GUI toolkit
- C class library, not wrapper around C libs
- superset of Motif and MFC
- available on Unix and MS Windows
- no change for developer
- commercial but with public domain version
- www.troll.no
- Qplotter
- add-on functionality for HEP
- HIGZ/HPLOT
17Mathematical Libraries
- NAG (Numerical Algorithms Group) C Library
- Covers a broad range of functionality
- Linear algebra
- differential equations
- quadrature, etc.
- Special functions of CERNLIB added to Mark-6
release - mostly for theory and accelerator
- Quality assurance
- extensive testing done by NAG
- www.nag.com
18CLHEP - foundation classes
- HEP foundation class library
- Random number generators
- Physics vectors
- 3- and 4- vectors
- Geometry
- Linear algebra
- System of units
- more packages recently added
- will continue to evolve
- wwwinfo.cern.ch/asd/lhc/clhep/
19Histograms the HTL package
- Histograms are the basic tool for physics
analysis - Statistical information of density distributions
- Histogram Template Library (HTL)
- design based on C templates
- Modular separation between sampling and
display - Extensible open for user defined binning
systems - Flexible support transient/persistent at the
same time - Open large use of abstract interfaces
- recent addition 3D histograms
20Fitting and Minimization
- Fitting and Minimization Library (FML)
- common OO interface
- NAG-C, MINUIT
- based on Abstract Interfaces
- IVector, IModelFunction,
- fitting as a special case of minimization
- minimize distance between data and model
- replacement for HepFitting (and Gemini)
- Gemini
- common interface to minimizer engine
- very thin layer
21- Opening bracket
- Persistency
22Object persistencyTwo concepts serial and page
I/O
- Sequential access to objects (streaming)
- good in networking context or serial writes to
file(s) - much like good old Fortran
- often perceived to be simpler to implement
(ltlt, gtgt) - Navigational access to objects (buffered)
- I/O on demand for complex data models
- location transparent (for user) access to object
- typically by de-referencing of a smart pointer
- optimized for (random) disk access (disks deliver
pages) - sequential write to file(s) still ok
- Both concepts need to take care about changes of
the internal structure of the objects (schema
evolution)
23Architectural IssuePersistency (Object-I/O)
- Brings a completely new quality into the design
- Objects have now lifetime
- dont delete until you really are sure you want
to - persistency is kind of intended memory leak
- would like to see no difference between memory
and disk - Layout of objects may change during (extended)
life - schema evolution
- additions/deletions of attributes
- changes of inheritance relations
24Architectural IssuePersistency (Object-I/O)
(II)
- Objects can be placed (clustering)
- de-coupling of logical and physical view of data
- Special care needed to ensure consistency in data
set - avoid reading group of objects (tracks,
events,...) for which writing/updating is not
(yet) complete - clean up if only part of the objects are written
- typically taken care of by using transactions
- Complications possible in distributed computing
- need to protect disk access now like memory
access in past (Segmentation violation)
25Physical Model and Logical Model
- Physical model may be changed to optimise
performance - Existing applications continue to work
transparently !
26Object Model
Thanks to Vincenzo Innocente (CMS)
27Physical clustering
Thanks to Vincenzo Innocente (CMS)
28- Closing bracket
- Persistency
29Tags, Ntuples and Events
- Tags - a special kind of Ntuple
- Always associated with an underlying persistent
store - Tags may be used to store ntuple-like data
- extracted from all over the event
- minPt, maxEmiss, nJets, nMuon, trigger,
- Main use speedup data selection for analysis
- Tag simplifies selection without loosing
complexity - Events more complex than a tree structure (CWN)
- lots of cross-references between classes,
containers - Association from the Tag to the Event may be used
to navigate to any other part of the Event - even from an interactive visualization program
30Anaphe components
31Anaphe Internals (Abstract) Interfaces
32AIDA compliance of Anaphe
- Presently (Anaphe 3.x) only AIDA 1.0 compliant
- Plan to implement AIDA 2.2 Interfaces by end 2001
(Anaphe 4.x) - initially as wrappers to existing
interfaces/packages - Will maintain 3.x for some time
- ensures stability for users
- Development will concentrate on 4.x
- while AIDA will evolve further
- Similar timeschedule as JAS (Tony Johnson)
- OpenScientist (Guy Barrand) already there
33- Lizard a tool for Interactive Data Analysis
34Interactive Data Analysis
- Aim OO replacement for PAW (at least)
- analysis of ntuple-like data (Tags,
Ntuples, ) - visualisation of data (Histograms, scatter-plot,
Vectors) - fitting of histograms (and other data)
- access to experiment specific data/code
- Maximize flexibility and re-use
- Foresee customization/integration
- allow use from within experiments s/w
- Plan for extensions
- code for now, design for the future
- Ensure maintainability
- use of s/w quality control tools
35Scripting - why
- Typical use of scripting is quite different from
programming (reconstruction, analysis, ...) - history go back to where I was before
- repetition/looping - with modifiable parameters
- avoid one size fits all or using power-tool as
hammer - rapid prototyping in scripting language
- quick turn-around times
- performance critical code in core language
- exploit richer set of features/functionality
(e.g. templates in C) - scripting languages usually less susceptible to
changes than mainstream languages - potentially longer lifes
36Python - why
- Python - OO (scripting) language
- no strange !-variables
- sensitive to indentation
- More easy for users
- as Java
- Lots of user supplied modules available and ready
for use - scientific, numerics, graphics, GUI, network, OS,
games, DBs, - example http//www.vex.net/parnassus/
- Parnassus Totals 1173 items in 49 categories.
- Also usable in Java (Jython)
- used in JAS for scripting
- minimize changes needed within AIDA compliant
environments
37Python - how
- SWIG to (semi-) automatically create connection
to chosen scripting language - allows flexibility to choose amongst several
scripting languages - Python, Perl, Tcl, Guile, Ruby, (Java)
- Very easy to use
- swig -c -python -shadow -c myClass.h
- create shared lib from myClass.cpp and
myClass_wrap.c - start python and import myClass.h to use it
- Very easy to extend
- simply inherit from swiggified class in python
- modifications can later be fed back into C
- performance, type safety, special language
features (templates),
38PAW -gt Lizard translation
- Ntuple projection Lizard
- lizard --useHBook
- -) nt ntm.findNtuple(higgscand.hbkcands)
- -) nplot1D(nt, mass, quality5 cut gt 198)
- Ntuple projection PAW
- pawX11
- pawgt h/file 1 higgscand.hbk
- pawgt nt/pl 10.mass quality5.and.cutgt198
- Assuming file higgscand.hbk contains ntuple with
number 10 and title cands
Any valid C expression
39Tutorials and Examples available
40Users and Collaborations
- AIDA spoken here!
- IGUANA (CMS visualization)
- GAUDI (LHCb/HARP) framework
- ATHENA (Atlas) framework
- Analyzer modules in Geant 4
- JAS
- Open Scientist
- you?
41 42Software quality control
- Using tools for testing/checking has started
- Insure, CodeWizard
- Package dependencies Ignominy
- Set of perl and shell scripts by Lassi Tuura
(CMS) - Ignominy scans
- Make dependency data produced by the compilers
(.d files) - Source code for includes (resolved against the
ones actually seen) - Shared library dependencies (ldd output)
- Defined and required symbols (nm output)
- And maps
- Source code and binaries into packages
- include dependencies into package dependencies
- Unresolved/defined symbols into package
dependencies
ignominy dishonour, disgrace, shame infamy the
condition of being in disgrace, etc. (Oxford
English Dictionary)
43Ignominy Analysis of Anaphe
- Distribution of tools and utilities for LHC era
physics - Combination of commercial, free and HEP software
- Claims to be a toolkit
- Seems to live up to its toolkit claims
- Good work on modularity
- Clean design is evident in many places
- Dependency diagrams often split naturally into
functional units
Thanks to Lassi Tuura (CMS)
44Package Metrics
- Size total amount of source code (not
normalised across projects!) - ACD average component dependency ( libraries
linked in) - CCD sum of single-package component
dependencies over whole release - Indicates testing/integration cost
- NCCD Measure of CCD compared to a balanced
binary tree - A good toolkits NCCD will be close to 1.0
- lt 1.0 structure is flatter than a binary tree (
independent packages) - gt 1.0 structure is more strongly coupled
(vertical or cyclic) - Aim NCCD 1 for given software/functionality
Thanks to Lassi Tuura (CMS)
45Metrics NCCD vs Cycles
Includes Fortran
ATLAS
- NCCD (spaghetti index)
- ? 1.0 good toolkit
- lt 1.0 indep. packages
- gt 1.0 strongly-coupled
ROOT
ORCA
G4
COBRA
Anaphe
IGUANA
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
46History
- Started after CHEP-2000
- Full version out since June 2001
- Established functionality exceeding PAW
- Analyzer component giving direct access to data
and libraries of the experiment framework - Based on Abstract Interfaces
- Flexible and extensible
- Established parallel development of license
free version while re-using existing libraries - Direct reading/writing of HBook files as an
alternative to Objectivity/DB based persistency - Use of Minuit as a replacement for the minimizer
of NAG-C
47Ongoing activities
- Persistency
- De-emphasize Objectivity/DB (in coordination with
experiments, IT/DB and LCG) - Use of HBook ntuples
- Text files (using AIDA defined XML format)
- Planning to use LCG persistency (POOL)
- Investigating direct reading of ROOT files
- Fitting
- Implementing minimizer from GSL
- Discussing with the IGUANA team (CMS) to
integrate their GUI components - Looking forward for confirmation and/or
re-direction of our efforts following the SC2
(RTAGs)
48Future enhancements
- Access to other implementations of components
- HBOOK CWNtuples
- Communication with Java tools/packages (JAS,
Wired) - via AIDA
- Reading of ROOT (gt V3.0) files
- similar to Tony Johnsons (Java) RootIO package
- depends on stability of Root file format ?
- AIDA Ntuple/Histo store
- optimized for Ntuples, Histograms as (compressed)
XML - Adding other scripting languages
- Perl , Tcl, cint ?
49Challenge Distributed Computing
- Motivation
- move code to data
- parallel analysis
- Techniques
- services via AI
- late binding
- plug-in architecture
- End-user (Lizard)
- look-and-feel of local analysis
- RD started and first prototype available soon
- CORBA based
50Summary
- The architecture of Anaphe shows some important
items for flexible and modular data analysis - weak coupling between components through use of
Abstract Interface - basic functionality is covered by individual C
class libraries - emphasis on usability and maintainability
- Major criteria are flexibility, extensibility and
interoperability - Recent example GEANT-4 examples (based on AIDA)
- Lizard is an Interactive Data Analysis Tool based
on Anaphe components and the Python scripting
language (through SWIG) - Lizard is young but has very solid base in mature
Anaphe libraries - real plug-in structure
- Software quality control is important
- tools help to optimize dependencies / minimize
maintenance effort
51More information
- cern.ch/Anaphe
- cern.ch/Anaphe/Lizard
- aida.freehep.org/
- cern.ch/DB
- wwwinfo.cern.ch/asd/lhc/clhep/
52 53Analysis of Geant4
- Fairly large C project
- Very fine-grained (and multi-level) package
structuring - Seems quite clean from the preliminary analysis
- Fine package subdivision helps in many ways but
makes analysis and code understanding more
complicated - One subsystemseems stronglycoupled andneeds
attention - Need to studythe use of theinternal
commandsystem
Thanks to Lassi Tuura (CMS)
54Analysis of ROOT
- ROOT developers have done a formidable job of
breaking binary (shared library) dependencies,
but - For example By static analysis, nothing seems to
use the postscript package directly (no incoming
dependencies), but there is this code - void TPadPrint (const char filename, Option_t
option) - TVirtualPS psave gVirtualPS
- if (gROOT-gtLoadClass("TPostScript","Postscript"))
return - gROOT-gtProcessLineFast("new TPostScript()")
- gVirtualPS-gtOpen(psname,pstype)
- gVirtualPS-gtSetBit(kPrintingPS)
- Taking these and global objects into account
makes the dependency diagrams very different - Sign of fast growth? Need a next evolutionary
step? - So coherent that replacing parts could get
painful
Thanks to Lassi Tuura (CMS)
55Analysis of ROOT
Binary Source Logical Real
Binary only
Thanks to Lassi Tuura (CMS)
56Metrics NCCD vs ACD
ATLAS
ROOT
ORCA
G4
COBRA
IGUANA
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
57Metrics NCCD vs Size
ATLAS
ROOT
ORCA
G4
COBRA
IGUANA
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
58Metrics NCCD vs AID
ATLAS
ROOT
ORCA
COBRA
G4
Anaphe
IGUANA
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
59Metrics Packages vs Size
ATLAS
ORCA
G4
COBRA
IGUANA
Anaphe
ROOT
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
60Metrics Packages vs Size
ATLAS
ORCA
G4
COBRA
IGUANA
ROOT
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
61Example script (ntuple)
get list of names of all tuples from
tuplemanager ntm.listTuples() nt1ntm.findNtuple(
Charm1) retrieve tuple by name create 1D
histos to project into h1hm.create1D(10, mass
,100, 0., 5000.) h2hm.create1D(20, mass for
pt1gt10 ,100, 0., 5000.) project the attribute
MASS" into histo h1 without cut
("") nt1.project1D( h1, , MASS) project
the attribute MASS" into histo h2 with cut
(PT1gt10") nt1.project1D( h2, PT1gt10 , MASS)