Anaphe - OO Libraries for Data Analysis using C and Python AIDA - PowerPoint PPT Presentation

About This Presentation
Title:

Anaphe - OO Libraries for Data Analysis using C and Python AIDA

Description:

OpenScientist (Guy Barrand) already there. Gran Sasso Lab, Jul-2002 ... scientific, numerics, graphics, GUI, network, OS, games, DBs, ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 62
Provided by: din106
Category:

less

Transcript and Presenter's Notes

Title: Anaphe - OO Libraries for Data Analysis using C and Python AIDA


1
Anaphe - OO Libraries for Data Analysis using C
and PythonAIDA Abstract Interfaces for Data
Analysis
2
AnapheOO Libraries for Data Analysis using C
and Python
Andreas Pfeiffer CERN IT/API andreas.pfeiffer_at_cer
n.ch
3
Outline
  • Motivation
  • Anaphe Components
  • C
  • Lizard Interactive Data Analysis
  • Python
  • Software quality control
  • Summary

4
  • LHC Computing challenge

5
LHC The Alps
Interaction Points
100m deep
27km circumference
6
LHC Computing Challenge
  • 4 experiments will create huge amount of data
  • gt1 PetaByte/year for each experiment !
  • 1015 Bytes
  • 1,000 TeraBytes
  • 20,000 Redwood tapes
  • 100,000 dual-sided DVD-RAM disks
  • 1,500,000 sets of the Encyclopaedia Britannica
    (w/o photos)
  • Need lots of CPU power to reconstruct/analyse
  • about 1000 PC boxes per experiment (2005 ones !)
  • 40.000 of todays boxes (dual P-III 800 MHz)
  • complex data models
  • reconstruction s/w is also used for online
    filtering
  • needs high quality s/w in order not to waste beam
    time

7
Lifetime of LHC software 25 yrs
8
Technology (R)Evolution
  • 10 yrs major cycle length (HW,SW,OS)
  • 12 evolutionary changes in the market
  • 1 revolutionary change
  • towards greater diversity
  • dont forget changes of requirements
  • Consequences
  • s/w written today most probably will be rewritten
    tomorrow
  • we must anticipate changes

9
Anaphe what it is
  • Analysis for physics experiments
  • Modular (OO/C) replacement of CERNLIB
    functionality for use in HEP experiments
  • memory management
  • I/O
  • foundation classes
  • histogramming
  • minimizing/fitting
  • visualization
  • interactive data analysis
  • Trying to use standards wherever possible
  • Trying to re-use existing class libraries

10
Anaphe Components
11
  • AIDA
  • Abstract Interfaces for Data Analysis
  • ? next talk

12
  • Anaphe components

13
Layered Approach
  • Basic functionalities (histograms, fitting,
    etc.) are available as individual C class
    libraries.
  • Easy replacing one part without throwing away
    everything
  • Objectivity/DB to provide persistence
  • HepODBMS library (insulating layer, tags)
  • Histogram library (HTL)
  • Fitting libraries (Gemini, HepFitting)
  • Graphics libraries (Qt, Qplotter)
  • Insulate components through Abstract Interfaces
  • wrapper layer to implement Interfaces in terms
    of existing libs
  • Apply s/w quality control tools
  • code checking, testing

14
ANAPHE Components
Python / SWIG Objectivity/DB HBook NAG-C
Minuit Qt (free edition)
User Interface - using Abstract Types
15
Basic 3D Graphic Libraries
  • OpenGL (basic graphics)
  • De-facto industry standard for basic 3D graphics
  • Used in CAD/CAE, games, VR, medical imaging
  • OpenInventor (scene mgmt.)
  • OO 3D toolkit for graphics
  • Cubes, polygons, text, materials
  • Cameras, lights, picking
  • 3D viewers/editors,animation
  • Based on OpenGL/MesaGL

16
2D Graphics libraries
  • Qt
  • multi-platform C GUI toolkit
  • C class library, not wrapper around C libs
  • superset of Motif and MFC
  • available on Unix and MS Windows
  • no change for developer
  • commercial but with public domain version
  • www.troll.no
  • Qplotter
  • add-on functionality for HEP
  • HIGZ/HPLOT

17
Mathematical Libraries
  • NAG (Numerical Algorithms Group) C Library
  • Covers a broad range of functionality
  • Linear algebra
  • differential equations
  • quadrature, etc.
  • Special functions of CERNLIB added to Mark-6
    release
  • mostly for theory and accelerator
  • Quality assurance
  • extensive testing done by NAG
  • www.nag.com

18
CLHEP - foundation classes
  • HEP foundation class library
  • Random number generators
  • Physics vectors
  • 3- and 4- vectors
  • Geometry
  • Linear algebra
  • System of units
  • more packages recently added
  • will continue to evolve
  • wwwinfo.cern.ch/asd/lhc/clhep/

19
Histograms the HTL package
  • Histograms are the basic tool for physics
    analysis
  • Statistical information of density distributions
  • Histogram Template Library (HTL)
  • design based on C templates
  • Modular separation between sampling and
    display
  • Extensible open for user defined binning
    systems
  • Flexible support transient/persistent at the
    same time
  • Open large use of abstract interfaces
  • recent addition 3D histograms

20
Fitting and Minimization
  • Fitting and Minimization Library (FML)
  • common OO interface
  • NAG-C, MINUIT
  • based on Abstract Interfaces
  • IVector, IModelFunction,
  • fitting as a special case of minimization
  • minimize distance between data and model
  • replacement for HepFitting (and Gemini)
  • Gemini
  • common interface to minimizer engine
  • very thin layer

21
  • Opening bracket
  • Persistency

22
Object persistencyTwo concepts serial and page
I/O
  • Sequential access to objects (streaming)
  • good in networking context or serial writes to
    file(s)
  • much like good old Fortran
  • often perceived to be simpler to implement
    (ltlt, gtgt)
  • Navigational access to objects (buffered)
  • I/O on demand for complex data models
  • location transparent (for user) access to object
  • typically by de-referencing of a smart pointer
  • optimized for (random) disk access (disks deliver
    pages)
  • sequential write to file(s) still ok
  • Both concepts need to take care about changes of
    the internal structure of the objects (schema
    evolution)

23
Architectural IssuePersistency (Object-I/O)
  • Brings a completely new quality into the design
  • Objects have now lifetime
  • dont delete until you really are sure you want
    to
  • persistency is kind of intended memory leak
  • would like to see no difference between memory
    and disk
  • Layout of objects may change during (extended)
    life
  • schema evolution
  • additions/deletions of attributes
  • changes of inheritance relations

24
Architectural IssuePersistency (Object-I/O)
(II)
  • Objects can be placed (clustering)
  • de-coupling of logical and physical view of data
  • Special care needed to ensure consistency in data
    set
  • avoid reading group of objects (tracks,
    events,...) for which writing/updating is not
    (yet) complete
  • clean up if only part of the objects are written
  • typically taken care of by using transactions
  • Complications possible in distributed computing
  • need to protect disk access now like memory
    access in past (Segmentation violation)

25
Physical Model and Logical Model
  • Physical model may be changed to optimise
    performance
  • Existing applications continue to work
    transparently !

26
Object Model
Thanks to Vincenzo Innocente (CMS)
27
Physical clustering
Thanks to Vincenzo Innocente (CMS)
28
  • Closing bracket
  • Persistency

29
Tags, Ntuples and Events
  • Tags - a special kind of Ntuple
  • Always associated with an underlying persistent
    store
  • Tags may be used to store ntuple-like data
  • extracted from all over the event
  • minPt, maxEmiss, nJets, nMuon, trigger,
  • Main use speedup data selection for analysis
  • Tag simplifies selection without loosing
    complexity
  • Events more complex than a tree structure (CWN)
  • lots of cross-references between classes,
    containers
  • Association from the Tag to the Event may be used
    to navigate to any other part of the Event
  • even from an interactive visualization program

30
Anaphe components
31
Anaphe Internals (Abstract) Interfaces
32
AIDA compliance of Anaphe
  • Presently (Anaphe 3.x) only AIDA 1.0 compliant
  • Plan to implement AIDA 2.2 Interfaces by end 2001
    (Anaphe 4.x)
  • initially as wrappers to existing
    interfaces/packages
  • Will maintain 3.x for some time
  • ensures stability for users
  • Development will concentrate on 4.x
  • while AIDA will evolve further
  • Similar timeschedule as JAS (Tony Johnson)
  • OpenScientist (Guy Barrand) already there

33
  • Lizard a tool for Interactive Data Analysis

34
Interactive Data Analysis
  • Aim OO replacement for PAW (at least)
  • analysis of ntuple-like data (Tags,
    Ntuples, )
  • visualisation of data (Histograms, scatter-plot,
    Vectors)
  • fitting of histograms (and other data)
  • access to experiment specific data/code
  • Maximize flexibility and re-use
  • Foresee customization/integration
  • allow use from within experiments s/w
  • Plan for extensions
  • code for now, design for the future
  • Ensure maintainability
  • use of s/w quality control tools

35
Scripting - why
  • Typical use of scripting is quite different from
    programming (reconstruction, analysis, ...)
  • history go back to where I was before
  • repetition/looping - with modifiable parameters
  • avoid one size fits all or using power-tool as
    hammer
  • rapid prototyping in scripting language
  • quick turn-around times
  • performance critical code in core language
  • exploit richer set of features/functionality
    (e.g. templates in C)
  • scripting languages usually less susceptible to
    changes than mainstream languages
  • potentially longer lifes

36
Python - why
  • Python - OO (scripting) language
  • no strange !-variables
  • sensitive to indentation
  • More easy for users
  • as Java
  • Lots of user supplied modules available and ready
    for use
  • scientific, numerics, graphics, GUI, network, OS,
    games, DBs,
  • example http//www.vex.net/parnassus/
  • Parnassus Totals 1173 items in 49 categories.
  • Also usable in Java (Jython)
  • used in JAS for scripting
  • minimize changes needed within AIDA compliant
    environments

37
Python - how
  • SWIG to (semi-) automatically create connection
    to chosen scripting language
  • allows flexibility to choose amongst several
    scripting languages
  • Python, Perl, Tcl, Guile, Ruby, (Java)
  • Very easy to use
  • swig -c -python -shadow -c myClass.h
  • create shared lib from myClass.cpp and
    myClass_wrap.c
  • start python and import myClass.h to use it
  • Very easy to extend
  • simply inherit from swiggified class in python
  • modifications can later be fed back into C
  • performance, type safety, special language
    features (templates),

38
PAW -gt Lizard translation
  • Ntuple projection Lizard
  • lizard --useHBook
  • -) nt ntm.findNtuple(higgscand.hbkcands)
  • -) nplot1D(nt, mass, quality5 cut gt 198)
  • Ntuple projection PAW
  • pawX11
  • pawgt h/file 1 higgscand.hbk
  • pawgt nt/pl 10.mass quality5.and.cutgt198
  • Assuming file higgscand.hbk contains ntuple with
    number 10 and title cands

Any valid C expression
39
Tutorials and Examples available
40
Users and Collaborations
  • AIDA spoken here!
  • IGUANA (CMS visualization)
  • GAUDI (LHCb/HARP) framework
  • ATHENA (Atlas) framework
  • Analyzer modules in Geant 4
  • JAS
  • Open Scientist
  • you?

41
  • Software quality control

42
Software quality control
  • Using tools for testing/checking has started
  • Insure, CodeWizard
  • Package dependencies Ignominy
  • Set of perl and shell scripts by Lassi Tuura
    (CMS)
  • Ignominy scans
  • Make dependency data produced by the compilers
    (.d files)
  • Source code for includes (resolved against the
    ones actually seen)
  • Shared library dependencies (ldd output)
  • Defined and required symbols (nm output)
  • And maps
  • Source code and binaries into packages
  • include dependencies into package dependencies
  • Unresolved/defined symbols into package
    dependencies

ignominy dishonour, disgrace, shame infamy the
condition of being in disgrace, etc. (Oxford
English Dictionary)
43
Ignominy Analysis of Anaphe
  • Distribution of tools and utilities for LHC era
    physics
  • Combination of commercial, free and HEP software
  • Claims to be a toolkit
  • Seems to live up to its toolkit claims
  • Good work on modularity
  • Clean design is evident in many places
  • Dependency diagrams often split naturally into
    functional units

Thanks to Lassi Tuura (CMS)
44
Package Metrics
  • Size total amount of source code (not
    normalised across projects!)
  • ACD average component dependency ( libraries
    linked in)
  • CCD sum of single-package component
    dependencies over whole release
  • Indicates testing/integration cost
  • NCCD Measure of CCD compared to a balanced
    binary tree
  • A good toolkits NCCD will be close to 1.0
  • lt 1.0 structure is flatter than a binary tree (
    independent packages)
  • gt 1.0 structure is more strongly coupled
    (vertical or cyclic)
  • Aim NCCD 1 for given software/functionality

Thanks to Lassi Tuura (CMS)
45
Metrics NCCD vs Cycles
Includes Fortran
ATLAS
  • NCCD (spaghetti index)
  • ? 1.0 good toolkit
  • lt 1.0 indep. packages
  • gt 1.0 strongly-coupled

ROOT
ORCA
G4
COBRA
Anaphe
IGUANA
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
46
History
  • Started after CHEP-2000
  • Full version out since June 2001
  • Established functionality exceeding PAW
  • Analyzer component giving direct access to data
    and libraries of the experiment framework
  • Based on Abstract Interfaces
  • Flexible and extensible
  • Established parallel development of license
    free version while re-using existing libraries
  • Direct reading/writing of HBook files as an
    alternative to Objectivity/DB based persistency
  • Use of Minuit as a replacement for the minimizer
    of NAG-C

47
Ongoing activities
  • Persistency
  • De-emphasize Objectivity/DB (in coordination with
    experiments, IT/DB and LCG)
  • Use of HBook ntuples
  • Text files (using AIDA defined XML format)
  • Planning to use LCG persistency (POOL)
  • Investigating direct reading of ROOT files
  • Fitting
  • Implementing minimizer from GSL
  • Discussing with the IGUANA team (CMS) to
    integrate their GUI components
  • Looking forward for confirmation and/or
    re-direction of our efforts following the SC2
    (RTAGs)

48
Future enhancements
  • Access to other implementations of components
  • HBOOK CWNtuples
  • Communication with Java tools/packages (JAS,
    Wired)
  • via AIDA
  • Reading of ROOT (gt V3.0) files
  • similar to Tony Johnsons (Java) RootIO package
  • depends on stability of Root file format ?
  • AIDA Ntuple/Histo store
  • optimized for Ntuples, Histograms as (compressed)
    XML
  • Adding other scripting languages
  • Perl , Tcl, cint ?

49
Challenge Distributed Computing
  • Motivation
  • move code to data
  • parallel analysis
  • Techniques
  • services via AI
  • late binding
  • plug-in architecture
  • End-user (Lizard)
  • look-and-feel of local analysis
  • RD started and first prototype available soon
  • CORBA based


50
Summary
  • The architecture of Anaphe shows some important
    items for flexible and modular data analysis
  • weak coupling between components through use of
    Abstract Interface
  • basic functionality is covered by individual C
    class libraries
  • emphasis on usability and maintainability
  • Major criteria are flexibility, extensibility and
    interoperability
  • Recent example GEANT-4 examples (based on AIDA)
  • Lizard is an Interactive Data Analysis Tool based
    on Anaphe components and the Python scripting
    language (through SWIG)
  • Lizard is young but has very solid base in mature
    Anaphe libraries
  • real plug-in structure
  • Software quality control is important
  • tools help to optimize dependencies / minimize
    maintenance effort

51
More information
  • cern.ch/Anaphe
  • cern.ch/Anaphe/Lizard
  • aida.freehep.org/
  • cern.ch/DB
  • wwwinfo.cern.ch/asd/lhc/clhep/

52
  • Additional slides

53
Analysis of Geant4
  • Fairly large C project
  • Very fine-grained (and multi-level) package
    structuring
  • Seems quite clean from the preliminary analysis
  • Fine package subdivision helps in many ways but
    makes analysis and code understanding more
    complicated
  • One subsystemseems stronglycoupled andneeds
    attention
  • Need to studythe use of theinternal
    commandsystem

Thanks to Lassi Tuura (CMS)
54
Analysis of ROOT
  • ROOT developers have done a formidable job of
    breaking binary (shared library) dependencies,
    but
  • For example By static analysis, nothing seems to
    use the postscript package directly (no incoming
    dependencies), but there is this code
  • void TPadPrint (const char filename, Option_t
    option)
  • TVirtualPS psave gVirtualPS
  • if (gROOT-gtLoadClass("TPostScript","Postscript"))
    return
  • gROOT-gtProcessLineFast("new TPostScript()")
  • gVirtualPS-gtOpen(psname,pstype)
  • gVirtualPS-gtSetBit(kPrintingPS)
  • Taking these and global objects into account
    makes the dependency diagrams very different
  • Sign of fast growth? Need a next evolutionary
    step?
  • So coherent that replacing parts could get
    painful

Thanks to Lassi Tuura (CMS)
55
Analysis of ROOT
Binary Source Logical Real
Binary only
Thanks to Lassi Tuura (CMS)
56
Metrics NCCD vs ACD
ATLAS
ROOT
ORCA
G4
COBRA
IGUANA
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
57
Metrics NCCD vs Size
ATLAS
ROOT
ORCA
G4
COBRA
IGUANA
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
58
Metrics NCCD vs AID
ATLAS
ROOT
ORCA
COBRA
G4
Anaphe
IGUANA
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
59
Metrics Packages vs Size
ATLAS
ORCA
G4
COBRA
IGUANA
Anaphe
ROOT
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
60
Metrics Packages vs Size
ATLAS
ORCA
G4
COBRA
IGUANA
ROOT
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
61
Example script (ntuple)
get list of names of all tuples from
tuplemanager ntm.listTuples() nt1ntm.findNtuple(
Charm1) retrieve tuple by name create 1D
histos to project into h1hm.create1D(10, mass
,100, 0., 5000.) h2hm.create1D(20, mass for
pt1gt10 ,100, 0., 5000.) project the attribute
MASS" into histo h1 without cut
("") nt1.project1D( h1, , MASS) project
the attribute MASS" into histo h2 with cut
(PT1gt10") nt1.project1D( h2, PT1gt10 , MASS)
Write a Comment
User Comments (0)
About PowerShow.com