Algorithm Timing and Performance Issues with emphasis on HLT algorithm online timing - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Algorithm Timing and Performance Issues with emphasis on HLT algorithm online timing

Description:

Up to 100 kHz LVL1 rate, up to 3 kHz LVL2/EB rate ... Current algo is cell based with fixed cone 0.4. Speed up by x2 in release 13 (Jonathan Ferland) ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 31

Provided by: Xin108

Category:

more less

Transcript and Presenter's Notes

Title: Algorithm Timing and Performance Issues with emphasis on HLT algorithm online timing

1
Algorithm Timing and Performance Issues(with
emphasis on HLT algorithm online timing)

Xin Wu (University of Geneva)
On behalf of TDAQ
TP Week, June 7, 2007

2
Introduction

Offline algorithm timing/memory performance is
linked directly to the efficiency of doing
physics analyses
Think of it as sort of luminosity
Faster algorithms earlier results or with
larger data sample
HLT algorithms timing/memory performance are even
more critical
Slow algorithms or crashed processes contribute
to DAQ dead time
DAQ dead time loss of luminosity
The issue is serious for offline
Offline Performance Task Force is formed to
tackle it globally
The issue is serious for HLT
High LVL1 rate and limited budget (and space!)
for HLT farms
Up to 100 kHz LVL1 rate, up to 3 kHz LVL2/EB rate
HLT algorithm timing are being optimized
individually offline
Great progresses have been achieved in the past
year
Global optimization done online with actual TDAQ
hardware and realistic mixture of input events
in Technical Runs

See next slide from Wim Lavrijsen
3
Update from Performance Task Force (Wim Lavrijsen)

Mandate reduce computing resource consumption of
ATLAS software using technical means
Provide tools and know-how for developers
Identify problem area's and come up with
solutions
Goals (with HLT requirements in mind)
Improve overall "uptime" of athena jobs
Reduce memory leaks, improve startup time, reduce
initial memory sizes
Reduce CPU, memory, and I/O usage
Monitor and put in hand of algorithm developers
etc.
Focus on peak usage and recovery from peak
Current work
Provide machinery to continuously monitor
performance (early interception of major changes)
http//atlas-computing.web.cern.ch/atlas-computing
/links/distDirectory/nightlies/aid_perfmon
Provide standard jobs for reference and
benchmarking
Identify structural problems in ATLAS software
Dictionary sizes/use (ROOT team is working on
next-gen dicts)
python2.4 memory allocation (move to python2.5),
xml configuration (DB, python), malloc overhead
(use arena's instead)
dld.so overhead (improve configuration, use
on-demand)
Take home point for everyone FIX YOUR LD/LINK
OPTIONS AND JOFILES!
object size increase in 64bits (remove padding)

See D. Quarries talk on Monday
4
Timing Requirement on HLT Algorithms

The benchmark 10 ms per event requirement of
LVL2
Hard limit
100 kHz LVL1 ? 10 ?s LVL2 average processing time
(latency)
500 1U slots for the LVL2 farm
Optimization with many scenarios
500 dual CPU 8 GHz 1000 LVL2 processes ? 10 ms
per L2PU
Or 500 dual quadcore 2 GHz 4000 LVPU ? 40 ms per
L2PU
gt1 L2PU per core to improve CPU efficiency
Multi-threaded mode to improve memory and CPU
efficiency
Timing include data access
Requires non-trivial optimization (150 ROSs send
requested data to thousands of L2PU though the
LVL2 data network)
The benchmark 1s per event requirement of EF
Hard limit
3 kHz LVL2 ? 333 ?s EF average processing time
(latency)
1800 1U slots for the EF farm
Optimization with many scenarios
1800 dual CPU 8 GHz 3600 EF processes ? 1.2s per
PT
Or 1800 dual quadcore 2 GHz 14400 PT ? 4.8s per
PT

5
Brief Summary of the March TR (19/3-23/3)

Hardware
final ROIB LVL1 emulator
Pre-series machines (dual 1-core 3.2 GHz or 2.4
GHz)
12 ROS, 2 L2SV, 12 L2 nodes running 2 L2PU each
29 EF nodes running 2 PT application each
Software
tdaq-01-07-00, AtlasHLT 2.0.5-HLT, Offline
12.0.5-HLT-1
All basic HLT slices integrated
e10, g10, mu6, tau10, jet20, cosmic, Bphysics,
met
combined e10g10mu6tau10jet20
Input events
6k events (mixed physics processes, 60 jets
and 40 W/Z)
LVL1 simulated with CSC-05
Main achievement
Validated DAQ and HLT infrastructure with
tdaq-01-07-00
Successfully configured and ran slices
(individual and combined)

6
Brief Summary of the May TR (21/5-25/5)

Final Hardware
ROIB ( LVL1 emulator), 120 ROSs, 29 SFI
4 HLT racks (130 dual quad-core 1.8 GHz), 5
final system
Basically same software setup as March TR
Same Input events as March TR
Main achievement
Validated TDAQ and HLT infrastructure with final
hardware
Measurements with dummy algorithm LVL2 and EF
with final hardware
Functionality test with combined algorithm only
Tested DBProxy and triggerDB configuration
Preparation for M3 week
Good shift participation as March TR

7
Algorithm Online Timing General Remarks

Caveat Online timing measurement is a complex
issue
Depends on many variables network layout, number
of L2PU/PT per node, CPU speed, trigger slices,
input events, algorithms,
Most are not final. Continuous optimizations need
to be done
March/May TR are only first attempts
Results will certainly be improved
Will show mainly basic e, ?, mu, tau, jet reco
timing (per RoI)
LVL2 calorimeter based reconstruction
T2CaloEgamma, T2CaloTau, T2CaloJet
LVL2 muon reconstruction muFast
LVL2 tracking IdScan (e, tau, mu), SiTrack (e
only)
EF calorimeter data preparation
TrigCaloCellMaker, TrigCaloTowerMaker,
TrigCaloClusterMaker
EF tracking 10 tools
EF Egamma, Muon, Tau, Jet reconstruction
TrigEgammaRec, TrigMoore, TrigTauRec, TrigJetRec
Will not have time to go over all them in
details, apology!

LVL2 dedicated algorithms
EF use offline tools
8
LVL2 Egamma Reco T2CaloEgamma
Photon run March mean 6.6 ms/RoI
Egamma run March mean 7.4 ms
Combined run March mean 6.9 ms
Combined run May mean 6.2 ms
12 ROS L2PU 3.2 GHz
120 ROS L2PU 1.8 GHz
9
LVL2 Tau Calo Reco T2CaloTau
tau run March mean 12.9 ms/RoI
combined run March mean 6.4 ms

Faster in combined slice
Shorter data transfer time because of em/tau
overlap (ROBDataProvider caching)!
Further improvement possible with common data
preparation

combined run May mean 6.2 ms
10
LVL2 Jet Reco T2CaloJet
Jet run March mean 28.1 ms/RoI
Combined run Mar mean 26.0 ms

Current algo is cell based with fixed cone 0.4
Speed up by x2 in release 13 (Jonathan Ferland)
Further improvement under investigation using ROD
preprocessed TriggerTower info

Combined run May mean 25.0 ms
11
LVL2 Muon Reco muFast
muon run March mean 6.2 ms/RoI
combined run March mean 4.9 ms
Combined run May mean 6.4 ms
note scale change
12
LVL2 Tracking IDScan for egamma
Electron run March mean 17.1 ms/RoI
Combied run March mean 14.9 ms
Offline with Xeon 2.4 GHz (D. Emeliyanov)
Combined run May mean 16.8 ms
13
LVL2 Tracking IDScan for Tau
Tau run March mean 13.0 ms/RoI
Combined run March mean 6.6 ms

Faster in combined slice
Because of ROB Data and SpacePoints caching
Further improvement with EM/Tau common tracking?

Combined run May mean 8.1 ms
14
LVL2 Tracking IDScan for Muon
Combined run March mean 14.8 ms
Muon run march mean 19.4 ms/RoI

Used lager RoI than egamma in release 12
Will be the same in release 13

Combined run May mean 15.9 ms
15
LVL2 Tracking SiTrack for Egamma
Electron run March mean 8.8 ms
Combined run March mean 7.8 ms
Combined run May mean 8.3 ms

Helped by upstream IdScan ROB Data and
SpacePoints caching
Offline shows IdScan and SiTrack comparable in
timing

16
L2PU Timing for Electron Run March
Total time for accepted
Total time for rejected
mean 19.7 ms
mean 71.5 ms/event
Why the offset? IDC feature used L2 code found
by Werner Wiedenmann, fixed now by RD
Processing time for accepted
Data collection time for accepted
mean 53.0 ms
mean 25.0 ms
Two tracking algorithms !
17
L2PU Time for Combined Slice Run in May (1)
Total time for accepted
Processing time for accepted
mean gt94.3 ms/event
mean 82.4 ms
Data request for accepted
Data collection time for accepted
mean 24/event
mean 26.5 ms
1ms/Request
18
L2PU Time for Combined Slice Run in May (2)
Total time for rejected
Processing time for rejected
mean 31.5 ms
mean 25.7 ms
Data request for rejected
Data collection time for rejected
mean 5.3/event
mean 6.0 ms
19
Egamma EF Calo Reconstruction Timing
TrigCaloCellMaker
TrigCaloTowerMaker
combined run May mean 27.0 ms
combined run May mean 16.0/RoI
TrigCaloClusterMaker
combined run May mean 65.4 ms
20
Tau EF Calo Reconstruction Timing
TrigCaloCellMaker
TrigTauRec
combined run May mean 62.6 ms
combined run May mean 13 ms/RoI
21
Jet EF Calo Reconstruction Timing
TrigCaloCellMaker
TrigCaloTowerMaker
Combined run May mean 7.0 ms/RoI
combined run May mean 6.6 ms
TrigJetRec (cone 0.4)
TrigJetRec doNoise
Combined run May mean 44.0 ms
Combined run May mean 48.6 ms
90 of TrigJetRec Good to monitor timing within
an algorithm!
22
Electron EF Track Reconstruction Timing (1)
PixelClustering
SCTClustering
combined run May mean 5.7 ms
combined run May mean 5.2 ms/RoI
TRTDriftCircleMaker
SiTrigSpacePointFinder
combined run May mean 9.1
combined run May mean 5.0 ms
23
Electron EF Track Reconstruction Timing (2)
SiTrigTrackFinder
TrigAmbiguitySolver
combined run May mean 168.6 ms/RoI rms 141.9 ms
combined run May mean 22.3 ms rms 23.3 ms
TrigExtProcessor
TRTTrackExtAlg
combined run May mean 6.8 rms 4.2
combined run May mean 27.4 rms 23.7
24
Electron EF Track Reconstruction Timing (3)

Summary ms
PixelClustering 5.2
SCTClustering 5.7
TRTDriftCircleMaker 9.1
SiTrigSpacePointFinder 5.0
SiTrigTrackFinder 168.6
AmbiguitySolver 22.3
TRTTrackExtAlg 6.8
TrigExtProcessor 27.4
TrigVxPrimary 5.1
TrigParticlesCreator 5.0
Total EF electron tracking timing
260 ms/RoI with large rms

TrigVxPrimary
combined run May mean 5.1 ms
TrigParticlesCreator
combined run May mean 5.0 ms
25
Tau EF Track Reconstruction Timing

Summary ms
PixelClustering 5.7
SCTClustering 11.9
TRTDriftCircleMaker 20.2
SiTrigSpacePointFinder 5.1
SiTrigTrackFinder 46.2
AmbiguitySolver 8.7
TRTTrackExtAlg 5.1
TrigExtProcessor 9.9
TrigVxPrimary 5.1
TrigParticlesCreator 5.1
Total Tau tracking timing
123 ms/RoI with large rms
No time saving in combined run!
Further optimization possible

Ntracks / Tau RoI
combined run May mean 3.4 tracks (tighter LVL2
cuts)
Ntracks / Egamma RoI
combined run May mean 7.9 tracks (very loose LVL2
cuts)
26
EF Track Reconstruction Timing from Offline

Done by I.Grabowska-Bold
Dual P4 _at_ 2.8 GHz
Same events as used in the Technical Runs
KF for release 12, ModKF for release 13
Not directly comparable with online numbers

27
EF InDet Full Event Reconstruction

Done by I.Grabowska-Bold
Dual P4 _at_ 2.8 GHz
FullScan and MinBias slices in release 13
Very interesting to try it out online

28
Egamma and Muon EF Reconstruction Timing
TrigEgammaRec
Muon Reconstruction
combined run May mean 33.6 ms
combined run May mean 40.8 ms
Muon Identification
combined run May mean 13.0 ms
29
EF Total Processing Time
combined run May mean 1.57 s

Remember again the caveat of online timing
measurement before
Only a snap shot of one particular setup, still
far away of being representative of the final
hardware setup, typical high luminosity trigger
menu, and actual LHC events!

30
Conclusions

HLT algorithm timing has made great progress in
the past year
Individual algorithms give reasonable numbers
Further optimization necessary (faster
algorithmmore events, margin for real data,
complex menu, noise, pile-up, )
Investigate strategy change (merge RoI of
different type, )?
Online study of overall HLT algorithm timing
performance is starting
Confirmed offline numbers of simple inclusive
slices
Overall performance strongly couples with TDAQ
hardware/software
Requires a big collaborative effort to
understand, with well designed tools, and enough
access time to the hardware!
It needs to be constantly monitored
Next step is online test of Rel 13 with more
complete trigger menu
HLT algorithm performance is much more than just
execution time. Efforts are ongoing in many
areas, in collaboration with the Task Force
Memory usage/memory leak (goal 10 B/evt in LVL2,
1 kB/evt in EF )
Algorithm configuration and initialization time
Conditions database access
Multi-threaded mode