Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework - PowerPoint PPT Presentation

About This Presentation
Title:

Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework

Description:

Use a dynamic set of proxies to cooperate with them. Network of JINI-LUSs Secure & Public ... automatically by a servlet / CGI script. No Lost Packages ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework


1
Monitoring, Accounting and Automated Decision
Supportfor the ALICE Experiment Based on the
MonALISA Framework
  • Catalin Cirstoiu, Costin Grigoras, Latchezar
    Betev, Alexandru Costan, Iosif Legrand
  • 25/06/2007
  • HPDC 2007 Workshop on Grid Monitoring
  • Monterey, California

2
Contents
  • Monitoring requirements
  • MonALISA overview
  • Application monitoring
  • Monitoring architecture in AliEn
  • Jobs monitoring
  • Traffic monitoring
  • Services monitoring
  • Nodes monitoring
  • Actions framework
  • Feature snapshots

3
Monitoring Requirements
  • Global view of the entire distributed system
  • Least-intrusive
  • As accurate as possible
  • Best-effort data transport
  • Minimizing the requirements for open ports
  • Providing
  • Near real-time information
  • Long-term history of aggregated data
  • On key parameters like
  • System status
  • Resource usage
  • Helping with
  • Correlating events
  • System debugging
  • Generating reports
  • Taking automated actions based on the monitored
    data

4
MonALISA Overview
  • MonALISA is a dynamic distributed framework
  • Collects any type of information from different
    systems
  • Aggregates and analyzes it in near-real time
  • Provides support for automated control decisions
    and global optimization of workflows in complex
    distributed systems.

Postgres MySQL
Data Store
Lookup Service
Lookup Service
Data Cache Service DB
Web Service WSDL SOAP
Registration
Discovery
WS Client (other service)
Data (via ML Proxy)
Predicates Agents
Configuration Control (SSL)
Applications
Java Client (other service)
Agents Filters Data Modules
5
ML Discovery System Services
  • Hierarchical structure of loosely coupled
    services
  • Independent autonomous entities able to
  • Publish their existence
  • Discover other available Jini-enabled services
  • Use a dynamic set of proxies to cooperate with
    them

Global Services or Clients
Clients, Repositories,
HL services
Dynamic load balancing Scalability
Replication Security AAA for Clients
Proxies
Distributed System for gathering and Analyzing
Information
MonALISA services
Agents
Distributed Dynamic Discovery-based on a lease
Mechanism and REN
Network of JINI-LUSs Secure Public
6
ApMon Application Monitoring
  • Lightweight library of APIs (C, C, Java, Perl,
    Python) that can be used to send any information
    to MonALISA Services
  • High comm. performance
  • Flexible
  • Accounting
  • Sys Mon

dynamic reloading
Config Servlet
MonALISA hosts
APPLICATION
MonALISA Service
ApMon
APPLICATION
MonALISA Service
ApMon
No Lost Packages
System Monitoring
ApMon configuration generated automatically by a
servlet / CGI script
ApMon Config
load1 0.24
processes 97
pages_in 83
7
Monitoring architecture in AliEn
job slots
net In/out
run time
cpu time
free space
processes
load
jobs status
vsz
sockets
rss
migrated mbytes
active sessions
Aggregated Data
nr. of files
open files
Queued JobAgents
job status
MonaLisa Repository
Alerts
cpu ksi2k
Actions
Long History DB
disk used
MyProxy status
  • http//pcalimonitor.cern.ch8889/

LCG Tools
8
Job status monitoring
  • Global summaries
  • For each/all conditions
  • For each/all sites
  • For each/all users
  • Running cumulative
  • Error status
  • From job agents
  • From central services
  • Real-time map view
  • Integrated pie charts
  • History plots

9
Real-time Map
10
Integrated Pie Charts
11
History Plots, Annotations
12
Job Resource Usage
  • Cumulative parameters
  • CPU Time CPU KSI2K
  • Wall time Wall KSI2K
  • Read written files
  • Input output traffic (xrootd)
  • Running parameters
  • Resident memory
  • Virtual memory
  • Open files
  • Workdir size
  • Disk usage
  • CPU usage
  • Aggregated per site

13
Job Network Traffic
  • Based on the xrootd transfer from every job
  • Aggregated statistics for
  • Sites (incoming, outgoing, site to site,
    internal)
  • Storage Elements (incoming, outgoing)
  • Of
  • Read and written files
  • Transferred MB/s

14
Individual job tracking
  • Based on AliEn shell cmds.
  • top, ps, spy, jobinfo, masterjob
  • Using the GUI ML Client
  • Status, resource usage, per job

15
AliEn LCG Services monitoring
  • AliEn services
  • Periodically checked
  • PID check SOAP call
  • Simple functional tests
  • SE space usage
  • Efficiency
  • LCG environment and tools
  • Integrating the VoBOX tests previously run by ML
    within the SAM framework
  • Proxy lifetime, gsiscp, LCG CE/SE, Job
    submission, BDII, Local catalog, software area
    etc.
  • Error messages in case of failure
  • Efficiency
  • ML Alerts are used for problems notification
  • .

16
FTD/FTS Monitoring
  • Status of the transfers
  • Transfer rates
  • Success/failures
  • Efficiency via ARDA Experiment Dashboard

17
VOBox/Head node monitoring
  • Machine parameters, real-time history
  • Load, memory swap usage, processes, sockets

18
Actions framework
  • Based on monitoring information, actions can be
    taken in
  • ML Service
  • ML Repository
  • Actions can be triggered by
  • Values above/below given thresholds
  • Absence/presence of values
  • Correlation between multiple values
  • Possible actions types
  • Alerts
  • e-mail
  • Instant messaging
  • RSS Feeds
  • External commands
  • Event logging
  • Traffic
  • Jobs
  • Hosts
  • Apps

ML Service
Actions based on global information
ML Repository
Actions based on local information
  • Temperature
  • Humidity
  • A/C Power

ML Service
Sensors
Local decisions
Global decisions
19
Alerts and actions
MySQL daemon is automatically restarted when it
runs out of memory Trigger threshold on VSZ
memory usage
ALICE Production jobs queue is automatically kept
full by the automatic resubmission Trigger
threshold on the number of aliprod waiting jobs
Administrators are kept up-to-date on the
services status Trigger presence/absence of
monitored information
20
Fact figures
  • Raw parameters when running 4K Jobs
  • Unique data series 300K with frequency 1-15
    minutes
  • Message rate 16K / minute
  • Site aggregated parameters
  • Message rate 2K / minute
  • Bandwidth rate 300Kbps
  • Repository
  • DB Size 70 GB 700M records
  • Data Reduction Schema
  • 2 Months with 2 minutes bins
  • 1 Year with 30 minutes bins
  • Forever with 2 hours bins
  • Response time
  • App -gt ML Service network speed
  • ML Service -gt ML Clients
  • Subscribed parameters network speed
  • One shot requests (history requests) 5 seconds
  • Repository dynamic history requests 300ms /
    page
  • No incoming ports are required

21
Summary
  • The MonALISA framework is used as a primary
    monitoring tool for the ALICE Grid since 2004
  • Presently the system is used for monitoring of
    all (identified) services, jobs and network
    parameters necessary for the Grid operation and
    debugging
  • The add-on tools for automatic events
    notification allow for more efficient reaction to
    problems
  • The framework design and flexibility answers all
    requirements for a monitoring system
  • The accumulated information allows to construct
    and implement automated decision making
    algorithms, thus increasing further the
    efficiency of the Grid operations

22
Thank you!
  • Questions?

http//alien.cern.ch http//monalisa.caltec
h.edu
Write a Comment
User Comments (0)
About PowerShow.com