Title: Workpackage 3: Automatic Performance Analysis and Grid Computing
1Workpackage 3Automatic Performance Analysisand
Grid Computing
- Peter Kacsuk
- Laboratory of Parallel and Distributed Systems
- MTA SZTAKI Research Institute
- kacsuk_at_sztaki.hu
- www.lpds.sztaki.hu
2Contents
- Goals of monitoring in a grid
- Task 3.1 Requirement for performance analysis
- Task 3.2 Performance monitoring and performance
data access (Grid Monitoring Architecture) - Task 3.3 Automatic performance analysis and grid
computing - Conclusions
3Goals of grid monitoring
- The question is
- not how to measure resources (inherit old good
solutions) - but how to deliver information to end-users and
system/grid administrators - Status information
- Inform users/ administrators about the grid and
applications - available resources
- status and progress of applications
- Propagate errors to users/management
- Performance monitoring to
- tune the application,
- use the grid more efficiently
4Task 3.1 Requirement for performance analysis
Differences between high performance computers
and the grid
Grid (performance) monitoring scenarios
5Differences
- Complex distributed system ? often observe
unexpectedly low performance - Where is the bottleneck?
- application
- operating system
- disks
- network adapters on either the sending or the
receiving host - network switches, routers
- Experience of the Netlogger group
- 40 network, 40 application, 20 host problems
- application 50 client, 50 server process
problems
6Differences
- Dynamic environment with
- many computing resources
- many services
- many users
- many single jobs
- many distributed applications gt SCALABILITY
- World-wide distributed environment with
- high latency
- frequent faults
- very heterogeneous resources
- Security (authentication, authorisation, encoding)
7Scenario 1 (S5 of GGF)
- Fault detection and analysis, heartbeats
- monitoring data is used to determine faults in
system components and applications - monitoring data could also be used to find the
cause of the faults - requirements
- push model data delivery
- heartbeat events are valid only for a short time
interval and should be delivered in this time
constraint
8HeartBeat
Application Level Fault Handler
!
System Monitoring Tools
Process and Host Heartbeat
Process and Host Heartbeat
Host 2
Host 1
Process Status Inquiry
Process Status Inquiry
Register/ Unregister
Register/ Unregister
9Scenario 2 (S6)
- Job status/progress monitoring
- determine if a job is running, has died or hung
- follow the status of long running jobs
- requirements
- pull model data delivery
- per job data
- problems
- finding the processes that belong to a job
either the jobs should identify themselves or we
need an interface to the scheduler to get this
information
10Scenario 3 (S1, S13)
- Application performance monitoring
- determine performance characteristics of an
application - real-time visualisation of events
- requirements
- push model data delivery
- large volume of data in real-time
- data archival for off-line analysis
- user defined event types
11Scenario 3 (cont.)
- requirements (cont.)
- remote data processing (statistics, data
reduction) - dynamic sensor management
- problems
- finding the right producers before the
application is started (start-up problem) - remote data reduction from list, scripts,
plug-ins - level of sensor control enable/disable,
parameters
12Start-up procedure
scheduler
job manager
LAN
WAN
application process2
application process1
Local Monitor
Local Monitor
13Scenario 4 (S7)
- Performance analysis of distributed systems
- locate performance bottlenecks in complex
distributed systems - requirements (in addition to the requirements of
scenario 3) - very large volume of data from many different
sources in real-time - accurate and consistent measurements
- accurate cross-site timestamps
- comparable data from different sources
14Scenario 4 (cont.)
- problems
- time synchronisation (use NTP?)
- data accuracy error bounds should be provided,
sensors should be able to limit their impact on
the monitored systems - data consistency measurements should be
co-ordinated (producers may need to communicate)
15Scenario 5 (S10, S8, S14)
- Scheduling services, self tuning applications
- determine the optimal resources for a job
- applications could use monitoring data to adapt
themselves to the current situation - requirements
- pull model data delivery
- accurate usage/availability information from all
computing resources in the grid - freshness of data is critical
- availability of accurate forecasted data is
desirable
16Scenario 5 (cont.)
- problems
- gathering current measurements from all resources
- forecasting needs archives maybe consumers
should be able to announce themselves as well to
allow producers to discover demands
17Scenario 6 (S9)
- Data replication services
- data is replicated at or migrated to different
places to optimise access time - monitoring might be used to identify stale
replicas - requirements
- pull model data delivery
- current available space measurements from all
storage devices - data access patterns of applications
- network measurements (preferably forecasted)
18Scenario 6 (cont.)
- problems
- network topology information is needed
- Should the Grid Information System (GIS) contain
this information? - Should monitoring services be involved in
discovering this information?
19Scenario 7 (S11)
- Accounting and auditing
- account for resource utilisation
- verify resource utilisation and level of service
- requirements
- per user or per bank account data
- accuracy of measurement is important
- privacy of data is important
- problems
- privacy Where policies are enforced?
20Task 3.2 Performance monitoring and performance
data access (Grid Monitoring Architecture)
- Why not use an existing monitoring system?
- Most existing monitors cannot be embedded in
tools or applications - Limited fault detection functionality
- System- or application-specific information but
not both - Lack of scalable data forwarding and gathering
mechanisms - Incompatibility with security and authentication
requirements of the grid
21Grid Monitoring Architecture
- Goals
- Develop a general framework for monitoring and
management - Monitor and manage a variety of resources,
services and applications - Scalable, secure
- Framework should be extensible for specific tasks
- new components for monitoring and performing
actions - new logic for management
- modular structure
- Compatible with emerging standards
- to be able to communicate with other systems
22Existing monitoring tools
- NetLogger
- GRM/PROVE
- Network Weather Service
- Globus HeartBeat Monitor
- Autopilot
- NASA Information Power Grid - monitor
- GGF GMA architecture
23NetLogger monitoring structure
Local Host
Trace file
Host 1
Host 2
24NetLogger Toolkit (cont.)
- The NetLogger approach is novel in that it
combines - network, host, and application-level monitoring
to provide a complete view of the entire system. - Valuable tool for
- isolating and correcting performance bottlenecks
- debugging distributed applications
- selecting hardware components to upgrade (to
alleviate bottlenecks)
25NetLogger Toolkit (cont.)
- Disadvantages
- single user tool
- hand-made set up of monitoring session
- not scalable to large systems
- single trace collector process
- each event individually transferred through the
net - only push model
- no security
26GRM/PROVE performance analysis system
- Part of the P-GRADE integrated development
environment. - Monitoring and visualising parallel programs at
GRAPNEL level. - Portability. Heterogeneous UNIX clusters.
- Ensures evaluation of long-running programs
- Support for debugger in P-GRADE with execution
visualisation - Collection of both statistics and event trace
- No lost of trace data at program abortion. The
execution to the point of abortion can be
visualised. - Execution (and monitoring) remotely from the user
environment
27Usage Performance Visualization
28Various Windows in PROVE
29GRM semi-on-line monitor
- Semi-on-line
- store trace events in local storage (off-line)
- make it available for analysis at any time during
execution (on-line) - NO real-time (or fast response) requirements!
- Advantages
- analyse the state (performance) of the
application at any time - scalability analyse trace data in smaller
sections and delete them if they are not longer
needed - Less overhead/intrusion to the execution system
than with on-line collection - Less stress on the collection side pull model
instead of push model. Collections initiated only
from top.
30Semi-on-line observation
31GRM-Grid monitoring structure
Local Host
Trace file
Host 1
Host 2
Trace file
32GRM/PROVE monitoring
- Advantages
- much more scalable than NetLogger
- local data reduction is possible
- Less overhead/intrusion to the execution system
than in NetLogger - Less stress on the collection side pull model
instead of push model. Collections initiated only
from top. - Drawbacks
- single user tool
- no security
- no standard data format
33Network Weather Service
- NWS is a distributed system that
- periodically
- monitors and
- dynamically forecasts the performance of
- various networks and computational resources
- over a given time interval.
34Network Weather Service
- Components
- Sensors
- network, cpu, memory, disk monitors
- network bandwidth, latency, round-trip time,
connection time - Memory
- store event records in flat text files
- each type of events from a sensor in a different
file - Name Server
- directory capability used to bind process and
data names with low-level contact information
(e.g. address, port). - LDAP protocol
- Forecaster
- predicts performance data for the future based on
information stored in Memory components.
35Network Weather Service
36Network Weather Service
- Advantages
- scalable for many users and resources using
several Memory processes to store data - large network testing and monitoring using clique
(grouping) technique - intrusiveness to the monitored system can be kept
low using passive monitoring techniques (active
techniques are available as well) - Forecasting (schedulers need predictions not
archives!)
37Network Weather Service
- Disadvantages
- resolution in seconds
- not scalable for frequent monitor information
- no support for application monitoring
- non-standard message format
- query/response model ? no error report
- name server and forecaster are centralized
- no security (changes when Globus GIS will be
used)
38HeartBeat Monitor
- Monitors process state (system or application)
for status report and fault detection. - Components
- Client library (HBMCL)
- register/unregister process to HBLM
- Local Monitor (HBLM)
- checks status of registered processes on its host
- reports status periodically to HBMDC(s)
- Data Collector (HBMDC)
- receives reports sent by HBLMs and incorporates
these reports into its local repository - infers the unavailability or failure of monitored
components - calls activation callback functions registered by
applications for fault management
39HeartBeat Monitor
40HeartBeat Monitor
- Advantages
- process status monitoring and fault detection
- scalable for many resources and users
- Disadvantages
- no support for performance monitoring
- non-standard, non-extendible message format
41NASA IPG monitor
- 3 basic components
- Sensors, actuators, an event service
- Users write monitors and fault managers, with
help from components provided
Event Service
Monitor
Sensors
Fault Manager
Actuators
42Actuators
- An actuator performs an action
- Input parameters
- Output parameters
- Common actuator API
- Basic actuators are implemented
- Send email, run application, ...
- Users can implement their own actuators
43Event Service
- Propagates events from producers to consumers
- Publisher-subscriber paradigm
- Consumers subscribe for events from producers
- Asynchronous delivery of events
- Automatic delivery of events to all interested
clients - Request-reply paradigm on the way
- Consumer requests an event, the producer replies
with it - Communication
- TCP
- Future UDP, SSL-based authentication and
encryption - XML for events and messages
- Publisher and subscriber APIs
44Higher-Level Components
- Built on three basic components
- Sensor manager
- Manages sensors,subscriptions,and queries
- Directory service
- Hold information about monitors
- Allows managers to find them
- Currently defining
- Expert system (CLIPS)
- Rules for faultmanagement
- Currentlyinvestigating
Directory Service
45Autopilot
- Autopilot is a distributed performance
measurement and resource control system that is
based on the Pablo performance toolkit. - Its goal is to support monitoring and steering of
complex distributed applications.
46Grid Monitoring Architecture
- Global Grid Forum proposal for a new monitoring
architecture - Basic building blocks
- sensor
- producer
- consumer
- event directory service
- communication protocols
- event schemes
47Grid Monitoring Architecture
XML
48Open questions
- Producer-Consumer event schema and protocols
- GGF propose XML
- Sensors and Producers separated or not?
- Support for application monitoring in the
proposed architecture - Steering of systems, applications
- Where and how to store monitoring information?
49Comparison of Representative Grid Monitoring
Tools
- Report of LPDS, 2000 www.lpds.sztaki.hu
- Comparison metrics
- Scalability
- Intrusiveness
- Validity of Information
- Data format
- Extendebility
- Communication
- Security
- Measurement
50Conclusion
- Grid is a very complex word-wide distributed
system with many resources, services and users gt
conventional approaches are not adequate - There are many scenarios for monitoring the grid
(task 3.1) - Several architecture designs are under
development, some of them for performance
analysis (task 3.2) - No research yet on automatic performance analysis
in the grid gt Task 3.3 is crucial in the project
51 ?
Thank you