Workpackage 3: Automatic Performance Analysis and Grid Computing

About This Presentation

Title:

Workpackage 3: Automatic Performance Analysis and Grid Computing

Description:

Computer and Automation Research Institute. Hungarian Academy of Sciences ... Monitoring and visualising parallel programs at GRAPNEL level. Portability. ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 52

Provided by: Kati177

Category:

more less

Transcript and Presenter's Notes

Title: Workpackage 3: Automatic Performance Analysis and Grid Computing

1
Workpackage 3Automatic Performance Analysisand
Grid Computing

Peter Kacsuk
Laboratory of Parallel and Distributed Systems
MTA SZTAKI Research Institute
kacsuk_at_sztaki.hu
www.lpds.sztaki.hu

2
Contents

Goals of monitoring in a grid
Task 3.1 Requirement for performance analysis
Task 3.2 Performance monitoring and performance
data access (Grid Monitoring Architecture)
Task 3.3 Automatic performance analysis and grid
computing
Conclusions

3
Goals of grid monitoring

The question is
not how to measure resources (inherit old good
solutions)
but how to deliver information to end-users and
system/grid administrators
Status information
Inform users/ administrators about the grid and
applications
available resources
status and progress of applications
Propagate errors to users/management
Performance monitoring to
tune the application,
use the grid more efficiently

4
Task 3.1 Requirement for performance analysis
Differences between high performance computers
and the grid
Grid (performance) monitoring scenarios
5
Differences

Complex distributed system ? often observe
unexpectedly low performance
Where is the bottleneck?
application
operating system
disks
network adapters on either the sending or the
receiving host
network switches, routers
Experience of the Netlogger group
40 network, 40 application, 20 host problems
application 50 client, 50 server process
problems

6
Differences

Dynamic environment with
many computing resources
many services
many users
many single jobs
many distributed applications gt SCALABILITY
World-wide distributed environment with
high latency
frequent faults
very heterogeneous resources
Security (authentication, authorisation, encoding)

7
Scenario 1 (S5 of GGF)

Fault detection and analysis, heartbeats
monitoring data is used to determine faults in
system components and applications
monitoring data could also be used to find the
cause of the faults
requirements
push model data delivery
heartbeat events are valid only for a short time
interval and should be delivered in this time
constraint

8
HeartBeat
Application Level Fault Handler
!
System Monitoring Tools
Process and Host Heartbeat
Process and Host Heartbeat
Host 2
Host 1
Process Status Inquiry
Process Status Inquiry
Register/ Unregister
Register/ Unregister
9
Scenario 2 (S6)

Job status/progress monitoring
determine if a job is running, has died or hung
follow the status of long running jobs
requirements
pull model data delivery
per job data
problems
finding the processes that belong to a job
either the jobs should identify themselves or we
need an interface to the scheduler to get this
information

10
Scenario 3 (S1, S13)

Application performance monitoring
determine performance characteristics of an
application
real-time visualisation of events
requirements
push model data delivery
large volume of data in real-time
data archival for off-line analysis
user defined event types

11
Scenario 3 (cont.)

requirements (cont.)
remote data processing (statistics, data
reduction)
dynamic sensor management
problems
finding the right producers before the
application is started (start-up problem)
remote data reduction from list, scripts,
plug-ins
level of sensor control enable/disable,
parameters

12
Start-up procedure
scheduler
job manager
LAN
WAN
application process2
application process1
Local Monitor
Local Monitor
13
Scenario 4 (S7)

Performance analysis of distributed systems
locate performance bottlenecks in complex
distributed systems
requirements (in addition to the requirements of
scenario 3)
very large volume of data from many different
sources in real-time
accurate and consistent measurements
accurate cross-site timestamps
comparable data from different sources

14
Scenario 4 (cont.)

problems
time synchronisation (use NTP?)
data accuracy error bounds should be provided,
sensors should be able to limit their impact on
the monitored systems
data consistency measurements should be
co-ordinated (producers may need to communicate)

15
Scenario 5 (S10, S8, S14)

Scheduling services, self tuning applications
determine the optimal resources for a job
applications could use monitoring data to adapt
themselves to the current situation
requirements
pull model data delivery
accurate usage/availability information from all
computing resources in the grid
freshness of data is critical
availability of accurate forecasted data is
desirable

16
Scenario 5 (cont.)

problems
gathering current measurements from all resources
forecasting needs archives maybe consumers
should be able to announce themselves as well to
allow producers to discover demands

17
Scenario 6 (S9)

Data replication services
data is replicated at or migrated to different
places to optimise access time
monitoring might be used to identify stale
replicas
requirements
pull model data delivery
current available space measurements from all
storage devices
data access patterns of applications
network measurements (preferably forecasted)

18
Scenario 6 (cont.)

problems
network topology information is needed
Should the Grid Information System (GIS) contain
this information?
Should monitoring services be involved in
discovering this information?

19
Scenario 7 (S11)

Accounting and auditing
account for resource utilisation
verify resource utilisation and level of service
requirements
per user or per bank account data
accuracy of measurement is important
privacy of data is important
problems
privacy Where policies are enforced?

20
Task 3.2 Performance monitoring and performance
data access (Grid Monitoring Architecture)

Why not use an existing monitoring system?
Most existing monitors cannot be embedded in
tools or applications
Limited fault detection functionality
System- or application-specific information but
not both
Lack of scalable data forwarding and gathering
mechanisms
Incompatibility with security and authentication
requirements of the grid

21
Grid Monitoring Architecture

Goals
Develop a general framework for monitoring and
management
Monitor and manage a variety of resources,
services and applications
Scalable, secure
Framework should be extensible for specific tasks
new components for monitoring and performing
actions
new logic for management
modular structure
Compatible with emerging standards
to be able to communicate with other systems

22
Existing monitoring tools

NetLogger
GRM/PROVE
Network Weather Service
Globus HeartBeat Monitor
Autopilot
NASA Information Power Grid - monitor
GGF GMA architecture

23
NetLogger monitoring structure
Local Host
Trace file
Host 1
Host 2
24
NetLogger Toolkit (cont.)

The NetLogger approach is novel in that it
combines
network, host, and application-level monitoring
to provide a complete view of the entire system.
Valuable tool for
isolating and correcting performance bottlenecks
debugging distributed applications
selecting hardware components to upgrade (to
alleviate bottlenecks)

25
NetLogger Toolkit (cont.)

Disadvantages
single user tool
hand-made set up of monitoring session
not scalable to large systems
single trace collector process
each event individually transferred through the
net
only push model
no security

26
GRM/PROVE performance analysis system

Part of the P-GRADE integrated development
environment.
Monitoring and visualising parallel programs at
GRAPNEL level.
Portability. Heterogeneous UNIX clusters.
Ensures evaluation of long-running programs
Support for debugger in P-GRADE with execution
visualisation
Collection of both statistics and event trace
No lost of trace data at program abortion. The
execution to the point of abortion can be
visualised.
Execution (and monitoring) remotely from the user
environment

27
Usage Performance Visualization
28
Various Windows in PROVE
29
GRM semi-on-line monitor

Semi-on-line
store trace events in local storage (off-line)
make it available for analysis at any time during
execution (on-line)
NO real-time (or fast response) requirements!
Advantages
analyse the state (performance) of the
application at any time
scalability analyse trace data in smaller
sections and delete them if they are not longer
needed
Less overhead/intrusion to the execution system
than with on-line collection
Less stress on the collection side pull model
instead of push model. Collections initiated only
from top.

30
Semi-on-line observation
31
GRM-Grid monitoring structure
Local Host
Trace file
Host 1
Host 2
Trace file
32
GRM/PROVE monitoring

Advantages
much more scalable than NetLogger
local data reduction is possible
Less overhead/intrusion to the execution system
than in NetLogger
Less stress on the collection side pull model
instead of push model. Collections initiated only
from top.
Drawbacks
single user tool
no security
no standard data format

33
Network Weather Service

NWS is a distributed system that
periodically
monitors and
dynamically forecasts the performance of
various networks and computational resources
over a given time interval.

34
Network Weather Service

Components
Sensors
network, cpu, memory, disk monitors
network bandwidth, latency, round-trip time,
connection time
Memory
store event records in flat text files
each type of events from a sensor in a different
file
Name Server
directory capability used to bind process and
data names with low-level contact information
(e.g. address, port).
LDAP protocol
Forecaster
predicts performance data for the future based on
information stored in Memory components.

35
Network Weather Service
36
Network Weather Service

Advantages
scalable for many users and resources using
several Memory processes to store data
large network testing and monitoring using clique
(grouping) technique
intrusiveness to the monitored system can be kept
low using passive monitoring techniques (active
techniques are available as well)
Forecasting (schedulers need predictions not
archives!)

37
Network Weather Service

Disadvantages
resolution in seconds
not scalable for frequent monitor information
no support for application monitoring
non-standard message format
query/response model ? no error report
name server and forecaster are centralized
no security (changes when Globus GIS will be
used)

38
HeartBeat Monitor

Monitors process state (system or application)
for status report and fault detection.
Components
Client library (HBMCL)
register/unregister process to HBLM
Local Monitor (HBLM)
checks status of registered processes on its host
reports status periodically to HBMDC(s)
Data Collector (HBMDC)
receives reports sent by HBLMs and incorporates
these reports into its local repository
infers the unavailability or failure of monitored
components
calls activation callback functions registered by
applications for fault management

39
HeartBeat Monitor
40
HeartBeat Monitor

Advantages
process status monitoring and fault detection
scalable for many resources and users
Disadvantages
no support for performance monitoring
non-standard, non-extendible message format

41
NASA IPG monitor

3 basic components
Sensors, actuators, an event service
Users write monitors and fault managers, with
help from components provided

Event Service
Monitor
Sensors
Fault Manager
Actuators
42
Actuators

An actuator performs an action
Input parameters
Output parameters
Common actuator API
Basic actuators are implemented
Send email, run application, ...
Users can implement their own actuators

43
Event Service

Propagates events from producers to consumers
Publisher-subscriber paradigm
Consumers subscribe for events from producers
Asynchronous delivery of events
Automatic delivery of events to all interested
clients
Request-reply paradigm on the way
Consumer requests an event, the producer replies
with it
Communication
TCP
Future UDP, SSL-based authentication and
encryption
XML for events and messages
Publisher and subscriber APIs

44
Higher-Level Components

Built on three basic components
Sensor manager
Manages sensors,subscriptions,and queries
Directory service
Hold information about monitors
Allows managers to find them
Currently defining
Expert system (CLIPS)
Rules for faultmanagement
Currentlyinvestigating

Directory Service
45
Autopilot

Autopilot is a distributed performance
measurement and resource control system that is
based on the Pablo performance toolkit.
Its goal is to support monitoring and steering of
complex distributed applications.

46
Grid Monitoring Architecture

Global Grid Forum proposal for a new monitoring
architecture
Basic building blocks
sensor
producer
consumer
event directory service
communication protocols
event schemes

47
Grid Monitoring Architecture
XML
48
Open questions

Producer-Consumer event schema and protocols
GGF propose XML
Sensors and Producers separated or not?
Support for application monitoring in the
proposed architecture
Steering of systems, applications
Where and how to store monitoring information?

49
Comparison of Representative Grid Monitoring
Tools

Report of LPDS, 2000 www.lpds.sztaki.hu
Comparison metrics
Scalability
Intrusiveness
Validity of Information
Data format
Extendebility
Communication
Security
Measurement

50
Conclusion

Grid is a very complex word-wide distributed
system with many resources, services and users gt
conventional approaches are not adequate
There are many scenarios for monitoring the grid
(task 3.1)
Several architecture designs are under
development, some of them for performance
analysis (task 3.2)
No research yet on automatic performance analysis
in the grid gt Task 3.3 is crucial in the project

51

?
Thank you

Write a Comment

User Comments (0)

About PowerShow.com

Workpackage 3: Automatic Performance Analysis and Grid Computing - PowerPoint PPT Presentation

Workpackage 3: Automatic Performance Analysis and Grid Computing

Computer and Automation Research Institute. Hungarian Academy of Sciences ... Monitoring and visualising parallel programs at GRAPNEL level. Portability. ... – PowerPoint PPT presentation