Workpackage 3: Automatic Performance Analysis and Grid Computing - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Workpackage 3: Automatic Performance Analysis and Grid Computing

Description:

Computer and Automation Research Institute. Hungarian Academy of Sciences ... Monitoring and visualising parallel programs at GRAPNEL level. Portability. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 52
Provided by: Kati177
Category:

less

Transcript and Presenter's Notes

Title: Workpackage 3: Automatic Performance Analysis and Grid Computing


1
Workpackage 3Automatic Performance Analysisand
Grid Computing
  • Peter Kacsuk
  • Laboratory of Parallel and Distributed Systems
  • MTA SZTAKI Research Institute
  • kacsuk_at_sztaki.hu
  • www.lpds.sztaki.hu

2
Contents
  • Goals of monitoring in a grid
  • Task 3.1 Requirement for performance analysis
  • Task 3.2 Performance monitoring and performance
    data access (Grid Monitoring Architecture)
  • Task 3.3 Automatic performance analysis and grid
    computing
  • Conclusions

3
Goals of grid monitoring
  • The question is
  • not how to measure resources (inherit old good
    solutions)
  • but how to deliver information to end-users and
    system/grid administrators
  • Status information
  • Inform users/ administrators about the grid and
    applications
  • available resources
  • status and progress of applications
  • Propagate errors to users/management
  • Performance monitoring to
  • tune the application,
  • use the grid more efficiently

4
Task 3.1 Requirement for performance analysis
Differences between high performance computers
and the grid
Grid (performance) monitoring scenarios
5
Differences
  • Complex distributed system ? often observe
    unexpectedly low performance
  • Where is the bottleneck?
  • application
  • operating system
  • disks
  • network adapters on either the sending or the
    receiving host
  • network switches, routers
  • Experience of the Netlogger group
  • 40 network, 40 application, 20 host problems
  • application 50 client, 50 server process
    problems

6
Differences
  • Dynamic environment with
  • many computing resources
  • many services
  • many users
  • many single jobs
  • many distributed applications gt SCALABILITY
  • World-wide distributed environment with
  • high latency
  • frequent faults
  • very heterogeneous resources
  • Security (authentication, authorisation, encoding)

7
Scenario 1 (S5 of GGF)
  • Fault detection and analysis, heartbeats
  • monitoring data is used to determine faults in
    system components and applications
  • monitoring data could also be used to find the
    cause of the faults
  • requirements
  • push model data delivery
  • heartbeat events are valid only for a short time
    interval and should be delivered in this time
    constraint

8
HeartBeat
Application Level Fault Handler
!
System Monitoring Tools
Process and Host Heartbeat
Process and Host Heartbeat
Host 2
Host 1
Process Status Inquiry
Process Status Inquiry
Register/ Unregister
Register/ Unregister
9
Scenario 2 (S6)
  • Job status/progress monitoring
  • determine if a job is running, has died or hung
  • follow the status of long running jobs
  • requirements
  • pull model data delivery
  • per job data
  • problems
  • finding the processes that belong to a job
    either the jobs should identify themselves or we
    need an interface to the scheduler to get this
    information

10
Scenario 3 (S1, S13)
  • Application performance monitoring
  • determine performance characteristics of an
    application
  • real-time visualisation of events
  • requirements
  • push model data delivery
  • large volume of data in real-time
  • data archival for off-line analysis
  • user defined event types

11
Scenario 3 (cont.)
  • requirements (cont.)
  • remote data processing (statistics, data
    reduction)
  • dynamic sensor management
  • problems
  • finding the right producers before the
    application is started (start-up problem)
  • remote data reduction from list, scripts,
    plug-ins
  • level of sensor control enable/disable,
    parameters

12
Start-up procedure
scheduler
job manager
LAN
WAN
application process2
application process1
Local Monitor
Local Monitor
13
Scenario 4 (S7)
  • Performance analysis of distributed systems
  • locate performance bottlenecks in complex
    distributed systems
  • requirements (in addition to the requirements of
    scenario 3)
  • very large volume of data from many different
    sources in real-time
  • accurate and consistent measurements
  • accurate cross-site timestamps
  • comparable data from different sources

14
Scenario 4 (cont.)
  • problems
  • time synchronisation (use NTP?)
  • data accuracy error bounds should be provided,
    sensors should be able to limit their impact on
    the monitored systems
  • data consistency measurements should be
    co-ordinated (producers may need to communicate)

15
Scenario 5 (S10, S8, S14)
  • Scheduling services, self tuning applications
  • determine the optimal resources for a job
  • applications could use monitoring data to adapt
    themselves to the current situation
  • requirements
  • pull model data delivery
  • accurate usage/availability information from all
    computing resources in the grid
  • freshness of data is critical
  • availability of accurate forecasted data is
    desirable

16
Scenario 5 (cont.)
  • problems
  • gathering current measurements from all resources
  • forecasting needs archives maybe consumers
    should be able to announce themselves as well to
    allow producers to discover demands

17
Scenario 6 (S9)
  • Data replication services
  • data is replicated at or migrated to different
    places to optimise access time
  • monitoring might be used to identify stale
    replicas
  • requirements
  • pull model data delivery
  • current available space measurements from all
    storage devices
  • data access patterns of applications
  • network measurements (preferably forecasted)

18
Scenario 6 (cont.)
  • problems
  • network topology information is needed
  • Should the Grid Information System (GIS) contain
    this information?
  • Should monitoring services be involved in
    discovering this information?

19
Scenario 7 (S11)
  • Accounting and auditing
  • account for resource utilisation
  • verify resource utilisation and level of service
  • requirements
  • per user or per bank account data
  • accuracy of measurement is important
  • privacy of data is important
  • problems
  • privacy Where policies are enforced?

20
Task 3.2 Performance monitoring and performance
data access (Grid Monitoring Architecture)
  • Why not use an existing monitoring system?
  • Most existing monitors cannot be embedded in
    tools or applications
  • Limited fault detection functionality
  • System- or application-specific information but
    not both
  • Lack of scalable data forwarding and gathering
    mechanisms
  • Incompatibility with security and authentication
    requirements of the grid

21
Grid Monitoring Architecture
  • Goals
  • Develop a general framework for monitoring and
    management
  • Monitor and manage a variety of resources,
    services and applications
  • Scalable, secure
  • Framework should be extensible for specific tasks
  • new components for monitoring and performing
    actions
  • new logic for management
  • modular structure
  • Compatible with emerging standards
  • to be able to communicate with other systems

22
Existing monitoring tools
  • NetLogger
  • GRM/PROVE
  • Network Weather Service
  • Globus HeartBeat Monitor
  • Autopilot
  • NASA Information Power Grid - monitor
  • GGF GMA architecture

23
NetLogger monitoring structure
Local Host
Trace file
Host 1
Host 2
24
NetLogger Toolkit (cont.)
  • The NetLogger approach is novel in that it
    combines
  • network, host, and application-level monitoring
    to provide a complete view of the entire system.
  • Valuable tool for
  • isolating and correcting performance bottlenecks
  • debugging distributed applications
  • selecting hardware components to upgrade (to
    alleviate bottlenecks)

25
NetLogger Toolkit (cont.)
  • Disadvantages
  • single user tool
  • hand-made set up of monitoring session
  • not scalable to large systems
  • single trace collector process
  • each event individually transferred through the
    net
  • only push model
  • no security

26
GRM/PROVE performance analysis system
  • Part of the P-GRADE integrated development
    environment.
  • Monitoring and visualising parallel programs at
    GRAPNEL level.
  • Portability. Heterogeneous UNIX clusters.
  • Ensures evaluation of long-running programs
  • Support for debugger in P-GRADE with execution
    visualisation
  • Collection of both statistics and event trace
  • No lost of trace data at program abortion. The
    execution to the point of abortion can be
    visualised.
  • Execution (and monitoring) remotely from the user
    environment

27
Usage Performance Visualization
28
Various Windows in PROVE
29
GRM semi-on-line monitor
  • Semi-on-line
  • store trace events in local storage (off-line)
  • make it available for analysis at any time during
    execution (on-line)
  • NO real-time (or fast response) requirements!
  • Advantages
  • analyse the state (performance) of the
    application at any time
  • scalability analyse trace data in smaller
    sections and delete them if they are not longer
    needed
  • Less overhead/intrusion to the execution system
    than with on-line collection
  • Less stress on the collection side pull model
    instead of push model. Collections initiated only
    from top.

30
Semi-on-line observation
31
GRM-Grid monitoring structure
Local Host
Trace file
Host 1
Host 2
Trace file
32
GRM/PROVE monitoring
  • Advantages
  • much more scalable than NetLogger
  • local data reduction is possible
  • Less overhead/intrusion to the execution system
    than in NetLogger
  • Less stress on the collection side pull model
    instead of push model. Collections initiated only
    from top.
  • Drawbacks
  • single user tool
  • no security
  • no standard data format

33
Network Weather Service
  • NWS is a distributed system that
  • periodically
  • monitors and
  • dynamically forecasts the performance of
  • various networks and computational resources
  • over a given time interval.

34
Network Weather Service
  • Components
  • Sensors
  • network, cpu, memory, disk monitors
  • network bandwidth, latency, round-trip time,
    connection time
  • Memory
  • store event records in flat text files
  • each type of events from a sensor in a different
    file
  • Name Server
  • directory capability used to bind process and
    data names with low-level contact information
    (e.g. address, port).
  • LDAP protocol
  • Forecaster
  • predicts performance data for the future based on
    information stored in Memory components.

35
Network Weather Service
36
Network Weather Service
  • Advantages
  • scalable for many users and resources using
    several Memory processes to store data
  • large network testing and monitoring using clique
    (grouping) technique
  • intrusiveness to the monitored system can be kept
    low using passive monitoring techniques (active
    techniques are available as well)
  • Forecasting (schedulers need predictions not
    archives!)

37
Network Weather Service
  • Disadvantages
  • resolution in seconds
  • not scalable for frequent monitor information
  • no support for application monitoring
  • non-standard message format
  • query/response model ? no error report
  • name server and forecaster are centralized
  • no security (changes when Globus GIS will be
    used)

38
HeartBeat Monitor
  • Monitors process state (system or application)
    for status report and fault detection.
  • Components
  • Client library (HBMCL)
  • register/unregister process to HBLM
  • Local Monitor (HBLM)
  • checks status of registered processes on its host
  • reports status periodically to HBMDC(s)
  • Data Collector (HBMDC)
  • receives reports sent by HBLMs and incorporates
    these reports into its local repository
  • infers the unavailability or failure of monitored
    components
  • calls activation callback functions registered by
    applications for fault management

39
HeartBeat Monitor
40
HeartBeat Monitor
  • Advantages
  • process status monitoring and fault detection
  • scalable for many resources and users
  • Disadvantages
  • no support for performance monitoring
  • non-standard, non-extendible message format

41
NASA IPG monitor
  • 3 basic components
  • Sensors, actuators, an event service
  • Users write monitors and fault managers, with
    help from components provided

Event Service
Monitor
Sensors
Fault Manager
Actuators
42
Actuators
  • An actuator performs an action
  • Input parameters
  • Output parameters
  • Common actuator API
  • Basic actuators are implemented
  • Send email, run application, ...
  • Users can implement their own actuators

43
Event Service
  • Propagates events from producers to consumers
  • Publisher-subscriber paradigm
  • Consumers subscribe for events from producers
  • Asynchronous delivery of events
  • Automatic delivery of events to all interested
    clients
  • Request-reply paradigm on the way
  • Consumer requests an event, the producer replies
    with it
  • Communication
  • TCP
  • Future UDP, SSL-based authentication and
    encryption
  • XML for events and messages
  • Publisher and subscriber APIs

44
Higher-Level Components
  • Built on three basic components
  • Sensor manager
  • Manages sensors,subscriptions,and queries
  • Directory service
  • Hold information about monitors
  • Allows managers to find them
  • Currently defining
  • Expert system (CLIPS)
  • Rules for faultmanagement
  • Currentlyinvestigating

Directory Service
45
Autopilot
  • Autopilot is a distributed performance
    measurement and resource control system that is
    based on the Pablo performance toolkit.
  • Its goal is to support monitoring and steering of
    complex distributed applications.

46
Grid Monitoring Architecture
  • Global Grid Forum proposal for a new monitoring
    architecture
  • Basic building blocks
  • sensor
  • producer
  • consumer
  • event directory service
  • communication protocols
  • event schemes

47
Grid Monitoring Architecture
XML
48
Open questions
  • Producer-Consumer event schema and protocols
  • GGF propose XML
  • Sensors and Producers separated or not?
  • Support for application monitoring in the
    proposed architecture
  • Steering of systems, applications
  • Where and how to store monitoring information?

49
Comparison of Representative Grid Monitoring
Tools
  • Report of LPDS, 2000 www.lpds.sztaki.hu
  • Comparison metrics
  • Scalability
  • Intrusiveness
  • Validity of Information
  • Data format
  • Extendebility
  • Communication
  • Security
  • Measurement

50
Conclusion
  • Grid is a very complex word-wide distributed
    system with many resources, services and users gt
    conventional approaches are not adequate
  • There are many scenarios for monitoring the grid
    (task 3.1)
  • Several architecture designs are under
    development, some of them for performance
    analysis (task 3.2)
  • No research yet on automatic performance analysis
    in the grid gt Task 3.3 is crucial in the project

51

?
Thank you
Write a Comment
User Comments (0)
About PowerShow.com