GridMonitor: Integration of Large Scale Facility Monitoring With MDS

About This Presentation

Title:

GridMonitor: Integration of Large Scale Facility Monitoring With MDS

Description:

Information Provider Provides Cache for the Newest Value From the Mysql Database ... A Sub-cluster Contains the Host With the Same Configuration ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 17

Provided by: bruce278

Learn more at: https://chep03.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: GridMonitor: Integration of Large Scale Facility Monitoring With MDS

1
GridMonitor Integration of Large Scale Facility
Monitoring With MDS

Richard Baker, Antonio Chan
Jason Smith, Dantong Yu
USATLAS/RHIC Computing Facility
Brookhaven National Lab

2
Outline

Requirements
System Framework, Structure and Characteristics
I Ganglia and Its Information Provider
II Relational Database Based Archiving and Its
Information Provider
Gridview and GStat, Front End System
http//heppc1.uta.edu/atlas/grid-status/mds.gremli
n.usatlas.bnl.gov.html
Current Status and Future Works

3
Requirements

Requirements
Modularity and Extensibility Make Use of
Existing Monitoring Pieces
Flexibility Adjustable to the Dynamics of the
Monitored Systems
Overhead Non-intrusive
Scalability
Security, Consistency, Inter-operability,
Etc-bility

4
What Need to Be Monitored

Linux Farm Monitoring
Description
About 1100 Dual CPU LINUX Nodes
Performance Data Must Be Summarized for
Advertising to Grid
Performance Events Required
Configuration Information
Status Information CPU Load, (1, 5, 10, 15),
Memory Load, Disk Load, and Network Load
Example Usage A Resource Broker Might Ask the
Availability of Linux Farm System Resources in
Order to Plan the Efficient Execution of Tasks

5
More

Network Monitoring
Description
8 USATLAS Testbeds
Publish the Connectivity of These Test-beds,
Monitor the Healthiness of the USATLAS Network
Archived Performance Data Can Be Used to Predict
the Network Behavior a User Can Choose the Source
and Destination for File Replication
Performance Events Required
Bandwidth, Delay ( Round Trip Time), Trace Route

6
Monitoring Framework
7
Monitoring System Components

Four Tier Structure
Sensors
Host Ganglia, Top, /Proc and lsf Host Load
Archive System (Database System)
Round Robin Database (RRD)
Relational Database UNIXodbcmyodbcmysql
Database
Information Providers
Monitoring and Discovery Service (Mds2.2), GLUE
Schema, Customized Ganglia Client Tool Reporting
the Lastest Monitoring Data, and Database Client
Tools Reporting the Summary Information
Front-end Browsing System
Gridview, GStat (Grid Visualization Tool
Developed at
Univ. of Texas at Arlington)

8
Advantages

Information Provider Provides Cache for the
Newest Value From the Mysql Database
Non-intrusiveness Information Provider Can
Eliminate the User Random Accesses to the
Database Server
Scalability Can Be Significantly Increased
1000 Linux Nodes Are Being Monitored
Network Connectivity of Eight Usatlas Testbeds
Each Site Monitoring the Paths From Itself to the
Other Seven. Network Topology and Traffic Can Be
Easily Constructed
Flexibility
Independent on Sensors. Many Sensors Can Be
Easily Plugged As Long It Has Well Defined
Protocol and API We Could Switch Among Ganglia,
top, /proc
Archive System Is Independent to Underlying
Database
Can Be rdbms, Oracle, Mysql, Sybase, Informix,
Flat Files, Objectivity As Long the Odbc Drivers
Is Available

9
I Ganglia Monitoring with MDS

Ganglia Information Provider
Front-end Glue-schema Http//www.cnaf.Infn.It/se
rgio/datatag/glue/
Back-end XML

Gmond
Gmond
Gmond
Gmond
Cluster A Multicast Channel
Cluster A Multicast Channel
Gmond
Gmond
Gmond
Gmond
Gmond

Gmond
Gmond
Gmond
XML
XML
Gmetad (filtered)
Gmetad (filtered)
XML
GLUE
?
Ganglia IP
MDS
Layered Gmetad
10
I Ganglia Monitoring with MDS

gremlin grid-info-search -x -h
spider.usatlas.bnl.gov -s one
ATLAS Linux Cluster, local, grid
dn clATLAS Linux Cluster, mds-vo-namelocal,
ogrid
objectClass GlueClusterTop
objectClass GlueCluster
GlueClusterName ATLAS Linux Cluster
GlueClusterUniqueID ATLAS_Linux_Cluster-RCF_and_A
CF_Linux_Farm_Group
GlueClusterService compute
PHOBOS CAS Linux Cluster, local, grid
dn clPHOBOS CAS Linux Cluster,
mds-vo-namelocal, ogrid
objectClass GlueClusterTop
objectClass GlueCluster
GlueClusterName PHOBOS CAS Linux Cluster
GlueClusterUniqueID PHOBOS_CAS_Linux_Cluster-RCF_
and_ACF_Linux_Farm_Group
GlueClusterService compute
STAR CAS Linux Cluster, local, grid
dn clSTAR CAS Linux Cluster, mds-vo-namelocal,
ogrid
objectClass GlueClusterTop
objectClass GlueCluster

11
II Farm Monitoring

Linux Farm Is Divided Into Different Sub-clusters
Based on Site Policy, Different Experiments, OS
and Version, CPU Speed. A Sub-cluster Contains
the Host With the Same Configuration
Bnl Atlas Farm Is Partitioned Into Four
Subclusters Cpu400mhz, Cpu700hz, Cpu1ghz,
Cpu1.4ghz and CPU 2.4GHZ
The Status Information of a Sub-cluster Is
Summarized From All Nodes in This Sub-cluster
Grid Resource Broker Schedules in the Level of
Farm Sub-clusters

12
Information Schema (Linux Farm Monitoring)

Queue-Info
objectclass ( 1.3.6.1.4.1.3536.2.6.0.0.0.0
NAME 'Queue-Info' SUP 'Mds' STRUCTURAL
MUST ( MdsQueueNumberOfCpu
MdsQueueSpeed
MdsQueueAverageLoad
MdsQueueAverageUserPercent
MdsQueueAverageSysPercent ))
Need to be replaced by GLUB-schema

13
Backend Data Structure

Node Status Information
mysqlgt describe node_load
----------------------------------------------
--- ------------------------------
Field Type Null
Key Default Extra
----------------------------------------------
-----------------------------------
load_index int(10) unsigned PRI
NULL auto_increment
sampletime timestamp(14) YES MUL
NULL
machine_id varchar(31)
owner varchar(8)
load_5 float(10,2)
0.00
user_cpu float(10,2)
0.00
sys_cpu float(10,2)
0.00
----------------------------------------------
-----------------------------------

14
Information Provider (Linux Farm Monitoring)

generate Farm information every 10 minutesdn
MdsFarmQueueName1000, MdsHostNodeDomainNameusat
las.bnl.gov, Mds-Host-hngremlin.usatlas.bnl.gov,
Mds-Vo-namelocal, ogridobjectclass
GlobusTopobjectclass GlobusActiveObjectobjectcl
ass GlobusActiveSearchtype execpath
/usr/local/globus-new/customizebase
mds-farm-batch-info.plargs -dn
MdsFarmQueueName1000,MdsHostNodeDomainNameusatla
s.bnl.gov,Mds-Host-hngremlin.usatlas.bnl.gov,Mds-
Vo-namelocal,ogrid -ttl 900cachetime
600timelimit 20sizelimit 400

15
Observation from Grid-View
16
Current Status and Future Work

Current Status
Sensors Local Monitoring Tools Put Less Than 1
Percent CPU Load Non-intrusive
Improved the Ganglia Information Provider, It Can
Obtain Information From Both Gmond and Gmetad
Multiple Hierarchical Clusters Are Supported
Future Works
Merge the Ganglia RRD Information Provider and
the Archive DB Information Provider
Work With the Ganglia Team and Glue-schema, Help
to Define Requirements for What Information Be
Monitoring for Job Scheduling
Automate the Mapping From Xml to ldif (via Glue
Schema?), Provide Flexibility
Continue to Optimize The Information Provider to
Deliver Data Faster
Scalability Test