用nagios监控mysql replication主备间的同步情况

# cd /etc/nagios/command

# wget http://www.james.rcpt.to/svn/trunk/nagios/check_mysql_replication/check_mysql_replication.pl

# chmod 755 check_mysql_replication.pl

看看用法:

# ./check_mysql_replication.pl -h
check_mysql_replication.pl: check replication between MySQL database instances

 check_replication.pl [ –slave <host> ] [ –slave-pass <pass> ]
 [ –slave-port <d> ] [ –slave-user <user> ] [ –master <host> ]
 [ –master-pass <pass> ] [ –master-port <port> ] [ –master-user <user> ]
 [ –crit <positions> ] [ –warn <positions> ] [ –check-random-database ]
 [ –table-rows-diff-absolute-crit <number> ]
 [ –table-rows-diff-absolute-warn <number> ]
 [ –schema <db> ]

 –slave <host>         – MySQL instance running as a slave server
 –slave-port <d>       – port for the slave
 –slave-user <user>    – Username with File/Process/Super privs
 –slave-pass <pass>    – Password for above user
 –master <host>        – MySQL instance running as server (override)
 –master-port <d>      – port for the master (override)
 –master-user <user>   – Username for master (override)
 –master-pass <pass>   – Password for master
 –crit <positions>     – Number of complete master binlogs for critical state
 –warn <positions>     – Number of complete master binlog for warning state
 –check-random-database – Select a random DB from the slave’s list of
                databases and compare to the master’s information for
                these (need SELECT priv)
 –table-rows-diff-absolute-crit <number> – If we do the check-random-database,
                then ensure that the change in row count between master and
                slave is below this threshold, and go critical if not
 –table-rows-diff-absolute-warn <number> – If we do the check-random-database,
                then ensure that the change in row count between master and
                slave is below this threshold, and go warning if not
 –schema <db>          – The database schema to use
 –help                 – This help page
 –version              – Script version information

By default, you should use your configured replication user, as you will
then only need to specify the user and password once, and this script will
find the master from the slave’s running configuration.

Critical and warning values are now measured as amount of a complete master
sized binlog. If your master has the default 1GB binlog size, then specifying
a warning value of 0.1 means that your will let the slave get 100MB out of
sync before warning; you may want to set warning to 0.01, and critical at 0.1.

MySQL 3: GRANT File, Process on *.* TO repl@192.168.0.% IDENTIFIED BY <pass>
MySQL 4: GRANT Super, Replication client on *.* TO repl@192.168.0.% IDE…

If you want to use the check-random-database option, then the user needs
SELECT privileges on all replicated tables on the master and the slave.

Note: Any mysqldump tables (for backups) may lock large tables for a long
time. If you dump from your slave for this, then your master will gallop
away from your slave, and the difference will become large. The trick is to
set ‘crit’ above this differnce and ‘warn’ below.

If you are using the host name "localhost" to connect to port forwards, you’ll
probably hit the issue where MySQL uses the named pipe (socket) on the file
system instead of there TCP loopback address. Use host name "127.0.0.1".

(c) 2010 James Bromberger, james@rcpt.to, www.james.rcpt.to

先在命令行运行测试:

# /etc/nagios/command/check_mysql_replication.pl –slave 192.168.1.132 –slave-user user –slave-pass passwd –master 192.168.1.131 –master-user user –master-pass passwd
OK: 0.000 diff, 0 secs, 192.168.1.131:3306 (5.0.77-log) -> 192.168.1.132:3306 (5.0.77-log)

运行成功,我们配置到nagios上去。

# vim objects/commands.cfg

加上一段:

define command {
        command_name    check_mysql_replication
        command_line    /etc/nagios/command/check_mysql_replication.pl –slave $HOSTADDRESS$ –slave-user $ARG1$ –slave-pass $ARG2$ —
master $ARG3$ –master-user $ARG4$ –master-pass $ARG5$
        }

# vim objects/hosts.cfg

加上一段:

define service{
        use                             generic-service         ; Name of service template to use
        host_name                       mysql_slave
        service_description             mysql replication
        check_command                   check_mysql_replication!user!passwd!192.168.1.131!user!passwd
        notifications_enabled           1
        }

然后重新载入nagios配置文件:

# service nagios reload

过了五分钟才发现结果并不是我想要的,在nagios内发现检测的结果和手动运行脚本不一样,提示:

Service check did not exit properly

看到这个错误,就又回去检查了两篇配置,发现并没有什么问题。

再仔细看了下报错,是说脚本的exit不正确,立马看了下脚本,发现exit是类似这样的:

exit 3;

exit 2;

exit 1;

exit 0;

正好之前自己也写过检测redis的nagios脚本。依稀记得nagios对此是有要求的,不能直接这么用。然后把脚本做了如下更改:

exit 3 改成 exit $ERRORS{"UNKNOWN"};
exit 2 改成 exit $ERRORS{"CRITICAL"};
exit 1 改成 exit $ERRORS{"WARNING"};
exit 0 改成 exit $ERRORS{"OK"};

并在开头的use部分加上一行:

use utils qw($TIMEOUT %ERRORS &print_revision &support);

另外还需要把下面这行加在脚本的前10行内:

# nagios: -epn

意思是不用Nagios自带的嵌入式Perl解释器运行此脚本。

再等几分钟,发现脚本在nagios内运行正常了。

 

怪异的是我在网上看到很多人都在用这个脚本,没见有人说碰到这个问题。难道和使用的版本或是配置有关?

 

用nagios监控vmware esxi

在网上找了个监控vmware esxi的脚本,配置了一下,用起来很不错。

脚本:

http://exchange.nagios.org/directory/Plugins/Operating-Systems/*-Virtual-Environments/VMWare/Check-hardware-running-VMware-ESXi/details

 

脚本下来后,加参数运行就行了:

#./check_esx_wbem.py https://192.168.1.10 root passwd
20110310 01:53:16 Connection to https://192.168.1.10
20110310 01:53:16 Check classe CIM_ComputerSystem
20110310 01:53:16 Element Name = System Board 7:1
20110310 01:53:16 Element Op Status = 0
20110310 01:53:16 Element Name = Add-in Card 11:2
20110310 01:53:16 Element Op Status = 0
20110310 01:53:16 Element Name = localhost.localdomain
20110310 01:53:16 Element Name = Hardware Management Controller (Node 0)
20110310 01:53:16 Element Op Status = 0
20110310 01:53:16 Element Name = Controller 0 (PERC 6/i Integrated)
20110310 01:53:16 Element Op Status = 2
20110310 01:53:16 Check classe CIM_NumericSensor
20110310 01:53:17 Element Name = System Board 1 System Level
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Element Name = Power Supply 1 Voltage 1
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Element Name = Power Supply 1 Current 1
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Element Name = System Board 1 FAN 5 RPM
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Element Name = System Board 1 FAN 4 RPM
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Element Name = System Board 1 FAN 3 RPM
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Element Name = System Board 1 FAN 2 RPM
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Element Name = System Board 1 FAN 1 RPM
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Element Name = System Board 1 Ambient Temp
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Check classe CIM_Memory
20110310 01:53:17 Element Name = CPU1 Level-1 Cache
20110310 01:53:17 Element Op Status = 0
20110310 01:53:17 Element Name = CPU1 Level-2 Cache
20110310 01:53:17 Element Op Status = 0
20110310 01:53:17 Element Name = CPU1 Level-3 Cache
20110310 01:53:17 Element Op Status = 0
20110310 01:53:17 Element Name = CPU2 Level-1 Cache
20110310 01:53:17 Element Op Status = 0
20110310 01:53:17 Element Name = CPU2 Level-2 Cache
20110310 01:53:17 Element Op Status = 0
20110310 01:53:17 Element Name = CPU2 Level-3 Cache
20110310 01:53:17 Element Op Status = 0
20110310 01:53:17 Element Name = Memory
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Check classe CIM_Processor
20110310 01:53:17 Element Name = CPU1
20110310 01:53:17 Element Op Status = 2
20110310 01:53:17 Element Name = CPU2
20110310 01:53:17 Element Op Status = 15
20110310 01:53:17 Check classe CIM_RecordLog
20110310 01:53:18 Element Name = IPMI SEL
20110310 01:53:18 Element Op Status = 2
20110310 01:53:18 Check classe OMC_DiscreteSensor
20110310 01:53:19 Element Name = Add-in Card 2 SD vFLash Status 1
20110310 01:53:19 Element Name = Disk Drive Bay 3 ROMB Battery 0: Low
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 3 ROMB Battery 0: Failed
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Cable SAS B 0: Connected
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Cable SAS B 0: Config Error
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Cable SAS A 0: Connected
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Cable SAS A 0: Config Error
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 5: Drive Present
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 5: Drive Fault
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 5: Predictive Failure
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 5: Hot Spare
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 5: Parity Check In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 5: In Critical Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 5: In Failed Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 5: Rebuild In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 5: Rebuild Aborted
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 4: Drive Present
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 4: Drive Fault
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 4: Predictive Failure
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 4: Hot Spare
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 4: Parity Check In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 4: In Critical Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 4: In Failed Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 4: Rebuild In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 4: Rebuild Aborted
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 3: Drive Present
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 3: Drive Fault
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 3: Predictive Failure
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 3: Hot Spare
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 3: Parity Check In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 3: In Critical Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 3: In Failed Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 3: Rebuild In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 3: Rebuild Aborted
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 2: Drive Present
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 2: Drive Fault
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 2: Predictive Failure
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 2: Hot Spare
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 2: Parity Check In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 2: In Critical Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 2: In Failed Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 2: Rebuild In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 2: Rebuild Aborted
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 1: Drive Present
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 1: Drive Fault
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 1: Predictive Failure
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 1: Hot Spare
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 1: Parity Check In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 1: In Critical Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 1: In Failed Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 1: Rebuild In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 1: Rebuild Aborted
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 0: Drive Present
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 0: Drive Fault
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 0: Predictive Failure
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 0: Hot Spare
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 0: Parity Check In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 0: In Critical Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 0: In Failed Array
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 0: Rebuild In Progress
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Disk Drive Bay 1 Drive 0: Rebuild Aborted
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 Power Optimized 0: OEM
20110310 01:53:19 Element Name = System Board 1 Power Optimized 0: Unknown
20110310 01:53:19 Element Name = System Board 1 Power Optimized 0: Unknown
20110310 01:53:19 Element Name = System Board 1 Power Optimized 0: Unknown
20110310 01:53:19 Element Name = System Board 1 Fan Redundancy 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 Intrusion 0: General Chassis intrusion
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 OS Watchdog 0: Timer expired
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 OS Watchdog 0: Hard reset
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 OS Watchdog 0: Power down
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 OS Watchdog 0: Power cycle
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 Riser Config 0: Connected
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 Riser Config 0: Config Error
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Power Supply 1 Status 0: Presence detected
20110310 01:53:19 Element Name = Power Supply 1 Status 0: Failure detected
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Power Supply 1 Status 0: Predictive failure
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Power Supply 1 Status 0: Power Supply AC lost
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Power Supply 1 Status 0: Config Error: Vendor Mismatch
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 2 Status 0: IERR
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 2 Status 0: Thermal Trip
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 2 Status 0: Configuration Error
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 2 Status 0: Presence detected
20110310 01:53:19 Element Name = Processor 2 Status 0: Throttled
20110310 01:53:19 Element Name = Processor 1 Status 0: IERR
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 1 Status 0: Thermal Trip
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 1 Status 0: Configuration Error
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 1 Status 0: Presence detected
20110310 01:53:19 Element Name = Processor 1 Status 0: Throttled
20110310 01:53:19 Element Name = Disk Drive Bay 1 Presence  0: Present
20110310 01:53:19 Element Name = Disk Drive Bay 1 Presence  0: Absent
20110310 01:53:19 Element Name = Power Supply 2 Presence 0: Present
20110310 01:53:19 Element Name = Power Supply 2 Presence 0: Absent
20110310 01:53:19 Element Name = Power Supply 1 Presence 0: Present
20110310 01:53:19 Element Name = Power Supply 1 Presence 0: Absent
20110310 01:53:19 Element Name = Processor 2 Presence 0: Present
20110310 01:53:19 Element Name = Processor 2 Presence 0: Absent
20110310 01:53:19 Element Name = Processor 1 Presence 0: Present
20110310 01:53:19 Element Name = Processor 1 Presence 0: Absent
20110310 01:53:19 Element Name = System Board 1 Riser1 Pres 0: Present
20110310 01:53:19 Element Name = System Board 1 Riser1 Pres 0: Absent
20110310 01:53:19 Element Name = System Board 1 Riser2 Pres 0: Present
20110310 01:53:19 Element Name = System Board 1 Riser2 Pres 0: Absent
20110310 01:53:19 Element Name = System Board 1 Stor Adapt Pres 0: Present
20110310 01:53:19 Element Name = System Board 1 Stor Adapt Pres 0: Absent
20110310 01:53:19 Element Name = System Board 1 USB Cable Pres 0: Present
20110310 01:53:19 Element Name = System Board 1 USB Cable Pres 0: Absent
20110310 01:53:19 Element Name = System Board 1 iDRAC6 Ent Pres 0: Present
20110310 01:53:19 Element Name = System Board 1 iDRAC6 Ent Pres 0: Absent
20110310 01:53:19 Element Name = System Board 1 Heatsink Pres 0: Present
20110310 01:53:19 Element Name = System Board 1 Heatsink Pres 0: Absent
20110310 01:53:19 Element Name = System Board 1 1.05 V PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 1.0 AUX PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 1.0 LOM PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 1.1 V PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 8.0 V PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 1 1.8 PLL PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 2 1.8 PLL  PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 0.9V PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 1 VTT  0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 2 VTT  0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 1 MEM PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 2 MEM PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 5V PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 3.3V PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 1.8V PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 1.5V PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 1 0.75 VTT CPU1 PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 2 0.75 VTT CPU2 PG 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 2 VCORE 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = Processor 1 VCORE 0
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Element Name = System Board 1 CMOS Battery 0: Failed
20110310 01:53:19 Element Op Status = 2
20110310 01:53:19 Check classe VMware_StorageExtent
20110310 01:53:20 Element Name = Drive 0 in enclosure 32 on controller 0 Fw: SN11 – ONLINE
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Element Name = Drive 1 in enclosure 32 on controller 0 Fw: 3B05 – ONLINE
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Element Name = Drive 2 in enclosure 32 on controller 0 Fw: KA05 – ONLINE
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Element Name = Drive 3 in enclosure 32 on controller 0 Fw: 0001 – ONLINE
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Check classe VMware_Controller
20110310 01:53:20 Element Name = Controller 0 (PERC 6/i Integrated)
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Check classe VMware_StorageVolume
20110310 01:53:20 Element Name = RAID 10 Logical Volume 0 on controller 0, Drives(0e32,1e32,2e32,3e32)  – OPTIMAL
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Check classe VMware_Battery
20110310 01:53:20 Element Name = Battery on Controller 0
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Check classe VMware_SASSATAPort
20110310 01:53:20 Element Name = Port 0 on Controller 0
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Element Name = Port 1 on Controller 0
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Element Name = Port 2 on Controller 0
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Element Name = Port 3 on Controller 0
20110310 01:53:20 Element Op Status = 2
20110310 01:53:20 Element Name = Port 4 on Controller 0
20110310 01:53:20 Element Op Status = 15
20110310 01:53:20 Element Name = Port 5 on Controller 0
20110310 01:53:20 Element Op Status = 15
20110310 01:53:20 Element Name = Port 6 on Controller 0
20110310 01:53:20 Element Op Status = 15
20110310 01:53:20 Element Name = Port 7 on Controller 0
20110310 01:53:20 Element Op Status = 15
OK

输出量很大,其实是可以关掉的,把脚本内的 verbose = 1改成0就是了。

 

这个脚本用到了PyWEEM模块,如果机器上没有,需要安装的:

PyWEEM模块主页:http://pywbem.sourceforge.net/

# wget http://downloads.sourceforge.net/project/pywbem/pywbem/pywbem-0.7/pywbem-0.7.0.tar.gz?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fpywbem%2Ffiles%2Fpywbem%2F&ts=1299742557&use_mirror=voxel

# tar -xvzf pywbem-0.7.0.tar.gz

# cd pywbem-0.7.0

# python setup.py build

# python setup.py install

 

下面配置到nagios内:

1) 先加命令:

#vi /etc/nagios/objects/commands.cfg

在最后加上:

define command{
        command_name    check_esxi
        command_line    /etc/nagios/command/check_esx_wbem.py https://$HOSTADDRESS$ $ARG1$ $ARG2$
        }

 

2) 再加服务:

#vi linuxhosts.cfg

也在最后加上

define service{
        use                             generic-service         ; Name of service template to use
        host_name                       esxi
        service_description             check_esxi
        check_command                   check_esxi!root!passwd
        notifications_enabled           1
        }

 

3) 重载nagios

#service nagios reload

 

Nagios监控系统安装及配置文档

1. 版本历史

Revision    Author(s)    Date    Summary of activity
1.0            罗辉        2008-11-19    创建本文档

2. 参考文档

[1] http://www.nagiosexchange.org
[2] http://www.nagios.org/

3. 前言

做为系统管理员,管理着几十台或几百台服务器在运行。一个非常迫切的需求就是希望了解服务器及服务器上运行的服务的运行状况,在服务器或服务出现当机或停止的时候能够第一时间知道,及时处理。以便最小的减少由此带来的影响和损失。Nagios就是用来解决这个问题的,在目前的一些监控软件中,Nagios以其良好的稳定性,强大的功能等,已成为业界监控软件的首选。

Nagios官方网站的描述:
Nagios is an open source host, service and network monitoring program. Who uses it? Lots of people, including many big companies and organizations:Nagios是一个用来监控主机、服务和网络的开放源码软件,很多大的公司或组织都在使用它。

4. Nagios监控原理

上图为Nagios监控原理图。Nagios监控可以使用主动模式(Action)和被动模式 (Passive)。

主动模式主要是自身插件或结合Nrpe实现,由Nagios在定义的时间去主动监测被监控端的服务器或服务是否正常。被动模式结合Naca实现,由Nsca定时监控服务器或服务,再由Nasa把结果传至Nagios。

被动模式适合大规模服务器(一般在最少100台以上)需要监控的情况,可有效减少监控服务器的压力。在服务器数量比较少的情况下用主动模式比较方便,因为主要的配置在监控主机的设置就好了,无需在被监控端做过多设置。

我们的监控是使用Nagios结合Nrpe的主动模式。

5. Nagios的安装

5.1. YUM方式安装

我们的监控服务器是Centos linux 4.8,可用yum方式安装:

# wget http://dag.wieers.com/rpm/packages/rpmforge-release/rpmforge-release-0.3.6-1.el4.rf.i386.rpm
# rpm –ivh rpmforge-release-0.3.6-1.el4.rf.i386.rpm
# yum install nagios*
# chkconfig –level 2345 nagios on
# service nagios start

注:本文后面的配置都是按yum安装后的环境来配置的。

5.2. 源代码方式安装

1)安装Nagios

# wget http://jaist.dl.sourceforge.net/sourceforge/nagios/nagios-3.0.6.tar.gz
# tar zxvf nagios-3.0.6.tar.gz
# cd nagios-3.06
# ./configure -prefix=/usr/local/nagios
# make install                  //安装主要的程序、CGI及HTML文件
# make install-commandmode      //给外部命令访问nagios配置文件的权限
# make install-config           //把配置文件的例子复制到nagios的安装目录
# make install-init             //把nagios做成一个运行脚本,使nagios随系统开机启动
或者上面四步可用一行令代替:
# make all

2) 安装nagios的插件

# wget http://nchc.dl.sourceforge.net/sourceforge/nagiosplug/nagios-plugins-1.4.13.tar.gz
# cd nagios-plugins-1.4.13
# tar zxvf nagios-3.0.6.tar.gz
# ./configure -prefix=/usr/local/nagios
# make
# make install

6. Nagios配置文件

其实Nagios只有一个配置文件,就是/etc/nagios/nagios.cfg,其它的配置文件都是以include的方式包括进nagios.cfg的。如:
# You can specify individual object config files as shown below:
cfg_file=/etc/nagios/objects/commands.cfg
cfg_file=/etc/nagios/objects/contacts.cfg
cfg_file=/etc/nagios/objects/timeperiods.cfg
cfg_file=/etc/nagios/objects/templates.cfg

# Definitions for monitoring the local (Linux) host
cfg_file=/etc/nagios/objects/hosts.cfg
cfg_file=/etc/nagios/objects/services.cfg

commands.cfg是监控命令的配置文件,contacts.cfg是监控报警联系人的配置文件,timeperiods.cfg是时间定义配置文件,templates.cfg是模板配置文件,这里面定义了一些模板以方便用户使用。Hosts.cfg是被监控主机的配置文件,services.cfg是被监控服务的配置文件。
唯一的例外是cgi.cfg,这个文件是与WEB相关的。

6.1. Nagios的WEB配置

# htpasswd -c /etc/nagios/htpasswd.users nagiosadmin 123456
建一个WEB访问的用户之后,在浏览器输入http://ip/nagios/,输入相应的用户和密码就可以就看到Nagios的web界面了。

WEB用到的配置文件是/etc/nagios/cgi.cfg,更改配置可修改这个配置文件。
# vi /etc/nagios/cgi.cfg
use_authentication=1                         #使用用户认证
authorized_for_system_information=nagiosadmin
authorized_for_configuration_information=nagiosadmin
authorized_for_system_commands=nagiosadmin #多个用户之间用逗号隔开
authorized_for_all_services=nagiosadmin
authorized_for_all_hosts=nagiosadmin
authorized_for_all_service_commands=nagiosadmin
authorized_for_all_host_commands=nagiosadmin

6.2.  hosts.cfg

define host{            #这段是用来定义一个被监控的主机。
host_name             #这一项是用来定义标识主机的名字。我们用这个名字在host group和service里标识这个主机。一个主机能定义多个服务。使用适当时,宏$HOSTNAME$里存放了这一项的值。
alias                 #这一项用来定义主机的一个完整名字或描述。主要是和使你能理容易的标识一个主机。使用适当时,宏$HOSTALIAS$里存放了这一项的值。
address               #这一项是用来定义主机的地址。一般而言是主机的IP。当然,你也能够使用一个FQDN来标识你的主机,在没有可访问DNS服务器服务的情况下这种方法会引起问题。使用适当时,宏$ HOSTADDRESS $里存放了这一项的值。
max_check_attempts    #这一项用来定义在检测返回结果不是OK时,nagios重试检测命令的次数。设置这个值为1会导致nagios一次也不重试就报警。
check_period          #这一项用一个time period项的名字来定义在哪段时间内激活对这台主机的主动检测。time period是定义在别的文件里的配置项,我们可以在这里用名字来引用她。
contact_groups        #这是一个联系组列表。我们用联系组的名字来引用她们。多个联系组间用“,”来分隔。
notification_interval #这一项用来定义当一个服务仍然down或unreachable时,我们间隔多久重发一次通知给联系组。
notification_period   #这一项用一个time period定义来标识什么时间段内给联系组送通知。这里我们用time period定义的名字来引用她。
notification_options  #这一项用来决定发送通知的时机。选项有:d = 当有down状态时发送通知,u = 当有unreachable状态时发送通知, r = 当有服务recoveries时发送通知,f = 当主机启动或停机时发送通知。如果你给一个n选项,那么永远不会发送通知。
}

define hostgroup{     #这段是用来定义一个被监控的主机组。
hostgroup_name  #主机组名称,通常定义得较短
alias           #主机组别名,通常定义得较长
members         #主机组成员
}

6.3. services.cfg

define service{                      #这段是用来定义一个被监控的服务。
host_name             #主机名称
service_description   #服务描述
check_command         #执行命令
max_check_attempts    #最大失败尝试次数,值为1时只报警不重新检测
normal_check_interval #常规检测间隔时间,默认为60分钟(常规检测是指无论服务状态是否正常,检测次数达到“最大次数”时)
retry_check_interval  #失败尝试间隔时间,默认为60分钟(失败尝试是指服务状态不正常,检查次数达到“最大次数”时)
check_period          #检测时间段
notification_interval #当服务状态不正常时通知联系人的间隔时间,值为0时不通知联系人
notification_period   #通知联系人时间段
notification_options  #通知联系人选项,w警告,u未知,c危急,f启动和停止,n不发送通知
contact_groups        #联系人组
}

define servicegroup{                 #这段是用来定义一个被监控的服务组。
servicegroup_name     #服务组名称,通常定义得较短       
alias                 #服务组别名,通常定义得较长
members               #服务组成员
}

6.4. contacts.cfg

define contact{              #这段是用来定义一个联系人。
contact_name                 #这个指令用来定义一个联系人的简称。他会在定义contactgroup时被引用到。在相应的环境中,宏定义$CONTACTNAME$会包含这个值。
alias                        #这个指令是为了定义一个联系人的具体的描述。在相应的环境中,宏定义$CONTACTALIAS$会包含这个值。
host_notification_period     #这个指令是为了定义,能够通知Contact中定义的那个简称联系人,关于主机有问题或者恢复正常状态的时间段。你可以把他想象成能够通知Contact关于主机的在线时间。
service_notification_period  #这个指令是为了定义,能够通知Contact中定义的那个简称联系人,关于服务的问题或恢复正常的时间段。
host_notification_options    #这个指令为了定义主机在什么状态下会给联系人发通知。各个参数的描述如下:d=当主机的状态处于down时,发送通知;f=当主机状态处于stop时发送通知。r=当主机恢复up状态时发送通知。n=什么状态下都不发送通知(w-warning , u-unknown,c-critical,r-recovery;d-down,u-unreachable)。
service_notification_options #这个指令为了定义服务在什么状态下会给联系人发通知。各个参数的描述如下:w=当服务处于警告状态时发送通知 u=当服务的状态处于unknown时,发送通知;f=当服务状态处于启动和停止时发送通知。c=当服务处于Critical状态时发送通知。n=什么状态下都不发送通知。
host_notification_commands   #这个指令是为了定义一个通知联系人关于主机问题或恢复正常的联系手段的一个列表。多个手段之间用逗号隔开。
service_notification_commands#这个指令是为了定义一个通知联系人关于服务问题或恢复正常的联系手段的一个列表。多个手段之间用逗号隔开。
email                        #这个指令是为了定义联系人的email地址。这个将取决于你是如何定义你的notification commands.它可以用来给联系人发送紧急邮件。在相应的环境中。宏定义$CONTACTEMAIL$将会包含它的值。
}

define contactgroup{         #这段是用来定义一个联系人组。
contactgroup_name   #联系组名称,通常定义得较短
alias               #联系组别名,通常定义得较长
members             #联系组成员
}

6.5. timeperiods.cfg

define timeperiod{
timeperiod_name  #时间段名称,通常定义得较短
alias            #时间段别名,通常定义得较长
sunday           #星期日时间段
monday           #星期一时间段
tuesday          #星期二时间段
wednesday        #星期三时间段
thursday         #星期四时间段
friday           #星期五时间段
saturday         #星期六时间段
}

6.6. commands.cfg

define command{
command_name        #定义命令的简称
command_line        #定义当服务进行时Nagios要执行的动作。在命令执行以前,所有合法的宏都要被他们的值代替。
}

7. 用Nrpe监控Linux主机

7.1. 安装Nrpe

# wget http://dag.wieers.com/rpm/packages/rpmforge-release/rpmforge-release-0.3.6-1.el4.rf.i386.rpm
# rpm –ivh rpmforge-release-0.3.6-1.el4.rf.i386.rpm
# yum -y install nagios-nrpe
# chkconfig –level 2345 nrpe on
# service nrpe start

7.2. 配置Nrpe

修改Nrpe配置文件:

#  vi /etc/nagios/nrpe.cfg

command开头只保留两行:
command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_disk]=/usr/lib/nagios/plugins/check_disk -w 20 -c 10 -p /

第一行是监控系统负载,第二行是监控磁盘空间的。
command[check_load]内的check_load是定义的Nrpe命令,在监控端的Nrpe插件可用这个命令来取得执行结果。

7.3. 配置nagios

1) 增加nagios命令

# vi commands.cfg
最后增加:
define command{
command_name check_nrpe
command_line /usr/local/nagios/libexec/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
}

2) 增加被监控服务器

# vi /etc/nagios/objects/hosts.cfg
define host{                   
use                     linux-server
host_name               mysql.server
alias                   10.11.12.80
address                 10.11.12.80
}

3) 增加被监控服务

# vi /etc/nagios/objects/services.cfg
define service{
use                     local-service
host_name               mysql.server
service_description     nrpe_disk
check_command           check_nrpe!check_disk
notifications_enabled   1
}

define service{
use                     local-service
host_name               mysql.server
service_description     nrpe_load
check_command           check_nrpe!check_load
notifications_enabled   1
}
上面配置内的check_load和check_disk是被监控端Npre的配置文件(command[check_load])内定义好的命令。最后使配置生效:
# nagios –v /etc/nagios/nagios.cfg  #配置文件的语法检查
# service nagios reload

8. 监控Web及Tomcat服务

监控Web用Tomcat服务可用nagios自带的插件check_http。
# vi commands.cfg
增加:
# ‘check_tomcat’ command definition
define command{
command_name    check_tomcat
command_line    $USER1$/check_http -I $HOSTADDRESS$ -p 8080 $ARG1$
}
# ‘check_http’ command definition
define command{
command_name    check_http
command_line    $USER1$/check_http -I $HOSTADDRESS$ -H $HOSTADDRESS$ $ARG1$
}

# vi services.cfg
增加:
define service{
use                     local-service
host_name               web1.ihompy.com
hostgroup_name          web-servers
service_description     check-http
check_command           check_http
max_check_attempts      3
normal_check_interval   3
retry_check_interval    1
check_period            24×7
notification_interval   60
notification_period     24×7
notification_options    w,u,c,r
}
define service{
use                     local-service
host_name               l7ejb,l7admin,l7web,l7ds
#        hostgroup_name          l7-servers
service_description     check-tomcat
check_command           check_tomcat
max_check_attempts      3
normal_check_interval   3
retry_check_interval    1
check_period            24×7
notification_interval   60
notification_period     24×7
notification_options    w,u,c,r
}
被监控端无需配置,让nagios使修改后的配置生效便可。


9. 监控squid

9.1. 下载squid检测脚本

# wget http://workaround.org/squid/nagios-plugin/check_squid
# chmod  755 check_squid
# cp check_squid /usr/lib/nagios/plugins/
这个脚本我用的时候有点问题,出现:
Parsing of undecoded UTF-8 will give garbage when decoding entities at /usr/lib/perl5/vendor_perl/5.8.5/LWP/Protocol.pm line 114.
原因是:
HTML::HeadParser模块在使用parse()方法时,对没有编码的UTF-8会弄混,要保证在传值之前进行适当的编码。
参考:http://www.xinjiezuo.com/blog/?p=43
解决方式是在my $ua = new LWP::UserAgent;下面加入一行:
$ua->parse_head(0);

跳过去就好了。

9.2. 修改配置文件

# vi commands.cfg
增加:
# ‘squid’ command definition
define command {
command_name check_squid
command_line $USER1$/check_squid ‘$ARG1$’ ‘$ARG2$’ ‘$ARG3$’ $HOSTADDRESS$ ‘$ARG4$’ ‘$ARG5$’ ‘$ARG6$’ ‘$ARG7$’
}

# vi services.cfg
增加:
define service {
use                     local-service
host_name               squid1.ihompy.com
service_description     check-squid
check_command           check_squid!http://www.ihompy.com!-!-!80!-!-!2
max_check_attempts      3
normal_check_interval   3
retry_check_interval    1
check_period            24×7
notification_interval   60
notification_period     24×7
notification_options    w,u,c,r
}
被监控端无需配置,让nagios使修改后的配置生效便可。

10. 监控mysql及mssql服务

监控mysql服务可用nagios自带的插件check_mysql。Nagios也带有一个check_mssql用来监控sql server,不过要先安装freetds。
# yum install freetds
# vi commands.cfg
增加:
# ‘mysql’  command definition
define command{
command_name check_mysql
command_line $USER1$/check_mysql -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$
}
# ‘check_mssql’ command definition
define command{
command_name    check_mssql
command_line    $USER2$/check_mssql.sh $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$
}

# vi services.cfg
增加:

define service{
use                     local-service
host_name               mysql.ihompy.com
hostgroup_name          mysql-servers
service_description     check-mysql
check_command           check_mysql!root!1qaz2wsx
max_check_attempts      3
normal_check_interval   3
retry_check_interval    1
check_period            24×7
notification_interval   60
notification_period     24×7
notification_options    w,u,c,r
}
define service{
use                     local-service
host_name               sqlserver
#        hostgroup_name          backup-servers
service_description     check_sqlserver
check_command           check_mssql!sa!”!2000
notifications_enabled   1
}
被监控端无需配置,让nagios使修改后的配置生效便可。

11. 配置报警方式及联系人

Nagios可以有很多报警方式,比如:E-mail,短信,MSN等。
目前短信方式主要是用中国移动的飞信客户端及购买短信猫两种方式。前者目前是免费的,后者需一点费用买短信猫及手机卡,不过代价也不高,一共不到200RMB。
MSN因为其自身的原因,不太稳定。
我们使用的是一个折中的方式,在Nagios上配置的是E-mail方式,但使用的是中国移动的139邮箱,139邮箱在收到邮件后可免费发短信给用户。这样就邮箱和短信就都有了,目前观察下来稳定性还不错。

11.1. 配置联系人

# vi contacts.cfg
define contact{
contact_name                    luohui         
use                             generic-contact
alias                           Nagios Admin   
email                           farmer.luo@139.com 
pager                           13761802324324
address1                        huilinux@hotmail.com
}

define contact{
contact_name                    xuyong       
use                             generic-contact
alias                           Nagios Admin   
email                           xuyong76@139.com
pager                           133434323443
address1                        xuyong@newsky.sh
}
define contactgroup{
contactgroup_name       admins
alias                   Nagios Administrators
members                 luohui,xuyong
}
先定义两个联系人luohui和xuyong,再把它们加入到联系人组admins。

11.2 配置报警命令

默认已经配置好了,在command.cfg中的以下这几行:

# ‘notify-host-by-email’ command definition
define command{
command_name    notify-host-by-email
command_line    /usr/bin/printf "%b" "***** Nagios *****nnNotification Type: $NOTIFICATIONTYPE$nHost: $HOSTNAME$n
State: $HOSTSTATE$nAddress: $HOSTADDRESS$nInfo: $HOSTOUTPUT$nnDate/Time: $LONGDATETIME$n" | /bin/mail -s "** $NOTIFICATI
ONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
}

# ‘notify-service-by-email’ command definition
define command{
command_name    notify-service-by-email
command_line    /usr/bin/printf "%b" "***** Nagios *****nnNotification Type: $NOTIFICATIONTYPE$nnService: $SERVIC
EDESC$nHost: $HOSTALIAS$nAddress: $HOSTADDRESS$nState: $SERVICESTATE$nnDate/Time: $LONGDATETIME$nnAdditional Info:nn
$SERVICEOUTPUT$" | /bin/mail -s "** $NOTIFICATIONTYPE$ Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTAC
TEMAIL$
}

11.3. 在主机和服务中启用E-mail报警

因为我们配置的主机及服务都是使用了模板方式的,所以只要改了模板文件的配置,所有的主机和服务就都改了。
# vi templates.cfg
define contact{
name                            generic-contact          
service_notification_period     24×7               
host_notification_period        24×7                   
service_notification_options    w,u,c,r,f,s            
host_notification_options       d,u,r,f,s              
service_notification_commands   notify-service-by-email
host_notification_commands      notify-host-by-email   
register                        0                      
}

在主机及服务模板内,把contact_groups都改成联系人组admins:
contact_groups                  admins 

12. 结语

上面只列举了几个常用的服务做为例子说明。Nagios可监控的服务很多,如dns,pop,smtp等等。也可以自己写脚本检测一些特殊的服务。
目前用Nagios监控系统监控了近40台服务器,100多项服务。运行有半个月左右,已经成功报过几次警,用起来很顺手,非常不错。

自己写了一个perl脚本检测redis(nagios插件)

放在我的code库内了,http://farmerluo.googlecode.com/files/check_redis.pl

介绍下怎么安装:

脚本用到了perl的Redis库,需要先安装这个:
# perl -MCPAN -e shell
# install Redis

wget http://farmerluo.googlecode.com/files/check_redis.pl

cp check_redis.pl  /etc/nagios/command/

chown cacti.nagios check_redis.pl

在nagios内加入这个插件:
vi /etc/nagios/objects/command.cfg

# ‘check_redis’ command definition
define command{
command_name    check_redis
command_line    /etc/nagios/command/check_redis.pl -h $HOSTADDRESS$ $ARG1$
}

加入一个服务:
vi /etc/nagios/objects/linuxhost.cfg

define service{
use                             generic-service         ; Name of service template to use
host_name                       memcached.ha2,memcached.web2
service_description             redis
check_command                   check_redis
notifications_enabled           1
}

检查下nagios配置是否正解:
nagios -v  /etc/nagios/nagios.cfg

Nagios Core 3.2.1
Copyright (c) 2009-2010 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 03-09-2010
License: GPL

Website: http://www.nagios.org
Reading configuration data…
Read main config file okay…
Processing object config file ‘/etc/nagios/objects/commands.cfg’…
Processing object config file ‘/etc/nagios/objects/contacts.cfg’…
Processing object config file ‘/etc/nagios/objects/timeperiods.cfg’…
Processing object config file ‘/etc/nagios/objects/templates.cfg’…
Processing object config file ‘/etc/nagios/objects/linuxhosts.cfg’…
Processing object config file ‘/etc/nagios/objects/windows.cfg’…
Read object config files okay…

Running pre-flight check on configuration data…

Checking services…
Checked 32 services.
Checking hosts…
Checked 14 hosts.
Checking host groups…
Checked 4 host groups.
Checking service groups…
Checked 0 service groups.
Checking contacts…
Checked 3 contacts.
Checking contact groups…
Checked 2 contact groups.
Checking service escalations…
Checked 0 service escalations.
Checking service dependencies…
Checked 0 service dependencies.
Checking host escalations…
Checked 0 host escalations.
Checking host dependencies…
Checked 0 host dependencies.
Checking commands…
Checked 29 commands.
Checking time periods…
Checked 5 time periods.
Checking for circular paths between hosts…
Checking for circular host and service dependencies…
Checking global event handlers…
Checking obsessive compulsive processor commands…
Checking misc settings…

Total Warnings: 0
Total Errors:   0

Things look okay – No serious problems were detected during the pre-flight check

没问题,我们重新载入配置。

service nagios reload

再介绍一下用perl写nagios插件需要注意的地方:

  1. 总是要生成一些输出内容;
  2. 加上引用’use utils’并引用些通用模块来输出($TIMEOUT %ERRORS &print_revision &支持等);
  3. 总 是知道一些Perl插件的标准习惯,如:
    1. 退出时总是exit带 着$ERRORS{CRITICAL}、$ERRORS{OK}等;
    2. 使用getopt函数来处理命令行;
    3. 程序处理超 时问题;
    4. 当没有命令参数时要给出可调用print_usage;
    5. 使用标准的命令行选项开关(象-H ‘host’、-V ‘version’等)。

check_redisl.pl代码:

#!/usr/bin/perl

# nagios: -epn

################################################################################
# check_redis – Nagios Plugin for Redis checks.
#
# @author  farmer.luo at gmail.com
# @date    2012-01-15
# @license GPL v2
#
# check_nagios.pl -h <redis host> -p <redis port> -w <warning time> -c <critica time>
#
# Run the script need:
#
# perl -MCPAN -e shell
# install Redis
#
################################################################################

use strict;
use warnings;
use Redis;
use File::Basename;
use utils qw($TIMEOUT %ERRORS &print_revision &support);
use Time::Local;
use vars qw($opt_h); # Redis

                              use vars qw($opt_p); # Redis

                                                            use vars qw($opt_w); # ʱuse vars qw($opt_c); # ʱuse Getopt::Std;

$opt_h = "";
$opt_p = "6379";
$opt_w = 5;
$opt_c = 10;
my $r = "";
my $role = "master";

getopt(‘hpwcd’);

if ( $opt_h eq "" ) {
        help();
        exit(1);
}

my $start = time();

redis_connect();

# print $@;
if ( $@ ) {
        print "UNKNOWN – cann’t connect to redis server:" . $opt_h . ".";
        exit $ERRORS{"UNKNOWN"};
}

#sleep(3);
my $stop = time();

my $run = $stop – $start;

if ( $run > $opt_c ) {

        print "CRITICAL – redis server(" . $opt_h . ") run for " . $run . " seconds!";
        exit $ERRORS{"CRITICAL"};

} elsif ( $run > $opt_w ) {

        print "WARNING – redis server(" . $opt_h . ") run for " . $run . " seconds!";
        exit $ERRORS{"WARNING"};

} else {

        redis_info();

#       print "role = " . $role;

        if ( $role eq "master" ){

                if ( redis_set() ) {
                       print "WARNING – redis server:" . $opt_h . ",set key error.";
                       exit $ERRORS{"WARNING"};
                }

                if ( redis_get() ) {
                       print "WARNING – redis server:" . $opt_h . ",get key error.";
                       exit $ERRORS{"WARNING"};
                }

                if ( redis_del() ) {
                       print "WARNING – redis server:" . $opt_h . ",del key error.";
                       exit $ERRORS{"WARNING"};
                }

        }

        redis_quit();
        exit $ERRORS{"OK"};

}

sub help{

        die "Usage:n" , basename( $0 ) ,  " -h hostname -p port -w warning time -c critical time -d down timen"

}

sub redis_connect{

        my $redis_hp = $opt_h . ":" . $opt_p;

        eval{ $r = Redis->new( server => $redis_hp ); };

}

sub redis_set{

        $r->set( redis_nagios_key => ‘test’ ) || return 1;

        return 0;
}

sub redis_get{

        my $value = $r->get( ‘redis_nagios_key’ ) || return 1;

        return 0;
}

sub redis_del{

        $r->del( ‘redis_nagios_key’ ) || return 1;

        return 0;
}

sub redis_info{

        my $info_hash = $r->info;

        print "OK – redis server(" . $opt_h . ") info:";

        while ( my ($key, $value) = each(%$info_hash) ) {
            print "$key => $value, ";
        }

        my %info = %$info_hash;

        $role = $info{"role"};
}

sub redis_quit{

        $r->quit();

}

参考:
http://nagios-cn.sourceforge.net/nagios-cn/develope.html

如何编写 Nagios 插件

很早就用上了redis,当时没有找到nagios监控redis的插件,好在redis也算稳定,基本没出过什么问题。

最近在网上找,还是没找到,准备自己写个了。过两天再发上来,其实nagios plugin也很简单,只要注意一下程序退出的状态码就行了。

Nagios 每次在查询一个服务的状态时,产生一个子进程,并且它使用来自该命令的输出和退出代码来确定具体的状态。退出状态代码的含义如下所示:

  • OK —退出代码 0—表示服务正常地工作。
  • WARNING —退出代码 1—表示服务处于警告状态。
  • CRITICAL —退出代码 2—表示服务处于危险状态。
  • UNKNOWN —退出代码 3—表示服务处于未知状态。

最后一种状态通常表示该插件无法确定服务的状态。例如,可能出现了内部错误。

下面提供了一个 Python 示例脚本,用于检查 UNIX® 平均负载。它假定 2.0 以上的级别表示警告状态,而 5.0 以上的级别表示危险状态。这些值都采用了硬编码的方式,并且始终使用最近一分钟的平均负载。

清单 5. Python 插件—示例工作插件

#!/usr/bin/env python

import os,sys

(d1, d2, d3) = os.getloadavg()

if d1 >= 5.0:
print "GETLOADAVG CRITICAL: Load average is %.2f" % (d1)
sys.exit(2)
elif d1 >= 2.0:
print "GETLOADAVG WARNING: Load average is %.2f" % (d1)
sys.exit(1)
else:
print "GETLOADAVG OK: Load average is %.2f" % (d1)
sys.exit(0)
参考:http://www.ibm.com/developerworks/cn/aix/library/au-nagios/index.html

net-snmp调用脚本及独立日志配置

net-snmp可以调用外部脚本扩展功能,如:

vi /etc/snmp/snmpd.conf
exec .1.3.6.1.4.1.2021.18 tcpCurrEstab /etc/snmp/tcpconn.sh
exec .1.3.6.1.4.1.2021.19 tcpCurrHttp /etc/snmp/tcphttp.sh
exec .1.3.6.1.4.1.2021.20 tcpCurrPhp-fpm /etc/snmp/tcpphp.sh
exec .1.3.6.1.4.1.2021.21 tcpCurrMemcache /etc/snmp/tcpmemcache.sh

上面是旧版的配置,已经弃用了,新版本用:
extend .1.3.6.1.4.1.2021.18 tcpCurrEstab /etc/snmp/tcpconn.sh
extend .1.3.6.1.4.1.2021.19 tcpCurrHttp /etc/snmp/tcphttp.sh
extend .1.3.6.1.4.1.2021.20 tcpCurrPhp-fpm /etc/snmp/tcpphp.sh
extend .1.3.6.1.4.1.2021.21 tcpCurrMemcache /etc/snmp/tcpmemcache.sh

snmpd想允许一个段访问,配置为:
com2sec notConfigUser  192.168.1.0/24       public

[root@ha1 log]# cat /etc/snmp/tcpconn.sh
#!/bin/sh
conn=`netstat -s -t | grep connections established |awk ‘{print $1}’`
echo $conn
[root@ha1 log]# cat /etc/snmp/tcphttp.sh
#!/bin/sh
netstat -an | grep ‘:80 ‘ | grep ESTABLISHED | wc -l
[root@ha1 log]# cat /etc/snmp/tcpmemcache.sh
#!/bin/sh
netstat -an | grep :11211 | grep ESTABLISHED | wc -l
[root@ha1 log]# cat /etc/snmp/tcpphp.sh
#!/bin/sh
netstat -an | grep :9000 | grep ESTABLISHED | wc -l

重启net-snmpd:
service snmpd restart

测试:
[root@ha1 log]# snmpwalk -v 2c -c public 192.168.1.4 .1.3.6.1.4.1.2021.18
UCD-SNMP-MIB::ucdavis.18.1.1 = INTEGER: 1
UCD-SNMP-MIB::ucdavis.18.2.1 = STRING: "tcpCurrEstab"
UCD-SNMP-MIB::ucdavis.18.3.1 = STRING: "/etc/snmp/tcpconn.sh"
UCD-SNMP-MIB::ucdavis.18.100.1 = INTEGER: 0
UCD-SNMP-MIB::ucdavis.18.101.1 = STRING: "5023"
UCD-SNMP-MIB::ucdavis.18.102.1 = INTEGER: 0
UCD-SNMP-MIB::ucdavis.18.103.1 = ""

有信息输出就表明成功了。

默认net-snmpd输出日志到/var/log/messes,想输出至一个独立日志文件,配置为:

vi /etc/sysconfig/snmpd.options
# snmpd command line options
# OPTIONS="-Lsd -Lf /dev/null -p /var/run/snmpd.pid -a"
OPTIONS="-Lf /var/log/snmpd.log"

重启net-snmpd:
service snmpd restart

cat /var/log/snmpd.log

用cacti监控squid的方法

配置squid的snmp,安装时squid需加–enable-snmp参数。

vi /usr/local/squid/etc/squid.conf

加上下面几行:
snmp_port 3401
acl snmppublic snmp_community public
acl logger src 127.0.0.1/32
snmp_access allow snmppublic logger
snmp_access deny all

重启squid:
service squid restart

用下面的命令测试:
snmpwalk -v1 -c public 127.0.0.1:3401 .1.3.6.1.4.1.3495.1

如果能看到类似下面的信息,说明成功了。
SNMPv2-SMI::enterprises.3495.1.1.1.0 = INTEGER: 16360
SNMPv2-SMI::enterprises.3495.1.1.2.0 = INTEGER: 21626872

因为squid的snmp用的是3401端口,而snmp默认端口是161,所以需要用net-snmpd做转发:

vi /etc/snmp/snmpd.conf

加上下面一行:
proxy -v 1 -c public 127.0.0.1:3401 .1.3.6.1.4.1.3495.1

重启net-snmpd:

service snmpd restart

再用下面的命令测试,如果能得到和上面一样的信息,就没问题了。
snmpwalk -v1 -c public 127.0.0.1 .1.3.6.1.4.1.3495.1

然后就是配置cacti了,导入http://forums.cacti.net/about4142.html的模板做图就行了。

cacti监控squid服务

原文:http://www.puppeter.cn/?p=36

脚本位置
http://forums.cacti.net/about3158.html&highlight=squidstats

squid安装
编译安装时必须加入 –enable-snmp

snmp配置
# vi /usr/local/squid/etc/squid.conf
snmp_port 3401
acl snmppublic snmp_community valesquid
snmp_access allow snmppublic localhost
snmp_access deny all
snmp_incoming_address 0.0.0.0
snmp_outgoing_address 0.0.0.0

# vi /etc/snmp/snmp.conf
proxy -v 1 -c valesquid 127.0.0.1:3401 .1.3.6.1.4.1.3495.1

使用net-snmp代理功能监听squid的3401端口

测试
# snmpwalk -v2c -c valesquid 127.0.0.1:3401 .1.3.6.1.4.1.3495.1.1
SNMPv2-SMI::enterprises.3495.1.1.1.0 = INTEGER: 56784
SNMPv2-SMI::enterprises.3495.1.1.2.0 = INTEGER: 773424
SNMPv2-SMI::enterprises.3495.1.1.3.0 = Timeticks: (114466355) 13 days, 5:57:43.55

# snmpwalk -v2c -c valeftp 127.0.0.1 .1.3.6.1.4.1.3495.1.1
SNMPv2-SMI::enterprises.3495.1.1.1.0 = INTEGER: 56784
SNMPv2-SMI::enterprises.3495.1.1.2.0 = INTEGER: 773424
SNMPv2-SMI::enterprises.3495.1.1.3.0 = Timeticks: (114469771) 13 days, 5:58:17.71

# netstat -nlp -u
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 0 0 0.0.0.0:161 0.0.0.0:* 9549/snmpd
udp 0 0 0.0.0.0:58674 0.0.0.0:* 1291/(squid)
udp 0 0 0.0.0.0:3401 0.0.0.0:* 1291/(squid)
udp 0 0 0.0.0.0:3401 0.0.0.0:* 1291/(squid)

squidstats安装
很简单,略

编译安装net-snmp,解决cacti监控软件100M以上流量不准问题

原文:http://www.puppeter.cn/?p=624

安装配置
# cd /home/src
# wget http://nchc.dl.sourceforge.net/sourceforge/net-snmp/net-snmp-5.4.2.1.tar.gz
# tar xvfz net-snmp-5.4.2.1.tar.gz
# cd net-snmp-5.4.2.1
# ./configure –prefix=/usr/local/net-snmp –enable-mfd-rewrites
snmp协议选择v2c
# make && make install
# vi /etc/rc.local
/usr/local/net-snmp/sbin/snmpd -Lsd -Lf /dev/null -p /var/run/snmpd.pid -a -c /etc/snmp/snmpd.conf &

测试
# snmpwalk -v2c -c valeftp ip地址 system

修改Cacti中相关配置
在Cacti管理页面中选择Console->Data Source
找到需要修改的端口(即流量大于100M的端口),修改Output Type ID为 In/Out bits (64-bit counters)(原来为In/Out bits)。

如何检测服务器的net-snmp是否支持64bit计数

使用OID “ifHCInOctets”

不支持的情况
# snmpwalk -v 2c -c public 192.168.0.1 ifHCInOctets
IF-MIB::ifHCInOctets = No Such Object available on this agent at this OID

支持的情况
# snmpwalk -v 2c -c public 192.168.0.2 ifHCInOctets
IF-MIB::ifHCInOctets.1 = Counter64: 190305466
IF-MIB::ifHCInOctets.2 = Counter64: 2238924259791
IF-MIB::ifHCInOctets.3 = Counter64: 12021323
IF-MIB::ifHCInOctets.4 = Counter64: 0