Zabbix监控HP服务器硬件信息
做为Linux系统工程师,在服务器的维护管理当中,除了对系统进行维护管理之外,最重要的还要对服务器的硬件进行监控,比如服务器Raid状态是否正常(如果Raid卡出问题,会影响数据的读写速度),服务器硬盘是否正常(如果硬盘坏掉,严重的情况会丢失数据),服务器电源是否有故障等。除此之外还要对服务器的CPU,内存,处理器等重要设备的温度进行监控,如果温度超过服务器的临界温度则进行报警通知。
HP的服务器在硬件管理方面提供了自己管理工具hpacucli,通过该工具可以查看HP服务器的RAID信息,服务器硬盘等信息。
1)安装hpacucli工具(下载地址:HP hpacucli管理工具)
1
|
[[email protected] ~] #rpm -ivh hpacucli-9.40-12.0.x86_64.rpm
|
2)查看服务器RAID信息,硬盘是否正常。
1
2
3
4
5
6
|
[[email protected]~] # hpacucli ctrl all show config
Smart Array P410i in Slot 0 (Embedded) (sn: 5001438018042FF0)
array A (SAS, Unused Space: 0 MB)
logicaldrive 1 (279.4 GB, RAID 1, OK)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, OK)
|
3)通过hpacucli ctrl all show config detail命令可以详细地查看RAID和硬盘的信息。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
|
[[email protected] ~] # hpacucli ctrl all show config detail
Smart Array P410i in Slot 0 (Embedded)
Bus Interface: PCI
Slot: 0
Serial Number: 5001438018042FF0
Cache Serial Number: PBCDH0CRH1FH62
RAID 6 (ADG) Status: Disabled
Controller Status: OK
Chassis Slot:
Hardware Revision: Rev C
Firmware Version: 5.14
Rebuild Priority: Medium
Expand Priority: Medium
Surface Scan Delay: 15 secs
Monitor and Performance Delay: 60 min
Elevator Sort: Enabled
Degraded Performance Optimization: Disabled
Inconsistency Repair Policy: Disabled
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Accelerator Ratio: 25% Read / 75% Write
Drive Write Cache: Disabled
Total Cache Size: 512 MB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Capacitors
Battery /Capacitor Count: 1
Battery /Capacitor Status: OK
SATA NCQ Supported: True
Array: A
Interface Type: SAS
Unused Space: 0 MB
Status: OK
Logical Drive: 1
Size: 279.4 GB
Fault Tolerance: RAID 1
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Stripe Size: 128 KB
Status: OK
Array Accelerator: Enabled
Unique Identifier: 600508B1001034373220202020200002
Disk Name: /dev/cciss/c0d0
Mount Points: /boot 99 MB
Logical Drive Label: A00ADBD9PR7AMU1472 898D
Mirror Group 0:
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK)
Mirror Group 1:
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, OK)
physicaldrive 1I:1:1
Port: 1I
Box: 1
Bay: 1
Status: OK
Drive Type: Data Drive
Interface Type: SAS
Size: 300 GB
Rotational Speed: 10000
Firmware Revision: HPD4
Serial Number: ECA1PC80GTS31234
Model: HP EG0300FBDSP
PHY Count: 2
PHY Transfer Rate: 6.0GBPS, Unknown
physicaldrive 1I:1:2
Port: 1I
Box: 1
Bay: 2
Status: OK
Drive Type: Data Drive
Interface Type: SAS
Size: 300 GB
Rotational Speed: 10000
Firmware Revision: HPD7
Serial Number: PMX6902D
Model: HP EG0300FBDBR
PHY Count: 2
PHY Transfer Rate: 6.0GBPS, Unknown
|
HP官方还有一个hpasmcli管理工具,可以很详细查看服务器CPU,内存,处理器,电源等的温度信息。
1)安装hpasmcli工具(下载地址:HP hpasmcli管理工具)
1
|
[[email protected] ~] #rpm -ivh hp-health-9.40-1602.44.rhel6.x86_64.rpm
|
2)通过工具hpasmcli可以查看服务器各部件的温度信息,其中Temp表示各部件当前的温度,Threshold表示临界温度,当当前温度超过临界温度的时候就要注意啦。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
[[email protected] ~] # hpasmcli -s 'show temp'
Sensor Location Temp Threshold ------ -------- ---- --------- #1 AMBIENT 23C/73F 42C/107F #2 CPU#1 40C/104F 82C/179F #3 CPU#2 40C/104F 82C/179F #4 MEMORY_BD 33C/91F 87C/188F #5 MEMORY_BD 33C/91F 78C/172F #6 MEMORY_BD - 87C/188F #7 MEMORY_BD 32C/89F 78C/172F #8 MEMORY_BD 32C/89F 87C/188F #9 MEMORY_BD 32C/89F 78C/172F #10 MEMORY_BD - 87C/188F #11 MEMORY_BD 32C/89F 78C/172F #12 POWER_SUPPLY_BAY 33C/91F 59C/138F #13 POWER_SUPPLY_BAY 47C/116F 73C/163F #14 MEMORY_BD 29C/84F 72C/161F #15 PROCESSOR_ZONE 32C/89F 73C/163F #16 PROCESSOR_ZONE 30C/86F 64C/147F #17 MEMORY_BD 28C/82F 63C/145F #18 PROCESSOR_ZONE 39C/102F 69C/156F #19 SYSTEM_BD 35C/95F 69C/156F #20 SYSTEM_BD 38C/100F 71C/159F #21 SYSTEM_BD 44C/111F 65C/149F #22 SYSTEM_BD 45C/113F 71C/159F #23 SYSTEM_BD 39C/102F 69C/156F #24 SYSTEM_BD 47C/116F 69C/156F #25 SYSTEM_BD 35C/95F 63C/145F #26 SYSTEM_BD 45C/113F 66C/150F #27 SCSI_BACKPLANE_ZONE 35C/95F 60C/140F #28 SYSTEM_BD 73C/163F 110C/230F |
3)通过hpasmcli -s 'show'查看类似于help的帮助信息,监控的时候要重点关注 DIMM(内存)、FANS(风扇)、POWERSUPPLY(电源模块)、SERVER(系统)、CPU、TEMP(温度)等信息。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
[[email protected] ~] # hpasmcli -s 'show'
Invalid Arguments SHOW ASR
SHOW BOOT
SHOW DIMM [ SPD ]
SHOW F1
SHOW FANS
SHOW HT
SHOW IML
SHOW IPL
SHOW NAME
SHOW PORTMAP
SHOW POWERMETER
SHOW POWERSUPPLY
SHOW PXE
SHOW SERIAL [ BIOS | EMBEDDED | VIRTUAL ]
SHOW SERVER
SHOW TEMP
SHOW TPM
SHOW UID
SHOW WOL
|
4)hpasmcli几种常用的例子。
查看内存信息:hpasmcli -s 'show dimm'|egrep -i 'module|stat'
查看风扇信息:hpasmcli -s 'show fans'
查看硬件温度:hpasmcli -s 'show temp'
查看电源模块:hpasmcli -s 'show powersupply'
查看机器型号,***,CPU,内存大小:hpasmcli -s 'show server'
由于各种服务器的厂商不同,管理工具不同,因此Zabbix对服务器硬件方面没有很详细,全面的解决方案。之前dl528888写过zabbix通过omsa工具监控DEL服务器,也是一种很好的思路,我也借鉴过,这里非常感谢。
Zabbix监控总结起来有两种思路:第一就是server通过agentd方式获取数据,这种方式需要定义UserParameter参数,即KEY。第二就是server通过trapper的方式获取数据,即agentd将数据主动sender给server或者proxy。我这里是通过第二种traper的方式监控的。第一种方式server有时候会取不到数据,became not supported: Received value []
is not suitable for value type [Numeric (unsigned)] and data type [Decimal],
会产生上面的错误。
首先查看我监控的脚本,由于是通过traper的思路进行监控,log_file
文件依次定义了要监控服务器的主机名(hostname),监控项key以及监控的值。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
[[email protected] scripts] # cat hpacuclizabbix.sh
#!/bin/sh #create by sfzhang 20140517 #This scripts monitoring HP server, such as smart array status,Hardware information and server temperature。 zabbix_server= "*.*.*.*" #IP from Zabbix Server or proxy where data should be send to.
zabbix_sender= "/usr/local/zabbix/bin/zabbix_sender"
log_file= '/tmp/hpacuclizabbix.log' #In the file to define the monitor host, key and value
hpacucli= '/usr/sbin/hpacucli'
options= 'ctrl all show config detail'
hpacucli_log= "/tmp/result.log"
PATH=$PATH: /usr/sbin : /sbin
${hpacucli} ${options} > ${hpacucli_log} Cache_status=` cat ${hpacucli_log} | awk '/Cache Status:/{print $NF}' `
Controller_status=` cat ${hpacucli_log} | awk '/Controller Status:/{print $NF}' `
Battery_capacitor_status=` cat ${hpacucli_log} | awk '/Battery\/Capacitor Status:/{print $NF}' `
Physicaldrive_status=$( awk - v total=`hpacucli ctrl slot=0 pd all show status | grep physicaldrive | wc -l` - v normal=`hpacucli ctrl slot=0 pd all show status| awk '/physicaldrive/{if($NF=="OK") count+=1}END{print count}' ` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}' )
Memory_status=$( awk - v total=`hpasmcli -s 'SHOW DIMM' | grep -i 'Status' | wc -l` - v normal=`hpasmcli -s 'SHOW DIMM' | awk '/Status:/{if($NF=="Ok") count+=1}END{print count}' ` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}' )
Fans_status=$( awk - v total=`hpasmcli -s 'SHOW FANS' | grep "#" | wc -l` - v normal=`hpasmcli -s 'SHOW FANS' | awk '/#/{if($3=="Yes") count+=1}END{print count}' ` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}' )
Power_status=$( awk - v total=`hpasmcli -s 'SHOW POWERSUPPLY' | grep "Power supply" | wc -l` - v normal=`hpasmcli -s 'SHOW POWERSUPPLY' | awk '/Condition:/{if ($NF=="Ok") count+=1}END{print count}' ` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}' )
Processor_status=$( awk - v total=`hpasmcli -s 'SHOW SERVER' | grep "Processor:" | wc -l` - v normal=`hpasmcli -s 'SHOW SERVER' | awk '/Status/{if ($NF=="Ok") count+=1}END{print count}' ` 'BEGIN{if(total==normal) {print "OK"} else {print "NO"}}' )
Power_temp_num=$(hpasmcli -s 'SHOW TEMP' | awk '/POWER_SUPPLY_BAY/{print $3}' | awk -F "C" '{print $1}' | awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}' )
Ambient_temp_num=$(hpasmcli -s 'SHOW TEMP' | awk '/AMBIENT/{print $3}' | awk -F "C" '{print $1}' )
Cpu_temp_num=$(hpasmcli -s 'SHOW TEMP' | awk '/CPU/{print $3}' | awk -F "C" '{print $1}' | awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}' )
Memory_temp_num=$(hpasmcli -s 'SHOW TEMP' | awk '/MEMORY_BD/{print $3}' | awk -F "C" '{print $1}' | awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}' )
System_temp_num=$(hpasmcli -s 'SHOW TEMP' | awk '/SYSTEM_BD/{print $3}' | awk -F "C" '{print $1}' | awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}' )
Processor_temp_num=$(hpasmcli -s 'SHOW TEMP' | awk '/PROCESSOR_ZONE/{print $3}' | awk -F "C" '{print $1}' | awk 'BEGIN {max = 0} {if ($1>max) max=$1 fi} END {print max}' )
echo $HOSTNAME hp_smart_array.cache_status $Cache_status >${log_file}
echo $HOSTNAME hp_smart_array.controller_status $Controller_status >>${log_file}
echo $HOSTNAME hp_smart_array.battery_capacitor_status $Battery_capacitor_status >>${log_file}
echo $HOSTNAME hp_hardware.hpysicaldrive_status $Physicaldrive_status >>${log_file}
echo $HOSTNAME hp_hardware.memory_status $Memory_status >>${log_file}
echo $HOSTNAME hp_hardware.fans_status $Fans_status >>${log_file}
echo $HOSTNAME hp_hardware.power_status $Power_status >>${log_file}
echo $HOSTNAME hp_hardware.processor_status $Processor_status >>${log_file}
echo $HOSTNAME hp_power.temp_num $Power_temp_num >> ${log_file}
echo $HOSTNAME hp_ambient.temp_num $Ambient_temp_num >> ${log_file}
echo $HOSTNAME hp_cpu.temp_num $Cpu_temp_num >> ${log_file}
echo $HOSTNAME hp_memory.temp_num $Memory_temp_num >> ${log_file}
echo $HOSTNAME hp_system.temp_num $System_temp_num >> ${log_file}
echo $HOSTNAME hp_processor.temp_num $Processor_temp_num >> ${log_file}
$zabbix_sender -z $zabbix_server -i ${log_file} > /tmp/zabbix .temp
|
最后只需开启crontab,5分钟运行一次。
1
|
[[email protected]~] echo "*/5 * * * * /etc/zabbix/scripts/hpacuclizabbix.sh" >> /var/spool/cron/root
|
查看zabbix监控HP服务器硬件KEY的定义,数据的收集都是通过trapper的方式收集的。
查看zabbix监控HP服务器硬件triggers定义,其中nodata(600)这个trigger是为了防止被监控端数据采集出问题而设置的,比如crontab不正常,脚本被误删除等等。如果server10分钟之内收集不到被监控端的数据就会报警。
在zabbix server lastdata查看zabbix server 通过trapper收到的数据。
查看被监控端服务器各部件温度信息。
当被监控端出问题时Zabbix会及时报警。
说明:Zabbix监控HP服务器硬件操作方法:
1)在HP服务器上面安装hpacucli和hpasmcli管理工具。
2)修改hpacuclizabbix.sh
脚本的zabbix_server ip地址,指定为自己的server或者proxy的地址,并把该脚本添加到crontab。
3)导入附件中的模板,Link到要监控的主机上面即可。
4)如果有其它问题,欢迎多多交流。