Ad Widget

Collapse

Trying to monitor a system the occassionally half crashes.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • scalft
    Junior Member
    • Apr 2008
    • 12

    #1

    Trying to monitor a system the occassionally half crashes.

    Hello,

    I have a Linux server that will occasionally kernel panic, but not take the system completely down. The server will still ping, but will not accept TCP connections (ssh, zabbix agent, etc.). I am trying to design a way to effectively monitor for this situation. I would also like it to work from a template so I don't have to go through about 30 servers to configure each individually.

    I have attempted

    Code:
    tcp,10050
    as a simple check on a test host.

    When I stop zabbix_agent on the host, the value sometimes changes to zero, but sometimes doesn't. Checking zabbix_server.log I see the following:

    Code:
      4961:20080515:092422 Host [hostname]: another network error, wait for 15 seconds
      4961:20080515:092437 Get value from agent failed. Error: Cannot connect to [a.b.c.d:10050] [Connection refused]
      4961:20080515:092437 Host [hostname] will be checked after 60 seconds
      4961:20080515:092547 Timeout while answering request
      4961:20080515:092547 Get value from agent failed. Error: Cannot connect to [a.b.c.d:10050] [Interrupted system call]
    Meaning the host is not being checked during that interval... Therefore no update of my simple check item, and subsequently no trigger.

    Thoughts?

    Thad
  • nelsonab
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2006
    • 1233

    #2
    you could trigger on nodata key[item].nodata(time)
    RHCE, author of zbxapi
    Ansible, the missing piece (Zabconf 2017): https://www.youtube.com/watch?v=R5T9NidjjDE
    Zabbix and SNMP on Linux (Zabconf 2015): https://www.youtube.com/watch?v=98PEHpLFVHM

    Comment

    Working...