Ad Widget

Collapse

All Agents "Unreachable" At Same Time - Graphs Still Populating

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • chrisw
    Junior Member
    • Nov 2014
    • 10

    #1

    All Agents "Unreachable" At Same Time - Graphs Still Populating

    Hello all,

    I've come across a seemingly random issue today. Starting yesterday while I was away, every single monitored server alerts for "Zabbix Agent Unreachable", yet on the Configuration pages, all of the Availability icons are Green, and all the graphs are still populating fine.

    I'm seeing network connection errors in the logs, but I can't find any network issues, can ping the hosts from the Zabbix server etc. and no packet loss.

    Even the local Zabbix Agent on the Zabbix Server is showing unreachable.

    I am having connectivity issues with one host, but I have 36 monitored, and no issues with the other 35.

    Can anyone help point me in the right direction? Rebooting the server had no effect.
  • tchjts1
    Senior Member
    • May 2008
    • 1605

    #2
    Graphs can "appear" to be populating fine, depending on what size period of time you are re looking at. Look at 1 hour versus, say.... 7 days.

    If that were happening to me, I would look at my Zabbix internal processes and see if maybe you need to allocate some additional pollers or other config adjustment. To see the data for Internal Processes, see here, the last paragraph and the graphs that follow it: https://www.zabbix.com/forum/showthread.php?t=41219

    Comment

    • chrisw
      Junior Member
      • Nov 2014
      • 10

      #3
      Hey tchjts1, Thanks for the reply!

      I do have them defaulted to 1 hour views, and they are still populating as we speak.

      I resolved the one hosts issue (duplicate IP) and it no longer shows as Unreachable, but the other hosts aren't experiencing even remotely similar symptoms (the host that had the duplicate IP was practically inaccessible and disconnecting me every few minutes, all my other hosts are active and stable).

      Literally all my hosts fired off that alert yesterday at 15:15:30, with exception to one that fired at 15:16:00. I have confirmed via our security panel that no one was even in the building at that time to cause anything physical (and the duplicate IP issue only started this morning around 8AM).

      There is a small gap from my restart, but literally every other host that is alerting for Unreachable is populating up to date data on the graphs - even going to Latest Data is showing values and changes for the last check, which at this time, occurred just under 1 minute ago (Nov 17th 16:55 EST).

      I had those graphs set up on a screen, but everything is okay now. Leads me to some other potential issues though, as it was pretty hairy prior to my restart (CPU load and busy% were quite a bit higher than they are now). With exception to the recovery spike, however, the past 12 hours have been steady and clear at 30% busy or below / CPU load of 2 or below.

      I'm just really baffled by the updated checks / graphs but them stating they've been unreachable for over 24 hours.

      I will keep digging around the Zabbix server though, as it doesn't seem to be a problem with the clients at least.

      Thanks again!

      Comment

      • chrisw
        Junior Member
        • Nov 2014
        • 10

        #4
        I've resolved a few other one off problems now and cleared up my list quite a bit, and all but 3 of the remaining hosts have one thing in common I did not notice before

        Code:
        Cannot evaluate function "ServerName:agent.ping.nodata(5m)"
        Where ServerName is different for each server affected.

        This doesn't sound particularly good, but I'll keep digging.

        Comment

        • tchjts1
          Senior Member
          • May 2008
          • 1605

          #5
          What version of Zabbix server are you using?

          Comment

          • tchjts1
            Senior Member
            • May 2008
            • 1605

            #6
            Is this your trigger expression at the template level?

            Code:
            {Template OS Windows:agent.ping.nodata(5m)}=1

            Comment

            • chrisw
              Junior Member
              • Nov 2014
              • 10

              #7
              The latest currently available in Gentoo's Portage repo, 2.2.7. I recently did an upgrade, I believe from 2.2.2, but that was over 2 weeks ago and things have been running fine up until yesterday.

              I was having some issues installing sudo but I managed to get that going, and I confirmed Zabbix user can use ping, for arguments sake.

              Yes to your trigger question, for both Windows and Linux templates (3 linux hosts and 5 windows hosts remain in this state, same errors for both OS's)

              Code:
              	{Template OS Windows:agent.ping.nodata(5m)}=1
              	{Template OS Linux:agent.ping.nodata(5m)}=1
              There is also no firewall between this machine and the clients, nor anything like GRSEC or SELINUX to get in the way.

              Also strange, these errors don't appear to be showing up in the logs (the "cannot evaluate function" errors). The only other similar instances I have are in regards to one of our MFC's but I know that's related to the differences in a ColorQube vs a WorkCenter. I was hoping there would be something more in the logs related to this but I can't find anything.

              Also just confirmed there are a few agents still at 2.2.2 but it's affecting some on 2.2.2 and some on 2.2.7 so the version difference doesn't appear to be root cause.
              Last edited by chrisw; 18-11-2014, 01:28. Reason: Added correct previous version, note about client versions

              Comment

              • chrisw
                Junior Member
                • Nov 2014
                • 10

                #8
                Well I finally gave up, as I don't have a backup to Zabbix and there's only so long you can go without your monitoring system.

                I went to do an update to 2.4.1, and upon trying to back up the database, I found there was quite a few corrupted tables. Not sure if this was related to the issues I was seeing or not.

                I blew away the database and started fresh with 2.4.1. Was able to export everything, and everything imported but the Screens, which will be a pain. Also of course lost all my historical data, but at least I've got my Zabbix back up and running properly.

                Comment

                Working...