Ad Widget

Collapse

High unreachable use, but ... not. What are they doing?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Linwood
    Senior Member
    • Dec 2013
    • 398

    #1

    High unreachable use, but ... not. What are they doing?

    I run zabbix for numerous clients and am not seeing this problem.

    I am running it on my home network, have for some time (on 4.4.5 so a bit old), and in the last month or so have been getting alerts from "Unreachable poller processes more than 75% busy".

    So I did what I normally do, just increased them and moved on. Error came back. Was distracted so just doubled them without thinking. Error came back.

    So I looked more closely -- I was up to 100 processes. I have between zero and one item unreachable depending on whether my laptop is booted.

    So... what in the world are they doing to be busy? Is there some other functionality now rolled into those processes? When I look at the history they range from 2% to 97% busy, hovering around 50% on average, but with excursions where it stays above 90%. But again -- at most 1 system is unreachable. And 66 total hosts defined.

    Which actually begs another question: What does "unreachable" mean in this context -- unreachable for SNMP polls? Unreachable for pings?

    If I look at ps I see them looking like this (representative sample):

    zabbix 11419 11183 0 135138 10484 0 Aug31 ? 00:00:01 /usr/local/sbin/zabbix_server: unreachable poller #43 [got 1 values in 60.058949 sec, getting values]
    zabbix 11424 11183 0 135138 10496 1 Aug31 ? 00:00:01 /usr/local/sbin/zabbix_server: unreachable poller #44 [got 1 values in 60.054875 sec, getting values]
    zabbix 11425 11183 0 135138 11644 3 Aug31 ? 00:00:01 /usr/local/sbin/zabbix_server: unreachable poller #45 [got 1 values in 60.031330 sec, getting values]
    zabbix 11426 11183 0 135138 11764 3 Aug31 ? 00:00:01 /usr/local/sbin/zabbix_server: unreachable poller #46 [got 1 values in 60.059361 sec, getting values]
    zabbix 11427 11183 0 135138 10480 3 Aug31 ? 00:00:01 /usr/local/sbin/zabbix_server: unreachable poller #47 [got 1 values in 60.053022 sec, getting values]
    zabbix 11428 11183 0 135138 10728 2 Aug31 ? 00:00:01 /usr/local/sbin/zabbix_server: unreachable poller #48 [got 0 values in 0.000049 sec, getting values]
    zabbix 11429 11183 0 135138 11040 0 Aug31 ? 00:00:01 /usr/local/sbin/zabbix_server: unreachable poller #49 [got 1 values in 60.058609 sec, getting values]

    I do have discovery configured, but it is for ICMP only, in groups of 128 IP's at 10 DAY intervals, so I cannot see it being related to that.

    Where do I start looking?

    Linwood
  • isaqueprofeta
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Aug 2020
    • 154

    #2
    A lot of good stuff about this in this thread: https://www.zabbix.com/forum/zabbix-...e-this-problem

    Comment

    • Linwood
      Senior Member
      • Dec 2013
      • 398

      #3
      So I read through that but I do not think it helped. It says it is for passive hosts; I think mine are active. But that said, I have started 100 poller processes and I have a grand total of 66 hosts. Active hosts. All reachable.

      I've got sites with thousands of hosts and maybe only 100 or 200 unreachable processes, and never get an alert from them.

      I still feel like i'm missing something here -- why are so many in use in this instance; in particular in use for what?

      Comment

      • tim.mooney
        Senior Member
        • Dec 2012
        • 1427

        #4
        If I were in your situation I would first go back to the default # of StartUnreachablePollers, since your environment shouldn't require any more than the default. Having 100 of them running and all encountering the same problem is just going to cloud the next debugging steps.

        Then, I would use the runtime control option to increase the logging (greatly) for the unreachable poller processes. Examine the logs. See if you can determine where delays are coming (for example, broken DNS resolution of your clients).

        If that doesn't give you a good clue to investigate, my next step would be to just attach to one of the unreachable poller processes with strace and follow what it's doing. Use options to strace to write the output to a file and to include some kind of timestamps for entry of system calls -- whichever of strace's timing options you prefer, as long as it gives you a clue where it's spending its time. If you're not comfortable reading strace output you can follow up and I can provide suggestions for options and if you post the logs I (or someone else that beats me to it) can help you interpret where it's spending its time.

        Comment

        • Linwood
          Senior Member
          • Dec 2013
          • 398

          #5
          So I'm completely confused. I tried strace and learned nothing. The one I picked seemed to do, well, basically nothing. So I reduced them to 10 and increased logging. Again, learned nothing -- they appeared to be doing nothing. But here's what is weird. After I changed from 100 to 10, the number in use went from near 100% to about 30%, then went down yet again. The period near zero is where I disabled all hosts -- and indeed usage went to zero. Re-enabled, came back. I looked more carefully and there are 2-3 windows servers that are offline but have some external checks that return errors, so there should be a few unreachables -- so the 20% looks about right. But what was the near 100% That is over multiple reboots and several increases in startup processes.

          I guess I need to wait for it to start failing again.


          Click image for larger version

Name:	use.jpg
Views:	267
Size:	103.4 KB
ID:	408917

          Comment

          Working...