Ad Widget

Collapse

Zabbix proxy stopped working properly after reboot

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • NilsA
    Senior Member
    • Sep 2020
    • 102

    #1

    Zabbix proxy stopped working properly after reboot

    Hello there,

    as of recently one of our Zabbix proxies has stopped working properly. According to the graphs, the problem occurred after the last automated reboot of the virtual machine.
    Since then only simple checks on the agents the proxy is monitoring have been delivering data as expected. The other items that use Zabbix agent or SNMP have only delivered their data at most once after each reboot of the proxy, then stopped completely without an error in the logs or web interface.

    My Zabbix server is running on version 5.0.4, proxies are on 5.0.4 - 5.0.6, agents are anywhere between 4.0 and 5.0.6. Zabbix servers and proxies are installed on Debian 10 virtual machines on Hyper-Vs.

    I was not able to learn anything from the proxy or agent logs. The proxy log (on debug level 3) is just writing that the configuration is transmitted from the server and that the housekeeper is working. Agent logs are almost empty.
    The webinterface shows no signs of SNMP or Zabbix agent interfaces being inactive on the hosts - I have also tested an SNMP get on the proxy - which works.
    The ports that my proxy is working on to communicate with the server (10051) are open on both sides which I tested with
    Code:
    nc -vz ip.ip.ip.ip 10051
    and the Zabbix agent ports are also open on the host side.

    I tried checking the proxy db (mariadb) but I simply don't know enough about the inner workings of Zabbix to find out anything of help to my issue.

    My guess is that the problem lies in the proxy somehow not working properly with agent or SNMP checks.

    Can anyone help me or knows where I could start looking for the problem?
  • Hamardaban
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • May 2019
    • 2713

    #2
    The proxy and server versions must match.

    Comment

    • Zdenek_OMNISENSUIT
      Member
      Zabbix Certified SpecialistZabbix Certified Professional
      • Nov 2020
      • 55

      #3
      Hello.

      Yes, zabbix server and proxy must have the same version but this is applied for major version.
      This setup is ok - 5.0.x and 5.0.x.


      Please try increase the log level and check.

      Comment

      • NilsA
        Senior Member
        • Sep 2020
        • 102

        #4
        Thanks for the help. Something I forgot to mention in my original post is that I actually tried to fix this problem by shutting down my original proxy and setting up a completely new virtual machine on which I then installed a Zabbix proxy.
        I gave it the same IP address as the original (which is of course still shut down) and I see the same behaviour with the new one.

        I checked the documentation and tried out a couple things:
        - increasing poller loglevel to 5: nothing new was written into the log, only the usual " received configuration data from server at....". Which is surprising since the pollers should be doing a lot of work in this network.
        - increasing unreachable poller loglevel to 4: some new things were written into the log. I couldn't find anything of help - the unreachable pollers were always unable to receive any values and I'm not sure what to make of that.
        - increasing icmp ping loglevel to 4: this got me some new entries into the log as well. The icmp ping items are working fine and I can see them in my web interface - I'm assuming this is how the log for a working poller is supposed to be like.

        Here are some snippets from each of the tests:
        poller loglevel increase:
        Code:
         59104:20201217:103854.970 Got signal [signal:10(SIGUSR1),sender_pid:59082,sender_uid:100 0,value_int:1(0x00000001)].
        59104:20201217:103854.970 log level has been increased to 5 (trace)
        59105:20201217:103854.970 Got signal [signal:10(SIGUSR1),sender_pid:59082,sender_uid:100 0,value_int:1(0x00000001)].
        59105:20201217:103854.970 log level has been increased to 5 (trace)
        59106:20201217:103854.970 Got signal [signal:10(SIGUSR1),sender_pid:59082,sender_uid:100 0,value_int:1(0x00000001)].
        59106:20201217:103854.970 log level has been increased to 5 (trace)
        59107:20201217:103854.970 Got signal [signal:10(SIGUSR1),sender_pid:59082,sender_uid:100 0,value_int:1(0x00000001)].
        59107:20201217:103854.970 log level has been increased to 5 (trace)
        59108:20201217:103854.970 Got signal [signal:10(SIGUSR1),sender_pid:59082,sender_uid:100 0,value_int:1(0x00000001)].
        59108:20201217:103854.970 log level has been increased to 5 (trace)
        59109:20201217:103854.970 Got signal [signal:10(SIGUSR1),sender_pid:59082,sender_uid:100 0,value_int:1(0x00000001)].
        59109:20201217:103854.970 log level has been increased to 5 (trace)
        59110:20201217:103854.970 Got signal [signal:10(SIGUSR1),sender_pid:59082,sender_uid:100 0,value_int:1(0x00000001)].
        59110:20201217:103854.970 log level has been increased to 5 (trace)
        59111:20201217:103854.970 Got signal [signal:10(SIGUSR1),sender_pid:59082,sender_uid:100 0,value_int:1(0x00000001)].
        59111:20201217:103854.970 log level has been increased to 5 (trace)
        59083:20201217:104019.269 received configuration data from server at "192.168.1.15", datalen 88844
        unreachable poller loglevel increase:
        Code:
         59118:20201217:103802.121 In DCconfig_get_poller_items() poller_type:1
        59118:20201217:103802.121 End of DCconfig_get_poller_items():0
        59118:20201217:103802.121 In DCconfig_get_poller_nextcheck() poller_type:1
        59118:20201217:103802.121 End of DCconfig_get_poller_nextcheck():-1
        59118:20201217:103802.121 End of get_values():0
        59118:20201217:103802.121 zbx_setproctitle() title:'unreachable poller #7 [got 0 values in 0.000138 sec, idle 5 sec]'
        59117:20201217:103802.121 zbx_setproctitle() title:'unreachable poller #6 [got 0 values in 0.000103 sec, getting values]'
        59117:20201217:103802.121 In get_values()
        59117:20201217:103802.121 In DCconfig_get_poller_items() poller_type:1
        59117:20201217:103802.121 End of DCconfig_get_poller_items():0
        59117:20201217:103802.121 In DCconfig_get_poller_nextcheck() poller_type:1
        59117:20201217:103802.121 End of DCconfig_get_poller_nextcheck():-1
        59117:20201217:103802.121 End of get_values():0
        59117:20201217:103802.121 zbx_setproctitle() title:'unreachable poller #6 [got 0 values in 0.000128 sec, idle 5 sec]'
        59116:20201217:103802.121 zbx_setproctitle() title:'unreachable poller #5 [got 0 values in 0.000101 sec, getting values]'
        59116:20201217:103802.121 In get_values()
        59116:20201217:103802.121 In DCconfig_get_poller_items() poller_type:1
        59116:20201217:103802.121 End of DCconfig_get_poller_items():0
        59116:20201217:103802.121 In DCconfig_get_poller_nextcheck() poller_type:1
        59116:20201217:103802.121 End of DCconfig_get_poller_nextcheck():-1
        59116:20201217:103802.121 End of get_values():0
        icmp pinger loglevel:
        Code:
         59120:20201217:104417.316 zbx_setproctitle() title:'icmp pinger #1 [pinging hosts]'
        59120:20201217:104417.316 In zbx_ping() hosts_count:2
        59120:20201217:104417.316 /tmp/zabbix_proxy_59120.pinger
        59120:20201217:104417.316 10.10.100.40
        59120:20201217:104417.316 10.10.100.1
        59120:20201217:104417.316 /usr/bin/fping -C3 -i0 2>&1 </tmp/zabbix_proxy_59120.pinger;
        59120:20201217:104417.319 read line [10.10.100.1 : [0], 84 bytes, 0.62 ms (0.62 avg, 0% loss)]
        59120:20201217:104417.320 read line [10.10.100.40 : [0], 84 bytes, 1.42 ms (1.42 avg, 0% loss)]
        59120:20201217:104418.321 read line [10.10.100.40 : [1], 84 bytes, 1.05 ms (1.23 avg, 0% loss)]
        59120:20201217:104418.321 read line [10.10.100.1 : [1], 84 bytes, 1.13 ms (0.87 avg, 0% loss)]
        59084:20201217:104419.223 received configuration data from server at "192.168.1.15", datalen 88844
        59120:20201217:104419.321 read line [10.10.100.40 : [2], 84 bytes, 1.12 ms (1.19 avg, 0% loss)]
        59120:20201217:104419.321 read line [10.10.100.1 : [2], 84 bytes, 1.11 ms (0.95 avg, 0% loss)]
        59120:20201217:104419.321 read line []
        59120:20201217:104419.321 read line [10.10.100.40 : 1.42 1.05 1.12]
        59120:20201217:104419.321 read line [10.10.100.1 : 0.62 1.13 1.11]
        59120:20201217:104419.322 End of zbx_ping():SUCCEED
        59120:20201217:104419.322 In process_values()
        59120:20201217:104419.322 host [10.10.100.40] cnt=3 rcv=3 min=0.001050 max=0.001420 sum=0.003590
        59120:20201217:104419.322 In process_value()
        59120:20201217:104419.322 In zbx_preprocess_item_value()
        59120:20201217:104419.322 End of zbx_preprocess_item_value()
        59120:20201217:104419.322 End of process_value()
        I then also tested the zabbix_get function with a host on which I deactivated TLS PSK on, which worked fine.

        If you have any clue what the issue could be or what I should test next, please let me know.

        Comment

        • Zdenek_OMNISENSUIT
          Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Nov 2020
          • 55

          #5
          About the increase log level - don't specify just one part, do it for whole proxy.
          You can check system logs too.
          Question about the reboot - it was after some change on the server?
          You mentioned port 10051 but what about 10050? Is it allowed for incomming communication from server?
          And is open 10051 for incomming (active) communication from agents?
          As I can see from the description, it looks that you can still see only passive check but not active.
          Not enought information here but try above.

          Comment

          • NilsA
            Senior Member
            • Sep 2020
            • 102

            #6
            I automated reboots on all proxies and the server with crontab a few weeks back due to the free memory shrinking over time. So there was no change made.
            A colleague updated the Hyper-V where all the virtual machines are running on on the day of the reboot. That should'nt have any effect on Zabbix functionality though.

            Regarding the ports: my proxy is communicating with my server on port 10051. This works - ICMP items are being sent, heartbeat works fine. My server is monitoring each proxy directly on port 10050. This works fine as well.
            Here is the iptables from the proxy:
            Code:
            -P INPUT ACCEPT
            -P FORWARD ACCEPT
            -P OUTPUT ACCEPT
            -A INPUT -p tcp -m tcp --dport 10050 -j ACCEPT
            -A INPUT -p tcp -m tcp --dport 10051 -j ACCEPT
            Neither active nor passive checks are currently working on the proxy. All Zabbix agent items and SNMP items deliver no data to the server. Only ICMP Ping works.
            I have attached two screenshots of host screens. One is from the Hyper-V (Zabbix agent items only) - the other is from a printer in the network (one ICMP item, one SNMP item).
            As you can see, all non ICMP items stopped providing data after the 12th of December, 11 pm. The small sections where there is data of those items is from when I restarted the zabbix-proxy service or rebooted the server.

            I have also added a link to a part of the zabbix-proxy log right after I set debug-level to 5 and restarted the service. I removed the details of transmitted configs due to obvious safety issues.
            If you can make anything of it, please let me know. If not, I will look for the issue on the hosts: https://pastebin.com/P3RDbHQd

            Edit: the images couldn't be opened so here they are
            Click image for larger version  Name:	printer.PNG Views:	2 Size:	39.3 KB ID:	415464
            Click image for larger version  Name:	hyper-v.PNG Views:	3 Size:	37.3 KB ID:	415463
            Last edited by NilsA; 18-12-2020, 12:23.

            Comment

            • NilsA
              Senior Member
              • Sep 2020
              • 102

              #7
              Small update:
              We disabled all except for one host in the network on the web interface and we are receiving data normally from that host now.
              We will run more tests to find out what went wrong / where the problem lies.

              Comment

              • NilsA
                Senior Member
                • Sep 2020
                • 102

                #8
                The problem has been resolved. In case anyone runs into a similar issue in the future here is what happened:

                My colleague has been quite involved in the troubleshooting and found out that the issue was that a Lancom 1326 switch malfunctioned due to receiving SNMP requests. This apparently resulted in the switch not working as it should.
                From proxy and agent logs, we found that Zabbix agents were still sending data for passive checks. Shortly after the proxy restarts, the packages were empty. Packages were coming in without actual data from the agent items.
                After rebooting the switch everything worked fine again.

                Anyways thanks for the help.

                Comment

                • Zdenek_OMNISENSUIT
                  Member
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • Nov 2020
                  • 55

                  #9
                  Hello.

                  That is nice

                  Comment

                  Working...