Ad Widget

Collapse

Zabbix server stops responding to clients

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cunningdavid
    Junior Member
    • Mar 2021
    • 11

    #1

    Zabbix server stops responding to clients

    We recently upgraded our Zabbix server to version 7.0.18, and since then have several times had issues where after running fine for a few days, alerts for all clients were generated because of a lack of updates. Looking on all the Zabbix client logs we see this:

    5583:20251003:025848.279 active check data upload to [zabbix.company.com:10051] started to fail ([connect] cannot connect to [[zabbix.company.com]:10051]: [4] Interrupted system call)
    5583:20251003:030118.338 active check data upload to [zabbix.company.com:10051] is working again

    Even the Zabbix client on the Zabbix server itself logged this:

    893612:20251003:025810.835 Unable to connect to [zabbix.company.com]:10051 [cannot connect to [[zabbix.company.com]:10051]: connection timed out]
    893612:20251003:025810.835 Unable to send heartbeat message to [zabbix.company.com]:10051 [cannot connect to [[zabbix.company.com]:10051]: connection timed out]

    ​On the Zabbix server we verified that the zabbix_server process was still running, and still had TCP port 10051 open for listening on. We did note that "netstat" showed several thousand TCP connections in CLOSE_WAIT state for port 10051. A tcpdump showed data arriving to port 10051, and at least some even got a response from the Zabbix server, although I assume not all clients did, given the errors they logged. A restart of the zabbix_server program immediately resolves the problem for a few days.

    Can someone please point us in the right direction of how to debug the issue? Thank you in advance.
    Last edited by cunningdavid; 03-10-2025, 05:28.
  • cunningdavid
    Junior Member
    • Mar 2021
    • 11

    #2
    When sniffing traffic from one of the Zabbix clients we see things like the following. What would cause the failed items in the response?

    T 2025/10/03 04:14:30.914547 xx.xx.168.2:10051 -> yy.yy.1.71:51100 [AP] #1770
    ZBXD.\.......{"response":"success","info":"process ed: 1; failed: 18; total: 19; seconds spent: 0.000773"}

    T 2025/10/03 04:14:55.029307 xx.xx.168.2:10051 -> yy.yy.1.71:59010 [AP] #1774
    ZBXD.\.......{"response":"success","info":"process ed: 0; failed: 19; total: 19; seconds spent: 0.000134"}

    Comment

    • cyber
      Senior Member
      Zabbix Certified SpecialistZabbix Certified Professional
      • Dec 2006
      • 4807

      #3
      Local FW-s? SElinux? if possible, try to disable those, see, if anything gets better, then switch those on again and look through their logs.

      Comment

      • cunningdavid
        Junior Member
        • Mar 2021
        • 11

        #4
        Originally posted by cyber
        Local FW-s? SElinux? if possible, try to disable those, see, if anything gets better, then switch those on again and look through their logs.
        Thanks for the suggestion, but it doesn't seem to fit the problem description. Why would data collection work fine and then stop working after several days of the program running, due to a firewall or Selinux problem? And start working again after restarting zabbix_server?
        I've a feeling that the "failed" part of the zabbix_server response in the sniffed network traffic above is important. What would cause failed items?

        Comment

        • cyber
          Senior Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Dec 2006
          • 4807

          #5
          That sniffed part does not say anything.. there can be many reasons... Starting with network issues, Or missing items (ok this is maybe relevant in case of trapper items, not with agent items). Seems you have connection ({"response":"success") but data does not get processed..
          CLOSE_WAIT would suggest that your side is not finished with all the duties and has not closed the socket... Is is a standalone setup or split up to UI/server/DB? Some of it running out of resources?

          Comment

          • cunningdavid
            Junior Member
            • Mar 2021
            • 11

            #6
            Thanks for the reply cyber. The sniffed part which says "processed: 1; failed: 18; total: 19" is from the Zabbix server. That suggests the problem isn't data failing to reach the Zabbix server, it's a problem of the data not updating on the Zabbix server. Even the Zabbix server software itself knows something is going wrong if it logs "failed" items.

            The server is standalone with the Zabbix server/ui/db all on the same machine. Have you any suggestions for what resources we might check for whether they're running out?

            Comment

            • cyber
              Senior Member
              Zabbix Certified SpecialistZabbix Certified Professional
              • Dec 2006
              • 4807

              #7
              How big of a setup? NVPS? Number of monitored hosts? Biggest resource need is for DB. Then server and then UI.

              Comment

              • cunningdavid
                Junior Member
                • Mar 2021
                • 11

                #8
                While those questions will help with giving the right resources to the Zabbix sever, I'm not sure how they're relevant to a situation where the Zabbix server runs fine for several days and then stops updating data.

                Let's focus on a specific question to start with - can anyone help me understand what causes the Zabbix server to report non-zero "failed" items in it's response to clients?

                Comment

                Working...