Ad Widget

Collapse

Problem with Zabbix 7 agents active checks with Zabbix HA deployment.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Markku
    Senior Member
    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
    • Sep 2018
    • 1781

    #16
    Originally posted by jhboricua
    I have two of them pointed directly to the Fargate container task's IP addresses, bypassing the Network Load Balancer. Should be fine as long as the containers are not stopped.
    Curious: Why do you need the NLBs there in the first place if you can reach the containers directly anyway? Is it because the container IP can change any time and it is not possible to have up-to-date DNS name directly for the container?

    Markku

    Comment

    • Markku
      Senior Member
      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
      • Sep 2018
      • 1781

      #17
      FWIW, yesterday I set up a Zabbix HA cluster with one Linux agent2 just for testing this, no issues yet (but I don't have any LBs or other middleboxes here).

      Markku

      Comment

      • jhboricua
        Senior Member
        • Dec 2021
        • 113

        #18
        Originally posted by Markku

        Curious: Why do you need the NLBs there in the first place if you can reach the containers directly anyway? Is it because the container IP can change any time and it is not possible to have up-to-date DNS name directly for the container?

        Markku
        Yes, this is why.

        Comment

        • jhboricua
          Senior Member
          • Dec 2021
          • 113

          #19
          Update for this morning. The two hosts I have targeting the containers directly stopped communicating with the primary Zabbix node since the last agent restart. Same behavior as when they were targeting the load balancer, they start attempting to communicate with the stand-by node and continue to do so without ever sending traffic back to the primary. So it doesn't appear to be a load balancer issue. Unfortunately, increasing the logging verbosity didn't help because I misunderstood the way Zabbix does 'log rotation'. It only keeps one file as .old, so the increased log volume caused it to being overwritten far to frequently and I could not go back to the time the issue started.

          What I do see in the Wireshark capture of the agent is that there's a reset even at the time the issue starts and the agent then flips to talk to the stand-by node after that. All traffic to the primary ceases.

          I'm restarting the Agent on these and I'm increasing the log file size to the 1GB max in hopes it retains enough entries to be able to see what the logs show the next time the agent drops.

          Comment

          • Markku
            Senior Member
            Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
            • Sep 2018
            • 1781

            #20
            Reminder that one key is still the server-side logs, what they are saying at the same time the agent ceases to connect to the active node.

            Markku

            Comment

            • jhboricua
              Senior Member
              • Dec 2021
              • 113

              #21
              Yes, I plan up updating the loglevel on the backend side today. I might not be able to come back with another update until the end of this month due to planned time off starting tomorrow.

              Is the Zabbix cluster you deployed also container based?

              Comment

              • jhboricua
                Senior Member
                • Dec 2021
                • 113

                #22
                Is there something in particular I should be looking at for the server debug logs? After setting the level to 4 as suggested there's just far too much stuff for me to parse effectively via Cloudwatch because they limit the number of results to a max of 10000. Even after filtering out the proxy messages, I can hit the 10000 limit in as little as seven seconds worth of log entries. So trying to find out something that happened in a 5 minute span is incredibly tedious.

                Comment

                • Markku
                  Senior Member
                  Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                  • Sep 2018
                  • 1781

                  #23
                  Originally posted by jhboricua
                  Is the Zabbix cluster you deployed also container based?
                  No, because I don't need the intrastructure complexity caused by using containers, I'm interested in Zabbix performance. So I'm using VMs for this.

                  Markku

                  Comment

                  • Markku
                    Senior Member
                    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                    • Sep 2018
                    • 1781

                    #24
                    Originally posted by jhboricua
                    Is there something in particular I should be looking at for the server debug logs? After setting the level to 4 as suggested there's just far too much stuff for me to parse effectively via Cloudwatch because they limit the number of results to a max of 10000. Even after filtering out the proxy messages, I can hit the 10000 limit in as little as seven seconds worth of log entries. So trying to find out something that happened in a 5 minute span is incredibly tedious.
                    Oh, sorry that I wasn't clear: Normally (without increasing the logging level) Zabbix server logs the proxy configuration messages every 10 seconds, yes, generating "lots" of logs. (There is a ZBXNEXT for changing this.) But excluding those log lines, there shouldn't be much more, so you should be able to see if there are anything else logged there during the times some agents decide to change their behaviors. Like incomplete agent/proxy connections, or cluster switchover messages, giving you hints what was happening.

                    Markku
                    Last edited by Markku; 22-08-2024, 08:06. Reason: fixed touch keyboard typos

                    Comment

                    • jhboricua
                      Senior Member
                      • Dec 2021
                      • 113

                      #25
                      After much troubleshooting and no clear answers from the server logs, I disabled the HA mode and went back to a single Zabbix server container to see if the agents would still randomly drop the connections. One week in so far, and not a single issue with active checks failing. We might simply rely on Fargate to handle a Zabbix server container failure by replacing the task with a new one if that's what it takes to keep the monitoring solution stable.

                      Comment

                      Working...