Ad Widget

Collapse

Problem with Zabbix 7 agents active checks with Zabbix HA deployment.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jhboricua
    Senior Member
    • Dec 2021
    • 113

    #1

    Problem with Zabbix 7 agents active checks with Zabbix HA deployment.

    We're in the process to upgrade our environment to Zabbix 7 and have setup a containerized HA deployment of Zabbix 7.0.2 in our non-prod environment. Our current production deployment is on Zabbix 5.0. We picked a subset of our Linux and Windows hosts and pointed them to the new 7.0 non-prod deployment so that we can start tuning templates. Their agents were upgraded to the 7.0.2 Agent2 package. On these agents we configured the server and server_active parameter as follows:

    Code:
    Server=zbx-node-01.uat.zabbix.mydomain.com,zbx-node-02.uat.zabbix.mydomain.com
    ServerActive=zbx-node-01.uat.zabbix.mydomain.com;zbx-node-02.uat.zabbix.mydomain.com
    Notice we are using a semi-colon between the two values in the server active parameter as this is a Zabbix HA native HA deployment.

    On the 5.0 deployment we do passive checks. For the 7.0 deployment we would like to use active checks to reduce the load on the Zabbix servers. We are running into a problem. After while, could be hours or could be days, the active agents stop sending data and we get Alarms for 'Active checks are not available'.

    On the agent logs we see this when the problem starts happening:

    Code:
    2024/08/19 11:08:47.198891 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail
    2024/08/19 11:09:05.197837 [103] cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60222->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.'
    2024/08/19 11:09:05.197837 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail
    2024/08/19 11:09:11.198846 [103] cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60232->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.'
    2024/08/19 11:09:11.198846 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail
    2024/08/19 11:09:17.197849 [103] cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60244->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.'
    2024/08/19 11:09:17.197849 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail
    2024/08/19 11:09:18.197859 [103] sending of heartbeat message to [zbx-node-02.uat.zabbix.mydomain.com:10051] is working again
    2024/08/19 11:09:23.197859 [103] cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60255->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.'
    2024/08/19 11:09:23.197859 [103] active check configuration update from host [svw005syseng700.mydomain.com] started to fail
    2024/08/19 11:09:29.197862 [103] connection closed2024/08/19 11:09:29.197862 [103] connection closed​
    ​​
    For some reason these failed agents stop communicating with the active node and then just continuously attempt to check with the standby node, which of course tells them to take a hike. The only way to recover is to restart the Agent service. I ran a Wireshark capture and no traffic is generated to the active node from the Zabbix Agent, only to the passive. As soon as I restart the Agent Service, it starts sending traffic to the active node again.

    What could possibly be triggering this behavior? So far, it seems to be only happening with Windows agents.​
  • Markku
    Senior Member
    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
    • Sep 2018
    • 1781

    #2
    The way Zabbix server HA works is that the standby node does not listen to the trapper port (10051). Thus, the agent possibly trying to connect to it gets a TCP reset from the server's TCP stack.

    Now, your log looks a bit suspicious when it says that an "existing connection was forcibly closed". This does not match the behavior I just described: since there is nobody listening to the connection, nobody ever accepts the connection, and thus the connection cannot be closed either (because there is no TCP connection).

    So, is it possible that there is something (like a firewall or other TCP proxy) that intercepts those connections?

    Since you are using Wireshark (that's great!), you are also able to see exactly what happens at the TCP level.

    About some agents being suddenly unable to connect to the active node, that's also something that you should see better with Wireshark. And, why would some agents behave this way and not the others, there is something for you to think about: how is the topology different, etc.

    You should also increase the logging level in the agent side to see what the agent is sending, receiving and thinking.

    Markku

    Comment

    • Markku
      Senior Member
      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
      • Sep 2018
      • 1781

      #3
      To add to the previous, it would also help to capture the affected agent connections (using tcpdump) at the server sides. Then you can again see exactly if the connection (or attempt) from the agent matches that on the server side, or is someone intercepting the connection (attempt) in transit.

      Markku

      Comment

      • jhboricua
        Senior Member
        • Dec 2021
        • 113

        #4
        Originally posted by Markku
        The way Zabbix server HA works is that the standby node does not listen to the trapper port (10051). Thus, the agent possibly trying to connect to it gets a TCP reset from the server's TCP stack.
        This is what I see in the wireshark captures, resets from the 2nd node when communications with the primary stops. But that's what I expect the 2nd node to do. I'm more interested in why is the Windows Agent2 suddenly not attempting to reach the primary. I mean, there was no traffic whatsoever to the primary in the wireshark capture when the issue happens.

        Originally posted by Markku
        Now, your log looks a bit suspicious when it says that an "existing connection was forcibly closed". This does not match the behavior I just described: since there is nobody listening to the connection, nobody ever accepts the connection, and thus the connection cannot be closed either (because there is no TCP connection).

        So, is it possible that there is something (like a firewall or other TCP proxy) that intercepts those connections?
        This is a container deployment in AWS Fargate. There is a network load balancer in front of each backend server.

        zbx-node-01.uat.zabbix.mydomain.com --> backend1_NLB --> backend1_container
        zbx-node-02.uat.zabbix.mydomain.com --> backend2_NLB --> backend2_container

        By the way these are network load balancers, operating at layer 4, no layer 7 application load balancers.

        Originally posted by Markku
        About some agents being suddenly unable to connect to the active node, that's also something that you should see better with Wireshark. And, why would some agents behave this way and not the others, there is something for you to think about: how is the topology different, etc.
        The only difference is the Os of the server running the agent. So far, Linux servers are unaffected, only windows agents are doing this. They are both configured exactly the same from a Server/ServerActive standpoint.

        Originally posted by Markku
        You should also increase the logging level in the agent side to see what the agent is sending, receiving and thinking.
        I'll try that. Do you recommend level 4 or 5?
        Last edited by jhboricua; 19-08-2024, 20:41.

        Comment

        • Markku
          Senior Member
          Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
          • Sep 2018
          • 1781

          #5
          Logging level 4 should be enough for us mere mortals, 5 is for hardcore Zabbix engineers

          By the way these are network load balancers, operating at layer 4, no layer 7 application load balancers.
          Ok, I don't have enough knowledge to off the bat say how it works, but I'd assume NLB does not do TCP proxy but just packet forwarding (and mangling). So, if you said that in the capture you saw TCP resets (for the same new connection attempts) but the agent log says "existing connection was forcibly closed", so be it, even though it doesn't make sense (output-wise).

          Interesting case anyway, hopefully the additional logging gives you some useful information about why the agent thinks it should change to using the other server.

          Markku

          Comment

          • jhboricua
            Senior Member
            • Dec 2021
            • 113

            #6
            The AWS NLB can do proxy protocol v1 or v2. We have client IP preservation enabled on them, meaning the target (zabbix server) sees the agent traffic source IP of the actual client (the NLB becomes transparent).

            But I'm concerned with the fact that when this problems with the active agent not communicating happens, the agent simply stops generating traffic towards the active node. This morning I have 8 out of 17 host having this problem. Five of them are Windows hosts and three Linux, so yes, not exclusive to window agents I guess. On the windows host I see no traffic being generated to the active node by the agent in Wireshark, it just keeps trying to reach the standby. Why does the agent suddenly stops attempting to reach the primary? That's not a NLB issue.

            Unfortunately I didn't have additional logging enabled on these new failed agents.

            Comment

            • Markku
              Senior Member
              Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
              • Sep 2018
              • 1781

              #7
              Do you have any idea how the failing agents differ from the other agents? Things like recently restarted or not, network connectivity/topology, configuration.

              How are your active agents configured in Zabbix server, do you have agent interfaces configured, if you have, do you use IP (127.0.0.1 or real IP?) or DNS name?

              Markku

              Comment

              • jhboricua
                Senior Member
                • Dec 2021
                • 113

                #8
                We install and enforce the agent configuration with Chef. The agent package installed via the chef cookbook is: Zabbix Agent2 version 7.0.2 (Windows/Linux).
                The monitored servers receive the same parameters in their agent configuration file. We only set the following parameters via Chef, leaving the rest at default values:

                Code:
                Server and ServerActive: as per my opening post in this thread.
                Hostname: fqdn of server
                HostInterface: IP address of server. We had to add this parameter with Zabbix 7.0 because without it, the agents don't autoregister their agent interface correctly. Not sure if this is a bug or not because we didn't have to do this on Zabbix 5 or 6. See my thread on this: https://www.zabbix.com/forum/zabbix-help/489529-odd-autoregistration-behavior-with-zabbix-7-0
                DebugLevel: I just added this attribute on our Chef cookbook this morning to make it easier to flip the debug level on agents.
                I have a simple autoregistration action defined that attaches the Linux or Windows by Zabbix agent active template to the servers as they register. That's it. They register with their real IP as per the thread I reference above.

                Comment

                • Markku
                  Senior Member
                  Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                  • Sep 2018
                  • 1781

                  #9
                  Btw, about this error:

                  cannot receive data from [zbx-node-02.uat.zabbix.mydomain.com:10051]: Cannot read message: 'read tcp 10.134.37.126:60222->10.132.148.205:10051: wsarecv: An existing connection was forcibly closed by the remote host.'
                  This is coming from this code (link works as of today): https://git.zabbix.com/projects/ZBX/...s/comms.go#387

                  Before reaching the line that emits that error text, there has been the connect phase and also one send (write) of data.

                  The takeaway here is that something had accepted the connection and then disconnected it (= that is not a standard situation for the agent and unexpected results will result). I'd recommend you to check the packet capture again, if it gives you more information.

                  Also, can you capture packets on the containers?

                  Markku

                  Comment


                  • josefilho
                    josefilho commented
                    Editing a comment
                    para resolver esse problema amigo, você precisa garantir que sua base de dados tem a mesma senha do arquivo zabbix_server.conf e conf.php as senhas dos 3 tem que ser a mesma, resolvendo isso você PRECISA entrar no banco de dados e dar os privilegios

                    mysql -uroot -p
                    use esse comando caso não saiba a senha e no campo password deixe a senha iigual dos arquivos que falei acima ALTER USER 'zabbix'@'localhost' IDENTIFIED BY 'password';
                    grant all privileges on zabbix.* to zabbix@localhost;
                    set global log_bin_trust_function_creators = 1;
                    FLUSH PRIVILEGES;
                    quit;


                    feito isso
                    systemctl restart zabbix-server zabbix-agent apache2
                    systemctl enable zabbix-server zabbix-agent apache

                    Comando para acompanhar upgrade da database

                    tail -f /var/log/zabbix/zabbix_server.log | grep database





                    agora veja o processo do zero de como fazer update.



                    Update Zabbix versão 7.0

                    Comando para ver versão do zabbix
                    grep "ZABBIX_VERSION" /usr/share/zabbix/include/defines.inc.php

                    Faça backup do Zabbix linha por linha ou use o script:

                    mkdir /opt/zabbix-backup/
                    cp /etc/zabbix/zabbix_server.conf /opt/zabbix-backup/
                    cp /etc/zabbix/zabbix_agentd.conf /opt/zabbix-backup/
                    cp -R /usr/lib/zabbix/alertscripts/ /opt/zabbix-backup/
                    cp -R /usr/lib/zabbix/externalscripts/ /opt/zabbix-backup/
                    cp -R /usr/share/zabbix/ /opt/zabbix-backup/
                    mysqldump -u root -p zabbix > /opt/zabbix-backup/zabbix_backup.sql inserir senha do banco de dados
                    mkdir /opt/zabbix-backup/bkp-bd
                    cp /opt/zabbix-backup/zabbix_backup.sql /opt/zabbix-backup/bkp-bd


                    Script que cria as pasta de backup e cópia os arquivos incluindo backup da base de dados

                    mkdir /opt/zabbix-backup/ && \
                    cp /etc/zabbix/zabbix_server.conf /opt/zabbix-backup/ && \
                    cp /etc/zabbix/zabbix_agentd.conf /opt/zabbix-backup/ && \
                    cp -R /usr/lib/zabbix/alertscripts/ /opt/zabbix-backup/ && \
                    cp -R /usr/lib/zabbix/externalscripts/ /opt/zabbix-backup/ && \
                    cp -R /usr/share/zabbix/ /opt/zabbix-backup/ && \
                    mysqldump -u root -p zabbix > /opt/zabbix-backup/zabbix_backup.sql && \
                    mkdir /opt/zabbix-backup/bkp-bd && \
                    cp /opt/zabbix-backup/zabbix_backup.sql /opt/zabbix-backup/bkp-bd


                    Baixe e instale o Zabbix


                    Zabbix is being downloaded over 4 000 000 times every year for a reason. Download and install Zabbix for free and try it yourself!

                    zabbix=7.0&os_distribution=ubuntu&os_version=22.04 &components=server_frontend_agent&db=mysql&ws=apac he

                    Use o link acima para baixar a versão do Zabbix, em seguida use os comandos abaixo

                    Crie diretório e baixe o arquivo acima dentro do diretorio /zabbix-7/
                    wget https://repo.zabbix.com/zabbix/7.0/u...u22.04_all.deb
                    dpkg -i zabbix-release_7.0-2+ubuntu22.04_all.deb
                    apt update
                    apt upgrade -y
                    apt install zabbix-server-mysql zabbix-frontend-php zabbix-apache-conf zabbix-sql-scripts zabbix-agent
                    Quando chegar na tela abaixo selecione a opção N
                    Update Zabbix versão 7.0 2
                    Entre na base de dados e use os comandos abaixo para dar privilégios.
                    mysql -uroot -p
                    grant all privileges on zabbix.* to zabbix@localhost;
                    set global log_bin_trust_function_creators = 1;
                    FLUSH PRIVILEGES;
                    quit;
                    Reinicie os serviços do zabbix
                    systemctl restart zabbix-server zabbix-agent apache2
                    systemctl enable zabbix-server zabbix-agent apache
                    Comando para acompanhar upgrade da database
                    tail -f /var/log/zabbix/zabbix_server.log | grep database
                    Comando para acompanhar log do zabbix
                    tail -f /var/log/zabbix/zabbix_server.log
                • Markku
                  Senior Member
                  Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                  • Sep 2018
                  • 1781

                  #10
                  I don't see you telling anything about the Zabbix servers' logs, so how are things shown there?

                  Markku

                  Comment

                  • jhboricua
                    Senior Member
                    • Dec 2021
                    • 113

                    #11
                    For some reason my reply to your post #7 got flagged as spam and it's pending moderation, so you probably can't see it.

                    Originally posted by Markku
                    I don't see you telling anything about the Zabbix servers' logs, so how are things shown there?

                    Markku
                    Might have to increase logging level on the backend servers too. Right now the server logs are riddled with 'sending configuration data to proxy....' entries which is pretty annoying when trying to search stuff.

                    Comment

                    • Markku
                      Senior Member
                      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                      • Sep 2018
                      • 1781

                      #12
                      Originally posted by jhboricua
                      For some reason my reply to your post #7 got flagged as spam and it's pending moderation, so you probably can't see it.
                      Right

                      Originally posted by jhboricua
                      Might have to increase logging level on the backend servers too. Right now the server logs are riddled with 'sending configuration data to proxy....' entries which is pretty annoying when trying to search stuff.
                      Yeah, "grep -v" is your friend there

                      Markku

                      Comment

                      • jhboricua
                        Senior Member
                        • Dec 2021
                        • 113

                        #13
                        Can't do grep on the backend logs as they are Fargate containers, I have to search the logs via Cloudwatch, and that is an artform in itself, lol. Here's my filter to exclude those entries after multiple attempts:

                        Code:
                        [unixtime, var1, var2, var3, var4, var5, proxy !=*zbx-proxy*, var6, var7, var8, var9, var10, var11, var12, var13, var14, var15]
                        Its a PITA.

                        Comment

                        • Markku
                          Senior Member
                          Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                          • Sep 2018
                          • 1781

                          #14
                          I'm not much of a CloudWatch user but I'm accustomed to using there search terms like -"not_this_word" (= minus in front of the term), would that suffice?

                          Markku

                          Comment

                          • jhboricua
                            Senior Member
                            • Dec 2021
                            • 113

                            #15
                            I turned up the logging level on my test hosts and I have two of them pointed directly to the Fargate container task's IP addresses, bypassing the Network Load Balancer. Should be fine as long as the containers are not stopped.

                            Comment

                            Working...