Ad Widget

Collapse

Delayed items

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • TRNX
    Member
    • Oct 2019
    • 54

    #16
    I sent you a private message with URL for download server log. (I don't want to publish it).
    I think there is not any special in configuration. First I installed Debian 10 and after then I installed Zabbix 4.4.1 (after some time I update it to 4.4.4). When I install Zabbix I got some error and I had to edit PATH (Google help me). I am not Linux user, but I think the configuration is good. I proceeded step by step according to these instructions: https://www.zabbix.com/download?zabb...ysql&ws=apache

    And yes, I am the primary IT admin. I have some colleagues, but they have access only to the frontend.

    Comment

    • Markku
      Senior Member
      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
      • Sep 2018
      • 1782

      #17
      Sorry for the delayed reply. So, I looked at the server log file, and it was just full of various error messages, from Powershell and other sources. What I would recommend you to do is to try to get rid of all the errors, by removing/disabling the configurations that cause them. The errors should be quite self-explaining to you as you know best what special things you have added there. The goal for this is of course to eliminate the possible sources of problems.

      Editing PATH sounds strange to me as I've installed countless Zabbix servers on Debian (various versions of both, using Zabbix repo) and I've never had to change any of those system-level variables. If you have a fellow Linux admin available, you should ask a second opinion about the overall system state.

      Markku

      Comment

      • TRNX
        Member
        • Oct 2019
        • 54

        #18
        Thank you for your answer. I know about errors, we have a lot of hosts with different powershell version and some hosts have problem, others don´t. I edited powershell command for one of these errors and now it should be ok. Now I must update all hosts.
        Editing PATH - I had problem when I installed zabbix. I proceeded step by step according to instructions on official Zabbix website and in one these steps I get error. So I google the error message and I found the way to resolve it is edit PATH. I don´t remember what error massage it was, but after I edited it, it work´s.

        Comment

        • TRNX
          Member
          • Oct 2019
          • 54

          #19
          Hello.

          I tried follow your advice and disable or "repair" not supported items. Now I have only 3 not supported items in Zabbix of which 2 is timeouted powershell scripts (I have max limit for timeout 30s). In Zabbix-agent log (on zabbix server) I have some unknown connections (I think it is internet bots) and sometimes there are these records:

          :active check configuration update from [127.0.0.1:10051] started to fail (cannot connect to [[127.0.0.1]:10051]: [4] Interrupted system call)
          :active check configuration update from [127.0.0.1:10051] is working again
          I don´t know why sometimes agent can´t connect to server (agent and server is on the same machine).

          In the Zabbix-server log there are some denied autoregistrations.

          The main problem is still spamming many e-mails where the host didn´t recieve data for 10 minutes. It is about 100-200 e-mail messages every day, but it is not true. The hosts are still online. In the queue sometimes there are only few items and sometimes there are a lot of items. I don´t know why and I don´t know what I should to do.
          Attached Files

          Comment

          • TRNX
            Member
            • Oct 2019
            • 54

            #20
            Any ideas?

            Comment

            • TRNX
              Member
              • Oct 2019
              • 54

              #21
              If anybody has no idea, the topic can be closed. Thank you for your advices.

              Comment

              • rakkioo7
                Junior Member
                • Sep 2020
                • 1

                #22
                You didn't make reference to the Zabbix worker and operator adaptations, however, take one issue host and update its Zabbix specialist to the freshest conceivable form in your Zabbix worker discharge arrangement. I had one single getting into mischief dynamic operator that had a more established specialist, and updating the specialist tackled the issues. I was left muddled if the past establishment was in one way or another defective or if the operator variant had the effect.

                Comment

                • TRNX
                  Member
                  • Oct 2019
                  • 54

                  #23
                  Excuse me for refreshing the topic, but the problem persist and it starts be annoying. Still have a lot of items in queue (not always but most of the time yes).
                  The main problem is that two hosts still starting one trigger, which sending e-mail.. So when I come to work, I have for example 300-500 mail messages per night. Screenshot from Outlook in attachments. We don´t have any bandwidth limit on Zabbix server.

                  All host has same agent version 5.0.0.2400
                  The server version is: 5.0.2
                  In previous versions there was a problems too.

                  There is our system informations:
                  Number of enabled hosts: 205
                  Number of enabled items: 25498
                  Required server performance, new values per second: 26.95

                  Any idea where can be a problem?
                  Attached Files

                  Comment

                  • Markku
                    Senior Member
                    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                    • Sep 2018
                    • 1782

                    #24
                    What is the item that is failing?

                    Markku

                    Comment

                    • TRNX
                      Member
                      • Oct 2019
                      • 54

                      #25
                      It is average CPU load check (system.cpu.load[percpu,avg1]) which has 5 minutes interval. And I have this trigger: {TemplateName:system.cpu.load[percpu,avg1].nodata(10m)}=1
                      This trigger signalize me that data from host wasn´t recieved. I have approximately 40 host groups and only two (everytime the same host groups) has problem.

                      Comment

                      • Markku
                        Senior Member
                        Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                        • Sep 2018
                        • 1782

                        #26
                        As far as I understand your situation, two of your hosts (active agents) fail to have the data correctly received by the Zabbix server. As other similar hosts are working fine (= the Zabbix server is functioning properly), it must be something with the two agents or their connectivity with the Zabbix server. (Let us know if the situation was understood incorrectly, you are sometimes talking about hosts and sometimes about host groups)

                        What kind of errors do you get in the Zabbix logs (server and agents) for those two hosts?

                        You can run tcpdump to see the behaviour in the actual agent connection TCP sessions if you need more information about the connections.

                        Markku

                        Comment

                        • TRNX
                          Member
                          • Oct 2019
                          • 54

                          #27
                          I apologize for the delay..
                          Sorry for confusion I mean two hosts ("host group" was mistake) and each of them in different host group. Data from these hosts comes to Zabbix, but probably are delayed.

                          This trigger has action to send e-mail when problem starts. And when the problem is resolved, we get second e-mail about resolving problem. So the CPU load check/item must have more than 5 minutes delay if I understand it right, because we get 300-400 these e-mail messages per night. In addition this check is set to every monitored host in our Zabbix, but only two doing this.
                          It is average CPU load check (system.cpu.load[percpu,avg1]) which has 5 minutes interval. And I have this trigger: {TemplateName:system.cpu.load[percpu,avg1].nodata(10m)}=1
                          This trigger signalize me that data from host wasn´t recieved. I have approximately 40 host groups and only two (everytime the same host groups) has problem.
                          First problem host is only one in host group (Customer A has 1 server), but second problem host is in host group, which has another 2 non-problem hosts (Customer B has 3 servers).

                          In attachments I add agent logs from both problem hosts. No errors, only sometimes timeouted one script which takes a long of time. And I add a screenshot when we could see (in queue tab) the delay of one problem host.

                          Thank you very much for every help
                          Attached Files

                          Comment

                          • Markku
                            Senior Member
                            Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                            • Sep 2018
                            • 1782

                            #28
                            So you have resolved your previous errors in the Zabbix server? Previously you had some connectivity problems within your server.

                            Are you sure that all the systems are having the correct synchronized time?

                            You can also run tcpdump on the Zabbix server and see what is happening in the data of the affected agents.

                            Markku

                            Comment

                            • TRNX
                              Member
                              • Oct 2019
                              • 54

                              #29
                              Hello and thank you for answer.

                              "So you have resolved your previous errors in the Zabbix server? Previously you had some connectivity problems within your server."
                              Today I watched two latest zabbix-server logs and I didn´t found any records about connectivity problems between server and agent. But there was a lot of powershell scripts errors. This is next problem.. For example, I have powershell script for check Windows Server Backup result. On some host, the script working fast (maybe 10 sec?), but on some host the same scripts takes 1 or 2 minutes.. Can´t increase timeout parameter because max is 30 sec.. But we need this check, we need information about backup resluts.. So I don´t know what to do in this situation.

                              "Are you sure that all the systems are having the correct synchronized time?"
                              Thank you for advice. Zabbix server has good time. One of problem host has time synchronized too, but second problem host not. So I repaired it. I will see if it helps.

                              "You can also run tcpdump on the Zabbix server and see what is happening in the data of the affected agents."
                              I tried to run tcpdump from Zabbix server for approximately 20 minutes. This is the command I used:
                              Code:
                              /usr/sbin/tcpdump -i any host [I]HOST-IP-ADDRESS[/I] and port 10051 -vvv -n -w tcpdump.pcap
                              I hope I used the command right. If not, please correct me. Can you check the output file? I don´t understand it
                              There is download link to .pcap output: https://drive.google.com/file/d/1Oxh...ew?usp=sharing


                              Next I tried to disable all other unsupported items (except the powershell timeouted scripts).
                              Last edited by TRNX; 15-04-2021, 09:10.

                              Comment

                              • Markku
                                Senior Member
                                Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                                • Sep 2018
                                • 1782

                                #30
                                I filtered only the packets starting the TCP sessions (source = client, TCP has SYN bit on):

                                Click image for larger version  Name:	delay-pcap.png Views:	0 Size:	56.2 KB ID:	422878
                                In the delta display time column you can see very irregular intervals when the client is initiating the sessions. As there are lots of different items with different intervals being monitored, I guess that is normal.

                                Your connection is TLS-encrypted so we cannot see what data is exactly contained in the packets (so we don't know what happens with the CPU items that you are getting the delay problems). You may want to first remove all items from the host, then adding only some specific items (like the CPU) that don't include sensitive data, remove TLS encryption temporarily, and then repeat the packet capture to see what you get.

                                (Update, forgot to say: In the capture there isn't any indication of network problems that could explain your problems, as far as I see it)

                                Markku

                                Comment

                                Working...