Ad Widget

Collapse

Gaps in graphs for a single host with plenty of resources

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ConstipatedNinja
    Junior Member
    • Jan 2020
    • 7

    #1

    Gaps in graphs for a single host with plenty of resources

    Hello!

    We have a few thousand hosts in our Zabbix instance, but there is exactly one host that is misbehaving. The graphs are all extremely choppy for just this one host. We're currently using zabbix 3.4.14, both server-side and client-side. The client that's experiencing problems is a rather large VM running centos 7.2.1511 and is running an artifactory instance.

    Things I've ruled out:
    CPU load too high - the client in question continuously runs at a load of about 4, but it has 16 cores available to it.
    No memory - the client in question consistently has 50-52 GB of free RAM.
    No disk - there's plenty of disk space available (about 40% full)
    Network problems - ifconfig shows >50 TiB of RX traffic and >30 TiB of TX traffic with zero dropped packets.
    VM host problems - there are several other VMs on the same vCenter host, and none of them are showing gaps in their data.
    DNS issues - I've tried both specifying the hostname (works for everything else) and specifying the IP, and there is no difference.
    Rogue web scenario taking too long - I did identify one web scenario that had an average of 30 seconds to return, but I've disabled that. 3 web scenarios remain, but they all return within milliseconds
    Crazy IPv6 issues - since IPv6 troubles can manifest in weird ways, I've stripped everything to only using IPv4. There is no difference.

    I've slowly increased the StartAgents, BufferSend, BufferSize, MaxLinesPerSecond, and Timeout values to larger and larger numbers. Originally they were 5, 30, 300, 100, 15 (respectively) but now they're at a monstrous 8, 3600, 8000, 500, 30 (respectively), and no settings in-between have affected the issue.


    This machine is pretty integral to our company's workflow, and we'd be receiving complaints hand over fist if there were any performance problems with it in the least.

    Attached is a picture of the Template OS Linux CPU Load graph for the troubled client, scaled to the last 6 hours of data.

    Any and all help, hare-brained ideas, and conjecture are very welcome. I've run through everything I can think of checking and suggestions of new things to check will help me keep my sanity.

    Thanks in advance!
    -Lily
  • Markku
    Senior Member
    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
    • Sep 2018
    • 1781

    #2
    Hi Lily, as a networking professional, in cases like that I would definitely take a packet capture on both the agent side and server side. That would give a definite answer to the question: does the data even get from the agent to the server.

    sudo tcpdump -v -s 0 -w capture1.pcap port 10050 and host <ipaddress>
    (assuming a passive agent, use 10051 for active agent)

    Then use Wireshark to inspect the captures.

    If you have TLS enabled, so won't be able to see the data, so in that case you could maybe disable TLS for some time to see the data.

    (I frequently give here this suggestion to see the actual Zabbix traffic, but seldomly anyone does that... Happy anyway if people find answers to their problems somehow.)

    Markku

    Comment

    • Markku
      Senior Member
      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
      • Sep 2018
      • 1781

      #3
      Btw do you have any interesting messages in the server or agent logs?

      Markku

      Comment

      • ConstipatedNinja
        Junior Member
        • Jan 2020
        • 7

        #4
        Hi, Markku! Thanks a ton for the reply!

        I'll start some captures right away. I appreciate the suggestion! I'm admittedly much more used to diagnosing infiniband issues than ethernet issues, and they're entirely different beasts.

        I'm afraid that there were no interesting messages in the logs. The most interesting bit I found was an absence of logs. An example item of system.cpu.load[all,avg1] has an interval of 1m for this agent. I was able to see on the agent side the "Requested [system.cpu.load[all,avg1]]" line followed shortly by the "Sending back [<number>]" line every minute for several minutes, followed suddenly by gaps where there was no request (and thus no sending anything back). On the server-side there are occasional lines for other hosts showing that an item failed due to a network error, but never for the host in question.


        Once I have some captures I'll dig through and report back. Thanks again!
        -Lily

        Comment

        • Markku
          Senior Member
          Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
          • Sep 2018
          • 1781

          #5
          In the situation you described (passive agents, and you already have an observation that agents are not receiving the requests), it would be more interesting to inspect the captures in the server side.

          I wrote some Wireshark plugins to dissect Zabbix traffic, https://github.com/markkuleinio/wire...bix-dissectors, if they are helpful. Not required though.

          Is there a specific reason for not using active agents? In the scale you described the active items are preferred, because the server does not need to poll all the data anymore.

          Markku

          Comment

          • ConstipatedNinja
            Junior Member
            • Jan 2020
            • 7

            #6
            Thanks for the plugins, I really appreciate it!

            If I'm being entirely honest, our Zabbix instance started as someone else's pet project to monitor the servers that they cared about. It eventually expanded to encompass our operations. Adding to that, past management decided that every possible aspect of every server should be monitored and should be capable of alerting. I've done a LOT of cleanup, but it's still definitely a product of its upbringing.

            Attached is what I'm seeing from the server-side packet captures.
            Attached Files

            Comment

            • Markku
              Senior Member
              Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
              • Sep 2018
              • 1781

              #7
              Now there is some interesting output, in the session with Zabbix server-side port 37224:
              • In line 4 the server sends the item request
              • Agent ACKs that packet immediately in line 5 (= it certainly received the request)
              • In line 16 (almost 15 seconds later!) the server apparently has had enough of waiting and sends FIN to close the connection, which is ACKed by the agent in the next packet
              • In line 18 the agent eventually sends the response, 25 seconds after the actual request
              • The server responds with a reset because it is not expecting anything anymore (the response is most probably discarded in that case).
              Compare to the session with port 40394, in which the response comes much faster, practically immediately.

              So, check inside packet number 4, which item does it contain that makes the agent wait for such a long time before responding.

              Markku

              Comment

              • Markku
                Senior Member
                Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                • Sep 2018
                • 1781

                #8
                If at all possible, actually it is still preferred to have simultaneous captures on both server and agent, just to rule out any strange effects of packet loss in the path.

                Markku

                Comment

                • ConstipatedNinja
                  Junior Member
                  • Jan 2020
                  • 7

                  #9
                  It appears that the check in question was the web scenario that was taking 30 seconds on average that I disabled for a while to rule out as the culprit. I'm afraid that nothing changed about the gappiness when I disabled that check.

                  I just used at(1) to synchronize a packet capture. Here's another example of one of these resets, but from both sides.
                  -Lily
                  Attached Files

                  Comment

                  • ConstipatedNinja
                    Junior Member
                    • Jan 2020
                    • 7

                    #10
                    I disabled that check and reran a simultaneous packet capture for the same length as the last one. Instead of ~330 packets being captured over a 5 minute span, this time there were >2100 captured packets, and only one conversation that ended in resets. However, that one conversation that ended in resets was the disabled check. For some reason the server still ran it one last time more than 15 minutes after the check was disabled.

                    During the time of this packet capture, there were still gigantic gaps in the graphs.

                    Comment

                    • Markku
                      Senior Member
                      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                      • Sep 2018
                      • 1781

                      #11
                      Ok, in the agent-side capture we see:
                      - A large request is received in packet #45 (session with port 42114)
                      - Agent does not do anything for it until packet #58 (15 seconds later) where it FINs the session = looks like it wouldn't do anything for it, maybe agent-side timeout kicks in?
                      - HOWEVER 15 seconds later (#60) the agent sends the data but as we see in the server-side #61 the server does not accept it because the agent already sent the FIN earlier. Thus the data does not get handled at all.

                      Again the question: What request was that and why the agent just sat over it? What is in the response?

                      The other session in the agent-side capture (with port number 45170):
                      - Request was received in #50
                      - It was responded to immediately in #52
                      - Everybody was happy in the connection-level.

                      What did that request and response contain? Is it shown in the Latest data?

                      Markku

                      Comment

                      Working...