Ad Widget

**Markku** · 30-01-2020, 19:48

Hi Lily, as a networking professional, in cases like that I would definitely take a packet capture on both the agent side and server side. That would give a definite answer to the question: does the data even get from the agent to the server.

sudo tcpdump -v -s 0 -w capture1.pcap port 10050 and host <ipaddress>
(assuming a passive agent, use 10051 for active agent)

Then use Wireshark to inspect the captures.

If you have TLS enabled, so won't be able to see the data, so in that case you could maybe disable TLS for some time to see the data.

(I frequently give here this suggestion to see the actual Zabbix traffic, but seldomly anyone does that...

Happy anyway if people find answers to their problems somehow.)

Markku

**Markku** · 30-01-2020, 19:52

Btw do you have any interesting messages in the server or agent logs?

Markku

**ConstipatedNinja** · 01-02-2020, 01:33

Hi, Markku! Thanks a ton for the reply!

I'll start some captures right away. I appreciate the suggestion! I'm admittedly much more used to diagnosing infiniband issues than ethernet issues, and they're entirely different beasts.

I'm afraid that there were no interesting messages in the logs. The most interesting bit I found was an absence of logs. An example item of system.cpu.load[all,avg1] has an interval of 1m for this agent. I was able to see on the agent side the "Requested [system.cpu.load[all,avg1]]" line followed shortly by the "Sending back [<number>]" line every minute for several minutes, followed suddenly by gaps where there was no request (and thus no sending anything back). On the server-side there are occasional lines for other hosts showing that an item failed due to a network error, but never for the host in question.

Once I have some captures I'll dig through and report back. Thanks again!
-Lily

**Markku** · 01-02-2020, 08:57

In the situation you described (passive agents, and you already have an observation that agents are not receiving the requests), it would be more interesting to inspect the captures in the server side.

I wrote some Wireshark plugins to dissect Zabbix traffic, https://github.com/markkuleinio/wire...bix-dissectors, if they are helpful. Not required though.

Is there a specific reason for not using active agents? In the scale you described the active items are preferred, because the server does not need to poll all the data anymore.

Markku

**ConstipatedNinja** · 03-02-2020, 20:13

Thanks for the plugins, I really appreciate it!

If I'm being entirely honest, our Zabbix instance started as someone else's pet project to monitor the servers that they cared about. It eventually expanded to encompass our operations. Adding to that, past management decided that every possible aspect of every server should be monitored and should be capable of alerting. I've done a LOT of cleanup, but it's still definitely a product of its upbringing.

Attached is what I'm seeing from the server-side packet captures.

Attached Files

**Markku** · 03-02-2020, 20:48

Now there is some interesting output, in the session with Zabbix server-side port 37224:

In line 4 the server sends the item request
Agent ACKs that packet immediately in line 5 (= it certainly received the request)
In line 16 (almost 15 seconds later!) the server apparently has had enough of waiting and sends FIN to close the connection, which is ACKed by the agent in the next packet
In line 18 the agent eventually sends the response, 25 seconds after the actual request
The server responds with a reset because it is not expecting anything anymore (the response is most probably discarded in that case).

Compare to the session with port 40394, in which the response comes much faster, practically immediately.

So, check inside packet number 4, which item does it contain that makes the agent wait for such a long time before responding.

Markku

**Markku** · 03-02-2020, 21:09

If at all possible, actually it is still preferred to have simultaneous captures on both server and agent, just to rule out any strange effects of packet loss in the path.

Markku

**ConstipatedNinja** · 03-02-2020, 22:16

It appears that the check in question was the web scenario that was taking 30 seconds on average that I disabled for a while to rule out as the culprit. I'm afraid that nothing changed about the gappiness when I disabled that check.

I just used at(1) to synchronize a packet capture. Here's another example of one of these resets, but from both sides.
-Lily

Attached Files

**ConstipatedNinja** · 03-02-2020, 22:53

I disabled that check and reran a simultaneous packet capture for the same length as the last one. Instead of ~330 packets being captured over a 5 minute span, this time there were >2100 captured packets, and only one conversation that ended in resets. However, that one conversation that ended in resets was the disabled check. For some reason the server still ran it one last time more than 15 minutes after the check was disabled.

During the time of this packet capture, there were still gigantic gaps in the graphs.

**Markku** · 04-02-2020, 20:16

Ok, in the agent-side capture we see:
- A large request is received in packet #45 (session with port 42114)
- Agent does not do anything for it until packet #58 (15 seconds later) where it FINs the session = looks like it wouldn't do anything for it, maybe agent-side timeout kicks in?
- HOWEVER 15 seconds later (#60) the agent sends the data but as we see in the server-side #61 the server does not accept it because the agent already sent the FIN earlier. Thus the data does not get handled at all.

Again the question: What request was that and why the agent just sat over it? What is in the response?

The other session in the agent-side capture (with port number 45170):
- Request was received in #50
- It was responded to immediately in #52
- Everybody was happy in the connection-level.

What did that request and response contain? Is it shown in the Latest data?

Markku

Ad Widget

Gaps in graphs for a single host with plenty of resources

Gaps in graphs for a single host with plenty of resources

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment