Hello!
We have a few thousand hosts in our Zabbix instance, but there is exactly one host that is misbehaving. The graphs are all extremely choppy for just this one host. We're currently using zabbix 3.4.14, both server-side and client-side. The client that's experiencing problems is a rather large VM running centos 7.2.1511 and is running an artifactory instance.
Things I've ruled out:
CPU load too high - the client in question continuously runs at a load of about 4, but it has 16 cores available to it.
No memory - the client in question consistently has 50-52 GB of free RAM.
No disk - there's plenty of disk space available (about 40% full)
Network problems - ifconfig shows >50 TiB of RX traffic and >30 TiB of TX traffic with zero dropped packets.
VM host problems - there are several other VMs on the same vCenter host, and none of them are showing gaps in their data.
DNS issues - I've tried both specifying the hostname (works for everything else) and specifying the IP, and there is no difference.
Rogue web scenario taking too long - I did identify one web scenario that had an average of 30 seconds to return, but I've disabled that. 3 web scenarios remain, but they all return within milliseconds
Crazy IPv6 issues - since IPv6 troubles can manifest in weird ways, I've stripped everything to only using IPv4. There is no difference.
I've slowly increased the StartAgents, BufferSend, BufferSize, MaxLinesPerSecond, and Timeout values to larger and larger numbers. Originally they were 5, 30, 300, 100, 15 (respectively) but now they're at a monstrous 8, 3600, 8000, 500, 30 (respectively), and no settings in-between have affected the issue.
This machine is pretty integral to our company's workflow, and we'd be receiving complaints hand over fist if there were any performance problems with it in the least.
Attached is a picture of the Template OS Linux CPU Load graph for the troubled client, scaled to the last 6 hours of data.
Any and all help, hare-brained ideas, and conjecture are very welcome. I've run through everything I can think of checking and suggestions of new things to check will help me keep my sanity.
Thanks in advance!
-Lily
We have a few thousand hosts in our Zabbix instance, but there is exactly one host that is misbehaving. The graphs are all extremely choppy for just this one host. We're currently using zabbix 3.4.14, both server-side and client-side. The client that's experiencing problems is a rather large VM running centos 7.2.1511 and is running an artifactory instance.
Things I've ruled out:
CPU load too high - the client in question continuously runs at a load of about 4, but it has 16 cores available to it.
No memory - the client in question consistently has 50-52 GB of free RAM.
No disk - there's plenty of disk space available (about 40% full)
Network problems - ifconfig shows >50 TiB of RX traffic and >30 TiB of TX traffic with zero dropped packets.
VM host problems - there are several other VMs on the same vCenter host, and none of them are showing gaps in their data.
DNS issues - I've tried both specifying the hostname (works for everything else) and specifying the IP, and there is no difference.
Rogue web scenario taking too long - I did identify one web scenario that had an average of 30 seconds to return, but I've disabled that. 3 web scenarios remain, but they all return within milliseconds
Crazy IPv6 issues - since IPv6 troubles can manifest in weird ways, I've stripped everything to only using IPv4. There is no difference.
I've slowly increased the StartAgents, BufferSend, BufferSize, MaxLinesPerSecond, and Timeout values to larger and larger numbers. Originally they were 5, 30, 300, 100, 15 (respectively) but now they're at a monstrous 8, 3600, 8000, 500, 30 (respectively), and no settings in-between have affected the issue.
This machine is pretty integral to our company's workflow, and we'd be receiving complaints hand over fist if there were any performance problems with it in the least.
Attached is a picture of the Template OS Linux CPU Load graph for the troubled client, scaled to the last 6 hours of data.
Any and all help, hare-brained ideas, and conjecture are very welcome. I've run through everything I can think of checking and suggestions of new things to check will help me keep my sanity.
Thanks in advance!
-Lily
Happy anyway if people find answers to their problems somehow.)
Comment