Unexplained problems with HistoryCache and ValueCache

db100

Member

Joined: Feb 2023

Posts: 62
#1

Unexplained problems with HistoryCache and ValueCache

24-10-2024, 08:52

Dear all,

i am running a containerized version of Zabbix server v7.0.0 using no proxy so far.
All data is fed in via means of Triggers, via zabbix senders.

Incoming message rate is pretty constant at about 100 messages per second, although every item uses a throttled preprocessor of data with a time threshold of 60 seconds (i have never managed to understand if the displayed message rate is before or after throttling).

I have 2 problems that i cannot explain to myself and thus i cant find the solution. Perhaps these 2 are unrelated. Both problems are visibile in the metrics chart that i am sharing here, see below for explanation:

1. first problem: every night at 3am sharp there is an unexplained spike of value cache. There is no recurring activity on the server running zabbix, such as backup or such, so i cannot explain how that could be. as you can see in the charts both data collectors and internal processes show no sign of being busy. the spike happen all by itself in the value cache. The spike in the history sync that you see in the second graph is really just the server being rebooted after the cache reaches 100%

2. second problem: as you can see in the second chart above, sometimes the history syncer gets stuck at about 25% utilization forever. This is really just a an history syncer process getting stuck in the zabbix server. i will provide a screenshot of the ps aux output when that happens, as this is not the case right now. The problem arises when the history cache reaches 100%, then the server needs to be rebooted. as you can see from the blue line above, the history cache grows eternally due to this process getting stuck and even if the other syncers are litterally chilling and doing no work at all. You need to know that this situation is most likely to arise when the zabbix server and database are not on the same node and thus there exists some network latency between the two.

is there any way i can debug these 2 problems ? and do you believe that using a proxy for buffering the trapper input could help that ? IMO this should not make a big difference, as the issue here is not the incoming rate or the process utilization.

also note that CPU and RAM, disk IO for all the nodes are quite stable and show no sign of peaks around 3 am or when the history syncer gets stuck
Attached Files
Tags: None
db100

Member

Joined: Feb 2023

Posts: 62
#2

25-10-2024, 12:03

here the screenshot of ps aux when 25% (1 of 4) of history syncer processes gets stuck:
Comment

Ad Widget

Unexplained problems with HistoryCache and ValueCache

Unexplained problems with HistoryCache and ValueCache

Comment