Hi Team,
We are facing an intermittent issue where some of our items are receiving this error:
Intermittent means that it occur randomly like once or twice a day, which causes us to receive false alert as the nodata section of the trigger expression is getting triggered due to nodata received from the item. See image below where the timestamp for 11:29:09 and 11:30:09 are missing which causes the alert.

Trigger:
Item:
Background:
- We have numerous liveness/readiness check endpoints. Above one to name a few.
- All of our the endpoint services (liveness/readiness) that we monitor are returning data in less than 1 second. As in superfast.
To address the issue we have tried the following attempts but all are to no avail.
1. Increased the timeout of the item script to 60. Before it was 3 seconds.
2. Increased zabbix-server and zabbix-postgresql resource requests/limits.
We know that zabbix-server and zabbix-postgresql are resource heavy especially if we are monitoring huge amount of data.
For the server:
For the server:


3. Modified some internals of zabbix processes such as STARTPOLLERS to acommodate more items.
As we understand STARPOLLERS are responsible for allocating resource to items so that it can be able to fetched those intime. By default STARTPOLLERS are set to 5, however we have increased it to 35. Below shown Zabbix Processes:



4. We also switched to used TimescaleDB. We have noticed performance improvement on the housekeeping process compared to regular postgres where housekeeping process reaches to 100%. But still we are getting the same issue mentioned in the title.
5. We also increased the ingress-controller replica count, assuming that it could be network related. Our Istio ingress controller is not heavily utilized but we still decided to increased it to 3 replica. However doing so still doesn't work as the same issue still reoccurs intemittently.
> We have noticed though that there are no request really made against our istio-ingressgateway. Which means the zabbix-server doesn't even made a GET against the service as we traced it in the istio logs as it appears to be missing. This led us to have all the actions above to ensure Zabbix has the best performance to process all of the items but to no avail.
6. We even reduced the number of polling of Native Kubernetes alerts from 2 minutes to 1 minute to halve the number of VPS (Values per second) zabbix is processing. From 300 VPS down to 150 VPS. However doing so still doesn't work.
Does anyone face a similar problem? We hope that we can find a fix for this as we are receiving false alert due to nodata issue.
Here are the Zabbix components that we are using:
- Zabbix-Web 7.0.3
- Zabbix-Server 7.0.3
- Zabbix Agent 7.0.3
- Zabbix Pg16 with Timescaledb 2.15.3
* We have deployed it inside a Kubernetes cluster version 1.30
Will appreciate any help. Thank you!
We are facing an intermittent issue where some of our items are receiving this error:
Code:
Item became not supported: Cannot execute script: RangeError: execution timeout
Trigger:
Code:
last(/aws-kube-prod-01-powercompany-cloud/service_preview-powercompany-cloud_service-usermanagement-backend-liveness.live)<>"{\"status\":\"UP\"}" or nodata(/aws-kube-prod-01-powercompany-cloud/service_preview-powercompany-cloud_service-usermanagement-backend-liveness.live,90s)=1
Code:
var url = "https://service_preview.powercompany.cloud/service-usermanagement-backend/actuator/health/liveness";
var maxRetries = 3;
var retryDelay = 5000;
var attempt = 0;
var result = null;
function makeRequest() {
try {
var request = new HttpRequest();
return request.get(url);
} catch (error) {
return error.message || error;
}
}
while (attempt < maxRetries) {
result = makeRequest();
if (result && result.includes('{"status":"UP"')) {
break;
}
attempt++;
if (attempt < maxRetries) {
Zabbix.sleep(retryDelay);
}
}
return result;
- We have numerous liveness/readiness check endpoints. Above one to name a few.
- All of our the endpoint services (liveness/readiness) that we monitor are returning data in less than 1 second. As in superfast.
To address the issue we have tried the following attempts but all are to no avail.
1. Increased the timeout of the item script to 60. Before it was 3 seconds.
2. Increased zabbix-server and zabbix-postgresql resource requests/limits.
We know that zabbix-server and zabbix-postgresql are resource heavy especially if we are monitoring huge amount of data.
For the server:
Code:
resources: limits: cpu: '2' memory: 9Gi requests: cpu: '2' memory: 9Gi
Code:
resources: limits: cpu: '2' memory: 9Gi requests: cpu: '2' memory: 9Gi
3. Modified some internals of zabbix processes such as STARTPOLLERS to acommodate more items.
As we understand STARPOLLERS are responsible for allocating resource to items so that it can be able to fetched those intime. By default STARTPOLLERS are set to 5, however we have increased it to 35. Below shown Zabbix Processes:
4. We also switched to used TimescaleDB. We have noticed performance improvement on the housekeeping process compared to regular postgres where housekeeping process reaches to 100%. But still we are getting the same issue mentioned in the title.
5. We also increased the ingress-controller replica count, assuming that it could be network related. Our Istio ingress controller is not heavily utilized but we still decided to increased it to 3 replica. However doing so still doesn't work as the same issue still reoccurs intemittently.
> We have noticed though that there are no request really made against our istio-ingressgateway. Which means the zabbix-server doesn't even made a GET against the service as we traced it in the istio logs as it appears to be missing. This led us to have all the actions above to ensure Zabbix has the best performance to process all of the items but to no avail.
6. We even reduced the number of polling of Native Kubernetes alerts from 2 minutes to 1 minute to halve the number of VPS (Values per second) zabbix is processing. From 300 VPS down to 150 VPS. However doing so still doesn't work.
Does anyone face a similar problem? We hope that we can find a fix for this as we are receiving false alert due to nodata issue.
Here are the Zabbix components that we are using:
- Zabbix-Web 7.0.3
- Zabbix-Server 7.0.3
- Zabbix Agent 7.0.3
- Zabbix Pg16 with Timescaledb 2.15.3
* We have deployed it inside a Kubernetes cluster version 1.30
Will appreciate any help. Thank you!
Comment