Ad Widget

**Pada** · 06-02-2017, 11:54

Could you provide us with some more info, like:
1) What is your current Zabbix performance figures like: NVPS (new values/second) - all of which is visible on the Dashboard
2) Some basic hardware/software specs: eg. VM with 8 CPU cores, 16GB RAM, 4x 15000rpm harddrive in RAID 10, InnoDB MySQL engine, Ubuntu LTS 16.04.1 for Zabbix DB and another VM for the Zabbix server.
3) How does your Zabbix server's cache and Internal process busy percentages look? To get this, you'll need to assign the Template_Zabbix_Server to your Zabbix server.
It would also be great if you can post your Zabbix server config if these figures are indicating some other issue
4) What kind of monitoring interfaces are you using? eg. JMX / Zabbix Agent (passive) / Zabbix Agent (active) / SNMP v2?
5) What is your Zabbix Administration > Queue page looking like? And are you using Zabbix Proxy servers?
6) Is Zabbix completely stopping to collect data, or is it just not keeping up? Like we had a case where our Zabbix Proxies couldn't send data fast enough to Zabbix Server since our network link was saturated

**HamzaB** · 07-02-2017, 18:16

Thanks for your reply:

1) Number of new values per second is around 162

2) It's running on an AWS instance, OS: CentOS 6.2, 4 CPU Cores, 16GB of memory, MySQL DB running on AWS RDS, InnoDB engine.

3) Cache and busy processes:
- Zabbix configuration cache, % free: 98.88
- Zabbix buffer write vcache, % free: 92
- Zabbix history write cache, % free: 99.47
- Zabbix text write cache, % free: 99.46
- Zabbix trend write cache, % free: 99.92

- Zabbix busy alerter processes, in %: 0.02 %
- Zabbix busy configuration syncer processes, in %: 0.66 %
- Zabbix busy db watchdog processes, in %: 0 %
- Zabbix busy discoverer processes, in %: 0.02 %
- Zabbix busy escalator processes, in %: 40.9 %
- Zabbix busy history syncer processes, in %: 25.31 %
- Zabbix busy housekeeper processes, in %: 100 %
- Zabbix busy http poller processes, in %: 1.19 %
- Zabbix busy icmp pinger processes, in %: 17.23 %
- Zabbix busy poller processes, in %: 6.13 %
- Zabbix busy proxy poller processes, in %: 0 %
- Zabbix busy self-monitoring processes, in %: 0.02 %
- Zabbix busy timer processes, in %: 0.03 %
- Zabbix busy trapper processes, in %: 0.27 %
- Zabbix busy unreachable poller processes, in %: 0.09 %

4) Using only zabbix agent active and zabbix agent passive

5) The queue is usually fine, the delayed metrics over 10 mins are around 20, and they belong to a host that is not always up.

6) I believe it's completely stopping. We see metrics not updated for several minutes.

One more thing, we looked at the agents logs and we found these events:

Code:

10457:20170203:183613.286 active check data upload to [zabbix_server:10051] started to fail ([connect] cannot connect to [[zabbix_server]:10051]: [111] Connection refused)
10457:20170203:183614.313 active check data upload to [zabbix_server:10051] is working again

But when we check connectivity with telnet or nc, port 10051 is responding

**aib** · 07-02-2017, 18:22

Originally posted by HamzaB

- Zabbix busy housekeeper processes, in %: 100 %

In my setup when Housekeeper starts to clean database, Zabbix server just stops to respond.
Did you try to correlate your HouseKeeper activity with Zabbix unavailability?

**HamzaB** · 07-02-2017, 18:34

Originally posted by aib

In my setup when Housekeeper starts to clean database, Zabbix server just stops to respond.
Did you try to correlate your HouseKeeper activity with Zabbix unavailability?

It could be. I just looked in the logs and the last time this happened the housekeeper was running. How can I make sure of this? And are there any parameters I can modify to fix this?

**Pada** · 07-02-2017, 19:50

What kind of underlaying storage are you using for RDS? Magnetic, SSD or provisioned IOPS?
Like our Zabbix 1.8 server's DB has that kind of specs (16GB RAM, 4x CPU cores, 5x 15000rpm HDDs in RAID6) on a non-AWS VM and it can do ~1000nvps.

It is probably OK if the HouseKeeper is at 100% for a minute or two, but it should not go for more than an hour in 1 go - then it probably means that your DB is too slow in deleting the old entries.
Restarting the Zabbix server service would stop the HouseKeeper.
I'm not sure why your Zabbix is not collecting any new data at all though, because our old Zabbix 1.8 server's housekeeper is running for like 4h or more at a time and its still collecting data.

What does your CloudWatch metrics show in terms of the CPU usage and Read/Write latencies during the time which the Housekeeper ran?

Lastly, does your Zabbix server logs have anything useful around the time when it stops collecting data?
If its the DB that can't cope, the "history syncer processes" would become more busy and the "history write cache % free" would decline if I'm not mistaken.

Because of the HouseKeeper taking too long for our amount of data (~1000nvps), I've now setup our new Zabbix 3.2.3 database in AWS Aurora and disabled the Housekeeper in favour of using MySQL partitions for the item history & trends. We're haven't replaced our old Zabbix just yet.
See https://www.zabbix.org/wiki/Docs/howto/mysql_partition for more info on the partitions. Unfortunately if you only start using partitions now (and disable the HouseKeeper), then the DB size would keep on growing for at least the amount of time that you want to keep your partitions for, AND you'll probably have to have some downtime to wait for the partitions to be created on an existing table with lots of data.

Side notes:
1) The company I work for is currently experiencing lots of unexplained issues with RDS (running on the magnetic storage type) that relates to high IO, which is why we're also moving to Aurora.

2) We had to modify the parameter groups for Aurora instances to enable the Event Scheduler, because we want to let Aurora recreate (new/missing) partitions every hour for Zabbix. And we forgot to initially apply the settings to the read-replica, so then when that node became master our event was never scheduled and our Zabbix stopped writing since there was no partitions for it to write to.

3) Also, Zabbix doesn't handle Aurora failovers that well, since it can reconnect to a read-replica and then it doesn't close the connection and retry the connection to the possible master node.

**HamzaB** · 08-02-2017, 12:53

- We're using SSD storage for the RDS instance.
- I looked at the logs from a week ago, I've seen the housekeeper take 1h15min to finish. The last couple of days, it's taking 45-55mins.
- During the housekeepers runs, CPU usage is fine, not above 25% , the write latency on the RDS instance spikes and reaches 20ms (not sure if this is a worrisome value)
- Zabbix server logs do not mention anything useful about this. The only relevent trace in the agent logs (I posted it above)

I think I might disable housekeeping and see if the issue happens again.

**batchenr** · 09-02-2017, 08:11

Code:

10457:20170203:183613.286 active check data upload to [zabbix_server:10051] started to fail ([connect] cannot connect to [[zabbix_server]:10051]: [111] Connection refused)
10457:20170203:183614.313 active check data upload to [zabbix_server:10051] is working again

when i used yo have this i went to /etc/zabbix/zabbix_agent.conf
and uncomment
#port=10050
and restarted, hope it helps.

**HamzaB** · 13-02-2017, 15:43

I disabled the housekeeping for the History and the Trends. The issue did not occur again since then. So I believe it's safe to say the housekeeper was causing this.
However, without housekeeping, we face the problem of the growing size on the DB. Any idea how to fix this while keeping the housekeeper running?

**aib** · 13-02-2017, 16:28

check the forum for "Partitioning"
A lot of people faced with the same problem like you and created a threads and got answers "Partition your DB and create scripts which will delete data on schedule"

Ad Widget

Zabbix_server is not collecting data

Zabbix_server is not collecting data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment