Ad Widget

Collapse

How to alert on Zabbix servers failures in HA mode?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • olegus
    Member
    • Dec 2023
    • 68

    #1

    How to alert on Zabbix servers failures in HA mode?

    We run Zabbix 6.4 in dockers in HA mode .
    Yesterday both Active and Passive servers crashed and it was left unnoticed. Zabbix Web worked and hosts were shown green. The only indication they were down was a popup in the bottom that disappeared in a sec. Docker containers were also green.
    Logs showed cache size errors. We set ZBX_CACHERSIZE =512M and servers were happy again.
    But it left us with a question how to make sure we are notified immediately if a Zabbix server service is down. We have HA cluster and we expected that we see alerts immediately if an Active server is down.
    What probably happen is that when Active node crashed, control went to a Passive one and it also crashed because of an insufficient cache size. But it has to be time to send an alert, right? So I'm guessing, we did not setup alerting correctly.

    How to properly set up HA cluster failure alerts?
  • cyber
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2006
    • 4807

    #2
    It probably did not have enough time to send anything, if it crashes during startup, for example... And even, if it had time, you have to have some monitoring for this. It seems that best way would be having a dashboard with Zabbix own internal metrics, monitor all that "manually" and tune your cache sizes early.. Keep them up to 40% occupied, then they can probably withstand also some value storms after interrupts...

    Comment

    • olegus
      Member
      • Dec 2023
      • 68

      #3
      Thanks for the reply, but manual monitoring is NOT a way to go in 2024 The better way would be to periodically ping some Zabbix server API endpoint with cron, for example.
      And the best way would be to trigger an event when some server metrics reach some limits, but we should know what server metrics are so critical that can lead to unexpected shutdown.
      Or even better - would be great if we can raise an even on a proxy if a server is not responding.

      Comment

      • cyber
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Dec 2006
        • 4807

        #4
        IF you tune your caches to proper size you don't really have to do much manual work after that.. maybe glance over it once in a while.. I totally agree, that just constantly looking at some screen sounds stupid.. If you add zabbix server selfmonitoring templates to your server, it will also add metrics for all kind of caches etc.. So you can have alerting before crashes, if things do not crash too fast.. https://git.zabbix.com/projects/ZBX/.../zabbix_server

        API endpoint does not give you much .. frontend (including API) talks to DB directly, so no server is included there.. You can test, if 10051 port is answering on server(s). Proxies also do not produce events.. They collect and forward data to server, all the rest is done by server

        Comment

        Working...