Ad Widget

Collapse

Monitored items start to fail after zabbix server failover in HA setup

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • rivermigue
    Junior Member
    • Jul 2024
    • 8

    #1

    Monitored items start to fail after zabbix server failover in HA setup

    Hello,
    I am trying to identify the reasons behind all monitored items in a cluster setup become unsupported after failing over to the secondary Zabbix Server node.
    Current env is
    Alma Linux 8
    Zabbix 7.0.8
    • Each agent has its closest zabbix proxy fqdn in the ServerActive directive
    • Each proxy has configured zabbix.domain.com with port 10051 in the Server directive (zabbix.domain.com points to one zabbix master at a time depending on who is listenting on 10051 since there is only one process capable of listening on that port in a HA setup)
    • The servers have their Load Balancer virtual ip specified in NodeAddress directive
    When I stop zbx-master-1, zbx-master-2 becomes the new active node and zabbix ui as well as our load balancer detects that, I can see in zbx-master-2 that it is sending the requested configs to all proxies and all proxies are receiving them, however, every item starts to fail, specially those which have the item nodata, comparing the logs between zbx-master-1 when it was the master it shows that all items are supported however in zbx-master-2 the only entries I see are when the proxies are requesting the config from the master. I am unsure of what is causing this? Any suggestions.
    This looks like the proxies are not sending back all metrics to the new zabbix master for a reason I cant detect, turning debug 4 or 5 in the proxy and the server dont show any information about connectivity between the two.
    Edit: I should also mention that when I do test item on those that run out of the proxies (SQL or Web Api Calls) they do work, however if I let them be, proxies dont seem to be sending those metrics to the zabbix server.
    Click image for larger version  Name:	Untitled Diagram.jpg Views:	0 Size:	105.3 KB ID:	497561Click image for larger version  Name:	Screenshot 2025-01-21 at 12.53.16 PM.png Views:	0 Size:	72.5 KB ID:	497562
    Last edited by rivermigue; 22-01-2025, 18:33. Reason: Adding extra information
  • rivermigue
    Junior Member
    • Jul 2024
    • 8

    #2
    Anyone has a suggestion? I have the secondary master in standby but doesnt seem to be of much help when the failover happens since its like if it isnt accepting the metrics from the proxies for some reason.

    Comment

    • cyber
      Senior Member
      Zabbix Certified SpecialistZabbix Certified Professional
      • Dec 2006
      • 4806

      #3
      If you restart any of the proxies after failover, will those start to send data?
      Have you tested with a bit more conventional way of setting both server addresses as cluster to proxy settings, bypassing that F5 there.. ie: "Server=zbx-master-1;zbx-master-2"

      Comment

      • markfree
        Senior Member
        • Apr 2019
        • 868

        #4
        I've seen proxies that show up as active, have no errors logged, but are not sending any data to the server due to a time mismatch.
        Check that all hosts are synchronized.

        Comment

        • rivermigue
          Junior Member
          • Jul 2024
          • 8

          #5

          1) Restarting the proxies after the failover does not make them send data to the new master
          2) I removed the load balancer IP and specified in each proxy the list of zabbix masters separated by semi colons since these are active proxies, same result, proxies are not sending the data although they show up as online in the web ui, the queues by proxies just keeps increasing and never catches up.
          3) Both zabbix masters are in CST and one proxy in PST, time is in sync.
          4) The database is in the same datacenter as zbx-master-1, however we do have another db instance in the same datacenter where zbx-master-2 which acts as our last resort for a failover, the latency between both DC is around 45ms, when I failover zabbix, I am not taking in consideration the database, would a latency of 45ms cause something like this?
          5) Selinux is permissive

          The strange part to me is that proxies appear online in the UI and are also able to get config updates from both masters when failing over as that comes up in each zabbix proxy log, the statistics in the zabbix master seems to show that is processing some data in the pollers but yet, they seem to never clear the queue and we just start to receive a bunch of alerts unless we failover back to the original master.

          Comment

          • rivermigue
            Junior Member
            • Jul 2024
            • 8

            #6
            After many thoughts around this, I found the cause for this behavior.
            Our PSQL cluster has nodes in both datancers as well as zabbix, datacenters are separated by ~40ms in latency, if we failover zabbix master to DC2 while the psql master is in DC1, this is when we start seeing all these problems with items getting queued and never making its way to the zabbix master for some reason, this gets resolved by having both masters (psql and zabbix) in the same datacenter.
            I dont know if this is true for setups with fewer items, but at least for us it is and I am not sure if this is logged somewhere in the zabbix logs? Anyway, this is resolved for us.

            Comment

            Working...