Good morning.
I'm trying some WMIQuery-based triggers to detect split-brain situations. So far, my tests have driven me to the following.
getting the string from the wmi query goes like this.
ITEM:
{NODE1:wmi.get[root\Mscluster,select * from MSCluster_NodeToActiveGroup where PartComponent='MSCluster_ResourceGroup.Name="CLUST R"'].str(MSCluster_Node.Name="NODE1")}=1
(will report a "true" if Node1 finds that its the cluster owner)
This way, a split-brain situation might be detected running the query in both nodes, and if both report back a "true" about owning the service.
TRIGGER:
{NODE1:wmi.get[root\Mscluster,select * from MSCluster_NodeToActiveGroup where PartComponent='MSCluster_ResourceGroup.Name="CLUST R"'].str(MSCluster_Node.Name="NODE1")}=1
and
{NODE2:wmi.get[root\Mscluster,select * from MSCluster_NodeToActiveGroup where PartComponent='MSCluster_ResourceGroup.Name="CLUST R"'].str(MSCluster_Node.Name="NODE2")}=1
But I had a false positive on a failover situation. And I'm trying to find out how did Zabbix manage the data from these queries. When the service failed on node 1 due to a network error, the last value was the "node 1 is the active node" from node 1, but when node 2 took over, the message from Node1 seemed to prevail, so the trigger condition was met. Node 1 kept the "I am active node" and Node 2 began reporting that Node2 was the active node. I'm trying to figure out how did Zabbix manage the last response from Node1 if WMI service stops. Did it keep the last response as valid?
Is there any way to make an "aggregate" of the answers from the queries? Any "count" method I could call to retrieve several responses from the query in the same timeline?
(and no, unfortunately, I can't detect the situation from the service status parameter, since the service is running on both nodes, yet only one is the primary and it's managed by the Cluster Manager)
Thanks in advance. I'll post any new findings about this.
I'm trying some WMIQuery-based triggers to detect split-brain situations. So far, my tests have driven me to the following.
getting the string from the wmi query goes like this.
ITEM:
{NODE1:wmi.get[root\Mscluster,select * from MSCluster_NodeToActiveGroup where PartComponent='MSCluster_ResourceGroup.Name="CLUST R"'].str(MSCluster_Node.Name="NODE1")}=1
(will report a "true" if Node1 finds that its the cluster owner)
This way, a split-brain situation might be detected running the query in both nodes, and if both report back a "true" about owning the service.
TRIGGER:
{NODE1:wmi.get[root\Mscluster,select * from MSCluster_NodeToActiveGroup where PartComponent='MSCluster_ResourceGroup.Name="CLUST R"'].str(MSCluster_Node.Name="NODE1")}=1
and
{NODE2:wmi.get[root\Mscluster,select * from MSCluster_NodeToActiveGroup where PartComponent='MSCluster_ResourceGroup.Name="CLUST R"'].str(MSCluster_Node.Name="NODE2")}=1
But I had a false positive on a failover situation. And I'm trying to find out how did Zabbix manage the data from these queries. When the service failed on node 1 due to a network error, the last value was the "node 1 is the active node" from node 1, but when node 2 took over, the message from Node1 seemed to prevail, so the trigger condition was met. Node 1 kept the "I am active node" and Node 2 began reporting that Node2 was the active node. I'm trying to figure out how did Zabbix manage the last response from Node1 if WMI service stops. Did it keep the last response as valid?
Is there any way to make an "aggregate" of the answers from the queries? Any "count" method I could call to retrieve several responses from the query in the same timeline?
(and no, unfortunately, I can't detect the situation from the service status parameter, since the service is running on both nodes, yet only one is the primary and it's managed by the Cluster Manager)
Thanks in advance. I'll post any new findings about this.
Comment