As far as I know one of the main reasons why Zabbix INC is reviewing the entire structure of the nodes are the bugs related to sync process between node and master node.
Synchronization database was always something complex for large DBMSes soon would be no different to a monitoring tool.
I'll put here my analysis of the problem, not saying it is right or wrong, I only say that is just what I understand to be my problem and how i plan to solve.
My problem:
1) The company I work for has clients and representation in virtually nationwide and in various physical locations of the parents (in most cases more than one building with elements to monitor in the same state).
2) We have remote management teams and local management team. Support teams third level are "always" remote.
Because this we need to use a tool that allows to provide a local view and a vision regional / national. For this the NODE was created and for this we use it.
My employer chose the approach to monitoring the enviroment:
1) A national team responsible for the standardization of monitoring and 24x7;
2) Teams locations (where there is network administrators) responsible for registering and maintaining the hosts to be monitored.
3) national monitoring screens (usually displayed in NOCs) with macro and micro views.
4) regional monitoring screens (usually displayed on LCD screens where are the regional administrators) with macro and micro view of that region.
Obviously we have LAN, MAN and WAN with a SLA and on top of each (SLA often atop applications as well), then we must have the correct granularity of monitoring and vision aimed at achieving the contracted levels.
This is the macro strategy of what we understand as being the best we can get right now for the monitoring of nearly 15,000 hosts that we have to monitor (with growth of around 10% year).
So far so quiet, theoretically Zabbix through the use of Nodes could meet without problems. The problem is that the node has some "potential bugs related to the process of synchronization, potentiated when you have multilevel monitoring node (master node => node child => node grandson, etc..)". Because this bug report I was forced to think about alternatives since I can not give up monitoring as contracts requiring me.
As the nodes of Zabbix system work?
The system nodes is detailed in the following urls:
For 2.0.x: https://www.zabbix.com/documentation/doku.php?id=2.0 2Fmanual%%% 2Fdistributed_monitoring 2Fnodes
For 2.2.x (not yet released): https://www.zabbix.com/documentation...nitoring/nodes
As is detailed in these two URLs can create a monitoring level (parent => child) or more than one level (parent => son => grandchild => grandson => etc.). The company for which I work have this possibility as well as support for proxies, as stated earlier in this post.
The node was constructed in order to enable synchronization of data that is very beautiful in theory, however, several DBMSes simply abandoned because it is extremely complex to ensure the integrity of the base since the dependency relationships that can occur on the road (with concurrent changes and deletions inclusions in different locations and timings of data made posteriore) can literally screwed to the database.
The only way I see it, until the Zabbix INC show any other option, is: ensure single direction data.
The way I found to mitigate the problem:
Force zabbix to use a unidirecional way to send data:
-- from node to master
not more from
-- Node to Master AND from Master to Node.
Say this is my environment (my environment is VERY more large):
NODE | IP | Purpose
101 | 10.10.10.10 | Master-Node
- 201 | 10.10.11.10 | Node Child 1 responsible for São Paulo
- 202 | 10.10.12.10 | Node Child 2 responsible for Rio de Janeiro
- 203 | 10.10.15.10 | Node Child 3 responsible for Brasília
-- 301 | 10.10.16.10 | Node "Grandson" 1 responsible for Campinas / SPO connected to Child 1
-- 302 | 10.10.26.10 | Node "Grandson" 2 responsible for Barueri 2 / SPO connected to Child 1
-- 303 | 10.10.36.10 | Node "Grandson" 3 responsible for Barretos 3 / SPO connected to Child 1
-- 304 | 10.10.46.10 | Node "Grandson" 4 responsible for Buzios / RJO connected to Child 2
Soon we have a monitor with 3 hierarchical levels:
Level 1) Master
Level 2) Regional Nodes
Level 3) Nodes sector
It is observed that the number of nodes sector varies, or even nonexistent, due to requirements of monitoring. And we have (in the hypothetical case because I can not publish my real information monitoring environment without authorization) local teams in each physical building where a node monitoring.
To ensure this mode of operation is very simple, just need to "err purposely" my settings.
In level 1 nodes do the correct configuration as explained in the above URLs.
In Nodes with level 2 and 3 (and also in subsequent) we make a "error" when provide the MASTER ID.
In "Child 1, 2 and 3" for the monitoring function should I tell the same as the MASTER node ID is 101, however I will say that the node ID is 999 and this will I ever use this in my node id infrastructure to reference a valid node.
In "Grandson 1, 2, 3 and 4" I should put its node id "Child" (second level of the hierarchical tree) configuration, however, will "miss" again and put the ID is 999.
This will generate me a line "error" because the children nodes in Zabbix_Server validates the node that is sending data is a valid node for such action.
The line will be similar to the line below to the nodes of level 1:
"2778:20130102:141305.404 NODE 201: Received configuration changes from unknown node 101"
And it will be similar to the line below to the level 2 nodes connected to the node of SPO:
"2778:20130102:141305.404 NODE 303: Received configuration changes from unknown node 201"
With this I guarantee that even if someone, contradicting the guidelines do not do it, solve register or amend node using the higher-level node against a lower level (eg change data from 202 to 101, or change in 201 data 303) this setting will never be accepted by the lower level node as the master node ID is not recognized.
After doing that I guarantee will not be based corrupted due to operational errors or bugs.
It is just a gap that is identifying unauthorized changes made to the node of highest level against a lower-level node. This gap will be filled by a report of non-compliance that point (via the auditing feature of Zabbix) the unauthorized modification.
I hope to be contributing to the community Zabbix with this post, it solves the problem MY however may not fully resolve the other people and therefore would like to receive feedback from you related to that.
Synchronization database was always something complex for large DBMSes soon would be no different to a monitoring tool.
I'll put here my analysis of the problem, not saying it is right or wrong, I only say that is just what I understand to be my problem and how i plan to solve.
My problem:
1) The company I work for has clients and representation in virtually nationwide and in various physical locations of the parents (in most cases more than one building with elements to monitor in the same state).
2) We have remote management teams and local management team. Support teams third level are "always" remote.
Because this we need to use a tool that allows to provide a local view and a vision regional / national. For this the NODE was created and for this we use it.
My employer chose the approach to monitoring the enviroment:
1) A national team responsible for the standardization of monitoring and 24x7;
2) Teams locations (where there is network administrators) responsible for registering and maintaining the hosts to be monitored.
3) national monitoring screens (usually displayed in NOCs) with macro and micro views.
4) regional monitoring screens (usually displayed on LCD screens where are the regional administrators) with macro and micro view of that region.
Obviously we have LAN, MAN and WAN with a SLA and on top of each (SLA often atop applications as well), then we must have the correct granularity of monitoring and vision aimed at achieving the contracted levels.
This is the macro strategy of what we understand as being the best we can get right now for the monitoring of nearly 15,000 hosts that we have to monitor (with growth of around 10% year).
So far so quiet, theoretically Zabbix through the use of Nodes could meet without problems. The problem is that the node has some "potential bugs related to the process of synchronization, potentiated when you have multilevel monitoring node (master node => node child => node grandson, etc..)". Because this bug report I was forced to think about alternatives since I can not give up monitoring as contracts requiring me.
As the nodes of Zabbix system work?
The system nodes is detailed in the following urls:
For 2.0.x: https://www.zabbix.com/documentation/doku.php?id=2.0 2Fmanual%%% 2Fdistributed_monitoring 2Fnodes
For 2.2.x (not yet released): https://www.zabbix.com/documentation...nitoring/nodes
As is detailed in these two URLs can create a monitoring level (parent => child) or more than one level (parent => son => grandchild => grandson => etc.). The company for which I work have this possibility as well as support for proxies, as stated earlier in this post.
The node was constructed in order to enable synchronization of data that is very beautiful in theory, however, several DBMSes simply abandoned because it is extremely complex to ensure the integrity of the base since the dependency relationships that can occur on the road (with concurrent changes and deletions inclusions in different locations and timings of data made posteriore) can literally screwed to the database.
The only way I see it, until the Zabbix INC show any other option, is: ensure single direction data.
The way I found to mitigate the problem:
Force zabbix to use a unidirecional way to send data:
-- from node to master
not more from
-- Node to Master AND from Master to Node.
Say this is my environment (my environment is VERY more large):
NODE | IP | Purpose
101 | 10.10.10.10 | Master-Node
- 201 | 10.10.11.10 | Node Child 1 responsible for São Paulo
- 202 | 10.10.12.10 | Node Child 2 responsible for Rio de Janeiro
- 203 | 10.10.15.10 | Node Child 3 responsible for Brasília
-- 301 | 10.10.16.10 | Node "Grandson" 1 responsible for Campinas / SPO connected to Child 1
-- 302 | 10.10.26.10 | Node "Grandson" 2 responsible for Barueri 2 / SPO connected to Child 1
-- 303 | 10.10.36.10 | Node "Grandson" 3 responsible for Barretos 3 / SPO connected to Child 1
-- 304 | 10.10.46.10 | Node "Grandson" 4 responsible for Buzios / RJO connected to Child 2
Soon we have a monitor with 3 hierarchical levels:
Level 1) Master
Level 2) Regional Nodes
Level 3) Nodes sector
It is observed that the number of nodes sector varies, or even nonexistent, due to requirements of monitoring. And we have (in the hypothetical case because I can not publish my real information monitoring environment without authorization) local teams in each physical building where a node monitoring.
To ensure this mode of operation is very simple, just need to "err purposely" my settings.
In level 1 nodes do the correct configuration as explained in the above URLs.
In Nodes with level 2 and 3 (and also in subsequent) we make a "error" when provide the MASTER ID.
In "Child 1, 2 and 3" for the monitoring function should I tell the same as the MASTER node ID is 101, however I will say that the node ID is 999 and this will I ever use this in my node id infrastructure to reference a valid node.
In "Grandson 1, 2, 3 and 4" I should put its node id "Child" (second level of the hierarchical tree) configuration, however, will "miss" again and put the ID is 999.
This will generate me a line "error" because the children nodes in Zabbix_Server validates the node that is sending data is a valid node for such action.
The line will be similar to the line below to the nodes of level 1:
"2778:20130102:141305.404 NODE 201: Received configuration changes from unknown node 101"
And it will be similar to the line below to the level 2 nodes connected to the node of SPO:
"2778:20130102:141305.404 NODE 303: Received configuration changes from unknown node 201"
With this I guarantee that even if someone, contradicting the guidelines do not do it, solve register or amend node using the higher-level node against a lower level (eg change data from 202 to 101, or change in 201 data 303) this setting will never be accepted by the lower level node as the master node ID is not recognized.
After doing that I guarantee will not be based corrupted due to operational errors or bugs.
It is just a gap that is identifying unauthorized changes made to the node of highest level against a lower-level node. This gap will be filled by a report of non-compliance that point (via the auditing feature of Zabbix) the unauthorized modification.
I hope to be contributing to the community Zabbix with this post, it solves the problem MY however may not fully resolve the other people and therefore would like to receive feedback from you related to that.
Comment