Ad Widget

Collapse

Zabbix Psudo HA Proposal - requesting feedback

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • tgrissom
    Junior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Nov 2017
    • 7

    #1

    Zabbix Psudo HA Proposal - requesting feedback

    Request for feedback:
    I know HA has been discussed in may threads But I found that many of the discussions were dated back to 2.x versions of Zabbix. None of the threads address all my questions using a single version of the software. Much has changed. So....

    I have inherited the responsibility for the care and feeding of my company's Zabbix infrastructure/application. And as such I have been given some latitude in implanting a new more resilient and efficient architecture in our upgrade from Zabbix 1.8.3 to 3.4 (maybe 4.0 depending on when it comes out).

    Simplified High level view of the network(s):
    We have a few dozen sites across the country with various types of hosts/devices. Not all are monitored by Zabbix but almost all sites have at least a couple of hosts we currently monitor. My hope is to, in the near future replace some of the other monitoring tools we use with Zabbix. But that is a conversation for a different day.

    We have 2 NOC sites, a Primary (PNOC) and a backup(BNOC). The backup site, for all intents, is a duplicate of the primary sites; Servers and databases infrastructure is mirrored. The BNOC is located in a different city and is activated proactively in, for instance, the preparation for a Hurricane at the primary site. Otherwise the backup site is basically idle. Except for perhaps a fire or tornado disaster, switching operation stites is very controlled and planned.

    The goal is to provide a stable/resilient Zabbix environment that facilitates meeting the SLA with our customer. That means continuous monitoring and reasonably uninterrupted collection of states and performance data from the various hosts/devices across the networks. Secondarily the goal is to minimize the administrative effort to maintain such an environment, and finally to minimize the effort and complexity for an admin to move the Zabbix applications from site to site.

    Our operational procedures require that the service NOT be autonomous. Any switch in location must be instigated by a human. That said, a full HA/auto failover is not required. Besides, the NOC sites are geographically separated. A full inter-site clustered HA solution would be prohibitively complicated and expensive.

    After reading through the documents and thread after thread here and on other forums, I have the following draft / skeleton proposal on which I would like community feedback.

    --------------------------
    I would implement:

    > MySQL database configured in Master/Slave using GTID row replication.
    Implement partitioning for History/trend tables. Master node at PNOC and Slave node at BNOC.
    > Zabbix 3.4 server/front end installed at each site. Each instance would access the local DB instance. BNOC Zabbix server would be off line until a switch-over.
    > All hosts will be monitored via Proxy and will be configured with Server/ActiveServer items referencing proxies at both NOC sites.
    > Initially all proxies will be located in the NOC data centers. And will be deployed in pairs, one at each site. (ideally they would move out to the regional hubs or Pops but for now they must be at NOC.)
    > Proxies will be configured to use DNS to reference the active Zabbix server, a poor mans VIP. That FQDN will be provisioned in the Proxies' host files and reference the active Zabbix server's IP.
    > All agent monitored items that reasonable can be will be made Active items for performance and resiliency purposes.
    > All agents and proxies will use certificates for encryption/authentication.


    In the event we need to move operations to BNOC I would follow this procedure:

    1. Stop PNOC Zabbix Server (PServer) processes
    2. Wait for the DB replication to catch up.
    3. Swap DB Master to BNOC
    4. Point all proxies to BServer via host file update. (No proxy restart required) This would be done using a script or puppet or maybe even Zabbix initiated script. TBD
    5. Start Zabbix BServer processes.
    6. Once everything is stabilized and the backlog of buffered items is processed, configure/move all hosts from PProxies to BProxies. This would be done manually or using a script/API.


    Questions/Concerns
    > I am not a DBA. And know very little about the implication of replication methods. Is Row the correct replication method (vs transaction) for the Zabbix application?
    > I am assuming that when the agent references 2 proxies for ActiveServer, it will continue to query both proxies for active check config. The proxy that is NOT configured to monitor the hosts will log complaints that it knows nothing about that host. If that is the worst of the impact of this config, I can live with that. Are there any other concerns/impacts of multiple configured active servers?
    > When the proxies connects to the BServer after switchover, will the proxies and subtending hosts see this connection as the same 'source/path' as PServer. And as such will the cached config and all host/items data it has buffered still be valid and the data reported to the server after the connection is established just as if it were talking to PServer again.
    > There should be very little 'data loss' after the transition. With the understanding that Passive agent, ODBC, snmp items etc. will all have a gap in data during the front end down time. Holes in this?
    > When the hosts are reconfigured (via front end) to report to the secondary proxy, will the current path / set of active items the agent knows about be cleared before the second path/items are provisioned and take affect. I understand There may be a short period of 'data loss' during this transition. But I want to avoid the same data being sent to 2 proxies and generating errors or buffered data on the inactive proxy.
    > Will history and trend data associated with any given host before and after a swap from one proxy to another be maintained.
    > Does implementing DB Partitioning impact the upgradability of either Zabbix or MYSQL in the future.
    > Could I maintain both front ends as active as long as I point them to the active DB server instance?
    >I could achieve the proxy's connection swap to the new site using Site DNS BUT my understanding is that the proxy would have to perform a DNS query each time the it needed to send data to the server. Is this correct? Putting a DNS cache on each proxy could mitigate this but would then delay the proxy seeing the new server address.

    Anything major I am missing here?


    This is my first post so please forgive me for any forum faux pas.
    Thanks in advance....
  • tgrissom
    Junior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Nov 2017
    • 7

    #2
    I guess I am on my own.

    No response from the peanut gallery. I guess I am on my own.

    Comment

    • kloczek
      Senior Member
      • Jun 2006
      • 1771

      #3
      Originally posted by tgrissom
      Questions/Concerns
      > I am not a DBA. And know very little about the implication of replication methods. Is Row the correct replication method (vs transaction) for the Zabbix application?
      MIXED is more compact and many times I found that ROW based replication have been causing problems on slave side as lock quite often is hitting places in the middle transactions (between begin and end)

      > I am assuming that when the agent references 2 proxies for ActiveServer, it will continue to query both proxies for active check config. The proxy that is NOT configured to monitor the hosts will log complaints that it knows nothing about that host. If that is the worst of the impact of this config, I can live with that. Are there any other concerns/impacts of multiple configured active servers?
      It will not work as both proxies will receive exactly the same data and will be pushing to the server duplicated metrics data. This is the active proxies case.
      In case of passive proxies randomly server will be connecting to one of the proxies causing total confusion of the stack.

      > When the proxies connects to the BServer after switchover, will the proxies and subtending hosts see this connection as the same 'source/path' as PServer. And as such will the cached config and all host/items data it has buffered still be valid and the data reported to the server after the connection is established just as if it were talking to PServer again.
      > There should be very little 'data loss' after the transition. With the understanding that Passive agent, ODBC, snmp items etc. will all have a gap in data during the front end down time. Holes in this?
      Proxy history is preserved in proxy DB backend.
      If DB content will be preserved during upgrade and if proxy downtime will be lower than time after which agent local buffer will be filled up it will be no loses in metrics data.

      > When the hosts are reconfigured (via front end) to report to the secondary proxy, will the current path / set of active items the agent knows about be cleared before the second path/items are provisioned and take affect. I understand There may be a short period of 'data loss' during this transition. But I want to avoid the same data being sent to 2 proxies and generating errors or buffered data on the inactive proxy.
      Such switching will nit happen instantly.
      All depends how frequently proxies and agent are refreshing configuration. Generally switching time is not lower tnan sum of those two periods.

      > Will history and trend data associated with any given host before and after a swap from one proxy to another be maintained.
      Look on for example history table row definition:
      Code:
      Create Table: CREATE TABLE `history` (
        `itemid` bigint(20) unsigned NOT NULL,
        `clock` int(11) NOT NULL DEFAULT '0',
        `value` double(16,4) NOT NULL DEFAULT '0.0000',
        `ns` int(11) NOT NULL DEFAULT '0',
        KEY `history_1` (`itemid`,`clock`)
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
      As you see all data in history/trends tables are identified by item ID. Changing proxy ID does not change item IDs.
      > Does implementing DB Partitioning impact the upgradability of either Zabbix or MYSQL in the future.
      No.

      > Could I maintain both front ends as active as long as I point them to the active DB server instance?
      Yes. Frontends can be scaled horizontally.
      My advice: switch to nginx and php-fpm which in most cases will reduce needs to have more than one frontend as nginx with php module or apache have muuuch higher resources consumption per http session.

      >I could achieve the proxy's connection swap to the new site using Site DNS BUT my understanding is that the proxy would have to perform a DNS query each time the it needed to send data to the server. Is this correct? Putting a DNS cache on each proxy could mitigate this but would then delay the proxy seeing the new server address.
      Generally proxy is so lightweight process that it can be treated stateless bit killed and rebuild as low impact procedure. Even with own local MySQL backend within the same system image could be treated the same.
      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
      https://kloczek.wordpress.com/
      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
      My zabbix templates https://github.com/kloczek/zabbix-templates

      Comment

      • tgrissom
        Junior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Nov 2017
        • 7

        #4
        Thanks for the insight.

        Thanks for the reply.

        I have implemented "MIXED" replication after doing a little more investigation. Thanks for giving me a little more comfort with that decision.

        As for the proxies both reporting data to the front end. I am getting ready to test this today. I have the agents pointing to both proxies, but it is my understanding that the agent will only send data to the proxy that sent the active items. e.g. if it is configured with ServerActive=Proxy1,Proxy2. And Proxy1 responds with active items and proxy2 does not respond with active items, the items will only be sent to proxy1 not proxy2. I will post the results of the testing.

        Timing of Switching what proxy a host is configured to report to: I will review the proxy/agent config refresh interval.

        As far and the multiple front ends: The load on the front end is not really a concern at this point so does not justify a load balancer or swapping the web stack. The reason for running multiple front ends would be solely for convenience of being able to access a local front end from either site. and provide a psudo hot/hot front end so if maintenance is required or whatever, it can be done with a minimum impact / effort.

        Along those same lines, we decided to break it up/configure the system so any of the elements; db, Zserver, Zfrontend, Proxies, could be migrated to the alternate site independently. This provides us some flexibility when performing maintenance or is any of the elements fail. Again this is in testing. Will let you know how it fleshed out.


        Originally posted by kloczek
        MIXED is more compact and many times I found that ROW based replication have been causing problems on slave side as lock quite often is hitting places in the middle transactions (between begin and end)


        It will not work as both proxies will receive exactly the same data and will be pushing to the server duplicated metrics data. This is the active proxies case.
        In case of passive proxies randomly server will be connecting to one of the proxies causing total confusion of the stack.





        Proxy history is preserved in proxy DB backend.
        If DB content will be preserved during upgrade and if proxy downtime will be lower than time after which agent local buffer will be filled up it will be no loses in metrics data.



        Such switching will nit happen instantly.
        All depends how frequently proxies and agent are refreshing configuration. Generally switching time is not lower tnan sum of those two periods.



        Look on for example history table row definition:
        Code:
        Create Table: CREATE TABLE `history` (
          `itemid` bigint(20) unsigned NOT NULL,
          `clock` int(11) NOT NULL DEFAULT '0',
          `value` double(16,4) NOT NULL DEFAULT '0.0000',
          `ns` int(11) NOT NULL DEFAULT '0',
          KEY `history_1` (`itemid`,`clock`)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
        As you see all data in history/trends tables are identified by item ID. Changing proxy ID does not change item IDs.

        No.


        Yes. Frontends can be scaled horizontally.
        My advice: switch to nginx and php-fpm which in most cases will reduce needs to have more than one frontend as nginx with php module or apache have muuuch higher resources consumption per http session.



        Generally proxy is so lightweight process that it can be treated stateless bit killed and rebuild as low impact procedure. Even with own local MySQL backend within the same system image could be treated the same.

        Comment

        Working...