Ad Widget

Collapse

Problem recovering after disruption of proxy/server connectivity

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • abjornson
    Member
    • Oct 2013
    • 34

    #1

    Problem recovering after disruption of proxy/server connectivity

    I have been using Zabbix 2.4.8 to monitor a large ISP network for almost 2 years. It works very well!

    My zabbix server is in the cloud on AWS. I have 2 Zabbix proxies in my network's core datacenter that split the load of monitoring my ~2000 hosts.

    The load is split evenly, with each proxy monitoring about 1000 hosts, and requiring 700-850 VPS.

    This works very well when there is no disruption of communication between proxies and server.

    However, we sometimes suffer upstream connectivity outages at our datacenter. This cuts the proxies off from the server from anywhere between 15 mins and a few hours.

    I've found the proxies / server have a lot of trouble "catching up" after a disruption like that.

    What i see when connectivity is restored is that all hosts monitored by the proxies have delays in data/graphs. Even though admin/proxies shows a "last seen" of just a few seconds ago...graphs for proxy monitored hosts will run a delay of 45mins to an 1.5 hours for a while....so it looks like no data is collecting, but in fact data is collecting but just filling in with a delay.

    It seems like since I'm close to the max VPS limit of 1000 imposed on proxies by ZBX_MAX_HRECORDS. I think after an outage, all this history data doesn't fit through the "pipe" with the realtime data....and data is sent sequentially.

    I wish i could tell the proxy to prioritize the live data over the old data, but I don't think I can do that.

    --reduce the amount of history data stored - I’d actually rather just have a gap during the outage and just drop the data it couldn't send to server, if it meant i wouldn't have this delay problem. Would i do this by reducing HistoryCacheSize to 0 on the proxies, or is there another setting for this?
    --recompile zabbix proxy with ZBX_MAX_HRECORDS more than 1000 https://zabbix.com/forum/showthread.php?t=56509 This would probably really help - but is labor intensive and makes future upgrades harder
    --add a third proxy. This would be the easiest fix for me right now because it’s relatively quick

    I may be imagining things, but i feel like if i reboot proxies and server after the outage....they "catch up" more quickly and get back to realtime data.

    Any help with this is greatly appreciated!
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    I don't know what is the current state of the proxy sync but seems increasing ZBX_MAX_HRECORDS so far is only solution (I'm using 50k). I remember that zabbix guys planned add configuration parameter around ZBX_MAX_HRECORDS to not force recompile binaries.

    Sometimes max sync speed is limited by zbx srv DB backend write speed.
    Here is really wide range of factors which may limit max inserts speed. Only few major:
    1) If you are using MySQL it is yet another factor related to max_allowed_packet which is by default 4MB
    2) MySQL >= 5.6 (https://kloczek.wordpress.com/2016/0...rade-surprise/)
    3) increase memory used to holds more data in memory (part of the insert query is reading some data)
    4) use partitioned history* and trends* tables

    In case enough high write speed I recommend switching DB backend OS from Linux to Solaris on hardware with RAM size used by ZFS ARC memory ~daily volume of the data (just count size of prev day history* tables partitions). Switching from Linux to Solaris on the same HW may even double write speed and decrease as well few times latency of selects. Looks like MySQL caching is worse than ZFS ARC (Adaptive Reclaim Cache).
    There are other benefits of using ZFS like simplification of create slave DB instances by ZFS snapshoting and "zfs send/receive" which is not trashing cached in memory data.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • abjornson
      Member
      • Oct 2013
      • 34

      #3
      Thanks so much for your reply @kloczek - very helpful validation....and wow! 50k ZBX_MAX_HRECORDS - I'd planned just to go to 2.5k up from 1k as a start. Do you know a good way to tune if you've gone too high? I'd imagine if you increase too much you'd go beyond the capacity of your hardware?

      I will try this soon and report back. I had seen mention that this was under consideration to be implemented as a configuration option. This would be awesome if they'd implement! I love the ease of upgrading from apt repo - hate to break that with a custom compiled version.

      Thanks also for the backend db scale tips - these will be useful in the future. I'm pretty confident my current issue is proxy/server issues around ZBX_MAX_HRECORDS....but interested to check those out as well.

      Comment

      • abjornson
        Member
        • Oct 2013
        • 34

        #4
        An additional question: in the scenario i outlined above (proxy cutoff from server for some amount of time)

        Does anyone know if there's an easy way to tell the proxy to just dump the backlog and "fast forward" to the present? Obviously in a perfect world, I'd keep all the backlog data.....but failing that, until I get ZBX_MAX_HRECORD sorted....it would be preferable to dump the backlog data if it meant i'd get realtime data going again as soon as the outage was resolved.

        Comment

        • kloczek
          Senior Member
          • Jun 2006
          • 1771

          #5
          Originally posted by abjornson
          Thanks so much for your reply @kloczek - very helpful validation....and wow! 50k c - I'd planned just to go to 2.5k up from 1k as a start. Do you know a good way to tune if you've gone too high? I'd imagine if you increase too much you'd go beyond the capacity of your hardware?

          I will try this soon and report back. I had seen mention that this was under consideration to be implemented as a configuration option. This would be awesome if they'd implement! I love the ease of upgrading from apt repo - hate to break that with a custom compiled version.

          Thanks also for the backend db scale tips - these will be useful in the future. I'm pretty confident my current issue is proxy/server issues around ZBX_MAX_HRECORDS....but interested to check those out as well.
          My understanding of ZBX_MAX_HRECORDS is that it is more or less way of throttling volume of the data going from the proxy when server processing power has not enough power to process new data against triggers definitions or DB backend is not enough strong. In all my past cases none of those issues have been limiting speed of the syncing data from the proxies after outage (planned or not).
          Problem probably must be hitting server when there is enough big flow of the data from active proxies. If it was the original cause probably it would be better to solve this by sending on srv<>prx protocol signal like "I'm busy right now please send me next batch later or half of your batch". When proxy is more then N seconds behind own schedule of the sending data to srv it should have possibility to send let's say 10-50% more data than in prev cycle. With such logic bandwidth of those data could be self regulating.
          Max size of the data batch from the proxy is limited only by max size of the write cashes size on server side. Similar algorithm could be used on suggesting passive proxy to send more data in next cycle if server has no congestion in processing data in write cache.

          Your question about tuning. If bandwidth of the data from the proxies is an issue there are potentially two bottlenecks on server side. First one is related to processing speed new data against triggers definitions (ratio between number of items to triggers will be related to such bottleneck) and if here will be n o issue only other bottleneck will be related to max performance of the DB backend. In other words there is no straight/simple answer on the question about tuning.
          Problem only is that if there are some issues here related to one of those briers probably many people will have problem with diagnosing where is bottleneck.
          IMO to help diagnosing those issues probably it would be good to add few new srv internal metrics. I think that one f such metrics could be counter which will be increased after processing the trigger. If this speed of this counter will be reaching plateau it will be clear indicator some bottleneck has been hit.
          http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
          https://kloczek.wordpress.com/
          zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
          My zabbix templates https://github.com/kloczek/zabbix-templates

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by abjornson
            An additional question: in the scenario i outlined above (proxy cutoff from server for some amount of time)

            Does anyone know if there's an easy way to tell the proxy to just dump the backlog and "fast forward" to the present? Obviously in a perfect world, I'd keep all the backlog data.....but failing that, until I get ZBX_MAX_HRECORD sorted....it would be preferable to dump the backlog data if it meant i'd get realtime data going again as soon as the outage was resolved.
            Simplest known to me way of dropping those data is just reinitialize proxy DB backend. Proxy uses the same DB schema as server (because it simplifies DB schema upgrade on start server and proxy). However proxy is using only 4 or 5 tables from whole DB and delete whole DB, create it and import schema takes usually only few seconds.
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            Working...