Ad Widget

Collapse

Zabbix deployment scaling issues advise

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ChenAvi
    Junior Member
    • Jan 2023
    • 7

    #1

    Zabbix deployment scaling issues advise

    Hello,

    I would like to get some help tuning our Zabbix deployment as we encounter scaling issues that cause the data to not be synced on the server.
    Currently, as a short-term solution, I perform a restart of the server whenever it happens.

    In our Kubernetes cluster, we have 6 proxies between the agents to the server and the DB is TimescaleDB (PostgreSQL-based DB).

    Our system information:
    Click image for larger version

Name:	image.png
Views:	649
Size:	25.7 KB
ID:	457050

    Server graphs that depict the load on it (restart of the server marked in red):​
    Click image for larger version

Name:	zabbix_server_graphs1.png
Views:	612
Size:	434.7 KB
ID:	457051
    Click image for larger version

Name:	zabbix_server_graphs3.png
Views:	589
Size:	385.9 KB
ID:	457053
    Click image for larger version

Name:	zabbix_server_graphs4.png
Views:	576
Size:	459.8 KB
ID:	457054

    Additional changes to the image default values:

    For the Zabbix server:
    - name: ZBX_MEMORYLIMIT
    value: 516M
    - name: ZBX_POSTMAXSIZE
    value: 256M
    - name: ZBX_MAXEXECUTIONTIME
    value: "3000"
    - name: ZBX_CACHESIZE
    value: "6G"
    - name: ZBX_HISTORYCACHESIZE
    value: "2G"
    - name: ZBX_HISTORYINDEXCACHESIZE
    value: "1G"
    - name: ZBX_TRENDCACHESIZE
    value: "2G"
    - name: ZBX_VALUECACHESIZE
    value: "4G"
    - name: ZBX_STARTPOLLERS
    value: "40"
    - name: ZBX_STARTLLDPROCESSORS
    value: "20"

    For the Zabbix proxy:
    - name: ZBX_PROXYMODE
    value: "0"
    - name: ZBX_CONFIGFREQUENCY
    value: "600"
    - name: ZBX_STARTTRAPPERS
    value: "25"
    - name: ZBX_HISTORYINDEXCACHESIZE
    value: 64M
    - name: ZBX_STARTPREPROCESSORS
    value: "30"
    - name: ZBX_CACHESIZE
    value: "2G"​


    ​​​Please let me know if other data is needed. Hopefully, I can get a lead on what needs to be tuned for better performance.

    Thanks,
    Chen
    Attached Files
  • Markku
    Senior Member
    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
    • Sep 2018
    • 1782

    #2
    These are always interesting cases for learning. What is your Zabbix version?

    The documentation (https://www.zabbix.com/documentation.../zabbix_server) for StartDBSyncers says "default value (4) should be enough to handle up to 4000 NVPS", have you tried increasing it to 5 to improve history syncer situation?

    How is your database performing, do you have statistics for it? (I'm not a professional DBA but maybe someone can comment on those.)

    There is some contention with unreachable pollers, try increasing it moderately as well.

    How are your items actually configured? Because you have active proxies but still your pollers (on the server) are in high utilization. How does increasing pollers even more (but moderately) affect your case?

    Markku

    Comment

    • medl
      Junior Member
      • Nov 2022
      • 9

      #3
      Poller utilization goes to 100% - may interpretation would be that there are either additional checks running, a latency or timing issue.
      What Timeout do you have configured?

      I observed that when Zabbix is using SNMP Bulk requests - if they fail (timeout or similar) - zabbix may fall back to standard SNMP requests.
      This puts a lot of pressure on the Poller and on restart it will do bulk requests again. (Observed Behaviour in Version 6.0.8)
      Your Problem sounds similar but here i just assume that you have SNMP Workload. The unreachable Poller Spike fits into that picture.

      Once i had identified the unreliable Devices i deployed a special Proxy for those so they don't clog up regular processing.
      Updateing SNMP Information on a unrealiable Device also helps (sometimes). Standard Requests also put a lot of pressure on the Target Devices - which in turn slow down Processing even more.
      So far i have not found a way to force Zabbix to keep doing bulk requests.

      You can identify those devices by checking the Zabbix Queue.
      Side note: Atm i suspect that my installation will not survive a larger Network Outage because if this.

      Back to your case - you said you have 6 Proxies, the Graphs show that the Server is also Polling (and maybe runs into the issue i described).
      Personally i'd recommend to offload as much as possible to the proxies, that keeps the server and frontend "clean and working" even if there are Problems in processing.

      Do you also have utilization Graphs from your Proxies?​

      Comment

      • ChenAvi
        Junior Member
        • Jan 2023
        • 7

        #4
        Hi, thank you for your answers, I'll try to implement your suggestions.
        Our version is zabbix_server (Zabbix) 5.0.26.
        Regarding the DB, we haven't noticed any performance issues with it. If I'm not mistaking the timeout we have is 30 seconds and we don't have graphs for the proxies.

        Chen

        Comment

        • cyber
          Senior Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Dec 2006
          • 4807

          #5
          Considering the number of hosts (just below 2k) and NVPS of 5117, you do A LOT of polling... And that shows... pollers are busy... config and history syncers are busy... But as suggested, unload ALL polling from server. I would also recheck all the intervals of checking... Do you really need them as often as you do (and I don't even know here, how often you check.. )

          Comment

          Working...