Ad Widget

Collapse

Zabbix 7.4.x on Nutanix VMs

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • butlerm_kd
    Junior Member
    • Apr 2024
    • 2

    #1

    Zabbix 7.4.x on Nutanix VMs

    Does anyone have experience (good, bad or indifferent) running large scale (>15K Hosts) on Nutanix [AHV] Virtual Machines?
    Interested in sharing your experiences for a possible customer deployment
  • ZabbixSeeker
    Junior Member
    • Mar 2026
    • 9

    #2
    Yes, all of the above. But I only have time for a quick post here. It all depends on what you're doing, your resources (hardware and man hours) and what you're expecting of Zabbix.

    You need to design your architecture and implementation early, or you will have severe challenges.

    Good to know (might be relevant in your case):

    Scalability: Zabbix isn't engineered for scalability out of the box. Realizing the shortcomings in production, especially if the SLA is high, is a really bad position. If you're using mysql, implement partitioning from the start or pay the concequences (maybe your job even). How this "officially" is setup with Zabbix is not beautiful architecture-wise and maintenance-wise, but is probably fully necessary in your case. I think it's some kind of scripts that runs on a cronjob and you might have to micro manage some tweak on it. It some kind of scripts that you'll find on Zabbix blog. PostgresSQL might work much better in your case. Do your research here.

    Multi tenancy: Please forget about multi tenancy or pay the price (I almost lost my job) - Zabbix is NOT that. I'd say that that advertisement is false in all shapes and forms. You will discover multiple issues with that claims if you attempt to run some kind of multi tenancy system. You can on a superficial level isolate users and host groups, but the users need to be from the same company and trust each other since the users can access each others network through each others proxies. There are multiple other other issues. You've been warned. One customer, one Zabbix system (web and all).

    Version control: there is no version control or good support for it. Your templates can be very expensive (hours of work) and important, but permanently be ruined by a change. Exports/imports of templates have several gotchas and shortcomings, such as flattening templates hierarchies or similar surprises, enabling disabled items, and so on and so forth. Making backup copies of templates or developing a new version of a template will be problematic - old hosts will point at the backup/old template (good luck updating 15k hosts and all the gochas with that) and/or backed up templates will be linked to live templates and vice versa. I can't explain here how we solved this to some degree, but it's about compromises and basically playing with fire (risks) with hard restrictions on our routines. Consider how important this is to you and how you wanna deal with it. Database backup is a must.

    Database design: unless you're an expert in that area. Binary logs and error logs will explode in your face - don't use them unless you need them. Expect a lot of micro management. The "optimal" database settings aren't formally documented - but if you buy Zabbix support, they do have some documentation (that's exactly what it is although advertised and claimed otherwise) that they'll provide you with. Though, we were provided with deprecated database settings for MySQL 8.4 that are ignored by MySQL as well as some settings where mysql 8.4 defaults were more appropriate, and settings that necessary didn't do us any good. My suggestion is that you research and solve these things yourself and benchmark everything.

    If you need backups and database HA, but isn't an expert in database clusters, I'd suggest a single regular database node to avoid various risks with with so much database activity. Ironically, you might actually achieve less availability (and risk database corruption or data loss) if you do it wrong.

    Security: too much to unpack here. Do your research. Know that the default is exploitable in multiple ways through normal use, especially the zabbix agent. Those claiming otherwise are unaware of the dangers. I do security for a living, just be assured that multiple risks within normal usage were found.

    Documentation and semantics: when you're working at this scale, and in a production environment, I assure you that you won't find everything you need in the official documentation. You'll have to experiment with the details. It's clear that english isn't the Zabbix project's first language, although the project is in English. If semantic is important for your company, you will hit some some semantical inconcistencies and bad terminology for things.

    User interface: backend is coded in C, frontend i PHP which to a degree is rendered for a hardcoded screen resolution, etc. This architecture and the current code base are frankly outdated for fast modern web dev, so don't expect any big changes soon. There are some hacks in the UI implementations to be frank (states from different browser tabs affecting each other), and it lack some proper UX engineering, but it will get the job done in the end. There is no reasonable or special phone web support - but your phone can browse the page since it has a browser. No beautiful 3rd party implementations, depending on your requirements. This might be an issue if your production goal is as large as you claim.

    What you want to do, is to test every important requirement that you have first, and then focus on scaling. It can be done, once you know what you're doing.

    You need to server and proxy settings. Follow the "queue" statistics in the admin panel closely. If your server and proxy settings aren't optimal, you'll run into long delays in your monitoring data flow and alerts.

    Sorry that I can't get into the gritty nitty technical details since they depend a lot of what you're doing.

    So, my suggestion to you is: you need to spend a lot of time - you're not going to get it out of the box. Build for scale early.

    In the zabbix/server configs, buff the "StartXXProcesses" (StartSNMPPollers, etc) for everything from 1 or whatever the default is, to maybe 50 in general.

    If you're just doing a ping every 5 minutes on 15k hosts, you can just run an out of the box docker zabbix setup. Any more items, or polling more often than that will overload your system.

    You will need to use proxies, at least one, for this scale. If the proxy is running on a customer's hardware (not your presumably fast nutanix hardware), you do have challenges if anything more than ping every 5 minutes is checked at this scale.

    If your Nutanix hardware isn't running on SSDs primarily, you do really really really need to tweak your setup.

    Test, get used to and document an zabbix upgrading process that suites your environment. Read the docs on this too.

    That's all I have time for now.

    Monitor your database disks - it will grow to several terrabytes, depending on your monitoring tweaks. Design your backup and restoration processes early. Even on fast hardware, backups and restorations takes time. If you're running a database cluster, it's even more complicated.
    Last edited by ZabbixSeeker; Today, 10:28.

    Comment

    Working...