Hi all, I've been searching for examples and considerations for a deployment like that and appart from technical recommendations about HW rightsizing -which is also critical for sure- I haven’t found much information about how to face the essential host and items aggrupation / management at big scale. The context is a large amount of stores (near 4K) in 2 countries with one or more point of sales (POS) and a technical room in each of them with this components within a rack:
In addition, outside the store there are 1 to 3 self-service POS (totems) for nightly customers and must be running 24x7 in unattended mode (which makes remote monitoring critical as there is nobody in the store to call for support).
We've finished a successful MVP with a couple of stores and our lab monitored but scaling from 30 devices to >10.000 is a big step. I’m worried about item naming and aggregation towards an easy store problems management. Anyone with experience in such a kind of deployment? Our concerns so far are:
Thank you in advance for your recommendations!
- 2 Router / Firewall
- UPS
- VMware ESXI with the following vms:
- BackOffice Site Server
- IoT control Server
- Netboot Server
- Enterprise grade Switch
- Other store services controllers (also critical)
In addition, outside the store there are 1 to 3 self-service POS (totems) for nightly customers and must be running 24x7 in unattended mode (which makes remote monitoring critical as there is nobody in the store to call for support).
We've finished a successful MVP with a couple of stores and our lab monitored but scaling from 30 devices to >10.000 is a big step. I’m worried about item naming and aggregation towards an easy store problems management. Anyone with experience in such a kind of deployment? Our concerns so far are:
- Host Groups – We’ve been analyzing host groups and we don’t know if it’s better to keep Host Groups for host aggregation by store or by family of devices (Servers, POSs, etc). We have defined also tags to include some additional store information but as said, we don’t know if this is the best way with such an amount of hosts.
- Host Naming - names are based on a mnemotechical rule like [device role (router, POS, Server, Totem]_[Store Numerical Code] and we think is enough as it's de DNS resolving name for most of them, we're also adding some tags to provide additional information. How do you name your hosts?
- Geolocation - Stores will be geolocated but host groups don’t allow this, it seems we must enter coordinates host by host (using the API perhaps for mass updating).
- Host management – With a list of more than 10K hosts it’s probably that we’ll have visualization problems ¿Any suggestion on how to make an effective management? Maybe the key is to not use the UI for host management and make use of the API for host registration / modification. It opens however a different problem since visualization issues are transferred to other platform.
- Templates, Items and triggers management – We are very focused on templates minimal content (just items to be monitored) and high item refresh times (200 to 300 secs) but as with the previous point, the amount of items will probably make it near impossible to manage them in the UI and API may not work for this (or make it very complex).
- Problems management – On a big crash scene (caused by network issues in a region, for example), the amount of problems that will be prompted will likely take down the UI monitoring panels, Is there a way to avoid that? We don’t know if there is something to identify a root cause and prompt just an event per store (trigger dependencies??).
- Active vs passive checks – It has been a long discussion but active checks resulted in better performance for POS (CPU average, Mem average and File system % free items in addition to some services availability) and Virtualization Server (vms performance, console errors, etc). However, I’ve read in many posts that users prefer passive checks and we’re not certain on this point, probably we’re missing something…
- Zabbix Proxy – The idea is to have a Zabbix Proxy by store (a vm on each store ESXI) in order to reduce traffic and Zabbix server congestion, but we don’t know if we’ll need some additional middleware (node?) between stores Zabbix proxies and the Zabbix central server as we’ll have up to 4K stores calling with data.
- SLAs – Since we have a contract support based on SLAs I will appreciate some advice on this topic. I’m sure Zabbix can keep a track of the services provided by the different hosts / items groups but sometimes store managers turn-off totems by convenience (maintenance, checks, black-outs, power shortenings, etc.) and this can depicture the SLA. I do also know you can manage calendar-based maintenance times but most of them are unplanned. Any recommendations on this?
- Hardware rightsizing – It’s obviously one of the biggest concerns but there is more information and it’s discussed in many other posts in the forum; however, any advice taking this context into consideration will be greatly appreciated.
Thank you in advance for your recommendations!
as host can be in several groups, there is not obstacle here to have it in store based group and also in device family based group. If you have device family based groups it might be easier to manage your templates related to that family.
Comment