Ad Widget

Collapse

Facing >3500 locations deployment

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Koldogut
    Junior Member
    • Jul 2023
    • 9

    #1

    Facing >3500 locations deployment

    Hi all, I've been searching for examples and considerations for a deployment like that and appart from technical recommendations about HW rightsizing -which is also critical for sure- I haven’t found much information about how to face the essential host and items aggrupation / management at big scale. The context is a large amount of stores (near 4K) in 2 countries with one or more point of sales (POS) and a technical room in each of them with this components within a rack:
    • 2 Router / Firewall
    • UPS
    • VMware ESXI with the following vms:
      • BackOffice Site Server
      • IoT control Server
      • Netboot Server
      • Enterprise grade Switch
    • Other store services controllers (also critical)

    In addition, outside the store there are 1 to 3 self-service POS (totems) for nightly customers and must be running 24x7 in unattended mode (which makes remote monitoring critical as there is nobody in the store to call for support).

    We've finished a successful MVP with a couple of stores and our lab monitored but scaling from 30 devices to >10.000 is a big step. I’m worried about item naming and aggregation towards an easy store problems management. Anyone with experience in such a kind of deployment? Our concerns so far are:
    • Host Groups – We’ve been analyzing host groups and we don’t know if it’s better to keep Host Groups for host aggregation by store or by family of devices (Servers, POSs, etc). We have defined also tags to include some additional store information but as said, we don’t know if this is the best way with such an amount of hosts.
    • Host Naming - names are based on a mnemotechical rule like [device role (router, POS, Server, Totem]_[Store Numerical Code] and we think is enough as it's de DNS resolving name for most of them, we're also adding some tags to provide additional information. How do you name your hosts?
    • Geolocation - Stores will be geolocated but host groups don’t allow this, it seems we must enter coordinates host by host (using the API perhaps for mass updating).
    • Host management – With a list of more than 10K hosts it’s probably that we’ll have visualization problems ¿Any suggestion on how to make an effective management? Maybe the key is to not use the UI for host management and make use of the API for host registration / modification. It opens however a different problem since visualization issues are transferred to other platform.
    • Templates, Items and triggers management – We are very focused on templates minimal content (just items to be monitored) and high item refresh times (200 to 300 secs) but as with the previous point, the amount of items will probably make it near impossible to manage them in the UI and API may not work for this (or make it very complex).
    • Problems management – On a big crash scene (caused by network issues in a region, for example), the amount of problems that will be prompted will likely take down the UI monitoring panels, Is there a way to avoid that? We don’t know if there is something to identify a root cause and prompt just an event per store (trigger dependencies??).
    • Active vs passive checks – It has been a long discussion but active checks resulted in better performance for POS (CPU average, Mem average and File system % free items in addition to some services availability) and Virtualization Server (vms performance, console errors, etc). However, I’ve read in many posts that users prefer passive checks and we’re not certain on this point, probably we’re missing something…
    • Zabbix Proxy – The idea is to have a Zabbix Proxy by store (a vm on each store ESXI) in order to reduce traffic and Zabbix server congestion, but we don’t know if we’ll need some additional middleware (node?) between stores Zabbix proxies and the Zabbix central server as we’ll have up to 4K stores calling with data.
    • SLAs – Since we have a contract support based on SLAs I will appreciate some advice on this topic. I’m sure Zabbix can keep a track of the services provided by the different hosts / items groups but sometimes store managers turn-off totems by convenience (maintenance, checks, black-outs, power shortenings, etc.) and this can depicture the SLA. I do also know you can manage calendar-based maintenance times but most of them are unplanned. Any recommendations on this?
    • Hardware rightsizing – It’s obviously one of the biggest concerns but there is more information and it’s discussed in many other posts in the forum; however, any advice taking this context into consideration will be greatly appreciated.
    Finally, it’s planned to have a monitoring team keeping an eye on the problems 24x7 and quick response is a must.

    Thank you in advance for your recommendations!
    Last edited by Koldogut; 12-10-2023, 13:36.
  • cyber
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2006
    • 4806

    #2
    Host groups... you can have both ways.. as host can be in several groups, there is not obstacle here to have it in store based group and also in device family based group. If you have device family based groups it might be easier to manage your templates related to that family.
    Proxies... You cannot have any middleware between proxy and server... it is how it is.. proxy communicates with server and vice versa, nothing in between. Active proxies should be a way to go, then you server does not have to do polling, those proxies send in the data by themselves and it does not happen all at once, it will probably spread out over time.
    Active vs passive. Use active if possible, anything passive is kind of double work, both ends have to work to obtain that value..
    Noone will be able to enter 10k hosts manually.. So your initial host creation should be either autoregistration or, if you have somekind of cmdb, then syncing data from there over API. It would be also right time to insert geolocation to inventory. Maybe also some other data about the locations, addresses, contact info etc. You can later utilize it later in events, tag values etc. (all those {INVENTORY.*} macros).

    hardware ... tricky question always... split it up to frontend/server/DB, let each part do their own work. And its easier to adjust also.later if needed...
    just an example...

    Click image for larger version  Name:	image.png Views:	0 Size:	13.3 KB ID:	472105
    Server(s) - 4CPU-s 12G mem. Current load keeps itself below 1.. Pretty same size hosts for proxies also, but I have 16 of them (hundreds of hosts per proxy) not 4k.. so your proxy can probably be a rasberry PI.. if it needs to keep an eye on 5-10 hosts..
    DB is probably most important part, keep it fast, keep it equipped... its 16 CPUs-and 128G mem here... and its pretty occupied..
    Frontend... depending on your users... does not have to be anything very big, it is the DB again, that answers those queries, that has to perform...

    Comment


    • Koldogut
      Koldogut commented
      Editing a comment
      Thank you for your quick answer!
  • Koldogut
    Junior Member
    • Jul 2023
    • 9

    #3
    Initial creation will be done vía API for sure but I fear the day I'll need to change 1 single router in a more than 10K hosts list (not everything is in our CMDB yet), have you tweaked the dashboards to avoid initial filtering or similar? I know users make their own changes to standard panels, is this a sustainable activity? does these changes remain on upgrades, etc.?

    Comment

    • cyber
      Senior Member
      Zabbix Certified SpecialistZabbix Certified Professional
      • Dec 2006
      • 4806

      #4
      I have not done any changes to standard code. It makes updates horrible.. You would need to apply your own changes each time you upgrade. Possible, but I would avoid doing it as much as possible.

      Search exists, so finding a correct host to change is not an issue..

      Comment

      • Koldogut
        Junior Member
        • Jul 2023
        • 9

        #5
        Thank you for your support!

        Do you have any experience with SLAs in Zabbix? do you make any use of them?

        Comment

        • Hamardaban
          Senior Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • May 2019
          • 2713

          #6
          Tracking SLA in zabbix is a rather complicated and capricious thing.
          Everything is built on the control of the time during which the configured "services" have problems.
          All this works, but requires careful configuration.

          https://www.zabbix.com/documentation...al/it_services

          Comment

          • Koldogut
            Junior Member
            • Jul 2023
            • 9

            #7
            Great! thank you very much for your support

            Comment

            Working...