Ad Widget

Collapse

Large scale setup documentation

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • webdev.gk
    Junior Member
    • Apr 2018
    • 3

    #1

    Large scale setup documentation

    Hi Zabbix gurus. I am looking for any documentation on setting Zabbix for large scale setup like 5k centos cluster nodes, 1k windows vms and 500 Linux vm servers.

    I need documentation on how to setup zabbix frontend on separate server as well as zabbix server and database.

    Thanks a lot.
  • steeladept
    Member
    • Sep 2018
    • 69

    #2
    All the documentation to fit what you are asking for is right here, between the official Zabbix documentation, the forum posts, and the wiki. Note of caution, numbers of machines are largely immaterial, it is really based on the number of monitors or values per second you need. The documentation provides information on how to ballpark that based on what you want to monitor, so I recommend you start there.

    Comment

    • Mocho
      Junior Member
      • Apr 2019
      • 3

      #3
      Hi,

      I'm looking at replicating my current monitoring system which is reacting to >150000 nodes on Zabbix.
      Running a stress test now importing all the online nodes into zabbix VM to see how it performs.
      Been tweaking as I go but I'm almost capping my VM resources, will probably migrate to a proper server but so far I'm impressed with Zabbix's behavior.
      There's a lot of item trimming and grooming that I still need to make but as a starting point I've got Device/Node type grouping, with Specific templates per Type as well, with a few cloned templates where I need different types of snmp data.

      This is my current dashboard and growing fast, It will eventually grow to >150000 hosts (scary) System information

      Yes 192.168.56.10:10051
      35328 35231 / 0 / 97
      3341156 3339811 / 0 / 1345
      1487115 1487115 / 0 [7357 / 1479758]
      3 2
      17294.41
      Had to increase PHP memory_limit (for obvious reasons) zabbix.conf
      Also increased step by step (512MB) at a time, when ever it would cry for more, on different settings in zabbix_server.conf
      I really need to trim down the amount of #### I'm polling (snmp wise).

      I'll try to follow up on this whenever I can, I'm basically trying to find a proper cooking recipe for my monitoring demands.

      Current user custom settings in zabbix_server.conf are:

      StartPollers=100
      StartPollersUnreachable=5
      StartPingers=100
      StartDiscoverers=10
      CacheSize=4G //Might of went overhead with this one ^_^' first time it screamed for more I gave it 4GB
      StartDBSyncers=6
      HistoryCacheSize=512M
      HistoryIndexCacheSize=512M
      TrendCacheSize=512M
      ValueCacheSize=4G
      UnreachablePeriod=60
      UnavailableDelay=120
      UnreachableDelay=30

      My VM has 10GB 4*Cores and is in constant meltdown but I'm pushing it.
      Also in effect, developed a 2-way ticketing plugin which is binding ticketid-eventid both ways. operation/action w/ack and recovery action, besides autonomous host import on provisioning details on parameter changes.
      Keep in mind though that this is not a proper architecture since I've got everything on the same machine (zabbix + mysql > 1xCentOS server),

      Let us know what and how you're cooking please!
      Thanks
      Best regards

      Comment

      • steeladept
        Member
        • Sep 2018
        • 69

        #4
        Mocho - that is quite impressive that you got to that vps number without more issues. I am curious if you have been having any database issues or did you partition it as suggested elsewhere? I have got to assume you partitioned it, but just looking to confirm. Also, how many proxies are you running? I run 10 of them and am only pushing 2200 vps (though to be fair, they are more to continue monitoring in the event of a site to site network outage far more than to offload work from the application server). I am also curious if that would run into far more issues if you started doing more active monitoring using Zabbix Agents.

        As for my configuration, it is much more modest:
        Yes localhost:10051
        1030 946 / 0 / 84
        137538 118771 / 37 / 18730
        68555 63919 / 4636 [2054 / 61865]
        23 2
        2147.13
        This is a production environment using 100% Zabbix Agents for server monitoring, though we will be bringing in SNMP devices eventually for network monitoring as well (mostly for event correlation).

        I have this broken out across 10 Proxies as I already stated, with a separate MYSQL server and the application and front end servers both running on the same box. These boxes are intentionally small and distributed, to take advantage of the VMware performance profile. However, even at that small size, with my environment I have only run across issues with the cache size not big enough - the box configurations were more than fine. Once I configured my cache sizes and start processes, things started running smoothly.
        Last edited by steeladept; 11-04-2019, 15:09. Reason: Edited to add my configuration as requested

        Comment

        • Mocho
          Junior Member
          • Apr 2019
          • 3

          #5
          Hi steeladept - Thanks for your feedback, much appreciated.
          As expected even after increasing the memory_limit (adding hw memory as well) and a couple of extra cores I hit the ceiling and swap is killing performance and everything else.
          But again this was a stress test, first time installing and testing zabbix.
          Have too many items (snmp), I'm downsizing, enough just to cover each exotic monitoring demand (which will still include both snmp and icmp as well), but it was fun to see it grow and starting to meltdown.
          I'm also trying to figure out what will be the best combination of templates/applications/items for our porpuses.
          When I'm able to replicate a small scale setup that covers all the needs then I'll migrate this to proper staging hardware virtualization and also start looking into partitioning the monitoring, maybe by country, not sure yet, and start importing more nodes.
          Still need to go through agent and proxy documentation.
          I'm curious If I'll be able to have a single type of agent for all flavour without extra config needs per site.
          The main issue I see with the proxies is that It will require a lot of hw and instances to cover all the nodes.
          I have to cover a load of snmp data for around 155K hosts and growing.
          I'll try to keep this thread running as I move forward and whenever possible.
          Thanks again!
          Best regards


          Comment

          • warp10
            Member
            • Apr 2019
            • 39

            #6
            Hi Mocho How many templates for each server you monitor ?

            Comment

            • Jason
              Senior Member
              • Nov 2007
              • 430

              #7
              Originally posted by Mocho
              Hi steeladept - Thanks for your feedback, much appreciated.
              As expected even after increasing the memory_limit (adding hw memory as well) and a couple of extra cores I hit the ceiling and swap is killing performance and everything else.
              But again this was a stress test, first time installing and testing zabbix.
              Have too many items (snmp), I'm downsizing, enough just to cover each exotic monitoring demand (which will still include both snmp and icmp as well), but it was fun to see it grow and starting to meltdown.
              I'm also trying to figure out what will be the best combination of templates/applications/items for our porpuses.
              When I'm able to replicate a small scale setup that covers all the needs then I'll migrate this to proper staging hardware virtualization and also start looking into partitioning the monitoring, maybe by country, not sure yet, and start importing more nodes.
              Still need to go through agent and proxy documentation.
              I'm curious If I'll be able to have a single type of agent for all flavour without extra config needs per site.
              The main issue I see with the proxies is that It will require a lot of hw and instances to cover all the nodes.
              I have to cover a load of snmp data for around 155K hosts and growing.
              I'll try to keep this thread running as I move forward and whenever possible.
              Thanks again!
              Best regards

              You will almost need to design your own templates or at the very least heavily customise copies of the ones provided.

              It pays to put the time into examining carefully what you're monitoring from each template and the frequency with which you monitor those items. If you don't need to monitor something then don't and the items that you do monitor then how quickly do you need to know about any problems?

              Comment

              • steeladept
                Member
                • Sep 2018
                • 69

                #8
                I would agree with Jason. I started with basic templates but I then recreated them, heavily modified to meet our needs, and use those. Currently I use only 2 or 3 templates per machine, but they are nested, so a template contains a template type of thing. This has caused me minor troubles in the past, as sometimes I don't want to monitor the included template on a specific machine, but breaking it out is a pain, so I don't suggest doing it the way I did. You can add all your templates separately to each machine, if you want, and I learned that is usually the better way to go.

                Comment

                • Mocho
                  Junior Member
                  • Apr 2019
                  • 3

                  #9
                  Hi warp10 steeladept, Jason


                  That was a stress test ( my first zabbix setup test) .

                  Currently I've moved away from the previous VM and migrated to a proper PVS environment, separating DB and APP, no zabbix proxies.
                  Also cleaned the load of items that were overhead for the porpuses.

                  Right now I've got grouping per device type as well as 1 Template per device type with 2xApplications (ICMP and SNMP) the biggest item filled template is for access controllers with 11 items all the others are downsized to 1 or 2 metrics. Plus the zabbix self monitoring ones.
                  Coming down to 171854 items for a total 149747 nodes/hosts.
                  Very stable right now, even with per node action, including also recovery operation and trigger correlations, automation is working fine both ways populating eventid in JSD and ticketID in Zabbix.
                  Also managed to automate the node imports (create, update, delete) through zabbix api adding crucial info on the tags (licensing, devtype, etc) and integrated with licensing api and provisioning api's.
                  Very happy with the results so far. Also made a stress test on this env, blocking all access for a major outage and zabbix didn't crash although huge load of tickets to a jira staging env.
                  All custom automations and external scripting were made with nodejs (maybe not the best option but the one I could build a proof of concept fastest including integrating with all the different api's plus zabbix api)
                  Click image for larger version

Name:	zbx.JPG
Views:	3798
Size:	292.1 KB
ID:	381795 ,

                  The approach was as steeladept and Jason mentioned. I started with Zabbix out of the box templates, then started downsizing them which was a good exercise.
                  In the end I made my custom templates which for now are more than enough to cover all the current business rules and alarmistic needs.

                  Many thanks
                  Best regards

                  Comment


                  • Mocho
                    Mocho commented
                    Editing a comment
                    PS- All info I needed could be found on the documentation, spent a couple of week reading and going through trial and error.
                    Api has a few headaches like getting interface id's to update ip address no a host first you need to get the host id etc etc. That type of gimmic. But once you get past that It's all pretty straight forward and documented.
                    So far so good, no brick walls.
                • kloczek
                  Senior Member
                  • Jun 2006
                  • 1771

                  #10
                  170k items today it is not large scale monitoring.Currently It is bottom of the mid scale.
                  That number combined with only 500-600 NVPS I would say that it is even bottom of the small scale.

                  Additionally on the graph is activity of the housekeeper (yellow line) which meas that you are not using partitioned history* trends* tables. If you want to improve your monitoring stack start from that point.
                  However with 500-600 NVPS housekeeper still should be fine
                  http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                  https://kloczek.wordpress.com/
                  zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                  My zabbix templates https://github.com/kloczek/zabbix-templates

                  Comment

                  • ida.djurhuus@regionh.dk
                    Junior Member
                    • Dec 2023
                    • 5

                    #11
                    Interesting post to follow, wandering if you have had any experience with API performance yet?

                    Comment

                    Working...