Ad Widget

Collapse

Best practices of monitoring in medium/large enviroments

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mushero
    Senior Member
    • May 2010
    • 101

    #16
    Slowing working through large-scale items

    How we use Zabbix:

    We have a few hundred servers and 25,000 items and 7500 triggers, 60 updates/second, and 17GB of data, for about 100 customers/groups of very diverse Linux servers, and we expect 10X growth in the next year, so this is of strong interest to us.

    We modify our system every day by numerous engineers as we are always adding hosts, triggers, new and custom items backed by scripts, etc. So we are always looking at how we've broken the system and caused issues on current hosts when adding new ones, as we are nearly 100% template-driven.

    The new option to not have ALL on drop downs is a HUGE HELP as loading pages with ALL set was killing us. But would like to have All option for some things (i.e. should not go away when Not Selected is default; we want None Selected but All available).

    We will soon have major group issues as we have 50-100 now and will later need to customize to enter a group number or something as the drop downs will be too long, hard to use. Not sure how to approach this as the number of groups, templates, and screens will grow > 100 and then > 1000.

    We have customized a few things with more coming, but mostly for more usability in out NOC and 24x7 staff. In particular, we have a big red flag on the dashboard for Unacknowledged events, so you can see them across the room.

    We use the dashboard, with the custom flag (above) and sounds to warn of new alerts (need this feature for everyone), heavy graph and screen users, plus slideshows for critical systems. Use latest data a lot. Monitoring Overview and Triggers is useless and we'd like to see the event history easier to use and better click links from the dashboard as hard to get what we want on diverse alerts.

    We are heavy ACK users, with 5-10 per alert - we manage this by 24x7 team ACK on each step they take so everyone can see the alert status; wish this was a bit easier to see/summaries, but decent.

    We have a test ACK system that generates fake alerts randomly for our support team to ACK to make sure they are paying attention over night; we have a complex SQL report to tell us how long it took to ACK, and how long the alert lasted (tough in SQL) in case it was very short.

    We use email alerts for most things, plus some SMS, but not much escalation yet; it's not that easy and hard to test.

    We use lots of SQL for special reports, which we hope to share soon, looking for disabled hosts that shouldn't be (using a new table/field to track who approved the disable, until when), mis-matched template vs. host items, mismatch intervals, missing / wrong URL on triggers (we use to link to our wiki, critical to us), missing profile data that we use in URLs, etc. Happy to share all.

    Built-in reports are useless as far as we can tell; we'd love to be able to add our own in some simple php config system, i.e. add SQL and arguments, get results.

    The 1.8 DB scaling in the DB dropped our I/O 10X or more.

    We do not use proxies yet, but will in some cases.

    We do not use maps; we'd love to, but no time to build them for dozens of different installations/systems.

    We do not use data pushing from the server as we don't like the security of an agent connecting to our server; we may route through a proxy at some point.

    We heavily use custom scripts on agents, though trying to do more in the agent config if we can control the time-outs.

    Our customers also use the system to see their hosts which has worked well so far. The new versions allow graphs to be on templates which has made this much simpler to manage.

    We are investigating the best way to have HA - probably replication to a standby server in another city, which will start checks if the main server dies.

    Waiting to hear more from this guy doing 3,000 hosts:


    We are happy to share anything we are doing - we have lots of people working on/in Zabbix every day, custom reports, some UI changes, and are thinking about our own agent patches to get it to monitor a lot more things.

    Comment

    • tchjts1
      Senior Member
      • May 2008
      • 1605

      #17
      You have some excellent observations, several of which parallel mine.

      Reporting is becomning a major pain for us. Like you, we need to be able to report on when an alert happened, what the duration was, when it was ACK'd and when it was resolved. I have been in discussion with Alexei on this. We also need the ability to export reports into Excel.

      We currently monitor about 200 hosts of varied platforms - Win, AIX, Linux, HP-UX and Solaris. In the next month we will quickly deploy to an additional 1,500 hosts. After that, we are looking at an additional 2,700 hosts, so we will be near 5,000 when all is said and done. We have a standalone Zabbix App, DB and 12 proxies located globally.

      I have also discussed with Alexei the ability to have sub-groups, because looking at a list of 5,000 hosts will be impossible. Same goes for the screens area. There has to be a way to group screens. Per platform, per application, per region, per individual Zabbix user, etc.

      The ability to clone screens now in 1.8 is fantastic, as is the ability to drag and drop graphs on the screens. We have also tasked Zabbix with providing functionality for an individual user to be able to create their own screens, presumably from the "Latest Data" section. If they see a graph they want on a screen, there will be some type of "add to screen" functionality. This will allow users to create screens of graphs that are valuable to them and take that burden off the Zabbix Admin. This will go hand-in-hand with screen groups. Zabbix Admin will publish the "official" screens, everyone else will have their own group for their screens with the ability for the Zabbix Admin to publish individual user screens to the Official Screens list in case of a good one.

      Zabbix recently created a script for me to do bulk loading of hosts, which totally rocked and saved me tons of time from manual entries. If anyone is interested, I have their permission to share it.

      Working with the Zabbix Devs and Alexei is a rewarding experience. We have the Platinum support package and I take advantage of that fact, and make it work for me. The Zabbix staff is very professional, and a pleasure to work with in problem resolution, guidance and collaborative efforts. In particular, Alexei, Igor and Rich ahve been nothing short of super.

      Comment

      • mushero
        Senior Member
        • May 2010
        • 101

        #18
        Be interested in scripts, hope to buy support soon

        Glad we're on the same wavelength, and we'd love to get the bulk loading scripts as we start loading 100 hosts at a time. We look forward to using proxies to do intermediate work, but prefer centralized if we can, though as we spread around the world (and even within China which is like a mini-world) we'll have to de-centralize. Will then wish we had global management of proxies, etc. (maybe already have that in some way; a project for next year).

        We're also happy to share all our stuff, including SQL for ACKs, event duration, etc. and GUI changes to see UnACK alerts, sounds on Dashboard for new alerts, etc. We have a big push next month on this, for all the saftey reports, probably in a UI form so you can run inside the Zabbix UI.

        Can send private msg or Steve.Mushero -at- ChinaNetCloud - pointy dot - com

        We also hope to get on support this summer, as we hear lots of good things about the team and both want to be supportive and supported. Being in in-expensive China we may hire a part-time developer to just do patches for us, mostly on agent to add more Linux keys and GUI to add reports/options; all to donate back.

        Comment

        • MikeBreton
          Junior Member
          • Nov 2009
          • 19

          #19
          ERD/Reporting

          Noticed you mentioned that you do Custom Reporting. Are you using a particular product? Did you develop your own ERDs? Trying to do much the same but having a hard time getting started.

          mab

          Originally posted by mushero
          How we use Zabbix:

          We have a few hundred servers and 25,000 items and 7500 triggers, 60 updates/second, and 17GB of data, for about 100 customers/groups of very diverse Linux servers, and we expect 10X growth in the next year, so this is of strong interest to us.

          We modify our system every day by numerous engineers as we are always adding hosts, triggers, new and custom items backed by scripts, etc. So we are always looking at how we've broken the system and caused issues on current hosts when adding new ones, as we are nearly 100% template-driven.

          The new option to not have ALL on drop downs is a HUGE HELP as loading pages with ALL set was killing us. But would like to have All option for some things (i.e. should not go away when Not Selected is default; we want None Selected but All available).

          We will soon have major group issues as we have 50-100 now and will later need to customize to enter a group number or something as the drop downs will be too long, hard to use. Not sure how to approach this as the number of groups, templates, and screens will grow > 100 and then > 1000.

          We have customized a few things with more coming, but mostly for more usability in out NOC and 24x7 staff. In particular, we have a big red flag on the dashboard for Unacknowledged events, so you can see them across the room.

          We use the dashboard, with the custom flag (above) and sounds to warn of new alerts (need this feature for everyone), heavy graph and screen users, plus slideshows for critical systems. Use latest data a lot. Monitoring Overview and Triggers is useless and we'd like to see the event history easier to use and better click links from the dashboard as hard to get what we want on diverse alerts.

          We are heavy ACK users, with 5-10 per alert - we manage this by 24x7 team ACK on each step they take so everyone can see the alert status; wish this was a bit easier to see/summaries, but decent.

          We have a test ACK system that generates fake alerts randomly for our support team to ACK to make sure they are paying attention over night; we have a complex SQL report to tell us how long it took to ACK, and how long the alert lasted (tough in SQL) in case it was very short.

          We use email alerts for most things, plus some SMS, but not much escalation yet; it's not that easy and hard to test.

          We use lots of SQL for special reports, which we hope to share soon, looking for disabled hosts that shouldn't be (using a new table/field to track who approved the disable, until when), mis-matched template vs. host items, mismatch intervals, missing / wrong URL on triggers (we use to link to our wiki, critical to us), missing profile data that we use in URLs, etc. Happy to share all.

          Built-in reports are useless as far as we can tell; we'd love to be able to add our own in some simple php config system, i.e. add SQL and arguments, get results.

          The 1.8 DB scaling in the DB dropped our I/O 10X or more.

          We do not use proxies yet, but will in some cases.

          We do not use maps; we'd love to, but no time to build them for dozens of different installations/systems.

          We do not use data pushing from the server as we don't like the security of an agent connecting to our server; we may route through a proxy at some point.

          We heavily use custom scripts on agents, though trying to do more in the agent config if we can control the time-outs.

          Our customers also use the system to see their hosts which has worked well so far. The new versions allow graphs to be on templates which has made this much simpler to manage.

          We are investigating the best way to have HA - probably replication to a standby server in another city, which will start checks if the main server dies.

          Waiting to hear more from this guy doing 3,000 hosts:


          We are happy to share anything we are doing - we have lots of people working on/in Zabbix every day, custom reports, some UI changes, and are thinking about our own agent patches to get it to monitor a lot more things.

          Comment

          • mushero
            Senior Member
            • May 2010
            • 101

            #20
            Have Partial ERD, DB details

            We have a partial ERD for key elements like items, triggers, etc. The relationship between triggers and their hosts / templates is a challenge as it's double routed (forward/backward) through functions, items, and hosts.

            I was just thinking yesterday we should update a full ERD and share it, along with our general info on the key tables and queries. If you send me a message I'll send you want we have now.

            I'm actually surprised the main docs don't have a DB table / field reference and ER diagram; would be hugely helpful for sophisticated users.

            One thing we do a lot is clone a templated item or trigger so we can change the intervals or thresholds, but then they become de-synched with their old parents - I have very nasty queries to find these by name - the trigger query is 24 lines long and only works for simple triggers and other constraints.

            And yes, we know we can use host macros in 1.8 but we'd have to do it on all hosts to use it on even one, which is quite painful to implement.

            Comment

            • danrog
              Senior Member
              • Sep 2009
              • 164

              #21
              mushero, in 1.8.x you can create a global macro, modify your trigger/item's to use that macro name and then only add custom macro's to the hosts that need it. No need to use macro's on all your hosts. It works great for us; we use it for all our disk, CPU, mem, win services and other base OS triggers we want to be able to change if needed.

              P.S. Could you share your reports with me

              Comment

              • MikeBreton
                Junior Member
                • Nov 2009
                • 19

                #22
                Groups of Groups

                One option would be to allow groups of groups recursively. Much the same way Job Hierarchies/Trees are stored in relational D/Bs. There is a parent/child relationship that can reference another such relationship. I guess this could get messy, but it is a thought.

                MikeBreton


                Originally posted by mushero
                How we use Zabbix:

                We have a few hundred servers and 25,000 items and 7500 triggers, 60 updates/second, and 17GB of data, for about 100 customers/groups of very diverse Linux servers, and we expect 10X growth in the next year, so this is of strong interest to us.

                We modify our system every day by numerous engineers as we are always adding hosts, triggers, new and custom items backed by scripts, etc. So we are always looking at how we've broken the system and caused issues on current hosts when adding new ones, as we are nearly 100% template-driven.

                The new option to not have ALL on drop downs is a HUGE HELP as loading pages with ALL set was killing us. But would like to have All option for some things (i.e. should not go away when Not Selected is default; we want None Selected but All available).

                We will soon have major group issues as we have 50-100 now and will later need to customize to enter a group number or something as the drop downs will be too long, hard to use. Not sure how to approach this as the number of groups, templates, and screens will grow > 100 and then > 1000.

                We have customized a few things with more coming, but mostly for more usability in out NOC and 24x7 staff. In particular, we have a big red flag on the dashboard for Unacknowledged events, so you can see them across the room.

                We use the dashboard, with the custom flag (above) and sounds to warn of new alerts (need this feature for everyone), heavy graph and screen users, plus slideshows for critical systems. Use latest data a lot. Monitoring Overview and Triggers is useless and we'd like to see the event history easier to use and better click links from the dashboard as hard to get what we want on diverse alerts.

                We are heavy ACK users, with 5-10 per alert - we manage this by 24x7 team ACK on each step they take so everyone can see the alert status; wish this was a bit easier to see/summaries, but decent.

                We have a test ACK system that generates fake alerts randomly for our support team to ACK to make sure they are paying attention over night; we have a complex SQL report to tell us how long it took to ACK, and how long the alert lasted (tough in SQL) in case it was very short.

                We use email alerts for most things, plus some SMS, but not much escalation yet; it's not that easy and hard to test.

                We use lots of SQL for special reports, which we hope to share soon, looking for disabled hosts that shouldn't be (using a new table/field to track who approved the disable, until when), mis-matched template vs. host items, mismatch intervals, missing / wrong URL on triggers (we use to link to our wiki, critical to us), missing profile data that we use in URLs, etc. Happy to share all.

                Built-in reports are useless as far as we can tell; we'd love to be able to add our own in some simple php config system, i.e. add SQL and arguments, get results.

                The 1.8 DB scaling in the DB dropped our I/O 10X or more.

                We do not use proxies yet, but will in some cases.

                We do not use maps; we'd love to, but no time to build them for dozens of different installations/systems.

                We do not use data pushing from the server as we don't like the security of an agent connecting to our server; we may route through a proxy at some point.

                We heavily use custom scripts on agents, though trying to do more in the agent config if we can control the time-outs.

                Our customers also use the system to see their hosts which has worked well so far. The new versions allow graphs to be on templates which has made this much simpler to manage.

                We are investigating the best way to have HA - probably replication to a standby server in another city, which will start checks if the main server dies.

                Waiting to hear more from this guy doing 3,000 hosts:


                We are happy to share anything we are doing - we have lots of people working on/in Zabbix every day, custom reports, some UI changes, and are thinking about our own agent patches to get it to monitor a lot more things.

                Comment

                • mushero
                  Senior Member
                  • May 2010
                  • 101

                  #23
                  Confused on how to do that, at least the way we do it - almost all our triggers are in templates, so if I have a Linux template for 500 hosts to alert if load_average > 5 today any host that I want different I have to clone the trigger.

                  I think the only way to use a host macro is to change the template macro to use load_average > {host_macro_name} but then that macro has to be on all 500 hosts; if not, the trigger is useless. So to use any host macro I have to do it on all affected hosts or none; I can do this, especially in the DB by adding a macro to all hosts with that template applied.

                  This is the only way I can see to do it - basically we use a template-wide setting until we have to break out too many servers, then we have to do this - with 200 items per server with 100 triggers this is a challenge, though I have no brilliant ideas other than the 'macro' being from the template unless I define it at the host level; that type of thing would be very useful, like items work now.

                  P.S. I'll post a PDF of what I have for the DB & Reports separately.

                  Originally posted by danrog
                  mushero, in 1.8.x you can create a global macro, modify your trigger/item's to use that macro name and then only add custom macro's to the hosts that need it. No need to use macro's on all your hosts. It works great for us; we use it for all our disk, CPU, mem, win services and other base OS triggers we want to be able to change if needed.

                  P.S. Could you share your reports with me

                  Comment

                  • mushero
                    Senior Member
                    • May 2010
                    • 101

                    #24
                    DB Info & Reports

                    I've attached/uploaded an initial PDF of what we have on the DB and reports.

                    This has some DB info on hosts, items, triggers, functions, etc. and valid values, and a lot of queries on how to join all that together to get useful data, checks, and info from the system. Several of these have very specific uses that may not be obvious, so feel free to ask - we especially like our big query to tell how long it took the 24x7 staff to ACK something, that also shows total event duration in case it was too short to ACK. Event duration is very painful to get in Zabbix.

                    We have a lot more in progress for safety checks, etc. I'm happy to help on something if I have time as we increase our library.

                    Warning - this doc is from our Wiki so it's not very well-formatted in the PDF process; I'll improve it in future versions. Let me know if you can't paste from the PDF.
                    Attached Files

                    Comment

                    • richlv
                      Senior Member
                      Zabbix Certified Trainer
                      Zabbix Certified SpecialistZabbix Certified Professional
                      • Oct 2005
                      • 3112

                      #25
                      Originally posted by mushero
                      I think the only way to use a host macro is to change the template macro to use load_average > {host_macro_name} but then that macro has to be on all 500 hosts; if not, the trigger is useless.
                      that would be lame, wouldn't it ? fortunately, that's not what you have to do

                      you can set global macro, let's say, {$CPU_LOAD} to be 5. then you modify your trigger expression to use that macro instead of hardcoded value "5". and that's it, after these changes it still works as before.
                      but the benefit - now you can override the value of this macro either on host, or template level. so setting {$CPU_LOAD} to 10 for some host will leave all other hosts at 5, and this one will be at 10. again, you can also override it on template level. host takes precedence over template level, and both take precedence over global macro.
                      Zabbix 3.0 Network Monitoring book

                      Comment

                      • richlv
                        Senior Member
                        Zabbix Certified Trainer
                        Zabbix Certified SpecialistZabbix Certified Professional
                        • Oct 2005
                        • 3112

                        #26
                        Originally posted by mushero
                        I've attached/uploaded an initial PDF of what we have on the DB and reports.
                        i noticed in the pdf that your itemtype list is not complete - see the helpful comment by nelsonab at http://www.zabbix.com/documentation/...i/objects/item

                        trigger type - 0 for normal, 1 for multiple problem events.
                        trigger value - 0 ok, 1 problem, 2 unknown
                        Zabbix 3.0 Network Monitoring book

                        Comment

                        • mushero
                          Senior Member
                          • May 2010
                          • 101

                          #27
                          Well, look at that - I'm clearly an idiot for not looking at the template level - you have made my day ! Now I can see how glorious this can be for us since we run hundreds of very heterogeneous systems for Chinese site/games where many groups of hosts need very different values; we can finally really set useful triggers on a lot of things we've avoided. Happy Day.

                          Originally posted by richlv
                          that would be lame, wouldn't it ? fortunately, that's not what you have to do

                          you can set global macro, let's say, {$CPU_LOAD} to be 5. then you modify your trigger expression to use that macro instead of hardcoded value "5". and that's it, after these changes it still works as before.
                          but the benefit - now you can override the value of this macro either on host, or template level. so setting {$CPU_LOAD} to 10 for some host will leave all other hosts at 5, and this one will be at 10. again, you can also override it on template level. host takes precedence over template level, and both take precedence over global macro.

                          Comment

                          • mushero
                            Senior Member
                            • May 2010
                            • 101

                            #28
                            Excellent and thanks - and I am working my way through your book, looking for ideas and insights.

                            And any idea on the Trigger error column and how it's used ?

                            And what is a 'multiple problem event' ? I'm a little fuzzy on multi-host triggers and other advanced trigger tricks.

                            Originally posted by richlv
                            i noticed in the pdf that your itemtype list is not complete - see the helpful comment by nelsonab at http://www.zabbix.com/documentation/...i/objects/item

                            trigger type - 0 for normal, 1 for multiple problem events.
                            trigger value - 0 ok, 1 problem, 2 unknown

                            Comment

                            • richlv
                              Senior Member
                              Zabbix Certified Trainer
                              Zabbix Certified SpecialistZabbix Certified Professional
                              • Oct 2005
                              • 3112

                              #29
                              Originally posted by mushero
                              Excellent and thanks - and I am working my way through your book, looking for ideas and insights.

                              And any idea on the Trigger error column and how it's used ?
                              i believe it is supposed to contain error messages when evaluating the trigger - for example, if one of items used in the trigger expression is missing the data etc
                              Originally posted by mushero
                              And what is a 'multiple problem event' ? I'm a little fuzzy on multi-host triggers and other advanced trigger tricks.
                              see trigger properties in frontend - there's a dropdown for that. multiple problem events will be generated whenever trigger evaluates to the problem condition (so while load is > 5 you would get a new event upon each new value, not just once when that happens for the first time)
                              Zabbix 3.0 Network Monitoring book

                              Comment

                              • romale
                                Member
                                • Mar 2013
                                • 53

                                #30
                                Originally posted by richlv
                                you can set global macro, let's say, {$CPU_LOAD} to be 5. then you modify your trigger expression to use that macro instead of hardcoded value "5". and that's it, after these changes it still works as before.
                                but the benefit - now you can override the value of this macro either on host, or template level. so setting {$CPU_LOAD} to 10 for some host will leave all other hosts at 5, and this one will be at 10. again, you can also override it on template level. host takes precedence over template level, and both take precedence over global macro.
                                is there way to change {$CPU_LOAD} for several hosts (50 hosts for example) by selecting it from list and change CPU threshold via mass update or by any other way?

                                Comment

                                Working...