Slowing working through large-scale items
How we use Zabbix:
We have a few hundred servers and 25,000 items and 7500 triggers, 60 updates/second, and 17GB of data, for about 100 customers/groups of very diverse Linux servers, and we expect 10X growth in the next year, so this is of strong interest to us.
We modify our system every day by numerous engineers as we are always adding hosts, triggers, new and custom items backed by scripts, etc. So we are always looking at how we've broken the system and caused issues on current hosts when adding new ones, as we are nearly 100% template-driven.
The new option to not have ALL on drop downs is a HUGE HELP as loading pages with ALL set was killing us. But would like to have All option for some things (i.e. should not go away when Not Selected is default; we want None Selected but All available).
We will soon have major group issues as we have 50-100 now and will later need to customize to enter a group number or something as the drop downs will be too long, hard to use. Not sure how to approach this as the number of groups, templates, and screens will grow > 100 and then > 1000.
We have customized a few things with more coming, but mostly for more usability in out NOC and 24x7 staff. In particular, we have a big red flag on the dashboard for Unacknowledged events, so you can see them across the room.
We use the dashboard, with the custom flag (above) and sounds to warn of new alerts (need this feature for everyone), heavy graph and screen users, plus slideshows for critical systems. Use latest data a lot. Monitoring Overview and Triggers is useless and we'd like to see the event history easier to use and better click links from the dashboard as hard to get what we want on diverse alerts.
We are heavy ACK users, with 5-10 per alert - we manage this by 24x7 team ACK on each step they take so everyone can see the alert status; wish this was a bit easier to see/summaries, but decent.
We have a test ACK system that generates fake alerts randomly for our support team to ACK to make sure they are paying attention over night; we have a complex SQL report to tell us how long it took to ACK, and how long the alert lasted (tough in SQL) in case it was very short.
We use email alerts for most things, plus some SMS, but not much escalation yet; it's not that easy and hard to test.
We use lots of SQL for special reports, which we hope to share soon, looking for disabled hosts that shouldn't be (using a new table/field to track who approved the disable, until when), mis-matched template vs. host items, mismatch intervals, missing / wrong URL on triggers (we use to link to our wiki, critical to us), missing profile data that we use in URLs, etc. Happy to share all.
Built-in reports are useless as far as we can tell; we'd love to be able to add our own in some simple php config system, i.e. add SQL and arguments, get results.
The 1.8 DB scaling in the DB dropped our I/O 10X or more.
We do not use proxies yet, but will in some cases.
We do not use maps; we'd love to, but no time to build them for dozens of different installations/systems.
We do not use data pushing from the server as we don't like the security of an agent connecting to our server; we may route through a proxy at some point.
We heavily use custom scripts on agents, though trying to do more in the agent config if we can control the time-outs.
Our customers also use the system to see their hosts which has worked well so far. The new versions allow graphs to be on templates which has made this much simpler to manage.
We are investigating the best way to have HA - probably replication to a standby server in another city, which will start checks if the main server dies.
Waiting to hear more from this guy doing 3,000 hosts:
We are happy to share anything we are doing - we have lots of people working on/in Zabbix every day, custom reports, some UI changes, and are thinking about our own agent patches to get it to monitor a lot more things.
How we use Zabbix:
We have a few hundred servers and 25,000 items and 7500 triggers, 60 updates/second, and 17GB of data, for about 100 customers/groups of very diverse Linux servers, and we expect 10X growth in the next year, so this is of strong interest to us.
We modify our system every day by numerous engineers as we are always adding hosts, triggers, new and custom items backed by scripts, etc. So we are always looking at how we've broken the system and caused issues on current hosts when adding new ones, as we are nearly 100% template-driven.
The new option to not have ALL on drop downs is a HUGE HELP as loading pages with ALL set was killing us. But would like to have All option for some things (i.e. should not go away when Not Selected is default; we want None Selected but All available).
We will soon have major group issues as we have 50-100 now and will later need to customize to enter a group number or something as the drop downs will be too long, hard to use. Not sure how to approach this as the number of groups, templates, and screens will grow > 100 and then > 1000.
We have customized a few things with more coming, but mostly for more usability in out NOC and 24x7 staff. In particular, we have a big red flag on the dashboard for Unacknowledged events, so you can see them across the room.
We use the dashboard, with the custom flag (above) and sounds to warn of new alerts (need this feature for everyone), heavy graph and screen users, plus slideshows for critical systems. Use latest data a lot. Monitoring Overview and Triggers is useless and we'd like to see the event history easier to use and better click links from the dashboard as hard to get what we want on diverse alerts.
We are heavy ACK users, with 5-10 per alert - we manage this by 24x7 team ACK on each step they take so everyone can see the alert status; wish this was a bit easier to see/summaries, but decent.
We have a test ACK system that generates fake alerts randomly for our support team to ACK to make sure they are paying attention over night; we have a complex SQL report to tell us how long it took to ACK, and how long the alert lasted (tough in SQL) in case it was very short.
We use email alerts for most things, plus some SMS, but not much escalation yet; it's not that easy and hard to test.
We use lots of SQL for special reports, which we hope to share soon, looking for disabled hosts that shouldn't be (using a new table/field to track who approved the disable, until when), mis-matched template vs. host items, mismatch intervals, missing / wrong URL on triggers (we use to link to our wiki, critical to us), missing profile data that we use in URLs, etc. Happy to share all.
Built-in reports are useless as far as we can tell; we'd love to be able to add our own in some simple php config system, i.e. add SQL and arguments, get results.
The 1.8 DB scaling in the DB dropped our I/O 10X or more.
We do not use proxies yet, but will in some cases.
We do not use maps; we'd love to, but no time to build them for dozens of different installations/systems.
We do not use data pushing from the server as we don't like the security of an agent connecting to our server; we may route through a proxy at some point.
We heavily use custom scripts on agents, though trying to do more in the agent config if we can control the time-outs.
Our customers also use the system to see their hosts which has worked well so far. The new versions allow graphs to be on templates which has made this much simpler to manage.
We are investigating the best way to have HA - probably replication to a standby server in another city, which will start checks if the main server dies.
Waiting to hear more from this guy doing 3,000 hosts:
We are happy to share anything we are doing - we have lots of people working on/in Zabbix every day, custom reports, some UI changes, and are thinking about our own agent patches to get it to monitor a lot more things.

Comment