My company's operations department is using Zabbix to monitor servers, which is where Zabbix appears to excel. I am a software developer and want to add instrumentation into our software to manage a host of things. Due to past experience, I was going to feed this into Prometheus, but the ops team would prefer I feed Zabbix. I can understand their request, so I'm trying to fill it.
I want to instrument things like:
-Request latency across various portions of the system
-Transactions per second, again, across various portions of the system
-A few other transactional types of metrics, such as the ebb and flow of various microservice instances
These sorts of things are never tied to a particular machine. I can't just monitor dev001 and dev002. There's a web server, which is behind a load balancer, so it might really be five webservers spread across 5 hosts. There's a redis cluster and a postgresql instance. On each web server will be one or more instances of our core application. There may also be other hosts, and somewhere all our microservices are running. Those guys come up for a while, do some work, and go away. Except during failure conditions, they tend to stay up for 30 minutes to two days. Once they go away, they're generally gone for good, as they are tied to various events in our system. (And in this case, an event is a real world event with a configured start and stop time.)
I want to be able to track this kind of data for alerting if things start to fail, but I also want to track usage over time. This will help ops in machine sizing. It will also help dev if someone says, "Suddenly after last week's release, the system is slower." I'd love to be able to go back to all these statisics and compare performance over arbitrary units of time.
Zabbix appears to be very host-based, but as a programmer, I don't care about hosts. I care about services / microservices. I care about latency. Operations cares about CPU and memory usage and page swaps, but I'm not here to instrument any of that.
So my question is: can Zabbix help with what I'm trying to do. In searching the forums, there are only two hits against microservices, and neither of them was terribly useful.
If Zabbix can help, can anyone give me pointers to how I should organize my data when I'm not dealing with hosts but instead with services? As far as I'm concerned, this could all run on a single host (like it does in development mode) or on 45. I don't care. I care about overally performance of my software and about identifying when I suddenly added a bottleneck that wasn't there last week.
Maybe there's a good writeup on how to do this type of monitoring. Seaching the docs for services told me about monitoring windows services, which isn't remotely the same thing.
Any pointers are appreciated.
I want to instrument things like:
-Request latency across various portions of the system
-Transactions per second, again, across various portions of the system
-A few other transactional types of metrics, such as the ebb and flow of various microservice instances
These sorts of things are never tied to a particular machine. I can't just monitor dev001 and dev002. There's a web server, which is behind a load balancer, so it might really be five webservers spread across 5 hosts. There's a redis cluster and a postgresql instance. On each web server will be one or more instances of our core application. There may also be other hosts, and somewhere all our microservices are running. Those guys come up for a while, do some work, and go away. Except during failure conditions, they tend to stay up for 30 minutes to two days. Once they go away, they're generally gone for good, as they are tied to various events in our system. (And in this case, an event is a real world event with a configured start and stop time.)
I want to be able to track this kind of data for alerting if things start to fail, but I also want to track usage over time. This will help ops in machine sizing. It will also help dev if someone says, "Suddenly after last week's release, the system is slower." I'd love to be able to go back to all these statisics and compare performance over arbitrary units of time.
Zabbix appears to be very host-based, but as a programmer, I don't care about hosts. I care about services / microservices. I care about latency. Operations cares about CPU and memory usage and page swaps, but I'm not here to instrument any of that.
So my question is: can Zabbix help with what I'm trying to do. In searching the forums, there are only two hits against microservices, and neither of them was terribly useful.
If Zabbix can help, can anyone give me pointers to how I should organize my data when I'm not dealing with hosts but instead with services? As far as I'm concerned, this could all run on a single host (like it does in development mode) or on 45. I don't care. I care about overally performance of my software and about identifying when I suddenly added a bottleneck that wasn't there last week.
Maybe there's a good writeup on how to do this type of monitoring. Seaching the docs for services told me about monitoring windows services, which isn't remotely the same thing.
Any pointers are appreciated.
Comment