This is likely to be a long post and for that I apologize. In order to ask the questions I need to ask I need to describe the environment we will be monitoring. I posted this to the mailing list first and got very little response.
Environment:
3 core switches
< 5000 compute nodes divided into ~70 node clusters
1 switch per standard rack
Infrastructure to support these clusters
At as simple a level as I can describe and still paint the picture, each cluster has two racks of server in a compute node role. In each rack are the compute nodes and a switch. At present these are the elements I am attempting to setup for monitoring. There will be more later but the questions I need to ask here will give me the necessary information to move forward.
We need to make each node dependent on it's own switch and likewise each switch dependent on the core switch to which it is attached. With the volume of monitored items manual configuration of the triggers and dependencies is out of the question.
My first question, and I admittedly come from a Nagios background, is whether or not the monitored services are automatically dependent on the host being up. In other words if I define a host alive check and that check fails will the services automatically go into "disabled' mode since the host itself is down?
The scenario I see for setting up templates with dependencies is hard to describe but I will do my best to try and make it clear. Keep in mind we are trying to automate as much of this as possible through the template system.
My biggest question is: In points 2.a and 3.a the monitored services won't have any dependencies defined as we would like to have a single template or single set of templates that can be assigned to any given host based on its role. Will the services being monitored outside of the templates described in points 2 and 3 be automatically dependent on a host alive check?
This is the simplest way I can see to setup automatic dependency creation for the sheer scale of what we are monitoring. This is a necessary piece of the puzzle and if the above method will not work then we could use some guidance on how to automate as much of the dependency creating as possible. We would prefer to use the templates to eliminate as much human error as possible. Even using mass updates could introduce errors simply based on the scale of this project. Using templates we can clone them and reduce the human error factor by a very large amount.
I have some much less desirable alternatives for consideration but would prefer to exhaust all possible methods of automating this before testing them. Any and all help will be greatly appreciated.
Environment:
3 core switches
< 5000 compute nodes divided into ~70 node clusters
1 switch per standard rack
Infrastructure to support these clusters
At as simple a level as I can describe and still paint the picture, each cluster has two racks of server in a compute node role. In each rack are the compute nodes and a switch. At present these are the elements I am attempting to setup for monitoring. There will be more later but the questions I need to ask here will give me the necessary information to move forward.
We need to make each node dependent on it's own switch and likewise each switch dependent on the core switch to which it is attached. With the volume of monitored items manual configuration of the triggers and dependencies is out of the question.
My first question, and I admittedly come from a Nagios background, is whether or not the monitored services are automatically dependent on the host being up. In other words if I define a host alive check and that check fails will the services automatically go into "disabled' mode since the host itself is down?
The scenario I see for setting up templates with dependencies is hard to describe but I will do my best to try and make it clear. Keep in mind we are trying to automate as much of this as possible through the template system.
- Assign each core switch a host-alive check and trigger
- Create one template per core to assign to the switches that contains only the host-alive check for the switch to which it will be assigned. In this template the host-alive check's trigger will have a dependency on the appropriate host-alive check on the core that serves the switch.
- Create various templates for the different switches in use to monitor traffic, ports, etc.
- Create one template per switch to be assigned to the nodes that are served by the switch. This template contains on the host-alive check for the nodes and the trigger for the host alive check will be dependent on the trigger for the switch's host-alive check.
- Create as many templates as necessary for the nodes based on standard services, hardware, etc.
My biggest question is: In points 2.a and 3.a the monitored services won't have any dependencies defined as we would like to have a single template or single set of templates that can be assigned to any given host based on its role. Will the services being monitored outside of the templates described in points 2 and 3 be automatically dependent on a host alive check?
This is the simplest way I can see to setup automatic dependency creation for the sheer scale of what we are monitoring. This is a necessary piece of the puzzle and if the above method will not work then we could use some guidance on how to automate as much of the dependency creating as possible. We would prefer to use the templates to eliminate as much human error as possible. Even using mass updates could introduce errors simply based on the scale of this project. Using templates we can clone them and reduce the human error factor by a very large amount.
I have some much less desirable alternatives for consideration but would prefer to exhaust all possible methods of automating this before testing them. Any and all help will be greatly appreciated.
Comment