I am trying to monitor the health of a hadoop hostgorup, I have the same item on every host like: proc.num[,,,datanode], it will return 1 if datanode service is running. We have 150 nodes, so I don't care if one service goes down on one host, what I care is more than 10% go down. As far as I know:
1. I can use calculate item to sum the return values, then create a trigger. But I need to write 150 items to 1 calculate item, and I need to modify this calculate item if we add/decommission some nodes.
My question is can I have a item for a host group?
2. Create a virtual host and create calculate item for each host like: last("hostname:item"). return 1 or 0
Can I count the return value of all the checks? Or can I have a trigger for all item in this virtual host?
Those are my ideas, feel free to tell me if you have a better way to do it! Many thanks!!!!!!
1. I can use calculate item to sum the return values, then create a trigger. But I need to write 150 items to 1 calculate item, and I need to modify this calculate item if we add/decommission some nodes.
My question is can I have a item for a host group?
2. Create a virtual host and create calculate item for each host like: last("hostname:item"). return 1 or 0
Can I count the return value of all the checks? Or can I have a trigger for all item in this virtual host?
Those are my ideas, feel free to tell me if you have a better way to do it! Many thanks!!!!!!
Comment