Hi everyone!
Our Zabbix 6.0.18 causes heavy database load when users navigate on a page which returns many results such as monitoring>hosts.
Some of the queries last minutes, clog several CPUs to 100% and continue even if the users navigate away from Zabbix frontend or click on another menu item such as reports>system info.
Changing the number of results in the user's settings makes no difference, and users have reasonable (<200) results per page anyway. This is often triggered by loosely filtered queries such as selecting all problems and a hostgroup with many hosts.
If users, by mistake, remove all filters the page times out, the submenu page (for example monitoring>hosts) becomes unavailable to them even after logging out and back in. Other submenu pages remain available. The only solution in such cases is to ask a colleague to send a link to a filtered view, which restores access to the specific unavailable submenu.
Htop shows all 20CPUs running at 95-100% and the SELECT and UPDATE queries running for a long time and causing the 100% CPU utilisation. Screenshot attached.
In some cases the only way to restore has been to restart the database, in other cases they eventually finish and CPU utilisation returns normal.
Looking at pg_top we have gathered part of one of the queries which was running for 20+ minutes:
SELECT f.triggerid,i.hostid FROM functions f,items i WHERE (f.triggerid IN (7849,7850,7851,7852,7856,7857,7863,7864,7866,7867 ,7869,7871,7872,7873,7875,7876,7877,7878,7879,7881 ,7883,7885,7889,7891,7892,7894,7895,7896,7898,7899 ,7900,7902,7904,7907,7908,7912,7913,7915,7917,7920 ,7930,7931,7932,7933,7934,7935,7937,7938,7940,7941 ,7943,7944,7945,7947,7949,7951,7953,7955,7958,7960 ,7963,7964,7966,7967,7969,7976,7980,7983,7984,7986 ,7987,7988,7989,7990,7991,7992,7993,7997,7999,8000 ,8001,8002,8005,8006,8007,8008,8010,8012,8014,8018 ,8019,8021,44186,44192,44200,44201,44202,44203,442 04,44205,44207,44208,44209,44210,44211,44212,66698 ,66699,66702,66704,66705,66710,66712,67619,74766,7 4768,74798,74800,74811,74812,74813,74818,74819,748 20,92470,92472,92473,92475,92476,92478,92479,92481 ,92482,92483,92484,92485,92486,92487,92491,92492,9 2493,92494,92495,92496,92497,92498,92499,92506,925 07,92508,92510,92511,92512,92513,92514,92515,92516 ,92517,92521,92522,92523,92527,92528,92529,92531,9 2532,92534,92535,92539,92540,92541,92542,92543,9
Please note the query above is cut from pg_top and was not available in full.
We are running 6.0.18 on frontends and servers (which run on different VMs), postgresql 14.8 with timeseries and ubuntu 22.04 for all VMs.
Zabbix is configured with ~40K hosts and 2 mil items.
Any known issue this might be related to? And any suggestion on what to do next?
Thanks
Stefano
Our Zabbix 6.0.18 causes heavy database load when users navigate on a page which returns many results such as monitoring>hosts.
Some of the queries last minutes, clog several CPUs to 100% and continue even if the users navigate away from Zabbix frontend or click on another menu item such as reports>system info.
Changing the number of results in the user's settings makes no difference, and users have reasonable (<200) results per page anyway. This is often triggered by loosely filtered queries such as selecting all problems and a hostgroup with many hosts.
If users, by mistake, remove all filters the page times out, the submenu page (for example monitoring>hosts) becomes unavailable to them even after logging out and back in. Other submenu pages remain available. The only solution in such cases is to ask a colleague to send a link to a filtered view, which restores access to the specific unavailable submenu.
Htop shows all 20CPUs running at 95-100% and the SELECT and UPDATE queries running for a long time and causing the 100% CPU utilisation. Screenshot attached.
In some cases the only way to restore has been to restart the database, in other cases they eventually finish and CPU utilisation returns normal.
Looking at pg_top we have gathered part of one of the queries which was running for 20+ minutes:
SELECT f.triggerid,i.hostid FROM functions f,items i WHERE (f.triggerid IN (7849,7850,7851,7852,7856,7857,7863,7864,7866,7867 ,7869,7871,7872,7873,7875,7876,7877,7878,7879,7881 ,7883,7885,7889,7891,7892,7894,7895,7896,7898,7899 ,7900,7902,7904,7907,7908,7912,7913,7915,7917,7920 ,7930,7931,7932,7933,7934,7935,7937,7938,7940,7941 ,7943,7944,7945,7947,7949,7951,7953,7955,7958,7960 ,7963,7964,7966,7967,7969,7976,7980,7983,7984,7986 ,7987,7988,7989,7990,7991,7992,7993,7997,7999,8000 ,8001,8002,8005,8006,8007,8008,8010,8012,8014,8018 ,8019,8021,44186,44192,44200,44201,44202,44203,442 04,44205,44207,44208,44209,44210,44211,44212,66698 ,66699,66702,66704,66705,66710,66712,67619,74766,7 4768,74798,74800,74811,74812,74813,74818,74819,748 20,92470,92472,92473,92475,92476,92478,92479,92481 ,92482,92483,92484,92485,92486,92487,92491,92492,9 2493,92494,92495,92496,92497,92498,92499,92506,925 07,92508,92510,92511,92512,92513,92514,92515,92516 ,92517,92521,92522,92523,92527,92528,92529,92531,9 2532,92534,92535,92539,92540,92541,92542,92543,9
Please note the query above is cut from pg_top and was not available in full.
We are running 6.0.18 on frontends and servers (which run on different VMs), postgresql 14.8 with timeseries and ubuntu 22.04 for all VMs.
Zabbix is configured with ~40K hosts and 2 mil items.
Any known issue this might be related to? And any suggestion on what to do next?
Thanks
Stefano
Comment