Добрый день.
Текущая версия zabbix 4.4.7. Centos 7. postgresql 11.7. TimescaleDB 1.6 (1.7)
С момента включения timescaledb начались проблемы с перезапуском службы zabbix-server.
Через некоторое время после перезапуска слжбы начинаются тормоза в системе.
В логах zabbix-server появляется следующее:
24836:20200502:223422.345 server #23 started [trapper #4]
24834:20200502:223422.350 server #21 started [trapper #2]
24842:20200502:223422.356 server #29 started [alerter #3]
24845:20200502:223422.358 server #31 started [preprocessing worker #1]
24846:20200502:223422.359 server #32 started [preprocessing worker #2]
24848:20200502:223422.359 server #34 started [lld manager #1]
24847:20200502:223422.442 server #33 started [preprocessing worker #3]
24822:20200502:223835.381 item "s-dc:raid.winswen" became supported
24822:20200502:230004.808 slow query: 3.250235 sec, "update trends_uint set num=60,value_min=7386492928,value_avg=7391986004,v alue_max=7396642816 where itemid=28680 an
d clock=1588446000;
update trends_uint set num=60,value_min=4582731776,value_avg=5159637674,v alue_max=5736448000 where itemid=24420 and clock=1588446000;
update trends_uint set num=59,value_min=613,value_avg=1001,value_max=2042 where itemid=24180 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=3 where itemid=28620 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24780 and clock=1588446000;
update trends_uint set num=60,value_min=127,value_avg=128,value_max=133 where itemid=33840 and clock=1588446000;
update trends_uint set num=60,value_min=46837719040,value_avg=46839531724 ,value_max=46840750080 where itemid=30120 and clock=1588446000;
update trends_uint set num=60,value_min=4139196416,value_avg=4139218261,v alue_max=4140507136 where itemid=28800 and clock=1588446000;
update trends_uint set num=60,value_min=7378808832,value_avg=7382698120,v alue_max=7391145984 where itemid=32280 and clock=1588446000;
update trends_uint set num=60,value_min=2547122176,value_avg=2547122176,v alue_max=2547122176 where itemid=32700 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24720 and clock=1588446000;
update trends_uint set num=59,value_min=344072,value_avg=1290817,value_ma x=16475152 where itemid=24240 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=29340 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=31500 and clock=1588446000;
"
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly c
orrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly c
orrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
24820:20200502:230046.594 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[update trends_uint set num=60,value_min=1,value_avg=2,value_max=29 where itemid=33182 and clock=1588446000;
update trends_uint set num=60,value_min=40172,value_avg=40172,value_max=4 0172 where itemid=33149 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=8 where itemid=33194 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=39 where itemid=33196 and clock=1588446000;
...
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly c
orrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
24851:20200502:230101.712 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[select a.alertid,a.mediatypeid,a.sendto,a.subject,a.messa ge,a.status,a.retries,e.source,e.object,e.objectid ,a.parameters,a.eventid from alerts a left join events e on
a.eventid=e.eventid where alerttype=0 and a.status=3 order by a.alertid]
24820:20200502:230101.718 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode
24820:20200502:230101.718 database is down: reconnecting in 10 seconds
24821:20200502:230101.737 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode
24821:20200502:230101.737 database is down: reconnecting in 10 seconds
24851:20200502:230101.740 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode
24851:20200502:230101.740 database is down: reconnecting in 10 seconds
24819:20200502:230101.744 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[update trends_uint set num=60,value_min=1,value_avg=1,value_max=1 where itemid=24059 and clock=1588446000;
update trends_uint set num=60,value_min=7161635,value_avg=7163404,value_m ax=7165175 where itemid=24419 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24719 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24779 and clock=1588446000;
update trends_uint set num=60,value_min=46149763072,value_avg=46149763072 ,value_max=46149763072 where itemid=27119 and clock=1588446
...
и так далее...
в логах pgsql следующее в этот момент:
2020-05-02 23:00:46.532 MSK [914] LOG: server process (PID 24856) was terminated by signal 9: Killed
2020-05-02 23:00:46.532 MSK [914] DETAIL: Failed process was running: update trends_uint set num=60,value_min=1,value_avg=2,value_max=29 where itemid=33182 and clock=1
588446000;
update trends_uint set num=60,value_min=40172,value_avg=40172,value_max=4 0172 where itemid=33149 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=8 where itemid=33194 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=39 where itemid=33196 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=18 where itemid=33202 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=22 where itemid=33195 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=33110 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=2,value_max=35 where itemid=33191 and clock=1588446000;
update trends_uint set num=60,value_min=338423808,value_avg=338633523,val ue_max=338948096 where itemid=33583 and clock=1588446000;
update trends_uint
2020-05-02 23:00:46.533 MSK [914] LOG: terminating any other active server processes
2020-05-02 23:00:46.539 MSK [24871] WARNING: terminating connection because of crash of another server process
2020-05-02 23:00:46.539 MSK [24871] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
2020-05-02 23:00:46.539 MSK [24871] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:00:46.541 MSK [24862] WARNING: terminating connection because of crash of another server process
2020-05-02 23:00:46.541 MSK [24862] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
2020-05-02 23:00:46.541 MSK [24862] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:00:46.543 MSK [1222] WARNING: terminating connection because of crash of another server process
2020-05-02 23:00:46.543 MSK [1222] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proce
ss exited abnormally and possibly corrupted shared memory.
2020-05-02 23:00:46.543 MSK [1222] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:00:46.546 MSK [24855] WARNING: terminating connection because of crash of another server process
2020-05-02 23:00:46.546 MSK [24855] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
...
2020-05-02 23:01:01.704 MSK [29110] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:01:01.706 MSK [29128] WARNING: terminating connection because of crash of another server process
2020-05-02 23:01:01.706 MSK [29128] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
2020-05-02 23:01:01.706 MSK [29128] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:01:01.717 MSK [29145] FATAL: the database system is in recovery mode
2020-05-02 23:01:01.736 MSK [29146] FATAL: the database system is in recovery mode
2020-05-02 23:01:01.737 MSK [29147] FATAL: the database system is in recovery mode
2020-05-02 23:01:01.750 MSK [29148] FATAL: the database system is in recovery mode
2020-05-02 23:01:01.845 MSK [914] LOG: all server processes terminated; reinitializing
...
Когда служба начинает писать такое в логах. ее не удается рестартовать (systemctl restart zabbix-server зависает).
Сам сервер также не удается рестартовать по reboot. Процесс перезапуска останавливается на zabbix-server stopping. (ждал минут 10.)
Пока рестрат из этого состояния только по аппаратному Reset.
До установки timescaledb такого не было.
Ситуация аналогичная на двух разных серверах.
Подскажите, как справиться с этой проблеммой?
Заранее благодарен.
Текущая версия zabbix 4.4.7. Centos 7. postgresql 11.7. TimescaleDB 1.6 (1.7)
С момента включения timescaledb начались проблемы с перезапуском службы zabbix-server.
Через некоторое время после перезапуска слжбы начинаются тормоза в системе.
В логах zabbix-server появляется следующее:
24836:20200502:223422.345 server #23 started [trapper #4]
24834:20200502:223422.350 server #21 started [trapper #2]
24842:20200502:223422.356 server #29 started [alerter #3]
24845:20200502:223422.358 server #31 started [preprocessing worker #1]
24846:20200502:223422.359 server #32 started [preprocessing worker #2]
24848:20200502:223422.359 server #34 started [lld manager #1]
24847:20200502:223422.442 server #33 started [preprocessing worker #3]
24822:20200502:223835.381 item "s-dc:raid.winswen" became supported
24822:20200502:230004.808 slow query: 3.250235 sec, "update trends_uint set num=60,value_min=7386492928,value_avg=7391986004,v alue_max=7396642816 where itemid=28680 an
d clock=1588446000;
update trends_uint set num=60,value_min=4582731776,value_avg=5159637674,v alue_max=5736448000 where itemid=24420 and clock=1588446000;
update trends_uint set num=59,value_min=613,value_avg=1001,value_max=2042 where itemid=24180 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=3 where itemid=28620 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24780 and clock=1588446000;
update trends_uint set num=60,value_min=127,value_avg=128,value_max=133 where itemid=33840 and clock=1588446000;
update trends_uint set num=60,value_min=46837719040,value_avg=46839531724 ,value_max=46840750080 where itemid=30120 and clock=1588446000;
update trends_uint set num=60,value_min=4139196416,value_avg=4139218261,v alue_max=4140507136 where itemid=28800 and clock=1588446000;
update trends_uint set num=60,value_min=7378808832,value_avg=7382698120,v alue_max=7391145984 where itemid=32280 and clock=1588446000;
update trends_uint set num=60,value_min=2547122176,value_avg=2547122176,v alue_max=2547122176 where itemid=32700 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24720 and clock=1588446000;
update trends_uint set num=59,value_min=344072,value_avg=1290817,value_ma x=16475152 where itemid=24240 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=29340 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=31500 and clock=1588446000;
"
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly c
orrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly c
orrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
24820:20200502:230046.594 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[update trends_uint set num=60,value_min=1,value_avg=2,value_max=29 where itemid=33182 and clock=1588446000;
update trends_uint set num=60,value_min=40172,value_avg=40172,value_max=4 0172 where itemid=33149 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=8 where itemid=33194 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=39 where itemid=33196 and clock=1588446000;
...
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly c
orrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
24851:20200502:230101.712 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[select a.alertid,a.mediatypeid,a.sendto,a.subject,a.messa ge,a.status,a.retries,e.source,e.object,e.objectid ,a.parameters,a.eventid from alerts a left join events e on
a.eventid=e.eventid where alerttype=0 and a.status=3 order by a.alertid]
24820:20200502:230101.718 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode
24820:20200502:230101.718 database is down: reconnecting in 10 seconds
24821:20200502:230101.737 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode
24821:20200502:230101.737 database is down: reconnecting in 10 seconds
24851:20200502:230101.740 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode
24851:20200502:230101.740 database is down: reconnecting in 10 seconds
24819:20200502:230101.744 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[update trends_uint set num=60,value_min=1,value_avg=1,value_max=1 where itemid=24059 and clock=1588446000;
update trends_uint set num=60,value_min=7161635,value_avg=7163404,value_m ax=7165175 where itemid=24419 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24719 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24779 and clock=1588446000;
update trends_uint set num=60,value_min=46149763072,value_avg=46149763072 ,value_max=46149763072 where itemid=27119 and clock=1588446
...
и так далее...
в логах pgsql следующее в этот момент:
2020-05-02 23:00:46.532 MSK [914] LOG: server process (PID 24856) was terminated by signal 9: Killed
2020-05-02 23:00:46.532 MSK [914] DETAIL: Failed process was running: update trends_uint set num=60,value_min=1,value_avg=2,value_max=29 where itemid=33182 and clock=1
588446000;
update trends_uint set num=60,value_min=40172,value_avg=40172,value_max=4 0172 where itemid=33149 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=8 where itemid=33194 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=39 where itemid=33196 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=18 where itemid=33202 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=1,value_max=22 where itemid=33195 and clock=1588446000;
update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=33110 and clock=1588446000;
update trends_uint set num=60,value_min=1,value_avg=2,value_max=35 where itemid=33191 and clock=1588446000;
update trends_uint set num=60,value_min=338423808,value_avg=338633523,val ue_max=338948096 where itemid=33583 and clock=1588446000;
update trends_uint
2020-05-02 23:00:46.533 MSK [914] LOG: terminating any other active server processes
2020-05-02 23:00:46.539 MSK [24871] WARNING: terminating connection because of crash of another server process
2020-05-02 23:00:46.539 MSK [24871] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
2020-05-02 23:00:46.539 MSK [24871] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:00:46.541 MSK [24862] WARNING: terminating connection because of crash of another server process
2020-05-02 23:00:46.541 MSK [24862] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
2020-05-02 23:00:46.541 MSK [24862] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:00:46.543 MSK [1222] WARNING: terminating connection because of crash of another server process
2020-05-02 23:00:46.543 MSK [1222] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proce
ss exited abnormally and possibly corrupted shared memory.
2020-05-02 23:00:46.543 MSK [1222] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:00:46.546 MSK [24855] WARNING: terminating connection because of crash of another server process
2020-05-02 23:00:46.546 MSK [24855] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
...
2020-05-02 23:01:01.704 MSK [29110] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:01:01.706 MSK [29128] WARNING: terminating connection because of crash of another server process
2020-05-02 23:01:01.706 MSK [29128] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
ess exited abnormally and possibly corrupted shared memory.
2020-05-02 23:01:01.706 MSK [29128] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-05-02 23:01:01.717 MSK [29145] FATAL: the database system is in recovery mode
2020-05-02 23:01:01.736 MSK [29146] FATAL: the database system is in recovery mode
2020-05-02 23:01:01.737 MSK [29147] FATAL: the database system is in recovery mode
2020-05-02 23:01:01.750 MSK [29148] FATAL: the database system is in recovery mode
2020-05-02 23:01:01.845 MSK [914] LOG: all server processes terminated; reinitializing
...
Когда служба начинает писать такое в логах. ее не удается рестартовать (systemctl restart zabbix-server зависает).
Сам сервер также не удается рестартовать по reboot. Процесс перезапуска останавливается на zabbix-server stopping. (ждал минут 10.)
Пока рестрат из этого состояния только по аппаратному Reset.
До установки timescaledb такого не было.
Ситуация аналогичная на двух разных серверах.
Подскажите, как справиться с этой проблеммой?
Заранее благодарен.
Comment