Ad Widget

Collapse

zabbix4.4+timescaledb проблема перезапуска.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Magikanin
    Junior Member
    • Dec 2018
    • 4

    #1

    zabbix4.4+timescaledb проблема перезапуска.

    Добрый день.
    Текущая версия zabbix 4.4.7. Centos 7. postgresql 11.7. TimescaleDB 1.6 (1.7)
    С момента включения timescaledb начались проблемы с перезапуском службы zabbix-server.
    Через некоторое время после перезапуска слжбы начинаются тормоза в системе.
    В логах zabbix-server появляется следующее:
    24836:20200502:223422.345 server #23 started [trapper #4]
    24834:20200502:223422.350 server #21 started [trapper #2]
    24842:20200502:223422.356 server #29 started [alerter #3]
    24845:20200502:223422.358 server #31 started [preprocessing worker #1]
    24846:20200502:223422.359 server #32 started [preprocessing worker #2]
    24848:20200502:223422.359 server #34 started [lld manager #1]
    24847:20200502:223422.442 server #33 started [preprocessing worker #3]
    24822:20200502:223835.381 item "s-dc:raid.winswen" became supported
    24822:20200502:230004.808 slow query: 3.250235 sec, "update trends_uint set num=60,value_min=7386492928,value_avg=7391986004,v alue_max=7396642816 where itemid=28680 an
    d clock=1588446000;
    update trends_uint set num=60,value_min=4582731776,value_avg=5159637674,v alue_max=5736448000 where itemid=24420 and clock=1588446000;
    update trends_uint set num=59,value_min=613,value_avg=1001,value_max=2042 where itemid=24180 and clock=1588446000;
    update trends_uint set num=60,value_min=1,value_avg=1,value_max=3 where itemid=28620 and clock=1588446000;
    update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24780 and clock=1588446000;
    update trends_uint set num=60,value_min=127,value_avg=128,value_max=133 where itemid=33840 and clock=1588446000;
    update trends_uint set num=60,value_min=46837719040,value_avg=46839531724 ,value_max=46840750080 where itemid=30120 and clock=1588446000;
    update trends_uint set num=60,value_min=4139196416,value_avg=4139218261,v alue_max=4140507136 where itemid=28800 and clock=1588446000;
    update trends_uint set num=60,value_min=7378808832,value_avg=7382698120,v alue_max=7391145984 where itemid=32280 and clock=1588446000;
    update trends_uint set num=60,value_min=2547122176,value_avg=2547122176,v alue_max=2547122176 where itemid=32700 and clock=1588446000;
    update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24720 and clock=1588446000;
    update trends_uint set num=59,value_min=344072,value_avg=1290817,value_ma x=16475152 where itemid=24240 and clock=1588446000;
    update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=29340 and clock=1588446000;
    update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=31500 and clock=1588446000;
    "
    WARNING: terminating connection because of crash of another server process
    DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly c
    orrupted shared memory.
    HINT: In a moment you should be able to reconnect to the database and repeat your command.
    WARNING: terminating connection because of crash of another server process
    DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly c
    orrupted shared memory.
    HINT: In a moment you should be able to reconnect to the database and repeat your command.
    24820:20200502:230046.594 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
    [update trends_uint set num=60,value_min=1,value_avg=2,value_max=29 where itemid=33182 and clock=1588446000;
    update trends_uint set num=60,value_min=40172,value_avg=40172,value_max=4 0172 where itemid=33149 and clock=1588446000;
    update trends_uint set num=60,value_min=1,value_avg=1,value_max=8 where itemid=33194 and clock=1588446000;
    update trends_uint set num=60,value_min=1,value_avg=1,value_max=39 where itemid=33196 and clock=1588446000;

    ...
    WARNING: terminating connection because of crash of another server process
    DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly c
    orrupted shared memory.
    HINT: In a moment you should be able to reconnect to the database and repeat your command.
    24851:20200502:230101.712 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
    [select a.alertid,a.mediatypeid,a.sendto,a.subject,a.messa ge,a.status,a.retries,e.source,e.object,e.objectid ,a.parameters,a.eventid from alerts a left join events e on
    a.eventid=e.eventid where alerttype=0 and a.status=3 order by a.alertid]
    24820:20200502:230101.718 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode

    24820:20200502:230101.718 database is down: reconnecting in 10 seconds
    24821:20200502:230101.737 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode

    24821:20200502:230101.737 database is down: reconnecting in 10 seconds
    24851:20200502:230101.740 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode

    24851:20200502:230101.740 database is down: reconnecting in 10 seconds
    24819:20200502:230101.744 [Z3005] query failed: [0] PGRES_FATAL_ERROR:server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
    [update trends_uint set num=60,value_min=1,value_avg=1,value_max=1 where itemid=24059 and clock=1588446000;
    update trends_uint set num=60,value_min=7161635,value_avg=7163404,value_m ax=7165175 where itemid=24419 and clock=1588446000;
    update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24719 and clock=1588446000;
    update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=24779 and clock=1588446000;
    update trends_uint set num=60,value_min=46149763072,value_avg=46149763072 ,value_max=46149763072 where itemid=27119 and clock=1588446

    ...

    и так далее...

    в логах pgsql следующее в этот момент:

    2020-05-02 23:00:46.532 MSK [914] LOG: server process (PID 24856) was terminated by signal 9: Killed
    2020-05-02 23:00:46.532 MSK [914] DETAIL: Failed process was running: update trends_uint set num=60,value_min=1,value_avg=2,value_max=29 where itemid=33182 and clock=1
    588446000;
    update trends_uint set num=60,value_min=40172,value_avg=40172,value_max=4 0172 where itemid=33149 and clock=1588446000;
    update trends_uint set num=60,value_min=1,value_avg=1,value_max=8 where itemid=33194 and clock=1588446000;
    update trends_uint set num=60,value_min=1,value_avg=1,value_max=39 where itemid=33196 and clock=1588446000;
    update trends_uint set num=60,value_min=1,value_avg=1,value_max=18 where itemid=33202 and clock=1588446000;
    update trends_uint set num=60,value_min=1,value_avg=1,value_max=22 where itemid=33195 and clock=1588446000;
    update trends_uint set num=59,value_min=0,value_avg=0,value_max=0 where itemid=33110 and clock=1588446000;
    update trends_uint set num=60,value_min=1,value_avg=2,value_max=35 where itemid=33191 and clock=1588446000;
    update trends_uint set num=60,value_min=338423808,value_avg=338633523,val ue_max=338948096 where itemid=33583 and clock=1588446000;
    update trends_uint
    2020-05-02 23:00:46.533 MSK [914] LOG: terminating any other active server processes
    2020-05-02 23:00:46.539 MSK [24871] WARNING: terminating connection because of crash of another server process
    2020-05-02 23:00:46.539 MSK [24871] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
    ess exited abnormally and possibly corrupted shared memory.
    2020-05-02 23:00:46.539 MSK [24871] HINT: In a moment you should be able to reconnect to the database and repeat your command.
    2020-05-02 23:00:46.541 MSK [24862] WARNING: terminating connection because of crash of another server process
    2020-05-02 23:00:46.541 MSK [24862] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
    ess exited abnormally and possibly corrupted shared memory.
    2020-05-02 23:00:46.541 MSK [24862] HINT: In a moment you should be able to reconnect to the database and repeat your command.
    2020-05-02 23:00:46.543 MSK [1222] WARNING: terminating connection because of crash of another server process
    2020-05-02 23:00:46.543 MSK [1222] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proce
    ss exited abnormally and possibly corrupted shared memory.
    2020-05-02 23:00:46.543 MSK [1222] HINT: In a moment you should be able to reconnect to the database and repeat your command.
    2020-05-02 23:00:46.546 MSK [24855] WARNING: terminating connection because of crash of another server process
    2020-05-02 23:00:46.546 MSK [24855] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
    ess exited abnormally and possibly corrupted shared memory.
    ...
    2020-05-02 23:01:01.704 MSK [29110] HINT: In a moment you should be able to reconnect to the database and repeat your command.
    2020-05-02 23:01:01.706 MSK [29128] WARNING: terminating connection because of crash of another server process
    2020-05-02 23:01:01.706 MSK [29128] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server proc
    ess exited abnormally and possibly corrupted shared memory.
    2020-05-02 23:01:01.706 MSK [29128] HINT: In a moment you should be able to reconnect to the database and repeat your command.
    2020-05-02 23:01:01.717 MSK [29145] FATAL: the database system is in recovery mode
    2020-05-02 23:01:01.736 MSK [29146] FATAL: the database system is in recovery mode
    2020-05-02 23:01:01.737 MSK [29147] FATAL: the database system is in recovery mode
    2020-05-02 23:01:01.750 MSK [29148] FATAL: the database system is in recovery mode
    2020-05-02 23:01:01.845 MSK [914] LOG: all server processes terminated; reinitializing

    ...


    Когда служба начинает писать такое в логах. ее не удается рестартовать (systemctl restart zabbix-server зависает).
    Сам сервер также не удается рестартовать по reboot. Процесс перезапуска останавливается на zabbix-server stopping. (ждал минут 10.)

    Пока рестрат из этого состояния только по аппаратному Reset.

    До установки timescaledb такого не было.

    Ситуация аналогичная на двух разных серверах.

    Подскажите, как справиться с этой проблеммой?

    Заранее благодарен.
  • atef.haloui
    Junior Member
    • Apr 2021
    • 10

    #2
    Hi,
    I had the exact problem on my server that houses postgresql and zabbix. Every time the problem happen I need to hard reset the VM (a sudo reboot has no effect).
    The problem appeared only recently after upgrading to version 5.2.6.
    After analysing the logs it turns out that it's due to my wrongly tuned postgresql instance that consume most of the ram.
    I've used https://pgtune.leopard.in.ua/#/ to tune postgresql and reboot the hole system --> all is working perfecty again.
    In the meanwhile, I've also disabled compression in housekeeping.

    These are the pieces of logs that helped me to understand what's wrong:
    Code:
    2021-04-16 16:25:04.373 CEST [24629] zabbix@zabbix WARNING: there is no transaction in progress
    2021-04-16 16:25:04.373 CEST [24631] zabbix@zabbix WARNING: there is no transaction in progress
    2021-04-16 16:25:04.441 CEST [24632] zabbix@zabbix WARNING: there is no transaction in progress
    2021-04-16 16:25:11.238 CEST [24624] LOG: could not close temporary statistics file "/var/run/postgresql/11-main.pg_stat_tmp/db_0.tmp": No space left on device
    2021-04-16 16:25:12.198 CEST [24624] LOG: could not close temporary statistics file "/var/run/postgresql/11-main.pg_stat_tmp/global.tmp": No space left on device
    2021-04-16 16:25:14.478 CEST [628] LOG: server process (PID 24632) was terminated by signal 9: Killed
    Code:
    "
    5134:20210416:162434.698 database is down: reconnecting in 10 seconds
    5141:20210416:162434.773 [Z3001] connection to database 'zabbix' failed: [0] FATAL: the database system is in recovery mode
    
    5141:20210416:162434.773 database is down: reconnecting in 10 seconds
    5136:20210416:162444.547 database connection re-established
    WARNING: there is no transaction in progress
    5139:20210416:162444.686 database connection re-established
    WARNING: there is no transaction in progress
    5134:20210416:162444.704 database connection re-established
    WARNING: there is no transaction in progress
    5141:20210416:162444.779 database connection re-established
    WARNING: there is no transaction in progress
    5139:20210416:162450.092 [Z3005] query failed: [0] PGRES_FATAL_ERROR:ERROR: out of memory
    DETAIL: Failed on request of size 16 in memory context "MessageContext".
    Cheers

    Comment

    • atef.haloui
      Junior Member
      • Apr 2021
      • 10

      #3
      After further testing, it seems that postgresql instability (loop on killing and restarting timescaledb) is due to the values applied by the timescaledb-tune utility.
      As soon as I rollback using pgtune recommanded settings postgresql is stable again.

      Comment

      Working...