Ad Widget

Collapse

I'm in way over my head, Housekeeping for over 24h and its overloading my database

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • loro64
    Junior Member
    • Oct 2024
    • 7

    #1

    I'm in way over my head, Housekeeping for over 24h and its overloading my database

    Its happening multiple times in a week where my zabbix setup goes 100% housekeeping processes and the load on my database is WAY too high.
    Heres some info about my setup:
    • We running in HA 2 web/backend nodes, we are running on a 3 node postgres cluster (HA with patroni)
    • PostgreSQL version is: psql (16.4 (Ubuntu 16.4-1.pgdg22.04+1))
    • Zabbix server version 7.0.3
    • Number of hosts (enabled/disabled) 1676
    • Number of items (enabled/disabled/not supported) 194923
    • Number of triggers (enabled/disabled [problem/ok]) 118752
    • Required server performance, new values per second 3020.79
    • VPS: Around 2600
    I think ive screwed up something and I'm really in need of assistance, heres everything in the server config that we have with a value set:
    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=0
    DebugLevel=2
    PidFile=/run/zabbix/zabbix_server.pid
    SocketDir=/run/zabbix
    DBHost=redacted
    DBName=zabbix
    DBUser=redacted
    DBPassword=redacted
    DBPort=5001
    StartPollers=10
    StartAgentPollers=5
    StartPollersUnreachable=10
    StartPingers=5
    SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
    CacheSize=1G
    HistoryIndexCacheSize=128M
    TrendCacheSize=24M
    ValueCacheSize=128M
    Timeout=4
    ExternalScripts=/usr/lib/zabbix/externalscripts
    FpingLocation=/usr/bin/fping
    Fping6Location=/usr/bin/fping6
    LogSlowQueries=3000
    StatsAllowedIP=127.0.0.1
    EnableGlobalScripts=0
    HANodeName=redacted
    NodeAddress=redacted:10051
    Housekeeping parameters:
    Click image for larger version

Name:	image.png
Views:	544
Size:	52.4 KB
ID:	493191
    Housekeeping / History Sync (48H):
    Click image for larger version

Name:	image.png
Views:	412
Size:	51.8 KB
ID:	493193
    System load for the databases:
    Click image for larger version

Name:	image.png
Views:	414
Size:	63.9 KB
ID:	493192

    So yea, I'm clueless to whats going on and im pretty sure this isnt normal behavior.
  • cyber
    Senior Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Dec 2006
    • 4811

    #2
    I would start with DB logs, but what to search for, I cannot say.. Just start from the time, where that DB load shot up... It may be quite a lot of logs, but grep and grep -v are your friends..

    Comment

    • loro64
      Junior Member
      • Oct 2024
      • 7

      #3
      I'm not having the regular output from postgresql logs because patroni but the later shows this in a loop:
      Oct 24 10:35:32 server-name patroni[1536]: 2024-10-24 10:35:32.179 EDT [1536] LOG: checkpoint complete: wrote 146 buffers (0.9%); 0 WAL file(s) added, 0 removed, 33 recycled; write=45.749 s, sync=0.619 s, total=46.470 s; sync files=299, longest=0.213 s, average=0.003 s; distance=540304 kB, estimate=540835 kB; lsn=2DED/DD2C0A18, redo lsn=2DED/BF0124B8
      Oct 24 10:35:33 server-name patroni[2049363]: 2024-10-24 10:35:33.458 EDT [2049363] LOG: could not send data to client: Broken pipe
      Oct 24 10:35:33 server-name patroni[2049363]: 2024-10-24 10:35:33.458 EDT [2049363] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:35:33 server-name patroni[2049363]: 2024-10-24 10:35:33.459 EDT [2049363] FATAL: connection to client lost
      Oct 24 10:35:33 server-name patroni[2049363]: 2024-10-24 10:35:33.459 EDT [2049363] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:35:33 server-name patroni[2049365]: 2024-10-24 10:35:33.464 EDT [2049365] FATAL: terminating background worker "parallel worker" due to administrator command
      Oct 24 10:35:33 server-name patroni[2049365]: 2024-10-24 10:35:33.464 EDT [2049365] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:35:33 server-name patroni[1534]: 2024-10-24 10:35:33.738 EDT [1534] LOG: background worker "parallel worker" (PID 2049365) exited with exit code 1
      Oct 24 10:35:35 server-name patroni[1536]: 2024-10-24 10:35:35.747 EDT [1536] LOG: checkpoint starting: wal
      Oct 24 10:35:39 server-name patroni[980]: 2024-10-24 10:35:39,686 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:35:49 server-name patroni[980]: 2024-10-24 10:35:49,582 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:35:59 server-name patroni[980]: 2024-10-24 10:35:59,720 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:36:09 server-name patroni[980]: 2024-10-24 10:36:09,580 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:36:19 server-name patroni[1536]: 2024-10-24 10:36:19.107 EDT [1536] LOG: checkpoint complete: wrote 119 buffers (0.7%); 0 WAL file(s) added, 0 removed, 33 recycled; write=42.755 s, sync=0.412 s, total=43.360 s; sync files=282, longest=0.062 s, average=0.002 s; distance=541509 kB, estimate=541509 kB; lsn=2DED/FDF14D18, redo lsn=2DED/E00E3BE0
      Oct 24 10:36:19 server-name patroni[980]: 2024-10-24 10:36:19,579 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:36:23 server-name patroni[1536]: 2024-10-24 10:36:23.790 EDT [1536] LOG: checkpoint starting: wal
      Oct 24 10:36:28 server-name patroni[2049715]: 2024-10-24 10:36:28.558 EDT [2049715] LOG: could not send data to client: Broken pipe
      Oct 24 10:36:28 server-name patroni[2049715]: 2024-10-24 10:36:28.558 EDT [2049715] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:36:28 server-name patroni[2049715]: 2024-10-24 10:36:28.559 EDT [2049715] FATAL: connection to client lost
      Oct 24 10:36:28 server-name patroni[2049715]: 2024-10-24 10:36:28.559 EDT [2049715] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:36:28 server-name patroni[2049718]: 2024-10-24 10:36:28.564 EDT [2049718] FATAL: terminating background worker "parallel worker" due to administrator command
      Oct 24 10:36:28 server-name patroni[2049718]: 2024-10-24 10:36:28.564 EDT [2049718] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:36:28 server-name patroni[2049719]: 2024-10-24 10:36:28.572 EDT [2049719] FATAL: terminating background worker "parallel worker" due to administrator command
      Oct 24 10:36:28 server-name patroni[2049719]: 2024-10-24 10:36:28.572 EDT [2049719] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:36:28 server-name patroni[1534]: 2024-10-24 10:36:28.820 EDT [1534] LOG: background worker "parallel worker" (PID 2049719) exited with exit code 1
      Oct 24 10:36:28 server-name patroni[1534]: 2024-10-24 10:36:28.829 EDT [1534] LOG: background worker "parallel worker" (PID 2049718) exited with exit code 1
      Oct 24 10:36:29 server-name patroni[980]: 2024-10-24 10:36:29,580 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:36:39 server-name patroni[980]: 2024-10-24 10:36:39,588 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:36:49 server-name patroni[980]: 2024-10-24 10:36:49,580 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:36:59 server-name patroni[980]: 2024-10-24 10:36:59,580 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:37:07 server-name patroni[1536]: 2024-10-24 10:37:07.449 EDT [1536] LOG: checkpoint complete: wrote 155 buffers (0.9%); 0 WAL file(s) added, 0 removed, 33 recycled; write=43.149 s, sync=0.406 s, total=43.660 s; sync files=269, longest=0.126 s, average=0.002 s; distance=540757 kB, estimate=541434 kB; lsn=2DEE/1F068428, redo lsn=2DEE/10F9150
      Oct 24 10:37:09 server-name patroni[980]: 2024-10-24 10:37:09,585 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:37:12 server-name patroni[1536]: 2024-10-24 10:37:12.490 EDT [1536] LOG: checkpoint starting: wal
      Oct 24 10:37:19 server-name patroni[980]: 2024-10-24 10:37:19,581 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:37:29 server-name patroni[980]: 2024-10-24 10:37:29,581 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:37:39 server-name patroni[980]: 2024-10-24 10:37:39,581 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:37:49 server-name patroni[980]: 2024-10-24 10:37:49,581 INFO: no action. I am (postgresql-0), the leader with the lock
      Oct 24 10:37:53 server-name patroni[1536]: 2024-10-24 10:37:53.655 EDT [1536] LOG: checkpoint complete: wrote 24 buffers (0.1%); 0 WAL file(s) added, 0 removed, 33 recycled; write=40.035 s, sync=0.959 s, total=41.166 s; sync files=252, longest=0.326 s, average=0.004 s; distance=539709 kB, estimate=541262 kB; lsn=2DEE/3E9A68C0, redo lsn=2DEE/22008708
      Oct 24 10:37:56 server-name patroni[2050118]: 2024-10-24 10:37:56.300 EDT [2050118] LOG: could not send data to client: Broken pipe
      Oct 24 10:37:56 server-name patroni[2050118]: 2024-10-24 10:37:56.300 EDT [2050118] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:37:56 server-name patroni[2050118]: 2024-10-24 10:37:56.301 EDT [2050118] FATAL: connection to client lost
      Oct 24 10:37:56 server-name patroni[2050118]: 2024-10-24 10:37:56.301 EDT [2050118] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:37:56 server-name patroni[2050121]: 2024-10-24 10:37:56.307 EDT [2050121] FATAL: terminating background worker "parallel worker" due to administrator command
      Oct 24 10:37:56 server-name patroni[2050121]: 2024-10-24 10:37:56.307 EDT [2050121] STATEMENT: select itemid,min(clock) from history group by itemid
      Oct 24 10:37:56 server-name patroni[2050122]: 2024-10-24 10:37:56.307 EDT [2050122] FATAL: terminating background worker "parallel worker" due to administrator command
      Oct 24 10:37:56 server-name patroni[2050122]: 2024-10-24 10:37:56.307 EDT [2050122] STATEMENT: select itemid,min(clock) from history group by itemid
      More about my patroni config:
      scope: pgcluster-zabbix7-prod
      namespace: /service/
      name: postgresql-0

      restapi:
      listen: masked-ip:8008
      connect_address: masked-ip:8008

      etcd:
      host: masked-ip:23790

      bootstrap:
      initialize: true
      dcs:
      ttl: 30
      loop_wait: 10
      retry_timeout: 10
      maximum_lag_on_failover: 1048576
      postgresql:
      use_pg_rewind: true
      parameters:
      wal_level: replica
      hot_standby: "on"
      max_connections: 100
      max_wal_senders: 10
      max_replication_slots: 10
      wal_keep_size: 64

      initdb:
      - encoding: UTF8
      - data-checksums


      pg_hba: # Add following lines to pg_hba.conf after running 'initdb'
      - host replication replicator masked-ip/32 md5
      - host replication replicator masked-ip/32 md5
      - host replication replicator masked-ip/32 md5
      - host all all masked-ip/26 md5

      users:
      admin:
      password: admin
      options:
      - createrole
      - createdb

      postgresql:
      listen: masked-ip:5432
      connect_address: masked-ip:5432
      data_dir: /var/lib/postgresql/16/main
      bin_dir: /usr/lib/postgresql/16/bin
      authentication:
      replication:
      username: replicator
      password: masked-password
      superuser:
      username: postgres
      password: masked-password
      parameters:
      unix_socket_directories: '/var/run/postgresql'

      tags:
      nofailover: false
      noloadbalance: false
      clonefrom: false
      nosync: false

      Comment

      • loro64
        Junior Member
        • Oct 2024
        • 7

        #4
        From what I noticed:
        - Postgresql vacuuming and housekeeping running at the same time is bad. (213 343 607 dead tuples haven't been cleaned)
        - Turning off the Housekeeping brought me back to sub 2 loads. I had to manually force a zabbix-server switch to my second node to make it effective.
        - The history_uint table doesn't seem to be properly indexed, the vacuuming is taking forever even tho I gave it some more resources.

        Steps taken so far to correct the situation:
        - Temp disabling the housekeeping until vacuuming is up to speed
        - Make the vacuum perform faster at cost of some more resources
        Code:
        ALTER TABLE history_uint SET (
            autovacuum_vacuum_cost_delay = 2,  -- Lower the delay (faster)
            autovacuum_vacuum_cost_limit = 1000  -- Increase the cost limit (faster)
        );
        It would be far better to make a VACUUM FULL but since it will lock the table and I cannot afford this downtime, well.. auto vacuum will have to do.

        Next steps:
        - Reindex the history_uint table
        - Reapply the housekeeping

        Still taking any feedback or insight anyone could bring my way. I'll keep updating as far as I progress into this issue.

        Comment

        • loro64
          Junior Member
          • Oct 2024
          • 7

          #5
          Better later than never. We ended up using TimescaleDB on a single DB and everything is running peachy now

          Comment

          Working...