Ad Widget

**cyber** · 24-10-2024, 13:30

I would start with DB logs, but what to search for, I cannot say..

Just start from the time, where that DB load shot up... It may be quite a lot of logs, but grep and grep -v are your friends..

**loro64** · 24-10-2024, 16:47

I'm not having the regular output from postgresql logs because patroni but the later shows this in a loop:

Oct 24 10:35:32 server-name patroni[1536]: 2024-10-24 10:35:32.179 EDT [1536] LOG: checkpoint complete: wrote 146 buffers (0.9%); 0 WAL file(s) added, 0 removed, 33 recycled; write=45.749 s, sync=0.619 s, total=46.470 s; sync files=299, longest=0.213 s, average=0.003 s; distance=540304 kB, estimate=540835 kB; lsn=2DED/DD2C0A18, redo lsn=2DED/BF0124B8
Oct 24 10:35:33 server-name patroni[2049363]: 2024-10-24 10:35:33.458 EDT [2049363] LOG: could not send data to client: Broken pipe
Oct 24 10:35:33 server-name patroni[2049363]: 2024-10-24 10:35:33.458 EDT [2049363] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:35:33 server-name patroni[2049363]: 2024-10-24 10:35:33.459 EDT [2049363] FATAL: connection to client lost
Oct 24 10:35:33 server-name patroni[2049363]: 2024-10-24 10:35:33.459 EDT [2049363] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:35:33 server-name patroni[2049365]: 2024-10-24 10:35:33.464 EDT [2049365] FATAL: terminating background worker "parallel worker" due to administrator command
Oct 24 10:35:33 server-name patroni[2049365]: 2024-10-24 10:35:33.464 EDT [2049365] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:35:33 server-name patroni[1534]: 2024-10-24 10:35:33.738 EDT [1534] LOG: background worker "parallel worker" (PID 2049365) exited with exit code 1
Oct 24 10:35:35 server-name patroni[1536]: 2024-10-24 10:35:35.747 EDT [1536] LOG: checkpoint starting: wal
Oct 24 10:35:39 server-name patroni[980]: 2024-10-24 10:35:39,686 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:35:49 server-name patroni[980]: 2024-10-24 10:35:49,582 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:35:59 server-name patroni[980]: 2024-10-24 10:35:59,720 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:36:09 server-name patroni[980]: 2024-10-24 10:36:09,580 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:36:19 server-name patroni[1536]: 2024-10-24 10:36:19.107 EDT [1536] LOG: checkpoint complete: wrote 119 buffers (0.7%); 0 WAL file(s) added, 0 removed, 33 recycled; write=42.755 s, sync=0.412 s, total=43.360 s; sync files=282, longest=0.062 s, average=0.002 s; distance=541509 kB, estimate=541509 kB; lsn=2DED/FDF14D18, redo lsn=2DED/E00E3BE0
Oct 24 10:36:19 server-name patroni[980]: 2024-10-24 10:36:19,579 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:36:23 server-name patroni[1536]: 2024-10-24 10:36:23.790 EDT [1536] LOG: checkpoint starting: wal
Oct 24 10:36:28 server-name patroni[2049715]: 2024-10-24 10:36:28.558 EDT [2049715] LOG: could not send data to client: Broken pipe
Oct 24 10:36:28 server-name patroni[2049715]: 2024-10-24 10:36:28.558 EDT [2049715] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:36:28 server-name patroni[2049715]: 2024-10-24 10:36:28.559 EDT [2049715] FATAL: connection to client lost
Oct 24 10:36:28 server-name patroni[2049715]: 2024-10-24 10:36:28.559 EDT [2049715] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:36:28 server-name patroni[2049718]: 2024-10-24 10:36:28.564 EDT [2049718] FATAL: terminating background worker "parallel worker" due to administrator command
Oct 24 10:36:28 server-name patroni[2049718]: 2024-10-24 10:36:28.564 EDT [2049718] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:36:28 server-name patroni[2049719]: 2024-10-24 10:36:28.572 EDT [2049719] FATAL: terminating background worker "parallel worker" due to administrator command
Oct 24 10:36:28 server-name patroni[2049719]: 2024-10-24 10:36:28.572 EDT [2049719] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:36:28 server-name patroni[1534]: 2024-10-24 10:36:28.820 EDT [1534] LOG: background worker "parallel worker" (PID 2049719) exited with exit code 1
Oct 24 10:36:28 server-name patroni[1534]: 2024-10-24 10:36:28.829 EDT [1534] LOG: background worker "parallel worker" (PID 2049718) exited with exit code 1
Oct 24 10:36:29 server-name patroni[980]: 2024-10-24 10:36:29,580 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:36:39 server-name patroni[980]: 2024-10-24 10:36:39,588 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:36:49 server-name patroni[980]: 2024-10-24 10:36:49,580 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:36:59 server-name patroni[980]: 2024-10-24 10:36:59,580 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:37:07 server-name patroni[1536]: 2024-10-24 10:37:07.449 EDT [1536] LOG: checkpoint complete: wrote 155 buffers (0.9%); 0 WAL file(s) added, 0 removed, 33 recycled; write=43.149 s, sync=0.406 s, total=43.660 s; sync files=269, longest=0.126 s, average=0.002 s; distance=540757 kB, estimate=541434 kB; lsn=2DEE/1F068428, redo lsn=2DEE/10F9150
Oct 24 10:37:09 server-name patroni[980]: 2024-10-24 10:37:09,585 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:37:12 server-name patroni[1536]: 2024-10-24 10:37:12.490 EDT [1536] LOG: checkpoint starting: wal
Oct 24 10:37:19 server-name patroni[980]: 2024-10-24 10:37:19,581 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:37:29 server-name patroni[980]: 2024-10-24 10:37:29,581 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:37:39 server-name patroni[980]: 2024-10-24 10:37:39,581 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:37:49 server-name patroni[980]: 2024-10-24 10:37:49,581 INFO: no action. I am (postgresql-0), the leader with the lock
Oct 24 10:37:53 server-name patroni[1536]: 2024-10-24 10:37:53.655 EDT [1536] LOG: checkpoint complete: wrote 24 buffers (0.1%); 0 WAL file(s) added, 0 removed, 33 recycled; write=40.035 s, sync=0.959 s, total=41.166 s; sync files=252, longest=0.326 s, average=0.004 s; distance=539709 kB, estimate=541262 kB; lsn=2DEE/3E9A68C0, redo lsn=2DEE/22008708
Oct 24 10:37:56 server-name patroni[2050118]: 2024-10-24 10:37:56.300 EDT [2050118] LOG: could not send data to client: Broken pipe
Oct 24 10:37:56 server-name patroni[2050118]: 2024-10-24 10:37:56.300 EDT [2050118] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:37:56 server-name patroni[2050118]: 2024-10-24 10:37:56.301 EDT [2050118] FATAL: connection to client lost
Oct 24 10:37:56 server-name patroni[2050118]: 2024-10-24 10:37:56.301 EDT [2050118] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:37:56 server-name patroni[2050121]: 2024-10-24 10:37:56.307 EDT [2050121] FATAL: terminating background worker "parallel worker" due to administrator command
Oct 24 10:37:56 server-name patroni[2050121]: 2024-10-24 10:37:56.307 EDT [2050121] STATEMENT: select itemid,min(clock) from history group by itemid
Oct 24 10:37:56 server-name patroni[2050122]: 2024-10-24 10:37:56.307 EDT [2050122] FATAL: terminating background worker "parallel worker" due to administrator command
Oct 24 10:37:56 server-name patroni[2050122]: 2024-10-24 10:37:56.307 EDT [2050122] STATEMENT: select itemid,min(clock) from history group by itemid

More about my patroni config:

scope: pgcluster-zabbix7-prod
namespace: /service/
name: postgresql-0

restapi:
listen: masked-ip:8008
connect_address: masked-ip:8008

etcd:
host: masked-ip:23790

bootstrap:
initialize: true
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
hot_standby: "on"
max_connections: 100
max_wal_senders: 10
max_replication_slots: 10
wal_keep_size: 64

initdb:
- encoding: UTF8
- data-checksums

pg_hba: # Add following lines to pg_hba.conf after running 'initdb'
- host replication replicator masked-ip/32 md5
- host replication replicator masked-ip/32 md5
- host replication replicator masked-ip/32 md5
- host all all masked-ip/26 md5

users:
admin:
password: admin
options:
- createrole
- createdb

postgresql:
listen: masked-ip:5432
connect_address: masked-ip:5432
data_dir: /var/lib/postgresql/16/main
bin_dir: /usr/lib/postgresql/16/bin
authentication:
replication:
username: replicator
password: masked-password
superuser:
username: postgres
password: masked-password
parameters:
unix_socket_directories: '/var/run/postgresql'

tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false

**loro64** · 24-10-2024, 21:38

From what I noticed:
- Postgresql vacuuming and housekeeping running at the same time is bad. (213 343 607 dead tuples haven't been cleaned)
- Turning off the Housekeeping brought me back to sub 2 loads. I had to manually force a zabbix-server switch to my second node to make it effective.
- The history_uint table doesn't seem to be properly indexed, the vacuuming is taking forever even tho I gave it some more resources.

Steps taken so far to correct the situation:
- Temp disabling the housekeeping until vacuuming is up to speed
- Make the vacuum perform faster at cost of some more resources

Code:

ALTER TABLE history_uint SET (
    autovacuum_vacuum_cost_delay = 2,  -- Lower the delay (faster)
    autovacuum_vacuum_cost_limit = 1000  -- Increase the cost limit (faster)
);

It would be far better to make a VACUUM FULL but since it will lock the table and I cannot afford this downtime, well.. auto vacuum will have to do.

Next steps:
- Reindex the history_uint table
- Reapply the housekeeping

Still taking any feedback or insight anyone could bring my way. I'll keep updating as far as I progress into this issue.

**loro64** · 26-05-2025, 22:32

Better later than never. We ended up using TimescaleDB on a single DB and everything is running peachy now

Ad Widget

I'm in way over my head, Housekeeping for over 24h and its overloading my database

I'm in way over my head, Housekeeping for over 24h and its overloading my database

Comment

Comment

Comment

Comment