Ad Widget

Collapse

Errors after upgrading Postgres and TimescaleDB

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cstackpole
    Senior Member
    Zabbix Certified Specialist
    • Oct 2006
    • 225

    #1

    Errors after upgrading Postgres and TimescaleDB

    Greetings,

    I had a working Zabbix 5.0.2 system with a PostgreSQL 9.2 on Scientific Linux 7. However, after the upgrade to 5, the server performance TANKED. It was nearly unusable on most but basic queries. This is mostly because this is on my tiny home system that I do silly things on before I try to do silly things at work. :-D Seriously, it has a lot of history that I don't want to lose but it is just a quad core atom with 4GB of memory. So yeah, it's underpowered. But it's a home lab. :-)

    Anyway...

    I've been wanting to upgrade from 9.2 for a while and there's a potential to upgrade to 11 at work so I thought I would give it a go at home first. The upgrade to 11 was really smooth. Things seemed to work better. But if I went to Monitoring -> latest data to show all my hosts, it would absolutely time-out. I could do one host at a time, or two of the smaller item hosts, but nothing more than that. I was looking for ways of increasing performance when I found this...

    https://blog.zabbix.com/upgrading-za...scaledb/11015/

    Hooray! The steps that the author, Alexander Petrov-Gavrilov, uses to upgrade to PostgreSQL 12 are pretty much the exact steps I used to upgrade to 11. Since I was already running 5.0.2, I skipped that bit. Then I followed the instructions to install TimescaleDB. Things appeared to have gone smoothly... I thought all was good... But now it has been going for a few hours and there are issues.

    1. "Database history tables upgraded -> No."
    Um...Why? Yes, I've stopped, started, and restarted the zabbix-server multiple times since this process began. I'm not sure why it won't update the tables and change to yes.

    2. History items aren't updating. If I stop, then start the zabbix-server they will all update ONCE. After that, I get loads of errors like this in my zabbix-server log:

    Code:
    22337:20200814:181840.856 [Z3005] query failed: [0] PGRES_FATAL_ERROR:ERROR: "history" is not a hypertable or a continuous aggregate view
    HINT: It is only possible to drop chunks from a hypertable or continuous aggregate view
    [SELECT drop_chunks(1589671120,'history')]
    22337:20200814:181840.856 cannot drop chunks for history
    22337:20200814:181840.863 [Z3005] query failed: [0] PGRES_FATAL_ERROR:ERROR: "history_uint" is not a hypertable or a continuous aggregate view
    HINT: It is only possible to drop chunks from a hypertable or continuous aggregate view
    [SELECT drop_chunks(1589671120,'history_uint')]
    22337:20200814:181840.864 cannot drop chunks for history_uint
    Well... That's interesting. So I tried to rerun the conversion.
    Code:
    $ sudo systemctl stop zabbix-server
    $ cd /usr/share/doc/zabbix-server-pgsql-5.0.2/
    $ zcat timescaledb.sql | sudo -u zabbix psql zabbix
    NOTICE: migrating data to chunks
    DETAIL: Migration might take a while depending on the amount of data.
    ERROR: NULL value in column "clock" violates not-null constraint
    HINT: Columns used for time partitioning cannot be NULL
    NOTICE: migrating data to chunks
    DETAIL: Migration might take a while depending on the amount of data.
    ERROR: NULL value in column "clock" violates not-null constraint
    HINT: Columns used for time partitioning cannot be NULL
    ERROR: table "history_log" is already a hypertable
    ERROR: table "history_text" is already a hypertable
    ERROR: table "history_str" is already a hypertable
    ERROR: table "trends" is already a hypertable
    ERROR: table "trends_uint" is already a hypertable
    UPDATE 1
    UPDATE 1
    I don't see history nor history_uint in that list...which make me wonder if the data is fine in history...
    Code:
    zabbix=# select * from history;
    ERROR: compressed data is corrupted
    Errr.... that's not good. :-/

    3. Pick an item in latest data. Something that updates like every minute or two. Then go to "View as: 500 latest values". Should be updating right? Watch the graph...it is updating...so why are the latest values 10, 15, 30, 45, or more minutes old? Keep refreshing...Open in a completely different browser...Nope...still the same stale data... Refresh... Refresh... Refresh... OH! There's ALL the values that should have been there.... WTF?

    And I can probably guess one of the first responses. A quad-core 4GB system? Psshhh That can't handle the load! Yeah, well I previously might not have argued about that... But since the jump to PostgresQL 11 and TimescaleDB the load is trivial... it isn't breaking a sweat.

    I'm going to keep poking at it for a while, but I thought I'd toss this out and see if anyone has suggestions. I may just take the hit on the data for the past few hours and try to roll back to the known working 9.2 data and see if I can do this process again. Unsure at this time.

    Thoughts?
    Thanks


  • cstackpole
    Senior Member
    Zabbix Certified Specialist
    • Oct 2006
    • 225

    #2
    Greetings,

    Well after a few hours of digging into this issue, I've got a few updates.

    1. Still broken. Thoughts/advice appreciated.

    2. From the errors, it looks like the pg_upgrade process had issues that it didn't flag. I found several reports of people all moving from 9.x to 11 who had similar errors.

    history_uint was completely shot. Only the first 1654 entries were good. The rest were just broken. I spent an hour cleaning rows with errors one by one and still had less than 1700 rows of an unknown number (I couldn't count the entries because the table was so broken). So I just took the hit and dumped the contents of the entire table. The primary error that I was hitting here was "MultiXactId XXXXXXXXX has not been created yet -- apparent wraparound" which required me to find the bad row with an offset, then look up the ctid to delete it. Some of them, I couldn't delete because it didn't exist so I had to force an entry then delete it. The process was slow and painful and not worth it for me on this home system.

    history also had corruption. However, I was able to use the function chk from https://dba.stackexchange.com/questi...corrupt-errors to find and delete all the bad ctid entries. THEN I was getting errors about NULL statements. There were 10 entries in history that had NULL that shouldn't have (probably further corruption). THEN I got errors that "ERROR: row is too big: size 51792, maximum size 8160" so I sorted by clock entry and found a bunch of entries that were negative values, or 0, or 1, or whatever. So I went and found what looked like the start of actual clock entries (a big 8 digit number). I deleted everything smaller than it. That cleaned up 100,000,000 entries! (and that was only like a small portion of the entries). THEN I got errors about indexes. GRRR!!! But, the suggestion I found online was to run a full vacuum. It found bad entries that were annoying because it died after finding one...so after EIGHT long slow attempts where I just deleted the bad entry it finally finished. FINALLY!!

    Now that both of those tables are "clean", I reran the create_hypertable on both (from the zabbix provided script). AND IT WORKED! Whoo!!!

    Now all the errors from Zabbix_server.log are gone. Whew!

    3. With all that cleaned up, this odd behavior is gone. However, some items aren't updating at all except on a restart of zabbix-server. Still digging into this.

    I still have issues I'd like thoughts on. However, this is just further reinforcement to have a clean backup before you start this process (I have one, I just wanted the learning process of fighting through it), and verify data integrity AFTER an upgrade of the DB BEFORE starting to add new things to it.

    Thanks!

    Comment

    Working...