Ad Widget

Collapse

Improving Zabbix server database upgrade 3.0LTS to 5.0LTS

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • prismanet1970
    Junior Member
    • Feb 2023
    • 2

    #1

    Improving Zabbix server database upgrade 3.0LTS to 5.0LTS

    Hi everyone,

    I recently took over a Zabbix 3.0LTS infrastructure running on CentOS 7 and Postgres 9.x (with partitions). Frontend, Zabbix Server and Postgres are all in distinct VMs. We have about 1500 hosts, 95k items and 63k triggers. Trend data is kept for 12 months. Database is about 250GB on disk.

    We plan to upgrade to Zabbix 5.0LTS on Ubuntu 20.04; we cannot go yet to 6.0LTS because remote proxies are not supported on the OS side. We set up a new environment and we are currently doing dry-runs of upgrade to make sure everything works. The database upgrade step is the one that takes most of the time for the upgrade; essentially because we need to keep the history/trends. Unfortunately, we don't have access to SSDs to maximize disk IO but we improved upgrade by adjusting Postgresql configuration; upgrade takes 2.5 hours but we'd like to improve further to reduce downtime further.

    We analyzed the server log during the upgrade and there are 2 steps that are more time consuming.
    • Step at 3% mark (60 mins): we see slow queries like this "select source,object,objectid,eventid,value from events where eventid>607307518 and source in (0,3) order by eventid limit 10000"
    • Step at 16% mark (90 mins): we see slow queries like this "update alerts set p_eventid=786671521 where eventid=786678760;"
    Do you have any ideas of how to improve processing of these 2 tasks (or even just one)?

    One way I'm looking at it (to be tested):
    1. Few days BEFORE migration take a backup of the DB and upgrade it on the new instance.
    2. On the migration day, take another backup of configuration and history/trends/alerts since the last backup.
      • Would Zabbix detect the upgrade task to be done and ONLY update the latest entries or redo the full tables?
    Any ideas you may have from past experiences?

    Regards,
    Sylvain
  • prismanet1970
    Junior Member
    • Feb 2023
    • 2

    #2
    Hi Everyone,

    After reviewing the database content, I noticed that the events table had about 245M rows. Looking more closely on those, there were 2 types of records:
    • Source 0: events caused by triggers
    • Source 3: internal events (like bad calculations because of invalid values)
    Source 3 events were counting for 90% of the events. In my lab environment, I restored the v3 database and deleted Source 3 entries older than 7 days.

    I also did the same for the Alerts table. Its content was less large but I also deleted the alerts older than 90 days.

    Result: DB upgrade took 20 minutes :-)

    That should be enough for me for this upgrade.

    Hope this may help someone else.

    Regards,
    Sylvain

    Comment

    Working...