Ad Widget

Collapse

Upgrade 5.4 to 6.0 - Server Services never start, hang on "configuration syncer"

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cwhite
    Member
    • Aug 2015
    • 46

    #1

    Upgrade 5.4 to 6.0 - Server Services never start, hang on "configuration syncer"

    It seems I can never get a smooth install of our production environment, lab tested perfectly with and without HA enabled. Once again I ran into issues that I hope someone else has as well. Our environment is all AWS instances - 3 EC2 running Ubuntu 20 LXD, each EC2 runs a DB container (galera cluster) and Zabbix Server container (w/ frontend). We currently manage HA with PCS installed and Route53 health checks for FQDN resolution. Multiple proxies in the wild phoning home to AWS.....

    Upgrade of actual database was not an issue (which is usually where I've had issues in the past with version upgrades), even the scripting of updating PRIM Key tables and moving data went smoothly - roughly 4 hours for 200GB DB. On my original attempt to upgrade - I shut down all PCS standby nodes, 2 of the 3 EC2 instances. So only the 'active' EC2 was online, once PRIM Key Tables finished and I started up Server with HA enabled - registered as active but then stopped at "configuration syncer" process - never did anything else, no failure, restart or continue on to rest of process/services for Server to run. Disabled HA in conf and restarted, still registered as active and stopped again at "configuration syncer". Tried to to runtime commands to remove HA Node from DB and they don't work because Server is fully started....manually removed the record from table "ha_node". Restarted service and still registered as HA Active but now the record in "ha_node" was listed as localhost - and when manually deleted and Server restarted it would return as a record and Server log showed HA active and hung on "configuration syncer".

    I fully rolled back our production environment to 5.4 (snapshots) and tried again a few days later, removing all PCS conf and packages (thinking possible some of that technology was used under the hood in code for Zabbix's implementation). Upgraded server to 6.0 again with no HA enabled in conf and it immediately registered to ha_node as Active with localhost and hung on configuration syncer (did not run PRIM Key Table scripts). Rolled all back again......

    Today I tried with a fresh container install from packages, on same host LXD as DB - same scenario immediately registered as Active HA and localhost in DB, hung configuration syncer. I did not see any of this behavior in our lab - I could enable and disable HA and Server behaved as expected. Starting in active or standby if enabled in conf but if commented out then started HA disabled and continued.

    Now I'm at a stand still with upgrading, Server never fully boots and after the DB upgrade there is some remnant from my attempts that believes HA is Active even when disabled in conf file. I even did COPY LIKE ha-node table, drop original, ALTER, etc to no change........someone please have seen/experienced this somewhere!!!

    Of note when I debug level 4 in Server, it constantly in a loop of checkin ha_node status but always returns as SUCCEED - this is same loop where in normal debug startup hangs on configuration syncer
    Attached Files
  • Glencoe
    Zabbix developer
    • Oct 2019
    • 152

    #2
    Can you please raise an issue at support.zabbix.com?

    Comment

    • cwhite
      Member
      • Aug 2015
      • 46

      #3
      Just posted, was my next step, just wanted to afford community the opportunity first. https://support.zabbix.com/browse/ZBX-20936

      Comment

      • cwhite
        Member
        • Aug 2015
        • 46

        #4
        Follow-up - there are various issues with sync process and MariaDB 10.6 reported as bugs. My bug has links to other bugs and forum posts......my end solution was to fully teardown the Galeria DB Cluster (we're running 10.5). Restart a single node as standalone (no Galera conf) and then preform the upgrade. That completed as intended and then we had to rebuild the cluster - now running 6.0.4.

        Comment

        Working...