Ad Widget

Collapse

Performance issues after Zabbix 5.0 and MariaDB Upgrade

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jameskirsop
    Member
    • Jul 2018
    • 32

    #1

    Performance issues after Zabbix 5.0 and MariaDB Upgrade

    I recently upgraded Zabbix to 5.0 (from 4.0) and as such needed to upgrade MariaDB from 5.5 (Default for CentOS 7 base repo) to 10.2.32 (from the MariaDB repo).

    Since upgrading we've had a number of performance-like problems, but I've not been able to find anything useful in zabbix_server.log or mariadb.log that points to the cause.

    Problems include:
    - Seeing the 'Zabbix server is not running' error message on the web front end (I've checked that zabbix_server processes are running, no errors logged to the server log file)
    - Having some (or occasionally all) proxies report that they're unable to connect to the Zabbix Server due to TCP errors: "cannot send heartbeat message to server at "zabbix.daraco.com.au": ZBX_TCP_READ() timed out"
    - I'm also seeing occasional SQL write error messages dumped as errors on the web GUI in various locations
    - Periods where the history write cache is filled every hour - for about 10-15 minutes (this sometimes correlates to a big jump in the Zabbix queue to 100K of items in the queue see attached graph)

    These issues were not present before we did the version upgrade.

    To address the problem I've attempted the following:
    - Increasing the allocated physical RAM on the host by 4GB (since doing this I've seen my load averages decrease by about 30% and typical CPU I/o wait times decrease a little too)
    - Increasing the allocated RAM setting on innodb_buffer_pool_size from 8G to 10G
    - Investigated changing the innodb_io_capacity setting from the default of 200, but this is being deprecated in MariaDB 10.5, so I'm not sure of its effectiveness in the MariaDB 10.2??
    - Increased the following Zabbix server parameters:
    -- CacheSize from 384 to 512M
    -- HistoryCacheSize from 24 to 32M

    The 'Value Cache effectiveness' graph shows relatively few misses, so I'm thinking that I've got the cacheing parameters set correctly.

    Some basic details about the server:
    - 6 cores
    - 24GB RAM (as it's running a few other things as well as Zabbix/MariaDB)
    - MySQL data store is close to 800GB (history_uint table size is 263GB, events about 180GB). /var/lib/mysql has its own dedicated volume.

    If I do a show processlist at the mysql CLI, I see results including this:
    | Id | User | Host | db | Command | Time | State | Info | Progress |
    | 2474 | zabbix_server | localhost | zabbix_prod | Query | 5801 | Sending data | SELECT DISTINCT e.eventid,e.clock,e.ns,e.objectid,e.acknowledged,e r1.r_eventid FROM events e LEFT JO | 0.000 |
    | 2475 | zabbix_server | localhost | zabbix_prod | Query | 5801 | Sending data | SELECT DISTINCT e.eventid,e.clock,e.ns,e.objectid,e.acknowledged,e r1.r_eventid FROM events e LEFT JO | 0.000 |


    Execution times of over 5801 (ie. an hour and a half) aren't ideal, so I'm imagining something is up there that will be causing performance bottlenecks, but I'm not sure how this would compare to other Zabbix instances.

    TLDR;
    I'm wondering there were significant changes to MariaDB between 5.5 and 10.2 that I need to tweak configuration for, or if there are things I can do to the Zabbix configuration to improve performance.

    At this stage, I expect that my MariaDB config will need tweaking, but I'm not really sure where to start - given things were going quite well prior to the upgrade. I'm making this assumption on the 3rd problem listed above. However, since I'm not seeing it reproduced by any other process or task, I'm unsure of the foundations of this thought.
    Attached Files
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    I haven't upgraded our site to Zabbix 5.x yet, mainly because any major upgrade always has unexpected issues and I wanted to wait for other people to find and report those. I initially thought I would wait for a 5.0.3 or 5.0.4 release before updating, but I may wait a little longer than that.

    I have however already upgraded our Zabbix 4.4.x install to use MariaDB 10.2.x (on RHEL 7.x), as part of my prep-work for getting ready for Zabbix 5.x. I also fixed up some issues with the schema that had been introduced by product updates over the years, and I converted all of the tables to use the new row_format=DYNAMIC. At least for my site's relatively small Zabbix install, the move from MariaDB 5.5.x to 10.2.x hasn't introduced any performance issues that I've noticed. It's possible that I just haven't noticed them because my site's DB is lightly loaded, but I think it's more likely that 10.2.x is working just fine with Zabbix 4.4.x.

    There are certainly big changes between MariaDB 5.5 and MariaDB 10.x, but most of the reports I've seen, including comments from Zabbix developers, are that MariaDB 10.x has some beneficial performance enhancements over MariaDB 5.x. There can always be outliers, but the general impression I've gotten from the research I've done indicates that MariaDB 10.x performance should be the same or slightly better in many cases, and potentially much better in a few very specific cases.

    Since 5.x was released, there have been a few reports on these forums about performance issues with 5.x. I don't know this for certain (again: not running Zabbix 5.x yet), but there have been enough reports here that I'm suspicious that there are some query issues that are new with 5.x. If my memory serves, I think some of the posts about performance issues were with MySQL/MariaDB as a backend, and some were with PostgreSQL as a backend, so if that's true, that furthers my guess that there may be some query optimization or changes that are still needed for the new release.

    It's totally possible that the issues you're seeing are indeed something specific to your install or some bit of misconfiguration with your MariaDB install, but my speculation at this point is that the issue is something specific to Zabbix 5.x.

    Comment

    • jameskirsop
      Member
      • Jul 2018
      • 32

      #3
      Thanks, Tim. Your whole post is rather reassuring. It's good to know that you're on 4.4 with MDB 10.2 without any issues.

      Originally posted by tim.mooney
      I have however already upgraded our Zabbix 4.4.x install to use MariaDB 10.2.x (on RHEL 7.x), as part of my prep-work for getting ready for Zabbix 5.x. I also fixed up some issues with the schema that had been introduced by product updates over the years, and I converted all of the tables to use the new row_format=DYNAMIC.
      Do you have a series of specific steps / queries that you ran to make these changes?? That might help in getting some of my issues resolved and make Zabbix more efficient.

      One thing I forgot to mention is the collation of my tables doesn't match the one supported by Zabbix, but I've not had any prior performance problems because of that so I've not been too concerned to address it until now. The specific error message is:

      Code:
      53633:20200716:134608.250 Zabbix supports only "utf8_bin" collation. Database "zabbix_prod" has default collation "utf8_general_ci"
      53633:20200716:134608.293 character set name or collation name that is not supported by Zabbix found in 421 column(s) of database "zabbix_prod"
      53633:20200716:134608.294 only character set "utf8" and collation "utf8_bin" should be used in database
      I've just also noticed some 'query failed: [1205] Lock wait timeout exceeded' errors in the Zabbix log (often happening on the hosts table and early in the morning, so not when I've noticed issues when at my desk). Given the times of the errors don't usually correlate to when I'm seeing other issues, I'm not convinced they're related.

      Comment

      • tim.mooney
        Senior Member
        • Dec 2012
        • 1427

        #4
        Originally posted by jameskirsop
        One thing I forgot to mention is the collation of my tables doesn't match the one supported by Zabbix, but I've not had any prior performance problems because of that so I've not been too concerned to address it until now. The specific error message is:

        Code:
        53633:20200716:134608.250 Zabbix supports only "utf8_bin" collation. Database "zabbix_prod" has default collation "utf8_general_ci"
        53633:20200716:134608.293 character set name or collation name that is not supported by Zabbix found in 421 column(s) of database "zabbix_prod"
        53633:20200716:134608.294 only character set "utf8" and collation "utf8_bin" should be used in database

        The collation requirements were more strictly spelled out starting with the 3.2 upgrade notes (see the section about "Case Sensitive MySQL"), and the software started warning about incorrect collation late in the 4.4.x series. It was when I was preparing to go from 3.0 through 3.2 to 4.2 that I started the process of fixing the collation of my install. I used an earlier version of the procedure from https://support.zabbix.com/browse/ZBX-17357 . Unfortunately, the early versions of that procedure had some serious problems (loss of default values for a bunch of columns because of an error in the process), and I discovered that after upgrading to 4.2.x. That's what got me looking closely at my database schema, and it's when I discovered that because my schema had been upgraded many times over the years, it had accumulated some "cruft".

        I think the procedure described at the end of the ZBX-17357 is now pretty good for fixing an existing collation, though last I heard, there were still some private or hidden comments on that issue that were causing some people to wonder if there might still be problems with the procedure.

        Originally posted by jameskirsop
        Do you have a series of specific steps / queries that you ran to make these changes?? That might help in getting some of my issues resolved and make Zabbix more efficient.
        I'm not suggesting that any of the schema differences my site had or that your site might have are causing the performance issues you're seeing. I think that's kind of unlikely, though not impossible. I did my investigation and cleanup to make my site's database as close to "pristine" as possible, with the belief that it would probably improve my chances of a successful 4.4.x to 5.x upgrade, and that it would also potentially improve the success rate for my future Zabbix upgrades too.

        I've posted a general overview of what I did to clean up my DB schema in a couple of posts,







        It's basically "create fresh schema of the exact same version on a different host and then diff them" followed by "fix any differences".

        Note that the absolute easiest way to ensure a correct schema would be to create a new, fresh database (with the correct charset and collation) and then import your data (making sure to not import any schema statements that would recreate the old schema) from your existing database. I didn't take that approach because I believed by fixing the existing database I would have less downtime. That probably turned out to be true, but it was probably closer in downtime to a full dump & reload than I was expecting. Probably the biggest benefit to the route I chose is that I learned quite a bit about recent MariaDB and I improved my knowledge of the Zabbix schema. It wasn't a "simple" procedure, but I learned a bunch of stuff in the process that may be generally useful for me with other projects, so I'm still happy with the choice.

        Comment

        • jameskirsop
          Member
          • Jul 2018
          • 32

          #5
          Originally posted by tim.mooney
          Since 5.x was released, there have been a few reports on these forums about performance issues with 5.x. I don't know this for certain (again: not running Zabbix 5.x yet), but there have been enough reports here that I'm suspicious that there are some query issues that are new with 5.x. If my memory serves, I think some of the posts about performance issues were with MySQL/MariaDB as a backend, and some were with PostgreSQL as a backend, so if that's true, that furthers my guess that there may be some query optimization or changes that are still needed for the new release.
          I'm now wondering why Zabbix is trying to run multiple queries several times. This row in the mariadb process list shows up 5 times,
          | 251673 | zabbix_server | localhost | zabbix_prod | Query | 26 | updating | DELETE FROM sessions WHERE status='1' AND userid='5' | 0.000 |

          I'm also seeing periods where there are multiple identical updates / inserts into the sessions table when I try and log in.

          Next step is to try and find some ways of reliably reproducing and logging a bug (or several...)

          Comment

          • jameskirsop
            Member
            • Jul 2018
            • 32

            #6
            tim.mooney do you happen to remember what schema differences you found when doing your clean up?

            Apart from the COMPACT vs DYNAMIC row types, and the collation changes, I've done a quick scan through mysqldump'ed schema and haven't been able to find any discrepancies. Unfortunately a diff won't work because the outputs of mysql dump and the 5.0.2 schema generated via a zcat create.sql.gz > filename.sql don't resemble each other particularly well (constraints are in a different location in the file, line breaks are different etc)

            In my scan I took a look at the larger tables in my instance (history_uint/history_text/trends_uint etc) and their constraints + some other tables such as users and groups.

            Comment

            • tim.mooney
              Senior Member
              • Dec 2012
              • 1427

              #7
              Originally posted by jameskirsop
              tim.mooney do you happen to remember what schema differences you found when doing your clean up?
              The process started because I had to fix up all the columns that had lost their DEFAULT value because of the wayward procedure to convert the COLLATIONs. You won't have that issue because you haven't done the conversion yet and presumably the procedure you'll use will work much better.

              The other difference was duplicate indexes (KEYs) in some tables that did not appear in a freshly-created schema. I'm not certain whether these were causing any issues or not, but they were present in my much-upgraded schema and not present in a freshly-created schema, so I got rid of them. I still had some CONSTRAINTs in my schema that were dumped in a different order than the same CONSTRAINT on a new table, but that's not a difference that's important.

              For example, here's how a freshly-created `drules` table (at 4.4.7, the version I'm currently running) looks:

              Code:
              DROP TABLE IF EXISTS `drules`;
              /*!40101 SET @saved_cs_client = @@character_set_client */;
              /*!40101 SET character_set_client = utf8 */;
              CREATE TABLE `drules` (
              `druleid` bigint(20) unsigned NOT NULL,
              `proxy_hostid` bigint(20) unsigned DEFAULT NULL,
              `name` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '',
              `iprange` varchar(2048) COLLATE utf8_bin NOT NULL DEFAULT '',
              `delay` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '1h',
              `nextcheck` int(11) NOT NULL DEFAULT 0,
              `status` int(11) NOT NULL DEFAULT 0,
              PRIMARY KEY (`druleid`),
              UNIQUE KEY `drules_2` (`name`),
              KEY `drules_1` (`proxy_hostid`),
              CONSTRAINT `c_drules_1` FOREIGN KEY (`proxy_hostid`) REFERENCES `hosts` (`hostid`)
              ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
              /*!40101 SET character_set_client = @saved_cs_client */;
              Here's how my much-upgraded `drules` looked, before I got rid of the extra indexes:

              Code:
              DROP TABLE IF EXISTS `drules`;
              /*!40101 SET @saved_cs_client = @@character_set_client */;
              /*!40101 SET character_set_client = utf8 */;
              CREATE TABLE `drules` (
              `druleid` bigint(20) unsigned NOT NULL,
              `proxy_hostid` bigint(20) unsigned DEFAULT NULL,
              `name` varchar(255) COLLATE utf8_bin NOT NULL,
              `iprange` varchar(2048) COLLATE utf8_bin NOT NULL,
              `delay` varchar(255) COLLATE utf8_bin NOT NULL DEFAULT '1h',
              `nextcheck` int(11) NOT NULL DEFAULT 0,
              `status` int(11) NOT NULL DEFAULT 0,
              PRIMARY KEY (`druleid`),
              UNIQUE KEY `drules_2` (`name`),
              KEY `c_drules_1` (`proxy_hostid`),
              KEY `drules_1` (`proxy_hostid`),
              CONSTRAINT `c_drules_1` FOREIGN KEY (`proxy_hostid`) REFERENCES `hosts` (`hostid`)
              ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
              /*!40101 SET character_set_client = @saved_cs_client */;
              Note the extra "c_drules_1" KEY in mine, vs. the freshly created? That matches up with the FOREIGN KEY CONSTRAINT `c_drules_1`, and indeed FOREIGN KEY CONSTRAINTs must be indexed, but I had 2 indexes (KEYs) for the same column. I think the "c_drules_1" was an earlier auto-generated KEY. Either way, it's not present in the fresh schema, so I got rid of it.

              Across all the tables in my database, there were 42 cases of these duplicate KEY (index) statements that didn't appear in a fresh schema.

              Comment

              • kitadisa
                Junior Member
                • Feb 2020
                • 3

                #8
                Originally posted by jameskirsop
                tim.mooney do you happen to remember what schema differences you found when doing your clean up?

                Apart from the COMPACT vs DYNAMIC row types, and the collation changes, I've done a quick scan through mysqldump'ed schema and haven't been able to find any discrepancies. Unfortunately a diff won't work because the outputs of mysql dump and the 5.0.2 schema generated via a zcat create.sql.gz > filename.sql don't resemble each other particularly well (constraints are in a different location in the file, line breaks are different etc)

                In my scan I took a look at the larger tables in my instance (history_uint/history_text/trends_uint etc) and their constraints + some other tables such as users and groups.
                Thanks, Tim. Your whole post is rather reassuring. It's good to know that you're on 4.4 with MDB 10.2 without any issues.

                Comment

                • jameskirsop
                  Member
                  • Jul 2018
                  • 32

                  #9
                  An update, I've managed to isolate the long running queries that I mentioned in my original post to an issue with the Problems widget. This has been reported here: https://support.zabbix.com/browse/ZBX-17866

                  Sadly the issue is marked as 'trivial' even though it actually severely degrades the performance of my Zabbix instance rendering it unstable at times.

                  Comment

                  • tim.mooney
                    Senior Member
                    • Dec 2012
                    • 1427

                    #10
                    Thank you for doing the work to isolate it and report it!

                    Hopefully in this case they've used "trivial" to mean it's easy to fix, rather than that they don't consider it a serious problem.

                    Comment

                    Working...