Ad Widget

Collapse

High load / thousands of false positive alert mails

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • joergi
    Member
    • Jul 2013
    • 32

    #1

    High load / thousands of false positive alert mails

    Hello together,

    we are using Zabbix 2.0.7 and at this night we got thousands of false positive alert mails. I cannot find the reason for this. Please have a look to the attached Graph.
    The sending of the false positive mails is started at 23:13 and stops at 23:58. It looks like, the housekeeper process blocks something. The Houskeeper stopped also at around 23:58 and at this time everything is fine.

    How can I debug this issue and avoid this issue for the future?

    Many thanks,
    Jörg
    Last edited by joergi; 03-01-2014, 11:18.
  • MaxM
    Member
    • Sep 2011
    • 42

    #2
    I've encountered similar effects before where the housekeeper ended up consuming 100% of CPU, causing calculations to fall behind, which causes timing issues leading to false alarms. When you're getting lots of false alerts like that, the best thing I've found to do is to apply a global maintenance window, clear the escalations table, and wait it out. That being said, there's a reason every large environment disables the housekeeper and enables partitioning.

    Comment

    • joergi
      Member
      • Jul 2013
      • 32

      #3
      Hello Max,

      many thanks... I am not sure, is the partitioning supported in Zabbix 2.0 with MySQL database?

      Thanks,
      Jörg

      Comment

      • MaxM
        Member
        • Sep 2011
        • 42

        #4
        I have used it in MySQL for both 2.0 and 2.2. The post Ricardo put together is a pretty good starting point. I only partition history and trends

        Comment

        • tchjts1
          Senior Member
          • May 2008
          • 1605

          #5
          You may be able to avoid the congestion when housekeeper is running with some tuning of your Zabbix configurations.

          * Have you modified your housekeeper settings at all as far as MaxDelete and how often housekeeper should run?

          * Are you running with the stock default settings for your pollers and cache settings in zabbix_server.conf?

          * What kind of hardware setup are you running on? Stand-alone or VM's?

          * Are both the Zabbix App and DB server on the same server?

          * How much memory do you have for your DB server?

          Comment

          • syndeysider
            Senior Member
            • Oct 2013
            • 115

            #6
            Having been through this before recently

            1. Zabbix's MySQL DB has Foreign Key constraints all over the place. This is not conducive to utilize partitioning unless you only partition a select set of tables. It get's complicated but as advised there are helpful tutorials on how to at least reduce the more "heavier" tables using partitioning.

            2. RAM? RAMDISK for MYSQL /tmp? MaxM hit the nail on the head. Poor performance = false alerts based on timing constraints. Housekeeper may need to be tweaked or you may want to check out



            for performance tuning your MySQL instance.

            I went from sluggish, day long housekeeping, 8GB ram to 32GB RAM, RAMDISK, minute long housekeeping.

            Bottom line is that you are most likely getting false alerts based on the performance of Zabbix and it's ability to timeously write/read data from MySQL.

            Comment

            • joergi
              Member
              • Jul 2013
              • 32

              #7
              Hello tchjts1,

              here are the informations:

              * Have you modified your housekeeper settings at all as far as MaxDelete and how often housekeeper should run?
              HousekeepingFrequency=4
              the MaxHousekeeperDelete is not set -> default value

              * Are you running with the stock default settings for your pollers and cache settings in zabbix_server.conf?
              StartPollers=50
              StartIPMIPollers=16
              StartPollersUnreachable=15
              StartHTTPPollers=2
              StartProxyPollers=25
              CacheSize=512M
              HistoryCacheSize=128M
              TrendCacheSize=128M
              HistoryTextCacheSize=32M

              * What kind of hardware setup are you running on? Stand-alone or VM's?
              These are VM's

              * Are both the Zabbix App and DB server on the same server?
              Sepparate server

              * How much memory do you have for your DB server?
              The Database server has 16 GB RAM

              Many thanks,
              Jörg

              Comment

              • joergi
                Member
                • Jul 2013
                • 32

                #8
                attached you will find the zabbix server and mysql configuration....
                Attached Files

                Comment

                • joergi
                  Member
                  • Jul 2013
                  • 32

                  #9
                  Hi syndeysider,

                  there is a RAM disk for mysql configured with a size of 1G and the mysqltuner shows the following:

                  Code:
                  -------- General Statistics --------------------------------------------------
                  [--] Skipped version check for MySQLTuner script
                  [OK] Currently running supported MySQL version 5.1.70-0ubuntu0.10.04.1-log
                  [OK] Operating on 64-bit architecture
                  
                  -------- Storage Engine Statistics -------------------------------------------
                  [--] Status: +Archive -BDB -Federated +InnoDB -ISAM -NDBCluster
                  [--] Data in InnoDB tables: 64G (Tables: 103)
                  [!!] Total fragmented tables: 20
                  
                  -------- Security Recommendations  -------------------------------------------
                  [OK] All database users have passwords assigned
                  
                  -------- Performance Metrics -------------------------------------------------
                  [--] Up for: 97d 1h 19m 55s (3B q [375.615 qps], 4M conn, TX: 1469B, RX: 471B)
                  [--] Reads / Writes: 29% / 71%
                  [--] Total buffers: 10.5G global + 5.6M per thread (512 max threads)
                  [OK] Maximum possible memory usage: 13.3G (84% of installed RAM)
                  [OK] Slow queries: 0% (4K/3B)
                  [OK] Highest usage of available connections: 42% (219/512)
                  [OK] Key buffer size / total MyISAM indexes: 16.0M/86.0K
                  [OK] Key buffer hit rate: 100.0% (1B cached / 0 reads)
                  [!!] Query cache efficiency: 0.0% (0 cached / 865M selects)
                  [OK] Query cache prunes per day: 0
                  [OK] Sorts requiring temporary tables: 0% (203K temp sorts / 42M sorts)
                  [!!] Joins performed without indexes: 495421
                  [OK] Temporary tables created on disk: 5% (7M on disk / 134M total)
                  [OK] Thread cache hit rate: 99% (3K created / 4M connections)
                  [OK] Table cache hit rate: 42% (512 open / 1K opened)
                  [OK] Open file limit used: 1% (47/2K)
                  [OK] Table locks acquired immediately: 100% (3B immediate / 3B locks)
                  [!!] InnoDB data size / buffer pool: 64.6G/10.0G
                  
                  -------- Recommendations -----------------------------------------------------
                  General recommendations:
                      Run OPTIMIZE TABLE to defragment tables for better performance
                      Adjust your join queries to always utilize indexes
                  Variables to adjust:
                      long_query_time (<= 10)
                      query_cache_limit (> 1M, or use smaller result sets)
                      join_buffer_size (> 3.0M, or always use indexes with joins)
                      innodb_buffer_pool_size (>= 64G)
                  The current size of the Zabbix database is 83GB.

                  Regards,
                  Jörg

                  Comment

                  • joergi
                    Member
                    • Jul 2013
                    • 32

                    #10
                    Has anyone any thoughts on this? Any advice would be really great....

                    Thanks,
                    Jörg

                    Comment

                    • joergi
                      Member
                      • Jul 2013
                      • 32

                      #11
                      Hello together,

                      tonight I got around 70 false positiv alerts at 1:14am. At 1:09am I got a alert "Zabbix Database is down".
                      After checking some logs, I found that the DR backup of the Zabbix database server VM was running (snapshot).
                      I have no clue how I could fix this issue.
                      Has anyone any thoughts on this?

                      Regards,
                      Jörg
                      Last edited by joergi; 04-09-2014, 09:44.

                      Comment

                      • syndeysider
                        Senior Member
                        • Oct 2013
                        • 115

                        #12
                        Hi joergi

                        I initially built my Zabbix environment in VMWare and than realised that the more external dependencies Zabbix has the more prone it is to false alerting.

                        This is a design flaw and not an Application flaw.

                        I found that during DRS/DR/Backups the Database becomes unresponsive because VMWare Tools locks down the DB. This leads to nodata in x triggers as well as timeouts which could lead to triggers firing e.g. false triggers.

                        My resolution was to move my infrastructure to physical hardware and where possible limit the dependencies as much as I could.

                        2 x Physical, HA Clusters. One for Zabbix, One for the mySQL DB. Why the excess of Physical Hardware? Same reason you get a Physical domain controller by a hardcore Sysadmin... VMWare is great! We use it extensively, but there's always that 1% of critical applications running off physical steel.

                        Long story short. Change your design or implement maintenance mode during your DR Backups.

                        Comment

                        • joergi
                          Member
                          • Jul 2013
                          • 32

                          #13
                          Hi syndeysider,

                          many thanks for your answer...
                          It is not an option for us to use a physical in our environment.
                          Now I am thinking about a MySql master / slave replication. Unfortunately I have no experiences on this.
                          Disable DR Backup on the Master and enable DR Backup on the slave. What do you think? Could this work?

                          Thanks,
                          Jörg

                          Comment

                          • syndeysider
                            Senior Member
                            • Oct 2013
                            • 115

                            #14
                            This will suit your needs if configured properly. http://dev.mysql.com/doc/mysql-backu...s-backups.html

                            I went through this initially but found that my MySQL INNODB .bin files used in master/slave sync grew exponentially as I didn't have the Master/Slave setup correctly and Zabbix was writing ALOT of data regularly. Something like 8GB every day. This was my first large setup. I also read :

                            http://downloads.mysql.com/docs/mysq...-5.1-en.a4.pdf

                            To try get a better understanding if that the complexity of MySQL Replication was worth the DR or if I couldn't find an alternative. I am not a strong MySQL guy as my recent problem post's might highlight and thus I chose to go the hardware route and eliminate another dependency.

                            Just remember you will need to keep an eye on your Master/Slave replication state. There are some Forum posts on an HA Solution with MySQL which might prove useful.

                            Lastly, although not ideal, depending on how much of Zabbix you use i.e. IT Services etc. you might be able to quickly "turn off" the alerts by implementing a maintenance mode for the backup duration. The alerts still trigger and display but won't be actioned i.e. SMS/Email if the "Not in MAINTENANCE" option is selected in the Trigger Definition? It's an ugly workaround....

                            Comment

                            • joergi
                              Member
                              • Jul 2013
                              • 32

                              #15
                              Originally posted by syndeysider
                              I found that during DRS/DR/Backups the Database becomes unresponsive because VMWare Tools locks down the DB. This leads to nodata in x triggers as well as timeouts which could lead to triggers firing e.g. false triggers.
                              Btw. in the DR Backup config is the quiescence of the VM disabled and should not lock anything.

                              Comment

                              Working...