Ad Widget

Collapse

Zabbix Constant High I/O / Queue Wait Time

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jmusbach
    Member
    • Sep 2013
    • 37

    #1

    Zabbix Constant High I/O / Queue Wait Time

    Hello, we are in the final stages of getting zabbix deployed for our business but are noticing that we regularly have items in the queue longer than 10 minutes. Looking at the output of iotop on the zabbix server it seems mysql is typically doing writes between 3-5MBps constantly. We've tried upping its buffer pool to 2GB per the docs (https://www.zabbix.com/documentation...n/requirements) but that only helped momentarily. We're monitoring about 263 nodes and Zabbix says it's getting around 120 new values per second. Has anyone else run into this? Do we just need to get more RAM for the server and up the mysql buffer pool more? Thanks.
  • tchjts1
    Senior Member
    • May 2008
    • 1605

    #2
    It is hard to say that just adding more RAM will take care of the issue. It is probably a little more complex than that since there are a fair amount of values that can be tweaked between MySql and Zabbix server settings.

    What version of Zabbix are you using?

    Take a look at this post... the last half of it anyway where the graphs are. If you can screenshot those, we can probably start tuning some settings for you.


    And since you are mentioning high IO wait times, you can also take a look at this post:

    Comment

    • jmusbach
      Member
      • Sep 2013
      • 37

      #3
      Thanks, we're using 2.0.8. Here are the graphs and some more things I think may be useful:

      zabbix cache usage: i42.tinypic.com/2j1uc8w[dot]jpg
      zabbix internal processes: i39.tinypic.com/1tkdgj[dot]jpg
      zabbix data gathering: i41.tinypic[dot]com/30mvlhw.jpg
      zabbix queue: i43.tinypic[dot]com/21zi9t.jpg
      iotop:i39.tinypic[dot]com/20u6ghx.jpg
      top: i44.tinypic[dot]com/1z6f7nl.jpg

      Let me know if you need anything else, thanks.
      Last edited by jmusbach; 10-10-2013, 00:49.

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        I only looked at one, which is only a 1 hour view. As per the links I posted, a 12 hour or a 1 day view will be much more useful.

        Comment

        • jmusbach
          Member
          • Sep 2013
          • 37

          #5
          ah sorry about that, here's the graphs in 1d view:

          zabbix cache: goput[dot]it/v/pff.tiff
          zabbix data gathering: goput[dot]it/v/atm.tiff
          zabbix internal processes: goput[dot]it/v/cei.tiff
          current zabbix statistics: goput[dot]it/v/eje.tiff

          Thanks!

          Comment

          • tchjts1
            Senior Member
            • May 2008
            • 1605

            #6
            Can't you just snapshot your graphs and upload them here? I use MWSNAP which is a free snapshot utility.

            Never mind. I downloaded them. That site you uploaded to has no viewer for .tiff files.
            Attached Files
            Last edited by tchjts1; 10-10-2013, 20:36.

            Comment

            • jmusbach
              Member
              • Sep 2013
              • 37

              #7
              Ah thanks, I tried uploading and it said I didn't have enough space alotted to my quota to attach all the images. And if I tried making direct links to the images I got a error saying I had too much live content. Anyways please let me know if you come up with any things we can tweak. Thanks!

              Comment

              • tchjts1
                Senior Member
                • May 2008
                • 1605

                #8
                So what I see in your graphs.... your unreachable pollers were being hammered for about 5 hours straight. Were your servers actually unreachable during that time? That would explain your queue being very high.

                Otherwise, your settings are not that bad. These settings are all in zabbix_server.conf. Your cache settings look good. If it were me, I would increase my StartPollers= and UnreachablePollers= a little bit, maybe add 20/5 respectively. I would also bump up my Timeout= to 10, if you still have it at the default of 3.

                Restart your Zabbix server process after you make any changes.

                What kind of infrastructure are you on? Standalone or VM servers?
                I am on VM and I run my App server on a different VM than my DB server.

                Outside of that, I would question why the unreachable pollers were being hammered. Were you having any network issues going on?

                Comment

                • tchjts1
                  Senior Member
                  • May 2008
                  • 1605

                  #9
                  Originally posted by jmusbach
                  Ah thanks, I tried uploading and it said I didn't have enough space alotted to my quota to attach all the images.
                  Check out MWSnap sometime. It makes the picture size pretty compact while preserving clarity. I will also check and see if we can increase available space for uploads.
                  Last edited by tchjts1; 10-10-2013, 21:12.

                  Comment

                  • jmusbach
                    Member
                    • Sep 2013
                    • 37

                    #10
                    At that point I think we were having some network latency issues. But in general things availability wise are fine. But we still get a lot of things listed as waiting >10 minutes in the queue and when I view details they're things set to run on the zabbix server's agent itself. Things like its own stats but also some scripts we've set the agent to run. At the time this is happening I am able to run the scripts that're sitting in the queue just fine so I'm not sure of the cause. If you look at the iotop screenshot you can see mysql's frequently causing IO of 3+ MBps. Could this be a cause of queueing? Thanks again.

                    Comment

                    • tchjts1
                      Senior Member
                      • May 2008
                      • 1605

                      #11
                      Check my other link regarding high IO wait?


                      I would look at the swappiness setting at the OS layer. If it is at the default of 60, I would set it to 0 on the Zabbix DB server. Especially if you are seeing high IO wait coupled with abnormal swap usage.

                      Comment

                      • jmusbach
                        Member
                        • Sep 2013
                        • 37

                        #12
                        Thanks, things seem to be better now that I changed the swappiness on the server to 0 so that it only swaps as a last resort and applied the configuration tweaks you suggested. However one problem still remains, if you look at my attached screenshot you'll see that some graphs are randomly getting breaks in them. The graphs are using data obtained from scripts the server is set to run as items and the scripts should be fine. The graphs weren't a issue until we finished adding all our servers and monitoring items, now it seems like the server is too busy to maintain a steady stream of the data the graphs depend on? Anything to tweak to make the data come in more reliably? Thanks.
                        Attached Files

                        Comment

                        • jmusbach
                          Member
                          • Sep 2013
                          • 37

                          #13
                          bump...................

                          EDIT: Turns out the graph issue was caused by latency when the zabbix server was in a datacenter separate from most of our main servers. Once the zabbix server was placed in the same datacenter this issue went away.
                          Last edited by jmusbach; 18-10-2013, 15:44.

                          Comment

                          Working...