Ad Widget

Collapse

Queue Backups

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gtuhl
    Junior Member
    • Oct 2011
    • 3

    #1

    Queue Backups

    We've been using Zabbix for awhile to monitor all sorts of things and generally it works really well. Lately though, about once a day, it just starts falling way behind. The values in the Administration->Queue screen climb into the hundreds, and all of our hosts start showing up dead.

    We are monitoring ~150 hosts, ~10k items, and the server performance value on the dashboard is 130. Most of the time everything is working great.

    I do not believe it is a hardware issue as I have two reasonable boxes in the mix. The Zabbix server (v1.8.5) runs on one while the DB (PostgreSQL 9.0) runs on the other. Each is a bare metal 8 core, 12GB RAM, 4 disk 15k rpm raid 10 running Ubuntu 10.04 LTS.

    Generally these boxes run a load average of only 1 - 2, are using a couple GB of RAM, and certainly are not taxing the disks (by either vmstat or iostat -dx output). I rarely see IOWait get above 3-4 and it is usually 0.

    When these queue backups happen nothing looks different on the servers - no jump in CPU nor memory and the DB looks perfectly healthy in terms of no long running queries present. The backup does clear on its own eventually as well but we go 20 - 30 minutes with no monitoring or even worse lots of false alarms going off.

    On the server running zabbix-server I have these settings in zabbix_server.conf:
    CacheSize=1024M
    HistoryCacheSize=1024M
    HistoryTextCacheSize=1024M
    TrendCacheSize=1024M
    StartPollers=100
    StartTrappers=100

    Right now all of our items are of the passive variety and I have no proxies or zabbix_sender setups in place. Is this simply too many items for a single server to manage the collection of? The fact that the hardware itself isn't being taxed makes me hope this isn't the case. Is there some configuration I could look into adjusting that might help?

    I've tried to keep our configs very clean and kept the common templates shared by many hosts fairly lean with generally larger updated intervals of 60s or more. The only messages I see in the zabbix server logs look harmless - occasional curl timeouts from web scripts, unsupported items that I haven't yet cleaned up, etc.

    Any help is greatly appreciated.
  • Colttt
    Senior Member
    Zabbix Certified Specialist
    • Mar 2009
    • 878

    #2
    hmm.. do you tuning the database? how looks that?
    how great ist you database? do you use houskeeping, when yes how often?
    Debian-User

    Sorry for my bad english

    Comment

    • gtuhl
      Junior Member
      • Oct 2011
      • 3

      #3
      The DB seems to be in excellent shape. The machine has loads of idle cpu/ram/disk, never any query backup, and I've got it tuned up nicely (I setup a lot of postgres installs).

      Housekeeping was causing trouble with its row-at-a-time cleaning previously so I moved to an approach of:

      - Disabled housekeeping completely
      - Don't keep detailed history for an item more than 30 days
      - Run a query every morning that purges anything older than 30+ days from history, history_uint, history_str, history_text, and history_log

      It perhaps isn't a recommend approach but seems to be working very well. I made that change a long time ago and it made a huge difference in terms of the stability of zabbix improving.

      Generally everything is running great, it just falls apart occasionally and goes from having 0 items past the "10 second" line in the queue to having hundreds.

      It's very odd, and only happens once every day or so (has not happened since I created this thread).

      I am wondering if I need to setup some proxies so I am spreading the outbound connections to all our hosts across a few different machines. Could be a spike in network latency to enough hosts could cause the single zabbix server to get tied up such that it can't continue bringing in data.

      Comment

      • gtuhl
        Junior Member
        • Oct 2011
        • 3

        #4
        We had another one of these events today and I believe I've finally figured it out. It had absolutely nothing to do with Zabbix so wanted to wrap this thread for any future readers.

        In our case it was some borderline malicious caching behavior in our host's DNS servers. I've swapped in google's and everything works flawlessly - the huge backup cleared in a matter of minutes. The outbound connections were all getting hung up on DNS lookups that never returned.

        Comment

        Working...