Ad Widget

Collapse

Huge zabbix queue but where is the bottleneck ?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mcmyst
    Member
    • Feb 2012
    • 72

    #1

    Huge zabbix queue but where is the bottleneck ?

    Hi everyone,

    I have an hypervisor with this hardware : Xeon E5-2620 @ 2Ghz, 16Gb RAM and 3 * 300Gb SAS 15K drives in RAID 5.
    Three VM are running on this server:
    - zabbix db : MySQL, 4 cores and 4,5Gb RAM, InnoDB with 3,5Gb pool size
    - zabbix server : 2 cores, 1Gb RAM
    - zabbix proxy : 2 cores, 1Gb RAM

    I don't see any performance issue browsing the web GUI and when I take a look at these graphs, they seems correct:


    The problem is, the zabbix server queue is never getting empty. It is always between 300 and 3000 items:


    I know that you recommend to run at least the database on a dedicated server but my hypervisor is not loaded at all (I will post the image in a later post because I have reached the limit)

    So I am wondering where is the bottleneck ?

    Here are the proxy tuned parameters:
    ConfigFrequency=120
    StartPollers=15
    StartPollersUnreachable=5
    StartPingers=10
    CacheSize=64M
    StartDBSyncers=9
    HistoryCacheSize=64M

    Here are the server tuned parameters:
    StartPollers=15
    StartPollersUnreachable=15
    StartPingers=3
    ListenIP=172.16.3.50
    SenderFrequency=30
    CacheSize=64M
    StartDBSyncers=9
    HistoryCacheSize=64M
    TrendCacheSize=16M
    Timeout=5

    The proxy is handling 95% of the monitoring (around 60k SNMPv2 items).


    My questions are:
    Should I increase my Pollers on the server or on the proxy ?
    Do you think I should change some settings somewhere ?
    Could you help me ?
    Last edited by mcmyst; 23-01-2013, 11:12.
  • tchjts1
    Senior Member
    • May 2008
    • 1605

    #2
    On your internal process % statistics, I do not see "Zabbix busy poller process %" listed. (I saw proxy poller process). Your graph only shows a 1 hour timeframe. You should look at a larger window and see how that looks. Your proxy poller process looks relatively low for that one hour. You should take a look at Zabbix busy poller process %.

    Anyway, how many hosts are you monitoring? I have about 200 hosts and I have my StartPollers value set at 70 for Zabbix server. (No proxies) You should also take a look at your cache usage stats.
    Attached Files

    Comment

    • mcmyst
      Member
      • Feb 2012
      • 72

      #3
      I have around 300 network switch monitored via SNMPv2.
      Here are the graphs:




      I know that my pollers looks low, but before today StartPollers was at 8 for the server and the proxy... Today I have set it to 15 but no change on the queue. The pollers does not look to be busy, so increasing it will not change anything in my opinion. Do you think that I should try more ?

      But I don't know on which I should increase it ? On the server or on the proxy ?

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        Originally posted by mcmyst
        I know that my pollers looks low, but before today StartPollers was at 8 for the server and the proxy... Today I have set it to 15 but no change on the queue. The pollers does not look to be busy, so increasing it will not change anything in my opinion. Do you think that I should try more ?

        But I don't know on which I should increase it ? On the server or on the proxy ?
        Yeah, you're right. All your pollers look fine for resource usage. Are you seeing anything helpful in your zabbix_server.log?

        Maybe you are having some timeouts happening. You could try increasing this variable by a few seconds to see if it helps:

        Code:
        ### Option: Timeout
        #       Specifies how long we wait for agent, SNMP device or external check (in seconds).
        #
        # Mandatory: no
        # Range: 1-30
        # Default:
        # Timeout=3
        And any changes you make to the zabbix_server.conf file require a restart of the Zabbix server process.

        Comment

        • mcmyst
          Member
          • Feb 2012
          • 72

          #5
          Ok thank you I will try tomorrow morning. I will post the results here.

          Comment

          • tchjts1
            Senior Member
            • May 2008
            • 1605

            #6
            Check your server log first before you change settings... see if there is any obvious error happening there.

            Comment

            • flako
              Member
              • Sep 2011
              • 40

              #7
              Hello
              As is the graph of 'Zabbix performance' (the queue)
              Viewing your graphics, I would try disabling housekeeper (zabbix_server.conf). This is 100% saturated (I'll bet you a beer that is not ending), this makes DB saturates causing the queue items increase. You're also running every 70min (Its too fast, once per day would be enough if housekeeper work)

              Comment

              • tchjts1
                Senior Member
                • May 2008
                • 1605

                #8
                Although housekeeper may be attributing to some of the bottleneck, I wouldn't recommend disabling it unless you have your DB partitioned, and are managing old data that way.

                My housekeeper is the same as yours, except a bit shorter in duration. I would look at optimizing your settings in my.cnf rather than disabling housekeeper.

                I too, thought the same as Flako and set it to run only every 12 hours. Bad choice though, as it then ran for about 2 hours solid instead of 5 or 10 minutes every hour.
                Attached Files

                Comment

                • mcmyst
                  Member
                  • Feb 2012
                  • 72

                  #9
                  Thank you all for your replies.

                  I know that housekeeper is a performance killer, but I have to run it to delete old data because I don't have a partitioned database (MySQL can't use foreign keys on partitioned tables). And even if I change of database engine, I want to keep some items for 3 years and some others for one year so it would never works.

                  Here are my logs from the proxy:
                  Code:
                  snmp_build: unknown failure 25485:20130123:064848.679 SNMP item [Ethernet0-0-5.ifHCInBroadcastPkts] on host [SWITCH] failed: first network error, wait for 15 seconds
                  snmp_build: unknown failure 25491:20130123:064903.039 SNMP item [Ethernet0-0-32.ifHCInBroadcastPkts] on host [SWITCH] failed: another network error, wait for 15 seconds
                   25491:20130123:064918.054 resuming SNMP checks on host [SWITCH]: connection restored
                  snmp_build: unknown failure 25485:20130123:064928.660 SNMP item [Ethernet0-0-35.ifHCInBroadcastPkts] on host [SWITCH] failed: first network error, wait for 15 seconds
                   25491:20130123:064943.079 resuming SNMP checks on host [SWITCH]: connection restored
                  snmp_build: unknown failure 25478:20130123:064948.530 SNMP item [Ethernet0-0-3.ifHCInBroadcastPkts] on host [SWITCH] failed: first network error, wait for 15 seconds
                  snmp_build: unknown failure 25491:20130123:065003.085 SNMP item [Ethernet0-0-3.ifHCInBroadcastPkts] on host [SWITCH] failed: another network error, wait for 15 seconds
                   25491:20130123:065018.110 resuming SNMP checks on host [SWITCH]: connection restored
                  snmp_build: unknown failure 25481:20130123:065048.282 SNMP item [FastEthernet2-14.ifHCInBroadcastPkts] on host [SWITCH] failed: first network error, wait for 15 seconds
                   25469:20130123:065050.155 Received configuration data from server. Datalen 10951816
                  snmp_build: unknown failure 25491:20130123:065106.423 SNMP item [FastEthernet2-14.ifHCInBroadcastPkts] on host [SWITCH] failed: another network error, wait for 15 seconds
                  snmp_build: unknown failure 25478:20130123:065108.605 SNMP item [Ethernet0-0-6.ifHCInBroadcastPkts] on host [SWITCH] failed: first network error, wait for 15 seconds
                   25491:20130123:065118.432 resuming SNMP checks on host [SWITCH]: connection restored
                   25488:20130123:065123.439 resuming SNMP checks on host [SWITCH]: connection restored
                  snmp_build: unknown failure 25473:20130123:065128.598 SNMP item [Ethernet0-0-4.ifHCInBroadcastPkts] on host [SWITCH] failed: first network error, wait for 15 seconds
                  snmp_build: unknown failure 25491:20130123:065143.439 SNMP item [Ethernet0-0-4.ifHCInBroadcastPkts] on host [SWITCH] failed: another network error, wait for 15 seconds
                  snmp_build: unknown failure 25491:20130123:065158.449 SNMP item [Ethernet0-0-41.ifHCInBroadcastPkts] on host [SWITCH] failed: another network error, wait for 15 seconds
                   25491:20130123:065213.507 resuming SNMP checks on host [SWITCH]: connection restored
                  snmp_build: unknown failure 25485:20130123:065223.634 SNMP item [Ethernet0-0-44.ifHCInBroadcastPkts] on host [SWITCH] failed: first network error, wait for 15 seconds
                  snmp_build: unknown failure 25488:20130123:065238.459 SNMP item [Ethernet0-0-44.ifHCInBroadcastPkts] on host [SWITCH] failed: another network error, wait for 15 seconds
                   25488:20130123:065253.475 resuming SNMP checks on host [SWITCH]: connection restored
                   25469:20130123:065307.898 Received configuration data from server. Datalen 10951816
                  snmp_build: unknown failure 25486:20130123:065308.798 SNMP item [Ethernet0-0-40.ifHCInBroadcastPkts] on host [SWITCH] failed: first network error, wait for 15 seconds
                  snmp_build: unknown failure 25488:20130123:065324.192 SNMP item [Ethernet0-0-40.ifHCInBroadcastPkts] on host [SWITCH] failed: another network error, wait for 15 seconds
                   25491:20130123:065338.211 resuming SNMP checks on host [SWITCH]: connection restored
                  I have only this kind of error, I think it is because there are too many SNMP queries on the host and it stops responding. I will see if I can raise up the limits on all the switch.

                  On the server side, I have only this:
                  Code:
                   1047:20130123:062534.813 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1048:20130123:062753.672 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1055:20130123:062837.192 housekeeper deleted: 1134607 records from history and trends, 0 records of deleted items, 0 events, 0 alerts, 0 sessions
                    1047:20130123:063011.860 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1046:20130123:063229.189 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1049:20130123:063446.985 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1049:20130123:063704.754 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1049:20130123:063922.709 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1048:20130123:064139.850 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1047:20130123:064357.460 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1050:20130123:064614.870 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1046:20130123:064832.537 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1048:20130123:065049.954 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1050:20130123:065307.683 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1049:20130123:065525.448 Sending configuration data to proxy 'proxy'. Datalen 10951816
                    1050:20130123:065742.999 Sending configuration data to proxy 'proxy'. Datalen 10951816
                  Yes the housekeeper saturates on 100% but as you can see, it is not causing lot of trouble to the database. But you make me think that the housekeeper is still enabled on my proxy. Could it be the cause ? I will try disabling it at work this morning just to see if it get better.

                  Comment

                  • mcmyst
                    Member
                    • Feb 2012
                    • 72

                    #10
                    Ok so I think I have found my problem thanks to this post:


                    So I have double checked my proxy log saying "first network error" and then I found that some items OID where malformed as follow:
                    .3.6.1.2.1.31.1.1.1.9.6 in place of 1.3.6.1.2.1.31.1.1.1.9.6

                    The thing is that these OID were not in "Not Supported" state but in "Enabled" state.

                    In fact I have developped a program to create automatically zabbix items/triggers/graphs trhought the API. The typo was in my program code...

                    So now I have to figure out all the malformed OID and it should be much better !

                    Comment

                    • mcmyst
                      Member
                      • Feb 2012
                      • 72

                      #11
                      We did it !


                      So the problem was 29 items with malformed OIDs that were not reported as "Not Supported"...

                      So thank you everyone for your help, thank you 'tchjts1' to have pointed me to the logs !

                      And as you can see, even if the housekeeper is at 100%, the queue is getting lower and lower !

                      Comment

                      Working...