Ad Widget

Collapse

Gaps in graphs, poller process constant ~37% busy, server resources barely touched

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • danodemano
    Junior Member
    • Jun 2014
    • 21

    #1

    Gaps in graphs, poller process constant ~37% busy, server resources barely touched

    Hello,

    I hope you fine people can help me with this. I've poured through the forums, Google searches, blog posts, etc and I have been unable to resolve this. I have a zabbix VM monitoring a handful of devices through SNMP. My new values is sitting at ~47 right now and the "hardware" is hardly being touched. However I am getting gaps in my graphs and the data gathering process is always around ~37% busy. Can someone help me figure out what's going on? I have another ~40 devices I would like to add but am reluctant to do so as I'm worried things will get much worse.

    EDIT: Forgot to add that I'm using the default built-in SNMP interfaces template to generate these graphs.

    I do notice this regularly in the log as well and assume it's related:

    Code:
     10217:20150210:095232.389 SNMP agent item "ifOperStatus[backplane 1/A/19]" on host "[HOST]" failed: first network error, wait for 15 seconds
     10248:20150210:095247.838 resuming SNMP agent checks on host "[HOST]": connection restored
     10245:20150210:095332.795 SNMP agent item "ifOutErrors[internal 1/21/2]" on host "[HOST]" failed: first network error, wait for 15 seconds
     10246:20150210:095347.869 resuming SNMP agent checks on host "[HOST]": connection restored
     10238:20150210:095432.876 SNMP agent item "ifOutErrors[backplane 1/A/19]" on host "[HOST]" failed: first network error, wait for 15 seconds
     10248:20150210:095447.869 resuming SNMP agent checks on host "[HOST]": connection restored
     10181:20150210:095532.582 SNMP agent item "ifInOctets[backplane 1/A/18]" on host "[HOST]" failed: first network error, wait for 15 seconds
     10248:20150210:095547.888 resuming SNMP agent checks on host "[HOST]": connection restored
     10192:20150210:095632.768 SNMP agent item "ifOutOctets[backplane 1/A/3]" on host "[HOST]" failed: first network error, wait for 15 seconds
     10247:20150210:095647.900 resuming SNMP agent checks on host "[HOST]": connection restored
     10208:20150210:095732.298 SNMP agent item "ifOperStatus[backplane 1/A/16]" on host "[HOST]" failed: first network error, wait for 15 seconds
     10246:20150210:095748.052 resuming SNMP agent checks on host "[HOST]": connection restored
    Relevant info:

    Zabbix config:
    Code:
    StartPollers=80
    StartIPMIPollers=5
    StartPollersUnreachable=5
    StartTrappers=15
    StartPingers=5
    StartDiscoverers=5
    StartTimers=15
    SenderFrequency=15
    CacheSize=256M
    StartDBSyncers=4
    HistoryCacheSize=128M
    TrendCacheSize=64M
    HistoryTextCacheSize=64M
    ValueCacheSize=64M
    Timeout=15
    LogSlowQueries=1000
    MySQL config:
    Code:
    tmp-table-size                 = 128M
    max-heap-table-size            = 128M
    query-cache-type               = 1
    query-cache-size               = 128M
    query-cache-limit              = 128M
    max-connections                = 400
    thread-cache-size              = 300
    open-files-limit               = 65535
    table-definition-cache         = 4096
    table-open-cache               = 4096
    table-cache                    = 512
    join-buffer-size               = 8M
    read-buffer-size               = 512k
    read-rnd-buffer-size           = 512k
    innodb-flush-method            = O_DIRECT
    innodb-log-files-in-group      = 2
    innodb-log-file-size           = 256M
    innodb-flush-log-at-trx-commit = 2
    innodb-file-per-table          = 1
    innodb-buffer-pool-size        = 2G
    innodb-log-buffer-size         = 4M
    innodb-thread-concurrency      = 0
    Pictures:









    Please let me know if you need additional information. Thanks!!!!
    Last edited by danodemano; 10-02-2015, 17:29.
  • danodemano
    Junior Member
    • Jun 2014
    • 21

    #2
    Since I ran into a 4 pictures limit here are more:







    Comment

    • danodemano
      Junior Member
      • Jun 2014
      • 21

      #3
      And the last one:

      Comment

      • danodemano
        Junior Member
        • Jun 2014
        • 21

        #4
        Alright, just for fun I upped the polling interval on the ports to 300 seconds. Either I'm missing something very basic or the graphing/polling in Zabbix is just garbage.

        Here is what Zabbix is giving me:



        Here is what Observium shows for the same port:



        And this is from the ISP showing the port from their side (their's is flipped, latest data is on the left):



        Not only is Zabbix still missing chunks of data it's not even close to an accurate representation of the usage. Am I missing something here?
        Last edited by danodemano; 12-02-2015, 22:44.

        Comment

        • jan.garaj
          Senior Member
          Zabbix Certified Specialist
          • Jan 2010
          • 506

          #5
          Increase timeout, debug level and then check logs for more details. It can be https://support.zabbix.com/browse/ZBX-7936
          Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
          My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

          Comment

          • danodemano
            Junior Member
            • Jun 2014
            • 21

            #6
            I've tried with a timeout of 30, no change. I will try upping the debug level tomorrow and see if it gives anything useful. Thanks!

            Comment

            • danodemano
              Junior Member
              • Jun 2014
              • 21

              #7
              So I enabled debug logging for ~20 minutes this morning and the log file is 170MB. I can't really browse it so I guess the question is what am I actually looking for? Is there something I can search to help track down what's going on? I can post the file somewhere if that would help. Thanks!

              Comment

              • jan.garaj
                Senior Member
                Zabbix Certified Specialist
                • Jan 2010
                • 506

                #8
                What happened before network errors, e.g.:
                Code:
                30682:20120821:225422.239 Item [localhost:net.tcp.service[smtp]] error: Get value from agent failed: ZBX_TCP_READ() failed: [4] Interrupted system call
                ...
                30682:20120821:225422.306 Zabbix agent item [net.tcp.service[smtp]] on host [localhost] failed: first network error, wait for 15 seconds
                Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
                My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

                Comment

                • danodemano
                  Junior Member
                  • Jun 2014
                  • 21

                  #9
                  Looks like it's getting SNMP timeouts?

                  Code:
                  30316:20150213:092733.806 getting SNMP values failed: Timeout while connecting to "10.20.28.1:161".
                   30316:20150213:092733.806 End of get_values_snmp()
                   30316:20150213:092733.806 In deactivate_host() hostid:10106 itemid:41695 type:4
                   30316:20150213:092733.806 query [txnlev:1] [begin;]
                   30316:20150213:092733.806 query [txnlev:1] [update hosts set snmp_errors_from=1423837653,snmp_disable_until=1423837668,snmp_error='Timeout while connecting to "10.20.28.1:161".' where hostid=10106]
                   30316:20150213:092733.807 query [txnlev:1] [commit;]
                   30316:20150213:092733.807 SNMP agent item "ifAdminStatus[cross copy data 1/5/1]" on host "TA5006" failed: first network error, wait for 15 seconds
                   30316:20150213:092733.807 deactivate_host() errors_from:1423837653 available:1
                   30316:20150213:092733.807 End of deactivate_host()
                   30316:20150213:092733.807 End of get_values():1

                  Comment

                  • jan.garaj
                    Senior Member
                    Zabbix Certified Specialist
                    • Jan 2010
                    • 506

                    #10
                    Ask your network why you have SNMP timeout? :-) Firewalls, routers, switches, rate limits, Zabbix bulk requests, .....

                    Code:
                    getting SNMP values failed: Timeout while connecting to "10.20.28.1:161".
                    Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
                    My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

                    Comment

                    • danodemano
                      Junior Member
                      • Jun 2014
                      • 21

                      #11
                      That's the part I don't get, I have other servers polling these sames devices on the same network that don't have a problem. The whole network is layer 2 and latency is <2ms across the board. Oh well, time to scrap zabbix and look elsewhere. Thanks.

                      Comment

                      • jan.garaj
                        Senior Member
                        Zabbix Certified Specialist
                        • Jan 2010
                        • 506

                        #12
                        Did you try to disable bulk requests ("Use bulk requests" in SNMP agent interface config)? And do you use latest version 2.4.3?

                        Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
                        My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

                        Comment

                        • danodemano
                          Junior Member
                          • Jun 2014
                          • 21

                          #13
                          I just disabled bulk requests, we'll see if that makes a difference. And I am on 2.4.3.

                          Comment

                          • Jason
                            Senior Member
                            • Nov 2007
                            • 430

                            #14
                            I see the same thing on 2.2.7.

                            We have bulk SNMP disabled and the Timeout set at 30.

                            Not sure if it's the way snmp is called from zabbix... snmpwalk or get from the command line always returns quickly and never times out.

                            Comment

                            • danodemano
                              Junior Member
                              • Jun 2014
                              • 21

                              #15
                              Yep I disabled bulk requests and upped the timeout to 30 and am still seeing the issue.

                              Comment

                              Working...