Ad Widget

Collapse

Zabbix proxy stops monitoring some hosts

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Cray
    Member
    • Mar 2009
    • 72

    #1

    Zabbix proxy stops monitoring some hosts

    Hi all !

    I'm actually facing a problem I've been investigating on for 2 days (without any luck I must say ...).

    For some reasons I can't explain, one of my proxy stopped monitoring some hosts a few days ago (for instance SNMP data and simple checks such as pings no longer gives data).

    eg (data is no longer provided for the selected host) :



    This is not a simple mistake (such as host IP change, disabling SNMP on host etc) :

    - I no longer receive data from simple checks (ICMP) and SNMP but the devices are still polled correctly in the discovery rules !

    - I can do ping / fping / snmpget on the targeted devices, it works ! But it looks like no more info is given to Zabbix ôo

    The very strange things is that it happens only for a few hosts (switches and printers), and the others are being monitored just fine.
    I tried to restart the zabbix proxy process, even the server, without any luck

    I would greatly appreciate any help
    Last edited by Cray; 27-04-2009, 17:04.
  • MarkusL
    Member
    • Nov 2008
    • 41

    #2
    Hi Cray,

    you might want to check /tmp/zabbix_proxy.log (tail -f).

    We use 8 proxies at the moment and discovered nearly the same effects.
    In our situation the problem seemed to come from wrong SNMP-OIDs, wrong perfmon-counter or something like that (may query-timeouts, too).

    My guess is the following. Maybe one of the zabbix-team could reply to that:

    in case of one or more items with wrong keys / timeouts, etc., the proxy is slowing down their query-interval. After a time X, it will completely stop to query it. The only way of getting it to work again in our case was to reduce the "error-items" on the template-host, best of course error-items=0 and then restart the proxy.

    But, as in the beginning mentioned, our golden tool is the log-file.


    Kind regards,

    Markus.

    Comment

    • Cray
      Member
      • Mar 2009
      • 72

      #3
      Hello Markus

      I'm glad somebody else has this problem too, I often feel being alone on some Zabbix's issues

      The log file was the first thing I checked when seeing this problem for the first time. I don't know if it's normal, but the log file is full of SQL queries (Debuglevel set to 3, but they appear even if set to 2).

      3887:20090428:110622 Starting zabbix_proxy. ZABBIX 1.6.4.
      3887:20090428:110622 **** Enabled features ****
      3887:20090428:110622 SNMP monitoring: YES
      3887:20090428:110622 WEB monitoring: YES
      3887:20090428:110622 ODBC: NO
      3887:20090428:110622 IPv6 support: NO
      3887:20090428:110622 **************************
      3889:20090428:110622 server #1 started [Configuration syncer]
      3891:20090428:110622 server #2 started [Datasender]

      [...]

      4068:20090428:110625 server #50 started [ICMP pinger]
      4072:20090428:110625 server #51 started [ICMP pinger]
      4075:20090428:110625 server #52 started [ICMP pinger]
      4078:20090428:110625 server #53 started [Housekeeper]
      4078:20090428:110625 Executing housekeeper
      4140:20090428:110628 server #74 started [HTTP Poller]
      4085:20090428:110628 server #54 started [Poller for unreachable hosts. SNMP:YES]
      4088:20090428:110628 server #55 started [Poller for unreachable hosts. SNMP:YES]
      4096:20090428:110628 server #56 started [Poller for unreachable hosts. SNMP:YES]
      4099:20090428:110628 server #58 started [Poller for unreachable hosts. SNMP:YES]
      4098:20090428:110628 server #57 started [Poller for unreachable hosts. SNMP:YES]
      4111:20090428:110629 server #59 started [Poller for unreachable hosts. SNMP:YES]
      4114:20090428:110629 server #60 started [Poller for unreachable hosts. SNMP:YES]
      4117:20090428:110629 server #61 started [Poller for unreachable hosts. SNMP:YES]
      4120:20090428:110629 server #62 started [Poller for unreachable hosts. SNMP:YES]
      4122:20090428:110629 server #63 started [Poller for unreachable hosts. SNMP:YES]
      4123:20090428:110629 server #64 started [Poller for unreachable hosts. SNMP:YES]
      4126:20090428:110629 server #65 started [Poller for unreachable hosts. SNMP:YES]
      4130:20090428:110629 server #67 started [Poller for unreachable hosts. SNMP:YES]
      4129:20090428:110629 server #66 started [Poller for unreachable hosts. SNMP:YES]
      4136:20090428:110630 server #70 started [Poller for unreachable hosts. SNMP:YES]
      4135:20090428:110630 server #69 started [Poller for unreachable hosts. SNMP:YES]
      4132:20090428:110630 server #68 started [Poller for unreachable hosts. SNMP:YES]
      4138:20090428:110630 server #72 started [Poller for unreachable hosts. SNMP:YES]
      4137:20090428:110630 server #71 started [Poller for unreachable hosts. SNMP:YES]
      4139:20090428:110630 server #73 started [Poller for unreachable hosts. SNMP:YES]
      4141:20090428:110630 server #75 started [Discoverer. SNMP:YES]
      4145:20090428:110630 server #76 started [Discoverer. SNMP:YES]
      4150:20090428:110630 server #77 started [Discoverer. SNMP:YES]
      3887:20090428:110630 server #0 started [Heartbeat sender]
      4157:20090428:110630 server #79 started [Discoverer. SNMP:YES]
      4154:20090428:110630 server #78 started [Discoverer. SNMP:YES]
      4168:20090428:110631 server #81 started [Discoverer. SNMP:YES]
      4161:20090428:110631 server #80 started [Discoverer. SNMP:YES]
      4171:20090428:110631 server #83 started [Discoverer. SNMP:YES]
      4170:20090428:110631 server #82 started [Discoverer. SNMP:YES]
      4172:20090428:110631 server #84 started [Discoverer. SNMP:YES]
      4179:20090428:110631 server #86 started [Discoverer. SNMP:YES]
      4174:20090428:110631 server #85 started [Discoverer. SNMP:YES]
      4188:20090428:110631 server #88 started [Discoverer. SNMP:YES]
      4184:20090428:110631 server #87 started [Discoverer. SNMP:YES]
      4191:20090428:110631 server #89 started [Discoverer. SNMP:YES]
      4193:20090428:110631 server #90 started [Discoverer. SNMP:YES]
      4197:20090428:110631 server #91 started [Discoverer. SNMP:YES]
      4200:20090428:110631 server #92 started [Discoverer. SNMP:YES]
      4206:20090428:110631 server #94 started [Discoverer. SNMP:YES]
      4202:20090428:110631 server #93 started [Discoverer. SNMP:YES]
      4078:20090428:110632 Deleted 8665 records from history [4.382097 seconds]
      3889:20090428:110641 [Z3005] Query failed: [0] columns hostid, key_ are not unique [update items set type=3,snmp_community='public',snmp_oid='interface s.ifTable.ifEntry.ifInOctets.1',snmp_port=161,host id=10126,key_='icmpping',delay=30,status=0,value_t ype=3,trapper_hosts='',units='',multiplier=0,delta =0,snmpv3_securityname='',snmpv3_securitylevel=0,s nmpv3_authpassphrase='',snmpv3_privpassphrase='',f ormula='1',logtimefmt='',templateid=23264,valuemap id=0,delay_flex='',params='DSN=<database source name>\nuser=<user name>\npassword=<password>\nsql=<query>',ipmi_sens or='' where itemid=25194;
      update items set type=0,snmp_community='',snmp_oid='',snmp_port=161 ,hostid=10126,key_='perf_counter[\Processor(0)\% Processor Time]',delay=10,status=0,value_type=0,trapper_hosts='', units='',multiplier=0,delta=0,snmpv3_securityname= '',snmpv3_securitylevel=0,snmpv3_authpassphrase='' ,snmpv3_privpassphrase='',formula='1',logtimefmt=' ',templateid=25076,valuemapid=0,delay_flex='',para ms='',ipmi_sensor='' where itemid=25195;
      update items set type=0,snmp_community='',snmp_oid='',snmp_port=161 ,hostid=10126,key_='perf_counter[\Processor(1)\% Processor Time]',delay=30,status=0,value_type=0,trapper_hosts='', units='',multiplier=0,delta=0,snmpv3_securityname= '',snmpv3_securitylevel=0,snmpv3_authpassphrase='' ,snmpv3_privpassphrase='',formula='1',logtimefmt=' ',templateid=25077,valuemapid=0,delay_flex='',para ms='',ipmi_sensor='' where itemid=25196;
      update items set type=0,snmp_community='',snmp_oid='',snmp_port=161 ,hostid=10126,key_='agent.ping',delay=30,status=0, value_type=3,trapper_hosts='',units='',multiplier= 0,delta=0,snmpv3_securityname='',snmpv3_securityle vel=0,snmpv3_authpassphrase='',snmpv3_privpassphra se='',formula='0',logtimefmt='',templateid=23222,v aluemapid=1,delay_flex='',params='',ipmi_sensor='' where itemid=25197;
      update items set type=0,snmp_community='',snmp_oid='',snmp_port=161 ,hostid=10126,key_='agent.version',delay=3600,stat us=0,value_type=1,trapper_hosts='',units='',multip lier=0,delta=0,snmpv3_securityname='',snmpv3_secur itylevel=0,snmpv3_authpassphrase='',snmpv3_privpas sphrase='',formula='0',logtimefmt='',templateid=23 229,valuemapid=0,delay_flex='',params='',ipmi_sens or='' where itemid=25198;
      update items set type=0,snmp_community='',snmp_oid='',snmp_port=161 ,hostid=10126,key_='perf_counter[\PhysicalDisk(_Total)\Avg. Disk Write Queue Length]',delay=30,status=0,value_type=0,trapper_hosts='', units='',multiplier=0,delta=0,snmpv3_securityname= '',snmpv3_securitylevel=0,snmpv3_authpassphrase='' ,snmpv3_privpassphrase='',formula='1',logtimefmt=' ',templateid=23212,valuemapid=0,delay_flex='',para ms='',ipmi_sensor='' where itemid=25199;
      update items set type=0,snmp_community='',snmp_oid='',snmp_port=161 ,hostid=10126,key_='perf_counter[\System\File Read Bytes/sec]',delay=30,status=0,value_type=0,trapper_hosts='', units='Bps',multiplier=0,delta=0,snmpv3_securityna me='',snmpv3_securitylevel=0,snmpv3_authpassphrase ='',snmpv3_privpassphrase='',formula='1',logtimefm t='',templateid=23213,valuemapid=0,delay_flex='',p arams='',ipmi_sensor='' where itemid=25200;
      - is there any way to prevent those SQL queries from appearing on the proxy log file ? (it makes it very hard to read, and that's why I haven't pasted all the lines, you would have to scroll for an hour )

      - one line that caught my attention is the one below :

      3889:20090428:110641 [Z3005] Query failed: [0] columns hostid, key_ are not unique [update items set [...]
      I don't know if it's 'normal', maybe the Zabbix developpers would like to comment on this log.

      - anyway if you want me to check the log file for a specific line / log, I will do it with pleasure.


      About the solution you're proposing :

      - Generally, I only keep the items that exist on the targeted device, so that there would be no "not supported" message at all (I believe this is what you mean when you talk about the 'error-items'), but the fact is that originally, I started having those issues even only with simple checks (for example : only an icmp check on a device ...... and after a few times, zabbix would not gatter the icmp value anymore )

      P.S : I insist on the fact that if I execute manual commands against the targeted device, they will work (ping / snmp queries etc).

      Comment

      • MarkusL
        Member
        • Nov 2008
        • 41

        #4
        Hello Cray,

        looks little strange to me. I checked three of our proxies, no one has sql-statements in log-file,...

        The line you posted (Query failed) could be the cause of your trouble, maybe,... (sorry, I´m not that deep in the DB-structure).

        Did you try to set up a complete new template with only one icon?
        I often ran into trouble when two templates had similar items, linked to a host. Even after fixing the problem, removing both templates from the host AND cleaning all history it did not work out.
        Sometimes I can only help myself by completely starting from the scratch with the involked template.

        Little offtopic:
        I often ran into these problems by the time I was using the Applications in items / templates. When using the EXACTLY SAME Application-name in different templates and linking two or three of them on one host, that´s it,... 90% of my cases I was than not able to link another template to the host, change items in the templates linked. Thats why I do not use Applications any more. I got my one solution by naming my items f.e. 000_System_Info| DNS-Name or something. With the leading 000-999 I can group all items of all templates the way I want.


        Kind regards,

        Markus.

        Comment

        • Cray
          Member
          • Mar 2009
          • 72

          #5
          Hello Markus,

          The fact that I see SQL queries in the log files from the proxy , and you don't, seems really weird (I already posted a thread regarding this strange behavior, we'll see if somebody came across the same problem).

          Which version of Zabbix are you running ? (I have 1.6.4 for the server / the proxies and the agents).

          Back to the problem of the proxy stopping monitoring some hosts :

          - I will try to use a fresh template, see if it solves the problem

          BTW : I set the debug mode to 4 (developper), and when analyzing the logs, here's a sample of what I can find :

          3964:20090428:170001 End of process_ping()
          3964:20090428:170001 End of do_ping() result=SUCCEED
          3964:20090428:170001 Host [10.0.0.243] alive [1] 0.002810 sec.
          3964:20090428:170001 In process_value([email protected])
          3964:20090428:170001 Query [select i.itemid,i.key_,h.host,h.port,i.delay,i.descriptio n,i.nextcheck,i.type,i.snmp_community,i.snmp_oid,h .useip,h.ip,i.history,i.lastvalue,i.prevvalue,i.ho stid,h.status,i.value_type,h.errors_from,i.snmp_po rt,i.delta,i.prevorgvalue,i.lastclock,i.units,i.mu ltiplier,i.snmpv3_securityname,i.snmpv3_securityle vel,i.snmpv3_authpassphrase,i.snmpv3_privpassphras e,i.formula,h.available,i.status,i.trapper_hosts,i .logtimefmt,i.valuemapid,i.delay_flex,h.dns,i.para ms,i.trends,h.useipmi,h.ipmi_port,h.ipmi_authtype, h.ipmi_privilege,h.ipmi_username,h.ipmi_password,i .ipmi_sensor,i.lastlogsize from hosts h, items i where h.hostid%10=6 and h.status=0 and h.hostid=i.hostid and h.proxy_hostid=0 and h.useip=1 and h.ip='10.0.0.243' and i.key_='icmpping' and i.status in (0,3) and i.type=3 and i.nextcheck<=1240930799]
          3922:20090428:170001 Get value from agent result: '1101.143953'
          3922:20090428:170001 End get_value()
          3922:20090428:170001 Query [begin;]
          The device at 10.0.0.243 is a switch from which Zabbix shows he has no data (but I know the switch is ON, I can ping it, do SNMP queries etc).

          So, Zabbix says he has no data about the switch, but in the proxy log file, I see that zabbix proxy has successfully pinged the device and the host showed alive

          What the hell is going on ôo ...

          Comment

          • MarkusL
            Member
            • Nov 2008
            • 41

            #6
            Hi Cray,

            ok, I don´t want to discuss the logfile-problem to deep, because I only have half knowledge about it.

            I tested the debuglevel by 4 on one of our proxies. We get SQL-queries, too. I think they have nothing to do with values going from the proxy to the server (or not going in your point,...). I think the SQL-queries come from database-sync server->proxy. In time you are updating a template with a new item linked to a host that is monitored by a proxy, the proxy has to update his database. In config-file you have a point called ConfigFrequency (standard 3600, @us 120). This is how often proxy asks for config-updates.

            If my acceptance is correct, I would say, you have a database-problem at your proxy. But, again, I only have half knowledge about this. It´s only my logic,...

            We had 1.6.1, we are just upgrading all our proxies to 1.6.4 (main-server is 1.6.4).


            Kind regards,

            Markus.

            Comment

            • Cray
              Member
              • Mar 2009
              • 72

              #7
              Hey Markus (and thanks for you replies btw)

              My ConfigFrequency is set to 60 (1min), and if I look in the master log file, I can see it sends the config-uptade to the proxy.

              I don't know (like you) if its a database problem, but if it is, it would be great to know where it comes from (as the proxy only works with SQlite, I cannot test it with mysql or other DB)

              I think I have found something that might really be interesting here :

              If I look at the Queue tab in the Administration menu, I have many pending pollings for which the next check is always at ..... 01 /01/ 1970 (you might wanna check that, as it probably the cause of the proxy not polling the devices : this date is 40 years old outdated, it may be why zabbix won't ever do the polling).



              And (surprinsigly), the devices programmed to be polled to 01/01/1970 are the ones which zabbix no longer received data from
              Last edited by Cray; 28-04-2009, 19:54.

              Comment

              • Cray
                Member
                • Mar 2009
                • 72

                #8
                I (still) haven't found a solution to that problem

                I would love to have even a clue for this 01/01/1970 never-ending queue...

                Comment

                • MarkusL
                  Member
                  • Nov 2008
                  • 41

                  #9
                  Hi Cray,

                  currently we are dealing with the same problem in Administration\Queue. Our server has about 200 lines with 01.01.1970 followed by actual Queue-entries.

                  Maybe someone from the Zabbix-team could reply, what the 01.01.1970 is about?


                  As our proxies all work fine (MySQL InnoDB), I do NOT believe that your basic problem is related to the 01.01.1970-issue. But, just a guess of mine,...


                  Kind regards,

                  Markus.

                  Comment

                  • Cray
                    Member
                    • Mar 2009
                    • 72

                    #10
                    Hi again Markus

                    You said your proxies are running under Mysql innoDB, but as far as I remembered, the Zabbix proxy only supports SQLite DB type.

                    Regarding my still-unresolved-problem :

                    you say the 01.01.1970-items being queued infinitely might not be the source of my proxies not getting any infos, but, as I already said :

                    - all the infinitely-queued items in the Zabbix administration panel are the ones that I'm not getting any informations

                    - so basically it looks like the Zabbix server never tells the proxy to monitor some items (as they are supposed to be monitored on the 01.01.1970 )

                    This vicious circle is driving me mad, I've tried many possible solutions to solve this queue-problem (time-sync check, disable/re-activate the items, flush the proxy config .....)

                    Comment

                    • MarkusL
                      Member
                      • Nov 2008
                      • 41

                      #11
                      Hi Cray,

                      jop, our Proxies all run on MySQL / InnoDB, just like our server. I think it was new in 1.6.1 to have proxy with MySQL/InnoDB, not 100% sure.

                      Well, the 1970er-problem might not be the problem, but again: from my side only half knowledge. Maybe someone from the zabbix-team can point out, what it is about with the 1970er ones.

                      I can tell you the following:
                      we have on ALL items "nodata"-trigger, firing as soon as no data arrives for 2min up to 32days (monthly backupy). By now NO nodata-trigger is firing, except he should :-) (batch not running or something). But not from the 1970er problem-side.
                      Thats why I believe, your problem is not coming from the 1970er side. But it´s again only a comparison of what you have and I have.
                      And here, as we have totaly differnt databases, I think it makes not that much sense to compare our systems that deep. Is it possible for you to setup a virtual proxy or something with 1.6.1+, MySQL and innodb? That would be quite interesting,...


                      Kind regards,

                      Markus.

                      Comment

                      • Cray
                        Member
                        • Mar 2009
                        • 72

                        #12
                        Hey Markus

                        As you said, it makes more sense to compare similar configurations to do troubleshooting.

                        I will setup a proxy with a Mysql/innodb database to see how it behaves

                        (Dunno why I was thinking Zabbix's proxy only support SQLlite DB, must have read it somewhere, or maybe I was already looking forward for an embbeded-zabbix-hardware )

                        I will keep you posted !

                        Comment

                        • Cray
                          Member
                          • Mar 2009
                          • 72

                          #13
                          Bump

                          I replaced one of the proxy that was running SQlite with one running Mysql (innoDB), and so far, all the items that were stuck in the queue (and that were related to this particular proxy) were flushed almost instantly

                          I will have a look at the proxy in the next days, see how it will behave (the problem started happening on my SQLite proxy after a few days).

                          I was wondering if this problem is related to the database engine performance (on p.36 of the Zabbix manual, the comparative table states that SQLlite is suitable for a light duty proxy). Maybe the SQLite engine is overworked when monitoring up to N items....but how could I eventually check that ?

                          Comment

                          • elvar
                            Senior Member
                            • Feb 2008
                            • 226

                            #14
                            This problem is plaguing me as well. My proxy hosts (switches) just randomly stop pulling data and stay queued indefinitely. They will be working fine for awhile then bam, nothing. This is maddening. Running 1.8 on the server and 1.8 on the proxy.

                            Comment

                            Working...