Ad Widget

Collapse

Zabbix 3.0.1 trappers stop processing new data

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • fire555
    Junior Member
    • May 2011
    • 7

    #1

    Zabbix 3.0.1 trappers stop processing new data

    Hey all,

    I have been pulling my hair out over this one.

    I recently upgraded a Zabbix 1.8.5 instance to 3.0.1. The upgrade went well and everything was working fine. Except, periodically the trappers will stop processing new data. This seems to coincide with either something the timer or housekeeping process is running.

    I thought perhaps the upgrade had left behind some rubbish, so I installed a brand new Zabbix 3.0.1 install into a brand new CentOS 7 instance, and used a brand new database started from the creation scripts. I copied the template over from the old instance via the export/import method.

    Hopeful this would resolve the issue, I changed the DNS and my monitored systems started sending data to the new instance. Everything was going fine, until a couple hours later when the same behaviour presented itself.

    I have tried to turn logging to level 4, but the amount of information logged is almost impossible to wade through and has not revealed anything useful.

    The Zabbix frontend reports that it cannot connect to the Zabbix server, even though other Zabbix threads seem to be connecting to the database and running commands fine.

    Database is MySQL. Monitored systems all use the 1.8.5 version of the Zabbix agent as this is very difficult to mass update at this stage. Could this be the source of my issue? From what I understand, the 3.0.1 server shouldn't have a problem with an old agent.

    Does anyone have any ideas how to isolate this issue?

    Thanks
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    Seems like you have performance bottleneck in storage used by zabbix server database backend.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • fire555
      Junior Member
      • May 2011
      • 7

      #3
      A little more info.

      The pollers carry on without interruption. It is only the trappers that stop writing new values to the database.

      I cannot believe this is a storage bottleneck. This same environment barely worked handling data from 500 monitored hosts using version 1.8.5. And if it is the case, why would the pollers continue to save data?

      The problem also does not appear to be linked with the housekeeper running. Last night the trappers stopped at 126am. The housekeeper was running at 14 minutes past the hour.

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        Originally posted by fire555
        I cannot believe this is a storage bottlenec.
        Engineering it is not something which should be taken in believing categories.
        Housekeeping adds more read and write IOs when is running.
        Do you have you zabbix DB monitoring? If not just login on the host where is running your DB backend and at least check iostat/sar output.
        Do you know how many read and write IO/s is doing storage used by DB?
        Last edited by kloczek; 28-03-2016, 17:22.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • mdiorio
          Junior Member
          • Mar 2016
          • 27

          #5
          I'm just getting up and running with Zabbix now, and I'm not seeing any data from Trappers getting in either. Pulling data in from Elasticsearch server. Any agent data is returning properly, but not trapper data.

          Code:
          From the agent side, I'm seeing:
          **2016-03-28 11:21:22,214 DEBUG zbxsender send_to_zabbix:58 Got response from Zabbix: {u'info': u'processed: 0; failed: 4; total: 4; seconds spent: 0.000018', u'response': u'success'}
           2016-03-28 11:21:22,214 INFO zbxsender send_to_zabbix:59 processed: 0; failed: 4; total: 4; seconds spent: 0.000018
          Yet I'm seeing values in the server logs - and you're right, debug level 4 is insane, can't imagine what 5 is like:

          Code:
            9185:20160328:112635.725 __zbx_zbx_setproctitle() title:'trapper #4 [processing data]'
            9185:20160328:112635.725 trapper got '{
                  "request":"sender data",
                  "data":[
                          {
                                  "host":"ghq-1delasticnode01.globalspec.net",
                                  "key":"health[initializing_shards]",
                                  "value":0,
                                  "clock":1459178782.36},
                          {
                                  "host":"ghq-1delasticnode01.globalspec.net",
                                  "key":"health[relocating_shards]",
                                  "value":0,
                                  "clock":1459178782.36},
                          {
                                  "host":"ghq-1delasticnode01.globalspec.net",
                                  "key":"health[unassigned_shards]",
                                  "value":62,
                                  "clock":1459178782.36},
                          {
                                  "host":"ghq-1delasticnode01.globalspec.net",
                                  "key":"health[delayed_unassigned_shards]",
                                  "value":0,
                                  "clock":1459178782.36}]
          }'
          But my host items do not get any data. I doubt my issue is a bottleneck too. I am only monitoring two hosts and about 10 web scenarios, one strictly Zabbix Agent on Windows, and this host. I'm using SQLite DB right now, but with this minimal number of hosts/items, it should be working just fine.

          Comment

          • fire555
            Junior Member
            • May 2011
            • 7

            #6
            Originally posted by kloczek
            Engineering it is not something which should be taken in believing categories.
            Housekeeping adds more read and write IOs when is running.
            Do you have you zabbix DB monitoring? If not just login on the host where is running your DB backend and at least check iostat/sar output.
            Do you know how many read and write IO/s is doing storage used by DB?
            Sorry, that was a poor choice of words. I should have said all the logging I have suggests that the database is barely doing anything. Write IOPS sit at about 40/sec, reads maybe 10. The database is actually an RDS instance in AWS with SSD backed storage. IOPS can easily surge to 3000. CPU Usage is barely 2%.

            After looking a bit closer at this, I believe the trappers are no longer even bound to the TCP stack. Looking back at a few older logs I found these lines exactly when the trappers stopped processing data.

            Code:
            Cannot get socket IP address: [107] Transport endpoint is not connected
            This line appears numerous times for each trapper PID.

            Any ideas?

            Comment

            • kloczek
              Senior Member
              • Jun 2006
              • 1771

              #7
              Originally posted by mdiorio
              But my host items do not get any data. I doubt my issue is a bottleneck too. I am only monitoring two hosts and about 10 web scenarios, one strictly Zabbix Agent on Windows, and this host. I'm using SQLite DB right now, but with this minimal number of hosts/items, it should be working just fine.
              Sqlite stores all database tables in single file and every update/insert/delete is causing that whole DB file is rewritten on each SQL statement.
              If something is deleting something from DB nothing at the same time can even do select.
              In other words sqlite as DB backend does not provide any concurrency.

              You need to switch to for example MySQL.

              Sqlite in context of zabbix is good only on some relatively small proxies or on some embedded systems with such proxy. On proxies housekeeping overhead is way smaller than on server.
              Proxy db backend only stores data collected from agents and time to time is doing few queries to read some batch of data to sent it to server.
              On proxy almost there is no selects and/or selects preasure on DB backend on proxy is only a fraction of this which is necessary to guarantee on server.
              Last edited by kloczek; 29-03-2016, 14:13.
              http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
              https://kloczek.wordpress.com/
              zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
              My zabbix templates https://github.com/kloczek/zabbix-templates

              Comment

              • mdiorio
                Junior Member
                • Mar 2016
                • 27

                #8
                I made a booboo - I am using MySQL database for Zabbix. I was originally using SQLite and reloaded with MySQL.

                I'm not seeing any network events in the log unlike fire555. I'm only seeing the zbx_send_reponse with a failed in it for all trapper data received. But you can see the keys and values are returning valid results. It's the trapper that's not processing the data, even though it's sending a response of success.

                Code:
                3448:20160329:111444.433 __zbx_zbx_setproctitle() title:'trapper #4 [processing data]'
                  3448:20160329:111444.433 trapper got '{
                	"request":"sender data",
                	"data":[
                		{
                			"host":"ghq-1delasticnode01.globalspec.net",
                			"key":"health[active_shards]",
                			"value":63,
                			"clock":1459264491.56},
                		{
                			"host":"ghq-1delasticnode01.globalspec.net",
                			"key":"health[active_primary_shards]",
                			"value":63,
                			"clock":1459264491.56},
                		{
                			"host":"ghq-1delasticnode01.globalspec.net",
                			"key":"health[number_of_nodes]",
                			"value":1,
                			"clock":1459264491.56},
                		{
                			"host":"ghq-1delasticnode01.globalspec.net",
                			"key":"health[number_of_data_nodes]",
                			"value":1,
                			"clock":1459264491.56},
                		{
                			"host":"ghq-1delasticnode01.globalspec.net",
                			"key":"clusterstats[indices.count]",
                			"value":59,
                			"clock":1459264491.56},
                		{
                			"host":"ghq-1delasticnode01.globalspec.net",
                			"key":"clusterstats[indices.store.size_in_bytes]",
                			"value":231743576191,
                			"clock":1459264491.56}]
                }'
                  3448:20160329:111444.433 In recv_agenthistory()
                  3448:20160329:111444.433 In process_hist_data()
                  3448:20160329:111444.433 End of process_hist_data():SUCCEED
                  3448:20160329:111444.433 In zbx_send_response()
                  [B][COLOR="red"]3448:20160329:111444.433 zbx_send_response() '{"response":"success","info":"processed: 0; failed: 6; total: 6; seconds spent: 0.000022"}'[/COLOR][/B]  
                  3448:20160329:111444.433 End of zbx_send_response():SUCCEED
                  3448:20160329:111444.433 End of recv_agenthistory()
                  3448:20160329:111444.433 __zbx_zbx_setproctitle() title:'trapper #4 [processed data in 0.000384 sec, waiting for connection]'
                  3442:20160329:111444.458 get value from agent result: '0'
                  3442:20160329:111444.458 End of get_value_agent():SUCCEED
                  3442:20160329:111444.458 End of get_value():SUCCEED

                Comment

                Working...