Ad Widget

Collapse

Two checks not surviving a reboot.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • navaho
    Junior Member
    • Aug 2011
    • 8

    #1

    Two checks not surviving a reboot.

    Running 2.4.3 on the server (hand built), and 2.2.7 clients (debian jessy latest)

    We have two checks that do not survive a client server reboot.

    proc.num[ntpd] and system.cpu.load simply do not survive a client server reboot. If we reboot a server that we are monitoring the ntpd check will not return a value 100% of the time. We need to go back in to a rebooted server and manually restart zabbix-agent so that this check will begin to return a value. The system.cpu.load has the same issue.

    We have 44 items on each client server and these two fail to return a value on a server reboot 100% of the time. The other 42 have never been an issue and will begin returning data with no issues at all.

    proc.num[ntp] is set to zabbix agent, while system.cpu.load is set to zabbix agent (active).

    Does anyone have any insights as to why the agent needs to be manually restarted after a reboot to get these checks going?
  • navaho
    Junior Member
    • Aug 2011
    • 8

    #2
    We still have to go back and issue an /etc/init.d/zabbix-agent restart for any vps server that we reboot or these two checks will not start return values. Given the recent xen vulnerabilities that's a lot of vps restarts and then a lot of going back to restart zabbix-agent. Any ideas? anyone?

    Comment

    • troffasky
      Senior Member
      • Jul 2008
      • 567

      #3
      Troubleshooting steps:
      - Turn up logging on agent.
      - Use zabbix_get to fetch item.

      Comment

      • navaho
        Junior Member
        • Aug 2011
        • 8

        #4
        Thank you, I had not thought of zabbix_get.

        Comment

        • navaho
          Junior Member
          • Aug 2011
          • 8

          #5
          Hi,

          Thank you for the help. I'm still not sure what is going on here, but I have more information to work with.

          The check that is failing is actually system.localtime. This check runs agentd every 60 seconds. It's got a trigger that is {C_Template_Linux:system.localtime.fuzzytime(30)}= 0

          We have zabbbix-server 2.4.3 and the clients are a debian 8. package, 1:2.2.7+dfsg-2, which is 2.2.7 (obviously).

          When a vps server is rebooted, this fails 100% of the time until we log in and manually restart the client. We've actually taken to adding a short sleep and restart to rc.local.

          With a reminder about zabbix_get from troffasky we turned up the logs and tried the check using zabbix_get when it's working and when it's not.

          upon a VPS restart we can see the client request the list of checks

          Code:
             759:20160728:144445.509 active checks #11 [getting list of active checks]
             759:20160728:144445.509 In refresh_active_checks() host:'199.255.144.121' port:10051
             757:20160728:144445.511 got [{"response":"success","data":[{"key":"check_dpkg[]","delay":60,"lastlogsize":0,"mtime":0},{"key":"check_mailq","delay":900,"l
          astlogsize":0,"mtime":0},{"key":"check_puppet_rc","delay":900,"lastlogsize":0,"mtime":0},{"key":"net.conn[ESTABLISHED]","delay":60,"lastlogsize":0,"mtime":0}
          ,{"key":"net.conn[TIME_WAIT]","delay":60,"lastlogsize":0,"mtime":0},{"key":"net.if.in[eth0]","delay":60,"lastlogsize":0,"mtime":0},{"key":"net.if.in[lo]","de
          lay":60,"lastlogsize":0,"mtime":0},{"key":"net.if.out[eth0]","delay":60,"lastlogsize":0,"mtime":0},{"key":"net.tcp.service[ssh,,522]","delay":60,"lastlogsize
          ":0,"mtime":0},{"key":"system.cpu.load","delay":60,"lastlogsize":0,"mtime":0},{"key":"system.cpu.util[,idle,]","delay":30,"lastlogsize":0,"mtime":0},{"key":"
          system.cpu.util[,iowait,]","delay":30,"lastlogsize":0,"mtime":0},{"key":"system.cpu.util[,system,]","delay":30,"lastlogsize":0,"mtime":0},{"key":"system.cpu.
          util[,user,]","delay":30,"lastlogsize":0,"mtime":0},[B]{"key":"system.localtime","delay":60,"lastlogsize":0,"mtime":0}[/B],{"key":"system.swap.in[,pages]","delay":3
          0,"lastlogsize":0,"mtime":0},{"key":"system.swap.out[,pages]","delay":30,"lastlogsize":0,"mtime":0},{"key":"system.swap.size[,pused]","delay":60,"lastlogsize
          ":0,"mtime":0},{"key":"system.swap.size[,total]","delay":60,"lastlogsize":0,"mtime":0},{"key":"system.who","delay":60,"lastlogsize":0,"mtime":0},{"key":"vfs.
          fs.size[/,pused]","delay":300,"lastlogsize":0,"mtime":0},{"key":"vm.memory.size[available]","delay":60,"lastlogsize":0,"mtime":0},{"key":"vm.memory.size[free
          ]","delay":60,"lastlogsize":0,"mtime":0},{"key":"vm.memory.size[total]","delay":3600,"lastlogsize":0,"mtime":0}]}]
          Code:
             757:20160728:144445.513 In add_check() key:'system.localtime' refresh:60 lastlogsize:0 mtime:0
             757:20160728:144445.513 End of add_check()
          And it checks, but doesn't send the value?

          Code:
          757:20160728:144445.798 End of process_value():SUCCEED
             747:20160728:144445.798 End of process_value():SUCCEED
             757:20160728:144445.798 for key [system.localtime] received value [1469742285]
             747:20160728:144445.798 for key [system.localtime] received value [1469742285]
             757:20160728:144445.798 In process_value() key:'b-db074.dh01.groupee-inc.net:system.localtime' value:'1469742285'
             747:20160728:144445.798 In process_value() key:'b-db074.dh01.groupee-inc.net:system.localtime' value:'1469742285'
             757:20160728:144445.798 In send_buffer() host:'199.255.146.250' port:10051 values:14/100
             747:20160728:144445.798 In send_buffer() host:'mp003.mz01.groupee-inc.net' port:10051 values:14/100
             757:20160728:144445.798 Will not send now. Now 1469742285 lastsent 1469742285 < 5
             747:20160728:144445.798 Will not send now. Now 1469742285 lastsent 1469742285 < 5
             757:20160728:144445.798 End of send_buffer():SUCCEED
             747:20160728:144445.798 End of send_buffer():SUCCEED
             757:20160728:144445.798 buffer: new element 14
             747:20160728:144445.798 buffer: new element 14
             757:20160728:144445.798 End of process_value():SUCCEED
             747:20160728:144445.798 End of process_value():SUCCEED
          Eventually it gets sent?

          Code:
           747:20160728:144450.832 JSON before sending [{
                  "request":"agent data",
                  "data":[
                          {
                                  "host":"b-db074.dh01.groupee-inc.net",
                                  "key":"check_dpkg[]",
                                  "value":"AOK",
                                  "clock":1469742285,
                                  "ns":536084400},
                          {
           -snip -
                          {
                                  "host":"b-db074.dh01.groupee-inc.net",
                                  "key":"system.localtime",
                                  "value":"1469742285",
                                  "clock":1469742285,
                                  "ns":798693319},
                          {
          after that it never appears again in the zabbix-agent log. I'm not actually sure that ANY of the agentd checks are running. I never actually see the agent send data back again.

          Until a restart. Once we restart the agent, that check clears, and the agent will then continually and consistently start sending data.

          Code:
          6250:20160728:081712.143 collector [idle 1 sec]
            6255:20160728:081712.210 In send_buffer() host:'mp003.mz01.groupee-inc.net' port:10051 values:19/100
            6255:20160728:081712.210 JSON before sending [{
          	"request":"agent data",
          	"data":[
          		{
          			"host":"b-db074.dh01.groupee-inc.net",
          			"key":"net.conn[ESTABLISHED]",
          			"value":"61",
          			"clock":1469719027,
          			"ns":168178879},
          		{
          
          -- snip --
          
          		{
          			"host":"b-db074.dh01.groupee-inc.net",
          			"key":"system.localtime",
          			"value":"1469719027",
          			"clock":1469719027,
          			"ns":196562142},
          		{
          The entire time that the agent is NOT sending data (or doing anything?) it DOEs show up on the process list

          Code:
           8109 ?        S      0:00 /usr/sbin/zabbix_agentd
           8110 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: collector [idle 1 sec]
           8111 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: listener #1 [waiting for connection]
           8112 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: listener #2 [waiting for connection]
           8114 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: listener #3 [waiting for connection]
           8115 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #1 [idle 1 sec]
           8117 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #2 [idle 1 sec]
           8118 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #3 [getting list of active checks]
           8119 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #4 [getting list of active checks]
           8120 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #5 [getting list of active checks]
           8122 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #6 [getting list of active checks]
           8123 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #7 [getting list of active checks]
           8125 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #8 [getting list of active checks]
           8126 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #9 [getting list of active checks]
           8127 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #10 [idle 1 sec]
           8129 ?        S      0:00  \_ /usr/sbin/zabbix_agentd: active checks #11 [getting list of active checks]
          So, I guess the question here is why does zabbix-agent start on a VPS restart, run momentarily, then fail to send data from there until we restart it?

          Comment

          • navaho
            Junior Member
            • Aug 2011
            • 8

            #6
            Does anyone have any ideas why it is that on a server reboot zabbix_agentd fails to return data to the server until we manually restart it using the init script?

            Comment

            • jonxor
              Junior Member
              • Jun 2016
              • 24

              #7
              Taking some shots in the dark:
              Does this item turn to "Not supported" in the web interface during the time when it does not poll data?

              That would indicate that it is at least sending some kind of response.

              Another thing to check is if there any other init script that specifies a different config file for your local agent?

              I would try to set up a zabbix_get of this item on the local

              Comment

              • navaho
                Junior Member
                • Aug 2011
                • 8

                #8
                Hi,

                Thank you. They do not turn to unsupported. When we manually do zabbix_get on the server to the client we get correct values. There are not alternate config files. The agentd starts when the system boots and is in the process space, and does nothing. Any checks in the server that are set up as zabbix agent return values. Any that are set to zabbix agent (active) get no values unless/until we go into the server and issue systemctl restart zabbix_agent.

                As a really poor work around we've put a script in /usr/local/bin that has a sleep 15 and then a systemctl restart zabbix_agent. In /etc/rc.local we have a call to that script with an &.

                It works, but it's really the wrong tool for the job.

                We've thought that maybe the order in which things start up on initial boot or reboot might be an issue, perhaps a dependency condition, but some experimenting with that has also been fruitless.

                It is frustrating in that it's not one or two hosts, it is all 351 and growing, but looking at these forums and the ticket system it doesn't seem to be common for everyone else.

                Comment

                • jonxor
                  Junior Member
                  • Jun 2016
                  • 24

                  #9
                  Ah, that is out of my expertise, I only use passive checks. The only thing I could think of besides that, is that perhaps on the proxy/server that the agents are reaching out to, there are too many TCP connections, or for some other reason, the agents can't reach the proxy/server on port 10051.

                  Comment

                  Working...