Ad Widget

Collapse

zabbix_agentd dying mysteriously

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gfirebright
    Junior Member
    • Apr 2006
    • 4

    #1

    zabbix_agentd dying mysteriously

    Greetings,

    I'm currently experiencing a weird problem with zabbix_agentd - it's dying, and I'm not sure why. I've upped the logging for it and have an excerpt from the log attached below.

    I've set Timeout to the maximum of 30 in zabbix_agentd.conf, but that didn't help the problem. agentd seems to die when the server is hitting it, but again, I'm not quite sure why or how. In addition to the zabbix_agentd.log, I did manage to capture a brief message from the server's web gui:

    Got empty string from [x.x] IP [x.x.x.x] Parameter [io[disk_wio]]

    Any ideas on what could be going wrong?

    Thanks,

    - Gary


    ----

    005997:20060404:084512 zabbix_agentd started. ZABBIX 1.1beta7.
    005998:20060404:084512 zabbix_agentd 5998 started
    005999:20060404:084512 zabbix_agentd 5999 started
    006000:20060404:084512 zabbix_agentd 6000 started
    006001:20060404:084512 zabbix_agentd 6001 started
    006002:20060404:084512 zabbix_agentd 6002 started
    006002:20060404:084512 In init_list()
    006002:20060404:084512 In refresh_metrics()
    006002:20060404:084512 get_active_checks: host[x.x.x.x] port[10051]
    006002:20060404:084512 Sending [ZBX_GET_ACTIVE_CHECKS x.x]
    006002:20060404:084512 Before read
    006002:20060404:084512 In delete_all_metrics()
    006002:20060404:084512 Parsed [ZBX_EOF]
    006002:20060404:084512 Sleeping for 60 seconds
    005998:20060404:084555 In check_security()
    005998:20060404:084555 Connection from [x.x.x.x]. Allowed servers [x.x.x.x]
    005998:20060404:084555 Before read()
    005998:20060404:084555 After read() 2 [13]
    005998:20060404:084555 Got line:io[disk_wio]
    005997:20060404:084555 One child process died. Exiting ...
    005999:20060404:084555 Got signal. Exiting ...
    006000:20060404:084555 Got signal. Exiting ...
    006001:20060404:084555 Got signal. Exiting ...
    006002:20060404:084555 Got signal. Exiting ...
  • firebright
    Junior Member
    • Mar 2006
    • 2

    #2
    This continues...

    We now have a couple of people looking at this.

    This system is set up EXACTLY the same way as another host that's actually working (line for line in the config file), and yet this one dies as soon as the server makes contact with it.

    It works, and you can telent to the port and get the beta information, and then ocne the server connects it gives a permissions error and dies.
    003712:20060404:173043 Enabling host [mysite.org]
    003732:20060404:173043 Got empty string from [mysite.org] IP [64.84.32.104] Parameter [io[disk_wio]]
    003732:20060404:173043 Assuming that agent dropped connection because of access permissions
    003732:20060404:173043 Started network errors for [mysite.org]
    003732:20060404:173043 Host [mysite.org]: another network error, wait for 5 seconds
    003713:20060404:173043 Connection reset by peer.
    003713:20060404:173043 Started network errors for [mysite.org]
    003713:20060404:173043 Host [mysite.org]: another network error, wait for 5 seconds
    003712:20060404:173048 Cannot connect to [mysite.org] [Connection refused]
    003712:20060404:173048 Host [mysite.org]: another network error, wait for 5 seconds
    003712:20060404:173053 Cannot connect to [mysite.org] [Connection refused]
    003712:20060404:173053 Host [mysite.org]: another network error, wait for 5 seconds
    003732:20060404:173058 Cannot connect to [mysite.org] [Connection refused]
    003732:20060404:173058 Host [mysite.org]: another network error, wait for 5 seconds
    003713:20060404:173058 Cannot connect to [mysite.org] [Connection refused]
    003713:20060404:173058 Host [mysite.org]: another network error, wait for 5 seconds
    003713:20060404:173103 Cannot connect to [mysite.org] [Connection refused]
    003713:20060404:173103 Host [mysite.org]: another network error, wait for 5 seconds
    003713:20060404:173108 Cannot connect to [mysite.org] [Connection refused]
    003713:20060404:173108 Host [mysite.org] will be checked after [60] seconds

    I can replicate this on a couple of servers now, but they have IDENTICAL (line for line) client setups to other servers that are working, and were set up in the same way at the same time.

    We're running this properly as a zabbix user (non-system, straight user account), and starting the daemon as zabbix. The thing is, it's just dying, and we can't figure out why. We've moved the log file, set public permissions on both the pid and the log, as well as checked to see whether the client is starting with telnet (as discussed), and it is. But as soon as the server hits it, it throws that error and dies.

    It is always consistent about Parameter [io[disk_wio]].

    Anyways, help would be appreciated.

    Comment

    • gfirebright
      Junior Member
      • Apr 2006
      • 4

      #3
      Well, we've made progress. We fixed a strange permissions error we found, as well as disabling io[disk_wio] - it's chugging along happily now. Still not sure what's causing the problem with io[disk_wio], but the rest of it's working.

      Comment

      • Alexei
        Founder, CEO
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Sep 2004
        • 5654

        #4
        Thanks for detailed report. What version of ZABBIX you're using?
        Alexei Vladishev
        Creator of Zabbix, Product manager
        New York | Tokyo | Riga
        My Twitter

        Comment

        • firebright
          Junior Member
          • Mar 2006
          • 2

          #5
          information...

          Ok, so it turned out to be a couple of interesting things.

          Here are some details, so that others who follow will not waste time like we did. First, the problem was related to what Gary was talking about. But, even after fixing that, it would lose its connections on those two machines as soon as the server hit the client.

          We're running ZABBIX 1.1beta7. Now, as soon as we changed the permissions on the /proc directory to o+rx, then it started working. Anyways, seemed to be related to the openvz.org distro it was running.

          Maybe it'll save someone some time.

          Comment

          Working...