Ad Widget

Collapse

A lot of trouble with Zabbix 1.1beta7

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • valain
    Junior Member
    • Mar 2006
    • 3

    #1

    A lot of trouble with Zabbix 1.1beta7

    Hello,

    I have installed zabbix_server and zabbix_agentd on a local server where everything works very nicely.

    I have then installed zabbix_agentd on a second, remote server that I want to monitor. This remote server is one of our main production machines and works well (i.e. no network trouble or whatsoever).

    In my zabbix_server.log file I see a lot of errors with that host. Here's an short extract (real servername and IP address hidden) :


    016135:20060302:142804 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [system.cpu.load[,avg1]]
    016135:20060302:142804 Assuming that agent dropped connection because of access permissions
    016135:20060302:142804 Started network errors for [SERVERNAME]
    016135:20060302:142804 Host [SERVERNAME]: another network error, wait for 5 seconds
    016135:20060302:143112 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[pop]]
    016135:20060302:143112 Assuming that agent dropped connection because of access permissions
    016135:20060302:143112 Host [SERVERNAME] will be checked after [60] seconds
    016135:20060302:143203 Enabling host [SERVERNAME]
    016135:20060302:143238 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vfs.fs.size[/tmp,total]]
    016135:20060302:143238 Assuming that agent dropped connection because of access permissions
    016135:20060302:143238 Started network errors for [SERVERNAME]
    016135:20060302:143238 Host [SERVERNAME]: another network error, wait for 5 seconds
    016135:20060302:143309 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[smtp]]
    016135:20060302:143309 Assuming that agent dropped connection because of access permissions
    016135:20060302:143309 Host [SERVERNAME] will be checked after [60] seconds
    016135:20060302:143335 Enabling host [SERVERNAME]
    016135:20060302:143339 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.service[imap]]
    016135:20060302:143339 Assuming that agent dropped connection because of access permissions
    016135:20060302:143339 Started network errors for [SERVERNAME]
    016135:20060302:143339 Host [SERVERNAME]: another network error, wait for 5 seconds
    016135:20060302:143444 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [system.cpu.load[,avg15]]
    016135:20060302:143444 Assuming that agent dropped connection because of access permissions
    016135:20060302:143444 Host [SERVERNAME] will be checked after [60] seconds
    016135:20060302:143540 Enabling host [SERVERNAME]
    016135:20060302:144002 Timeout while receiving data from [SERVERNAME]
    016135:20060302:144002 Getting value of [system.users.num] from host [SERVERNAME] failed
    016135:20060302:144002 The value is not stored in database.
    016135:20060302:144012 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [net.tcp.port[A.B.C.D,10050]]
    016135:20060302:144012 Assuming that agent dropped connection because of access permissions
    016135:20060302:144012 Started network errors for [SERVERNAME]
    016135:20060302:144012 Host [SERVERNAME]: another network error, wait for 5 seconds
    016135:20060302:144052 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vfs.fs.size[/home,used]]
    016135:20060302:144052 Assuming that agent dropped connection because of access permissions
    016135:20060302:144052 Host [SERVERNAME] will be checked after [60] seconds
    016135:20060302:144143 Enabling host [SERVERNAME]
    016135:20060302:144233 Got empty string from [SERVERNAME] IP [A.B.C.D] Parameter [vm.memory.size[free]]
    016135:20060302:144233 Assuming that agent dropped connection because of access permissions
    016135:20060302:144233 Started network errors for [SERVERNAME]
    016135:20060302:144233 Host [SERVERNAME]: another network error, wait for 5 seconds


    Where can I start looking for the source of these errors, which look like if the remote zabbix_agent sometimes doesn't answer correctly? Strangely enough, the parameters that don't work, are never the same. Sometimes it's vm.memory.XXX, then it's tcp.XXX, then it's user defined parameters... and a couple of seconds later, they all work again.

    I also notice that the "status" of this host changes from 0 to 2 and back to 0 every 15 seconds or so. Sometimes it stays "alive" for a couple of minutes before the status value starts to change every couple of seconds again.

    Ideas?
  • valain
    Junior Member
    • Mar 2006
    • 3

    #2
    Hi,

    Something more I have noticed on the host on which the agent seems to have trouble:

    * when I "killall zabbix_agentd" and restart it, I have to try anywhere between 1 and 10 times before it starts correctly, and inbetween I have to manually delete the .pid file ;

    * I have set the debug level to 4 for that agentd, but the /tmp/zabbix_agentd.log file stays completely empty?

    Help? :-)

    Comment

    • bytesize
      Member
      • Aug 2005
      • 71

      #3
      Hi,

      I'm seeing exactly the same errors with beta7:

      016135:20060302:143339 Host [SERVERNAME]: another network error, wait for 5 seconds

      When this error occurs, the trigger is often displayed as grey in the overview page. Even restarting the Zabbix server doesn't clear the status.

      John

      Comment

      • bytesize
        Member
        • Aug 2005
        • 71

        #4
        Hi Valain,

        I have managed to resolve both these problems. To stop the network errors, I changed the following parameters in the zabbix_server.conf file:

        Server=10
        StartSuckers=24
        StartTrappers=16

        I then dropped all the data from the database, and re-imported it. To make this task easier, I keep all my hosts information in a CSV file which can be imported using the bulkloader tool. This step may be uneccessary after you have changed the parameters above and restarted Zabbix, but I like to make sure all my data is consistent after upgrading.

        The problem with killing processes is that Zabbix does not shutdown cleanly, hence the pid file still remains. You should download the Init scripts from the Cookbook forum, and use those to start/stop/restart Zabbix cleanly.

        I hope this helps!

        John

        Comment

        Working...