Ad Widget

Collapse

connection resets now and then

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cheesus
    Junior Member
    • Sep 2014
    • 2

    #1

    connection resets now and then

    Hello,

    we have our Zabbix Server in a separate datacenter from the monitored hosts (by design - we wanted to monitor "from the outside").
    Everything works fine, except that maybe one out of a twenty passive checks fails with "Connection reset".
    I don't know if that is to be expected between two professional data centers, both in Germany, however I do not notice any other network problems in any other applications - I have ssh sessions open for hours, transfer ~25G backups via sftp nightly etc without problems.

    On the server I have like
    Code:
    watch01:/srv/data/zabbix/log # tail -1000000 zabbix_server.log | egrep -i "(network error|failed)" | grep ":1251"
     23715:20140904:125110.408 Item [prod01.xxxxx.net:proc.num[java,,,jboss-as-7.2.0.Final]] error: Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer
     23715:20140904:125110.409 query [txnlev:1] [update hosts set errors_from=1409827870,disable_until=1409827885,error='Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer' where hostid=10106]
     23715:20140904:125110.428 Zabbix agent item "proc.num[java,,,jboss-as-7.2.0.Final]" on host "prod01.xxxxx.net" failed: first network error, wait for 15 seconds
     23773:20140904:125114.411 Item [dev02.xxxxx.net:system.swap.size[,free]] error: Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer
     23773:20140904:125114.411 query [txnlev:1] [update hosts set errors_from=1409827874,disable_until=1409827889,error='Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer' where hostid=10105]
     23773:20140904:125114.428 Zabbix agent item "system.swap.size[,free]" on host "dev02.xxxxx.net" failed: first network error, wait for 15 seconds
     23729:20140904:125137.471 Item [prod01.xxxxx.net:net.if.in[eth1]] error: Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer
     23729:20140904:125137.472 query [txnlev:1] [update hosts set errors_from=1409827897,disable_until=1409827912,error='Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer' where hostid=10106]
     23729:20140904:125137.485 Zabbix agent item "net.if.in[eth1]" on host "prod01.xxxxx.net" failed: first network error, wait for 15 seconds
     23764:20140904:125154.494 Item [prod01.xxxxx.net:proc.num[,,run]] error: Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer
     23764:20140904:125154.495 query [txnlev:1] [update hosts set errors_from=1409827914,disable_until=1409827929,error='Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer' where hostid=10106]
     23764:20140904:125154.534 Zabbix agent item "proc.num[,,run]" on host "prod01.xxxxx.net" failed: first network error, wait for 15 seconds
     23771:20140904:125155.496 Item [dev02.xxxxx.net:net.if.out[eth0]] error: Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer
     23771:20140904:125155.496 query [txnlev:1] [update hosts set errors_from=1409827915,disable_until=1409827930,error='Get value from agent failed: ZBX_TCP_READ() failed: [104] Connection reset by peer' where hostid=10105]
     23771:20140904:125155.596 Zabbix agent item "net.if.out[eth0]" on host "dev02.xxxxx.net" failed: first network error, wait for 15 seconds
    any every of those errors recovers itself pretty soon.

    on the client, I see for example for the "23771:20140904:125155.596 Zabbix agent item "net.if.out[eth0]" on host "dev02.xxxxx.net" failed" no corresponding message:
    Code:
    $ grep "Requested \[net.if.out\[eth0\]\]" /var/log/zabbix/zabbix-agentd.log
     15197:20140904:125003.806 Requested [net.if.out[eth0]]
     15195:20140904:125304.159 Requested [net.if.out[eth0]]
    so it seems the request never reached the client.

    My graph looks like attached...


    Any ideas ? Zabbix seems not to retry until the next check is due, so my data timeline looks very perforated...

    Looked around all systems, syslog, firewall, ... nothing obvious...

    Any input appreciated...

    Thanks, Cheesus.
    Attached Files
Working...