Ad Widget

Collapse

Active checks and proxies periodically all go offline...

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Jason
    Senior Member
    • Nov 2007
    • 430

    #1

    Active checks and proxies periodically all go offline...

    We've got a CentOS 7 front end zabbix server running Zabbix 2.4.8 with a postgres database running on a CentOS 6 server. The zabbix database is ~330GB for 30 days worth of data for ~800hosts.

    Periodically of the active agents and proxies drop offline for anything from 15 minutes to over an hour. Attached is an uptime graph showing 1 outage from tonight.

    Passive checks are still collected and there are no gaps in that data.

    All of the storage is on fast 15K SAS disks and all the monitoring I have show that latency is low.

    I thought it could be network related, but some of the hosts that are dropping offline are on the local network with the zabbix server and connect by IP address. There is a firewall rule on the host allowing all traffic in without restriction on the zabbix server port.

    There is nothing untoward showing up the zabbix-server log during these outages.

    Any thoughts on possible causes?
    Attached Files
  • batchenr
    Senior Member
    • Sep 2016
    • 440

    #2
    Originally posted by Jason
    We've got a CentOS 7 front end zabbix server running Zabbix 2.4.8 with a postgres database running on a CentOS 6 server. The zabbix database is ~330GB for 30 days worth of data for ~800hosts.

    Periodically of the active agents and proxies drop offline for anything from 15 minutes to over an hour. Attached is an uptime graph showing 1 outage from tonight.

    Passive checks are still collected and there are no gaps in that data.

    All of the storage is on fast 15K SAS disks and all the monitoring I have show that latency is low.

    I thought it could be network related, but some of the hosts that are dropping offline are on the local network with the zabbix server and connect by IP address. There is a firewall rule on the host allowing all traffic in without restriction on the zabbix server port.

    There is nothing untoward showing up the zabbix-server log during these outages.

    Any thoughts on possible causes?
    zabbix server \ agent logs ?
    try disabling all unsupported items on the host that making this issues.

    Comment

    • Jason
      Senior Member
      • Nov 2007
      • 430

      #3
      I missed the logs for the last outage, but it's just happened again this morning.

      It's all proxies and hosts that use active checks that go offline.

      Exercept from proxy on the same LAN as the zabbix server.
      2095:20170225:083939.951 sending heartbeat message to server failed: error:"no response: network error", info:""
      2094:20170225:084039.624 Error while receiving answer from server [ZBX_TCP_READ() failed: [104] Connection reset by peer]
      2095:20170225:084039.952 sending heartbeat message to server failed: error:"no response: network error", info:""
      2095:20170225:084139.952 sending heartbeat message to server failed: error:"no response: network error", info:""
      2095:20170225:084239.953 sending heartbeat message to server failed: error:"no response: network error", info:""
      2095:20170225:084339.953 sending heartbeat message to server failed: error:"no response: network error", info:""
      2094:20170225:084342.638 Unable to connect to the server [10.10.1.100]:10052 [cannot connect to [[10.10.1.100]:10052]: [110] Connection timed out]. Will retry every 120 second(s)
      2095:20170225:084439.954 sending heartbeat message to server failed: error:"no response: network error", info:""
      2094:20170225:084445.638 Still unable to connect...
      2095:20170225:084539.954 sending heartbeat message to server failed: error:"no response: network error", info:""
      2095:20170225:084639.955 sending heartbeat message to server failed: error:"no response: network error", info:""
      2095:20170225:084739.955 sending heartbeat message to server failed: error:"no response: network error", info:""
      2094:20170225:084748.638 Still unable to connect...
      2094:20170225:084948.639 Connection restored.

      Line from another proxy during the outage

      6648:20170225:084119.896 Unable to connect to the server [10.10.1.100]:10052 [cannot connect to [[10.10.1.100]:10052]: [4] Interrupted system call]
      6647:20170225:084120.324 Error while receiving answer from server [ZBX_TCP_READ() failed: [104] Connection reset by peer]


      From the server...

      4355:20170225:083919.686 unmatched trap received from "192.168.24.10": 08:39:14 2017/02/25 .1.3.6.1.4.1.231.2.54.2.0.2502 Normal "General event" 192.168.24.10 - 78.33.226.210 JAH-SVR01 1488011954 memory 81 80
      4280:20170225:083926.890 resuming SNMP agent checks on host "AirFibreRemote": connection restored
      4273:20170225:084026.276 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
      4277:20170225:084113.599 resuming SNMP agent checks on host "AirFibreRemote": connection restored
      4305:20170225:084126.614 sending configuration data to proxy "-ZABBIX-Proxy", datalen 526868
      4355:20170225:084134.707 unmatched trap received from "192.168.0.10": 08:41:32 2017/02/25 .1.3.6.1.4.1.318.0.35 Normal "General event" 192.168.0.10 - 176.35.188.187 AVR Trim No Longer Active
      4279:20170225:084211.565 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
      4279:20170225:084227.674 resuming SNMP agent checks on host "AirFibreRemote": connection restored
      4292:20170225:084306.962 sending configuration data to proxy "XX-ZabbixProxy", datalen 441081
      4292:20170225:084307.512 sending configuration data to proxy "XY Zabbix Proxy", datalen 286245
      4303:20170225:084338.768 sending configuration data to proxy "XZ Zabbix Proxy", datalen 385282
      4303:20170225:084338.859 sending configuration data to proxy "SJ proxy", datalen 311054
      4272:20170225:084426.078 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
      4271:20170225:084512.176 resuming SNMP agent checks on host "AirFibreRemote": connection restored
      4278:20170225:084611.174 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
      4276:20170225:084627.312 resuming SNMP agent checks on host "AirFibreRemote": connection restored
      4325:20170225:084708.486 sending configuration data to proxy "Cloud01-proxy", datalen 556085
      4351:20170225:084710.284 item "Zabbix Database Servergsql.get.pg.stat_database[{$PGSCRIPTDIR},{$PGSCRIPT_CONFDIR},{HOST.HOST},{$Z ABBIX_AGENTD_CONF},zabbix]" became not supported: Timeout while executing a shell script.
      4271:20170225:084826.265 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
      4304:20170225:084834.856 sending configuration data to proxy "HA Proxy", datalen 1535022
      4304:20170225:084836.126 sending configuration data to proxy "XX-ZabbixProxy", datalen 441081
      4304:20170225:084837.250 sending configuration data to proxy "XY Zabbix Proxy", datalen 286245

      Comment

      • batchenr
        Senior Member
        • Sep 2016
        • 440

        #4
        Originally posted by Jason
        I missed the logs for the last outage, but it's just happened again this morning.

        It's all proxies and hosts that use active checks that go offline.

        Exercept from proxy on the same LAN as the zabbix server.
        2095:20170225:083939.951 sending heartbeat message to server failed: error:"no response: network error", info:""
        2094:20170225:084039.624 Error while receiving answer from server [ZBX_TCP_READ() failed: [104] Connection reset by peer]
        2095:20170225:084039.952 sending heartbeat message to server failed: error:"no response: network error", info:""
        2095:20170225:084139.952 sending heartbeat message to server failed: error:"no response: network error", info:""
        2095:20170225:084239.953 sending heartbeat message to server failed: error:"no response: network error", info:""
        2095:20170225:084339.953 sending heartbeat message to server failed: error:"no response: network error", info:""
        2094:20170225:084342.638 Unable to connect to the server [10.10.1.100]:10052 [cannot connect to [[10.10.1.100]:10052]: [110] Connection timed out]. Will retry every 120 second(s)
        2095:20170225:084439.954 sending heartbeat message to server failed: error:"no response: network error", info:""
        2094:20170225:084445.638 Still unable to connect...
        2095:20170225:084539.954 sending heartbeat message to server failed: error:"no response: network error", info:""
        2095:20170225:084639.955 sending heartbeat message to server failed: error:"no response: network error", info:""
        2095:20170225:084739.955 sending heartbeat message to server failed: error:"no response: network error", info:""
        2094:20170225:084748.638 Still unable to connect...
        2094:20170225:084948.639 Connection restored.

        Line from another proxy during the outage

        6648:20170225:084119.896 Unable to connect to the server [10.10.1.100]:10052 [cannot connect to [[10.10.1.100]:10052]: [4] Interrupted system call]
        6647:20170225:084120.324 Error while receiving answer from server [ZBX_TCP_READ() failed: [104] Connection reset by peer]


        From the server...

        4355:20170225:083919.686 unmatched trap received from "192.168.24.10": 08:39:14 2017/02/25 .1.3.6.1.4.1.231.2.54.2.0.2502 Normal "General event" 192.168.24.10 - 78.33.226.210 JAH-SVR01 1488011954 memory 81 80
        4280:20170225:083926.890 resuming SNMP agent checks on host "AirFibreRemote": connection restored
        4273:20170225:084026.276 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
        4277:20170225:084113.599 resuming SNMP agent checks on host "AirFibreRemote": connection restored
        4305:20170225:084126.614 sending configuration data to proxy "-ZABBIX-Proxy", datalen 526868
        4355:20170225:084134.707 unmatched trap received from "192.168.0.10": 08:41:32 2017/02/25 .1.3.6.1.4.1.318.0.35 Normal "General event" 192.168.0.10 - 176.35.188.187 AVR Trim No Longer Active
        4279:20170225:084211.565 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
        4279:20170225:084227.674 resuming SNMP agent checks on host "AirFibreRemote": connection restored
        4292:20170225:084306.962 sending configuration data to proxy "XX-ZabbixProxy", datalen 441081
        4292:20170225:084307.512 sending configuration data to proxy "XY Zabbix Proxy", datalen 286245
        4303:20170225:084338.768 sending configuration data to proxy "XZ Zabbix Proxy", datalen 385282
        4303:20170225:084338.859 sending configuration data to proxy "SJ proxy", datalen 311054
        4272:20170225:084426.078 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
        4271:20170225:084512.176 resuming SNMP agent checks on host "AirFibreRemote": connection restored
        4278:20170225:084611.174 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
        4276:20170225:084627.312 resuming SNMP agent checks on host "AirFibreRemote": connection restored
        4325:20170225:084708.486 sending configuration data to proxy "Cloud01-proxy", datalen 556085
        4351:20170225:084710.284 item "Zabbix Database Servergsql.get.pg.stat_database[{$PGSCRIPTDIR},{$PGSCRIPT_CONFDIR},{HOST.HOST},{$Z ABBIX_AGENTD_CONF},zabbix]" became not supported: Timeout while executing a shell script.
        4271:20170225:084826.265 SNMP agent item "hrProcessorLoad" on host "AirFibreRemote" failed: first network error, wait for 15 seconds
        4304:20170225:084834.856 sending configuration data to proxy "HA Proxy", datalen 1535022
        4304:20170225:084836.126 sending configuration data to proxy "XX-ZabbixProxy", datalen 441081
        4304:20170225:084837.250 sending configuration data to proxy "XY Zabbix Proxy", datalen 286245

        i dont use proxy so i dont know - but whenever i got a message from agent about cant connect i changed the port 10052 at /etc/zabbix/zabbix_agent.conf

        #port=10050 - just uncomment it and restart

        Comment

        • Jason
          Senior Member
          • Nov 2007
          • 430

          #5
          All of our agents/proxies are configured to connect on that port for various reasons... I don't fancy changing that on over 800 hosts... It's worked fine for years till recently so I'd be surprised if that were it...

          Comment

          • batchenr
            Senior Member
            • Sep 2016
            • 440

            #6
            Originally posted by Jason
            All of our agents/proxies are configured to connect on that port for various reasons... I don't fancy changing that on over 800 hosts... It's worked fine for years till recently so I'd be surprised if that were it...
            i just told you what works for me - you dont have to do it.
            unfortunately dont have proxy and not familiar with this errors.

            Comment

            Working...