Ad Widget

Collapse

Zabbix server crash (?) with many unreachable

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • chlehmann
    Junior Member
    • Mar 2019
    • 14

    #1

    Zabbix server crash (?) with many unreachable

    Hi there
    Overview:
    Zabbix 4.4.4
    50 Hosts, many of them unreachable (!!!)
    Centos 7.7
    4GB RAM
    Postgres 11

    We have a strange behavior with a Zabbix-server in a pre-prod environment. As you can see we have lots of unreachable devices (because pre-prod, network is not ready). Still, it would be good to see the state of the reachable devices.

    If zabbix is started it works as expected, but after exactly 5 days it stops working with the "Zabbix-server not running" banner. But: The service is up, the port is listening (according to SS -ln). Firewall is - temporary - stopped.
    The server-log looks strange: It logged (normal) stuff until the same time graphs are showing, then the file is rotated, the new one is empty and nothing happens anymore.
    Postgres is up and happy. Memory and diskspace available as well!

    Has any of you had similar issues? Do you know if it is related to the many unreachable-servers or is there something i was missing?
  • dimir
    Zabbix developer
    • Apr 2011
    • 1080

    #2
    It is essential to see Zabbix server log file because it's the only way to see why it crashed. Since it's lost at the time of crash perhaps you could increase log size (LogFileSize in zabbix_server.conf):
    Code:
    ### Option: LogFileSize
    #       Maximum size of log file in MB.
    #       0 - disable automatic log rotation.
    #
    # Mandatory: no
    # Range: 0-1024
    # Default:
    # LogFileSize=1
    or even disable log file rotation by setting it to 0.

    Comment

    • chlehmann
      Junior Member
      • Mar 2019
      • 14

      #3
      Sorry for my late reply.

      I did not post the log because it does - in my opinion - not contain anything valuable. But See for yourself:

      Code:
      24309:20200127:173409.868 forced reloading of the configuration cache
       24331:20200127:173420.017 SNMP agent item "vfs.fs.units.total[65]" on host "ucaadsuv.domain.ch" failed: first network error, wait for 30 seconds
       24341:20200127:173425.165 SNMP agent item "system.uptime[sysUpTime]" on host "ucaadsv.domain.ch" failed: first network error, wait for 30 seconds
       24364:20200127:173450.801 resuming SNMP agent checks on host "ucaadsuv.domain.ch": connection restored
       24367:20200127:173455.644 resuming SNMP agent checks on host "ucaadsv.domain.ch": connection restored
       24338:20200127:173534.530 SNMP agent item "vfs.fs.units[31]" on host "ucaaes21.domain.ch" failed: first network error, wait for 30 seconds
       24346:20200127:173604.689 resuming SNMP agent checks on host "ucaaes21.domain.ch": connection restored
       24337:20200127:173813.830 SNMP agent item "net.if.duplex[dot3StatsDuplexStatus.2]" on host "ucacmv.domain.ch" failed: first network error, wait for 30 seconds
       24353:20200127:173843.672 resuming SNMP agent checks on host "ucacmv.domain.ch": connection restored
       24337:20200127:174022.290 SNMP agent item "system.cpu.util[196611]" on host "ucaads11.domain.ch" failed: first network error, wait for 30 seconds
       24364:20200127:174052.956 resuming SNMP agent checks on host "ucaads11.domain.ch": connection restored
       24341:20200127:174307.284 SNMP agent item "vfs.fs.units.used[59]" on host "ucacm11.domain.ch" failed: first network error, wait for 30 seconds
       24359:20200127:174337.929 resuming SNMP agent checks on host "ucacm11.domain.ch": connection restored
       24333:20200127:174828.360 SNMP agent item "memory.units.total[6]" on host "ucaads21.domain.ch" failed: first network error, wait for 30 seconds
       24346:20200127:174858.513 resuming SNMP agent checks on host "ucaads21.domain.ch": connection restored
       24309:20200127:174911.401 forced reloading of the configuration cache
       24334:20200127:175201.055 SNMP agent item "system.cpu.util[196612]" on host "ucabpm21.domain.ch" failed: first network error, wait for 30 seconds
       24363:20200127:175231.731 resuming SNMP agent checks on host "ucabpm21.domain.ch": connection restored
       24321:20200127:175501.672 item "ucaaes11.domain.ch:vfs.fs.units.total[66]" became not supported: No Such Instance currently exists at this OID
       24321:20200127:175501.672 item "ucaaes11.domain.ch:vfs.fs.units[66]" became not supported: No Such Instance currently exists at this OID
       24321:20200127:175501.672 item "ucaaes11.domain.ch:vfs.fs.units.used[66]" became not supported: No Such Instance currently exists at this OID
       24321:20200127:175511.731 item "ucaaes11.domain.ch:vfs.fs.pused[66]" became not supported: Cannot evaluate expression: "Cannot evaluate function "last()": item "ucaaes11.domain.ch:vfs.fs.unit
      s.total[66]" not supported.".
       24321:20200127:175512.737 item "ucaaes11.domain.ch:vfs.fs.total[66]" became not supported: Cannot evaluate expression: "Cannot evaluate function "last()": item "ucaaes11.domain.ch:vfs.fs.unit
      s[66]" not supported.".
       24320:20200127:175516.915 item "ucaaes11.domain.ch:vfs.fs.used[66]" became not supported: Cannot evaluate expression: "Cannot evaluate function "last()": item "ucaaes11.domain.ch:vfs.fs.units
      [66]" not supported.".
       24320:20200127:180101.618 item "ucaaes11.domain.ch:vfs.fs.units.total[66]" became supported
       24320:20200127:180101.618 item "ucaaes11.domain.ch:vfs.fs.units[66]" became supported
       24320:20200127:180101.619 item "ucaaes11.domain.ch:vfs.fs.units.used[66]" became supported
       24318:20200127:180311.667 item "ucaaes11.domain.ch:vfs.fs.pused[66]" became supported
       24318:20200127:180312.683 item "ucaaes11.domain.ch:vfs.fs.total[66]" became supported
       24318:20200127:180316.711 item "ucaaes11.domain.ch:vfs.fs.used[66]" became supported
       24309:20200127:180410.992 forced reloading of the configuration cache
       24340:20200127:180816.938 SNMP agent item "system.cpu.util[196612]" on host "ucasmgr01.domain.ch" failed: first network error, wait for 30 seconds
       24363:20200127:180846.920 resuming SNMP agent checks on host "ucasmgr01.domain.ch": connection restored
       24342:20200127:181010.947 SNMP agent item "net.if.duplex[dot3StatsDuplexStatus.3]" on host "ucacm21.domain.ch" failed: first network error, wait for 30 seconds
       24366:20200127:181040.044 resuming SNMP agent checks on host "ucacm21.domain.ch": connection restored
       24342:20200127:181423.177 SNMP agent item "system.location" on host "ucacmv.domain.ch" failed: first network error, wait for 30 seconds
       24360:20200127:181453.336 resuming SNMP agent checks on host "ucacmv.domain.ch": connection restored
       24309:20200127:181906.512 forced reloading of the configuration cache
       24313:20200127:182117.259 executing housekeeper
       24313:20200127:182117.287 housekeeper [deleted 0 hist/trends, 0 items/triggers, 0 events, 8 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 0.022007 sec, idle for 1 hour(s)]
       24340:20200127:182146.862 SNMP agent item "vfs.fs.units.used[31]" on host "ucaeqwgv.domain.ch" failed: first network error, wait for 30 seconds
       24351:20200127:182216.531 resuming SNMP agent checks on host "ucaeqwgv.domain.ch": connection restored
       24338:20200127:183252.990 SNMP agent item "vfs.fs.units.total[60]" on host "ucabai21.domain.ch" failed: first network error, wait for 30 seconds
       24366:20200127:183322.433 resuming SNMP agent checks on host "ucabai21.domain.ch": connection restored
       24309:20200127:183410.935 forced reloading of the configuration cache
       24332:20200127:183546.375 SNMP agent item "system.cpu.util[196615]" on host "ucaeqwgv.domain.ch" failed: first network error, wait for 30 seconds
       24347:20200127:183616.553 resuming SNMP agent checks on host "ucaeqwgv.domain.ch": connection restored
       24333:20200127:184010.390 SNMP agent item "net.if.duplex[dot3StatsDuplexStatus.2]" on host "ucacm21.domain.ch" failed: first network error, wait for 30 seconds
       24352:20200127:184040.746 resuming SNMP agent checks on host "ucacm21.domain.ch": connection restored
       24334:20200127:184343.654 SNMP agent item "system.cpu.util[196614]" on host "ucaeqwg21.domain.ch" failed: first network error, wait for 30 seconds
       24367:20200127:184413.370 resuming SNMP agent checks on host "ucaeqwg21.domain.ch": connection restored
       24331:20200127:184440.668 SNMP agent item "memory.units[10]" on host "ucaeqwg11.domain.ch" failed: first network error, wait for 30 seconds
       24363:20200127:184510.354 resuming SNMP agent checks on host "ucaeqwg11.domain.ch": connection restored
       24309:20200127:184910.607 forced reloading of the configuration cache
       24336:20200127:185252.962 SNMP agent item "system.cpu.util[196610]" on host "ucabai21.domain.ch" failed: first network error, wait for 30 seconds
       24348:20200127:185322.801 resuming SNMP agent checks on host "ucabai21.domain.ch": connection restored
       24341:20200127:185758.365 SNMP agent item "system.uptime[sysUpTime]" on host "ucabpm11.domain.ch" failed: first network error, wait for 30 seconds
       24366:20200127:185828.054 resuming SNMP agent checks on host "ucabpm11.domain.ch": connection restored
       24309:20200127:190410.141 forced reloading of the configuration cache
       24333:20200127:190510.574 SNMP agent item "memory.units.total[1]" on host "ucacm21.domain.ch" failed: first network error, wait for 30 seconds
       24361:20200127:190540.722 resuming SNMP agent checks on host "ucacm21.domain.ch": connection restored
       24332:20200127:190622.637 SNMP agent item "vfs.fs.units.used[60]" on host "ucasmm21.domain.ch" failed: first network error, wait for 30 seconds
       24351:20200127:190652.775 resuming SNMP agent checks on host "ucasmm21.domain.ch": connection restored
       24318:20200127:191320.563 item "ucabpm21.domain.ch:vfs.fs.units[65]" became supported
       24318:20200127:191320.563 item "ucabpm21.domain.ch:vfs.fs.units.total[65]" became supported
       24318:20200127:191320.563 item "ucabpm21.domain.ch:vfs.fs.units.used[65]" became supported
       24335:20200127:191352.445 SNMP agent item "storage.discovery" on host "ucabpm21.domain.ch" failed: first network error, wait for 30 seconds
       24346:20200127:191422.384 resuming SNMP agent checks on host "ucabpm21.domain.ch": connection restored
       24321:20200127:191424.682 item "ucabpm21.domain.ch:vfs.fs.pused[65]" became supported
       24321:20200127:191424.682 item "ucabpm21.domain.ch:vfs.fs.total[65]" became supported
       24321:20200127:191424.682 item "ucabpm21.domain.ch:vfs.fs.used[65]" became supported
       24318:20200127:191520.744 item "ucabpm21.domain.ch:vfs.fs.units[65]" became not supported: No Such Instance currently exists at this OID
       24318:20200127:191520.744 item "ucabpm21.domain.ch:vfs.fs.units.used[65]" became not supported: No Such Instance currently exists at this OID
       24318:20200127:191520.745 item "ucabpm21.domain.ch:vfs.fs.units.total[65]" became not supported: No Such Instance currently exists at this OID
       24317:20200127:191521.749 item "ucabpm21.domain.ch:vfs.fs.pused[65]" became not supported: Cannot evaluate expression: "Cannot evaluate function "last()": item "ucabpm21.domain.ch:vfs.fs.units.total[65]" not supported.".
       24317:20200127:191521.749 item "ucabpm21.domain.ch:vfs.fs.total[65]" became not supported: Cannot evaluate expression: "Cannot evaluate function "last()": item "ucabpm21.domain.ch:vfs.fs.units[65]" not supported.".
       24317:20200127:191521.749 item "ucabpm21.domain.ch:vfs.fs.used[65]" became not supported: Cannot evaluate expression: "Cannot evaluate function "last()": item "ucabpm21.domain.ch:vfs.fs.units[65]" not supported.".
       24333:20200127:191749.748 SNMP agent item "memory.units.used[7]" on host "ucabai11.domain.ch" failed: first network error, wait for 30 seconds
       24365:20200127:191819.373 resuming SNMP agent checks on host "ucabai11.domain.ch": connection restored
       24309:20200127:191906.284 forced reloading of the configuration cache
       24313:20200127:192118.051 executing housekeeper
       24313:20200127:192118.079 housekeeper [deleted 0 hist/trends, 0 items/triggers, 0 events, 0 problems, 0 sessions, 0 alarms, 0 audit, 0 records in 0.018368 sec, idle for 1 hour(s)]
       24336:20200127:192146.276 SNMP agent item "system.cpu.util[196614]" on host "ucaeqwgv.domain.ch" failed: first network error, wait for 30 seconds
       24353:20200127:192216.620 resuming SNMP agent checks on host "ucaeqwgv.domain.ch": connection restored
       24341:20200127:192334.895 SNMP agent item "vfs.fs.units.used[57]" on host "ucaaes21.domain.ch" failed: first network error, wait for 30 seconds
       24360:20200127:192404.720 resuming SNMP agent checks on host "ucaaes21.domain.ch": connection restored
       24332:20200127:192419.250 SNMP agent item "memory.units.total[1]" on host "ucaadsuv.domain.ch" failed: first network error, wait for 30 seconds
       24349:20200127:192449.743 resuming SNMP agent checks on host "ucaadsuv.domain.ch": connection restored
       24309:20200127:193409.982 forced reloading of the configuration cache
       24336:20200127:193501.548 SNMP agent item "system.cpu.util[196613]" on host "ucabpm21.domain.ch" failed: first network error, wait for 30 seconds
       24367:20200127:193531.206 resuming SNMP agent checks on host "ucabpm21.domain.ch": connection restored
       24334:20200127:193958.454 SNMP agent item "system.cpu.util[196618]" on host "ucabpm11.domain.ch" failed: first network error, wait for 30 seconds
       24350:20200127:194028.812 resuming SNMP agent checks on host "ucabpm11.domain.ch": connection restored
       24337:20200127:194131.722 SNMP agent item "net.if.out.discards[ifOutDiscards.2]" on host "ucaaes11.domain.ch" failed: first network error, wait for 30 seconds
       24351:20200127:194201.692 resuming SNMP agent checks on host "ucaaes11.domain.ch": connection restored
       24339:20200127:194313.499 SNMP agent item "system.cpu.util[196608]" on host "ucacmv.domain.ch" failed: first network error, wait for 30 seconds
       24343:20200127:194343.989 resuming SNMP agent checks on host "ucacmv.domain.ch": connection restored
       24338:20200127:194610.448 SNMP agent item "system.cpu.util[196609]" on host "ucacm21.domain.ch" failed: first network error, wait for 30 seconds
       24337:20200127:194616.333 SNMP agent item "vfs.fs.units[65]" on host "ucasmgr01.domain.ch" failed: first network error, wait for 30 seconds
       24357:20200127:194640.849 resuming SNMP agent checks on host "ucacm21.domain.ch": connection restored
       24346:20200127:194646.995 resuming SNMP agent checks on host "ucasmgr01.domain.ch": connection restored
       24309:20200127:194906.483 forced reloading of the configuration cache
       24331:20200127:195134.562 SNMP agent item "net.if.vlan[vmVlan.3]" on host "ucaaes21.domain.ch" failed: first network error, wait for 30 seconds
       24349:20200127:195204.377 resuming SNMP agent checks on host "ucaaes21.domain.ch": connection restored
       24335:20200127:195243.803 SNMP agent item "vfs.fs.units.total[31]" on host "ucaeqwg21.domain.ch" failed: first network error, wait for 30 seconds
       24363:20200127:195313.409 resuming SNMP agent checks on host "ucaeqwg21.domain.ch": connection restored
       24340:20200127:200116.179 SNMP agent item "vfs.fs.units.total[69]" on host "ucasmgr01.domain.ch" failed: first network error, wait for 30 seconds
       24349:20200127:200146.865 resuming SNMP agent checks on host "ucasmgr01.domain.ch": connection restored
      As you can see, there are a lot unreachable-events, and then nothing. The graphs stop at 27.01.2020 17:50 and i guess the Server stopped at the 02.02.2020. Although, there is no logs between this last line (in my paste) and now. The reason i guess it died at the second of Feb is that the logfile (zabbix_server.log) was created at this time. But it's empty.

      It's really strange!

      Comment

      • chlehmann
        Junior Member
        • Mar 2019
        • 14

        #4
        I figured out that it is in fact a postgres-performance issue. There was nothing in the logs, neither in pgsql nor in the zabbix-server, but after using https://pgtune.leopard.in.ua and fixing the values it seems to be working stable now.

        Comment

        Working...