Ad Widget

Collapse

Fresh Zabbix 7.2 Install Issue(s)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • biggsclarkes
    Junior Member
    • Jan 2025
    • 4

    #1

    Fresh Zabbix 7.2 Install Issue(s)

    So, I'm trying to get Zabbix running on a Windows Server 2019 three-node Hyper-V cluster, backed with CSV storage on a SAN.

    I initially did what I did the last time I installed Zabbix server and fired up a gen 2 VM and installed Debian. Installed MySQL, setup the DB, then installed Zabbix-Server using nginx as the web-server.

    It worked for a few days until the weekend, when after Cluster Aware Updates, monitoring of the VMs broke and I'm now getting the below in the zabbix_server logs.

    I have configured the agents to communicate via PSK, with the same PSK and Identity on all hosts and server. The instructions on using PSK encrypted comms isn't clear if this should work or not, I've tried removing the PSK comms but I still get no availability from the agent.

    Code:
    1488:20250122:160912.730 failed to accept an incoming connection: from xxx.xxx.xxx.47: unspecified certificate verification error: TLS handshake set result code to 1: file ../ssl/record/rec_layer_s3.c line 316 func ssl3_read_n: error:0A000126:SSL routines::unexpected eof while reading: TLS write fatal alert "decode error"
      1485:20250122:160914.268 failed to accept an incoming connection: from xxx.xxx.xxx.159: reading first byte from connection failed: [11] Resource temporarily unavailable
      1489:20250122:160915.683 failed to accept an incoming connection: from xxx.xxx.xxx.70: reading first byte from connection failed: [11] Resource temporarily unavailable
      1487:20250122:160916.183 failed to accept an incoming connection: from xxx.xxx.xxx.50: reading first byte from connection failed: [11] Resource temporarily unavailable
      1486:20250122:160916.704 failed to accept an incoming connection: from xxx.xxx.xxx.213: reading first byte from connection failed: [11] Resource temporarily unavailable
      1488:20250122:160916.735 failed to accept an incoming connection: from xxx.xxx.xxx.137: reading first byte from connection failed: [11] Resource temporarily unavailable
      1485:20250122:160918.273 failed to accept an incoming connection: from xxx.xxx.xxx.201: reading first byte from connection failed: [11] Resource temporarily unavailable
      1485:20250122:160918.273 failed to accept an incoming connection: connection rejected, getpeername() failed: [107] Transport endpoint is not connected
      1485:20250122:160918.273 failed to accept an incoming connection: connection rejected, getpeername() failed: [107] Transport endpoint is not connected
      1489:20250122:160919.687 failed to accept an incoming connection: from xxx.xxx.xxx.159: reading first byte from connection failed: [11] Resource temporarily unavailable
      1487:20250122:160920.187 failed to accept an incoming connection: from xxx.xxx.xxx.70: reading first byte from connection failed: [11] Resource temporarily unavailable
      1486:20250122:160920.708 failed to accept an incoming connection: from xxx.xxx.xxx.50: reading first byte from connection failed: [11] Resource temporarily unavailable
      1486:20250122:160920.709 failed to accept an incoming connection: connection rejected, getpeername() failed: [107] Transport endpoint is not connected
      1488:20250122:160920.739 failed to accept an incoming connection: from xxx.xxx.xxx.201: reading first byte from connection failed: [11] Resource temporarily unavailable
      1485:20250122:160922.277 failed to accept an incoming connection: from xxx.xxx.xxx.213: reading first byte from connection failed: [11] Resource temporarily unavailable
      1485:20250122:160922.629 failed to accept an incoming connection: connection rejected, getpeername() failed: [107] Transport endpoint is not connected
      1489:20250122:160923.692 failed to accept an incoming connection: from xxx.xxx.xxx.50: reading first byte from connection failed: [11] Resource temporarily unavailable
      1489:20250122:160923.692 failed to accept an incoming connection: connection rejected, getpeername() failed: [107] Transport endpoint is not connected
      1489:20250122:160923.692 failed to accept an incoming connection: connection rejected, getpeername() failed: [107] Transport endpoint is not connected
      1489:20250122:160923.693 failed to accept an incoming connection: from xxx.xxx.xxx.164: unspecified certificate verification error: TLS handshake set result code to 1: file ../ssl/record/rec_layer_s3.c line 316 func ssl3_read_n: error:0A000126:SSL routines::unexpected eof while reading: TLS write fatal alert "decode error"
      1487:20250122:160924.192 failed to accept an incoming connection: from xxx.xxx.xxx.137: reading first byte from connection failed: [11] Resource temporarily unavailable
      1486:20250122:160924.713 failed to accept an incoming connection: from xxx.xxx.xxx.201: reading first byte from connection failed: [11] Resource temporarily unavailable
      1488:20250122:160924.744 failed to accept an incoming connection: from xxx.xxx.xxx.159: reading first byte from connection failed: [11] Resource temporarily unavailable
      1485:20250122:160926.633 failed to accept an incoming connection: from xxx.xxx.xxx.137: reading first byte from connection failed: [11] Resource temporarily unavailable
      1489:20250122:160927.697 failed to accept an incoming connection: from xxx.xxx.xxx.212: reading first byte from connection failed: [11] Resource temporarily unavailable
      1487:20250122:160928.196 failed to accept an incoming connection: from xxx.xxx.xxx.50: reading first byte from connection failed: [11] Resource temporarily unavailable
      1487:20250122:160928.197 failed to accept an incoming connection: connection rejected, getpeername() failed: [107] Transport endpoint is not connected
      1487:20250122:160928.197 failed to accept an incoming connection: from xxx.xxx.xxx.241: unspecified certificate verification error: TLS handshake set result code to 1: file ../ssl/record/rec_layer_s3.c line 316 func ssl3_read_n: error:0A000126:SSL routines::unexpected eof while reading: TLS write fatal alert "decode error"
      1487:20250122:160928.198 failed to accept an incoming connection: from xxx.xxx.xxx.239: unspecified certificate verification error: TLS handshake set result code to 1: file ../ssl/record/rec_layer_s3.c line 316 func ssl3_read_n: error:0A000126:SSL routines::unexpected eof while reading: TLS write fatal alert "decode error"
      1487:20250122:160928.198 failed to accept an incoming connection: connection rejected, getpeername() failed: [107] Transport endpoint is not connected
      1487:20250122:160928.199 failed to accept an incoming connection: from xxx.xxx.xxx.163: unspecified certificate verification error: TLS handshake set result code to 1: file ../ssl/record/rec_layer_s3.c line 316 func ssl3_read_n: error:0A000126:SSL routines::unexpected eof while reading: TLS write fatal alert "decode error"
      1487:20250122:160928.200 failed to accept an incoming connection: from xxx.xxx.xxx.184: unspecified certificate verification error: TLS handshake set result code to 1: file ../ssl/record/rec_layer_s3.c line 316 func ssl3_read_n: error:0A000126:SSL routines::unexpected eof while reading: TLS write fatal alert "decode error"
      1487:20250122:160928.200 failed to accept an incoming connection: connection rejected, getpeername() failed: [107] Transport endpoint is not connected
      1486:20250122:160928.718 failed to accept an incoming connection: from xxx.xxx.xxx.213: reading first byte from connection failed: [11] Resource temporarily unavailable
      1488:20250122:160928.748 failed to accept an incoming connection: from xxx.xxx.xxx.137: reading first byte from connection failed: [11] Resource temporarily unavailable
    There are three different errors there and I haven't found a conclusive answer to any of them. I've tested the connection between the server and agents and it's fine. No issues I can trace.

    Additionally, from an agent on a Windows VM on the same cluster;
    Code:
    2025/01/22 16:10:53.224917 [101] cannot connect to [zabbix.dc.local:10051]: read tcp xxx.xxx.xxx.76:28301->xxx.xxx.xxx.65:10051: i/o timeout
    2025/01/22 16:10:53.224917 [101] history upload to [zabbix.dc.local:10051] [host] started to fail
    2025/01/22 16:11:08.227014 [101] cannot connect to [zabbix.dc.local:10051]: read tcp xxx.xxx.xxx.76:28330->xxx.xxx.xxx.65:10051: i/o timeout
    2025/01/22 16:11:08.228026 [101] sending of heartbeat message for [host] started to fail
    2025/01/22 16:11:53.227295 [101] cannot connect to [zabbix.dc.local:10051]: read tcp xxx.xxx.xxx.76:28329->xxx.xxx.xxx.65:10051: i/o timeout
    2025/01/22 16:11:53.227295 [101] history upload to [zabbix.dc.local:10051] [host] started to fail
    2025/01/22 16:12:08.231388 [101] cannot connect to [zabbix.dc.local:10051]: read tcp xxx.xxx.xxx.76:28364->xxx.xxx.xxx.65:10051: i/o timeout
    2025/01/22 16:12:08.232425 [101] active check configuration update from host [host] started to fail
    Also, I have found that on both the Debian VM, having rebuilt it about three times, and now on an Ubuntu VM, the front-end can't seem to communicate with the Zabbix Server and I get the below;
    Click image for larger version  Name:	image.png Views:	13 Size:	17.6 KB ID:	497637
    Click image for larger version  Name:	image.png Views:	10 Size:	15.5 KB ID:	497638
    Any one got any troubleshooting ideas? I've searched as hard as I can and I've come up with nothing that's got me close to fixing this. The only thing I've got left to try​ is running Zabbix on a physical host and see if it's something related to virtualised network on the cluster.
    Last edited by biggsclarkes; 23-01-2025, 10:41.
  • Blevar
    Member
    • Jan 2025
    • 68

    #2
    Hi,

    This looks like a lot of SSL/TLS errors.
    First check if the server is actually running:
    Code:
    sudo systemctl status zabbix-server

    Comment


    • biggsclarkes
      biggsclarkes commented
      Editing a comment
      Hello, there wouldn't be any logs in zabbix_server.log if it wasn't running... Should've made it clear, the logs are from zabbix_server.log
  • PZakrzewski
    Junior Member
    • Dec 2024
    • 12

    #3
    It seems there are a few potential issues causing the errors:

    1. TLS Handshake Errors: Verify that PSK and identity values match exactly on the server and agents. If unsure, regenerate the PSK securely using openssl rand -hex 32 and update it on all hosts.
    2. Agent Connection Timeouts: Ensure port 10051 is open and reachable, DNS resolves correctly for zabbix.dc.local, and there are no network issues like latency or packet loss.
    3. Front-End Issue: Confirm the Zabbix server service is running, listening on the correct IP/port (netstat -tuln), and there are no restrictions on the web interface.


    So overall:

    • Test with a new PSK generated via `openssl rand -hex 32`.
    • Temporarily disable PSK/TLS on one agent to isolate the issue.
    • Increase log verbosity (DebugLevel=4) for more insights.
    • Use telnet/nc to test agent-server connectivity on port 10051.


    If these don’t help, testing on a physical host might rule out virtualization-related issues. Let us know how it goes!

    Comment

    • biggsclarkes
      Junior Member
      • Jan 2025
      • 4

      #4
      I've ditched PSK encryption for now. Having disabled it the server appears to be quite happy. I've tried ensuring I'm using the correct PSK everywhere, I've tried unique PSK Idents on agents, I've verified the server is up and listening.

      Now I've ditched PSK the Zabbix-Server stats are back and all hosts are working fine... I'll look at Certificate encryption instead in a couple of months

      Comment


      • PZakrzewski
        PZakrzewski commented
        Editing a comment
        It sounds like the issue may have been caused by using different PSKs on each agent. Just to clarify, the concept of a PSK (Pre-Shared Key) is that all parties (server and agents) must use the exact same PSK for communication to work. If you’d like to revisit PSK in the future, you can regenerate a single secure PSK (e.g., openssl rand -hex 32) and apply it uniformly across all agents and the server.
    • PZakrzewski
      Junior Member
      • Dec 2024
      • 12

      #5

      anyway if You'd like to go back to PSK configuration later You might want to check these things:


      1. Verify PSK Configuration:
      • Ensure the same PSK is used on the server and all agents.
      • Double-check TLSPSKIdentity and TLSPSKFile in your zabbix_agent.conf or zabbix_agent2.conf:
      Code:
      TLSAccept=psk
      TLSConnect=psk
      TLSPSKFile=/path/to/psk
      TLSPSKIdentity=zabbix-agent-identity
      2. Test Connection:
      Use openssl to test PSK communication:
      Code:
      openssl s_client -connect <zabbix-server-ip>:10051 -psk_identity '<PSK Identity>' -psk '<PSK>'
      This will confirm if the PSK handshake works correctly.

      3. Check Logs:
      Increase DebugLevel=4 in both zabbix_server.conf and the agent config for more detailed logs. Look for errors related to TLS or PSK.

      4. Firewall:
      Ensure no firewall rules are blocking port 10051 for encrypted communication.

      Comment


      • biggsclarkes
        biggsclarkes commented
        Editing a comment
        I started with a single PSK and ident across everything, then it broke over the weekend and I haven't been able to identify the cause. I suspect it may be something on our windows hosts that's caused it, but thanks for the PSK testing command, I didn't run into that the whole time I was looking into it somehow. My searching skills are apparently no longer working well.
    Working...