Ad Widget

Collapse

Zabbix Agent 2 - SSL_shutdown: error:0A000197:SSL routines::shutdown while in init

Collapse
This topic has been answered.
X
X
 
  • Time
  • Show
Clear All
new posts
  • AccidentallyTheCable
    Junior Member
    • Apr 2024
    • 7

    #1

    Zabbix Agent 2 - SSL_shutdown: error:0A000197:SSL routines::shutdown while in init

    I have a system where the agent has started to intermittently fail, and im not sure where to go from here. Ive tried everything I can think of, and still ending up with the same problem. This agent was working fine, and then about 3 days ago, started having problems.

    Server: `zabbix-server-mysql 1:7.0.0~beta1-4+debian12`
    Agent: `zabbix-agent2 1:7.0.0~beta2-4+debian12`

    Server Side Error: `zabbix_server[902554]: SSL_shutdown() with $AGENT set result code to 1: file ../ssl/ssl_lib.c line 2259 func SSL_shutdown: error:0A000197:SSL routines::shutdown while in init`
    Agent Side Error: `zabbix_agent2[2131144]: failed to process an incoming connection from $SERVER: EOF`

    I've double checked that the TLS settings on both sides are set correctly, and as mentioned, the problem is intermittent. For example, ill get a bunch of data from this agent, and then about 40 lines of those error messages. The data being received is different each time, meaning that in one run:

    - Data for `Network interface discovery: Interface wg0: Outbound packets discarded net.if.out["wg0",dropped]` succeeds
    - Data for `Network interface discovery: Interface wg0: Outbound packets with errors net.if.out["wg0",errors]` fails

    The next run, it might be flipped, it might not.

    Things Ive tried / verified:
    - Multiple verifications that this is not network related.
    - The system the agent is running on is a proxmox host, and all of the VMs are running fine, etc, including constant data streaming, without problem. Their agents are also not showing any problems. This is the only problem agent.
    - Verified that the certs are valid, everything is in order, and the chain can be validated.
    - Restarted both agent and server.
    - Tried toying with `TLSConnect` and `TLSAccept`, neither made any difference. Current settings are:
    - `TLSConnect`: Not set
    - `TLSAccept`: `cert`
    - Verified that `cert` In Zabbix is selected for this host. Tried toying with these options. No change
    - Turned up logging on both sides to `trace`, and nothing was immediately sticking out. Output is below.

    AGENT:
    ```
    received passive check request: '{"request":"passive checks","data":[{"key":"net.if.out["wg0",dropped]","timeout":3}]}' from '$SERVER'
    [1] processing update request (1 requests)
    [1] adding new request for key: 'net.if.out["wg0",dropped]'
    [1] created direct exporter task for plugin 'NetIf' itemid:0 key 'net.if.out["wg0",dropped]'
    executing direct exporter task for key 'net.if.out["wg0",dropped]'
    executed direct exporter task for key 'net.if.out[[wg0 dropped]]'
    sending passive check response: '{"version":"7.0.0","data":[{"value":"0"}]}' to '$SERVER'
    Calling C function "tls_connected()"
    Calling C function "tls_write()"
    Calling C function "tls_recv()"
    Calling C function "tls_recv()"
    Calling C function "tls_close()"
    Calling C function "tls_free()"
    Calling C function "tls_new_server()"
    Calling C function "tls_free()"
    Calling C function "tls_ready()"
    Calling C function "tls_send()"
    Calling C function "tls_connected()"
    Calling C function "tls_accept()"
    Calling C function "tls_recv()"
    Calling C function "tls_send()"
    Calling C function "tls_accept()"
    Calling C function "tls_recv()"
    Calling C function "tls_recv()"
    Calling C function "tls_recv()"
    failed to process an incoming connection from $SERVER: EOF
    Calling C function "tls_new_server()"
    Calling C function "tls_free()"
    Calling C function "tls_ready()"
    Calling C function "tls_send()"
    Calling C function "tls_connected()"
    Calling C function "tls_accept()"
    Calling C function "tls_recv()"
    Calling C function "tls_send()"
    Calling C function "tls_accept()"
    Calling C function "tls_recv()"
    Calling C function "tls_recv()"
    Calling C function "tls_recv()"
    failed to process an incoming connection from $SERVER: EOF
    ```

    SERVER:
    ```
    zbx_tls_connect() peer certificate issuer:"CN=PROTECTED" subject:"CN=PROTECTED"
    End of zbx_tls_connect():SUCCEED (established TLSv1.3 TLS_CHACHA20_POLY1305_SHA256)
    agent_task_process() step 'send' event:2 itemid:47982
    Sending [net.if.in["vmbr1",errors]] itemid:47982
    End of async_event():ZBX_ASYNC_TASK_READ
    In zbx_ipc_async_socket_recv() timeout:0
    End of zbx_ipc_async_socket_recv():0
    itemid:41993 hostid:10499 templateid:0
    itemid:41994 hostid:10499 templateid:0
    itemid:41995 hostid:10499 templateid:0
    itemid:41996 hostid:10499 templateid:0
    itemid:41997 hostid:10499 templateid:0
    itemid:41998 hostid:10499 templateid:0
    itemid:41999 hostid:10499 templateid:0
    itemid:42000 hostid:10499 templateid:0
    itemid:42001 hostid:10499 templateid:0
    itemid:42002 hostid:10499 templateid:0
    itemid:42003 hostid:10499 templateid:0
    itemid:42004 hostid:10499 templateid:0
    itemid:42005 hostid:10500 templateid:0
    itemid:42006 hostid:10500 templateid:0
    itemid:42007 hostid:10500 templateid:0
    itemid:42008 hostid:10500 templateid:0
    itemid:42009 hostid:10500 templateid:0
    itemid:42010 hostid:10500 templateid:0
    itemid:42011 hostid:10500 templateid:0
    itemid:42012 hostid:10500 templateid:0
    itemid:42013 hostid:10500 templateid:0
    itemid:42014 hostid:10500 templateid:0
    itemid:42015 hostid:10500 templateid:0
    itemid:42016 hostid:10500 templateid:0
    itemid:42017 hostid:10500 templateid:0
    itemid:42018 hostid:10500 templateid:0
    itemid:42019 hostid:10500 templateid:0
    itemid:42020 hostid:10500 templateid:0
    itemid:42021 hostid:10500 templateid:0
    itemid:42022 hostid:10500 templateid:0
    itemid:42023 hostid:10500 templateid:0
    itemid:42024 hostid:10500 templateid:0
    itemid:42025 hostid:10500 templateid:0
    itemid:42026 hostid:10501 templateid:0
    itemid:42027 hostid:10501 templateid:0
    itemid:42028 hostid:10501 templateid:0
    itemid:42029 hostid:10501 templateid:0
    itemid:42030 hostid:10501 templateid:0
    SSL_shutdown() with $AGENT set result code to 1: file ../ssl/ssl_lib.c line 2259 func SSL_shutdown: error:0A000197:SSL routines::shutdown while in init
    SSL_shutdown() with $AGENT set result code to 1: file ../ssl/ssl_lib.c line 2259 func SSL_shutdown: error:0A000197:SSL routines::shutdown while in init
    SSL_shutdown() with $AGENT set result code to 1: file ../ssl/ssl_lib.c line 2259 func SSL_shutdown: error:0A000197:SSL routines::shutdown while in init
    SSL_shutdown() with $AGENT set result code to 1: file ../ssl/ssl_lib.c line 2259 func SSL_shutdown: error:0A000197:SSL routines::shutdown while in init
    ```
  • Answer selected by AccidentallyTheCable at 13-04-2024, 00:10.
    AccidentallyTheCable
    Junior Member
    • Apr 2024
    • 7

    I think I might have figured it out finally.. it was a timeout / load problem :|

    I wish that the `first network error` message was more descriptive without having to turn up debugging and get spammed; could've saved me 8 days of digging

    Im not sure why / how though, no systems are using even 15% of their CPU at any time as far as I can see, but bumping the timeout values up seems to have fixed it

    Comment

    • markfree
      Senior Member
      • Apr 2019
      • 868

      #2
      This seems more like an OpenSSL error. Has it been updated on your system?

      I'm assuming you're using certificate-based encryption, and it looks like you're trying Zabbix 7. Right?​
      Have you (re-)checked your TLS cert options?

      Comment

      • AccidentallyTheCable
        Junior Member
        • Apr 2024
        • 7

        #3
        Ive been using 7 from the start of this install, and this is the first issue ive run into.

        All systems are currently using: `openssl 3.0.11-1~deb12u2`

        I have checked, and rechecked the TLS options and certs. As I mentioned, all other agents are fine, AND, this agent is sending *some* data, AND this agent was previously working fine until about 4 days ago. There are 13 other agents, all using the same settings, versions, etc, and not having issues. All of them are configured the same way, using TLS, all options are the same on all agents, same zabbix-agent2 versions, etc.

        As noted in my original post this agent was working completely fine until a few days ago, consistently getting data, none of whats happening now. Ive attached graphs of CPU and memory usage to show whats going on. You can see that data is getting through, and that its not always the same data getting through

        This is `now to now-8h`
        Click image for larger version  Name:	Capture1.png Views:	0 Size:	37.5 KB ID:	481990

        Below are graphs from `now-5d to now-5d+8h` and `now-4d to now-4d+8h`
        Click image for larger version  Name:	Capture2.png Views:	0 Size:	33.8 KB ID:	481991 Click image for larger version  Name:	Capture1.png Views:	0 Size:	44.3 KB ID:	481992
        You can see when data started to break. I checked journalctl on both server and agent, nothing out of the ordinary leading up to the issue.

        Comment

        • AccidentallyTheCable
          Junior Member
          • Apr 2024
          • 7

          #4
          It might be a bit early to say it, but it seems upgrading the server fixed this????? the agent version on the problem host was originally matching version, even after the problem started. I decided to upgrade it (hence the version in my original post), but that didnt seem to fix anything. Today I decided just to upgrade the server and see if it did anything, and it seems to have fixed it.

          Im very confused as to how a problem came from no changes, and upgrading fixed it where restarts did not, especially given the errors. The only packages that upgraded were zabbix specific, no supporting libraries got upgraded.

          Im going to give it another 24 hours before I mark this as resolved, but I have no idea what broke, or why, especially since the versions of the agent on all other hosts remains at `1:7.0.0~beta1-4+debian12`, and this one (and now the server) are at `1:7.0.0~beta2-4+debian12`

          Comment

          • AccidentallyTheCable
            Junior Member
            • Apr 2024
            • 7

            #5
            Spoke too soon..

            Server:
            zabbix_server[1213897]: Zabbix agent item "net.if.out["eno1",errors]" on host "$AGENT" failed: first network error, wait for 15 seconds

            Agent:
            (many)
            zabbix_agent2[2131144]: failed to process an incoming connection from $SERVER: EOF
            zabbix_agent2[2131144]: failed to process an incoming connection from $SERVER: read tcp $AGENT:10050->$SERVER:41500: read: connection reset by peer

            Comment

            • AccidentallyTheCable
              Junior Member
              • Apr 2024
              • 7

              #6
              I decided to install agent v1, and see if anything changed, unfortunately, the only thing that changed was the error message. This seems to point to an SSL problem, but I cant figure out why/what. Cert chains are valid, certs are valid, permissions ok.

              Server:
              Zabbix agent item "vfs.file.contents["/sys/class/net/eno1/operstate"]" on host "$AGENT" failed: first network error, wait for 15 seconds

              Agent:
              (many)
              zabbix_agentd[3346937]: failed to accept an incoming connection: from $SERVER: TLS handshake set result code to 1: file ../ssl/record/rec_layer_s3.c line 303 func ssl3_read_n: error:0A000126:SSL routines::unexpected eof while reading: TLS write fatal alert "decode error"

              Comment

              • tim.mooney
                Senior Member
                • Dec 2012
                • 1427

                #7
                The problem is probably not this, but it's one thing you should eliminate: make sure you're not running out of entropy on the system running the agent.

                I think it's rare (especially with modern kernels), but running out of entropy (draining the entropy pool) can cause weird, intermittent issues with anything that relies on random numbers. TLS is one of those operations that could potentially be impacted.

                Comment

                • AccidentallyTheCable
                  Junior Member
                  • Apr 2024
                  • 7

                  #8
                  I ran `cat /proc/sys/kernel/random/entropy_avail` on multiple systems, all gave the same result, `256`. Even did a `watch -n 1 cat /proc/sys/kernel/random/entropy_avail​` for a bit, same value.

                  I was going to make an attempt to use gdb to trace the agent2 and see what I could find, but it appears someone goofed...

                  ```
                  [Thread debugging using libthread_db enabled]
                  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
                  runtime.futex () at /home/packager/go1.21.3/src/runtime/sys_linux_amd64.s:558
                  558 /home/packager/go1.21.3/src/runtime/sys_linux_amd64.s: No such file or directory.
                  warning: Unsupported auto-load script at offset 0 in section .debug_gdb_scripts
                  of file /usr/lib/debug/.build-id/b6/e88ed910fc38d2f0319f40082a63c26b7d6d05.debug.​
                  ```
                  Note the path to the linux source file

                  Would like a buffer overflow or something cause this? perhaps just the amount of data being sent is overwhelming? idk, seems odd given I have another host with as many items.

                  Comment

                  • tim.mooney
                    Senior Member
                    • Dec 2012
                    • 1427

                    #9
                    Any kind of terminating signal (segfault, whatever) could cause the connection to shut down abruptly and cause the other end to potentially log some weird messages, but you should be seeing errors in the system logs or elsewhere if the agent is failing in that way.

                    If you're comfortable using gdb, then you may want to download the agent2 source code and build it yourself, rather than trying to rely upon the separate debug symbols. My site doesn't (yet) use the agent2, so I can't provide much in the way of useful info on building. If I were in your situation, I would start with the source package, as that should have the recipe to build it and should list the dependencies you need to have installed on your build box.

                    Comment

                    • AccidentallyTheCable
                      Junior Member
                      • Apr 2024
                      • 7

                      #10
                      I think I might have figured it out finally.. it was a timeout / load problem :|

                      I wish that the `first network error` message was more descriptive without having to turn up debugging and get spammed; could've saved me 8 days of digging

                      Im not sure why / how though, no systems are using even 15% of their CPU at any time as far as I can see, but bumping the timeout values up seems to have fixed it

                      Comment

                      • vso
                        Zabbix developer
                        • Aug 2016
                        • 190

                        #11
                        Could you please be so kind and check if this is the issue https://support.zabbix.com/browse/ZBX-23941 and https://support.zabbix.com/browse/ZBXNEXT-9024 ?
                        Last edited by vso; 15-04-2024, 16:56.

                        Comment

                        Working...