Ad Widget

Collapse

Zabbix 6.4: TLS write fatal alert "decode error" when new agent is added

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • dimic
    Junior Member
    • Sep 2010
    • 9

    #1

    Zabbix 6.4: TLS write fatal alert "decode error" when new agent is added

    Hi,
    I have a zabbix setup with a dozen of hosts. Everything is working well.
    I've cloned one windows and one linux hosts monitored by Zabbix, renamed them and updated host name in zabbix config and added them to zabbix.
    The 2 cloned VMs are located in a new subnet, but can communicate with the Zabbix Server over network.
    And something really odd started to happen. Zabbix became unresponsive and in logs I've started to see messages like these for ALL hosts which exists in Zabbix, including the old and the new ones:

    Code:
    zabbix-server-1  |    235:20240422:022037.623 failed to accept an incoming connection: from 172.xx.xx.xx: unspecified certificate verification error: TLS handshake set result code to 1: file ../ssl/record/rec_layer_s3.c line 308 func ssl3_read_n: error:0A000126:SSL routines::unexpected eof while reading: TLS write fatal alert "decode error"
    zabbix-server-1  |    235:20240422:022037.636 failed to accept an incoming connection: from 192.xx.xx.xx: unspecified certificate verification error: TLS handshake set result code to 1: file ../ssl/record/rec_layer_s3.c line 308 func ssl3_read_n: error:0A000126:SSL routines::unexpected eof while reading: TLS write fatal alert "decode error"

    In logs of agents I see corresponding messages:
    Code:
    2024/04/22 02:08:25.541293 [101] sending of heartbeat message for [xxx] started to fail
    2024/04/22 02:08:59.544035 [101] cannot receive data from [192.yy.yy.yy:10051]: Cannot read message: 'read tcp 172.xx.xx.xx:53395->192.yy.yy.yy:10051: i/o timeout'
    2024/04/22 02:08:59.544066 [101] history upload to [192.yy.yy.yy:10051] [xxx] started to fail
    2024/04/22 02:09:59.556672 [101] cannot receive data from [192.yy.yy.yy:10051]: Cannot read message: 'read tcp 172.xx.xx.xx:54557->192.yy.yy.yy:10051: i/o timeout'
    I've tried to use connection to the new agents with PSK and without encryption - the result is the same.
    Once I stop agents on the new VMs and remove hosts from Zabbix - it starts working properly again.
    Does anyone know what is the issue?

    thanks a lot!

    P.S. the PSK on the old agents is the same for all the agents. On the new agents I've tried the same and also new PSKs.
    P.P.S. I also noticed anoteher error in zabbix server log
    Code:
    zabbix-server-1  |    236:20240422:024954.896 cannot process returned result: cannot parse as a valid JSON object: invalid object format, expected opening character '{' or '[' at: 'OK'
    it is likely unrelated to the TLS, but if someone knows on how to find out from which agent/check it is coming from it would be great.
    Last edited by dimic; 22-04-2024, 00:53.
  • dimic
    Junior Member
    • Sep 2010
    • 9

    #2
    Additional info.
    When I've added 2 cloned hosts:
    1. I've got the problem triggered: " Zabbix poller processes more than 75% busy"
    2. the metric "Zabbix server: Utilization of trapper data collector processes, in %:" hiked to 100 %. But I have no trapper items in the templates configured and the agents don't have any plugins.

    This is how it looked on the graphs

    Click image for larger version  Name:	image.png Views:	0 Size:	603.5 KB ID:	482928
    Last edited by dimic; 24-04-2024, 02:25.

    Comment

    • Markku
      Senior Member
      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
      • Sep 2018
      • 1781

      #3
      Do you use the same TLSPSKIdentity values on the hosts?

      Markku

      Comment

      • dimic
        Junior Member
        • Sep 2010
        • 9

        #4
        Hi

        Yes, it was the same. I've tried to set dedicated TLSPSKIdentity for the new nodes. but the effect was the same

        Comment

        • dimic
          Junior Member
          • Sep 2010
          • 9

          #5
          Mmm, I'm not sure what was wrong, but I've removed both cloned VMs from Zabbiz, used new TLSPSKIdentity on both of them servers and added them back. After that it was working ok.
          Then I've tried to change back the TLSPSKIdentity back to the original values and the problem re-appeared.

          I've run `select host, tls_psk_identity, tls_psk from hosts;` against zabbix DB and as I see there - there are no different PSKs assigned to the same TLSPSKIdentity but still the problem presists even after zabbix server restart

          ​Do you know what could be wrong?
          Last edited by dimic; 25-04-2024, 02:51.

          Comment

          Working...