Ad Widget

Collapse

After upgrade from Zabbix 6.4 to 7.0.7, zabbix_server now constantly crashes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • marcos.della
    Junior Member
    • Dec 2024
    • 9

    #1

    After upgrade from Zabbix 6.4 to 7.0.7, zabbix_server now constantly crashes

    We are getting the following constantly (both nodes in our HA system are crashing constantly):

    6259:20241223:231851.669 Got signal [signal:11(SIGSEGV),reason:1,refaddrnil)]. Crashing ...
    6259:20241223:231851.669 ====== Fatal information: ======
    6259:20241223:231851.669 Program counter: 0x7fe14d48fa5c
    6259:20241223:231851.669 === Registers: ===
    6259:20241223:231851.669 r8 = 0 = 0 = 0
    6259:20241223:231851.669 r9 = 7fff3de7c63f = 140734231987775 = 140734231987775
    6259:20241223:231851.669 r10 = 7fff3de7c0f7 = 140734231986423 = 140734231986423
    6259:20241223:231851.669 r11 = 6 = 6 = 6
    6259:20241223:231851.669 r12 = 7b = 123 = 123
    6259:20241223:231851.669 r13 = 0 = 0 = 0
    6259:20241223:231851.669 r14 = 7fff3de7c860 = 140734231988320 = 140734231988320
    6259:20241223:231851.669 r15 = 55b4f857a9e0 = 94235748968928 = 94235748968928
    6259:20241223:231851.669 rdi = 0 = 0 = 0
    6259:20241223:231851.669 rsi = 7b = 123 = 123
    6259:20241223:231851.669 rbp = 55b4f857cb30 = 94235748977456 = 94235748977456
    6259:20241223:231851.669 rbx = 7b = 123 = 123
    6259:20241223:231851.669 rdx = 2 = 2 = 2
    6259:20241223:231851.669 rax = 0 = 0 = 0
    6259:20241223:231851.669 rcx = 1 = 1 = 1
    6259:20241223:231851.669 rsp = 7fff3de7c3c8 = 140734231987144 = 140734231987144
    6259:20241223:231851.669 rip = 7fe14d48fa5c = 140605640997468 = 140605640997468
    6259:20241223:231851.669 efl = 10283 = 66179 = 66179
    6259:20241223:231851.669 csgsfs = 2b000000000033 = 12103423998558259 = 12103423998558259
    6259:20241223:231851.669 err = 4 = 4 = 4
    6259:20241223:231851.670 trapno = e = 14 = 14
    6259:20241223:231851.670 oldmask = 0 = 0 = 0
    6259:20241223:231851.670 cr2 = 0 = 0 = 0
    6259:20241223:231851.670 === Backtrace: ===
    6259:20241223:231851.672 18: /usr/sbin/zabbix_server: lld worker #4 started(zbx_backtrace+0x41) [0x55b4f630f001]
    6259:20241223:231851.672 17: /usr/sbin/zabbix_server: lld worker #4 started(zbx_log_fatal_info+0x2b5) [0x55b4f630f3d5]
    6259:20241223:231851.672 16: /usr/sbin/zabbix_server: lld worker #4 started(+0x10a616) [0x55b4f630f616]
    6259:20241223:231851.672 15: /lib64/libpthread.so.0(+0x12d10) [0x7fe14f6acd10]
    6259:20241223:231851.672 14: /lib64/libc.so.6(+0xbba5c) [0x7fe14d48fa5c]
    6259:20241223:231851.672 13: /lib64/libc.so.6(+0x9ffa2) [0x7fe14d473fa2]
    6259:20241223:231851.672 12: /usr/sbin/zabbix_server: lld worker #4 started(zbx_strmatch_condition+0x45) [0x55b4f62d9f25]
    6259:20241223:231851.672 11: /usr/sbin/zabbix_server: lld worker #4 started(lld_override_trigger+0xbe) [0x55b4f64d138e]
    6259:20241223:231851.672 10: /usr/sbin/zabbix_server: lld worker #4 started(+0x29ed19) [0x55b4f64a3d19]
    6259:20241223:231851.672 9: /usr/sbin/zabbix_server: lld worker #4 started(lld_update_triggers+0x151d) [0x55b4f64a60bd]
    6259:20241223:231851.672 8: /usr/sbin/zabbix_server: lld worker #4 started(lld_process_discovery_rule+0x144f) [0x55b4f64d387f]
    6259:20241223:231851.672 7: /usr/sbin/zabbix_server: lld worker #4 started(lld_worker_thread+0x5e4) [0x55b4f64d46e4]
    6259:20241223:231851.672 6: /usr/sbin/zabbix_server: lld worker #4 started(zbx_thread_start+0x27) [0x55b4f63ce037]
    6259:20241223:231851.672 5: /usr/sbin/zabbix_server: lld worker #4 started(+0xd9e7a) [0x55b4f62dee7a]
    6259:20241223:231851.672 4: /usr/sbin/zabbix_server: lld worker #4 started(MAIN_ZABBIX_ENTRY+0x105e) [0x55b4f65df2fe]
    6259:20241223:231851.672 3: /usr/sbin/zabbix_server: lld worker #4 started(zbx_daemon_start+0x10d) [0x55b4f630fc3d]
    6259:20241223:231851.672 2: /usr/sbin/zabbix_server: lld worker #4 started(main+0x3ea) [0x55b4f62c839a]
    6259:20241223:231851.672 1: /lib64/libc.so.6(__libc_start_main+0xe5) [0x7fe14d40e7e5]
    6259:20241223:231851.672 0: /usr/sbin/zabbix_server: lld worker #4 started(_start+0x2e) [0x55b4f62ceece]
    ...
    6259:20241223:231851.674 ================================
    6259:20241223:231851.674 Please consider attaching a disassembly listing to your bug report.
    6259:20241223:231851.674 This listing can be produced with, e.g., objdump -DSswx zabbix_server.
    6259:20241223:231851.674 ================================
    6246:20241223:231851.729 One child process died (PID:6259,exitcode/signal:1). Exiting ...
    6247:20241223:231851.729 HA manager has been paused
    zabbix_server [6246]: Error waiting for process with PID 6259: [10] No child processes
    6247:20241223:231851.759 HA manager has been stopped
    6246:20241223:231851.775 syncing history data...
    ...

    No idea how to get out of this loop. we've stopped the second node and are trying to only use node-1 however it doesn't change the issue. Unfortunately this is our production environment and your staff member who was onsite a week prior isn't available either.

    Any clues? We were in the process of upgrading all of our proxies from 6.4.20 to 7.0.7 (we're about halfway thru) when this started happening and now we're dead in the water.

    Marcos
  • marcos.della
    Junior Member
    • Dec 2024
    • 9

    #2
    Further information, the crash continues on both servers. It operates just enough to pull in data for operations before crashing. We get the following right before each crash:

    409700:20241228:004916.865 failed to accept an incoming connection: from 100.64.2.61: unspecified certificate verification error: TLS handshake set result code to 5:
    409698:20241228:004916.865 failed to accept an incoming connection: from 10.45.8.250: unspecified certificate verification error: TLS handshake set result code to 5:
    409698:20241228:004916.866 failed to accept an incoming connection: from 100.64.4.42: unspecified certificate verification error: TLS handshake set result code to 5:
    409700:20241228:004916.866 failed to accept an incoming connection: from 100.64.2.8: unspecified certificate verification error: TLS handshake set result code to 5:
    409698:20241228:004916.866 failed to accept an incoming connection: from 100.64.4.109: unspecified certificate verification error: TLS handshake set result code to 5:
    409726:20241228:004916.933 server #68 started [connector worker #3]
    409728:20241228:004916.934 server #70 started [connector worker #5]
    409724:20241228:004916.939 server #66 started [connector worker #1]
    409698:20241228:004917.125 Proxy "zabbix-vpc5-proxy1.-------.com" version 6.4.19 is outdated, only data collection and remote execution is available with server version 7.0.7.
    409698:20241228:004917.134 Proxy "zabbix-vpc0-proxy1.-------.com" version 6.4.19 is outdated, only data collection and remote execution is available with server version 7.0.7.
    409701:20241228:004917.323 Proxy "zabbix-us-west-2-proxy1.-------.com" version 6.4.20 is outdated, only data collection and remote execution is available with server version 7.0.7.
    409674:20241228:004917.815 forced reloading of the snmp cache on [poller #3]
    409677:20241228:004917.815 forced reloading of the snmp cache on [poller #6]
    409681:20241228:004917.819 forced reloading of the snmp cache on [poller #10]
    409680:20241228:004917.820 forced reloading of the snmp cache on [poller #9]
    409672:20241228:004917.826 forced reloading of the snmp cache on [poller #1]
    409673:20241228:004917.827 forced reloading of the snmp cache on [poller #2]
    409675:20241228:004917.827 forced reloading of the snmp cache on [poller #4]
    409676:20241228:004917.828 forced reloading of the snmp cache on [poller #5]
    409678:20241228:004917.832 forced reloading of the snmp cache on [poller #7]
    409679:20241228:004917.834 forced reloading of the snmp cache on [poller #8]
    409699:20241228:004917.965 sending configuration data to proxy "zabbix-blue-proxy1.-------.com" at "100.64.2.57", datalen 8594740, bytes 704174 with compression ratio 12.2
    409697:20241228:004918.093 sending configuration data to proxy "zabbix-zone3-proxy1.-------.com" at "10.34.65.91", datalen 10768909, bytes 804992 with compression ratio 13.4
    409656:20241228:004918.127 Got signal [signal:11(SIGSEGV),reason:1,refaddrnil)]. Crashing ...
    409656:20241228:004918.127 ====== Fatal information: ======
    409656:20241228:004918.127 Program counter: 0x7f426b8faa5c

    Comment

    • tim.mooney
      Senior Member
      • Dec 2012
      • 1427

      #3
      Multiple other people have reported similar issues with 7.0.7 on the forums. For all of them, it's one of the LLD workers that is crashing (for you it's worker #4).

      Just based upon the stack frames, I would bet that whatever Zabbix bug is being triggered is happening in 'zbx_strmatch_condition' (for you stack frame #12), though the underlying problem could be earlier in the call stack.

      You could downgrade to 7.0.6 until this issue is resolved.

      Alternately, if your environment is fairly mature, could you disable low-level discovery (LLD) for a while, until the problem is fixed and you can upgrade to a fixed version?

      Comment

      • Sara.Art
        Member
        • Jun 2020
        • 52

        #4
        hello! we have a similar error but ours it's proxy related. With 7.0.7 Server I must disable Start IPMI Pollers to avoid crash.

        Comment

        • marcos.della
          Junior Member
          • Dec 2024
          • 9

          #5
          Originally posted by tim.mooney
          Multiple other people have reported similar issues with 7.0.7 on the forums. For all of them, it's one of the LLD workers that is crashing (for you it's worker #4).

          Just based upon the stack frames, I would bet that whatever Zabbix bug is being triggered is happening in 'zbx_strmatch_condition' (for you stack frame #12), though the underlying problem could be earlier in the call stack.

          You could downgrade to 7.0.6 until this issue is resolved.

          Alternately, if your environment is fairly mature, could you disable low-level discovery (LLD) for a while, until the problem is fixed and you can upgrade to a fixed version?
          As there are several packages involved on the server side, I am unclear how to "downgrade" to 7.0.6 on a Rocky 8 environment without breaking many things. Additionally, all of our proxies are at 7.0.7 (there are > 20 of them) which then would be incompatible with the 7.0.6 version?

          I do admit that having our entire infrastructure basically unmonitored at the moment (due to constant crashes) is problematic to say the least, but playing with going backwards is worse (we just upgraded from 6.4.20 to 7.0.7 so its not even a clean restore).

          Comment

          • Sara.Art
            Member
            • Jun 2020
            • 52

            #6
            I reverted to last snapshot as ours is a VM so I rolled back to 7.0.6 version. (I tried to downgrade but with no success). Looking at Bugzilla the problem is known, so I hope the fix will be released soon.

            Happy New Year despite the bugs!

            Comment

            • tim.mooney
              Senior Member
              • Dec 2012
              • 1427

              #7
              Originally posted by marcos.della

              As there are several packages involved on the server side, I am unclear how to "downgrade" to 7.0.6 on a Rocky 8 environment without breaking many things. Additionally, all of our proxies are at 7.0.7 (there are > 20 of them) which then would be incompatible with the 7.0.6 version?

              I do admit that having our entire infrastructure basically unmonitored at the moment (due to constant crashes) is problematic to say the least, but playing with going backwards is worse (we just upgraded from 6.4.20 to 7.0.7 so its not even a clean restore).
              Downgrade your proxies first. Slightly older proxies will work fine with a server that's a micro-version newer, and starting with the proxies has the benefit that it's easier than the server.

              You can use "yum" or "dnf" in each of these commands, it amounts to the same thing.

              On each of the proxies:
              1. get a list of all the zabbix packages:
                Code:
                sudo yum list 'zabbix*'
              2. run the "yum downgrade" command with each of those packages listed on the command line:
                Code:
                sudo yum downgrade zabbix-package1 zabbix-package2 zabbix-package3 etc.
                . Replace "package1" etc. with whatever zabbix packages you actually have on your proxies.

              Assuming all your proxies are running the same set of packages, you only have to get the list of packages on one of them, and then run the downgrade command in an ssh loop for the rest of them. If you use configuration management, it should be even easier to get the proxies to downgrade the zabbix-related packages.

              Once you're done downgrading ALL your proxies, you repeat the exact same process for your server. The only difference is that you probably have more zabbix-* packages on your server than you do on your proxies.

              NOTE: You do not have to downgrade your agent or agent2 packages, if you don't want to. You should be fine running zabbix-agent-7.0.7 on your proxies with 7.0.6 versions of the other zabbix packages. Compatibility within an X.Y series is generally really good and the agent protocol should never change at a micro version, so in a pinch you could run 7.0.7 agents with a 7.0.6 server and proxies. If you prefer keeping everything in sync, though, including the agent or agent2 is fine.

              NOTE2: It's likely the case that you wouldn't even need to downgrade your proxies. You probably could run 7.0.7 proxies with a 7.0.6 server, but that I wouldn't do unless you're really desperate. A lot safer to downgrade them too, before tackling your server.

              Comment

              • marcos.della
                Junior Member
                • Dec 2024
                • 9

                #8
                So we basically took the zabbix_server 7.0.6 binary and copied it over our two master servers as all other options (since we had JUST upgraded from 6.4.20 to 7.0.7) meant that we did not have a valid downgrade path. That and the fact that there are over 20 proxies lead to an almost impossible task to get this working again. Path of least resistance for the time being was to just drop in the 7.0.6 binaries (packages still report as 7.0.7) and hope that there is a fix in the near term to get us out of this mess. Needless to say, this was really bad timing for us as our Sr management staff is making decisions on future directions next week and the last two weeks have been a disaster just after one of the support engineers was here on site for a week.

                So, crossing fingers that this mishmash is going to carry us thru to 7.0.8 and we can breath a sigh of relief once that happens (the amount of engineering work on our side has been way too many hours for what should have been a straight forward upgrade).

                Marcos

                Comment

                • tim.mooney
                  Senior Member
                  • Dec 2012
                  • 1427

                  #9
                  Originally posted by marcos.della
                  So we basically took the zabbix_server 7.0.6 binary and copied it over our two master servers as all other options (since we had JUST upgraded from 6.4.20 to 7.0.7) meant that we did not have a valid downgrade path. That and the fact that there are over 20 proxies lead to an almost impossible task to get this working again. Path of least resistance for the time being was to just drop in the 7.0.6 binaries (packages still report as 7.0.7) and hope that there is a fix in the near term to get us out of this mess. Needless to say, this was really bad timing for us as our Sr management staff is making decisions on future directions next week and the last two weeks have been a disaster just after one of the support engineers was here on site for a week.

                  So, crossing fingers that this mishmash is going to carry us thru to 7.0.8 and we can breath a sigh of relief once that happens (the amount of engineering work on our side has been way too many hours for what should have been a straight forward upgrade).

                  Marcos
                  Every environment is different, so if that procedure works for your environment, great!

                  With 20 proxies and 2 servers, if you're not using a software repository (either Zabbix upstream or a local copy that you maintain) for your Zabbix packages, you're really working against the features of your OS.

                  As long as you have the necessary 7.0.6 packages in a package repository that your systems can access, it doesn't matter that your systems never ran version 7.0.6 previously. A yum/dnf "downgrade" command with the zabbix package names would downgrade 7.0.7 packages to whatever appears to be the next lower version.

                  I hope your workaround gets you to a working environment.

                  Comment

                  • marcos.della
                    Junior Member
                    • Dec 2024
                    • 9

                    #10
                    Just as a side note, we don't always have control of what version of zabbix installed, only the series. For instance, if you have a series of OPNSense firewalls, the plugins that are available are locked into certain versions managed by the appliance maintainers. So we can't really "pick" what point release version we want. Currently for instance we can only pick from the 6.0 series, the 6.4 series, or 7.0 series for plugins (to include the proxies).

                    Comment

                    Working...