Ad Widget

Collapse

near-constant "first network error, wait for 15 seconds"

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Markku
    Senior Member
    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
    • Sep 2018
    • 1782

    #16
    resolv.conf points to 127.0.0.x, when using systemd-resolved, right? Did you restart zabbix-server after any of the DNS changes? (Asking because I have no idea if that libevent caches some settings while zabbix-server service first starts)

    Based on your tests above at least the caching doesn't work in this case. (Edit: rather, the libevent DNS resolution is not affected by the caching)

    In https://libevent.org/doc/dns_8h.html there is some description about the working of DNS resolution.

    Maybe capturing and analyzing the DNS traffic on the server also gives some insight what happens.

    Markku
    Last edited by Markku; 10-08-2024, 09:18.

    Comment

    • ddrucker
      Member
      • Feb 2019
      • 35

      #17
      HAH! Well, as they say "There are only two hard things in Computer Science: cache invalidation and naming things."

      YES, restarting zabbix-server fixed the issue. It looks like it does cache resolv.conf.

      I've got to say I hate this trend of applications handling DNS lookup themselves instead of asking the OS to do it (looking at you, every web browser).

      Comment

      • Markku
        Senior Member
        Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
        • Sep 2018
        • 1782

        #18
        I thought it was two problems: cache invalidation, naming things, and off-by-one errors...

        Yeah I like the fact that Zabbix people have considered optimizing DNS (by using an existing lib and not trying to reinvent the wheel), this new async way just is not well documented yet that any DNS change requires apparently zabbix-server (or -proxy) restart. Apparently resolv.conf is only read at starting the event loop (or something like that, from the libevent link above).

        To recap, did you disable systemd-resolved, or kept using it but just restarted zabbix-server process?

        Markku

        Comment

        • ddrucker
          Member
          • Feb 2019
          • 35

          #19
          The underlying problem was that Zabbix's internal DNS resolver doesn't handle even slightly (a hundred or so instead of tens of milliseconds) delayed responses from upstream, and treats such a slow response as if it were NXDOMAIN.

          Installing and activating systemd-resolved should have fixed the problem, by caching results locally, but Zabbix didn't actually use it once installed because it had cached the nameserver IPs.

          Restarting zabbix-server caused it to reload the (now local) nameserver IP from resolv.conf, fixing the problem.

          This is what happens when you don't use the OS-provided APIs that everyone has been using for the last 45 years (first gethostbyname and then getaddrinfo)...

          Comment

          • Markku
            Senior Member
            Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
            • Sep 2018
            • 1782

            #20
            This sounds like a fun case of labbing, thanks for the ideas.

            My experience in C programming is currently just about using Wireshark dissector C APIs but you seen knowledgeable enough about POSIX C APIs to be able to criticize the new Zabbix 7.0 implementation of the DNS resolution in asynchronous code. Can you give some rough idea how should in your opinion the blocking (as far as I understand them) gethostbyname etc. calls be used in async code, instead of using libevent or similar libraries? Mind you, 7.0 (and the async way of working in Zabbix) is still new and if there is room for improvement, all ideas might be considered.

            Markku

            Comment

            • ddrucker
              Member
              • Feb 2019
              • 35

              #21
              I'm definitely not an expert on async code, but I think it would be reasonable to use the getaddrinfo API - even if it is blocking - and cache the result for the TTL (which is typically between minutes and hours - whereas currently zabbix is re-querying every few seconds). And just like that you don't have to have your own DNS client! That's what the OS is for!

              Comment

              • Markku
                Senior Member
                Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                • Sep 2018
                • 1782

                #22
                But if the app is going to cache the result of a DNS query, it would mean building an own in-app DNS client, right? Otherwise nothing would use the in-app cache.

                I see Zabbix uses libevent's evdns_getaddrinfo(), which "Make a non-blocking getaddrinfo request [...]".

                To still verify, is this a correct understanding:
                - First you saw that an unreliable/slow DNS server was not useful for Zabbix (at this point resolv.conf had the actual DNS resolver address configured)
                - So you installed systemd-resolved to implement client-side caching (= resolv.conf was now changed to have 127.0.0.53 or something else localhost), but did not yet restart zabbix-server service
                - Problems continued, you tried things but eventually got back to using systemd-resolved again
                - Then you restarted zabbix-server service and that fixed the issue (= libevent re-read resolv.conf and started using the systemd-resolved-provided cache)

                Or did I misunderstand something? I'm trying to understand the sequence of events.

                Markku

                Comment

                • ddrucker
                  Member
                  • Feb 2019
                  • 35

                  #23
                  Oh. I just realized that, indeed, there's no way to get the TTL from the OS (the possibility was discussed on the libc-alpha list but doesn't appear to have gone anywhere). Which makes my idea impossible. I was assuming the API call would return IP and TTL, which would enable a very simple function - not an entire DNS client - that just did getaddrinfo and saved the result to a dict (name -> (ip,expire time)) and returned cached results until the expire time.

                  Your understanding of my sequence of events is correct.

                  I do think it is confusing that libevent caches resolv.conf forever.

                  Comment

                  • Markku
                    Senior Member
                    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                    • Sep 2018
                    • 1782

                    #24
                    Thanks for confirming, I agree that the effect of ignoring any changes in resolv.conf after the app startup is confusing as it is unusual. I'll see around and open an issue to get it documented somehow.

                    Markku

                    Comment

                    • Markku
                      Senior Member
                      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                      • Sep 2018
                      • 1782

                      #25
                      This thread got Zabbix' attention: https://support.zabbix.com/browse/ZBX-25025

                      Markku

                      Comment

                      Working...