Ad Widget

Collapse

Debug Autodiscovery

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • perun.84
    Member
    • May 2016
    • 73

    #1

    Debug Autodiscovery

    How can I debug discovery process. I've configured snmp autodiscovery rule for huge subnet and there were a lot of discovered hosts. But, suddenly discovery has stopped although there are a lots of undiscovered hosts. How I can get confirmation that discovery process is still working? May I have seperate logs for discovery? Thanks in advance.
  • Linwood
    Senior Member
    • Dec 2013
    • 398

    #2
    Depending on the flavor of unix, you can tell if it's running by looking here:

    ps -F -A | grep -i discovery #<<< get PID’s

    The process strings should be self explanatory, and if you want to see what host it is on (say it is scanning a subnet) you can do:

    strace -p 12345

    where 12345 is the PID. I've found these two together to be the simplest way to tell when it is done, provided there is adequate delay between it rerunning as it won't show you whether it is on run 1, 2, 3, etc.

    Debugging is harder, you can use the runtime control of zabbix_server (if you are on ... I think it was 2.4.6 or later) to increase the debug level of just the discover process, and get way more detail than you want in the server log if you set it to 4 (with 3 you get way less than you need). It's helpful to limit the number of discoveries running in that case.

    If you are doing wide subnets (say more than a class C), I find it helpful to divide it up into no more than a class C, even wrote a (rather safe) routine to clone a class B into 256 class C discoveries so they run more in parallel. If you have SNMP and other protocol polling (not just a ping) in there, large numbers of candidate addresses can take a long time to run. I've also found it helpful at times to discover with ping, then clone that discovery into one for each host found and re-discover all the other protocols just for those IP's. If you had (say) 65534 possible addresses and 150 hosts, it is MUCH faster to poll the 150 specifically for services than all 65534 and let them time out for each protocol.

    Comment

    • perun.84
      Member
      • May 2016
      • 73

      #3
      First of all thanks a lot for answer.

      I'm doing snmp v2 discovery on very large network. I'm planning to install zabbix proxy servers for parts of network. I wrote 6 /16 subnets for discovery. It's tough to divide it to /24 subnets (I don't know if it is even possible). What about number of discoverer? How many of them should I start if zabbix server has 16G of RAM and 4vCPU-s?

      Comment

      • Linwood
        Senior Member
        • Dec 2013
        • 398

        #4
        Originally posted by perun.84
        First of all thanks a lot for answer.

        I'm doing snmp v2 discovery on very large network. I'm planning to install zabbix proxy servers for parts of network. I wrote 6 /16 subnets for discovery. It's tough to divide it to /24 subnets (I don't know if it is even possible). What about number of discoverer? How many of them should I start if zabbix server has 16G of RAM and 4vCPU-s?
        I have no numerical guidance but found generally they put very little load on the system, I ran around 150 at a time. They can add up to a significant network load if you are on a low bandwidth WAN doing the discovery though with large numbers.

        But I'm unclear what you mean by /16 not dividing into /24.

        In the best scenarios in a /16, you know what the 3rd digit might be. Say you have 10.1.x.y as a subnet, but you actually know that the x is only 1, 2, 3 or 4, and you haven't used the rest. You could do 4 separate /24 searches then, and those run in parallel. If you search using 1-4 it will use only one poller.

        So far as I know (from observation not looking at the code), each discovery range is executed sequentially. This means if you discover:

        10.1.0-255.0-255

        That it will, one at a time, test all 65536 entries, and in the vast majority have to wait through the entire timeout period as nothing will be there. But if you actually had random usage throughout the whole range, and instead set up 256 entries:

        10.1.0.0-255
        10.1.1.0-255
        10.1.2.0-255
        etc.

        Then these run 256 in parallel (limited by the number of pollers), and will be done MUCH faster, but with the same result. The only time this doesn't work is if you are looking in the discovery actions for very specific rule names as opposed to service names or responses as then you would need to clone all the actions as well.

        I mention this in case you have what I ran into the last site -- they had put very sparse /16 networks, and no one could tell me much of anything about what was in use, and wanted me to just hunt them all down. Over WANs. That's why I started decomposing /16's into /24's for the discovery scan, so I could finish before my attention span wandered.

        Comment

        • perun.84
          Member
          • May 2016
          • 73

          #5
          I set a lot of /24 subnets now. With strace I've noticed next, after 25-30 adresses check discovery proces is restarting (go back to first address). Strace says:

          sendmsg(8, {msg_name(16)={sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("10.1.0.27")}, msg_iov(1)=[{"0)\2\1\1\4\6xxxxx\240\34\2\4[\204r\212\2\1\0\2\1\0000\0160\f\6"..., 43}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 43
          select(9, [8], NULL, NULL, {3, 999992}) = ? ERESTARTNOHAND (To be restarted if no handler)
          --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=21163, si_uid=996} ---
          close(3) = 0
          exit_group(1) = ?
          +++ exited with 1 +++

          After that it goes back to 10.1.0.1 address.:-/

          Comment

          • Linwood
            Senior Member
            • Dec 2013
            • 398

            #6
            That seems odd. I have never explored failure conditions to see what it does, but I would assume that means it terminated unexpectedly. I'd suggest running one at time, in debug mode for discovery only, and see if there are errors.

            Is zabbix itself happy, the internal checks items all look good for caches and other processes?

            Comment

            • perun.84
              Member
              • May 2016
              • 73

              #7
              I don't know how to set discovery in debug mode..

              Comment

              • perun.84
                Member
                • May 2016
                • 73

                #8
                I found way for debug. I have following message:

                5403:20160527:100008.236 Got signal [signal:15(SIGTERM),sender_pid:5128,sender_uid:996, reason:0]. Exiting ...

                And after that, discovery process is being restarted.

                Comment

                • perun.84
                  Member
                  • May 2016
                  • 73

                  #9
                  I found pid of discovery process and I caught logs of discovery. Problem was in low value for TrendCacheSize. When I increase it, discovery seems to be OK. Now there are no service resets. Thanks.

                  Comment

                  • Linwood
                    Senior Member
                    • Dec 2013
                    • 398

                    #10
                    Glad you got it.

                    For those curious the nice feature in latter zabbix that allows debug control at runtime is described here:

                    Runtime loglevel changing

                    And example command:

                    $ zabbix_server --runtime-control log_level_increase=trapper

                    Here "trapper" is an example and the full list is under the internal items checks in the manual but includes for example some common ones:

                    discoverer
                    poller
                    http poller
                    icmp pinger
                    snmp trapper

                    Quote ones in spaces. Obviously remember to do a "decrease' after.

                    The tough one is that LLD is included in poller, so it's really hard to debug those without a lot of noise from regular polling activity. I've even at times disabled ever host but one to limit the noise then turned them back on after debugging. It would sure be nice to have a debug for "host=xxxx".

                    Comment

                    Working...