Ad Widget

Collapse

Complete recipe for monitoring DNS and NTP on your Network

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • tokind
    Member
    • May 2007
    • 47

    #1

    Complete recipe for monitoring DNS and NTP on your Network

    THIS VERSION SUPERCEEDED - SEE PHP VERSION BELOW



    I plan to put this on the Wiki. I just wanted comments and suggestions prior to doing so.

    Complete recipe for monitoring DNS and NTP on your Network
    The article assumes that you are using a single *nix Zabbix Server to monitor distributed DNS and/or NTP services running on or accessible from your network. Since I have not figured out a way to run external scripts tied to Items and Triggers from the server, I use zabbix_agentd to run these external checks. The scripts that execute the checks are BASH scripts based on the HOST (for DNS), and NTPQ (for NTP) commands.

    The scripts
    You may place these scripts wherever it is easiest for you. I placed them in /var/local/www/data/zabbix/scripts.
    Use the command which bash to determine the path to your bash environment. In my example the path is /usr/local/bin/bash. Your system may be different.

    DNS
    Code:
    #!/usr/local/bin/bash
    #dnslookup
    #DNS lookup scripts for Zabbix monitor. Conditional return
    # of 1=success | 0=failed
    
    DNS_SERVER=$1
    HOST_QUERY=$2
    
    if [`host $HOST_QUERY $DNS_SERVER | grep "has address" | wc -l` -eq 0 ]; then
    
     #lookup failed, bad DNS lookup
     echo "0"
    
    else
    
     echo "1"
    
    fi
    NTP
    Code:
    #!/usr/local/bin/bash
    #ntptest
    #NTP test scripts for Zabbix monitor. Conditional return
    # of 1=success | 0= for failed response
    
    HOST_QUERY=$1
    
    if [`ntpq -pn $HOST_QUERY | grep -E -c '^\*'` -eq 1 ]; then
    
     #Sync responded, OK
     echo "1"
    
    else
    
     echo "0"
    
    fi
    Be sure and set your scripts to the proper owner and use chmod +x to make them executable.

    These two lookups may take many seconds to complete and return a value. While they generally respond in less thana second for a successful query, a timeout response may take more than 15 seconds. Before we complete the setup we will extend the timeout period for both zabbix_server and zabbix_agentd so that we will always get some sort of response under normal circumstances.

    Note: In my initial testing with these scripts, a timeout response would fail to return any value. The result of this is that the Trigger would not trip, as there were no new samples to evaluate. The timeout message appeared in the /var/log/zabbix_agentd.log file. Increasing the agent timeout resolved this problem.
    Zabbix_agentd configuration
    Edit /etc/zabbix/zabbix_agentd.conf. Set Timeout=30.

    Add the following, assuming your path to the scripts:

    Code:
    UserParameter=DNSbr1,/usr/local/www/data/zabbix/scripts/dnslookup 192.168.1.10 helpdesk
    UserParameter=DNSbr6,/usr/local/www/data/zabbix/scripts/dnslookup 192.168.6.10 helpdesk
    UserParameter=DNSbr4,/usr/local/www/data/zabbix/scripts/dnslookup 192.168.4.10 helpdesk
    UserParameter=DNSbr10,/usr/local/www/data/zabbix/scripts/dnslookup 192.168.10.10 helpdesk
    UserParameter=DNSbr2,/usr/local/www/data/zabbix/scripts/dnslookup 192.168.2.10 helpdesk
    Note: "helpdesk" is a defined hostname in DNS on my network.

    And for NTP:
    Code:
    UserParameter=NTPs1,/usr/local/www/data/zabbix/scripts/ntptest 192.168.1.68
    You may have as many such tests as you want. Just keep track of the names for when you set up your Triggers in Zabbix.

    In order to load your new agent configuration, use ps aux to find the PID of your zabbix_agentd: main process and kill it. Then start the agent again:

    Code:
    >cd /usr/local/bin
    >./zabbix_agentd
    (If you need to troubleshoot the agent process for any reason, you should take care to set the log path, owner and permissions to write to the /var/log/zabbix_agent.log.)

    Assuming that you are about ready to set up triggers, you must now change the default timeout for Zabbix_server. It was set to 3 (seconds) here, and so when lookups failed I was getting nothing (instead of 0) in my triggers. Edit /etc/zabbix/zabbix_server.conf to set timeout=30.

    Kill zabbix_server (sleeping...) and then use ./zabbix_server to start it again with your new values.

    Set up the Triggers
    The Host
    Time to switch to the Zabbix web interface. Login as an administrator, then go to Configuration -> Hosts and create a host. I suggest the name "ExternalTests". I typed in a new group "External" set Use IP address and typed in localhost. Port 10050 (or your configured port).

    Triggers
    Go to (Configuration) Triggers and Create Trigger. Give your new Trigger a name like BR1 DNS Server and an expression like

    Code:
    {ExternalTests:DNSbr1.last(0)}=0
    I set the Severity to Warning.

    Repeat this for as many checks as you set up in the agentd configuration.

    Actions
    Now to set up an Action to warn you in case one of your services goes down. Go to (Configuration) Actions and Create Action. Select the Action type you want (I use Send message, and the media is a cell phone SMS service), Source must be a Trigger, I set two Conditions:

    Host group = External
    Trigger name like DNS Server

    Set your other options as desired. Since I am sending an email or SMS message I set the subject to {TRIGGER.NAME} Problem, and the message to {TRIGGER.NAME} may be down as of {DATE}-{TIME}.

    Once created, this actions will trip when any of your monitored services return a "0".

    You may wish to make your triggers a bit smarter or a bit less sensitive depending on your environment or the load on the servers. E.g. a trigger of:

    Code:
    {ExternalTests:DNSbr1.sum(#3)}=>2
    will trip after two out of the last three tests failed.

    Code:
    {ExternalTests:DNSbr1.sum(120)}<>0
    will trip if any test in the last 120 seconds failed. I think that last one will only trip every 120 seconds in case of an on-going failure.
    Last edited by tokind; 20-03-2008, 01:08.
  • Tenzer
    Senior Member
    • Nov 2007
    • 316

    #2
    I have made versions of the scripts you have in PHP, and I think that my dns check script is better:
    PHP Code:
    #!/usr/bin/php
    <?
        // Define defaults
        if($_SERVER[argv][1])
        {
            $ns_server = $_SERVER[argv][1];
        } else {
            echo "You need to supply a DNS server to check. Quitting.\n";
            exit;
        }

        $hosts = array("zabbix.com" => "85.113.250.92",
            "php.net" => "69.147.83.197");

        // Do query
        foreach($hosts as $host => $ip)
        {
            $result = shell_exec("dig +time=1 +tries=1 +short @".$ns_server." ".$host);
            if(!preg_match('/'.$ip.'/', $result))
            {
                $failed = TRUE;
            }
        }

        if($failed)
        {
            echo "0\n";
        } else {
            echo "1\n";
        }
    ?>
    It takes the IP address or hostname of the DNS server to check, and it checks the records defined in the array. I use this feature to both check recursive queries and domains the DNS server is hosting.
    If a DNS server is not responding it only takes the script 1 second for each domain it checks to find out, and not 10 seconds as your script.
    I don't know how easy it would be to make the above in shell scripting, since more people probably could gain usage from it.

    A thing I noticed, is that you write the path to bash as /usr/local/bin/bash. On most linux systems the path is /bin/bash, so I think you at least should add a note about that on the page, and tell users to use "which bash" in order to find the path to use.

    Comment

    • tokind
      Member
      • May 2007
      • 47

      #3
      I agree with you that PHP is a better platform for these scripts. I was using bash scripting mostly because I want to improve my bash scripting skills. However, these scripts are trivial, and since PHP is always present and configured on ANY Zabbix system, it will be the most appropriate platform.

      I would not check for specific IP address returns as you do in your script. If an IP address changed, I would have to edit the array. However I do appreciate the fact that your script may be used to check both local lookups and forwarded lookups; a definite advantage.

      I am having trouble with the NTP script I posted here. I will use your script as an example and re-direct this effort to PHP.

      Thank you.
      Last edited by tokind; 14-03-2008, 17:27.

      Comment

      • tokind
        Member
        • May 2007
        • 47

        #4
        A Complete recipe for monitoring DNS and NTP on your Network
        The article assumes that you are using a single *nix Zabbix Server to monitor distributed DNS and/or NTP services running on or accessible from your network. Since I have not figured out a way to run external scripts tied to Items and Triggers from the server, I use zabbix_agentd to run these external checks. The scripts that execute the checks are PHP scripts based on the HOST (for DNS), and NTPQ (for NTP) commands.

        The scripts
        You may place these scripts wherever it is easiest for you. I placed them in /var/local/www/data/zabbix/scripts.


        DNS
        Code:
        <?php
            // Define defaults
            $result=0;
            if($_SERVER[argv][1])
            {
                $ns_server = $_SERVER[argv][1];
            } else {
                echo "You need to supply a DNS server to check. Quitting.\n";
                exit;
            }
        
            $hosts = array("helpdesk",
                        "ns1.nmsu.edu");
        
            // Do query
            foreach($hosts as $host)
            {
                if(shell_exec("host ".$host." ".$ns_server." | grep 'has address' | wc -l")==0)
        
                {
                    $result= $result+0; // success
                } else {
                    $result= $result+1; // failure
                }
            }
            if($result > 0)
            {
        
                $result=0;
        
            } else {
        
                $result=1;
            }
        
            echo $result;
        
        ?>
        (Two or more lookups may be used to test for various DNS lookup scenarios, e.g. referrals, reverse lookups.)

        NTP
        Code:
        <?php
            // Define defaults
            $result=0;
            if($_SERVER[argv][1])
            {
                $ntp_server = $_SERVER[argv][1];
            } else {
                echo "You need to supply an NTP server to check. Quitting.\n";
                exit;
            }
        
            // Do query
            if(shell_exec("ntpq -pn ".$ntp_server." | grep -E -c '^\*'")==1)
            {
        
                $result= 1; // success
        
            } else {
        
               $result= 0; // failure
        
            }
        
            echo $result;
        
        ?>
        Be sure and set your scripts to the proper owner and use chmod +x to make them executable.

        These two lookups may take many seconds to complete and return a value. While they generally respond in less thana second for a successful query, a timeout response may take more than 15 seconds. Before we complete the setup we will extend the timeout period for both zabbix_server and zabbix_agentd so that we will always get some sort of response under normal circumstances.

        Note: In my initial testing with these scripts, a timeout response would fail to return any value. The result of this is that the Trigger would not trip, as there were no new samples to evaluate. The timeout message appeared in the /var/log/zabbix_agentd.log file. Increasing the agent timeout resolved this problem.
        Zabbix_agentd configuration
        Edit /etc/zabbix/zabbix_agentd.conf. Set Timeout=30.

        Add the following, assuming your path to the scripts:

        Code:
        UserParameter=DNSbr1,php /usr/local/www/data/zabbix/scripts/dnschk.php 192.168.1.10
        UserParameter=DNSbr6,php /usr/local/www/data/zabbix/scripts/dnschk.php 192.168.6.10
        UserParameter=DNSbr4,php /usr/local/www/data/zabbix/scripts/dnschk.php 192.168.4.10
        And for NTP:
        Code:
        UserParameter=NTPs1,php /usr/local/www/data/zabbix/scripts/ntpchk.php 192.168.1.68
        You may have as many such tests as you want. Just keep track of the names for when you set up your Triggers in Zabbix.

        In order to load your new agent configuration, use ps aux to find the PID of your zabbix_agentd: main process and kill it. Then start the agent again:

        Code:
        >cd /usr/local/bin
        >./zabbix_agentd
        (If you need to troubleshoot the agent process for any reason, you should take care to set the log path, owner and permissions to write to the /var/log/zabbix_agent.log.)

        Assuming that you are about ready to set up triggers, you must now change the default timeout for Zabbix_server. It was set to 3 (seconds) here, and so when lookups failed I was getting nothing (instead of 0) in my triggers. Edit /etc/zabbix/zabbix_server.conf to set timeout=30.

        Kill zabbix_server (sleeping...) and then use ./zabbix_server to start it again with your new values.

        Set up the Triggers
        The Host
        Time to switch to the Zabbix web interface. Login as an administrator, then go to Configuration -> Hosts and create a host. I suggest the name "ExternalTests". I typed in a new group "External" set Use IP address and typed in localhost. Port 10050 (or your configured port).

        Triggers
        Go to (Configuration) Triggers and Create Trigger. Give your new Trigger a name like BR1 DNS Server and an expression like

        Code:
        {ExternalTests:DNSbr1.last(0)}=0
        I set the Severity to Warning.

        Repeat this for as many checks as you set up in the agentd configuration.

        Actions
        Now to set up an Action to warn you in case one of your services goes down. Go to (Configuration) Actions and Create Action. Select the Action type you want (I use Send message, and the media is a cell phone SMS service), Source must be a Trigger, I set two Conditions:

        Host group = External
        Trigger name like DNS Server

        Set your other options as desired. Since I am sending an email or SMS message I set the subject to {TRIGGER.NAME} Problem, and the message to {TRIGGER.NAME} may be down as of {DATE}-{TIME}.

        Once created, this actions will trip when any of your monitored services return a "0".

        You may wish to make your triggers a bit smarter or a bit less sensitive depending on your environment or the load on the servers. E.g. a trigger of:

        Code:
        {ExternalTests:DNSbr1.sum(#3)}=>2
        will trip after two out of the last three tests failed.

        Code:
        {ExternalTests:DNSbr1.sum(120)}<>0
        will trip if any test in the last 120 seconds failed. I think that last one will only trip every 120 seconds in case of an on-going failure.
        Last edited by tokind; 19-03-2008, 22:55.

        Comment

        • tokind
          Member
          • May 2007
          • 47

          #5
          Now on Wiki

          Thanks for your feedback. Now featured on the Wiki:

          http://www.zabbix.com/wiki/doku.php?...n_your_network

          Comment

          • georgew
            Member
            • Mar 2008
            • 50

            #6
            I have a bind9 server that has a strange failure mode. It will answer queries, but only if they are local, or in it's cache.

            When I test it by hand, it is easy enough to come up with an uncached domain name to test... However if I set-up a script, all of the domains in the script will end up cached by the time DNS stops working again.

            Anyone experience this DNS bug? Anyone have an idea how to create an automated test for it?

            I could create a incredibly large list of domain names, and cycle through it, but the more domains you have, the more likely you will get false positives from other servers being down.

            If I took the long list, and tested multiple servers, comparing the answer, that might work... thoughts?

            Meanwhile I am working to spread the DNS load around, so that servers being attacked by the net are not the same servers used to do dns work... that seems to be helping reduce the issues...

            In the past I was able to make my dns servers self healing, they tested themselves, and restarted named if there was a problem, but this new failure mode is too tricky for my old scripts.

            I have really massive email servers, that are hit really hard by spammers, so they put a large load on the name servers, so that was part of the problem. The faster I made the mail servers, the more load the spammers would send.... the price you pay when you have 100 domains that have had the same 4000 usernames since 1994... the spammers have more mail servers than I do, and they only send one email at a time, using a particular server/ip only a few times a day... so blacklists stop very little. I can quarantine the spam itself, but I can't successfully blacklist all of their botnet mail servers. And of course my nameservers continue to see the resulting workload grow as I add more mail servers.

            George
            Last edited by georgew; 30-03-2008, 16:09.

            Comment

            • Tenzer
              Senior Member
              • Nov 2007
              • 316

              #7
              Originally posted by georgew
              I have a bind9 server that has a strange failure mode. It will answer queries, but only if they are local, or in it's cache.

              When I test it by hand, it is easy enough to come up with an uncached domain name to test... However if I set-up a script, all of the domains in the script will end up cached by the time DNS stops working again.

              Anyone experience this DNS bug? Anyone have an idea how to create an automated test for it?
              You could rely on a catch-all domain and query a domain name like [random-number].example.com. It should return the same IP address all the time, and it would be easy to do in PHP.

              That way you should get around the caching, if you just make the number long enough. It could for instance just be the current unix timestamp, that way you never query the same domain name twice?

              Comment

              • tokind
                Member
                • May 2007
                • 47

                #8
                My suggestion is a bit of a daunting project, but could go a long way toward replacing your old scripts.
                1. Create a large list of DNS lookups in a new table on your Zabbix Server.
                2. Add a quasi-random lookup from the DNS table, selecting two or more lookups (possibly of different classes, e.g. one local, two public) to your agent script.
                3. Store the results to an array and run your queries.


                With a little creative table work and a little PHP you could attain an effective mechanism not only to test, but also to send a remote command to restart named on the target server when a DNS failure is detected, say, two or three consecutive failures.
                You will have to enable active mode on the Zabbix Agent. You would also have to write a script on your server to periodically test the DNS entries in your table to clean out entries which have expired.

                This would actually add a level of sophistication to my simple tests. My plate is pretty full right now, but I will add this to my to-do list.

                Comment

                • eli.stair
                  Junior Member
                  • May 2006
                  • 20

                  #9
                  Complete CLI DNS tool

                  You beat me to a HOWTO on doing this with zabbix, but I'll go ahead and point everyone at a Perl script I wrote a couple years ago (and still use and maintain) which, IMO, is a fully-featured tool for monitoring DNS records.



                  I've been using this in production on a few thousand hosts, monitoring every A/PTR record to alert on ones that our MS DNS server has "lost". I call it to ensure that all production network devices have a proper matching SNMP sysName. It even has support for DNS round-robin entries, to support checking the total number of entries AND the IP's that should be present; we use this for everything from simple RR-based load balancing to our giant Spinnaker/NetappGX servers with "large" numbers of entries in the RR.

                  I haven't fully replaced my existing system yet and am currently working on integrating this with Zabbix. The big restriction is the inability to use macros in item templates for me, but for a small installation simply specifying the arguments to the "external" call is totally functional.

                  I'll post a new detailed entry on using this when Zabbix has underlying support for item-macros so it can be used most efficiently.

                  Cheers,

                  /eli

                  Comment

                  • georgew
                    Member
                    • Mar 2008
                    • 50

                    #10
                    What great answers!

                    Tokind, you were reading my mind... that is what I had been thinking I would need to do, get some large set of domains, and query some subset of them with each test.

                    Eli, your solution is a great one for testing local zones, my DNS consists of a few hundred zones that live on my name servers, and several hundred that are served by web servers (and mirrored to our main servers as secondaries). I suppose a script could be written to parse our dns config and zone files, and turn that into a set of tests. We use a script to mirror our two main DNS servers (rather than using a secondary server method, so our servers are 100% redundant, rather than primary/secondary). The mirror script also copies the DNS files to a backup server, and could easily drop a copy on a zabbix server too. Every time we edit DNS, we run the mirror/backup script.

                    Tenzer, you win the gold star!!! Any off-site domain with a wildcarded A record could be used! Do you know of any? I use wildcarded dns A records in-house on a couple domains, but I honestly don't know of any such domains off-site. Such a clean and simple solution!! Why didn't I think of that??!?



                    George

                    Comment

                    • Tenzer
                      Senior Member
                      • Nov 2007
                      • 316

                      #11
                      Originally posted by georgew
                      Tenzer, you win the gold star!!! Any off-site domain with a wildcarded A record could be used! Do you know of any? I use wildcarded dns A records in-house on a couple domains, but I honestly don't know of any such domains off-site. Such a clean and simple solution!! Why didn't I think of that??!?
                      I'm glad I could help
                      I just tried a couple of domain names, and found that "ning.com" has a wildcard domainname, so you could just query 1207301681.ning.com and check if it returns 8.6.19.68.
                      Of course... Using a third party domain name, could lead to them changing the IP address at any time, but I guess that is the cost of using this method, unless you have a remotely hosted domain name yourself, which you could use.

                      Comment

                      • OneLoveAmaru
                        Member
                        • Jan 2008
                        • 41

                        #12
                        Hey guys, I took everything you all put in this post and wrote the code to suite me well. The only problem is.... I'm getting this in my error logs and when I check latest data, it's not getting a value at all for DNS. When I run the script at the command line it gives me a 1 though. I'm using the PHP script that has the dns name pointing to the IP address.. since we only use static ip's here.

                        4608:20080707:154427 Run remote command [php /etc/zabbix/alert.d/dnschk.php 192.168.2.1] Result [1] [1]...
                        4608:20080707:154427 Sending back [1]
                        4608:20080707:154427 Got SIGPIPE. Where it came from???
                        4608:20080707:154427 Process listener error: ZBX_TCP_WRITE() failed [Broken pipe]

                        It's quite strange, everything else works. I am running debian etch 4.0 on all production servers except the 2 zabbix servers we have. I'm running Ubuntu 8.04 so I could get the latest Zabbix without compiling it myself. I thought it was permissions so I set the file to zabbix:www-data and still no go. Yes I have it executable, but it doesn't matter cause it runs the script with php in front of it anyways. The file is in /etc/zabbix/alert.d/dnschk.php
                        I also set the zabbix-agentd.conf on that machine to: EnableRemoteCommands=1 but still get that error. Anybody have an idea??? I have tried stop waiting, then starting back up. I even did a killall zabbix-agentd which did kill them all but on restart, same error.

                        Here is the line I have in the zabbix_agentd.conf:
                        UserParameter=DNS1,php /etc/zabbix/alert.d/dnschk.php 192.168.2.1

                        Any help to resolve this would be awesome, thanks guys! I also attached the template and the script I use.
                        Attached Files

                        Comment

                        • tokind
                          Member
                          • May 2007
                          • 47

                          #13
                          Just off the top of my head... try dropping the "\n" from the result?

                          I'm assuming that `which php' shows that you have the correct path to php?

                          Looking at your XML configuration statement: I do not understand the snmp_community, oid, and port definitions shown there. Shouldn't this be configured as a Type: ZABBIX Agent item?

                          The statement suggests that you are collecting the result from an SNMP query. Or are these just default values, not used because you did in fact select Type: ZABBIX Agent?

                          Comment

                          • OneLoveAmaru
                            Member
                            • Jan 2008
                            • 41

                            #14
                            The php script is running and giving correct results, as you can see from the log entries I posted

                            I can tell you port 161 is always default, why it exported an OID is beyond me. Here is how they are configured and have been configured this way since I created them. I created them from scratch, not copying anything.


                            Here is the item:


                            here is the trigger:


                            Ideas? The script is running, it's getting a value but the agent on the machine can't send it back to the server. I can telnet on port 10050 and 10051 from that box to the zabbix server, so it's not firewall. The agent can send back every other item.
                            Last edited by OneLoveAmaru; 08-07-2008, 16:31.

                            Comment

                            • tokind
                              Member
                              • May 2007
                              • 47

                              #15
                              Have you tried using zabbix_get from the server, to see what it returns?

                              zabbix_get -s {ip address} -k {parameter name}

                              I am not sure what to make of the log entry:

                              Process listener error: ZBX_TCP_WRITE() failed [Broken pipe]

                              You may have to look at the source code for the client to make any sense of this. Maybe zabbix_get will show you something useful.

                              Comment

                              Working...