Ad Widget

Collapse

False Positive Trigger Problem with ssh Service

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bberger
    Junior Member
    • Jun 2010
    • 8

    #1

    False Positive Trigger Problem with ssh Service

    Hi,

    I have a small zabbix installation monitoring some Linux computers (about 25). I have a variety of Linux distros and versions. I am experiencing a problem with ssh reporting. On 3 of my computers ssh is shown in zabbix as being down. The message is:

    SSH server is down on [computer name]

    This is a message from the service trigger expression:

    {net.tcp.service[ssh].last(0)}=0

    In each case the corresponding process trigger is not being activated:

    {proc.num[sshd].last(0)}<1

    and an investigation on each of the 3 computers shows sshd running and each computer will accept ssh connections correctly. I have no other false positives on these or any other of my Linux computers.

    1. Has anyone seen similar behavior and found a solution?

    2. Does anyone know exactly what the net.tcp.service check is looking for exactly? Does it look at the port and if so, what does it look for in response? Does it look for the service running and grep out ssh using the service --status-all command?

    So far I've tried:
    -- compiling the agent natively on the 3 malfunctioning computers
    -- comparing OS versions and settings
    -- turning off iptables and SELinux

    Thanks in advance.
  • tzn
    Junior Member
    • Apr 2011
    • 19

    #2
    I'm suffering form the same issue on several hosts. We are testing Zabbix to replace our Nagios installation (2500+ hosts, 30000+ checks), on on some hosts we are getting false positives from net.tcp.service[ssh]
    I eliminated custom kernel (grsec, etc), agent version, etc. Where should I look further ?

    Comment

    • richlv
      Senior Member
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Oct 2005
      • 3112

      #3
      proc.num grabs the amount of currently running processes. default provided triggers check that value.

      net.tcp.service connects to the port (from the agent) and checks for correct response.

      if that fails, there are multiple reasons, which might include network issues or overloaded target host.

      is the check failing constantly ? is sshd listening on all interfaces (including localhost) ?
      Zabbix 3.0 Network Monitoring book

      Comment

      • tzn
        Junior Member
        • Apr 2011
        • 19

        #4
        Originally posted by richlv
        proc.num grabs the amount of currently running processes. default provided triggers check that value.

        net.tcp.service connects to the port (from the agent) and checks for correct response.

        if that fails, there are multiple reasons, which might include network issues or overloaded target host.

        is the check failing constantly ? is sshd listening on all interfaces (including localhost) ?
        Network problem is not a case, locally check is reporting the same status:
        Code:
        root@sXXXXX:/etc# time sudo -u zabbix /usr/sbin/zabbix_agentd -t net.tcp.service[ssh]
        net.tcp.service[]                             [u|0]
        
        real    0m5.061s
        user    0m0.004s
        sys     0m0.000s
        Also it takes a lot of time to finish the check.

        sshd is istening on all interfaces:
        Code:
        [email protected]:/etc# netstat -nltp |grep ssh
        tcp        0      0 10.10.x.x:22         0.0.0.0:*               LISTEN      4119/sshd       
        tcp        0      0 127.0.0.1:22            0.0.0.0:*               LISTEN      4119/sshd
        Host is also not busy
        Code:
        root@sXXXXX:/etc# w
         11:40:27 up 7 days, 23:16,  1 user,  load average: 0.20, 0.30, 0.26
        Stracing what agent does gives following output:
        Code:
        11:52:31.750369 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
        11:52:31.750416 connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
        11:52:31.750555 read(3, "", 5)          = 0
        11:52:36.793393 write(3, "0\n", 2)      = 2
        11:52:36.793484 close(3)                = 0
        11:52:36.793537 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
        11:52:36.793598 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fe562115000
        11:52:36.793658 write(1, "net.tcp.service[]               "..., 52net.tcp.service[]                             [u|0]
        ) = 52
        11:52:36.793728 exit_group(0)           = ?
        Connect itself takes a lot of time, about 5 sec., and agent is connecting to localhost for unknown reason. I checked connecting manually, and there is one significant difference:
        Code:
        root@sXXXXX:/etc# nc 91.121.73.47 22
        SSH-2.0-OpenSSH_5.1p1 Debian-6ubuntu2
        
        Protocol mismatch.
        root@sXXXXX:/etc# nc 127.0.0.1 22
        root@sXXXXX:/etc#
        Last edited by tzn; 25-04-2011, 12:16.

        Comment

        • tzn
          Junior Member
          • Apr 2011
          • 19

          #5
          Problem solved

          denyhosts was the problem. localhost was blacklisted for ssh in /etc/hosts.deny

          Comment

          • hgomez
            Junior Member
            • Jul 2012
            • 17

            #6
            I encountered same problems after upgrading to Zabbix 2.0.6 for some hosts.

            I didn't get problems for 2.0.2-2.0.5 but now I see very often :

            Trigger: SSH server is down on mymach1.myorg.com
            Trigger status: PROBLEM
            Trigger severity: Average
            Trigger URL:

            Item values:

            1. SSH server (mymach1.myorg.com:net.tcp.service[ssh]): 0
            2. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*
            3. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*

            And 1mn later :

            Trigger: SSH server is down on mymach1.myorg.com
            Trigger status: OK
            Trigger severity: Average
            Trigger URL:

            Item values:

            1. SSH server (mymach1.myorg.com:net.tcp.service[ssh]): 1
            2. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*
            3. *UNKNOWN* (*UNKNOWN*:*UNKNOWN*): *UNKNOWN*


            All systems are running Zabbix 2.0.6 and I'm using Passive Check mode

            Anyone with same problems in 2.0.6 ?

            Comment

            • Heilig
              Senior Member
              Zabbix Certified Trainer
              Zabbix Certified SpecialistZabbix Certified Professional
              • Mar 2013
              • 366

              #7
              Check the trigger expression. Here is the latest article on the subject:
              Zabbix trigger expressions provide an incredibly flexible way of defining problem conditions. If you can express your problem using plain English or any other human language, there is a great chance it could be represented using triggers. I’ve noticed that even experienced Zabbix users are not always aware of the true power of triggers. The […]

              Comment

              • hgomez
                Junior Member
                • Jul 2012
                • 17

                #8
                Interesting

                Would you recommand using TRIGGER.VALUE or min() ?

                Comment

                • Heilig
                  Senior Member
                  Zabbix Certified Trainer
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • Mar 2013
                  • 366

                  #9
                  In this situation I recommend use function min().

                  Comment

                  • hgomez
                    Junior Member
                    • Jul 2012
                    • 17

                    #10
                    Originally posted by Heilig
                    In this situation I recommend use function min().
                    I selected last(1m), to ensure alert is triggered only if ssh is down for 1mn.
                    Since polling is each 30s, it should be 2 picks from remote if I'm right

                    Comment

                    • Heilig
                      Senior Member
                      Zabbix Certified Trainer
                      Zabbix Certified SpecialistZabbix Certified Professional
                      • Mar 2013
                      • 366

                      #11
                      No, the function last() isn't good in this situation. Trigger will blink each time when you received "0" value. In a previous post, I made ​​a mistake and the function min() is also better not use with items that return 2 values ​​(0 or 1).
                      You described function max(1m)=0. Its I recommend you to use.
                      If service continue blink you can made trigger less sensitive (by increasing the time interval), but better find and eliminate the cause of such behaviour of service.

                      Comment

                      • hgomez
                        Junior Member
                        • Jul 2012
                        • 17

                        #12
                        Originally posted by Heilig
                        No, the function last() isn't good in this situation. Trigger will blink each time when you received "0" value. In a previous post, I made ​​a mistake and the function min() is also better not use with items that return 2 values ​​(0 or 1).
                        You described function max(1m)=0. Its I recommend you to use.
                        If service continue blink you can made trigger less sensitive (by increasing the time interval), but better find and eliminate the cause of such behaviour of service.
                        I finally used :

                        net.tcp.service[ssh].min(1m)

                        And for now it seems to prevent from glitches

                        I will try with max

                        net.tcp.service[ssh].max(1m)

                        Comment

                        • rondeniable
                          Junior Member
                          • Jun 2006
                          • 14

                          #13
                          RE: CentOS 6

                          In my case, i had two issues.

                          First, I found that localhost was not set in /etc/hosts (*yeah i know*)

                          Second, I did not have localhost set as a ListenAddress in sshd_config.

                          I added this line to /etc/ssh/sshd_config and restarted.
                          ListenAddress 0.0.0.0

                          Hope that helps

                          Comment

                          • rubendob
                            Member
                            • Apr 2012
                            • 36

                            #14
                            Solved!

                            Originally posted by tzn
                            denyhosts was the problem. localhost was blacklisted for ssh in /etc/hosts.deny
                            Hey

                            thanks folks, was exactly same issue here.

                            Great job.

                            Comment

                            Working...