Ad Widget

Collapse

Zabbix Server Version 2.2.5 - Zabbix Server not running in GUI

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • tammy
    Junior Member
    • Jun 2015
    • 5

    #1

    Zabbix Server Version 2.2.5 - Zabbix Server not running in GUI

    Hello,

    We have been having this issue for a few weeks now. We get the problem in the GUI that says: "Zabbix Server is not running: the information displayed may not be current"

    The GUI and Server are on the same server. The DB is a separate host.

    If I telnet to the localhost on port 10051 while it's having the problem, I in fact cannot get to that port. Once I restart the zabbix-server process, it clears up and is able to connect again.

    This is happening randomly any where from twice a day to every few days.

    This is an Amazon EC2 host and it uses Amazon RDS for the DB. I have the server tuned in line with the Zabbix conference recommendations and the DB seems to be ok.

    The problem is intermittent and just started a few weeks ago- all the answers I found on here were only for persistent errors and since initial install - so not applicable to this, that I have found anyway.

    Any help would be greatly appreciated!
  • Parasin
    Member
    Zabbix Certified Specialist
    • Dec 2014
    • 53

    #2
    Is there any other service that may be trying to use that port?

    Also, make sure that no SSH connection is attempting to utilize that port as well!

    Comment

    • zabanist
      Junior Member
      • Jun 2015
      • 16

      #3
      Well, at least you know that the cause is real -- zabbix is not answering on 10051. As usual, I'd tail my way through /var/log/zabbix/zabbix_server and try to see if the trapper processes are having a problem. Its possible various kernel parameters need tuning as well, which may be evident via dmesg or lsof -n outputs.




      Originally posted by tammy
      Hello,

      We have been having this issue for a few weeks now. We get the problem in the GUI that says: "Zabbix Server is not running: the information displayed may not be current"

      The GUI and Server are on the same server. The DB is a separate host.

      If I telnet to the localhost on port 10051 while it's having the problem, I in fact cannot get to that port. Once I restart the zabbix-server process, it clears up and is able to connect again.

      This is happening randomly any where from twice a day to every few days.

      This is an Amazon EC2 host and it uses Amazon RDS for the DB. I have the server tuned in line with the Zabbix conference recommendations and the DB seems to be ok.

      The problem is intermittent and just started a few weeks ago- all the answers I found on here were only for persistent errors and since initial install - so not applicable to this, that I have found anyway.

      Any help would be greatly appreciated!

      Comment

      • tammy
        Junior Member
        • Jun 2015
        • 5

        #4
        Thanks. I've been trying to go through various options like you've listed. I will continue looking.

        Comment

        • tammy
          Junior Member
          • Jun 2015
          • 5

          #5
          As an update to this, I have found that when the problem occurs there becomes a high number of CLOSE_WAIT shown in netstat. TIME_WAIT is often high, but during the issue TIME_WAIT goes down by about half and CLOSE_WAIT goes way up - for example usually there is maybe 1 or 2 CLOSE_WAIT and 500 plus TIME_WAIT, but during the issue TIME_WAIT goes to maybe 100-200 and CLOSE_WAIT goes to about 200.

          Comment

          • tammy
            Junior Member
            • Jun 2015
            • 5

            #6
            I've been able to capture some additional networking issues on the high CLOSE_WAIT aspect. Most of the CLOSE_WAIT connections do not have a PID associated with it- so orphaned. And some connections are from the local IP on port 10051 and some are on the loopback. Does that make a difference? Why would Zabbix be causing a ton of orphaned CLOSE_WAIT connections?

            Comment

            • tammy
              Junior Member
              • Jun 2015
              • 5

              #7
              I tuned my host to a shorter TCP keepalive and have orphan retries at 3 _ it was previously set to not ever kill orphaned connections (and I just changed to 1 today after the event happened again last night). What I notice with the ~150 CLOSE_WAIT connections is that the RECV-Q is really high on some of them. Like 15K kind of high. Everything I've been reading says this is a problem at the application level. That the FIN is sent by the client and TCP is still waiting for the server application to pull the data from that socket and acknowledge the FIN.

              any help would be greatly appreciated!

              Comment

              • zabanist
                Junior Member
                • Jun 2015
                • 16

                #8
                Originally posted by tammy
                I tuned my host to a shorter TCP keepalive and have orphan retries at 3 _ it was previously set to not ever kill orphaned connections (and I just changed to 1 today after the event happened again last night). What I notice with the ~150 CLOSE_WAIT connections is that the RECV-Q is really high on some of them. Like 15K kind of high. Everything I've been reading says this is a problem at the application level. That the FIN is sent by the client and TCP is still waiting for the server application to pull the data from that socket and acknowledge the FIN.

                any help would be greatly appreciated!
                15K on a CLOSE_WAIT?

                Are these all to/fro localhost?

                I think you need to do the following things:

                Set DebugLevel = 4 (or 5) on Zabber Server.

                Potentailly launch zabbix server with a strace (re: "strace zabbix_server 2> myoutfile.txt"

                Between the two of these we'll hopefully get a good clue as to what is breaking. I'd be very curious if some security measures were in place to prevent excessive connections...

                Comment

                • maheshme973
                  Junior Member
                  • Jul 2011
                  • 9

                  #9
                  We are having the exact same problem with version 2.0.9. It started about 3 months ago.

                  I was kinda hoping an upgrade would solve the problem. tammy's post dashed our hopes

                  I did an strace on one of the zabbix_server processes that was in CLOSE_WAIT state and got the following output. Looks like the process is doing something with semaphores repeatedly that I am not able to understand.

                  Also, we are running this in our own datacenter - so I don't think AWS is a factor.
                  #############################################
                  strace -p 24263
                  Process 24263 attached - interrupt to quit
                  restart_syscall(<... resuming interrupted call ...>) = 0
                  semop(2818055, {{2, -1, SEM_UNDO}}, 1) = 0
                  semop(2818055, {{2, 1, SEM_UNDO}}, 1) = 0
                  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
                  rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x3e8ac326a0}, 8) = 0
                  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
                  nanosleep({1, 0}, 0x7fffca066750) = 0
                  semop(2818055, {{2, -1, SEM_UNDO}}, 1) = 0
                  semop(2818055, {{2, 1, SEM_UNDO}}, 1) = 0
                  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
                  rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x3e8ac326a0}, 8) = 0
                  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
                  nanosleep({1, 0}, 0x7fffca066750) = 0
                  semop(2818055, {{2, -1, SEM_UNDO}}, 1) = 0
                  semop(2818055, {{2, 1, SEM_UNDO}}, 1) = 0
                  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
                  rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x3e8ac326a0}, 8) = 0
                  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
                  nanosleep({1, 0}, 0x7fffca066750) = 0
                  semop(2818055, {{2, -1, SEM_UNDO}}, 1) = 0
                  semop(2818055, {{2, 1, SEM_UNDO}}, 1) = 0
                  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
                  rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x3e8ac326a0}, 8) = 0
                  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
                  nanosleep({1, 0}, 0x7fffca066750) = 0
                  semop(2818055, {{2, -1, SEM_UNDO}}, 1) = 0
                  semop(2818055, {{2, 1, SEM_UNDO}}, 1) = 0
                  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
                  rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x3e8ac326a0}, 8) = 0
                  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
                  nanosleep({1, 0}, 0x7fffca066750) = 0
                  semop(2818055, {{2, -1, SEM_UNDO}}, 1) = 0
                  semop(2818055, {{2, 1, SEM_UNDO}}, 1) = 0
                  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
                  rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x3e8ac326a0}, 8) = 0
                  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
                  nanosleep({1, 0}, 0x7fffca066750) = 0
                  semop(2818055, {{2, -1, SEM_UNDO}}, 1) = 0
                  semop(2818055, {{2, 1, SEM_UNDO}}, 1) = 0
                  rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
                  rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x3e8ac326a0}, 8) = 0
                  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
                  nanosleep({1, 0}, 0x7fffca066750) = 0
                  semop(2818055, {{2, -1, SEM_UNDO}}, 1^C <unfinished ...>
                  Process 24263 detached
                  #############################################

                  Comment

                  • zabanist
                    Junior Member
                    • Jun 2015
                    • 16

                    #10
                    Reading this doesn't provide me with a huge clue. It is clearly in a loop, and may be waiting for another process... regardless none of the calls seem to be reporting an error.

                    I would be much more interested to see what is happening in the parent process. Again, please set zabbix server to debug mode and, if you have the time and inclination (and processing power and storage space) run "strace -f" on the zabbix_server parent process when you start it up and see what is happening to all the children. Redirecting stderr to a file for later reference would be good. Zabbix server tends to do a pretty good job of describing problems.

                    So, not alot to go on from this strace.

                    That's all I got for what I'ver seen in this thread. You may want to check out the limits of of user memory and semaphor consumption, especially for non-root users. Again, dmesg may hint at something. If you are running selinux, you could look at audit logs for denies that could be causing problems.

                    Comment

                    • maheshme973
                      Junior Member
                      • Jul 2011
                      • 9

                      #11
                      Thanks for the inputs zabanist.

                      Unfortunately this is a critcial system and I don't have the luxury of trying out different things.

                      Last night we put in an emergency change: We now have the DB, the Zabbix Server and the UI functions working on different VMs.

                      We were planning to do this for a long time. A recurrence of the problem yesterday forced us to expedite the split.

                      We have not seen the problem since last evening - but I have a sneaky suspicion that I will see the issue again.

                      Comment

                      • maheshme973
                        Junior Member
                        • Jul 2011
                        • 9

                        #12
                        Tammy,
                        For what it's worth, I threw more resources at the problem. Here is what I found:

                        The UI is now running on a separate server but is pointed to the same DB instance as is the Zbx server. As part of the change, we also upgraded the mysql disks to SSD.

                        For the last 2 days we have not seen any issues. It is difficult to tell if the UI load separation did the trick or if it was just the SSDs. If its the latter, I guess we will start seeing issues again when the workload increases further.

                        There is one point of interest though - the UI server shows a constant CPU load of about 2.5 - 3.0. And it has 12 vCPU cores! So, the UI seems to be a bit heavy on the resources. You may want to try taking your UI load out too.

                        Comment

                        Working...