Ad Widget

Collapse

Zabbix server dies when database is unavailable

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • nmail_uk
    Member
    • May 2009
    • 65

    #1

    Zabbix server dies when database is unavailable

    Just logged into my Zabbix server and noticed it was not running, so I tail'd the log file, and found it was because the database server couldn't be contacted:

    Code:
    3976:20090520:152654 [Z3001] Connection to database 'zabbix' failed: [0] could not translate host name "sql-p1.networkmail.eu" to address: Temporary failure in name resolution
    
      3976:20090520:152654 [Z3005] Query failed: [0] Result is NULL [select oid from pg_type where typname = 'bytea']
      3962:20090520:152654 One child process died. Exiting ...
      3962:20090520:152801 [Z3001] Connection to database 'zabbix' failed: [0] could not translate host name "sql-p1.networkmail.eu" to address: Temporary failure in name resolution
    
      3962:20090520:152801 [Z3005] Query failed: [0] Result is NULL [select oid from pg_type where typname = 'bytea']
    This looks to be a temporary issue with the name service (the DNS servers are provided by my hosting company.) Wouldn't it be better if Zabbix didn't die, and stored the result temporarily until the database server was back up?

    Also, after trying to restart the server, I get the following error in the log file:

    Code:
    /opt/nmail/zabbix/sbin/zabbix_server [2589]: Semaphore [0] error in semctl(SETVAL)
    /opt/nmail/zabbix/sbin/zabbix_server [2589]: Unable to create mutex for log file
    Any ideas what this means? I deleted the log file, restarted the service and it started up OK.

    I've then also got a couple of other errors for one of my custom checks -

    Code:
      2606:20090521:235907 Item [Adelaide:bind.qcount] error: Type of received value [1324021] is not suitable for value type [Numeric (integer 64bit)]
      2607:20090522:000352 Item [Hamilton:bind.qcount] error: Type of received value [378221] is not suitable for value type [Numeric (integer 64bit)]
    How are 1324021 and 378221 not valid integers? This check worked fine before it crashed.

    Any ideas would be appreciated.

    Many thanks,
    Andy
  • Calimero
    Senior Member
    • Nov 2006
    • 481

    #2
    Originally posted by nmail_uk
    This looks to be a temporary issue with the name service (the DNS servers are provided by my hosting company.) Wouldn't it be better if Zabbix didn't die, and stored the result temporarily until the database server was back up?
    Well... if you remove zabbix_server's storage system, you can't expect it to work. And creating an extra storage system (flat files ?) would probably cumbersome... just to "correct" a problem that is outside zabbix.

    As far as I remember (we don't use it) zabbix_server can notify one of the user groups before shutting down when DB is gone. See configuration > General > Others.

    Anyway what we do here is have a watchdog (we use a simple cronjob but you could use monit or stuff like that) restart zabbix_server if anything goes wrong. And if that fails too monitoring will tell us (we have two zabbix_servers that monitor each other, among other things).


    Originally posted by nmail_uk
    Also, after trying to restart the server, I get the following error in the log file:

    Code:
    /opt/nmail/zabbix/sbin/zabbix_server [2589]: Semaphore [0] error in semctl(SETVAL)
    /opt/nmail/zabbix/sbin/zabbix_server [2589]: Unable to create mutex for log file
    Any ideas what this means? I deleted the log file, restarted the service and it started up OK.
    Sometime when zabbix_server crashes semaphores aren't cleaned up properly and may prevent you from restarting zabbix_server. You may then have to clean zabbix' dangling semaphores (man ipcs) before you can restart. That happened to me once or twice.

    Originally posted by nmail_uk
    I've then also got a couple of other errors for one of my custom checks -

    Code:
      2606:20090521:235907 Item [Adelaide:bind.qcount] error: Type of received value [1324021] is not suitable for value type [Numeric (integer 64bit)]
      2607:20090522:000352 Item [Hamilton:bind.qcount] error: Type of received value [378221] is not suitable for value type [Numeric (integer 64bit)]
    How are 1324021 and 378221 not valid integers? This check worked fine before it crashed.

    Any ideas would be appreciated.
    Err... I admit this one is strange. Did the checks recover ?

    Comment

    • richlv
      Senior Member
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Oct 2005
      • 3112

      #3
      Originally posted by nmail_uk
      Wouldn't it be better if Zabbix didn't die, and stored the result temporarily until the database server was back up?
      which zabbix version is that ? i might be wrong, but i think latest versions might do a small bit of caching before dying.
      Originally posted by nmail_uk
      How are 1324021 and 378221 not valid integers? This check worked fine before it crashed.
      if the items still do that, check returned data for things like trailing newlines (cr/lf ones especially are known to cause problems). you can telnet to agent port and enter item key to see exact data returned by the agent.
      Zabbix 3.0 Network Monitoring book

      Comment

      • nmail_uk
        Member
        • May 2009
        • 65

        #4
        @Calimero:

        This is true it's outside of Zabbix's control - an e-mail message warning of a database failure would be an ideal solution for me - I've just found and activated the option so hopefully that should give us a warning if/when it happens again.

        I was thinking of like a temporary caching mechanism - for example Nagios writes all check results to a spool folder before applying them to the system's storage, so if something goes wrong, it can try to re-process the result after a delay. I was thinking something like that might be more useful?

        The root cause of the database failure is because this server is using our hosting company's DNS servers - these will be changed to our own caching nameservers so hopefully we won't see this too often again.

        I have got two identical boxes and when I get around to setting it up, each box will monitor the other - hopefully attempting to restart zabbix_server if it dies on either box. If it cannot be restarted, the one node will kill the other one, and a shared IP (due to be set up shortly by our hosting company) will fail over to the "working" box.

        @richlv:

        The results from my bind.qcount check haven't changed - they literally stopped working when Zabbix crashed. The agent is not returning any newlines in the output - I've verified this.

        This is on Zabbix 1.6.4.

        Comment

        • richlv
          Senior Member
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Oct 2005
          • 3112

          #5
          is new data coming in for those items (you can verify in latest data tab) ?
          you can also check out nextcheck field for those items - if it is set to some unreasonable time in future, set it to 0 (of course, be careful, have backups and all the usual stuff when messing with the db directly)
          Zabbix 3.0 Network Monitoring book

          Comment

          • nmail_uk
            Member
            • May 2009
            • 65

            #6
            @richlv:

            Zabbix does appear to be retrying it, as the "latest data" is showing a last check time of 22nd May 13:34 (it's a 15-minute check interval.) It is however greyed out and listed as "not supported" on the dashboard.

            However the graph cuts off on 20th May around 15:20 - the same time as the zabbix_server died (and where all other graphs stop as well.) The other graphs start again last night when I noticed zabbix_server wasn't running and restarted it, however the bind.qcount graphs don't.

            I may try deleting and re-adding the checks and see if it makes any difference.

            Comment

            • richlv
              Senior Member
              Zabbix Certified Trainer
              Zabbix Certified SpecialistZabbix Certified Professional
              • Oct 2005
              • 3112

              #7
              deleting items will nuke history data.
              i'd suggest disabling/enabling it first, and resetting nextcheck field in the db manually
              Zabbix 3.0 Network Monitoring book

              Comment

              • nmail_uk
                Member
                • May 2009
                • 65

                #8
                Yeah I understand about the history - this platform isn't live yet, I'm still getting it configured as "proof of concept" so it's not a problem losing the history.

                I deleted the command and re-added it again and it's now receiving results. However, the 2 DNS servers were rebooted on Saturday evening which had the side effect of all statistics being reset; so it probably was an issue with the data my command was returning to Zabbix.

                Comment

                Working...