Ad Widget

Collapse

Zabbix 1.6.3 server down.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • johnpeter
    Junior Member
    • Oct 2009
    • 4

    #1

    Zabbix 1.6.3 server down.

    Dear all,

    Our Zabbix server stopped working sometime early this morning & we have been trying to get it up & running all day, but have had no joy. It will make on-call rather problematic because we won't know when systems are down.

    The configuration files have not been changed since the 28th of September and the server has been bounced after that date, but afore this morning. Just to rule out oddities, we have rebooted the server.

    Linux t2nl-mgt004 2.6.25.20-0.5-default #1 SMP 2009-08-14 01:48:11 +0200 x86_64 x86_64 x86_64 GNU/Linux
    # cat /etc/*relea*
    openSUSE 11.0 (X86-64)


    Below is the output of the server.log following the startup, and subsequent crash of the server, and the server and agent configuration files.

    I would be very grateful if someone might tell me what to investigate so that we can get it back.

    Yours faithfully, P,


    ########################
    ###### server.log ##### #####
    ########################
    Code:
      9605:20091012:140915 Starting zabbix_server. ZABBIX 1.6.3.
      9605:20091012:140915 **** Enabled features ****
      9605:20091012:140915 SNMP monitoring:       YES
      9605:20091012:140915 WEB monitoring:         NO
      9605:20091012:140915 Jabber notifications:   NO
      9605:20091012:140915 ODBC:                   NO
      9605:20091012:140915 IPv6 support:           NO
      9605:20091012:140915 **************************
      9605:20091012:140920 ZABBIX semaphores already exist, trying to recreate.
      9605:20091012:140920 ZABBIX semaphores already exist, trying to recreate.
      9605:20091012:140920 ZABBIX semaphores already exist, trying to recreate.
      9622:20091012:140920 server #17 started [Poller. SNMP:YES]
      9618:20091012:140920 server #13 started [Poller. SNMP:YES]
      9606:20091012:140920 server #1 started [Poller. SNMP:YES]
      9608:20091012:140920 server #3 started [Poller. SNMP:YES]
      9609:20091012:140920 server #4 started [Poller. SNMP:YES]
      9623:20091012:140920 server #18 started [Poller. SNMP:YES]
      9616:20091012:140920 server #11 started [Poller. SNMP:YES]
      9617:20091012:140920 server #12 started [Poller. SNMP:YES]
      9607:20091012:140920 server #2 started [Poller. SNMP:YES]
      9611:20091012:140920 server #6 started [Poller. SNMP:YES]
      9610:20091012:140920 server #5 started [Poller. SNMP:YES]
      9621:20091012:140920 server #16 started [Poller. SNMP:YES]
      9619:20091012:140920 server #14 started [Poller. SNMP:YES]
      9615:20091012:140920 server #10 started [Poller. SNMP:YES]
      9635:20091012:140920 server #26 started [Trapper]
      9625:20091012:140920 server #20 started [Poller. SNMP:YES]
      9638:20091012:140920 server #27 started [Trapper]
      9640:20091012:140920 server #28 started [Trapper]
      9642:20091012:140920 server #29 started [Trapper]
      9644:20091012:140920 server #30 started [Trapper]
      9646:20091012:140920 server #31 started [ICMP pinger]
      9648:20091012:140920 server #32 started [Alerter]
      9650:20091012:140920 server #33 started [Housekeeper]
      9620:20091012:140920 server #15 started [Poller. SNMP:YES]
      9650:20091012:140920 Executing housekeeper
      9652:20091012:140920 server #34 started [Timer]
      9624:20091012:140920 server #19 started [Poller. SNMP:YES]
      9632:20091012:140920 server #23 started [Poller. SNMP:YES]
      9658:20091012:140920 server #36 started [Node watcher. Node ID:0]
      9630:20091012:140920 server #21 started [Poller. SNMP:YES]
      9612:20091012:140921 server #7 started [Poller. SNMP:YES]
      9655:20091012:140921 server #35 started [Poller for unreachable hosts. SNMP:YES]
      9613:20091012:140921 server #8 started [Poller. SNMP:YES]
      9614:20091012:140921 server #9 started [Poller. SNMP:YES]
      9631:20091012:140921 server #22 started [Poller. SNMP:YES]
      9633:20091012:140921 server #24 started [Poller. SNMP:YES]
      9668:20091012:140921 server #39 started [Escalator]
      9663:20091012:140921 server #37 started [Discoverer. SNMP:YES]
      9605:20091012:140921 server #0 started [Watchdog]
      9605:20091012:140921 In main_watchdog_loop()
      9667:20091012:140921 server #38 started [DB Syncer]
      9634:20091012:140921 server #25 started [Poller. SNMP:YES]
      9622:20091012:140921 Item [amshqb-bob01:system[procrunning]] error: Not supported by ZABBIX agent
      9617:20091012:140921 Item [amshqb-dbase04:system[procrunning]] error: Not supported by ZABBIX agent
      9619:20091012:140921 Item [t2nl-mgt001:agent.ping] error: Get value from agent failed: Cannot connect to [t2nl-mgt001:10050] [Connection refused]
      9619:20091012:140921 Host [t2nl-mgt001]: first network error, wait for 15 seconds
      9623:20091012:140921 Item [t2nl-app106:vfs.fs.size[/var,pfree]] error: Got empty string from [t2nl-app106]. Assuming that agent dropped connection because of access permissions
      9619:20091012:140921 Parameter [agent.ping] will be checked after 240 seconds on host [t2nl-mgt001]
      9623:20091012:140921 Host [t2nl-app106]: first network error, wait for 15 seconds
      9605:20091012:140921 One child process died. Exiting ...
      9605:20091012:140923 ZABBIX Server stopped. ZABBIX 1.6.3.
    ########################
    ###### zabbix_server.conf #####
    ########################
    Code:
    # This is config file for ZABBIX server process
    # To get more information about ZABBIX, 
    # go http://www.zabbix.com
    
    ############ GENERAL PARAMETERS #################
    
    # Number of pre-forked instances of pollers
    # Default value is 5
    # This parameter must be between 0 and 255
    StartPollers=25
    
    # How often ZABBIX will perform housekeeping procedure
    # (in hours)
    # Default value is 1 hour
    # Housekeeping is removing unnecessary information from
    # tables history, alert, and alarms
    # This parameter must be between 1 and 24
    
    HousekeepingFrequency=1
    
    # How often ZABBIX will try to send unsent alerts
    # (in seconds)
    # Default value is 30 seconds
    SenderFrequency=30
    
    # Uncomment this line to disable housekeeping procedure
    #DisableHousekeeping=1
    
    # Specifies debug level
    # 0 - debug is not created
    # 1 - critical information
    # 2 - error information
    # 3 - warnings (default)
    # 4 - for debugging (produces lots of information)
    
    DebugLevel=3
    
    # Specifies how long we wait for agent response (in sec)
    # Must be between 1 and 30 
    Timeout=5
    
    # Specifies how many seconds trapper may spend processing new data
    # Must be between 1 and 30 
    #TrapperTimeout=5
    
    # After how many seconds of unreachability treat a host as unavailable
    UnreachablePeriod=45
    
    # How ofter check host for availability during the unreachability period
    #UnavailableDelay=15
    
    # How ofter check host for availability during the unavailability period
    #UnavailableDelay=60
    
    # Name of PID file
    
    PidFile=/var/tmp/zabbix_server.pid
    
    # Name of log file
    # If not set, syslog is used
    
    LogFile=/home/zabbix/server.log
    
    # Maximum size of log file in MB. Set to 0 to disable automatic log rotation.
    LogFileSize=10
    
    # Location for custom alert scripts
    AlertScriptsPath=/home/zabbix/bin/
    
    # Location of external scripts
    ExternalScripts=/etc/zabbix/externalscripts
    
    # Location of fping. Default is /usr/sbin/fping
    # Make sure that fping binary has root permissions and SUID flag set
    #FpingLocation=/usr/sbin/fping
    
    # Location of fping6. Default is /usr/sbin/fping6
    # Make sure that fping binary has root permissions and SUID flag set
    #Fping6Location=/usr/sbin/fping6
    
    # Frequency of ICMP pings (item keys 'icmpping' and 'icmppingsec'). Defauls is 60 seconds.
    #PingerFrequency=60
    
    # Database host name
    # Default is localhost
    
    #DBHost=t2nl-mgt102
    
    # Database name
    # SQLite3 note: path to database file must be provided. DBUser and DBPassword are ignored.
    DBName=zabbix
    
    # Database user
    
    DBUser=zabbix
    
    # Database password
    # Comment this line if no password used
    
    DBPassword=zabbix
    
    # Connect to MySQL using Unix socket?
    
    #DBSocket=/tmp/mysql.sock
    
    StartDBSyncers=
    ########################
    ###### zabbix_agentd.conf #####
    ########################
    Code:
    Server=127.0.0.1
    Hostname=t2nl-mgt004
    BufferSend=20
    ListenIP=127.0.0.1
    EnableRemoteCommands=1
    PidFile=/var/tmp/zabbix_agentd.pid
    LogFile=/var/tmp/zabbix_agentd.log
    LogFileSize=1
    Timeout=20
    StartAgents=10
    UserParameter=status.sshd,/home/zabbix/scripts/sshd_up.sh
    UserParameter=testdb[*],/home/zabbix/scripts/wrapper.sh $1
    #UserParameter=proccount[*],ps -e | grep $1 | grep -v grep | wc -l
    #UserParameter=disk_check[*],/home/zabbix/scripts/check-dual-connected -m $1 -s $1 -d $1 -z
    UserParameter=system.temp[*],/etc/zabbix/externalscripts/temp.py --host $1
    UserParameter=database.check_db[*],/home/zabbix/scripts/check_db.py $1 $2
    UserParameter=database.check_refresh[*],/home/zabbix/scripts/check_db_refresh_date.py $1 $2
  • johnpeter
    Junior Member
    • Oct 2009
    • 4

    #2
    Problem has been solved internally. Shall post resolution details later, when I have those.

    Comment

    • johnpeter
      Junior Member
      • Oct 2009
      • 4

      #3
      Solution:

      Truncate the dB
      create a tables called actions.backup.
      Backup the actions table (using a select) into actions.backup
      Drop the actions table, and recreate an actions tables without any data in it.
      Start zabbix
      Stop Zabbix
      Copy the data from actions.backup table into the freshly created actions table.
      Start Zabbix.

      Result: Zabbix now works.

      Reason why: unkown

      Comment

      • johnpeter
        Junior Member
        • Oct 2009
        • 4

        #4
        These are the actions we take to resurect Zabbix. Its very strange, and I do not know why this works. Would someone kindly comment? At present we have to do this weekly. The file system the mysql dB is not full.

        1. /etc/init.d/zabbix_server stop
        2. mysql -u root zabbix
        3. drop table actions_old; (if exists)
        4. create table actions_old select * from actions;
        5. truncate table actions;
        6. exit mysql
        7. /etc/init.d/zabbbix_server start
        8. wait ~ 1 minute or so (or tail ~zabbix/server.log)
        9. /etc/init.d/zabbix stop
        10. mysql -u root zabbix
        11. drop table actions;
        12. create table actions select * from actions_old;
        13. exit mysql
        14. /etc/init.d/zabbix start
        15. cross your fingers while tailing -f ~zabbix/server.log
        16. if zabbix works again, verify that zabbix_agentd also is running, if not do /etc/init.d/zabbix_agentd start

        Comment

        • sarathyme
          Member
          • Mar 2009
          • 58

          #5
          thanks!

          This happened to me as well.
          debugged full day and almost gave up.

          Finally gave a try to ur suggestion (discovery???) it worked.
          As you I am also not sure why it works, but happy that it works

          Thanks so much for the thread

          Thanks
          Vijay

          Comment

          Working...