Ad Widget

Collapse

Strange problem, unable to nail it down.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • joshuamcdo
    Member
    • Nov 2013
    • 76

    #1

    Strange problem, unable to nail it down.

    I am running Zabbix 2.0.9 server and client. ( Taken from EPEL repos in )

    We will try and make this easy...

    I am running Zabbix in the cloud, on an amazon centos image, using a large RDS instance for the DB.

    I noticed this morning that I had an alert for low disk space on the server. After checking it out I realized that debugging for the agent was set to 4 and taking up to much hard drive space. So I made the needed changes to the config file and restarted the agent..
    5 minutes, 10, then 20 and there is no contact from the agent to the server. I restart the server service and wait.. Some CPU stats start coming, then I wait 5, 10, and 20 or more minutes and most things are still not checking in.

    So I *stop* the server process, wait 5 seconds and then start it again, and everything starts to report as normal almost immediately.

    Is this normal behavior? I am doubting it. Anyone have any ideas?

    Thanks,
    J
  • joshuamcdo
    Member
    • Nov 2013
    • 76

    #2
    Strange problem, unable to nail it down.

    Just wanted to add some data to this..

    I build an ec2 instance...

    I setup a host inside zabbix..

    Then I installed, setup and started the agent..

    Nothing..

    I watched the traffic using tcpdump on the client side and there was no traffic to or from the zabbix server.

    I deleted the host from zabbix and recreated it, and viola, tcpdump instantly lit up and the agent started checking in...

    Is this a bug?

    J
    Last edited by joshuamcdo; 07-11-2013, 22:17.

    Comment

    • ILIV
      Junior Member
      • Oct 2012
      • 28

      #3
      Originally posted by joshuamcdo
      Just wanted to add some data to this..

      I build an ec2 instance...

      I setup a host inside zabbix..

      Then I installed, setup and started the agent..

      Nothing..

      I watched the traffic using tcpdump on the client side and there was no traffic to or from the zabbix server.

      I deleted the host from zabbix and recreated it, and viola, tcpdump instantly lit up and the agent started checking in...

      Is this a bug?

      J
      Is it reproducible? How often? What are the circumstances?

      Behavior you describe in this thread is not normal, and your troubleshooting apparently was not extensive, so it is really hard to tell what was really wrong.

      The key to issues like this is to establish if this is something reproducible in controlled experiment. If it is, you're then looking at an issue that others should be able to recreate and possibly provide some help.

      Comment

      • joshuamcdo
        Member
        • Nov 2013
        • 76

        #4
        Re: Re: Strange problem, unable to nail it down.

        Originally posted by ILIV
        Is it reproducible? How often? What are the circumstances?

        If it is, you're then looking at an issue that others should be able to recreate and possibly provide some help.
        I was going to say yes, but I want to run one more experience before I "confirm" that it's reproducible.

        J

        Comment

        • joshuamcdo
          Member
          • Nov 2013
          • 76

          #5
          Originally posted by ILIV
          Is it reproducible? How often? What are the circumstances?
          Being fairly new to the Zabbix world, I have to admit that troubleshooting zabbix is currently a challenge I face.

          The circumstances..
          I restart the agent on the Zabbix server it's self.
          Data stops coming in. 5, 10 minutes late, still nothing. I am observing this behavior right now.
          I stopped the zabbix server process, waited 5 seconds start it back up and so far it's still not checking in. The zabbix server has fallen into the alert category of not having responded in the last 5 mins.
          I stopped the server, then stopped the agent, then started the server and then started the agent back up. Still nothing reports.

          I restarted the zabbix agent on a random ec2 instnace.
          Data stops coming in. 5, 10 minutes later, still nothing. I am observing this behavior right now.

          I have been waiting for a while now, and nothing starts reporting.

          IF I were to remove the host and re-add it, it would start reporting immediately. But I simply can't keep doing that as we need to be able to keep history for these nodes.

          Yes, I can manually get any data I want from the agent if I query it from the server. That's never an issue.

          UPDATE:

          While the agent still hasn't checked in, and I continue to get alerts stating that the node is not responding.
          There are some off things that have started to check in.

          Checksum of /etc/passwd Nov 11th, 2013 09:31:28 PM
          Checksum of /etc/services Nov 11th, 2013 09:31:30 PM
          Checksum of /usr/bin/ssh Nov 11th, 2013 09:31:30 PM

          Free disk space on / in % Nov 11th, 2013 09:40:33 PM

          Used disk space on / in % Nov 11th, 2013 09:40:34 PM

          These all checked in around 20 minutes on the zabbix server after everything mentioned above.

          The random ec2 instance has the following that has checked in.

          Number of users connected Nov 11th, 2013 09:42:33 PM

          I have attached a screen shot of the CPU LOAD graph for the ec2-instance. As you can see it didn't start reporting for around 8ish hours. Then everything starts checking back in and the server stops reporting that the host is unavailable and returns to a normal state.

          I have pasted the client config below. Changed the servername to <servername> and the hostname to <hostname>. Those values were as declared. The <hostname> is the hostname of the host and the same as what's in the zabbix entry for the said host.


          # This is a config file for the Zabbix agent daemon (Unix)
          # To get more information about Zabbix, visit http://www.zabbix.com

          ############ GENERAL PARAMETERS #################

          ### Option: PidFile
          # Name of PID file.
          #
          # Mandatory: no
          # Default:
          PidFile=/var/run/zabbix/zabbix_agentd.pid

          ### Option: LogFile
          # Name of log file.
          # If not set, syslog is used.
          #
          # Mandatory: no
          # Default:
          # LogFile=

          LogFile=/var/log/zabbix/zabbix_agentd.log

          ### Option: LogFileSize
          # Maximum size of log file in MB.
          # 0 - disable automatic log rotation.
          #
          # Mandatory: no
          # Range: 0-1024
          # Default:
          LogFileSize=0

          ### Option: DebugLevel
          # Specifies debug level
          # 0 - no debug
          # 1 - critical information
          # 2 - error information
          # 3 - warnings
          # 4 - for debugging (produces lots of information)
          #
          # Mandatory: no
          # Range: 0-4
          # Default:
          DebugLevel=4

          ### Option: SourceIP
          # Source IP address for outgoing connections.
          #
          # Mandatory: no
          # Default:
          # SourceIP=


          ### Option: EnableRemoteCommands
          # Whether remote commands from Zabbix server are allowed.
          # 0 - not allowed
          # 1 - allowed
          #
          # Mandatory: no
          # Default:
          # EnableRemoteCommands=0

          ### Option: LogRemoteCommands
          # Enable logging of executed shell commands as warnings.
          # 0 - disabled
          # 1 - enabled
          #
          # Mandatory: no
          # Default:
          # LogRemoteCommands=0

          ##### Passive checks related

          ### Option: Server
          # List of comma delimited IP addresses (or hostnames) of Zabbix servers.
          # Incoming connections will be accepted only from the hosts listed here.
          # No spaces allowed.
          # If IPv6 support is enabled then '127.0.0.1', '::127.0.0.1', '::ffff:127.0.0.1' are treated equally.
          #
          # Mandatory: no
          # Default:
          # Server=

          Server=zabbix-servername.com

          ### Option: ListenPort
          # Agent will listen on this port for connections from the server.
          #
          # Mandatory: no
          # Range: 1024-32767
          # Default:
          # ListenPort=10050

          ### Option: ListenIP
          # List of comma delimited IP addresses that the agent should listen on.
          # First IP address is sent to Zabbix server if connecting to it to retrieve list of active checks.
          #
          # Mandatory: no
          # Default:
          # ListenIP=0.0.0.0

          ### Option: StartAgents
          # Number of pre-forked instances of zabbix_agentd that process passive checks.
          # If set to 0, disables passive checks and the agent will not listen on any TCP port.
          #
          # Mandatory: no
          # Range: 0-100
          # Default:
          # StartAgents=3

          ##### Active checks related

          ### Option: ServerActive
          # List of comma delimited IPort (or hostnameort) pairs of Zabbix servers for active checks.
          # If port is not specified, default port is used.
          # IPv6 addresses must be enclosed in square brackets if port for that host is specified.
          # If port is not specified, square brackets for IPv6 addresses are optional.
          # If this parameter is not specified, active checks are disabled.
          # Example: ##ServerActive=127.0.0.1:20051,zabbix.domain,[::1]:30051,::1,[12fc::1]
          #
          # Mandatory: no
          # Default:
          # ServerActive=

          ##ServerActive=127.0.0.1
          #ServerActive=servername.com

          ### Option: Hostname
          # Unique, case sensitive hostname.
          # Required for active checks and must match hostname as configured on the server.
          # Value is acquired from HostnameItem if undefined.
          #
          # Mandatory: no
          # Default:
          # Hostname=

          Hostname=<hostname>

          ### Option: HostnameItem
          # Item used for generating Hostname if it is undefined.
          # Ignored if Hostname is defined.
          #
          # Mandatory: no
          # Default:
          # HostnameItem=system.hostname

          ### Option: RefreshActiveChecks
          # How often list of active checks is refreshed, in seconds.
          #
          # Mandatory: no
          # Range: 60-3600
          # Default:
          # RefreshActiveChecks=120

          ### Option: BufferSend
          # Do not keep data longer than N seconds in buffer.
          #
          # Mandatory: no
          # Range: 1-3600
          # Default:
          # BufferSend=5

          ### Option: BufferSize
          # Maximum number of values in a memory buffer. The agent will send
          # all collected data to Zabbix Server or Proxy if the buffer is full.
          #
          # Mandatory: no
          # Range: 2-65535
          # Default:
          # BufferSize=100

          ### Option: MaxLinesPerSecond
          # Maximum number of new lines the agent will send per second to Zabbix Server
          # or Proxy processing 'log' and 'logrt' active checks.
          # The provided value will be overridden by the parameter 'maxlines',
          # provided in 'log' or 'logrt' item keys.
          #
          # Mandatory: no
          # Range: 1-1000
          # Default:
          # MaxLinesPerSecond=100

          ### Option: AllowRoot
          # Allow the agent to run as 'root'. If disabled and the agent is started by 'root', the agent
          # will try to switch to user 'zabbix' instead. Has no effect if started under a regular user.
          # 0 - do not allow
          # 1 - allow
          #
          # Mandatory: no
          # Default:
          # AllowRoot=0

          ############ ADVANCED PARAMETERS #################

          ### Option: Alias
          # Sets an alias for parameter. It can be useful to substitute long and complex parameter name with a smaller and simpler one.
          #
          # Mandatory: no
          # Range:
          # Default:

          ### Option: Timeout
          # Spend no more than Timeout seconds on processing
          #
          # Mandatory: no
          # Range: 1-30
          # Default:
          # Timeout=3

          ### Option: Include
          # You may include individual files or all files in a directory in the configuration file.
          # Installing Zabbix will create include directory in /etc, unless modified during the compile time.
          #
          # Mandatory: no
          # Default:
          # Include=

          # Include=/etc/zabbix_agentd.userparams.conf
          # Include=/etc/zabbix_agentd.conf.d/

          ####### USER-DEFINED MONITORED PARAMETERS #######

          ### Option: UnsafeUserParameters
          # Allow all characters to be passed in arguments to user-defined parameters.
          # 0 - do not allow
          # 1 - allow
          #
          # Mandatory: no
          # Range: 0-1
          # Default:
          # UnsafeUserParameters=0

          ### Option: UserParameter
          # User-defined parameter to monitor. There can be several user-defined parameters.
          # Format: UserParameter=<key>,<shell command>
          # See 'zabbix_agentd' directory for examples.
          #
          # Mandatory: no
          # Default:
          # UserParameter=




          ***************************************
          Server config.

          # This is a configuration file for Zabbix Server process
          # To get more information about Zabbix,
          # visit http://www.zabbix.com

          ############ GENERAL PARAMETERS #################

          ### Option: NodeID
          # Unique NodeID in distributed setup.
          # 0 - standalone server
          #
          # Mandatory: no
          # Range: 0-999
          # Default:
          # NodeID=0

          ### Option: ListenPort
          # Listen port for trapper.
          #
          # Mandatory: no
          # Range: 1024-32767
          # Default:
          # ListenPort=10051

          ### Option: SourceIP
          # Source IP address for outgoing connections.
          #
          # Mandatory: no
          # Default:
          # SourceIP=

          ### Option: LogFile
          # Name of log file.
          # If not set, syslog is used.
          #
          # Mandatory: no
          # Default:
          # LogFile=

          LogFile=/var/log/zabbix/zabbix_server.log

          ### Option: LogFileSize
          # Maximum size of log file in MB.
          # 0 - disable automatic log rotation.
          #
          # Mandatory: no
          # Range: 0-1024
          # Default:
          LogFileSize=512

          ### Option: DebugLevel
          # Specifies debug level
          # 0 - no debug
          # 1 - critical information
          # 2 - error information
          # 3 - warnings
          # 4 - for debugging (produces lots of information)
          #
          # Mandatory: no
          # Range: 0-4
          # Default:
          DebugLevel=3

          ### Option: PidFile
          # Name of PID file.
          #
          # Mandatory: no
          # Default:
          PidFile=/var/run/zabbix/zabbix_server.pid

          ### Option: DBHost
          # Database host name.
          # If set to localhost, socket is used for MySQL.
          # If set to empty string, socket is used for PostgreSQL.
          #
          # Mandatory: no
          # Default:
          DBHost=<dbhostname>.rds.amazonaws.com


          ### Option: DBName
          # Database name.
          # For SQLite3 path to database file must be provided. DBUser and DBPassword are ignored.
          #
          # Mandatory: yes
          # Default:
          # DBName=

          DBName=<dbname>

          ### Option: DBSchema
          # Schema name. Used for IBM DB2.
          #
          # Mandatory: no
          # Default:
          # DBSchema=

          ### Option: DBUser
          # Database user. Ignored for SQLite.
          #
          # Mandatory: no
          # Default:
          # DBUser=

          DBUser=<dbuser>

          ### Option: DBPassword
          # Database password. Ignored for SQLite.
          # Comment this line if no password is used.
          #
          # Mandatory: no
          # Default:
          DBPassword=<dbpassword>

          ### Option: DBSocket
          # Path to MySQL socket.
          #
          # Mandatory: no
          # Default:
          # DBSocket=/tmp/mysql.sock

          ### Option: DBPort
          # Database port when not using local socket. Ignored for SQLite.
          #
          # Mandatory: no
          # Range: 1024-65535
          # Default (for MySQL):
          # DBPort=3306

          ############ ADVANCED PARAMETERS ################

          ### Option: StartPollers
          # Number of pre-forked instances of pollers.
          #
          # Mandatory: no
          # Range: 0-1000
          # Default:
          # StartPollers=5
          StartPollers=10

          ### Option: StartIPMIPollers
          # Number of pre-forked instances of IPMI pollers.
          #
          # Mandatory: no
          # Range: 0-1000
          # Default:
          # StartIPMIPollers=0

          ### Option: StartPollersUnreachable
          # Number of pre-forked instances of pollers for unreachable hosts (including IPMI).
          #
          # Mandatory: no
          # Range: 0-1000
          # Default:
          # StartPollersUnreachable=1

          ### Option: StartTrappers
          # Number of pre-forked instances of trappers.
          # Trappers accept incoming connections from Zabbix sender, active agents, active proxies and child nodes.
          # At least one trapper process must be running to display server availability in the frontend.
          #
          # Mandatory: no
          # Range: 0-1000
          # Default:
          # StartTrappers=5

          ### Option: StartPingers
          # Number of pre-forked instances of ICMP pingers.
          #
          # Mandatory: no
          # Range: 0-1000
          # Default:
          # StartPingers=1

          ### Option: StartDiscoverers
          # Number of pre-forked instances of discoverers.
          #
          # Mandatory: no
          # Range: 0-250
          # Default:
          # StartDiscoverers=1

          ### Option: StartHTTPPollers
          # Number of pre-forked instances of HTTP pollers.
          #
          # Mandatory: no
          # Range: 0-1000
          # Default:
          # StartHTTPPollers=1

          ### Option: JavaGateway
          # IP address (or hostname) of Zabbix Java gateway.
          # Only required if Java pollers are started.
          #
          # Mandatory: no
          # Default:
          # JavaGateway=

          ### Option: JavaGatewayPort
          # Port that Zabbix Java gateway listens on.
          #
          # Mandatory: no
          # Range: 1024-32767
          # Default:
          # JavaGatewayPort=10052

          ### Option: StartJavaPollers
          # Number of pre-forked instances of Java pollers.
          #
          # Mandatory: no
          # Range: 0-1000
          # Default:
          # StartJavaPollers=0

          ### Option: SNMPTrapperFile
          # Temporary file used for passing data from SNMP trap daemon to the server.
          # Must be the same as in zabbix_trap_receiver.pl or SNMPTT configuration file.
          #
          # Mandatory: no
          # Default:
          # SNMPTrapperFile=/tmp/zabbix_traps.tmp

          ### Option: StartSNMPTrapper
          # If 1, SNMP trapper process is started.
          #
          # Mandatory: no
          # Range: 0-1
          # Default:
          # StartSNMPTrapper=0

          ### Option: ListenIP
          # List of comma delimited IP addresses that the trapper should listen on.
          # Trapper will listen on all network interfaces if this parameter is missing.
          #
          # Mandatory: no
          # Default:
          # ListenIP=0.0.0.0

          # ListenIP=127.0.0.1

          ### Option: HousekeepingFrequency
          # How often Zabbix will perform housekeeping procedure (in hours).
          # Housekeeping is removing unnecessary information from history, alert, and alarms tables.
          #
          # Mandatory: no
          # Range: 1-24
          # Default:
          HousekeepingFrequency=1

          ### Option: MaxHousekeeperDelete
          # The table "housekeeper" contains "tasks" for housekeeping procedure in the format:
          # [housekeeperid], [tablename], [field], [value].
          # No more than 'MaxHousekeeperDelete' rows (corresponding to [tablename], [field], [value])
          # will be deleted per one task in one housekeeping cycle.
          # SQLite3 does not use this parameter, deletes all corresponding rows without a limit.
          # If set to 0 then no limit is used at all. In this case you must know what you are doing!
          #
          # Mandatory: no
          # Range: 0-1000000
          # Default:
          # MaxHousekeeperDelete=500
          MaxHousekeeperDelete=5000

          ### Option: DisableHousekeeping
          # If set to 1, disables housekeeping.
          #
          # Mandatory: no
          # Range: 0-1
          # Default:
          # DisableHousekeeping=0

          ### Option: SenderFrequency
          # How often Zabbix will try to send unsent alerts (in seconds).
          #
          # Mandatory: no
          # Range: 5-3600
          # Default:
          SenderFrequency=15

          ### Option: CacheSize
          # Size of configuration cache, in bytes.
          # Shared memory size for storing host, item and trigger data.
          #
          # Mandatory: no
          # Range: 128K-2G
          # Default:
          # CacheSize=8M

          ### Option: CacheUpdateFrequency
          # How often Zabbix will perform update of configuration cache, in seconds.
          #
          # Mandatory: no
          # Range: 1-3600
          # Default:
          # CacheUpdateFrequency=60

          ### Option: StartDBSyncers
          # Number of pre-forked instances of DB Syncers
          #
          # Mandatory: no
          # Range: 1-100
          # Default:
          # StartDBSyncers=4

          ### Option: HistoryCacheSize
          # Size of history cache, in bytes.
          # Shared memory size for storing history data.
          #
          # Mandatory: no
          # Range: 128K-2G
          # Default:
          # HistoryCacheSize=8M

          ### Option: TrendCacheSize
          # Size of trend cache, in bytes.
          # Shared memory size for storing trends data.
          #
          # Mandatory: no
          # Range: 128K-2G
          # Default:
          # TrendCacheSize=4M

          ### Option: HistoryTextCacheSize
          # Size of text history cache, in bytes.
          # Shared memory size for storing character, text or log history data.
          #
          # Mandatory: no
          # Range: 128K-2G
          # Default:
          # HistoryTextCacheSize=16M

          ### Option: NodeNoEvents
          # If set to '1' local events won't be sent to master node.
          # This won't impact ability of this node to propagate events from its child nodes.
          #
          # Mandatory: no
          # Range: 0-1
          # Default:
          # NodeNoEvents=0

          ### Option: NodeNoHistory
          # If set to '1' local history won't be sent to master node.
          # This won't impact ability of this node to propagate history from its child nodes.
          #
          # Mandatory: no
          # Range: 0-1
          # Default:
          # NodeNoHistory=0

          ### Option: Timeout
          # Specifies how long we wait for agent, SNMP device or external check (in seconds).
          #
          # Mandatory: no
          # Range: 1-30
          # Default:
          Timeout=30

          ### Option: TrapperTimeout
          # Specifies how many seconds trapper may spend processing new data.
          #
          # Mandatory: no
          # Range: 1-300
          # Default:
          # TrapperTimeout=300

          ### Option: UnreachablePeriod
          # After how many seconds of unreachability treat a host as unavailable.
          #
          # Mandatory: no
          # Range: 1-3600
          # Default:
          UnreachablePeriod=45

          ### Option: UnavailableDelay
          # How often host is checked for availability during the unavailability period, in seconds.
          #
          # Mandatory: no
          # Range: 1-3600
          # Default:
          # UnavailableDelay=60

          ### Option: UnreachableDelay
          # How often host is checked for availability during the unreachability period, in seconds.
          #
          # Mandatory: no
          # Range: 1-3600
          # Default:
          # UnreachableDelay=15

          ### Option: AlertScriptsPath
          # Full path to location of custom alert scripts.
          # Default depends on compilation options.
          #
          # Mandatory: no
          # Default:
          #AlertScriptsPath=/var/lib/zabbixsrv/alertscripts
          AlertScriptsPath=/opt/zabbix/zabbixsrv/alertscripts
          ### Option: ExternalScripts
          # Full path to location of external scripts.
          # Default depends on compilation options.
          #
          # Mandatory: no
          # Default:
          #ExternalScripts=/var/lib/zabbixsrv/externalscripts
          ExternalScripts=/opt/zabbix/zabbixsrv/externalscripts
          ### Option: FpingLocation
          # Location of fping.
          # Make sure that fping binary has root ownership and SUID flag set.
          #
          # Mandatory: no
          # Default:
          # FpingLocation=/usr/sbin/fping

          ### Option: Fping6Location
          # Location of fping6.
          # Make sure that fping6 binary has root ownership and SUID flag set.
          # Make empty if your fping utility is capable to process IPv6 addresses.
          #
          # Mandatory: no
          # Default:
          # Fping6Location=/usr/sbin/fping6

          ### Option: SSHKeyLocation
          # Location of public and private keys for SSH checks and actions
          #
          # Mandatory: no
          # Default:
          # SSHKeyLocation=

          ### Option: LogSlowQueries
          # How long a database query may take before being logged (in milliseconds).
          # Only works if DebugLevel set to 3 or 4.
          # 0 - don't log slow queries.
          #
          # Mandatory: no
          # Range: 1-3600000
          # Default:
          ###LogSlowQueries=1000

          ### Option: TmpDir
          # Temporary directory.
          #
          # Mandatory: no
          # Default:
          # TmpDir=/tmp

          ### Option: Include
          # You may include individual files or all files in a directory in the configuration file.
          # Installing Zabbix will create include directory in /etc, unless modified during the compile time.
          #
          # Mandatory: no
          # Default:
          # Include=

          # Include=/etc/zabbix_server.general.conf
          #Include=/etc/zabbix_server.conf.d/

          ### Option: StartProxyPollers
          # Number of pre-forked instances of pollers for passive proxies.
          #
          # Mandatory: no
          # Range: 0-250
          # Default:
          # StartProxyPollers=1

          ### Option: ProxyConfigFrequency
          # How often Zabbix Server sends configuration data to a Zabbix Proxy in seconds.
          # This parameter is used only for proxies in the passive mode.
          #
          # Mandatory: no
          # Range: 1-3600*24*7
          # Default:
          # ProxyConfigFrequency=3600

          ### Option: ProxyDataFrequency
          # How often Zabbix Server requests history data from a Zabbix Proxy in seconds.
          # This parameter is used only for proxies in the passive mode.
          #
          # Mandatory: no
          # Range: 1-3600
          # Default:
          # ProxyDataFrequency=1

          Thanks in advance,

          J
          Attached Files
          Last edited by joshuamcdo; 12-11-2013, 18:55. Reason: Added screen shot

          Comment

          • joshuamcdo
            Member
            • Nov 2013
            • 76

            #6
            Re: Re: Strange problem, unable to nail it down.

            To provide a little bit of an update here..

            I have tried to increase the StartPollers from 10 to 20. That didn't seem to stop the problem.

            I had a monitored host just stop reporting out of the blue. No one had logged into it, nor changed anything. I started watching for packet activity to the said host and there was none. I could manually probe the host without issue, and it would always return the data.
            Out of frustration I stopped and started the server. I waited 5 seconds or so before starting it back up. Once started it took a few minutes but it came out of the "non reporting" state that it was stuck in.

            However the agent on the server it's self went into a coma. I stopped both the server and agent on the server it's self. That did nothing. So I stopped the server, waited a few seconds and started it back up. This appears to have gotten everything reporting again. Only then did I decide to look at the queue.

            Looking at the queue there are a few things that don't make sense..

            Nov 18th, 2013 06:43:36 PM 2h 58m 57s <zabbix_sever> CPU count ( Cores )

            CPU count ( Cores ) looks like this.
            system.cpu.num[]

            I don't usually see an issue capturing this value, and the reason I capture it is so that I can build a dynamic CPU Load trigger like this.

            Processor load is too high for 5 minutes on {HOST.NAME}
            {TemplateLinux:system.cpu.load.min(300)}>{Template Linux:system.cpu.num[].last(0)}*2.5

            My understanding is that this will alert if the cpu.load.min for 300 seconds is greater than the core count x 2.5. This works out fairly well for us, as it prevents us from having to create machine sized based templates. This is check ever 86400 seconds ( max ) as it's not needed to check it more often that that. I originally tried to set it for every 30 days.

            There are some other checks for other systems that are also queued like file system free etc, none of that makes sense to me. They will eventually clear out though.

            This sounds like a mis-configuration or a bug to me. It's almost as if a poller gets lost out there and doesn't return to functional status until it's killed <completely> and then started back up. Just my 2 cents.

            Thanks in advance for any help.

            J

            Comment

            • tchjts1
              Senior Member
              • May 2008
              • 1605

              #7
              What are your Zabbix internal and gathering processes looking like?
              Take a look at this post, specifically the last paragraph and the graphs below it.


              I would be interested to see how long your Housekeeper process is running because you have MaxHousekeeperDelete=5000

              Housekeeper taxes the system kind of heavily. I also see that you are running with pretty much the default configs for the cache settings. But... the graphs that I point to in the above post will tell the story on whether you need to tweak those settings. if you post those graphs here, please use a one day (24 hour) view.

              And do you have any Zabbix proxies in the mix?

              Comment

              • joshuamcdo
                Member
                • Nov 2013
                • 76

                #8
                Re:

                Originally posted by tchjts1
                What are your Zabbix internal and gathering processes looking like?
                Take a look at this post, specifically the last paragraph and the graphs below it.


                I would be interested to see how long your Housekeeper process is running because you have MaxHousekeeperDelete=5000

                tchjts1

                Thanks for the reply.

                I changed the housekeeper to being commented out, and when I restarted I got this message. " 764:20131120:003314.133 housekeeper deleted: 38648 records from history and trends, 0 records of deleted items, 0 events, 0 alerts, 0 sessions"

                I inherited this Zabbix mess, and have very low confidence in any of the settings. I don't have a large enough setup to worry about housekeep taking to long so I am going to leave it as commented out at this stage.

                I don't have any of the graphs shown in that post, I would suspect they were deleted. However, I did several fresh installs and don't ever recall seeing those screens, templates etc. I don't have any templates that are obvious in the sense that I could apply to the zabbix server either. Any thoughts on where I might get those templates?

                Thanks,

                J

                Comment

                • tchjts1
                  Senior Member
                  • May 2008
                  • 1605

                  #9
                  Bottom half of this page:
                  Join the friendly and open Zabbix community on our forums and social media platforms.


                  Template App Zabbix Server is the one you want.
                  Last edited by tchjts1; 20-11-2013, 16:25.

                  Comment

                  • joshuamcdo
                    Member
                    • Nov 2013
                    • 76

                    #10
                    Re: Re: Re :Strange problem, unable to nail it down.

                    I imported and applied the Zabbix server metrics templates as suggested.
                    WoW was that a telling tale... I can't figure out for the life of me why anyone would have removed that template, but they did..

                    The first thing that happened was that an alert was triggered for the "busy unreachable poller processes" being 75% or more in use.

                    <snip>
                    Trigger: Zabbix unreachable poller processes more than 75% busy Trigger status: PROBLEM Trigger severity: Low Severity Trigger URL:

                    Item values:

                    1. Zabbix busy unreachable poller processes, in % (<server_name>:zabbix[process,unreachable poller,avg,busy]): 100 %
                    </snip>

                    Curious, I googled the error and found little information, but it did't take me long to put two and two together. I did some more homework and changed quite a few things.

                    I change the max connections to the RDS database to 512.
                    I changed the StartPollers=10 to StartPollers=100, I changed the StartPollersUnreachable= from being commented out to StartPollersUnreachable=50 and some other tweaks. I have pasted the censored config below.

                    What I believe was happening, for a long long time now. Is that the zabbix server config was preventing it from having the needed resources to do it's job. Hosts would go "unavailable" because there weren't enough StartPollers specified. Once that happened it would take forever for the server to reach back out to them because the StartPollersUnreachable setting was commented out, leaving me to believe that it defaults to 1 or some other very low number.

                    Zabbix was on it's way out the door, no one could figure it out, because the former admin that put it together made a few fatal errors. Now that it's running in a normal state, I am starting to see that some of their custom triggers never worked, but they never knew because Zabbix rarely had a moment to deal with the situation. Things are SO much smoother now and there is actual CPU load on the system, which I had never previously seen because it wasn't allowed to work hard. I am trying hard to turn the tide of opinion about zabbix in the company I work for and this is a major step in doing so.

                    Please see the settings below and feel free to criticize anything you may think is off.


                    LogFile=<path_to>/zabbix_server.log
                    LogFileSize=512
                    DebugLevel=3
                    PidFile=<path_to>/zabbix_server.pid
                    DBHost=<dbs_hostname>.rds.amazonaws.com
                    DBName=<db_name>
                    DBUser=<db_user>
                    DBPassword=<db_password>
                    StartPollers=256
                    StartPollersUnreachable=50
                    StartTrappers=100
                    StartPingers=10
                    StartDiscoverers=10
                    StartHTTPPollers=20
                    StartSNMPTrapper=1
                    HousekeepingFrequency=2
                    SenderFrequency=5
                    CacheSize=1G
                    CacheUpdateFrequency=60
                    StartDBSyncers=10
                    TrendCacheSize=2G
                    HistoryTextCacheSize=1G
                    Timeout=30
                    UnreachablePeriod=120
                    UnavailableDelay=60
                    UnreachableDelay=15
                    AlertScriptsPath=<path_to>/alertscripts
                    ExternalScripts=<path_to>/externalscripts
                    ProxyDataFrequency=15


                    Also, thank you in a BIG way for your help! This one template answered A-LOT of questions for me and has zabbix finally performing the way it's supposed to.

                    J

                    P.S. - I can not stop, start, restart whatever any agent and only be missing the data that wasn't collected while it was stopped!
                    Last edited by joshuamcdo; 22-11-2013, 06:57. Reason: PS

                    Comment

                    • tchjts1
                      Senior Member
                      • May 2008
                      • 1605

                      #11
                      That template didn't become available until around 1.8.5, I think. And depending on how the upgrade was pursued, determined whether it was included or not. A fresh install from compile gives you the new templates. I think upgrading through packages, does not.

                      So anyway, How many hosts are you monitoring? What is your stats from the dashboard page that shows number of items/triggers/nvps ?

                      I think that your StartDBSyncers=10 is too high. Unless you have a valid reason to increase that, it should be left at the default. (I think default is 4)

                      Tell us a little bit about your infrastructure setup. Stand alone servers? VM's?
                      Separate App and DB server, or both on one server? How much memory do you have for each server?

                      Disabling housekeeper... I don't think that I would do that. Leave the setting at the default of whatever the MaxDelete was, which I think was 500. I would also leave it set at the default to run once an hour.

                      Comment

                      • joshuamcdo
                        Member
                        • Nov 2013
                        • 76

                        #12
                        Re: Re: Re: Re: Strange problem, unable to nail it down.

                        <text_output>
                        Number of hosts (monitored/not monitored/templates) 439 105 / 241 / 93
                        Number of items (monitored/disabled/not supported) 5744 4915 / 295 / 534
                        Number of triggers (enabled/disabled)[problem/unknown/ok] 2103 2038 / 65 [72 / 0 / 1966]

                        ( I have also attached a screen shot of this data)

                        Number of users (online) 70 1

                        Required server performance, new values per second 82.11 -
                        </text_output>

                        I admit that I raised the number of "StartDBSyncers" arbitrarily and based on an assumption without fulling understanding what it does. I also can't seem to find a clear definition for what exactly that setting does. I will change them back to 4 for now and see what happens.

                        As far as I know, I have not disabled housekeeping. My belief was such that I had lifted the previously imposed limit on how many rows it was allowed to delete at any given time.

                        <house_keeper>
                        ### Option: MaxHousekeeperDelete
                        # The table "housekeeper" contains "tasks" for housekeeping procedure in the format:
                        # [housekeeperid], [tablename], [field], [value].
                        # No more than 'MaxHousekeeperDelete' rows (corresponding to [tablename], [field], [value])
                        # will be deleted per one task in one housekeeping cycle.
                        # SQLite3 does not use this parameter, deletes all corresponding rows without a limit.
                        # If set to 0 then no limit is used at all. In this case you must know what you are doing!
                        #
                        # Mandatory: no
                        # Range: 0-1000000
                        # Default:
                        # MaxHousekeeperDelete=0
                        #MaxHousekeeperDelete=5000
                        #MaxHousekeeperDelete=5000
                        </house_keeper>

                        I have since commented that back out, for the time being while I work to understand things a little better..

                        I am just ecstatic that Zabbix is finally up and running at capacity. The only other line related that I know of is.

                        </snip>
                        ### Option: DisableHousekeeping
                        # If set to 1, disables housekeeping.
                        #
                        # Mandatory: no
                        # Range: 0-1
                        # Default:
                        # DisableHousekeeping=0
                        </snip>

                        It also just ran..

                        18276:20131122:154027.892 housekeeper deleted: 340257 records from history and trends, 0 records of deleted items, 0 events, 0 alerts, 0 sessions

                        This Zabbix server is running in the cloud, and 90% of our clients reside in the cloud. There are 2 proxies and both of those proxies server hardware in one of our data centers. The memory available to each client varies widely but typically no less than 3 gig.

                        The proxies I don't have that information for, I believe one of them is in the cloud and one of them is hardware.

                        The Zabbix server has 7.4 gigs of ram of which 5 Gigs is currently free, 2 vpu's and uses an RDS database instance in the cloud, in the same region. The cpu usage on the Zabbix server rarely has a real CPU load and one only appears between 2-4 when the housekeeper runs.

                        I still can't find any good information on the "StartDBSyncers" but I changed it to 4 as suggested.

                        A new issue has creeped into play now, but there is no telling what is causing it. I want to do some more homework and troubleshooting before I beg for help.


                        Thanks a million!
                        Attached Files

                        Comment

                        • tchjts1
                          Senior Member
                          • May 2008
                          • 1605

                          #13
                          Originally posted by joshuamcdo
                          StartPollers=256
                          StartPollersUnreachable=50
                          StartTrappers=100
                          StartPingers=10
                          StartDiscoverers=10
                          StartHTTPPollers=20
                          StartSNMPTrapper=1
                          HousekeepingFrequency=2
                          SenderFrequency=5
                          CacheSize=1G
                          CacheUpdateFrequency=60
                          StartDBSyncers=10
                          TrendCacheSize=2G
                          HistoryTextCacheSize=1G
                          Timeout=30
                          ProxyDataFrequency=15
                          As a comparison, here are my settings, and you can see the number of hosts/items/nvps in the screenshot at the bottom. My setup is on 2VM's, with the App server having 4vCPUs and 8GB of RAM and the DB server having 4vCPU's and 16GB or RAM.
                          StartPollers=375
                          StartPollersUnreachable=20
                          StartTrappers=40
                          StartPingers=4
                          StartDiscoverers=1
                          StartHTTPPollers=1
                          StartSNMPTrapper=0 (Are you actually doing SNMP traps?)
                          HousekeepingFrequency=1
                          SenderFrequency=30
                          CacheSize=254M
                          StartDBSyncers=4 (Only needs increased if you have thousands of hosts)
                          HistoryCacheSize=254M
                          TrendCacheSize=254M
                          HistoryTextCacheSize=192M
                          Timeout=15
                          ProxyDataFrequency=1
                          Attached Files

                          Comment

                          Working...