Ad Widget

Collapse

Some items are not refreshed in given frequency for hours in 3.4.11

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Nalim27
    Junior Member
    • Jul 2018
    • 7

    #1

    Some items are not refreshed in given frequency for hours in 3.4.11

    Hello,
    I logged (from mine point ov view) serious bug that affecting common Zabbix functionality - getting data from monitored hosts. But mine bug in JIRA was closed as Won't fix.

    I think in Zabbix 3.4.x is exists bug in items scheduling (process that putting items into queue for refresh) and I'm curious if someone have similar problem. O r maybe solution?

    Here is original bug description:

    Hello,
    ** About 2 moths ago we upgraded Zabbix 3.0.4 to Zabbix 3.4.9 (and then to 3.4.11). All looks fine but recently we found that some items are not refreshed as they should be. But Zabbix do not report any problems with gathering processes nor with cache nor with Items update Queue.
    Problem is serious for us because some items that should be refreshed every 3-5 minutes are not refreshed for example 12-24 hours.
    I created SQL procedure that will report all items that should be refreshed at least every 60 minutes but it is not refreshed more than 90 minutes. Every time it returns 40-100 items that are not refreshed. I executed it few times and put all result into attached Excel file "ERROR_not_refreshed_items.xlsx". Each list contains date and time when it was executed and list of all not refreshed items.
    Symptoms:
    1. Items on some hosts are not refreshed. It not happens on all hosts - but list of affected hosts for every check looks very similar.
    2. Items are different in every check but I can see that some items are exists in almost every check
    3. Items type is not important it happens for discovered items, items from template and even for items manually created on the server.
      Also if happens for simple Zabbix checks like CPU load and for custom scripts too.
    4. Not refreshed items are all enabled and the are not in Unsupported state nor in Error state.
    5. It happens on different host types and different Zabbix Agents - Solaris, Linux....
    6. Looks that problem is occurring mainly on two servers that have very big number of items:
      czken3hr.vfcz.dc-ratingen.de - 1070 items
      aczfil10s-z1.vfcz.dc-ratingen.de - 650 items
      l5ucms35 - 571 items
    7. Those delayed (or not refreshed items) are stil refreshed at least once per day (maybe twice).
    8. In Zabbix - Administration - Queue is displayed that only 1-10 items are delayed but not more that 10s. Simply all those very long delayed items discovered by mine SQL script are ignored by this queue
    9. Zabbix Value Cache is about 50% free
    10. Zabbix pooler processes are busy maximally for 43%, average for 31%
    11. Some statistics: We are monitoring 111 hosts, 20656 items and have 42 values per second.
    12. This problem did not exists in Zabbix 3.0.4 and previous versions
    13. We are using Zabbix 7 years - form version 1.8.8 till current 3.4.11
    Expected:
    1. Please see screenshot from Zabbix GUI that shows that cpu load item is not refreshed (at 11:45 AM): 2018-06-26 11_45_56-Delayed_CPU_Items.png
    2. Second screenshot shows configuration of this item: 2018-06-26 11_46_35-Configuration of items.png
    3. Third screenshot shows graph of items values - you can see that value is refreshed daily instead of every 5 minutes: 2018-06-26 11_48_32-History_CPU_Item.png

    Possible root cause:
    • It looks like problem is in calculation of what items should be refreshed. Maybe some problem in Value cache?
    • Looks like problem occurs if hosts have defined big number of items only (higher that 512 items.

    What we trying to solve this problem:
    • We restarted Zabbix server few times. Sometimes items were refreshed 30 minutes after restart, but after that problem occurred again. Sometimes problem occurred immediately after restart.
    • We tried to increase number of Process poolers on Zabbix Server - without any success.
    • We tried to increase and decrease number of Process poolers on affected Zabbix Agents - without any success.
    • We tried to go back to old Agents on affected hosts - without any success.





    Click image for larger version

Name:	2018-06-26 11_45_56-Delayed_CPU_Items.png
Views:	2082
Size:	37.9 KB
ID:	363414

    Click image for larger version

Name:	2018-06-26 11_46_35-Configuration of items.png
Views:	2079
Size:	76.2 KB
ID:	363412

    Click image for larger version

Name:	2018-06-26 11_48_32-History_CPU_Item.png
Views:	2151
Size:	115.8 KB
ID:	363413

    Click image for larger version

Name:	2018-07-25 16_10_32-Queue.png
Views:	2183
Size:	140.6 KB
ID:	363411
  • vso
    Zabbix developer
    • Aug 2016
    • 190

    #2
    How big is your ResfreshUnsupported parameter ? Is there something interesting in Zabbix server log ?

    Comment

    • Nalim27
      Junior Member
      • Jul 2018
      • 7

      #3
      Hello,
      yes in server log I found this issues very often (for 2 hots that have that problem):

      29731:20180726:104103.383 Zabbix agent item "DBAliveChecker[.....,V4K1PER_TAF.TEST.CZ]" on host "aczfil10s-z1.vfcz.dc-ratingen.de" failed: first network error, wait for 3 seconds
      29731:20180726:104136.385 Zabbix agent item "TableSpacesChecker[.....,INFETST_TAF.WORLD,85]" on host "l5ucms35.oskarmobil.cz" failed: another network error, wait for 3 seconds

      In 3.0.4 we had set UnreachableDelay=30s, when I found that network error messages then I decreased UnreachableDelay to default 15s and then to 3s (as you can see in log). I think that it is false report - it is custom check with custom bash script on agent side. But script itself is working correctly, we are using it to check healt of many databases.
      But sometimes that was raised in the log - I'm sure that network is ok.

      Maybe script returns some error value because checked database is down. Or check took longer that 30 seconds (limit on custom script run time) but ...... in 3.0.4 this was correctly solved as unsupported item.


      Back to your first question - I checked server.conf but ResfreshUnsupported is not there. Also i checked server.conf in zabbix source and it is not exists too. Can you please describe, where that parameter is located?

      Here is part of out conf file - maybe it will be useful for you:
      ### Option: ValueCacheSize
      # Size of history value cache, in bytes.
      # Shared memory size for caching item history data requests.
      # Setting to 0 disables value cache.
      #
      # Mandatory: no
      # Range: 0,128K-64G
      # Default:
      # ValueCacheSize=8M
      ValueCacheSize=80M

      ### Option: Timeout
      # Specifies how long we wait for agent, SNMP device or external check (in seconds).
      #
      # Mandatory: no
      # Range: 1-30
      # Default:
      # Timeout=3
      Timeout=30

      ### Option: TrapperTimeout
      # Specifies how many seconds trapper may spend processing new data.
      #
      # Mandatory: no
      # Range: 1-300
      # Default:
      # TrapperTimeout=300
      TrapperTimeout=300

      ### Option: UnreachablePeriod
      # After how many seconds of unreachability treat a host as unavailable.
      #
      # Mandatory: no
      # Range: 1-3600
      # Default:
      # UnreachablePeriod=45
      UnreachablePeriod=120

      ### Option: UnavailableDelay
      # How often host is checked for availability during the unavailability period, in seconds.
      #
      # Mandatory: no
      # Range: 1-3600
      # Default:
      # UnavailableDelay=60
      UnavailableDelay=90

      ### Option: UnreachableDelay
      # How often host is checked for availability during the unreachability period, in seconds.
      #
      # Mandatory: no
      # Range: 1-3600
      # Default:
      # UnreachableDelay=15
      UnreachableDelay=3

      Comment

      • Nalim27
        Junior Member
        • Jul 2018
        • 7

        #4
        Hi,
        quick update:
        I just found in server log this:
        29697:20180726:110351.000 value cache is fully used: please increase ValueCacheSize configuration parameter

        That is very strange because according Zabbix Internal item we have plenty of free space in value cache - please see this graph:
        Click image for larger version

Name:	2018-07-26 11_14_00-Window.png
Views:	2060
Size:	137.1 KB
ID:	363457

        Comment

        Working...