Ad Widget

Collapse

weird problem with history write cache filling

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mjcig
    Junior Member
    • Apr 2014
    • 5

    #1

    weird problem with history write cache filling

    Just recently have noticed an issue where the history cache is filling up. I have restarted the zabbix server processes and the cache will clear and then slowly begin to fill back up again.

    The cache does not fill up quickly, and would say it is about 1% an hour. The only way it recovers is if I restart the services again.

    I have increased the number of db syncers twice; the first by 2 and then by 4, and there is no change in the behavior. Except the db syncer avg busy rate has decreased, which I would expect.

    Reviewing the internal zabbix processes and all look good. The db syncer busy rate is great and nothing in the zabbix log that appears to be a problem. An occasionally slow query here and there, but nothing alarming.

    Any thoughts on what to look into next?

    The database admin states all looks good with the db and besides the cache filling, everything else is performing. The only weird thing I have noticed is the alerts and escalations have been slow at times and in some cases queueing to be sent? Maybe a look at these tables?

    Anyways, here are a few things with our config

    Zabbix 1.8.9 - built from source
    RHEL 5.8
    Oracle 11.2g with patches (cannot recall but after three service packs and patches, it solved our performance problems)
    2822 nvps

    Appreciate any input
  • mjcig
    Junior Member
    • Apr 2014
    • 5

    #2
    Had an opportunity to restart the zabbix server and there is no change in the behavior of the history cache filling. The frequency has increased, but nothing on the database or in the logs have proven any area to dig further.

    I confirmed there are no log items defined as checks and still trying to understand the impact if I let the cache fill completely. In past experience, when the history cache fills, the db syncers would become more busy as it is unable to process and write the data to the db quick enough.

    The odd thing is this is not the case. There is no performance based on the internal checks I have in place.

    Attached are grapsh frot he alst 7 days to help illustrate what I am seeing. The changes in the history cache is me restarting the zabbix services as in past when the history cache fills, data to be processed falls behind.

    Any other thoughts?
    Attached Files

    Comment

    • mjcig
      Junior Member
      • Apr 2014
      • 5

      #3
      To add.... I let let the cache run to zero and after bouncing on the bottom for a few hours; one of the proxy servers stopped sending values to the primary creating a back log in values.

      I am planning on moving to 1.8.20 for the short term while we evaluate upgrade options. We have a large history partition in our db with over 1.4 billion rows of data.

      Anyone running 1.8.20 and can chime in with any issues they may have encountered?

      Thanks in advance

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        Increasing DB Syncers is not your answer. You should leave that at the default unless you have a very large installation monitoring thousands of hosts.

        Instead, there are some cache settings in your zabbix_server.conf that you should increase (I think 1.8.2 has them). There are 3 of them. Just search that conf file for "cache" and you'll see them. Maybe try bumping them up an additional 128M.

        After you adjust them, restart your Zabbix server process.

        One of the cache settings is this: (As you see, I have mine set to 256M)

        Code:
        ### Option: CacheSize
        #       Size of configuration cache, in bytes.
        #       Shared memory size for storing host, item and trigger data.
        #
        # Mandatory: no
        # Range: 128K-1G
        # Default:
        # CacheSize=8M
        CacheSize=256M
        It may also benefit you to bump your Timeout= setting to 10 if you are still at the default of 3.

        Comment

        • mjcig
          Junior Member
          • Apr 2014
          • 5

          #5
          Thanks fr your thoughts.

          I agree that bumping up the number of dbsyncers will not solve problems, but we write 150 million rows of data to the history table daily. I have played around with increasing ans decreasing the dbsyncers here and there with no major changes.

          As for the History Cache, I have it set to 384M and was previously configured for 256M. The odd thing is this behavior started occuring out of nowhere. Before 4/10 the cache with 256M never went below 99% free and was
          that way for a year. changing it to more or less and I still see the same behavior. It is almost as if I stumbled upon some bug and the values in the buffer are not some sort of integer value.

          As for the timeout, it is my understanding this is for external checks... but has been some time. We originally has this set higher and saw increased lowad on our zabbix host. We moved those processes to scripts leveraging zabbix_sender to push the values in. Much more efficient at processing.

          In fact our complete environment is configured for active monitoring of the agents. We saw a huge increase in the number of values a proxy and server can process as the burden of poling and waiting for data is no longer and issue.

          Comment

          • mjcig
            Junior Member
            • Apr 2014
            • 5

            #6
            Update -

            I compiled and updated the server and proxy binaries to 1.8.20 and still experiencing the same behavior as before with the history cache being consumed.

            Are there any other internal checks or tools to review what type of data is being written to the history cache or inspect what data is in the buffer?

            Comment

            • waydena
              Junior Member
              • Feb 2012
              • 4

              #7
              Same history write cache problem out of nowhere

              We had the same problem out of nowhere after a system had been running for about 6 months. (Monitoring a system with 6 windows servers and 30 windows workstations, 4 switches, a couple of VoIP gateways. Running on a Dell R520 with 32 GB of RAM)

              I had added a couple of log file monitors and tweaked some switch SNMP monitoring. This was before the initial problem.

              After the problem I turned off the log monitoring and I also decreased some polling times in some widely used templates (from 30 -> 60 secs) and reduced some history settings in other templates (from 30 days to 7).

              I suspect the system may have gotten even more busy dumping all the data that was no longer required due to the reduced history settings - the housekeeping spiked a little also. I tried changing that from the default 1 hr to 6 hrs (as suggested in other forum posts) but that appeared to make things worse so I set it back to hourly. It may also have been that we just rolled over the period when a lot of history started getting dumped. These were both changes after the initial problem though.

              Our cache sizes were previously all defaults. I increased the settings as follows:

              Cache Size: 8M -> 32M
              History Cache: 16M -> 32 M
              History Text Cache: 16M -> 32M

              The system is behaving now (with Log monitors turned back on) but it is difficult to tell for sure if the problem is resolved by the new cache settings because it is back to running at 99.9% free all the time as it was previous to "the incident"...

              Comment

              Working...