Ad Widget

Collapse

Big delay between detecting an issue and sending a notification

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • hiddenrefuge
    Junior Member
    • Feb 2020
    • 3

    #1

    Big delay between detecting an issue and sending a notification

    Hello everyone,

    we're running a Zabbix 3.4.0 server to monitor all our switches and other devices including servers and such (physical server running Debian 9, 8 GB RAM, Intel Xeon X3340 and SW RAID1).

    Recently or rather said the last few months there has been a huge delay (up to 1 day) from when Zabbix detects an issue when the notification/alert e-mail is sent out. Also the same happens on recovery from the issue. Sometimes however it does work instantly. We also monitor our Internet connection and when it cuts out and comes back this usually gets sent instantly. Or last week we had a NAS fail and I disabled it in Zabbix to investigate the issue and the cancellation of the alert due to the host being disabled was sent out also instantly.

    Where can I start looking for possible causes?

    Thanks a lot for your help guys.
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    I'm on 4.2.x, and I think some of these places were re-organized or changed at 4.0.0, so things might look a little different in your 3.4.0 version.

    I would first check in Zabbix, in Monitoring->Problems, to track down when Zabbix *says* it sent an email. Find one of the problems for which you experienced a huge delay, and see if Zabbix sent the alert right away.

    If it did send the alert right away, then you need to start looking at your email logs, to see if your email relay accepted the message immediately, and what it did with the message once it was accepted.

    Comment

    • hiddenrefuge
      Junior Member
      • Feb 2020
      • 3

      #3
      Thanks for your reply Tim.

      I went ahead and unplugged the network connection of a testing device in my office that I monitor through Zabbix, too. That was at 7:21 (UTC+01:00). At the time of posting this Zabbix still didn't detect that the host is down. Something is seriously wrong somewhere else I guess.

      The load of the server is over 5.00 across the whole machine. The software RAID provides barely usable I/O and backups take ages. The database is over 150 GB. I actually took over administration of our monitoring server from another employee. I see a lot of things to need to be fixed and optimized.

      Where can I start?

      Here is one blunder:
      Last edited by hiddenrefuge; 12-02-2020, 08:52.

      Comment

      • hiddenrefuge
        Junior Member
        • Feb 2020
        • 3

        #4
        A small rundown of my test:

        7:21 Test device disconnected from network
        - Somewhere in between I restarted the zabbix_server process due to the massive load as indicated in the screenshot from my post above.
        9:15 Zabbix detected test device downtime.
        9:25 Mail was sent out regarding downtime of device.
        9:28 Problem as ACK in Zabbix and network of test device has been connected.
        9:28 Zabbix marked issue as resolved.
        9:34 Mail was sent out regarding test device coming back online.

        Comment

        • tim.mooney
          Senior Member
          • Dec 2012
          • 1427

          #5
          Originally posted by hiddenrefuge
          The load of the server is over 5.00 across the whole machine. The software RAID provides barely usable I/O and backups take ages. The database is over 150 GB. I actually took over administration of our monitoring server from another employee. I see a lot of things to need to be fixed and optimized.
          It does indeed seem like some tuning is in order. Hopefully that's the extent of it.

          How many hosts and items do you monitor? Does it seem like your server is big enough, based on the sizing guidelines in the requirements documentation ? Is your database on the same host, or is it on a separate server? Hosting Zabbix + the database on a server with just 8 GiB of RAM doesn't leave much room for database tuning, so your database may also be falling behind.

          There is a "Template App Zabbix Server" in the default Templates. If you apply it to your Zabbix server, what does it identify for problems? I'm hoping it will point out one or two areas where you just need to increase the number of pollers for a particular area of functionality, and that will address the majority of the issues.

          Comment

          Working...