Ad Widget

Collapse

Zabbix Log file monitoring behaviour

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • AkvenJan
    Junior Member
    • Jun 2021
    • 19

    #1

    Zabbix Log file monitoring behaviour

    Zabbix 4.4.6

    Hello
    I have some logs which I get by Zabbix Agent from servers.
    This logs don't have much values, but they go all the way back to past 4-5 years (and I can't modify log files to delete or archive them).
    Log entries have timestamps which I read
    Log time format: yyyy-MM-ddphh:mm:ss

    1. First question
    I created a trigger which will start if last message of the log contains some string
    Code:
    {EXAMPLELog:logrt["{$PATH}}Errors",,"windows-1251",80].regexp("2700100\s+[*]+",#1)}=1
    But if I am adding new device/server or deleting/recreating template (cause it's easier to duplicate macros based triggers in text editor than in GUI), Zabbix not only read the log from the stone age, but creating triggers for old entries in the log even if there is #1 option in regexp
    For example this entry created a trigger in 2021:
    Code:
    2018-01-22 09:25:59 2700100 *** A - type remote alarm
    How could I avoid that. If it is Zabbix logic of parsing log file, maybe there is some workaround to check trigger conditions only for entries with timestamp now-1day or something?

    2. And second question
    I have spamming entries in log which could be fired and closed several times in a minute (like a heartbeat but with closed conditions)
    For example 3 open/close entries in 2 minutes (* open, - close)
    Code:
    2018-01-22 09:25:59 2700100 *** A - type remote alarm
    2018-01-22 09:26:10 2700100 --- A - type remote alarm
    2018-01-22 09:26:10 2700100 *** A - type remote alarm
    2018-01-22 09:26:15 2700100 --- A - type remote alarm
    2018-01-22 09:26:20 2700100 *** A - type remote alarm
    2018-01-22 09:26:50 2700100 --- A - type remote alarm
    I want to create a trigger only on first entry and close only if there was no new firing in last N minutes

    I created a recovery condition with count and regexp macros
    Code:
    {EXAMPLELog:logrt["{$PATH}Errors",,"windows-1251",80].regexp("2700100\s+[-]+",#1)}=1
    and
    {EXAMPLELog:logrt["{$PATH}Errors",,"windows-1251",80].count(5m,"2700100\s+[*]+",regexp)}=0
    So it should be closed when there is close message and where were no new open messages within last 5 minutes

    BUT: if the problem was opened and closed within few seconds, the 5 minutes condition is not applied and incident stays opened.
    How can I check last N minutes counts not counting the initial entry that triggers the trigger?

    P.S. If there is a way to check if open and closed messages were within 1 minutes and not fire anything in this case? So to check only for long incidents not few seconds open/closed conditions. It's all easy when I deal with SNMP and metrics which get walked every n minutes and so on, but I don't know how to deal with logs correctly. And this devices can't do anything besides logs
    Last edited by AkvenJan; 24-06-2021, 09:52.
  • AkvenJan
    Junior Member
    • Jun 2021
    • 19

    #2
    I managed to deal with fast messages and prolonged repeated messages with this conditions:
    Creating indicent:
    Code:
    {EXAMPLELog:logrt["{$PATH}Errors",,"windows-1251",80].count(30s,"2700100\s+[*]+",regexp)}>=1
    and
    {EXAMPLELog:logrt["{$PATH}Errors",,"windows-1251",80].count(30s,"2700100\s+[-]+",regexp)}=<>1
    Recovery:
    Code:
    {EXAMPLELog:logrt["{$PATH}Errors",,"windows-1251",80].count(5m,"2700100\s+[*]+",regexp)}
    <=
    {EXAMPLELog:logrt["{$PATH}Errors",,"windows-1251",80].count(5m,"2700100\s+[-]+",regexp)}
    In this logic trigger will check if there was pair of open/closed messages within 30s interval. If they are and only happen once - they will be ignored (open count = 1 (condition is >=1), closed count = 1 (condition is <>1))
    If they are constantly repeating - incident will be created (open count = x, closed count = y)
    And incident will be closed if number of opening messages is <= number of closing messages)

    THE ONLY PROBLEM IS
    How to tell Zabbix not to rescan old entries in the log and only trigger on entries with new timestamps within given range (30s in example). it creates 50-100 old incidents per device - and I had about 500 of them.]
    Why it doesn't look into entry's timestamps?????

    Comment

    • AkvenJan
      Junior Member
      • Jun 2021
      • 19

      #3
      Timestamps are correct as I can see.
      Zabbix reads the time from the old log entry and inserts it correctly into local time field

      By why it uses not the local time filed for the triggers?

      Click image for larger version

Name:	zabbix-time.png
Views:	1867
Size:	5.6 KB
ID:	427121

      Comment

      • AkvenJan
        Junior Member
        • Jun 2021
        • 19

        #4
        Originally posted by cyber
        mode - possible values:
        all (default), skip - skip processing of older data (affects only newly created items).

        Remove template from host, so agent forgets its existence. Add skip mode to item. Add template to host again. It shoudl force agent to not read all the old entries but start from the end, ie only read new lines coming in after item creation.
        Thanks, I'll try it
        It's strange that I skipped this part of documentation.
        Maybe because Log file monitoring and log[] and logrt[] syntax are on the different pages. I don't understand why logrt[] syntax is not described in log file monitoring page.





        The only thing left is strange Zabbix perfomance with log files. I've got created incidents after 1 or more hours of receiving log value with the condition for it's creation. And I've got maybe 100 lines per day for this log. Why the delay?

        Comment

        • AkvenJan
          Junior Member
          • Jun 2021
          • 19

          #5
          Still, strange behavior
          2021-06-28 09:47:49 ALR15 2700300 ... Boost charging |Alarm panel external 14 |
          2021-06-28 09:47:48 ALR15 2700300 ??? Boost charging |Alarm panel external 14 |
          2021-06-28 09:37:49 ALR15 2700300 ... Boost charging |Alarm panel external 14 |
          2021-06-28 09:37:48 ALR15 2700300 ??? Boost charging |Alarm panel external 14 |
          I've got this and only this 4 entries

          The recovery condition is (I removed log expression for better understanding):
          Code:
          count(15m,"2700300\s+[*?]+",regexp) <= count(15m,"2700300\s+[-.]+",regexp)
          1 hour has passed, trigger hadn't recovered. Why?
          count of opened alarms (???) is 2
          count of closed alarms (...) is 2
          2<=2 in 15 minutes
          Why Zabbix didn't close this alarm?

          Comment

          • AkvenJan
            Junior Member
            • Jun 2021
            • 19

            #6
            Ok. I found a solution to all of this. It's not perfect, but will do.
            First of all, two Zabbix related limitations/tweaks:
            • mode skip for not reading old values from logs
            Code:
            logrt["{$PATH}Errors",,"windows-1251",80,skip]
            This way I will get alarms only for new items when adding new device
            https://www.zabbix.com/documentation...s/zabbix_agent
            • Zabbix analyze timestamps for time related functions (like count for last 5 minutes and so on) only when a new value is pushed into the log. So you can't use any analysis for the time after the last value (like check if there were repeats of log in the last n minutes after last value)
            SOLUTION
            • For dealing with the timestaps situation I used nodata function. I created single dependent elements for every alarms I needed to get. So every alarm got it dependent element based on regexp. For example:
            Regular expression: Mains\sAlarm;(.+) with Mains;\1 for filter
            So I get only
            Code:
            Mains;On
            Mains;Off
            for this element
            • For those cases when we got random alarms for 30 seconds once a day and don't wont to put them into incident. But want to create incindents when there are single long alarm or several On-Off micro-alarms. Creation expression:
            Code:
            (element.str(1m,"On",like)=1 and element.nodata(1m)=1) or (element.count(1m,"On",like)>1 and element.nodata(1m)=0)
            • For those cases when we want to close our incident after a single Off log or after a serios of micto-alarms had stopped
            Code:
            element.str(5m,'Off",like)=1 and element.nodata(5m)=1)
            So in this case we got:
            1. We create long duration incidents with single On and single Off logs with 1 minute delay on creation and 5 minutes delay on close
            2. We create incident when there are more than 1 repeat of incident (for incidents which fire on-off pairs every 10-30 seconds for several hours) and close them if there was no repeats in 5 minutes)
            3. We ignore single On-Off incidents with duration less than 1 minutes.
            4...
            5. PROFIT

            One of the problems we had with Zabbix is that repeated logs. We had once device that flooded Zabbix with On-Off pair like once a second. And with old logic Zabbix tried to create incident for every pair. And since Zabbix do no like high problems creation rate - it just got stuck.

            Comment

            • cyber
              Senior Member
              Zabbix Certified SpecialistZabbix Certified Professional
              • Dec 2006
              • 4806

              #7
              mode - possible values:
              all (default), skip - skip processing of older data (affects only newly created items).
              Remove template from host, so agent forgets its existence. Add skip mode to item. Add template to host again. It shoudl force agent to not read all the old entries but start from the end, ie only read new lines coming in after item creation.

              Comment

              Working...