Ad Widget

Collapse

Bug report: Master zabbix server crashes on receipt of invalid string data

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • erozen
    Junior Member
    Zabbix Certified Specialist
    • Apr 2007
    • 18

    #1

    Bug report: Master zabbix server crashes on receipt of invalid string data

    I'd post this in the proper place, but i don't have a support contract. I figured you'd still want to know about - feel free to delete this thread once it's in your issue tracking software.

    Version: 1.4.5 (both servers, and node)

    Setup: 2 zabbix servers in a 2 node heirachy, at least one host being monitored by the secondary server.

    Symptoms: Primary zabbix server crashes, seemingly randomly. Happens approximately once per minute.

    Cause: The primary server is receiving history_str data from the secondary for an item that no longer exists.

    Repro: Build a 2 teir server setup, and have a node reporting some string data to the slave. Delete the item, but time it such that the item doesn't exist in either servers items table, but there is still data for it in the secondaries history_str table that hasn't yet been replicated. Upon replication attempt, the primary will crash.
  • Niels
    Senior Member
    • May 2007
    • 239

    #2
    I've also had my server crash on faulty data from a node. I made a post about it which was ignored.

    Comment

    • vinny
      Senior Member
      • Jan 2008
      • 145

      #3
      I have seen this behaviour too...
      the problem is that zabbix server process do not seem to perform check task or cast value before updating the db...
      so when the db returns an error the server crash

      For me, i have to truncate history tables sometimes...
      -------
      Zabbix 1.8.3, 1200+ Hosts, 40 000+ Items...zabbix's everywhere

      Comment

      • ptietjens
        Junior Member
        • May 2008
        • 5

        #4
        Ditto

        I just discovered this little bug in the 1.45 code. Took me some time to figure out why the server was just suddenly stopping when receiving node data. Not something that fills me with confidence going forward with a server meant to ADD reliability to a network.

        Comment

        • erozen
          Junior Member
          Zabbix Certified Specialist
          • Apr 2007
          • 18

          #5
          Something else that's always bothered me - if one thread dies (for whatever reason), the WHOLE server/agent shuts down! What's the rationale for this? Surely, it would make sense to attempt to restart that thread, and if the thread keeps dying, shut it down, notify the admin, and keep the rest running? I don't want my monitoring system turning off at 3am because a string was one byte longer than expected....


          In the same thread - if an item doesn't return a valid value, the server stops monitoring it. So, if i'm tweaking my config, and temporarily break something - it's no longer being monitored, until i turn it back on! On a couple of occasions, we've had a major problems and downtime, and our ops team has asked 'why didn't we get an alert?' - it's rather embarassing to have to say 'Our monitoring system turned that metric off at some point in the past, for some unknown reason'

          Comment

          • cstackpole
            Senior Member
            Zabbix Certified Specialist
            • Oct 2006
            • 225

            #6
            Originally posted by Niels
            I made a post about it which was ignored.
            Unfortunatly there are too many of those ignored posts.

            Originally posted by erozen
            Something else that's always bothered me - if one thread dies (for whatever reason), the WHOLE server/agent shuts down! What's the rationale for this? Surely, it would make sense to attempt to restart that thread, and if the thread keeps dying, shut it down, notify the admin, and keep the rest running? I don't want my monitoring system turning off at 3am because a string was one byte longer than expected....


            In the same thread - if an item doesn't return a valid value, the server stops monitoring it. So, if i'm tweaking my config, and temporarily break something - it's no longer being monitored, until i turn it back on! On a couple of occasions, we've had a major problems and downtime, and our ops team has asked 'why didn't we get an alert?' - it's rather embarassing to have to say 'Our monitoring system turned that metric off at some point in the past, for some unknown reason'
            THIS! I cry a bit on the inside whenever I have to say 'Our monitoring system turned that metric off at some point in the past, for some unknown reason'. Of course I do my best to figure out the why and fix it, but it is hard to catch them when you have ~70 items on each of ~30 hosts. I am currently fighting this problem as one trigger (just a single one trigger on one host that is no different then any of the other hosts and triggers ) turns itself off randomly about 3 times a month. Inevitably I am going to get an angry phone call within the hour that it turns off (it is a very chatty trigger and people know when it stops). I still don't have a proper answer for them. If this wasn't a chatty trigger I probably wouldn't ever know which leads me to worry about the triggers that aren't chatty. Are they still running right? This is the sole reason for which I am building my own regression tests that I will use ~once a month to test as many triggers as I can.

            I am of the personal opinion that the reliability should be priority #1 and that it should take an act of God (or at very least root) to take down the zabbix server. I am seeing improved stability and performance on a regular basis so I am happy at the moment.

            Comment

            • erozen
              Junior Member
              Zabbix Certified Specialist
              • Apr 2007
              • 18

              #7
              I should explain my workaround/fix a little better.

              On both (all) zabbix databases, you need to run the following sql:
              Code:
              DELETE FROM history_str WHERE itemid NOT IN (SELECT itemid FROM items);
              DELETE FROM history_uint WHERE itemid NOT IN (SELECT itemid FROM items);
              DELETE FROM history_str_sync WHERE itemid NOT IN (SELECT itemid FROM items);
              DELETE FROM history_uint_sync WHERE itemid NOT IN (SELECT itemid FROM items);
              DELETE FROM history_sync WHERE itemid NOT IN (SELECT itemid FROM items);
              DELETE FROM history_log WHERE itemid NOT IN (SELECT itemid FROM items);
              DELETE FROM history_str WHERE itemid NOT IN (SELECT itemid FROM items);
              DELETE FROM history WHERE itemid NOT IN (SELECT itemid FROM items);
              This will clear out all the orphan history data. In theory it should be 0 for all tables, but in practise you'll have a lot. I had about 250,000 in my string table - so this is good housekeeping, even if nothing is wrong!

              Zabbix chaps:
              • Is it worth doing this sort of thing in the housekeeper process?
              • Are there any other SQL integrity checks like this i can do to clean out bad data?
              • Foreign key contraints and cascading deletes would really help here.


              Now, the only problem with this sql is, it's really slow. Especially if you have lot's of items, and especially if you use mysql, which optimises "IN ()" really badly. You can help with the last one by rewriting the query to use a join instead of subselect - but this is much easier to mistype or otherwise get wrong, so i don't recommend it. If you need to though, the syntax is:
              Code:
              DELETE FROM history h OUTER JOIN items i ON h.itemid = i.itemid WHERE i.itemid IS NULL;

              Depending on the DBMS, you will also lock tables - so your ability to monitor servers is diminished while you do this.

              Comment

              • Niels
                Senior Member
                • May 2007
                • 239

                #8
                Here's an example of the two methods:
                Code:
                mysql> SELECT COUNT(*) FROM history_uint WHERE itemid NOT IN (SELECT itemid FROM items);
                +----------+
                | count(*) |
                +----------+
                |  4721997 |
                +----------+
                1 row in set (9 min 34.17 sec)
                
                mysql> SELECT COUNT(*) FROM history_uint LEFT JOIN items ON history_uint.itemid=items.itemid WHERE items.itemid IS NULL;
                +----------+
                | COUNT(*) |
                +----------+
                |  4722488 |
                +----------+
                1 row in set (1 min 24.25 sec)
                The numbers don't add up, so maybe the SQls don't match.

                Comment

                • erozen
                  Junior Member
                  Zabbix Certified Specialist
                  • Apr 2007
                  • 18

                  #9
                  Interesting.

                  I wonder if there are still new values going in there? The fact that the numbers differ by only 500 out of 4 million i find... suspicious. :-) What does a vanilla count(*) give you?

                  I ran it against my database, and got this:
                  Code:
                  mysql> SELECT COUNT(*) FROM history_uint WHERE itemid NOT IN (SELECT itemid FROM items);
                  +----------+
                  | COUNT(*) |
                  +----------+
                  |   114938 |
                  +----------+
                  1 row in set (12 min 26.51 sec)
                  
                  mysql> SELECT COUNT(*) FROM history_uint LEFT JOIN items ON history_uint.itemid=items.itemid WHERE items.itemid IS NULL;
                  
                  +----------+
                  | COUNT(*) |
                  +----------+
                  |   114938 |
                  +----------+
                  1 row in set (3 min 49.86 sec)
                  
                  mysql> SELECT COUNT(*) FROM history_uint;
                  +----------+
                  | COUNT(*) |
                  +----------+
                  | 15683505 |
                  +----------+
                  1 row in set (3 min 21.01 sec)
                  
                  mysql>

                  Could you try either adding a datetime restriction (don't forget, the housekeeper runs aswell, so it's needs a lower and upper bound), or wrapping both queries in a 'LOCK TABLE/UNLOCK TABLE' or 'BEGIN WORK/ROLLBACK WORK', depending on your engine?


                  As a side note - I'm actually rather upset that i have 115,000 bad items, only a matter of weeks since i cleaned them out! I don't *think* I've changed any items since i spring cleaned, but....

                  Comment

                  • Niels
                    Senior Member
                    • May 2007
                    • 239

                    #10
                    I made a simple restriction on the clock field:
                    Code:
                    mysql> select count(*) FROM history_uint WHERE clock<1213177652 AND itemid NOT IN (SELECT itemid FROM items);
                    +----------+
                    | count(*) |
                    +----------+
                    |  4724668 |
                    +----------+
                    1 row in set (4 min 37.80 sec)
                    
                    mysql> SELECT COUNT(*) FROM history_uint LEFT JOIN items ON history_uint.itemid=items.itemid WHERE history_uint.clock<1213177652 AND items.itemid IS NULL;
                    +----------+
                    | COUNT(*) |
                    +----------+
                    |  4724668 |
                    +----------+
                    1 row in set (1 min 27.73 sec)
                    This would seem to indicate that faulty data are still being added, but we'll know that for sure when I remove the currently orphaned data.

                    Comment

                    • Niels
                      Senior Member
                      • May 2007
                      • 239

                      #11
                      OK, I've been looking through the history tables, and it looks like I'll be able to remove "millions and millions" of records. That's certainly enticing, I just need to build up some courage before I go through with it...

                      Comment

                      • erozen
                        Junior Member
                        Zabbix Certified Specialist
                        • Apr 2007
                        • 18

                        #12
                        I think a little mysqldump might be in order beforehand... :-)

                        I would be very interested to see some input from the zabbix guys on this - they may have some good ideas on other ways to deal with it, non-obvious dangers, potential for serious damage, performance etc etc


                        Good luck with your expunging - let us know how it works out once you've plucked up the courage (I'd be tempted by the dutch variety myself)!

                        Comment

                        • qix
                          Senior Member
                          Zabbix Certified SpecialistZabbix Certified Professional
                          • Oct 2006
                          • 423

                          #13
                          Could this issue be related?

                          With kind regards,

                          Raymond

                          Comment

                          • erozen
                            Junior Member
                            Zabbix Certified Specialist
                            • Apr 2007
                            • 18

                            #14
                            Wow, that's a curly one. It could well be related - sounds like the kind of thing you could spend weeks trying to debug...

                            Are you able to reproduce it at all?

                            Comment

                            • qix
                              Senior Member
                              Zabbix Certified SpecialistZabbix Certified Professional
                              • Oct 2006
                              • 423

                              #15
                              Nope, I can't anymore.

                              The whole distributed nodes things was way to buggy to my taste so I dropped the whole concept. I reinstalled my nodes are two standalone nodes now.

                              I think I'll try distributed nodes again when 1.6 is released.
                              I will probably go for the the 'ZABBIX proxy' setup where a 'dumb' box cellects data and sends it to the central server.
                              With kind regards,

                              Raymond

                              Comment

                              Working...