Ad Widget

Collapse

Server fails regularly after 1.6 to 1.8 upgrade

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • anrstone
    Member
    • Oct 2009
    • 61

    #1

    Server fails regularly after 1.6 to 1.8 upgrade

    So the situation we have is that we upgraded from 1.6 to 1.8 and everything that was working seemed to now be working in the new version apart from a few odd things that I posted about at the time. Since the upgrade we have had regular (sometimes daily) problems where Zabbix Server appears to fail we believe as a result of the DB becomming massively overloaded which in turn completely takes the server down (and I don't just mean the Zabbix Server application but what ever fails takes the OS out as well)

    We are monitoring a lot on the server to be fair and we could move that around I guess but most of our estate is actively monitorded not passively. To that end we have about 150 hosts and we are monitoring around 3600 parameters. The system has a separte DB server and Zabbix host as follows:

    DB Server: Dell server running win 2K3 with 2 x 3GHz dual core Zeon processors and 4Gb Ram. Runs Postgres 8.3

    Zabbix Server: Dell Server running Debian Lenny with 1 x 3.2GHz quad core Zeon processor and 2Gb Ram. Runs Zabbix 1.8.1

    The Zabbix Server runs the zabbix server application, one instance of the agent and the web interface.

    Clealry I assume to help more information is required but not sure what post so please let me know. Lastly will 1.8.3 fix some of the above problems - I have read similar sounding issues have been occurring on Oracle based systems.

    Thanks
  • anrstone
    Member
    • Oct 2009
    • 61

    #2
    So we've turned off a shed load of monitored items and we have just about got the DB server to perform within limits. Clearly the easiest option is to downgrade from 1.8 back to 1.6 and stay there until the performance issues are resolved. Before we do that does anyone have any thoughts on what might be going on?

    So after some reasonably lenghty analysis today we believe the problem is firmly DB related because:

    (1) CPU load on the Zabbix Server never exceeds 10%
    (2) Memory useage on the Zabbix Server never exceeds 30%
    (3) DB Server CPU load at around 30-40%
    (4) Disk load on DB server around 98% (we're running SCSI drives here so these are not slow at all!)

    The query performance is as follows:

    Average: 23 inserts / second, 31 updates a second, 95 table scans / second scanning a total of 1 million rows and around 950 index scans per second. I'm not sure what Postgres is capable of but these figures don't seem too bad.

    The key for us is that until upgrading we had not changed any default template settings - now we've upgraded we've been forced to massively reduce the amount of data we're collecting just to try and keep Zabbix alive - can anybody suggest what's going on in 1.8.1 that wasn't happening 1.6.6?

    Thanks
    Last edited by anrstone; 28-04-2010, 18:22.

    Comment

    • anrstone
      Member
      • Oct 2009
      • 61

      #3
      So the only solution I found to this was to re-install the whole thing from scratch and lose the history. This seems to have solved the problem with everything looking a lot more stable. We have taken the opportunity to install a clean version of 1.8.2 but we have one problem remaining which is that we now cannot send email notifications from zabbix - we've checked the server and we certainly can send via the command line so the problem looks firmly in the zabbix configuration or app...

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        I have 3 environments of 1.8.2 running. Prod was a 1.6.6 --> 1.8.2 upgrade, as was our test environment. Our Dev environment was a fresh install of 1.8.2. All instances are sending alert notifications fine.

        Comment

        • anrstone
          Member
          • Oct 2009
          • 61

          #5
          I can't tell for certain but it looks like the real issue may well be related to the size of the DB which is pretty large now. I haven't really got the time to confirm this at the moment but if I do I'll post some notes up.

          Comment

          • tchjts1
            Senior Member
            • May 2008
            • 1605

            #6
            What is it telling you under Administration --> Audit --> dropdown "Actions"
            Select the right zoom view to get data.

            Comment

            • anrstone
              Member
              • Oct 2009
              • 61

              #7
              Apart from an issue with an SMTP server that was resolved there is nothing in there indicating any actions to be fixed which is as I'd expect I think - certainly our new clean install has none. There is certainly nothing around the time the server fails nor is there in any of the server logs.

              I have now changed the server and installed a new daemon to keep sshd in memory and stop it from being paged in the event of a crash but I did that at the same time we did the clean install - and we've not had issues since.

              Comment

              Working...