Ad Widget

Collapse

Zabbix Server perf spikes after 2.0.5 to 2.2.4 upgrade

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zillions
    Junior Member
    • Jan 2013
    • 22

    #1

    Zabbix Server perf spikes after 2.0.5 to 2.2.4 upgrade

    Hi,
    We upgraded our Zabbix cluster from 2.0.5 yesterday to 2.2.4.
    Everything went smoothly, no issues, etc...

    One thing we noticed however, was that the server performance seems to be quite a bit different than the last version.

    Our graphs for server performance show a lot more regular spikes than before, as well as our queues which were normally fairly empty are showing more residual delay. The items are processing, and moving through, but it's just taking longer. We're also seeing similar spikes in network traffic.

    I've looked at server resources, and we're doing ok as far as I can tell. The servers have pretty good specs, and don't seem heavily utilized:
    CPU: 32 cores, about 10% utilization
    Memory: 128GB, about 53GB free

    Our running config wasn't changed at all, as we manage it with Puppet.
    Here's our config:
    ############ GENERAL PARAMETERS #################
    LogFile=/redacted/log/zabbix/zabbix_server.log
    LogFileSize=100

    PidFile=/var/run/zabbix/zabbix_server.pid

    DBHost=<redacted>
    DBName=<redacted>
    DBUser=<redacted>
    DBPassword=<redacted>
    DBPort=3306

    StartPollers=60
    StartPollersUnreachable=10
    StartTrappers=20
    StartPingers=30
    StartDiscoverers=80
    StartHTTPPollers=10

    ############ ADVANCED PARAMETERS ################
    #DisableHousekeeping=1

    CacheSize=512M
    CacheUpdateFrequency=60
    StartDBSyncers=50
    HistoryCacheSize=64M
    TrendCacheSize=128M
    HistoryTextCacheSize=128M
    Timeout=20

    Include=/etc/zabbix/conf.d/server
    AlertScriptsPath=/redacted/data/zabbix/alertscripts
    JavaGateway=localhost
    JavaGatewayPort=10052
    StartJavaPollers=10





    Here are the graphs:



    Ideas:
    So my first thought was that Housekeeping was somehow enabled by the upgrade, but it wasn't. We partition our databases every day, as well as they're running on their own Mysql cluster with equal level hardware. Everything there is running on a 10 drive SSD raid array as well, so response time is very fast.

    Proxies, we upgraded all of them as well (due to no backward/forward compatibility), and they seem to be working as well. Values of our items appear current on all the hosts I've spot checked.

    I looked to see if there was any kind of change with new required system variables/settings, but none of the new ones appeared mandatory.

    Any ideas?
    -Zillions
  • zillions
    Junior Member
    • Jan 2013
    • 22

    #2
    One other note, is that I've noticed that httpd takes quite a lot of resources on the server, and the httpd processes are quite persistent. They seem to individually take a decent amount of CPU. We do have the power for it, but it just seems higher than what I'd expect.

    At the moment we only have 11 online users, yet it looks like this:
    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    29684 apache 20 0 398m 16m 5156 S 62.5 0.0 0:12.75 httpd
    31235 apache 20 0 302m 14m 4148 R 61.8 0.0 0:02.34 httpd
    23735 apache 20 0 400m 18m 5504 S 60.2 0.0 0:53.08 httpd
    30227 apache 20 0 398m 17m 5500 R 60.2 0.0 0:11.83 httpd
    31146 apache 20 0 397m 15m 4344 S 60.2 0.0 0:03.50 httpd
    31260 apache 20 0 299m 11m 4076 S 56.9 0.0 0:01.74 httpd
    31224 apache 20 0 399m 17m 4348 S 54.3 0.0 0:02.47 httpd
    30165 apache 20 0 407m 26m 5472 R 36.0 0.0 0:08.34 httpd
    30190 apache 20 0 400m 18m 5452 R 32.7 0.0 0:10.40 httpd
    30228 apache 20 0 398m 17m 5460 R 32.7 0.0 0:08.25 httpd
    30899 apache 20 0 400m 18m 5436 S 32.7 0.0 0:04.91 httpd
    31272 apache 20 0 299m 11m 4068 S 26.8 0.0 0:00.82 httpd
    31273 apache 20 0 299m 11m 4076 S 25.8 0.0 0:00.79 httpd
    30278 apache 20 0 398m 17m 5488 S 25.5 0.0 0:12.93 httpd
    29983 apache 20 0 398m 16m 5444 S 19.6 0.0 0:08.38 httpd
    29616 apache 20 0 399m 18m 5204 S 12.1 0.0 0:14.55 httpd
    31283 apache 20 0 301m 12m 4116 R 8.8 0.0 0:00.27 httpd
    31286 apache 20 0 300m 11m 4064 R 8.8 0.0 0:00.27 httpd

    On one of our dev environments, when I look at it, the httpd processes take 0.1 cpu normally, and when someone opens the webpage they spike to 2-3%cpu briefly, but drop right back down, whereas on this prod server, they stay high like this. Not sure if related, and CPU doesn't seem to be a bottleneck, but just something I noticed.

    Comment

    • spidernik84
      Junior Member
      • Aug 2011
      • 17

      #3
      Hi,
      as for the httpd load we kinda had the same behavior with nginx.
      We were able to mitigate by installing the "apc" php cache tool and restarting the webserver.
      This lowered the cpu load required to generate and serve the frontend code and also dramatically reduced the response time. No tuning was needed on ubuntu for nginx/fpm, they were already configured.

      Regarding performance: we noticed something similar upgrading to 2.2. We completely got rid of the housekeeper for history and trends and instead started running a manual query on the DB. The query is roughly similar to the one you find in this ticket:


      Seems like the housekeeper is doing a for-loop, thus generating a single db query for each item to be deleted. A single, custom query to do a "batch" delete is performing much faster. Essentially, we're letting the DB do its job...
      Hope this helps.

      Comment

      Working...