Ad Widget

Collapse

Obtaining I/O ops, RAM, CPU details for zabbix server

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zabbixfk
    Senior Member
    • Jun 2013
    • 256

    #1

    Obtaining I/O ops, RAM, CPU details for zabbix server

    Hello All,

    I am trying to tune the zabbix server to do benchmarking and for sizing, in that process, i need to check how much each zabbix server sessions take up I/O ops, RAM, CPU.

    Using MySql as an database server, and running on CentOS 6.4.

    I was runnig with 2 zabbix proxy servers and 1 master server, now changed to 1 proxy and 1 master ( where majority of the systems connect to master , and location based systems connect to zabbix proxy server)

    I had encountered slowness while accessing zabbix UI, and some how those things are getting fixed and i am able to see improvement ( after playing around with templates, mysql configs - mainly commenting sync_binlog=1).

    I am sure, other guys would have tried to benchmark zabbix server, please share your inputs too.

    Requesting you guys to help/guide/point me to resolve my below doubts/questions,

    a). How can i check each zabbix server session is taking amount of RAM/CPU/I-Ops
    b). How much each opened web ui takes ram/cpu
    c). How does it effect if i keep adding new devices/templates/create new users will affect performance.
    d). I did enabled debug logs, but its lot of logs, i am finding difficulties to handle it.
    Is there any tool available to calculate ( in addition to tune, i am also looking at the sizing of the server)

    I had written below script to log values w.r.to. free ram, cpu, i/o waits, but i am looking something beyond this. Any pointers on improving/understanding things are greatly helpful.

    Code:
    #Scale for floating point precision
    sc=4
    # Sleep duration between results
    st=10
    
    while :;
    do
    	#dt=$( date +%c );
    	#upt=$(uptime);
    	res=$(iostat -xkd /dev/sda)
    	ws=$(echo $res | cut -d" " -f24)
    	rs=$(echo $res | cut -d" " -f23)
    	wt=$(echo $res | cut -d" " -f29)
    	mrs=$(echo "scale=$sc; $rs  / ( $rs + $ws ) * 100" | bc)
    	mws=$(echo "scale=$sc; $ws  / ( $rs + $ws ) * 100" | bc)
    	echo "READ/WRITE/WAIT  ->$mrs | $mws | $wt"
            freeRAM=$(free | grep Mem | awk '{ printf("%.4f \n", $4/$2 * 100.0) }')
            cpuLOAD=$(uptime | awk '{print $10}' | cut -d',' -f1)
    	echo "FREE RAM: $freeRAM"
            echo "CPULoad : $cpuLOAD"
            echo;echo;echo;
            sleep $st;
    done
    I know its a lot to ask, but any pointers are of great help.

    Below are some details about the setup.
    System Details:
    1). O.S : Cent OS 6.4
    2). RAM : 36G
    3). CPU: Intel XENON(R) E5606 @2.13Ghz ( 8 cores)
    4). HDD: 500GB , (Raid 5)
    5). MySql : 5.1.69-log Source distribution
    6). Web Server : httpd ( Apache/2.2.15 (Unix) )

    Zabbix server details:
    1). Zabbix server version : 2.0.6
    2). Number of hosts (monitored/not monitored/templates) : 646 : 563/28/55
    3). Number of items (monitored/disabled/not supported) : 20141 : 18754/661/726
    4). Number of triggers (enabled/disabled)[problem/unknown/ok] : 904 1558/346 : [20/0/1538]
    5). Number of Users ( online) : 45 : 40
    6). Required server performance, new value per second : 84.41

    MySql Configuration details
    Code:
    [mysqld]
    datadir=/var/lib/mysql
    socket=/var/lib/mysql/mysql.sock
    user=mysql
    symbolic-links=0
    #Slow query log details
    slow_query_log_file=/var/log/mysql/slow-query.log
    long_query_time = 30
    log-queries-not-using-indexes=YES
    innodb_lock_wait_timeout=500
    innodb_locks_unsafe_for_binlog=1
    expire_logs_days=5
    max_binlog_size=100M
    innodb_buffer_pool_size=20G
    innodb_file_per_table
    max_allowed_packet = 8M
    innodb_additional_mem_pool_size = 30M
    innodb_thread_concurrency = 8
    key_buffer_size = 60M
    max_connections=290
    table_cache=4096
    query_cache_size = 96M
    tmp_table_size = 32M
    thread_cache_size = 64
    sort_buffer_size = 12M
    query_cache_limit=64M
    thread_cache_size=512
    read_buffer_size=2M
    read_rnd_buffer_size=8M
    join_buffer_size=8M
    
    #sync_binlog=1
    Zabbix server conf file
    Code:
    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=50
    DebugLevel=4
    PidFile=/var/run/zabbix/zabbix_server.pid
    DBName=zabbix
    DBUser=zabbix
    DBPassword=zabbix
    StartPollers=45
    StartPollersUnreachable=7
    StartTrappers=45
    StartPingers=45
    StartDiscoverers=3
    StartSNMPTrapper=1
    ListenIP=0.0.0.0
    HousekeepingFrequency=1
    MaxHousekeeperDelete=500
    SenderFrequency=300
    CacheSize=1G
    CacheUpdateFrequency=300
    StartDBSyncers=4
    HistoryCacheSize=128M
    TrendCacheSize=1G
    HistoryTextCacheSize=128M
    Timeout=30
    TrapperTimeout=120
    UnreachablePeriod=600
    UnavailableDelay=120
    AlertScriptsPath=/etc/zabbix/alert.d/
    FpingLocation=/usr/local/sbin/fping
    LogSlowQueries=1
  • tchjts1
    Senior Member
    • May 2008
    • 1605

    #2
    Use your Linux OS template to get CPU, memory, IO, etc stats from your Zabbix server.

    I recall the UI slowness from version 2.0.6. If you move to 2.0.9, I think you will see that is resolved. You have an NVPS of 86. Ours is currently just over 1,000 and the GUI, graphs and screens are lightning fast.

    See the bottom of this post and the graphs that follow it for ways to improve Zabbix internal tuning: https://www.zabbix.com/forum/showthread.php?t=41219

    Comment

    • zabbixfk
      Senior Member
      • Jun 2013
      • 256

      #3
      Obtaining I/O ops, RAM, CPU details for zabbix server : For Benchmarking

      Thanks for the reply. Very sorry that i have not conveyed my message properly.

      I have already linked that template and getting the CPU/RAM data.

      In the process of benchmarking and sizing ( as my device count will grow, there by items/triggers and NPVS also), i was looking at finding out

      a). how much each zabbix server process (zabbix_server) takes CPU/RAM/IO etc ( any formula/commands scripts to calculate)
      b). How much does cpu/ram/io is taken when a device is added to zabbix server from UI.
      c). how much load each UI session ( user who logs to zabbix server from browser) takes - as i need to add more users, and most of them will be online.

      And i am posting some of the graphs as you mentioned in the reply.
      Thanks.
      [IMG]ttp://s30.postimg.org/ru8glhxkx/Screen_Shot_2014_04_23_at_6_12_01_PM.png[/IMG]



      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        Your unreachable poller process is running very high. On your Zabbix server in zabbix_server.conf, try increasing that value by 5 and restarting your zabbix server process.

        Your housekeeper process also looks a bit strange to me. What are your settings for that? I run mine every 1 hour with a maxdelete of 500. With that setting, it runs for about 10 minutes every hour with a very predictable pattern. Can you show that same graph for a 1 day (24 hour) period instead of a 7 day period?

        And also, as I mentioned, if you have the ability to upgrade to version 2.0.9, I think you will see much better performance in the GUI.

        Comment

        • jan.garaj
          Senior Member
          Zabbix Certified Specialist
          • Jan 2010
          • 506

          #5
          10% CPU iowait is not good - it can be good if you have slow hdd and super fast CPU - I think, that you have slow HDD and a lot of IOPs (try to disable debuglevel 4). Also some standard CPU usage metrics are missing. Why? Standard Linux tools can help you with detecting your bottleneck:
          iostat -xd..., mpstat -P ALL, top/htop, uptime (load), mysqltuner.pl, ....
          Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
          My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

          Comment

          • tchjts1
            Senior Member
            • May 2008
            • 1605

            #6
            Originally posted by jan.garaj
            10% CPU iowait is not good - it can be good if you have slow hdd and super fast CPU, ....
            Speaking of IO Wait, we had a similar issue because of the swappiness setting in Linux. Check this post I wrote up: https://www.zabbix.com/forum/showthread.php?t=38575

            Whether this applies in this case or not, I don't know. But it is worth being informed about possible solutions.

            Comment

            • zabbixfk
              Senior Member
              • Jun 2013
              • 256

              #7
              Obtaining I/O ops, RAM, CPU details for zabbix server : For Benchmarking

              Thank you all the reply.

              Will work on tuning stuffs.
              - Even i am a bit confused on CPU metrics are not getting displayed on zabbix graphs ( i have used zabbix server template.) - will debug more on this.

              - Will try after disabling debuglevel ( looks like may be too much of log writing is making disk wait.

              - Was looking at top/iostat/uptime commands, wrote on small script also ( pls check first message of this thread), but not able diagnose the results/issues. We actually changed the HDD after seeing more than 12% iowait ( command output of iostat).

              - Sorry, i am not in a position to take downtime and upgrade build to 2.0.9 ( i see that latest is 2.2.x version.) - and moreover, database size is is close to 22G (mysql) - not sure auto upgrade scripts given in 2.2.x can handle that.

              - Good point to be noted on the swappiness settings, it is 60 ( as a default) - changed it to 10 now. Let's see how it goes ( after changing this, iostat -xkcd shows 8.93%)

              - Housekeeper settings is 1,500 ( every 1 hour run for 500 deletes) - during this process i can see CPU shooting up.

              - Thanks again for all the replies. Sorry to deviate but wanted to find out more on the benchmark part - what i was looking at was sizing stuff. Considering the double of the existing device count, what kind of a server i would need in future .w.r.to. CPU/RAM etc by measuring each zabbix_server process and each zabbix apache end takes. But not to forget, all the replies are greatly helpful for tuning , understanding zabbix server also to find what i am doing wrong.

              Thanks.

              Comment

              • jan.garaj
                Senior Member
                Zabbix Certified Specialist
                • Jan 2010
                • 506

                #8
                12% iowait (command output of iostat) is not IMHO OK. Find a reason http://superuser.com/questions/50091...gnu-linux-base
                You server sizing looks perfect for 100nvps, so keep investigating.
                Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
                My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

                Comment

                • pc99096
                  Senior Member
                  • Oct 2011
                  • 193

                  #9
                  there might be a performance problem with slow hdd, other hw specifications look ok. i would try to decrease the history/trend interval in the templates.

                  Comment

                  • zabbixfk
                    Senior Member
                    • Jun 2013
                    • 256

                    #10
                    Obtaining I/O ops, RAM, CPU details for zabbix server : For Benchmarking

                    Thank you all for the reply.

                    I spent couple of hours to analyse disk performance, run both iostat, and iotop - but unable to come to conclusion - may be my interpretation of results are very bad.
                    IOTop output :


                    I was told
                    Code:
                    jbd2/dm-2-8
                    is disk controller, so its touching value close to 90% IO column is okay ( though it sounded scary to me).

                    IOSTAT output
                    Code:
                    Linux 2.6.32-358.18.1.el6.x86_64 (zabbixServer) 	04/28/2014 	_x86_64_	(8 CPU)
                    
                    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
                               4.30    0.00    2.04    9.09    0.00   84.57
                    
                    Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
                    sda               0.53   459.67    4.26  147.04   135.76  2359.41    32.98     1.07    7.06   4.99  75.53
                    SAR output. ( sar -p -d)
                    Code:
                                          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
                    Average:          sda    144.12    144.56   4036.38     29.01     10.57     73.38      6.24     89.96
                    Average:    VolGroup-lv_root     14.23      4.66    133.13      9.68      2.19    153.69      8.03     11.43
                    Average:    VolGroup-lv_swap      0.08      0.67      0.00      8.00      0.00     28.03     15.46      0.13
                    Average:    VolGroup-lv_var    490.61    139.22   3903.25      8.24     45.62     92.99      1.83     89.75
                    and From this link using below formula to calculate IOPS,
                    Code:
                     IOPS = d * dIOPS / (( %r + ( F + %w ))
                    Where,
                    d=number of disks,
                    dIOPS = iops per disk ( for 7.2K rpm, values range between 75-100 - i chosed 80).
                    %r = % of read workload ( rd_sec / ( rd_sec + wr_sec) ) - from sar output
                    %w = % of write workload ( wr_sec / ( rd_sec + wr_sec )
                    F = Raid factor , for RAID 5, its 4 for write, 1 for read.
                    I got 82 as value. Which is pretty much less than i thought. Was expecting 200+.

                    It would be great if somebody can guide / share your thoughts/pointers.

                    Thanks.

                    Comment

                    • jan.garaj
                      Senior Member
                      Zabbix Certified Specialist
                      • Jan 2010
                      • 506

                      #11
                      Your Zabbix server requires performance: 84.41 new values per second.
                      It's 84.41*90 ~= 7,6kb ~= 1kB/s
                      1kB/sec + trend data ~~~~= 2kB/s
                      Zabbix requires to write 2kB/s of data to DB, but mysql server writes >2MB/s to harddisk.

                      Your average queue size (avgqu-sz) for your /var (VolGroup-lv_var) is 45.62. It's terrible value :-)
                      But I don't understand #;
                      - why sda from iotop has "normal" avgqu-sz 1.07
                      - why controller has 58% IOPs
                      What will happen with iotop stat if you disable mysql and zabbix? Is your RAID/LVM healthy?

                      From my view, you should to minimize IOP operations from mysql. Do not use/disable (if it's possible):
                      - query log
                      - binary logs
                      Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
                      My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

                      Comment

                      • zabbixfk
                        Senior Member
                        • Jun 2013
                        • 256

                        #12
                        Obtaining I/O ops, RAM, CPU details for zabbix server : For Benchmarking

                        Thank you for the quick reply.

                        I did not understand your calculation of 2K/s zabbix write, can please you elaborate...

                        I am not sure why this controller ( guess you are referring to jbd2/dm-2-8) is shooting up on I/O, did some googling still not able to figure out.

                        Since its kind of production, i am not in a position to stop zabbix/mysql :'(

                        I have disabled querylog ( no change in iotop output), but can't disable binary log as (this zabbix server serves as master) i had setup master/slave replication for myql db to another server.

                        Looks like RAID/LVM is healthy, - as vgscan/lvmdiskscan didn't complain ( may be i am using wrong commands to check ? )
                        Code:
                        vgdisplay --verbose |grep PV |grep Name
                            Finding all volume groups
                            Finding volume group "VolGroup"
                          PV Name               /dev/sda2
                        Code:
                        vgscan 
                          Reading all physical volumes.  This may take a while...
                          Found volume group "VolGroup" using metadata type lvm2
                        Code:
                        lvmdiskscan 
                          /dev/ram0             [      16.00 MiB] 
                          /dev/root             [      50.00 GiB] 
                          /dev/ram1             [      16.00 MiB] 
                          /dev/sda1             [     500.00 MiB] 
                          /dev/VolGroup/lv_swap [       5.88 GiB] 
                          /dev/ram2             [      16.00 MiB] 
                          /dev/sda2             [     464.76 GiB] LVM physical volume
                          /dev/VolGroup/lv_var  [     408.88 GiB] 
                          /dev/ram3             [      16.00 MiB] 
                          /dev/ram4             [      16.00 MiB] 
                          /dev/ram5             [      16.00 MiB] 
                          /dev/ram6             [      16.00 MiB] 
                          /dev/ram7             [      16.00 MiB] 
                          /dev/ram8             [      16.00 MiB] 
                          /dev/ram9             [      16.00 MiB] 
                          /dev/ram10            [      16.00 MiB] 
                          /dev/ram11            [      16.00 MiB] 
                          /dev/ram12            [      16.00 MiB] 
                          /dev/ram13            [      16.00 MiB] 
                          /dev/ram14            [      16.00 MiB] 
                          /dev/ram15            [      16.00 MiB] 
                          3 disks
                          17 partitions
                          0 LVM physical volume whole disks
                          1 LVM physical volume
                        And this has journal enabled,
                        Code:
                        tune2fs -l /dev/mapper/VolGroup-lv_var  | grep has_journal
                        Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
                        When checked for 'Uniturptible sleep (IO) process' count, i got jbd2/dm-2-8 with 1457 pid is more, so
                        Code:
                        cat /proc/1457/io 
                        rchar: 0
                        wchar: 0
                        syscr: 0
                        syscw: 0
                        read_bytes: 0
                        write_bytes: 49116737536
                        cancelled_write_bytes: 0
                        Very surprised to see, rchar, read_bytes is 0 !!!
                        Code:
                        hdparm -tT /dev/sda2
                        
                        /dev/sda2:
                         Timing cached reads:   10770 MB in  2.00 seconds = 5391.83 MB/sec
                         Timing buffered disk reads:  342 MB in  3.01 seconds =  113.59 MB/sec
                        Any pointers are greatly helpful.

                        Thanks

                        Comment

                        • Navern
                          Member
                          • May 2013
                          • 33

                          #13
                          Originally posted by zabbixfk
                          Thank you for the quick reply.

                          I did not understand your calculation of 2K/s zabbix write, can please you elaborate...

                          I am not sure why this controller ( guess you are referring to jbd2/dm-2-8) is shooting up on I/O, did some googling still not able to figure out.

                          Since its kind of production, i am not in a position to stop zabbix/mysql :'(

                          I have disabled querylog ( no change in iotop output), but can't disable binary log as (this zabbix server serves as master) i had setup master/slave replication for myql db to another server.

                          Looks like RAID/LVM is healthy, - as vgscan/lvmdiskscan didn't complain ( may be i am using wrong commands to check ? )
                          Code:
                          vgdisplay --verbose |grep PV |grep Name
                              Finding all volume groups
                              Finding volume group "VolGroup"
                            PV Name               /dev/sda2
                          Code:
                          vgscan 
                            Reading all physical volumes.  This may take a while...
                            Found volume group "VolGroup" using metadata type lvm2
                          Code:
                          lvmdiskscan 
                            /dev/ram0             [      16.00 MiB] 
                            /dev/root             [      50.00 GiB] 
                            /dev/ram1             [      16.00 MiB] 
                            /dev/sda1             [     500.00 MiB] 
                            /dev/VolGroup/lv_swap [       5.88 GiB] 
                            /dev/ram2             [      16.00 MiB] 
                            /dev/sda2             [     464.76 GiB] LVM physical volume
                            /dev/VolGroup/lv_var  [     408.88 GiB] 
                            /dev/ram3             [      16.00 MiB] 
                            /dev/ram4             [      16.00 MiB] 
                            /dev/ram5             [      16.00 MiB] 
                            /dev/ram6             [      16.00 MiB] 
                            /dev/ram7             [      16.00 MiB] 
                            /dev/ram8             [      16.00 MiB] 
                            /dev/ram9             [      16.00 MiB] 
                            /dev/ram10            [      16.00 MiB] 
                            /dev/ram11            [      16.00 MiB] 
                            /dev/ram12            [      16.00 MiB] 
                            /dev/ram13            [      16.00 MiB] 
                            /dev/ram14            [      16.00 MiB] 
                            /dev/ram15            [      16.00 MiB] 
                            3 disks
                            17 partitions
                            0 LVM physical volume whole disks
                            1 LVM physical volume
                          And this has journal enabled,
                          Code:
                          tune2fs -l /dev/mapper/VolGroup-lv_var  | grep has_journal
                          Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
                          When checked for 'Uniturptible sleep (IO) process' count, i got jbd2/dm-2-8 with 1457 pid is more, so
                          Code:
                          cat /proc/1457/io 
                          rchar: 0
                          wchar: 0
                          syscr: 0
                          syscw: 0
                          read_bytes: 0
                          write_bytes: 49116737536
                          cancelled_write_bytes: 0
                          Very surprised to see, rchar, read_bytes is 0 !!!
                          Code:
                          hdparm -tT /dev/sda2
                          
                          /dev/sda2:
                           Timing cached reads:   10770 MB in  2.00 seconds = 5391.83 MB/sec
                           Timing buffered disk reads:  342 MB in  3.01 seconds =  113.59 MB/sec
                          Any pointers are greatly helpful.

                          Thanks
                          To check you raid health you should have some specific utility for your RAID controller. For example if you use Adaptec RAID controller than you can use arrconf utility to check health of your hardware RAID.

                          Comment

                          • jan.garaj
                            Senior Member
                            Zabbix Certified Specialist
                            • Jan 2010
                            • 506

                            #14
                            Originally posted by zabbixfk
                            I did not understand your calculation of 2K/s zabbix write, can please you elaborate...
                            One numeric value requires +/-90bytes in database. I've used this info + your info about required new values per second + some additional space for trend => my estimation data writes 2K/s (if you monitor only numeric values, if you monitor logs, it should be more of course)

                            Sorry, I don't have deep knowledge about disk, so I can't check your disk outputs.

                            My opinion:
                            - disk is overloaded (disk queue 45 for /var)
                            - write load from mysql has unexpected high value - my expectation is 2K for data file + 2K for bin log + overhead ~~~=> 100KB/s (not >2MB)

                            Try to check your DB server with mysqltuner.pl
                            Check what DB is doing (SHOW FULL PROCESSLIST), status (SHOW STATUS), ...
                            Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
                            My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

                            Comment

                            • zabbixfk
                              Senior Member
                              • Jun 2013
                              • 256

                              #15
                              Obtaining I/O ops, RAM, CPU details for zabbix server : For Benchmarking

                              Thank you all for the reply.

                              Thanks @jan.garaj - for the explanation.

                              Even i suspect disk issues. MySql seems to be culprit as, if i shut down mysql, all becomes normal ( cpu load goes down, i/o goes to normal).

                              SHOW FULL PROCESSLIST : shows, most of the time, either delete from history_* tables or lot of update queries, and some zabbix connections which are in sleep mode.

                              I had set Housekeeping frequency to 1, 500 items.


                              Thanks.

                              Comment

                              Working...