Ad Widget

Collapse

proxy_history

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • fiddletwix
    Junior Member
    • Feb 2016
    • 4

    #1

    proxy_history

    Can someone better explain what the proxy_history table is used for? All metrics on the proxy and zabbix server look fine but the number of rows in proxy_history vary from 100000-900000. Is this something I should be worried about?

    Here's my proxy config


    Server=zabbix-server
    ServerPort=10051
    Hostname=proxy01
    ListenPort=10051
    LogFile=/var/log/zabbix_proxy.log
    LogFileSize=0
    PidFile=/var/run/zabbix/zabbix_proxy.pid
    ProxyOfflineBuffer=1
    ConfigFrequency=300
    DataSenderFrequency=3
    StartPollers=600
    StartTrappers=200
    StartPingers=40
    StartHTTPPollers=20
    CacheSize=8G
    StartDBSyncers=40
    HistoryCacheSize=2G
    HistoryIndexCacheSize=2G
    Timeout=15
    ExternalScripts=/usr/lib/zabbix/externalscripts
    LogSlowQueries=3000
    Include=/etc/zabbix/zabbix_proxy.conf.d/*.conf


    A few other notes:

    My proxy system specs are 12 hyperthreaded cores. 48G of memory. The mysql server currently resides on the same server but will be moving off. However, I see no issues on the mysql server.

    This Proxy server is my first proxy server in this infrastructure and nvps is only about 215 or so.

    The proxy and zabbix server are seperated by a WAN(~80ms latency) with more than enough capacity for bandwidth.

    Caches are near empty, processes are pretty idle and things look good but that proxy_history table is all over the place.

    I'm open to general proxy tuning suggestions as well.
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    Originally posted by fiddletwix
    StartPollers=600
    Looks like you are still using passive monitoring.

    As long as you will not change all "zabbix agent" to "zabbix agent (active)" items and will not start using active agents setup proxy will be consuming more and more context switches because with co many pollers threads you are saturating probably now num_of_CPUs*1000 cs/s involuntary context switches per second.
    I can bet that now it is you only problem/bottleneck

    In your base OS template you should have always monitoring of system.cpu.switches[]

    In attachment you can find my "OS Linux" and "OS Solaris" templates and both is added system.cpu.switches[] monitoring. These templates contains as well graph "CPU::*::cs vs cores" which shows current cs/s against horizontal line 1000*nun_of_CPU_cores.
    If on your system system.cpu.switches (data in metric should be stored as "delta per second") will be reporting more than 12000 you will have explanation why even with low CPU usage your system is lagging under your workload.

    Feel free to use those templates or peak how to add monitoring and presentation layer of cs to your template.

    One more thing: Linux does not provide global counters of voluntary/involuntary CSs. On Solaris is possible to do this. In my OS Solaris template comment is URL to page where is described how to count both types of cs (so far I had no time to add monitoring of global ICX/VCX).

    Problem with saturating involuntary cs is very typical on systems with many heavily working massive multithreaded applications like many java application servers.
    On Solaris to check how it is with cs you can use "prstat -m".
    prstat is very similar to top and -m shows microstate accounting. In ICX (Involuntary CX) and VCX (Voluntary CX) columns are counters with those metrics and usually quick look on what is those columns is possible to say ... "aha here we have to many ICX "
    Example first few lines of prstat -m from system which is working as manin zabbix mysql DB backend:
    Code:
       PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP  
     14601 mysql    0.9 0.2 0.0 0.0 0.0  39  59 0.0 126   1  1K   0 mysqld/131
       617 root     0.0 0.4 0.0 0.0 0.0 0.0 100 0.0  33   1   0   0 zpool-data/172
      1790 zabbix   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0  40   0 140   0 zabbix_agent/1
     25870 root     0.1 0.1 0.0 0.0 0.0 0.0 100 0.0   8   0 140   0 bash/1
      4951 root     0.0 0.1 0.0 0.0 0.0 0.0 100 0.0   2   0 178   0 sleep/1
      4740 admin    0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0 175   0 prstat/1
    As you see in this case mysqld with 131 threads usually is finishing before scheduler is taking off this process from CPUs (126 VCX vs only 1 ICX)

    On Linux you can only check lines with nr_voluntary_switches and nr_involuntary_switches in /proc/<pid>/sched which shows per tread per second counters.

    PS. Try to read something about voluntary/involuntary context switches. People quite often are struggling with some system performance issues not knowing about impact of saturating physical limits of involuntary cs/s on exact HW. This problem quite "popular" and is affecting as well even harder applications working in different types of VMs as on such systems CS is multiplied by multiple schedulers running separately in each VM.
    Attached Files
    Last edited by kloczek; 11-03-2016, 19:37.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • kloczek
      Senior Member
      • Jun 2006
      • 1771

      #3
      Just checked my biggest active proxy which doing something like almost 1.5Knvps over active agents.

      Code:
      # cat /etc/zabbix/zabbix_proxy/{Start*,CacheSize,ProxyOfflineBuffer,DataSenderFrequency}
      StartHTTPPollers=5
      StartPingers=30
      StartPollers=10
      StartTrappers=100
      CacheSize=128M
      ProxyOfflineBuffer=6
      DataSenderFrequency=10
      On the same system is working server and on failing over this proxy with own local DB backend to another system different resources utilization is only about 5-10% lower (CPU, cs, IOs .. and on the same host we have some number of scripts feeding this proxy over zabbix trapper items and these processes are atm consuming more resources than proxy themselves).
      So it may show you how much more effective is using active monitoring vs passive.

      Below is graph with "cs vs cores" on this system (I know that on this system more than 2/3 CSs are voluntary so passing 4k limit is not an issue).
      Attached Files
      Last edited by kloczek; 11-03-2016, 20:54.
      http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
      https://kloczek.wordpress.com/
      zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
      My zabbix templates https://github.com/kloczek/zabbix-templates

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        BTW .. if anyone is interested off-cpu analyze of the system workload just checked that it is now possible to use exactly the same DTrace oneliner if someone is using OL6/OL7. Example:

        Code:
        # cat /etc/redhat-release; echo; dtrace -n 'sched:::off-cpu { self->ts = timestamp; } sched:::on-cpu /self->ts/ { @["ns"] = quantize(timestamp - self->ts); self->ts = 0; } tick-10s {printa(@);}'
        Red Hat Enterprise Linux Server release 6.6 (Santiago)
        
        dtrace: description 'sched:::off-cpu ' matched 3 probes
        CPU     ID                    FUNCTION:NAME
          1    648                        :tick-10s 
          ns                                                
                   value  ------------- Distribution ------------- count    
                      64 |                                         0        
                     128 |                                         19       
                     256 |                                         96       
                     512 |                                         413      
                    1024 |                                         943      
                    2048 |@                                        2375     
                    4096 |@@@                                      6173     
                    8192 |@@@@@                                    9713     
                   16384 |@@@@@                                    10373    
                   32768 |@@@                                      7099     
                   65536 |@@@                                      5452     
                  131072 |@@@@@@                                   13620    
                  262144 |@@@@@                                    10787    
                  524288 |@@                                       4614     
                 1048576 |@                                        2766     
                 2097152 |@                                        3141     
                 4194304 |@                                        1378     
                 8388608 |@                                        1602     
                16777216 |@                                        1255     
                33554432 |                                         617      
                67108864 |                                         915      
               134217728 |                                         214      
               268435456 |                                         162      
               536870912 |@                                        1379     
              1073741824 |                                         467      
              2147483648 |                                         14       
              4294967296 |                                         19       
              8589934592 |                                         0        
             17179869184 |                                         0        
             34359738368 |                                         0        
             68719476736 |                                         0        
            137438953472 |                                         0        
            274877906944 |                                         0        
            549755813888 |                                         0        
           1099511627776 |                                         0        
           2199023255552 |                                         0        
           4398046511104 |                                         0        
           8796093022208 |                                         0        
          17592186044416 |                                         0        
          35184372088832 |                                         0        
          70368744177664 |                                         0        
         140737488355328 |                                         0        
         281474976710656 |                                         0        
         562949953421312 |                                         0        
        1125899906842624 |                                         0        
        2251799813685248 |                                         0        
        4503599627370496 |                                         0        
        9007199254740992 |                                         0        
        18014398509481984 |                                         0        
        36028797018963968 |                                         117      
        72057594037927936 |                                         0        
        
        
        ^C
        If anyone is interest more details about such such analyze please read Breandan Greg brilliant article about this from http://www.brendangregg.com/offcpuanalysis.html
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • fiddletwix
          Junior Member
          • Feb 2016
          • 4

          #5
          Originally posted by kloczek
          Looks like you are still using passive monitoring.


          As long as you will not change all "zabbix agent" to "zabbix agent (active)" items and will not start using active agents setup proxy will be consuming more and more context switches because with co many pollers threads you are saturating probably now num_of_CPUs*1000 cs/s involuntary context switches per second.
          I can bet that now it is you only problem/bottleneck
          Sadly, the number of pollers is a red herring. I am using active monitoring now. That 600 was a left over when I was doing passive checks. My agent configs typically look like this


          PidFile=/var/run/zabbix/zabbix_agentd.pid
          LogFile=/var/log/zabbix/zabbix_agentd.log
          LogFileSize=0
          Server=proxy01,zabbix-server
          ListenPort=30050
          StartAgents=50
          ServerActive=proxy01:30051
          Hostname=myhostname
          HostMetadata=linux
          Include=/etc/zabbix/zabbix_agentd.d/



          In your base OS template you should have always monitoring of system.cpu.switches[]

          In attachment you can find my "OS Linux" and "OS Solaris" templates and both is added system.cpu.switches[] monitoring. These templates contains as well graph "CPU::*::cs vs cores" which shows current cs/s against horizontal line 1000*nun_of_CPU_cores.
          If on your system system.cpu.switches (data in metric should be stored as "delta per second") will be reporting more than 12000 you will have explanation why even with low CPU usage your system is lagging under your workload.

          Feel free to use those templates or peak how to add monitoring and presentation layer of cs to your template.


          One more thing: Linux does not provide global counters of voluntary/involuntary CSs. On Solaris is possible to do this. In my OS Solaris template comment is URL to page where is described how to count both types of cs (so far I had no time to add monitoring of global ICX/VCX).

          Problem with saturating involuntary cs is very typical on systems with many heavily working massive multithreaded applications like many java application servers.
          On Solaris to check how it is with cs you can use "prstat -m".
          prstat is very similar to top and -m shows microstate accounting. In ICX (Involuntary CX) and VCX (Voluntary CX) columns are counters with those metrics and usually quick look on what is those columns is possible to say ... "aha here we have to many ICX "
          Example first few lines of prstat -m from system which is working as manin zabbix mysql DB backend:
          Code:
             PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP  
           14601 mysql    0.9 0.2 0.0 0.0 0.0  39  59 0.0 126   1  1K   0 mysqld/131
             617 root     0.0 0.4 0.0 0.0 0.0 0.0 100 0.0  33   1   0   0 zpool-data/172
            1790 zabbix   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0  40   0 140   0 zabbix_agent/1
           25870 root     0.1 0.1 0.0 0.0 0.0 0.0 100 0.0   8   0 140   0 bash/1
            4951 root     0.0 0.1 0.0 0.0 0.0 0.0 100 0.0   2   0 178   0 sleep/1
            4740 admin    0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   1   0 175   0 prstat/1
          As you see in this case mysqld with 131 threads usually is finishing before scheduler is taking off this process from CPUs (126 VCX vs only 1 ICX)

          On Linux you can only check lines with nr_voluntary_switches and nr_involuntary_switches in /proc/<pid>/sched which shows per tread per second counters.

          PS. Try to read something about voluntary/involuntary context switches. People quite often are struggling with some system performance issues not knowing about impact of saturating physical limits of involuntary cs/s on exact HW. This problem quite "popular" and is affecting as well even harder applications working in different types of VMs as on such systems CS is multiplied by multiple schedulers running separately in each VM.
          Thanks, definitely like having csw metrics. I'll be adding these. That said, I just looked at CS using dstat and I'm seeing 4-6k on average.

          And internal processes are nearly idle except for data sender(attaching graph of that)

          That said, I still don't understand the role of proxy_history table. Is it that buffered data that needs to be transferred over to the zabbix server? What is updating and deleting from that table?
          Attached Files

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by fiddletwix
            That said, I still don't understand the role of proxy_history table. Is it that buffered data that needs to be transferred over to the zabbix server? What is updating and deleting from that table?
            This is just circular buffer of monitoring data received from agents which when will be delivered to the server will go to history* tables. Because proxy is not processing those data all these data are stored as longtext in proxy_history table. So in the same table are stored all float, uint, str, text and log metrics data.
            Proxy housekeeper by default running every hour (HousekeepingFrequency=1) and is removing all data from this table older than ProxyOfflineBuffer hours.
            This table does not need to be optimized because after couple of days its size stabilizes as vnps is ~const. For example on my proxy size of the file with proxy_history table has about 2.5GB.
            If you will give more than 2-3GB innodb pool for proxy DB you should have completely cached this table in memory (for all read IOs .. and on IO graph you should see only write IOs).
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • fiddletwix
              Junior Member
              • Feb 2016
              • 4

              #7
              Originally posted by kloczek
              This is just circular buffer of monitoring data received from agents which when will be delivered to the server will go to history* tables. Because proxy is not processing those data all these data are stored as longtext in proxy_history table. So in the same table are stored all float, uint, str, text and log metrics data.
              Proxy housekeeper by default running every hour (HousekeepingFrequency=1) and is removing all data from this table older than ProxyOfflineBuffer hours.
              This table does not need to be optimized because after couple of days its size stabilizes as vnps is ~const. For example on my proxy size of the file with proxy_history table has about 2.5GB.
              If you will give more than 2-3GB innodb pool for proxy DB you should have completely cached this table in memory (for all read IOs .. and on IO graph you should see only write IOs).
              Excellent! that makes a lot of sense. And right now my innodb buffer is 32G so I should be good.

              Comment

              Working...