Ad Widget

Collapse

6k Hosts, 50VPS, 8 Items per Host, 4 Triggers - Proxy queue never catches up

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • supa_marty
    Junior Member
    • Oct 2019
    • 1

    #1

    6k Hosts, 50VPS, 8 Items per Host, 4 Triggers - Proxy queue never catches up

    Hi all,

    I am seeking opinions and advice with regards to deploying a performant proxy that requires 50VPS, monitoring 6K Hosts via SNMP (8 items per Host, 15 minute interval, 4 triggers) - active mode.

    I have run a proxy for 3 weeks in the above scenario and attempted to tweak its performance every 2 or 3 days by changing config and DB parameters.

    After 21 days of troubleshooting I cannot find a solution. A queue builds and polled results continue to drift beyond their 15 minute expected interval, with a "no data" trigger (no data after 1000s ~ 16.4 minutes) littering the dashboard.

    What's Happening?
    On average, every 24 hours, VPS on the proxy dives down to an average of 25 (50% of the required) and hence a queue appears. Once a queue appears, the proxy is never able to catch up, no matter how long I leave it alone to see if it can figure itself out.

    Zabbix Version (On Proxy and Server)
    3.0
    Proxy Config
    ProxyMode=0
    ServerPort=10051
    HostnameItem=system.hostname
    LogType=console
    LogFile=/var/log/zabbix/zabbix_proxy.log
    LogFileSize=10
    DebugLevel=3
    ProxyOfflineBuffer=48
    HeartbeatFrequency=60
    ConfigFrequency=1800
    DataSenderFrequency=1
    StartPollers=10
    StartPollersUnreachable=10
    StartTrappers=5
    StartPingers=1
    StartDiscoverers=2
    StartHTTPPollers=2
    SNMPTrapperFile=/var/log/snmptt/snmptrap.log
    StartSNMPTrapper=1
    HousekeepingFrequency=1
    CacheSize=512M
    HistoryCacheSize=512M
    HistoryIndexCacheSize=512M
    Timeout=10
    TrapperTimeout=20
    UnreachablePeriod=45
    UnavailableDelay=20
    UnreachableDelay=5
    ExternalScripts=/usr/lib/zabbix/externalscripts
    LogSlowQueries=3000


    Proxy DB Config
    [mysqld]skip-host-cache
    skip-name-resolve
    max_allowed_packet = 32M
    table_open_cache = 1024
    wait_timeout = 86400
    innodb_buffer_pool_size = 8G
    max_connections = 500
    innodb_io_capacity = 8000
    innodb_io_capacity_max = 12000


    Zabbix-DB Config
    [mysqld]
    pid-file = /var/run/mysqld/mysqld.pid
    socket = /var/run/mysqld/mysqld.sock
    port = 3306
    basedir = /usr
    datadir = /var/lib/mysql
    tmpdir = /tmp
    lc_messages_dir = /usr/share/mysql
    lc_messages = en_US
    skip-external-locking
    connect_timeout = 5
    wait_timeout = 86400
    max_allowed_packet = 32M
    thread_cache_size = 128
    sort_buffer_size = 4M
    bulk_insert_buffer_size = 16M
    tmp_table_size = 32M
    max_heap_table_size = 32M
    key_buffer_size = 128M
    table_open_cache = 400
    myisam_sort_buffer_size = 512M
    concurrent_insert = 2
    read_buffer_size = 2M
    read_rnd_buffer_size = 1M
    query_cache_limit = 128K
    query_cache_size = 64M
    slow_query_log_file = /var/log/mysql/mariadb-slow.log
    long_query_time = 10
    expire_logs_days = 10
    max_binlog_size = 100M

    innodb_buffer_pool_size = 8G
    innodb_log_buffer_size = 32M
    innodb_file_per_table = 1
    innodb_open_files = 400
    innodb_io_capacity = 8000
    innodb_io_capacity_max = 12000
    innodb_flush_method = O_DIRECT

    skip-host-cache
    skip-name-resolve


    Zabbix-Server Config
    ListenPort=10051
    LogType=console
    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=5
    PidFile=/var/run/zabbix/zabbix_server.pid
    StartPollers=20
    StartTrappers=10
    SNMPTrapperFile=/var/log/snmptt/snmptt.log
    HousekeepingFrequency=1
    SenderFrequency=30
    CacheSize=512M
    CacheUpdateFrequency=60
    HistoryCacheSize=128M
    HistoryIndexCacheSize=64M
    TrendCacheSize=512M
    ValueCacheSize=512M
    Timeout=5
    AlertScriptsPath=/usr/lib/zabbix/alertscripts
    ExternalScripts=/usr/lib/zabbix/externalscripts
    LogSlowQueries=3000
    StartProxyPollers=8
    ProxyConfigFrequency=300

    Proxy DB queries during poor performance
    Executed every 1 second manually to see proxy queue

    MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
    +-------------------------------------------------------------------------------+
    | max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
    +-------------------------------------------------------------------------------+
    | 0 |
    +-------------------------------------------------------------------------------+

    MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
    +-------------------------------------------------------------------------------+
    | max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
    +-------------------------------------------------------------------------------+
    | 32 |
    +-------------------------------------------------------------------------------+
    1 row in set (0.00 sec)

    MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
    +-------------------------------------------------------------------------------+
    | max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
    +-------------------------------------------------------------------------------+
    | 28 |
    +-------------------------------------------------------------------------------+
    1 row in set (0.00 sec)

    MariaDB [zabbix]> select max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) from proxy_history;
    +-------------------------------------------------------------------------------+
    | max(id) - (select nextid from ids where table_name = 'proxy_history' limit 1) |
    +-------------------------------------------------------------------------------+
    | 0 |
    +-------------------------------------------------------------------------------+

    Proxy process during poor performance

    zabbix_proxy -f
    zabbix_proxy: configuration syncer [synced config 15762640 bytes in 1.900318 sec, idle 1800
    zabbix_proxy: heartbeat sender [sending heartbeat message success in 0.002096 sec, idle 60
    zabbix_proxy: data sender [sent 48 values in 0.003604 sec, idle 1 sec]
    zabbix_proxy: poller #1 [got 5 values in 0.354294 sec, idle 1 sec]
    zabbix_proxy: poller #2 [got 4 values in 0.307196 sec, idle 1 sec]
    zabbix_proxy: poller #3 [got 6 values in 0.387001 sec, idle 1 sec]
    zabbix_proxy: poller #4 [got 4 values in 0.301262 sec, idle 1 sec]
    zabbix_proxy: poller #5 [got 5 values in 0.347205 sec, idle 1 sec]
    zabbix_proxy: poller #6 [got 5 values in 0.331892 sec, idle 1 sec]
    zabbix_proxy: poller #7 [got 5 values in 0.339897 sec, idle 1 sec]
    zabbix_proxy: poller #8 [got 5 values in 0.368512 sec, idle 1 sec]
    zabbix_proxy: poller #9 [got 4 values in 0.281030 sec, idle 1 sec]
    zabbix_proxy: poller #10 [got 5 values in 0.359561 sec, idle 1 sec]
    zabbix_proxy: unreachable poller #1 [got 1 values in 20.029659 sec, getting values]
    zabbix_proxy: unreachable poller #2 [got 1 values in 20.033272 sec, getting values]
    zabbix_proxy: unreachable poller #3 [got 1 values in 20.021567 sec, getting values]
    zabbix_proxy: unreachable poller #4 [got 1 values in 20.037575 sec, getting values]
    zabbix_proxy: unreachable poller #5 [got 2 values in 20.142452 sec, getting values]
    zabbix_proxy: unreachable poller #6 [got 1 values in 20.029762 sec, getting values]
    zabbix_proxy: unreachable poller #7 [got 1 values in 20.038905 sec, getting values]
    zabbix_proxy: unreachable poller #8 [got 1 values in 20.028888 sec, getting values]
    zabbix_proxy: unreachable poller #9 [got 1 values in 20.029567 sec, getting values]
    zabbix_proxy: unreachable poller #10 [got 1 values in 20.038540 sec, getting values]
    zabbix_proxy: trapper #1 [processed data in 0.000000 sec, waiting for connection]
    zabbix_proxy: trapper #2 [processed data in 0.000000 sec, waiting for connection]
    zabbix_proxy: trapper #3 [processed data in 0.000000 sec, waiting for connection]
    zabbix_proxy: trapper #4 [processed data in 0.000000 sec, waiting for connection]
    zabbix_proxy: trapper #5 [processed data in 0.000000 sec, waiting for connection]
    zabbix_proxy: icmp pinger #1 [got 0 values in 0.000004 sec, idle 5 sec]
    zabbix_proxy: housekeeper [deleted 186839 records in 0.504918 sec, idle for 1 hour(s)]


    Graphs



    Please view the graph I have attached to see the recent zabbix proxy agent performance:
    1. First values received are when I enabled the agent. I restarted the proxy which dumped the previous queue (notice the red initially dive down). Optimal performance at this point as proxy is at 50VPS and no queue seen for around 6 hours.
    2. VPS dives down and queue begins creeping up
    3. Restart for a second time, queue dissapears, VPS returns to its average of 50VPS. Experience around 13 hours without queue or VPS hit.
    4. 13 hours later, queue begins to rise up and VPS begins to dive down. This time, I decided to leave the proxy untouched without a restart for more than 48 hours to see if it could fix itself.
    5. Dropping the proxy 48 hours + later brought everything back to 50VPS immediately. Prior to dropping the proxy I ran the above DB queries and process monitor to see how many values were coming in and being sent. The proxy_history table only had 28K elements and the next_id seemed to be at max 30, ie backlog wasn't that big?
    6. A clear repeating process


    What would I like to happen?
    I'd like to be able to leave the proxy untouched and have it perform without needing to restart the process when a queue begins. I was hoping 50 VPS wouldn't be too much of an ask (and it hasn't been for on average, 24 hours!).

    I would really like to isolate where the bottleneck exists. I'm hoping the above graphs, description, config and database parameters are enough for an Expert to understand where the problem may lie.

    Am I correct in presuming that once a queue begins, it causes a waterfall affect of processes to begin that impact the ability to further poll devices? So in effect, a positive feedback loop?

    I cannot explain through my troubleshooting why restarting the process immediately fixes the problem, as I have been checking the proxy_history table which appears to have similar number of values within it during perfect operation vs when there is a big queue. (~ 180k vals housekept every 1 hour)

    If you would like any more information or config please let me know.

    Network?
    When the proxy exhibits a problem polling target devices during a high queue scenario, I jump on the CLI and do bulk snmpget's to manually determine if there is a network problem. In every instance that this has happened, there has been no problem with the devices responding. This makes me believe the problem lies in the SNMP Poller process, OR the datasender process back to the zabbix-server. Does anyone agree?

    Can I start multiple data senders to ensure the data sender process isn't a bottle neck? Just a thought.

    Desired outcome from you guys
    Would love to hear from anyone that has had a similar experience or who can highlight a glaring problem.

    Problem at Zabbix Server receiver process?
    Problem at Zabbix Server DB?
    Problem at Zabbix Proxy poller process?
    Problem at Zabbix Proxy data sender process?
    Problem at Zabbix Proxy database?

    Thank you kindly to all who have read this far
    ​​​​​​​
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    1) Move to zabbix >=4.0
    2) query_cache_limit = 128K, query_cache_size = 64M - it is not possible to improve zabbix db queries as almost all queries are operation on constantly moving windows of data. Decrease those values to minimum allowed values allowed by MySQL to not waste that memory
    3) What is the IO rate between writes and reads on main DB backend? You should have not at least 20:1 write to read IOs it means that you are not caching enough data in memory -> increase innodb pool.
    Low latency write queries (like inserts and updates) depends on low latency of the read IOs. If by not enough memory to cache necessary data on selects your engine is forced to read data dfrom storage (even from fast NVMe one) your DB backend always will be slow on injecting new incoming data.
    4) you are using SNMP so in that case fact that proxy is active is not relevant. You are using passive monitoring (SNMP is type of passive monitoring)
    5) how many of those proxies you have? If all those 6k hosts are monitored over single proxy it will never work correctly because that proxy is bottleneck. You must increase proxy poller s and if it will be not enough you must spread monitoring of those hosts over more proxies.
    6) if you have not uniform rate of the SNMP items queries and in peak you have some big number of SNMP queries and flood of data which needs to be delivered to server and that is causing issue you must compile your own proxy with change like below (it is only example which increases some hardcoded bandwidth to 1k points and you may use different value):
    Code:
    $ cat zabbix-default-proxy_ZBX_MAX_HRECORDS_50000.patch 
    --- a/include/proxy.h~
    +++ b/include/proxy.h
    @@ -27,8 +27,8 @@
     #define ZBX_PROXYMODE_ACTIVE    0
     #define ZBX_PROXYMODE_PASSIVE    1
    
    -#define ZBX_MAX_HRECORDS    1000
    -#define ZBX_MAX_HRECORDS_TOTAL    10000
    +#define ZBX_MAX_HRECORDS    50000
    +#define ZBX_MAX_HRECORDS_TOTAL    100000
    
     #define ZBX_PROXY_DATA_DONE    0
     #define ZBX_PROXY_DATA_MORE    1
    ZBX_MAX_HRECORDS it is trolling bandwidth of the flow from proxy to server (you must have enough strong server and DB backend to be able process incoming data faster and flush from zabbix server write cache to DB backend).
    In many cases large scale monitoring stacks 1k points/exchange between srv<>prx creates bottleneck.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    Working...