Ad Widget

Collapse

Disconnected graphs snmp as well as zabbix-agent , no data received alerts

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zabbixfk
    Senior Member
    • Jun 2013
    • 256

    #1

    Disconnected graphs snmp as well as zabbix-agent , no data received alerts

    Hello All,

    There seems to be this weird scenario i am facing with respect to the agents off late where graphs show disconnected dots , no straight lines, keep getting no data recieved alerts but telnet to agent port looks fine. I really need help here to figure out what's going wrong. Zabbix queue is also piled up bigtime.
    I tried changing to active agent config, that didn't help either, where increasing pollers also didn't help.
    I have mix and match items with intervals ranging from 60s min to 1day max,
    Code:
     [TABLE]
    [TR]
    [TD]Number of hosts (enabled/disabled/templates)[/TD]
     			[TD]5693[/TD]
     			[TD]3018 / 2289 / 386[/TD]
     		[/TR]
    [TR]
    [TD]Number of items (enabled/disabled/not supported)[/TD]
     			[TD]24641[/TD]
     			[TD]22095 / 1513 / 1033[/TD]
     		[/TR]
    [TR]
    [TD]Number of triggers (enabled/disabled [problem/ok])[/TD]
     			[TD]7277[/TD]
     			[TD]6618 / 659 [373 / 6245][/TD]
     		[/TR]
    [TR]
    [TD]Number of users (online)[/TD]
     			[TD]73[/TD]
     			[TD]23[/TD]
     		[/TR]
    [TR]
    [TD]Required server performance, new values per second[/TD]
     			[TD]84.56[/TD]
     		[/TR]
    [/TABLE]
    Zabbix Queue looks like this
    Code:
     [TABLE]
    [TR]
    tems 			5 seconds 			10 seconds 			30 seconds 			1 minute 			5 minutes 			More than 10 minutes 		[/TR]
    [TR]
    [TD]Zabbix agent[/TD]
     			[TD]16[/TD]
     			[TD]27[/TD]
     			[TD]8[/TD]
     			[TD]41[/TD]
     			[TD]22[/TD]
     			[TD]90[/TD]
     		[/TR]
    [TR]
    [TD]Zabbix agent (active)[/TD]
     			[TD]0[/TD]
     			[TD]0[/TD]
     			[TD]3[/TD]
     			[TD]17[/TD]
     			[TD]11[/TD]
     			[TD]174[/TD]
     		[/TR]
    [TR]
    [TD]Simple check[/TD]
     			[TD]82[/TD]
     			[TD]127[/TD]
     			[TD]14[/TD]
     			[TD]24[/TD]
     			[TD]0[/TD]
     			[TD]0[/TD]
     		[/TR]
    [TR]
    [TD]SNMPv1 agent[/TD]
     			[TD]0[/TD]
     			[TD]0[/TD]
     			[TD]0[/TD]
     			[TD]0[/TD]
     			[TD]0[/TD]
     			[TD]0[/TD]
     		[/TR]
    [TR]
    [TD]SNMPv2 agent[/TD]
     			[TD]8[/TD]
     			[TD]6[/TD]
     			[TD]1[/TD]
     			[TD]0[/TD]
     			[TD]4[/TD]
     			[TD]10[/TD]
     		[/TR]
    [/TABLE]
    There's only minimal active items added, but max on zabbix_aget and snmp checks, below is proxy config , this is pretty similar to other 6 proxies. Data gathering process is only 60% rest all is cool (from the data gathering process graph). values processed is about 125+, and queue for this proxy is avg 900.
    Code:
    Server=XX.XX.XX.XX
    Hostname=Zabbix-Proxy
    LogFile=/var/log/zabbix/zabbix_proxy.log
    LogFileSize=300
    DebugLevel=4
    PidFile=/var/run/zabbix/zabbix_proxy.pid
    DBName=zabbix
    DBUser=zabbix
    DBPassword=password
    ProxyLocalBuffer=3
    ProxyOfflineBuffer=4
    ConfigFrequency=120
    DataSenderFrequency=30
    StartPollers=275
    StartPollersUnreachable=120
    StartTrappers=60
    StartPingers=90
    StartSNMPTrapper=1
    HousekeepingFrequency=3
    CacheSize=1G
    StartDBSyncers=50
    HistoryCacheSize=1G
    HistoryIndexCacheSize=1G
    Timeout=30
    UnreachablePeriod=90
    FpingLocation=/usr/local/sbin/fping
    LogSlowQueries=300
    Servers are not utilised much. proxy servers have enough ram , only 60% used, cpu load looks fine 20-30% (all the proxies). Below is the server config
    Code:
    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=500
    DebugLevel=4
    PidFile=/var/run/zabbix/zabbix_server.pid
    DBName=zabbix
    DBUser=zabbix
    DBPassword=password
    StartPollers=300
    StartIPMIPollers=1
    StartPollersUnreachable=150
    StartTrappers=130
    StartPingers=120
    StartDiscoverers=10
    StartSNMPTrapper=1
    ListenIP=0.0.0.0
    HousekeepingFrequency=2
    MaxHousekeeperDelete=300
    SenderFrequency=360
    CacheSize=1G
    CacheUpdateFrequency=300
    StartDBSyncers=15
    HistoryCacheSize=256M
    HistoryIndexCacheSize=256M
    TrendCacheSize=1G
    ValueCacheSize=128M
    Timeout=30
    TrapperTimeout=180
    UnreachablePeriod=600
    UnavailableDelay=180
    AlertScriptsPath=/etc/zabbix/alert.d/
    FpingLocation=/usr/local/sbin/fping
    LogSlowQueries=300
    StartProxyPollers=2
    ProxyDataFrequency=180
    zabbx server details
    Code:
    [root@zbx_server ~]# free -g
                 total       used       free     shared    buffers     cached
    
    Mem:           125         86         39          0          0         60
    
    -/+ buffers/cache:         25        100
    
    Swap:            9          0          9
    [root@zbx_server ~]#
    CPU - 40 Core, MySql ~ 170G db
    mysql config
    Code:
    [mysqld]
    datadir=/var/lib/mysql
    socket=/var/lib/mysql/mysql.sock
    user=mysql
    symbolic-links=0
    long_query_time = 10
    log-queries-not-using-indexes=YES
    innodb_lock_wait_timeout=500
    innodb_locks_unsafe_for_binlog=1
    innodb_file_per_table
    innodb_flush_method=O_DIRECT
    innodb_log_file_size=1G
    innodb_buffer_pool_size=48G
    innodb_file_per_table
    max_allowed_packet = 128M
    innodb_additional_mem_pool_size = 30M
    innodb_thread_concurrency = 8
    key_buffer_size = 60M
    max_connections=700
    table_cache=4096
    tmp_table_size = 32M
    thread_cache_size = 64
    query_cache_limit=64M
    thread_cache_size=512
    read_buffer_size=2M
    log-bin=mysql-bin
    binlog-do-db=zabbix
    server-id=9
    expire_logs_days=3
    max_binlog_size=100M
    [mysqld_safe]
    log-error=/var/log/mysqld.log
    pid-file=/var/run/mysqld/mysqld.pid
    [client]
    user=root
    password=password
    I am like more confused now. Trying to figure out what's the optimal configurations to get these working. I am not in a position to do multiple start/stop to zabbix severs as well as proxies (as they are production) but two-four can be tried. Earlier when these happened, i tried increasing the pollers and it went away, now increasing pollers isn't helping. Logs doesn't show much - except network_error , trying after 15 seconds and some ZBX_TCP_READ() timed out - where as continues
    Code:
    nc -z host_ip 10050   OR nc -z host_ip 161
    both works flawlessley. I mean i don't see disconnections using continues NC commands or ping's so definitely network isn't the issue.

    Zabbix is approximately 4-5 years old where i first installed 2.0.6 and now in 3.0.13 (all are centos 6.3/6.5 server plus 6 proxies with mysql as backed)
    Number of hosts monitored via server and proxies.
    Server : ~ 1500
    Proxies together : ~ 1500

    Currently i am stuck with below points
    1). Disconnected graphs on both SNMP monitored items as well as zabbix-agents , number of such having issue ~ 20+ on agent about 30 on snmp.
    2). SNMP traffic data is disconnected, and showing less data than router shows, i.e if router says interface clocking 300Mb, i am seeing only ~100Mb on zabbix - this is another issue( interval for traffic is 5mins)
    3). No-data received alerts on agents. I am able to get 1 when i manually do agent_ping but zabbix shows alerts, and on the latest data as well i am seeing data is missing ( interval for agent.ping is 1m )
    4). External scripts failing to run. I had some expect/shell/perl based scripts to login to routers to run and get some data show on zabbix - this use to work flawlessly on pervious version ( 2.2.12), but in this they are failing - 3.0.13 - either they show timeout running , or they run hanging on the console.

    Any pointers are greatly helpful.

    Thanks
    Starts
    11-06-2018
    Ends
    12-06-2018
  • kernbug
    Senior Member
    • Feb 2013
    • 330

    #2
    Hello

    Do you use partitioning?
    Size of the DB?

    Could you provide the following graphs?
    • Zabbix server performance
    • Zabbix internal process busy
    • Zabbix data gathering process busy
    • Zabbix cache usage

    Comment

    • zabbixfk
      Senior Member
      • Jun 2013
      • 256

      #3
      Thank you for the reply.
      No partitioning done on mysql.
      Size of the DB is ~ 176GB

      Zabbix Cache usage 7 days -

      zabbix server performance 7days

      zabbix zabbix internal process busy 7days

      zabbix data gathering process 7 days


      Thanks
      Last edited by zabbixfk; 18-06-2018, 12:55.

      Comment

      • kernbug
        Senior Member
        • Feb 2013
        • 330

        #4
        Originally posted by zabbixfk
        Thank you for the reply.
        No partitioning done on mysql.
        Size of the DB is ~ 176GB

        Zabbix Cache usage 7 days -

        zabbix server performance 7days

        zabbix zabbix internal process busy 7days

        zabbix data gathering process 7 days


        Thanks
        Thank you for the additional information.

        Could you gather information from strace (about 5-10m):
        Code:
        strace -s 100 -T -tt -fp PID -e trace=write
        where PID is the 'history syncer' process id in the system.

        Comment

        • zabbixfk
          Senior Member
          • Jun 2013
          • 256

          #5
          Here is the file.
          TinyUpload.com - solution for tiny file hosting. No download limits, no upload limit. Totaly free.

          Comment

          • kernbug
            Senior Member
            • Feb 2013
            • 330

            #6
            Originally posted by zabbixfk
            Here is the file.
            Thank you, nothing criminal except:
            Code:
            12:24:43.122005 write(6, " 24297:20180619:122443.121 __zbx_zbx_setproctitle() title:'history syncer #1 [synced 80 items in 0.2"..., 124) = 124 <0.000008>
            12:24:44.122502 write(6, " 24297:20180619:122444.122 __zbx_zbx_setproctitle() title:'history syncer #1 [synced 80 items in 0.2"..., 129) = 129 <0.000014>
            12:24:44.122715 write(6, " 24297:20180619:122444.122 In DCsync_history() history_num:21\n", 62) = 62 <0.000008>
            12:25:21.681305 write(6, " 24297:20180619:122521.681 __zbx_zbx_setproctitle() title:'history syncer #1 [synced 18 items in 0.9"..., 124) = 124 <0.000008>
            12:25:22.681851 write(6, " 24297:20180619:122522.681 __zbx_zbx_setproctitle() title:'history syncer #1 [synced 18 items in 0.9"..., 129) = 129 <0.000009>
            Please, reduce:
            Code:
             
             StartDBSyncers=50 -> 4  
             StartPollers=275->150  
             StartTrappers=130 -> 30
            And if possible apply partitioning, but backup first.

            Comment

            • zabbixfk
              Senior Member
              • Jun 2013
              • 256

              #7
              Thanks and really appreciate your reply. Wanted to ask how do you come on these numbers, can you help me figure out same?
              And how do i achieve partition? Only on server or all the proxies?
              I am new to this mysql things, tuned all the parameters from the help from internet - would be great if you could share some points.
              Doesn't decreasing the pollers affect the incoming data, some other thread they mentioned to increase the pollers whenever number of hosts/itmes from that server/proxy increase, so i kept it this number - not sure if i am doing right - can you through some lights here please.

              Thanks

              Comment

              • kernbug
                Senior Member
                • Feb 2013
                • 330

                #8
                Hello

                Originally posted by zabbixfk
                Wanted to ask how do you come on these numbers, can you help me figure out same?
                Sleepless nights with ~40000 hosts and ~10k nvps and I'm still so far from Zabbix expert level

                Originally posted by zabbixfk
                And how do i achieve partition? Only on server or all the proxies?
                Mostly partitioning of the Zabbix Server DB just enough. But if you want Zabbix Proxy DB also may be partitioned (only proxy_history table).[/QUOTE]

                Look here about setup instructions: https://zabbix.org/wiki/Docs/howto/mysql_partition

                Originally posted by zabbixfk
                I am new to this mysql things, tuned all the parameters from the help from internet - would be great if you could share some points.
                Doesn't decreasing the pollers affect the incoming data, some other thread they mentioned to increase the pollers whenever number of hosts/itmes from that server/proxy increase, so i kept it this number - not sure if i am doing right - can you through some lights here please.

                Thanks
                Zabbix components are well optimized and performance is great, but sometimes there is a bottleneck you need to cary on. Just increasing number of the process will increase locks between them, for example StartDBSyncers. I saw only few DB server configs that can survive with StartDBSyncers>30. Simple rule - start small, monitor your load, increase parameters(one by one, not all of them), find bottlenecks.



                Comment

                • zabbixfk
                  Senior Member
                  • Jun 2013
                  • 256

                  #9
                  Thanks for the reply.
                  I am trying to do changes you suggested ( decreasing pollers etc) on one of the proxies and seeing queue is being reduced in all the columns. But it suddenly gets increased stays there for some time again decreases, should i be worried? ( p.s theres still >90 items under more than 10 mins column on that proxy.)
                  I could restart master only once , and queue under all columns is decreased again same case, it gets increased for every refresh i do , not consistent. - i believe to apply changes in config, i have to restart zbx_server.
                  What is the best way to measure these things? And any documents you suggest me to read up on understanding what works better ( for my environment )- how many pollers needed for how many hosts / items etc - asking this coz you said about increasing process won't help - just a request though. Any books/articles - their manuals doesn't provide anything on this part.
                  - I can't do db partitioning server db now, that will need a downtime, let me see if that's possible, and for 170+G size don't know how long will it take.
                  BTW i am running all in one. Zbx server, frontend, db all in one - should i think on moving out frontend or db to some other machine? Will it help? What kind of connectivity is required between db and server & httpd - is 1G enough? physical/virtual? Any preffred combinations you can think of?
                  - ANy idea on taking backups? I had setup master slave replication to one of the boxes, so only db backup i have. Hope that's enough

                  OK i am asking too many things, really thankful to you for patiently pointing out me the directions

                  Thanks

                  Comment

                  • kernbug
                    Senior Member
                    • Feb 2013
                    • 330

                    #10
                    Originally posted by zabbixfk
                    Thanks for the reply.
                    I am trying to do changes you suggested ( decreasing pollers etc) on one of the proxies and seeing queue is being reduced in all the columns. But it suddenly gets increased stays there for some time again decreases, should i be worried?
                    Overall queue mostly depends on the Zabbix Server process, proxy push new values to server -> server (history_syncer) must write them down to the database;

                    And any documents you suggest me to read up on understanding what works better ( for my environment )- how many pollers needed for how many hosts / items etc - asking this coz you said about increasing process won't help - just a request though.
                    Increase pollers as you grow, if load on the performance graphs about 75% add 10 pollers, restart, check. 1 DB syncer perform max 1000 values write at once to the database, if this 1 operation happens in <500ms - performance of your database is enough. It's better to have test site with the same version of your main setup.

                    BTW i am running all in one. Zbx server, frontend, db all in one - should i think on moving out frontend or db to some other machine? Will it help? What kind of connectivity is required between db and server & httpd - is 1G enough? physical/virtual? Any preffred combinations you can think of?
                    Just enough setup 2 node cluster: replicated database, 1 server with fencing, web frontend on each node.

                    - ANy idea on taking backups? I had setup master slave replication to one of the boxes, so only db backup i have. Hope that's enough
                    Persona Xtrabackup - must have option.


                    Comment

                    • zabbixfk
                      Senior Member
                      • Jun 2013
                      • 256

                      #11
                      Thanks for the reply. I am kind of new to this sysadmin scene, can you elaborate on this part - apologies

                      Just enough setup 2 node cluster: replicated database, 1 server with fencing, web frontend on each node.
                      What tech should i be using , i mean how to setup this whole stuff - any pointers?

                      Thanks

                      Comment

                      Working...