Ad Widget

Collapse

Remote agents losing connection to the server for prolonged periods

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Jason
    Senior Member
    • Nov 2007
    • 430

    #1

    Remote agents losing connection to the server for prolonged periods

    Recently we've moved from a single zabbix server (was 2.4.7) running on Centos 6 to a front-end/backend scenario. The backend is the original centos 6 server and just has the postgres database on it. The tables are partitioned and we keep 30 days worth of data and the database is just under 300GB

    For the front-end I've built a new Centos 7 server and have got it all up and running and everything seemed fine. However, after a few days we get periods where the data just stops coming in for everything other than the zabbix server itself. The 2 attached graphs are for proxies both in the same DC and remote where the data stops. Data also stops coming in from servers on the same LAN as the zabbix server and that doesn't go through any routers/firewalls.

    I've found that if I log onto the zabbix server when the data isn't coming in and restart the zabbix server then everything starts checking in again, which implies it's the zabbix_server itself could be the problem. I've gone through the server log and cannot see anything out of the ordinary in there. On the database server there doesn't seem to be anything happening on that that can explain it either. No vaccuuming on the tables that I can see at that time (auto-vaccuuming is on).

    I've just moved up to 2.4.8 to see if this resolves the issue, but it hasn't seemed to have made a difference.

    Server config file is below. The server itself has 12GB of RAM assigned. All of our agents/proxies are active.

    ListenPort=10052
    LogFile=/var/log/zabbix/zabbix_server.log
    LogFileSize=0
    PidFile=/var/run/zabbix/zabbix_server.pid
    DBHost=X.X.X.X
    DBName=zabbix
    DBUser=dbuser
    DBPassword=Password
    StartPollers=10
    StartPollersUnreachable=5
    StartTrappers=60
    StartDiscoverers=10
    SNMPTrapperFile=/tmp/zabbix_traps.tmp
    StartSNMPTrapper=1
    CacheSize=128M
    CacheUpdateFrequency=300
    StartDBSyncers=4
    HistoryCacheSize=128M
    TrendCacheSize=32M
    ValueCacheSize=64M
    Timeout=30
    LogSlowQueries=3000
    StartProxyPollers=0

    I've also put a graph of the internal processes on zabbix which shows the history syncer process dropping to nothing during the outage.

    I'm out of ideas for what it could be. Any suggestions of things I've missed or where to look?
    Attached Files
  • Jason
    Senior Member
    • Nov 2007
    • 430

    #2
    I've put a daily restart in, but data is still dropping out on some evenings. Anyone any ideas?

    Comment

    • glebs.ivanovskis
      Senior Member
      • Jul 2015
      • 237

      #3
      Anything interesting in server log file? Have you tried stracing trapper processes during problem? What does netstat show for server ListenPort? Have you tried collecting network traffic? Overall, is it a good or bad network?

      It seems that when network is bad, Zabbix can be even worse.

      Comment

      • Jason
        Senior Member
        • Nov 2007
        • 430

        #4
        I've got several agents that are on the local network and some of which are even on the same vhost as zabbix. They are all working and even data from the database server stops coming in...

        I've just found that during the outage I can see several lines starting like this in the logs... These lines go on for several screens

        2847:20160620:215510.624 slow query: 3.061724 sec, "insert into trends_uint (itemid,clock,num,value_min,value_avg,value_max) values (122230,1466449200,62,3856,6036,11752),(214809,14
        66449200,1,0,0,0),(225066,1466449200,1,0,0,0),(221 764,1466449200,2,0,0,0),(167436,1466449200,62,1864 ,7263,12120),(197553,1466449200,65,32680,1662689,3 365120),(126092,1466449200,61,1,1
        ,1),(136350,1466449200,63,1,1,1)

        I've also found this...

        2784:20160620:225603.327 sending configuration data to proxy "SNMP Proxy", datalen 1859461
        2784:20160620:225603.328 cannot send configuration: ZBX_TCP_WRITE() failed: [32] Broken pipe

        2848:20160620:233259.344 error reason for "5267e681-66d4-8fd2-a230-725290a19831:vmware.vm.uptime[{$URL},{HOST.HOST}]" changed: Couldn't connect to server

        I'm keeping a count of the number of open sockets and it's ticking along at just over 100...

        It is possible it's a network issue, but I'd not expect to lose access to the database server as it's on the same physical hardware.

        Comment

        • glebs.ivanovskis
          Senior Member
          • Jul 2015
          • 237

          #5
          Originally posted by Jason
          I'm keeping a count of the number of open sockets and it's ticking along at just over 100...

          It is possible it's a network issue, but I'd not expect to lose access to the database server as it's on the same physical hardware.
          Do you mean these are over 100 ESTABLISHED connections on trapper port? That might be a concern. What about other TCP states, e.g. SYN_RECV, CLOSED, TIME-WAIT?

          They communicate through sockets, there is still TCP layer between them.

          Comment

          • Jason
            Senior Member
            • Nov 2007
            • 430

            #6
            No... just looking in more detail... They're all database connections....

            That's established connections found by "netstat -an | grep ESTABLISHED | wc -l" That does seem constant around 100

            I've about 4000 in TIME_WAIT

            Puzzled as I don't think I should have so many.

            We have just under 700 hosts and about 72,000items

            Just looking at last nights outage... The socket count in ESTABLISHED state suddenly shoots up to over 200 for the duration of the outage.
            Last edited by Jason; 22-06-2016, 09:06. Reason: added more info.

            Comment

            • glebs.ivanovskis
              Senior Member
              • Jul 2015
              • 237

              #7
              grep <port> might help too

              Here is a good article providing some insights on how TCP is implemented in Linux:
              This article gives an in-depth description how TCP backlog works in Linux and in particular what happens when the accept queue is full. Includes references to the relevant kernel sources.

              Comment

              • Jason
                Senior Member
                • Nov 2007
                • 430

                #8
                I know how the TCP ports work... What I don't know is why we have so many connections open to our database server and why suddenly the ESTABLISHED socket count shoots through the roof.

                I'm putting more monitoring in place to capture some more details.

                Comment

                • Jason
                  Senior Member
                  • Nov 2007
                  • 430

                  #9
                  More examining of the database logs...

                  During last nights "outage" it's the only time I'm seeing lots of the slow queries logged here. I'm wondering if it's an issue with the checkpointing and it just happens to clash with something from zabbix at that time...

                  2016-06-21 22:11:09.367 BSTLOG: checkpoint starting: time
                  2016-06-21 22:11:10.966 BSTLOG: automatic analyze of table "zabbix.public.hosts" system usage: CPU 0.01s/0.04u sec elapsed 1.03 sec
                  2016-06-21 22:11:15.856 BSTLOG: automatic analyze of table "zabbix.public.host_inventory" system usage: CPU 0.00s/0.16u sec elapsed 0.47 sec
                  2016-06-21 22:11:32.614 BSTLOG: automatic analyze of table "zabbix.public.item_discovery" system usage: CPU 0.09s/2.21u sec elapsed 7.78 sec
                  2016-06-21 22:11:44.497 BSTLOG: automatic analyze of table "zabbix.public.hosts" system usage: CPU 0.00s/0.04u sec elapsed 1.03 sec
                  2016-06-21 22:12:08.904 BSTLOG: automatic vacuum of table "zabbix.partitions.history_str_p2016_05_29": index scans: 0
                  pages: 0 removed, 90091 remain
                  tuples: 0 removed, 12939543 remain
                  buffer usage: 90185 hits, 180221 misses, 1 dirtied
                  avg read rate: 3.052 MiB/s, avg write rate: 0.000 MiB/s
                  system usage: CPU 2.87s/2.42u sec elapsed 461.25 sec
                  2016-06-21 22:12:24.681 BSTLOG: automatic analyze of table "zabbix.public.hosts" system usage: CPU 0.00s/0.04u sec elapsed 0.70 sec
                  2016-06-21 22:15:39.222 BSTLOG: checkpoint complete: wrote 67015 buffers (12.8%); 0 transaction log file(s) added, 0 removed, 28 recycled; write=269.652 s, sync=0.127 s, total=269.854 s; sync files=77, longest=0.045 s, average=0.001 s
                  2016-06-21 22:16:09.252 BSTLOG: checkpoint starting: time
                  2016-06-21 22:16:46.505 BSTLOG: duration: 2803.479 ms statement: SELECT MAX(e.eventid) AS eventid,e.objectid FROM events e WHERE e.object=0 AND e.source=0 AND e.objectid IN ('14945','16255','17067','17076','17752','17777',' 17882','18468','18661','19228','22874','23942','24 873','24972','25200','25258','26693','26698','2670 2','27324','29222','29231','30717','31172','31206' ,'31319','32397','34840','35305','35559','35606',' 35679','35785','36818','37562','39042','39086','39 272','39621','39737','40010','41275','41603','4168 4','42939','42964','43145','44179','44343','44348' ,'44512','44980','44981','44982','44984','44985',' 44986','45032','45033','45034','45036','45037','45 038','45378','45506','45508','45578','45580','4561 1','46008','48745','48747','48748','48753','48754' ,'48757','48759','48760','48996','49027','49259',' 49805','49806','49808','49809','51905','53115','53 269','53280','53311','53341','53548','53551','5358 9','53805','54157','55867','56184','56591','56619' ,'56720','57302','57396','57397','57400','57401',' 58407','58408','59648','59652','60167','60278','60 293','60306','60939','60940','60941','60946','6094 7','60948','61409','61411','62118','62122','62172' ,'62459','62461','62520','62541','62551','62791',' 62792','62793','62796','63406') AND e.value='1' GROUP BY e.objectid
                  2016-06-21 22:16:55.728 BSTLOG: automatic vacuum of table "zabbix.partitions.history_log_p2016_05_29": index scans: 0
                  pages: 0 removed, 481 remain
                  tuples: 0 removed, 15481 remain
                  buffer usage: 565 hits, 588 misses, 1 dirtied
                  avg read rate: 3.415 MiB/s, avg write rate: 0.006 MiB/s
                  system usage: CPU 0.01s/0.00u sec elapsed 1.34 sec
                  2016-06-21 22:17:09.538 BSTLOG: automatic vacuum of table "zabbix.public.host_inventory": index scans: 1
                  pages: 0 removed, 516 remain
                  tuples: 113 removed, 507 remain
                  buffer usage: 258 hits, 0 misses, 22 dirtied
                  avg read rate: 0.000 MiB/s, avg write rate: 1.326 MiB/s
                  system usage: CPU 0.00s/0.00u sec elapsed 0.12 sec

                  Comment

                  • Jason
                    Senior Member
                    • Nov 2007
                    • 430

                    #10
                    I've increased io_concurrency and dropped the number of trappers slightly on the server and it seems to have settled down. If it stays working ok then I'll try dropping the nightly restart.

                    Comment

                    Working...