Ad Widget

**Jason** · 21-06-2016, 09:42

I've put a daily restart in, but data is still dropping out on some evenings. Anyone any ideas?

**glebs.ivanovskis** · 21-06-2016, 13:53

Anything interesting in server log file? Have you tried stracing trapper processes during problem? What does netstat show for server ListenPort? Have you tried collecting network traffic? Overall, is it a good or bad network?

It seems that when network is bad, Zabbix can be even worse.

**Jason** · 21-06-2016, 18:00

I've got several agents that are on the local network and some of which are even on the same vhost as zabbix. They are all working and even data from the database server stops coming in...

I've just found that during the outage I can see several lines starting like this in the logs... These lines go on for several screens

2847:20160620:215510.624 slow query: 3.061724 sec, "insert into trends_uint (itemid,clock,num,value_min,value_avg,value_max) values (122230,1466449200,62,3856,6036,11752),(214809,14
66449200,1,0,0,0),(225066,1466449200,1,0,0,0),(221 764,1466449200,2,0,0,0),(167436,1466449200,62,1864 ,7263,12120),(197553,1466449200,65,32680,1662689,3 365120),(126092,1466449200,61,1,1
,1),(136350,1466449200,63,1,1,1)

I've also found this...

2784:20160620:225603.327 sending configuration data to proxy "SNMP Proxy", datalen 1859461
2784:20160620:225603.328 cannot send configuration: ZBX_TCP_WRITE() failed: [32] Broken pipe

2848:20160620:233259.344 error reason for "5267e681-66d4-8fd2-a230-725290a19831:vmware.vm.uptime[{$URL},{HOST.HOST}]" changed: Couldn't connect to server

I'm keeping a count of the number of open sockets and it's ticking along at just over 100...

It is possible it's a network issue, but I'd not expect to lose access to the database server as it's on the same physical hardware.

**glebs.ivanovskis** · 22-06-2016, 08:08

Originally posted by Jason

I'm keeping a count of the number of open sockets and it's ticking along at just over 100...

It is possible it's a network issue, but I'd not expect to lose access to the database server as it's on the same physical hardware.

Do you mean these are over 100 ESTABLISHED connections on trapper port? That might be a concern. What about other TCP states, e.g. SYN_RECV, CLOSED, TIME-WAIT?

They communicate through sockets, there is still TCP layer between them.

**Jason** · 22-06-2016, 09:00

No... just looking in more detail... They're all database connections....

That's established connections found by "netstat -an | grep ESTABLISHED | wc -l" That does seem constant around 100

I've about 4000 in TIME_WAIT

Puzzled as I don't think I should have so many.

We have just under 700 hosts and about 72,000items

Just looking at last nights outage... The socket count in ESTABLISHED state suddenly shoots up to over 200 for the duration of the outage.

**glebs.ivanovskis** · 22-06-2016, 09:23

grep <port> might help too

Here is a good article providing some insights on how TCP is implemented in Linux:

How TCP backlog works in Linux

http://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html

This article gives an in-depth description how TCP backlog works in Linux and in particular what happens when the accept queue is full. Includes references to the relevant kernel sources.

**Jason** · 22-06-2016, 09:26

I know how the TCP ports work... What I don't know is why we have so many connections open to our database server and why suddenly the ESTABLISHED socket count shoots through the roof.

I'm putting more monitoring in place to capture some more details.

**Jason** · 22-06-2016, 10:53

More examining of the database logs...

During last nights "outage" it's the only time I'm seeing lots of the slow queries logged here. I'm wondering if it's an issue with the checkpointing and it just happens to clash with something from zabbix at that time...

2016-06-21 22:11:09.367 BSTLOG: checkpoint starting: time
2016-06-21 22:11:10.966 BSTLOG: automatic analyze of table "zabbix.public.hosts" system usage: CPU 0.01s/0.04u sec elapsed 1.03 sec
2016-06-21 22:11:15.856 BSTLOG: automatic analyze of table "zabbix.public.host_inventory" system usage: CPU 0.00s/0.16u sec elapsed 0.47 sec
2016-06-21 22:11:32.614 BSTLOG: automatic analyze of table "zabbix.public.item_discovery" system usage: CPU 0.09s/2.21u sec elapsed 7.78 sec
2016-06-21 22:11:44.497 BSTLOG: automatic analyze of table "zabbix.public.hosts" system usage: CPU 0.00s/0.04u sec elapsed 1.03 sec
2016-06-21 22:12:08.904 BSTLOG: automatic vacuum of table "zabbix.partitions.history_str_p2016_05_29": index scans: 0
pages: 0 removed, 90091 remain
tuples: 0 removed, 12939543 remain
buffer usage: 90185 hits, 180221 misses, 1 dirtied
avg read rate: 3.052 MiB/s, avg write rate: 0.000 MiB/s
system usage: CPU 2.87s/2.42u sec elapsed 461.25 sec
2016-06-21 22:12:24.681 BSTLOG: automatic analyze of table "zabbix.public.hosts" system usage: CPU 0.00s/0.04u sec elapsed 0.70 sec
2016-06-21 22:15:39.222 BSTLOG: checkpoint complete: wrote 67015 buffers (12.8%); 0 transaction log file(s) added, 0 removed, 28 recycled; write=269.652 s, sync=0.127 s, total=269.854 s; sync files=77, longest=0.045 s, average=0.001 s
2016-06-21 22:16:09.252 BSTLOG: checkpoint starting: time
2016-06-21 22:16:46.505 BSTLOG: duration: 2803.479 ms statement: SELECT MAX(e.eventid) AS eventid,e.objectid FROM events e WHERE e.object=0 AND e.source=0 AND e.objectid IN ('14945','16255','17067','17076','17752','17777',' 17882','18468','18661','19228','22874','23942','24 873','24972','25200','25258','26693','26698','2670 2','27324','29222','29231','30717','31172','31206' ,'31319','32397','34840','35305','35559','35606',' 35679','35785','36818','37562','39042','39086','39 272','39621','39737','40010','41275','41603','4168 4','42939','42964','43145','44179','44343','44348' ,'44512','44980','44981','44982','44984','44985',' 44986','45032','45033','45034','45036','45037','45 038','45378','45506','45508','45578','45580','4561 1','46008','48745','48747','48748','48753','48754' ,'48757','48759','48760','48996','49027','49259',' 49805','49806','49808','49809','51905','53115','53 269','53280','53311','53341','53548','53551','5358 9','53805','54157','55867','56184','56591','56619' ,'56720','57302','57396','57397','57400','57401',' 58407','58408','59648','59652','60167','60278','60 293','60306','60939','60940','60941','60946','6094 7','60948','61409','61411','62118','62122','62172' ,'62459','62461','62520','62541','62551','62791',' 62792','62793','62796','63406') AND e.value='1' GROUP BY e.objectid
2016-06-21 22:16:55.728 BSTLOG: automatic vacuum of table "zabbix.partitions.history_log_p2016_05_29": index scans: 0
pages: 0 removed, 481 remain
tuples: 0 removed, 15481 remain
buffer usage: 565 hits, 588 misses, 1 dirtied
avg read rate: 3.415 MiB/s, avg write rate: 0.006 MiB/s
system usage: CPU 0.01s/0.00u sec elapsed 1.34 sec
2016-06-21 22:17:09.538 BSTLOG: automatic vacuum of table "zabbix.public.host_inventory": index scans: 1
pages: 0 removed, 516 remain
tuples: 113 removed, 507 remain
buffer usage: 258 hits, 0 misses, 22 dirtied
avg read rate: 0.000 MiB/s, avg write rate: 1.326 MiB/s
system usage: CPU 0.00s/0.00u sec elapsed 0.12 sec

**Jason** · 24-06-2016, 10:11

I've increased io_concurrency and dropped the number of trappers slightly on the server and it seems to have settled down. If it stays working ok then I'll try dropping the nightly restart.

Ad Widget

Remote agents losing connection to the server for prolonged periods

Remote agents losing connection to the server for prolonged periods

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment