Hey all,
First post, apologies if I'm missing something critical. I'm involved with a trial of LEO connectivity on some trains in the UK. As part of the trial, we've been evaluating different on-board monitoring platforms to tell us the health of on-board systems, as well as pulling SNMP statistics from supported systems. We have a central Zabbix server, with Zabbix Proxies running as containers within the on-board FW on each train feeding data back. So far, so good. The connectivity isn't always reliable, even with LEO due to our geographic position and the availability of LEO coverage for some periods of time. Also, the undulating path the train takes or tunnels can also lead to loss of backhaul.
Our on-board backhaul router maintains a tunnel to the DC where the main Zabbix server lives, and this tunnel can intermittently go down. The issue I have, is the remote Zabbix Agents are losing record of host availability for SNMP monitored hosts, leaving them in an 'Unknown' state. If we remote restart the container via Ansible, bingo, within a few seconds, the state of monitored hosts returns. Even when we've lost the state of the monitored SNMP agents, we're still receiving data from the agents in relation to interface, CPU, memory, etc.
Before we restart the container, we have logs which look liks -
After the restart of the container, it clears the issue. The logs look remarkably similar -
Restarting the container, either manually, or in an automated fashion is fine, but it's not a very elegant solution, and in the OS we're using to run the containers, we've hit issues with the number of volumes exceeding a watermark when restarting the Zabbix Agent container so often (leading to more maintenance from the podman perspective).
I was wondering if any experts might be able to offer some advice who've worked with the agent in less than 100% reliable backhaul environments as to whether there might be some optional environment variables we can pass to the agent to mitigate the loss of monitoring. Our container config in the FW host OS is fairly basic and is as follows -
Thanks for taking the time to read this post!
Andy
First post, apologies if I'm missing something critical. I'm involved with a trial of LEO connectivity on some trains in the UK. As part of the trial, we've been evaluating different on-board monitoring platforms to tell us the health of on-board systems, as well as pulling SNMP statistics from supported systems. We have a central Zabbix server, with Zabbix Proxies running as containers within the on-board FW on each train feeding data back. So far, so good. The connectivity isn't always reliable, even with LEO due to our geographic position and the availability of LEO coverage for some periods of time. Also, the undulating path the train takes or tunnels can also lead to loss of backhaul.
Our on-board backhaul router maintains a tunnel to the DC where the main Zabbix server lives, and this tunnel can intermittently go down. The issue I have, is the remote Zabbix Agents are losing record of host availability for SNMP monitored hosts, leaving them in an 'Unknown' state. If we remote restart the container via Ansible, bingo, within a few seconds, the state of monitored hosts returns. Even when we've lost the state of the monitored SNMP agents, we're still receiving data from the agents in relation to interface, CPU, memory, etc.
Before we restart the container, we have logs which look liks -
Code:
8:20260223:013830.008 received configuration data from server at "1.2.3.4", datalen 233947 38:20260223:013833.536 enabling SNMP agent checks on host "FW-000000-001": interface became available 38:20260223:013836.548 enabling SNMP agent checks on host "SWI-000000-00-001": interface became available 38:20260223:013839.487 enabling SNMP agent checks on host "SWI-000000-00-002": interface became available 38:20260223:013840.551 enabling SNMP agent checks on host "SWI-000000-00-001": interface became available 38:20260223:013841.548 enabling SNMP agent checks on host "SWI-000000-00-002": interface became available 38:20260223:013842.533 enabling SNMP agent checks on host "RTR-000000-001": interface became available 16:20260223:020754.448 executing housekeeper 16:20260223:020754.480 housekeeper [deleted 0 records in 0.000380 sec, idle for 1 hour(s)] 8:20260223:023424.731 received configuration data from server at "1.2.3.4", datalen 29712 16:20260223:030754.004 executing housekeeper 16:20260223:030755.005 housekeeper [deleted 71882 records in 0.451045 sec, idle for 1 hour(s)] 15:20260223:031250.668 Unable to connect to [1.2.3.4]:10051 [cannot connect to [[1.2.3.4]:10051]: connection timed out] 15:20260223:031250.668 Will try to reconnect every 1 second(s) 15:20260223:031250.704 Connection restored. 8:20260223:031254.676 Unable to connect to [1.2.3.4]:10051 [cannot connect to [[1.2.3.4]:10051]: connection timed out] 8:20260223:031254.676 Will try to reconnect every 10 second(s) 8:20260223:031255.780 Connection restored.
Code:
user@fw:~$ sho container log zabbix-proxy | no-more Preparing Zabbix proxy Starting Zabbix Proxy (active) [000000]. Zabbix 7.4.2 (revision 7aa4e07). Press Ctrl+C to exit. 1:20260224:121243.196 Starting Zabbix Proxy (active) [000000]. Zabbix 7.4.2 (revision 7aa4e07). 1:20260224:121243.196 **** Enabled features **** 1:20260224:121243.196 SNMP monitoring: YES 1:20260224:121243.196 IPMI monitoring: YES 1:20260224:121243.196 Web monitoring: YES 1:20260224:121243.196 VMware monitoring: YES 1:20260224:121243.196 ODBC: YES 1:20260224:121243.196 SSH support: YES 1:20260224:121243.196 IPv6 support: YES 1:20260224:121243.196 TLS support: YES 1:20260224:121243.196 ************************** 1:20260224:121243.196 using configuration file: /etc/zabbix/zabbix_proxy.conf 1:20260224:121243.206 cannot open database file "/var/lib/zabbix/db_data/000000.sqlite": [2] No such file or directory 1:20260224:121243.207 creating database ... 1:20260224:121244.244 current database version (mandatory/optional): 07040000/07040000 1:20260224:121244.244 required mandatory version: 07040000 1:20260224:121244.246 proxy #0 started [main process] 8:20260224:121244.247 proxy #1 started [configuration syncer #1] 8:20260224:121244.265 no records in "settings" table 9:20260224:121244.274 proxy #2 started [trapper #1] 10:20260224:121244.274 proxy #3 started [trapper #2] 11:20260224:121244.275 proxy #4 started [trapper #3] 12:20260224:121244.276 proxy #5 started [trapper #4] 13:20260224:121244.278 proxy #6 started [trapper #5] 14:20260224:121244.279 proxy #7 started [preprocessing manager #1] 15:20260224:121244.280 proxy #8 started [data sender #1] 16:20260224:121244.285 proxy #9 started [housekeeper #1] 17:20260224:121244.286 proxy #10 started [http poller #1] 18:20260224:121244.288 proxy #11 started [browser poller #1] 19:20260224:121244.291 proxy #12 started [discovery manager #1] 20:20260224:121244.297 proxy #13 started [history syncer #1] 21:20260224:121244.299 proxy #14 started [history syncer #2] 22:20260224:121244.303 proxy #15 started [history syncer #3] 23:20260224:121244.308 proxy #16 started [history syncer #4] 24:20260224:121244.309 proxy #17 started [self-monitoring #1] 25:20260224:121244.310 proxy #18 started [task manager #1] 26:20260224:121244.311 proxy #19 started [poller #1] 27:20260224:121244.311 proxy #20 started [poller #2] 28:20260224:121244.311 proxy #21 started [poller #3] 29:20260224:121244.315 proxy #22 started [poller #4] 30:20260224:121244.317 proxy #23 started [poller #5] 31:20260224:121244.319 proxy #24 started [unreachable poller #1] 32:20260224:121244.321 proxy #25 started [icmp pinger #1] 33:20260224:121244.322 proxy #26 started [availability manager #1] 37:20260224:121244.330 proxy #30 started [snmp poller #1] 34:20260224:121244.337 proxy #27 started [odbc poller #1] 35:20260224:121244.339 proxy #28 started [http agent poller #1] 37:20260224:121244.340 thread started 36:20260224:121244.340 proxy #29 started [agent poller #1] 35:20260224:121244.340 thread started 38:20260224:121244.341 proxy #31 started [internal poller #1] 36:20260224:121244.341 thread started 14:20260224:121244.411 [2] thread started [preprocessing worker #2] 14:20260224:121244.411 [4] thread started [preprocessing worker #4] 14:20260224:121244.411 [1] thread started [preprocessing worker #1] 14:20260224:121244.412 [5] thread started [preprocessing worker #5] 14:20260224:121244.412 [7] thread started [preprocessing worker #7] 14:20260224:121244.412 [8] thread started [preprocessing worker #8] 14:20260224:121244.412 [6] thread started [preprocessing worker #6] 14:20260224:121244.412 [10] thread started [preprocessing worker #10] 14:20260224:121244.412 [3] thread started [preprocessing worker #3] 14:20260224:121244.412 [11] thread started [preprocessing worker #11] 14:20260224:121244.412 [12] thread started [preprocessing worker #12] 14:20260224:121244.413 [14] thread started [preprocessing worker #14] 14:20260224:121244.413 [15] thread started [preprocessing worker #15] 14:20260224:121244.415 [13] thread started [preprocessing worker #13] 14:20260224:121244.415 [9] thread started [preprocessing worker #9] 14:20260224:121244.415 [16] thread started [preprocessing worker #16] 8:20260224:121244.471 received configuration data from server at "1.2.3.4", datalen 233947 19:20260224:121245.648 thread started [discovery worker #1] 19:20260224:121245.648 thread started [discovery worker #3] 19:20260224:121245.648 thread started [discovery worker #4] 19:20260224:121245.648 thread started [discovery worker #5] 19:20260224:121245.648 thread started [discovery worker #2] 37:20260224:121246.396 enabling SNMP agent checks on host "FW-000000-001": interface became available 37:20260224:121247.337 enabling SNMP agent checks on host "RTR-000000-001": interface became available 37:20260224:121247.300 enabling SNMP agent checks on host "SWI-000000-00-001": interface became available 37:20260224:121247.385 enabling SNMP agent checks on host "SWI-000000-00-001": interface became available 37:20260224:121248.345 enabling SNMP agent checks on host "SWI-000000-00-002": interface became available
I was wondering if any experts might be able to offer some advice who've worked with the agent in less than 100% reliable backhaul environments as to whether there might be some optional environment variables we can pass to the agent to mitigate the loss of monitoring. Our container config in the FW host OS is fairly basic and is as follows -
Code:
set container name zabbix-proxy allow-host-networks set container name zabbix-proxy capability 'net-raw' set container name zabbix-proxy environment ZBX_DEBUGLEVEL value '3' set container name zabbix-proxy environment ZBX_HOSTNAME value '000000' set container name zabbix-proxy environment ZBX_PROXYMODE value '0' set container name zabbix-proxy environment ZBX_SERVER_HOST value '1.2.3.4' set container name zabbix-proxy environment ZBX_SERVER_PORT value '10051' set container name zabbix-proxy image 'zabbix/zabbix-proxy-sqlite3'
Andy
Comment