This is what I see in the zabbix_proxy.log
All I see in the zabbix_server.log is a "sending configuration data to proxy" message with an approximately matching timestamp and exactly matching "data len".
The vault server logs don't show anything useful, just the Zabbix server frequently reading those secrets and no failures. We have lots of other things that use vault constantly and we are unable to reproduce anything like that with the vault CLI, curl, or other methods of acessing the vault API, so it seems very unlikely to be a problem on the Vault side.
(server IP, device hostname and secret paths redacted. And it shows this for basically all vault secrets in use at the same time, not just a single secret)
Randomly we will get auth failures from things using those secrets (for a password or community string) and a corresponding gap in the item history. Those "no data" errors seem to happen approximiately every 60 seconds, but failures of items that use them are much less frequent.
It seems like there's a brief window of time when the Zabbix server sends a configuration update to the Zabbix proxy that the vault secrets are unavailable to the proxy. And seems like we only see auth failures if it happens to randomly happen right when an item that uses a vault secret is running. And seems like the larger the update the longer that window is, such that the larger the amount of configuration data sent the more likely it causes a failure of an item that uses the macro.
Anybody else encountered this? Or have any ideas how to eliminate or mitigate the problem? Even just some specifics on how to track down more detail on this kind of failure might be useful.
I scoured the documentation and really doesn't seem like there's any options to control caching of those secrets (except in the web frontend), which could be a significant mitigation method.
Code:
1272715:20240912:170033.355 received configuration data from server at "1.2.3.4", datalen 497 1272715:20240912:170043.795 received configuration data from server at "1.2.3.4", datalen 2784836 1272715:20240912:170044.231 cannot get secrets for path "redacted/secrets/that/include/snmp/community/for/redacted.example.com": no data 1272715:20240912:170057.531 received configuration data from server at "1.2.3.4", datalen 22357858 1272715:20240912:170100.839 cannot get secrets for path "fedacted/secrets/that/include/snmp/community/for/redacted.example.com": no data 1272770:20240912:170101.970 SNMP response from host "redacted.example.com" contains too few variable bindings
The vault server logs don't show anything useful, just the Zabbix server frequently reading those secrets and no failures. We have lots of other things that use vault constantly and we are unable to reproduce anything like that with the vault CLI, curl, or other methods of acessing the vault API, so it seems very unlikely to be a problem on the Vault side.
(server IP, device hostname and secret paths redacted. And it shows this for basically all vault secrets in use at the same time, not just a single secret)
Randomly we will get auth failures from things using those secrets (for a password or community string) and a corresponding gap in the item history. Those "no data" errors seem to happen approximiately every 60 seconds, but failures of items that use them are much less frequent.
It seems like there's a brief window of time when the Zabbix server sends a configuration update to the Zabbix proxy that the vault secrets are unavailable to the proxy. And seems like we only see auth failures if it happens to randomly happen right when an item that uses a vault secret is running. And seems like the larger the update the longer that window is, such that the larger the amount of configuration data sent the more likely it causes a failure of an item that uses the macro.
Anybody else encountered this? Or have any ideas how to eliminate or mitigate the problem? Even just some specifics on how to track down more detail on this kind of failure might be useful.
I scoured the documentation and really doesn't seem like there's any options to control caching of those secrets (except in the web frontend), which could be a significant mitigation method.
Comment