Hello Community -
We somewhat recently have deployed into a hybrid (on-prem/cloud) due to customer constraints - so as learning AWS is a steep curve from real networking - this may be unrelated to Zabbix.
Scenario:
We have a light Zabbix instance for testing on AWS currently, it is upgraded to 5.0. We only operate in one VPC with multiple AZ(s), initially we had this instance running stand alone and within same EC2 instance. Currently a container within a Ubuntu LXD Host (EC2). Meaning all services and mariadb where in the same container. All ran as intended to move forward with migrating our main on-prem Zabbix HA to AWS, last part of testing was galera clustering across multiple AZ(s) within EC2 instances. This time breaking out services as our on-prem with dedicated containers to frontend, mariadb galera cluster and Zabbix server across AZ(s) but same VPC. We've had no issue with performance in our on-prem setup for nearly 2 years - with all best practices implemented (partitioning, housekeeper, pcs cluster/HA, etc). We want to replicate that on AWS but..........
After overcoming the firewall/networking/security groups of AWS, we have our original standalone instance (all services in single container) and a galera node in another AZ. The galera cluster is up successfully - the issue arises when we change the 'primary' DB on the appropriate conf files to point to the other AZ galera node. Initially everything works fine (in fact as its supposed to) but after a few hours we get mysql/db timeout messages on Zabbix and mysql.
Zabbix:
"3782:20200717:161932.134 [Z3005] query failed: [2013] Lost connection to MySQL server during query [select actionid from operations where recovery=1 and actionid=16]
3782:20200717:161932.178 database is down: retrying in 10 seconds"
Galera:
"[Warning] Aborted connection 4037 to db: 'zabbix' user: 'zabbix' host: '10.11.64.18' (Got an error reading communication packets)"
I can confirm DB is up, galera cluster is fine and nothing changed other than time from last server service restart - now interesting correlation here is from DB side. The IP trying to connect is a secondary IP on Zabbix Server (10.11.64.18) used for some on-prem datacenter monitoring and tracking via SLAs and Tracks that if AWS was ever offline then services would get a new route back to our on-prem datacenter via EIGRP/BGP.
The real question is why does this interface/IP try to initiate a connection, as it correlates to loss of DB connection on zabbix logs, even though the main IP (10.11.64.15) of primary interface is defined in server_conf? Could this be a bug or am I missing something 'new' in 5.0?
I will acknowledge that the IP can talk to the galera node (routable on AWS) but it does not have the appropriate iptables/FW rules on its host in a different AZ. Our EC2 instances/hosts have to conform to a security standard and all have snort/iptables/nat entries for each container on it. Thus only the main IP on Zabbix is configured appropriately for SNMP and Zabbix ports, the secondary IP has an out-of-band port used just for our tracking. As such we could open up these rules in our security to verify issue is resolved but I go back to my real question - why is Zabbix trying to switch IPs, even with the primary explicitly defined in its conf?
EDIT:
New wrinkle while troubleshooting - as soon as I made a host query, I immediately got the db side error of wrong IP connection (after a long pause/query) and in Zabbix gui. Not sure why zabbix is trying on wrong IP or if AWS is doing something in there 'hand of bezos' routing......
We somewhat recently have deployed into a hybrid (on-prem/cloud) due to customer constraints - so as learning AWS is a steep curve from real networking - this may be unrelated to Zabbix.
Scenario:
We have a light Zabbix instance for testing on AWS currently, it is upgraded to 5.0. We only operate in one VPC with multiple AZ(s), initially we had this instance running stand alone and within same EC2 instance. Currently a container within a Ubuntu LXD Host (EC2). Meaning all services and mariadb where in the same container. All ran as intended to move forward with migrating our main on-prem Zabbix HA to AWS, last part of testing was galera clustering across multiple AZ(s) within EC2 instances. This time breaking out services as our on-prem with dedicated containers to frontend, mariadb galera cluster and Zabbix server across AZ(s) but same VPC. We've had no issue with performance in our on-prem setup for nearly 2 years - with all best practices implemented (partitioning, housekeeper, pcs cluster/HA, etc). We want to replicate that on AWS but..........
After overcoming the firewall/networking/security groups of AWS, we have our original standalone instance (all services in single container) and a galera node in another AZ. The galera cluster is up successfully - the issue arises when we change the 'primary' DB on the appropriate conf files to point to the other AZ galera node. Initially everything works fine (in fact as its supposed to) but after a few hours we get mysql/db timeout messages on Zabbix and mysql.
Zabbix:
"3782:20200717:161932.134 [Z3005] query failed: [2013] Lost connection to MySQL server during query [select actionid from operations where recovery=1 and actionid=16]
3782:20200717:161932.178 database is down: retrying in 10 seconds"
Galera:
"[Warning] Aborted connection 4037 to db: 'zabbix' user: 'zabbix' host: '10.11.64.18' (Got an error reading communication packets)"
I can confirm DB is up, galera cluster is fine and nothing changed other than time from last server service restart - now interesting correlation here is from DB side. The IP trying to connect is a secondary IP on Zabbix Server (10.11.64.18) used for some on-prem datacenter monitoring and tracking via SLAs and Tracks that if AWS was ever offline then services would get a new route back to our on-prem datacenter via EIGRP/BGP.
The real question is why does this interface/IP try to initiate a connection, as it correlates to loss of DB connection on zabbix logs, even though the main IP (10.11.64.15) of primary interface is defined in server_conf? Could this be a bug or am I missing something 'new' in 5.0?
I will acknowledge that the IP can talk to the galera node (routable on AWS) but it does not have the appropriate iptables/FW rules on its host in a different AZ. Our EC2 instances/hosts have to conform to a security standard and all have snort/iptables/nat entries for each container on it. Thus only the main IP on Zabbix is configured appropriately for SNMP and Zabbix ports, the secondary IP has an out-of-band port used just for our tracking. As such we could open up these rules in our security to verify issue is resolved but I go back to my real question - why is Zabbix trying to switch IPs, even with the primary explicitly defined in its conf?
EDIT:
New wrinkle while troubleshooting - as soon as I made a host query, I immediately got the db side error of wrong IP connection (after a long pause/query) and in Zabbix gui. Not sure why zabbix is trying on wrong IP or if AWS is doing something in there 'hand of bezos' routing......
Comment