Ad Widget

Collapse

Zabbix Server and Galera Cluster Across AWS Availability Zones

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cwhite
    Member
    • Aug 2015
    • 46

    #1

    Zabbix Server and Galera Cluster Across AWS Availability Zones

    Hello Community -

    We somewhat recently have deployed into a hybrid (on-prem/cloud) due to customer constraints - so as learning AWS is a steep curve from real networking - this may be unrelated to Zabbix.

    Scenario:

    We have a light Zabbix instance for testing on AWS currently, it is upgraded to 5.0. We only operate in one VPC with multiple AZ(s), initially we had this instance running stand alone and within same EC2 instance. Currently a container within a Ubuntu LXD Host (EC2). Meaning all services and mariadb where in the same container. All ran as intended to move forward with migrating our main on-prem Zabbix HA to AWS, last part of testing was galera clustering across multiple AZ(s) within EC2 instances. This time breaking out services as our on-prem with dedicated containers to frontend, mariadb galera cluster and Zabbix server across AZ(s) but same VPC. We've had no issue with performance in our on-prem setup for nearly 2 years - with all best practices implemented (partitioning, housekeeper, pcs cluster/HA, etc). We want to replicate that on AWS but..........

    After overcoming the firewall/networking/security groups of AWS, we have our original standalone instance (all services in single container) and a galera node in another AZ. The galera cluster is up successfully - the issue arises when we change the 'primary' DB on the appropriate conf files to point to the other AZ galera node. Initially everything works fine (in fact as its supposed to) but after a few hours we get mysql/db timeout messages on Zabbix and mysql.

    Zabbix:
    "3782:20200717:161932.134 [Z3005] query failed: [2013] Lost connection to MySQL server during query [select actionid from operations where recovery=1 and actionid=16]
    3782:20200717:161932.178 database is down: retrying in 10 seconds"

    Galera:
    "[Warning] Aborted connection 4037 to db: 'zabbix' user: 'zabbix' host: '10.11.64.18' (Got an error reading communication packets)"

    I can confirm DB is up, galera cluster is fine and nothing changed other than time from last server service restart - now interesting correlation here is from DB side. The IP trying to connect is a secondary IP on Zabbix Server (10.11.64.18) used for some on-prem datacenter monitoring and tracking via SLAs and Tracks that if AWS was ever offline then services would get a new route back to our on-prem datacenter via EIGRP/BGP.

    The real question is why does this interface/IP try to initiate a connection, as it correlates to loss of DB connection on zabbix logs, even though the main IP (10.11.64.15) of primary interface is defined in server_conf? Could this be a bug or am I missing something 'new' in 5.0?

    I will acknowledge that the IP can talk to the galera node (routable on AWS) but it does not have the appropriate iptables/FW rules on its host in a different AZ. Our EC2 instances/hosts have to conform to a security standard and all have snort/iptables/nat entries for each container on it. Thus only the main IP on Zabbix is configured appropriately for SNMP and Zabbix ports, the secondary IP has an out-of-band port used just for our tracking. As such we could open up these rules in our security to verify issue is resolved but I go back to my real question - why is Zabbix trying to switch IPs, even with the primary explicitly defined in its conf?


    EDIT:
    New wrinkle while troubleshooting - as soon as I made a host query, I immediately got the db side error of wrong IP connection (after a long pause/query) and in Zabbix gui. Not sure why zabbix is trying on wrong IP or if AWS is doing something in there 'hand of bezos' routing......

    Click image for larger version

Name:	Screen Shot 2020-07-17 at 1.56.04 PM.png
Views:	665
Size:	550.2 KB
ID:	405476
    Last edited by cwhite; 17-07-2020, 19:57.
  • cwhite
    Member
    • Aug 2015
    • 46

    #2
    Further troubleshooting -

    I removed the secondary interface - this is pointing more and more to AWS routing/timeouts - still get db packet errors even on the single interface. Using the local AZ db the errors go away, its only across AZ(s). So I guess next question is does anyone have a HA Zabbix deployment in the could without these timeouts across AZ(s)?

    Comment

    • cwhite
      Member
      • Aug 2015
      • 46

      #3
      We've moved to AWS cluster across the newer AZ(s) on US EAST Cluster - (C & F zones), eliminating our on-prem hardware. These AZ(s) offer true 10G networking throughput versus our old primary AZ which was on (E zone). We vetted to stay with EC2 instances rather than AWS Aurora or RDS, their pricing is so much more for what we are getting by using EC2 instances. We deployed 3 instances across the 2 new AZ(s) each with lxd containers - 1 each server/frontend and mariadb galera cluster container. Server containers are using PCS HA services to initiate active server through Route53 DNS management in AWS (failover). Each AZ server container is set to only r/w to its local AZ container in the Galera Cluster.

      The move to all AWS was necessitated by throughput limitations in our Transit VPC over Cisco CSRs instances ($$$$). The cost to increase CSR licensing throughput for Zabbix resiliency alone made EC2 instances a better cost effective option as well as opening up the throughput for VPN resources to the CSRs.

      We will be doing an RDS/Aurora test instance for proof of concept if it ever becomes needed....

      Comment

      • Venkata Krishna Darbha
        Junior Member
        • Aug 2021
        • 1

        #4
        Hi All,
        I am new to Zabbix and looking for some help. I have enabled firewalld service on linux 7 hosts using template however I see AWS systems does not use firewall service.
        Is there a way around to update template with Macros to exclude AWS systems not to trigger firewalld service alerts and other systems remain good?

        I am unable to post through my account so posting it here.

        Any advise is helpful?

        Thank you.

        Comment

        Working...