Ad Widget

Collapse

Occasional Timeouts with Zabbix agents behind an AWS Network Load Balancer

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • dimisjim
    Junior Member
    • Aug 2019
    • 2

    #1

    Occasional Timeouts with Zabbix agents behind an AWS Network Load Balancer

    Hey there,

    My architecture in AWS is as follows:

    There are 2 identical zabbix agents (based on zabbix/zabbix-agent:centos-4.0.11) each one running on a different EC2 instance.
    Zabbix server runs on a third instance (also dockerized with dockbix using 4.0 version as well), all three of them inside the same VPC.

    The idea is to have a Network Load Balancer that listens to the port that both agents run (10050) and have those 2 aforementioned instances being registered on the target group.
    Then, the DNS of this NLB would be provided to the Zabbix host configuration as the interface. The goal is to have multiple zabbix hosts targeting the same NLB and their requests being routed according to traffic load to the different agent. There is a zabbix agent item in each host that invokes a UserParameter (a python script) that is defined in each one of the two zabbix agent conf file.

    My problem is as follows: zabbix_get (and the equivalent call made automatically according to the interval set in the host conf) timeouts occasionally. One time I get a successful response "{"response":"success","info":"processed: 4; failed: 0; total: 4; seconds spent: 0.000106"}" (python script is pretty fast, it just takes 1 second) and other times I get a response such as: "zabbix_get [4515]: Timeout while executing operation". This also happens one after another. So one succeeded and the next timeouts, then the next succeeds and so on.

    I have tried to test the connection with telnet, and it works all the time. I have even tried to use a simple tcp echo container, which also worked fine all the time.

    Any ideas on what might be wrong would be greatly appreciated
  • Markku
    Senior Member
    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
    • Sep 2018
    • 1781

    #2
    Hi, I would definitely run some tcpdumps on all the servers at the same time to find out how the TCP sessions look. Like ”sudo tcpdump -v -w dumpfileX.pcap port 10050”, the get and open them in Wireshark and see who does not respond or who misses the sent packets. Remember to check the time on the hosts and compare the pcaps from the same events.

    Also check that your iptables or similar rules allow the traffic as needed. Load balancing might indicate that one host is working but the other is not (but NLB tries to use it anyway for every other connection).

    Markku

    Comment

    • dimisjim
      Junior Member
      • Aug 2019
      • 2

      #3
      So apparently the issue was that I hadn't enabled Cross-Zone Load Balancing, thus both zones couldn't talk to each other. So when a request was routed to the zone where there were no services to respond, it timed out.

      Comment

      Working...