Ad Widget

Collapse

Master/Child Node in Large Distributed Monitoring (DM) Environments

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Palmertree
    Senior Member
    • Sep 2005
    • 746

    #1

    Master/Child Node in Large Distributed Monitoring (DM) Environments

    Master/Child Node in Large Distributed Monitoring (DM) Environments

    Tested On: Zabbix Version 1.5.3 Beta and MYSQL 5.0 Database ONLY

    Note: Before I get started, I would like to mention that this is my own hack and does not mean it will be supported by feature versions of Zabbix. I am posting this in case this might benefit someone else. I spent several weeks testing/tweaking this with over around 893 host and 131974 active items on a single child node pushing history to the master node with very good performance. I was able to successfully push history to the master node every 10 seconds and configuration changes occurred within 5 minutes without skipping a beat on the history and graphs. Also, I was able to sync latest data so that lastcheck, lastvalue, and prevalue (Last check, Last value and Change) show up under latest data for a child node since this would no longer impact the history syncs.

    If you do decide to try this hack, please remember to backup your database before you get started.

    Problems found with current settings for Child/Master node setup in Large Environments with over 100 host and 20,000 active items
    • TCP connection 10051 between master/child node times-out due to waiting for data to be returned from slave or master node.
    • History data pauses for over 5 to 10 minutes on graphs due to nodesync syncing configuration changes between the nodes. Eventually history data can never catch backup because it gets farther and farther behind.
    • Function “calculate_checksums” increases the database load causing deadlocks due to “UNION ALL” and hash on all the tables at ounce in the database dbschema.c that are set to ZBX_SYNC.
    • If a template was removed from a host and re-added back on the master node, sometimes the INSERT items would occur before the deletes, causing corrupted items on the child node. You would know this happening too because there would be duplicate key warnings in the zabbix_server.log on the child node.

    Solution
    • Increased timeout value on trapper port (10051/tcp) to 10800 seconds (3 Hours).
    • Created separate daemons for nodesync_data (History, Trends, etc) and nodesync_config (Configuration Changes) so while history is being pushed configuration changes can be made at the same time.
    • Sort “calculate_checksums” data so that deletes occur first before INSERTS.
    • Created node_cksum_temp table so that checksums can be determine a table (items, hosts, etc.) at a time to decrease the number of deadlocks on the tables and to reduce load on the database.
    • Insert records into the database on the master node with low_pritority to reduce deadlocks on the database.
    • Increased ZBX_ITEMS_SIZE 10000 to support 200,000 items in dbcaching.

    Note: Most changes have to be done on both the master and child nodes.

    Setup
    1. Stop all Zabbix services on master and child nodes.

    2. Backup your databases on both the master and child nodes.

    3. Modify /etc/zabbix/zabbix_server.conf on both the master and child nodes by changing the following entry:
    TrapperTimeout=10800

    4. Create the following table named node_cksum_temp in the Zabbix database on both the master and child node for temporary checksum data:

    MYSQL Database Schema for table name “node_cksum_temp”

    Code:
    CREATE TABLE `node_cksum_temp` (
      `nodeid` int(11) NOT NULL default '0',
      `tablename` varchar(64) NOT NULL default '',
      `recordid` bigint(20) unsigned NOT NULL default '0',
      `cksumtype` int(11) NOT NULL default '0',
      `cksum` text NOT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
    5. On both the master and child nodes, install attached patch NodeSync_Patch-Version_1.5.3_20080614_2308.patch:
    zabbix-original (Original Version- Directory)
    zabbix (Patch Version – Directory)
    patch –p0 < NodeSync_Patch-Version_1.5.3_20080614_2308.patch

    6. Restart zabbix_agentd and zabbix_server on the master node.

    7. Restart zabbix_agentd and zabbix_server on the child node.

    8. Watch zabbix_server.log for NODE messages.
    Attached Files
    Last edited by Palmertree; 15-06-2008, 16:20.
  • xs-
    Senior Member
    Zabbix Certified Specialist
    • Dec 2007
    • 393

    #2
    Some questions:

    1)
    I;m running zabbix 1.4.5 with 750+ hosts, 20k+ items and 15k+ triggers, 3 nodes (1 big, 2 small).
    I don't have any of the problems you are describing (not judging your problems or patch), is the performance of 1.5.x that bad compared to 1.4.x? (i was under the impression 1.5.x would have huge performance improvements). Could you elaborate more specifically at which point (how many nodes, hosts per node, average item update interval, etc) you got your problems, just curious

    2)
    Dividing the zabbix server processes into multiple 'standalone' daemons (if i understand your post right) could be nice, but i kind of like the part where the entire zabbix_server process tree dies on error. Will your patch have the same behavior with these separate daemons?

    3)
    3 hours timeout?? isn't that a bit too much. have you tested this in worst case scenarios. It could make matters worse on the master node side when child nodes start reconnecting and timeout again really fast. (open connections).

    I like the other fixes tho

    Comment

    • vinny
      Senior Member
      • Jan 2008
      • 145

      #3
      Hi palmertree,
      i ll test it with great envy because I have faced all the problems u signaled.

      Question 2 of xs- is pertinent too, because to me, this behaviour is the major drawback of using zabbix.

      vinny
      -------
      Zabbix 1.8.3, 1200+ Hosts, 40 000+ Items...zabbix's everywhere

      Comment

      • Palmertree
        Senior Member
        • Sep 2005
        • 746

        #4
        1. Not sure where the cutoff point was but with a few host and items I did not see a problem. There would be time when the history would stop and then catch backup. But after adding more host and items, the problem got worse.

        2. I just separated the existing daemon into 2. It will behave the same when the process dies.

        3. I used 3 hour timeout to account for my backups. This can be set to whatever is appropriate to your environment.

        Comment

        • NOB
          Senior Member
          Zabbix Certified Specialist
          • Mar 2007
          • 469

          #5
          Originally posted by Palmertree
          1. Not sure where the cutoff point was but with a few host and items I did not see a problem. There would be time when the history would stop and then catch backup. But after adding more host and items, the problem got worse.

          2. I just separated the existing daemon into 2. It will behave the same when the process dies.

          3. I used 3 hour timeout to account for my backups. This can be set to whatever is appropriate to your environment.
          I like the distribution in two daemons:

          It follows the idea of ZABBIX (and other projects/people) very well, i. e. the
          separation of independent tasks wherever possible (Poller, Trapper, etc.)
          for obvious reasons.

          The implementation (patch) is clean and IMHO very easy to understand.

          I hope that the ZABBIX team will include it into 1.6 !

          According to the current Progress Report there is still 10% of the work to do for better distributed monitoring.
          So let's assume integrating this patch is part of it !

          The situation in Palmertrees case, AFAIR, is different from xs-.
          Palmertree uses one ZABBIX master with a large slave, while the latter
          uses one large ZABBIX master and two small ZABBIX slaves.
          Their statements in previous posts fit into the ZABBIX world very well:
          With a small number of hosts on the slaves you won't notice these problems.

          Keep up the good work, palmertree, xs- and, last not least, the ZABBIX team
          to create one of the best monitoring solutions !

          Regards

          Norbert.

          Comment

          • mcortinas
            Junior Member
            • Oct 2011
            • 8

            #6
            Hi,

            First of all, thank you for this post, this is a very interesting data for a big monitorins solutions based on Zabbix.

            I've implemented a infraestructura with a 1 Master and 3 Childs nodes and i've just changed the parameter of TrapperTimeout

            Regards,
            Marc

            Comment

            Working...