PDA

View Full Version : DM New master node, sync issues


xs-
29-09-2008, 16:04
Hi all,

Think i've found a bug in distributed monitoring synchronization, below is the scenario + error.
Env:
Both zabbix servers are mostly identical
- Version 1.6.0 (not trunk)
- MySQL 5
- Server binary is identical, copy of the same build.
- Ubuntu 8 tls

I've just set up a new 'master' node, to eventually link several stand alone zabbix nodes.
This masternode will not get any hosts of its own, just combine all child node information for viewing and alerting.

After setting up the new master node, and emptying everything (hosts/templates/triggers,actions, etc, etc), i configured it to have 1 child node. This child node is an existing zabbix node, with around 600 hosts.
Then i configured the child node, restarted both zabbix_server daemons (without restart, nothing really happens, sync-wise).

After some time, configuration updates start (tailing the logfile).
So far, the following updates on the master side:
- 1 successfull
- 1 timeout
- 3 query errors


Query errors are all the same:
7502:20080929:152555 NODE 10: Received data from slave node 1 for node 1 datalen 28130879
7502:20080929:153055 Timeout while answering request
7502:20080929:154426 Query failed: [insert into hosts_profiles_ext (hostid,hostid,device_alias,device_type,device_cha ssis,device_os,device_os_short,device_hw_arch,devi ce_serial,device_model,device_tag,device_vendor,de vice_contract,device_who,device_status,device_app_ 01,device_app_02,device_app_03,device_app_04,devic e_app_05,device_url_1,device_url_2,device_url_3,de vice_networks,device_notes,device_hardware,device_ software,ip_subnet_mask,ip_router,ip_macaddress,oo b_ip,oob_subnet_mask,oob_router,date_hw_buy,date_h w_install,date_hw_expiry,date_hw_decomm,site_stree t_1,site_street_2,site_street_3,site_city,site_sta te,site_country,site_zip,site_rack,site_notes,poc_ 1_name,poc_1_email,poc_1_phone_1,poc_1_phone_2,poc _1_cell,poc_1_screen,poc_1_notes,poc_2_name,poc_2_ email,poc_2_phone_1,poc_2_phone_2,poc_2_cell,poc_2 _screen,poc_2_notes) values(100100000010741,100100000010741,'','','','' ,'','','','','','','','','','','','','','','','',' ','','','','','','','','','','','','','','','','', '','','','','','','','','','','','','','','','','' ,'','','','')] Column 'hostid' specified twice [1110]
7502:20080929:154426 Query failed: [insert into hosts_profiles_ext (hostid,hostid,device_alias,device_type,device_cha ssis,device_os,device_os_short,device_hw_arch,devi ce_serial,device_model,device_tag,device_vendor,de vice_contract,device_who,device_status,device_app_ 01,device_app_02,device_app_03,device_app_04,devic e_app_05,device_url_1,device_url_2,device_url_3,de vice_networks,device_notes,device_hardware,device_ software,ip_subnet_mask,ip_router,ip_macaddress,oo b_ip,oob_subnet_mask,oob_router,date_hw_buy,date_h w_install,date_hw_expiry,date_hw_decomm,site_stree t_1,site_street_2,site_street_3,site_city,site_sta te,site_country,site_zip,site_rack,site_notes,poc_ 1_name,poc_1_email,poc_1_phone_1,poc_1_phone_2,poc _1_cell,poc_1_screen,poc_1_notes,poc_2_name,poc_2_ email,poc_2_phone_1,poc_2_phone_2,poc_2_cell,poc_2 _screen,poc_2_notes) values(100100000010767,100100000010767,'','','','' ,'','','','','','','','','','','','','','','','',' ','','','','','','','','','','','','','','','','', '','','','','','','','','','','','','','','','','' ,'','','','')] Column 'hostid' specified twice [1110]
7502:20080929:154426 Query failed: [insert into hosts_profiles_ext (hostid,hostid,device_alias,device_type,device_cha ssis,device_os,device_os_short,device_hw_arch,devi ce_serial,device_model,device_tag,device_vendor,de vice_contract,device_who,device_status,device_app_ 01,device_app_02,device_app_03,device_app_04,devic e_app_05,device_url_1,device_url_2,device_url_3,de vice_networks,device_notes,device_hardware,device_ software,ip_subnet_mask,ip_router,ip_macaddress,oo b_ip,oob_subnet_mask,oob_router,date_hw_buy,date_h w_install,date_hw_expiry,date_hw_decomm,site_stree t_1,site_street_2,site_street_3,site_city,site_sta te,site_country,site_zip,site_rack,site_notes,poc_ 1_name,poc_1_email,poc_1_phone_1,poc_1_phone_2,poc _1_cell,poc_1_screen,poc_1_notes,poc_2_name,poc_2_ email,poc_2_phone_1,poc_2_phone_2,poc_2_cell,poc_2 _screen,poc_2_notes) values(100100000010768,100100000010768,'','','','' ,'','','','','','','','','','','','','','','','',' ','','','','','','','','','','','','','','','','', '','','','','','','','','','','','','','','','','' ,'','','','')] Column 'hostid' specified twice [1110]

thissolution
07-10-2008, 23:44
G'day

I am having the same issue. From your log, yours is trying to send 28Mb across to the other node, and I am guessing your 2 nodes are not on a LAN together. I have tired over and over again, and getting the same issue, even if i dont add any hosts or new templates, just the first time it tries to sync, its trying to send across 4.1Mb! Thats a lot of data considering both servers are brand new setups, and nothing added (other then setting up the nodes in, the admin section).

Is no one else trying to use DM in 1.6, where the 2 nodes are not on a LAN next to each other?

Paul

Alexei
08-10-2008, 13:30
Registered as ZBX-540.

stalker
09-10-2008, 11:24
Possible this is timeout problem. How to change timeouts?

Links between my nodes is 2mbps and after add new switch to slave from master node i repeatedly see on master node:

16675:20081009:130543 Timeout while answering request
16675:20081009:130543 NODE 1: Error while receiving answer from Node [2] error: ZBX_TCP_READ() failed [Interrupted system call]


and on slave node repeated:
13537:20081009:130217 NODE 2: Received data from master node 1 for node 2 datalen 1944392

On slave node after recieving this 1944392 bytes mysql 20-30min eat 100% cpu and again by timeout recieved it. Perpetuum mobile :)

With initial sync node1 recieved 4Mb data and transforms to same perpetuum mobile.

How to change timeouts?

teferi
13-10-2008, 10:01
Possible this is timeout problem. How to change timeouts?


Timeout variable in your conf file.

thissolution
13-10-2008, 12:49
Teferi

But the variable has a max of only 30 seconds, and syncing 4-19Mb over a ADSL2 takes a lot longer then 30 seconds.

Also, how do we track to see when bug ZBX-540 is fixed?

Thanks
Paul

teferi
13-10-2008, 12:52
Teferi

But the variable has a max of only 30 seconds, and syncing 4-19Mb over a ADSL2 takes a lot longer then 30 seconds.

Also, how do we track to see when bug ZBX-540 is fixed?

Thanks
Paul

About timeout - well you may propably want to hack into code to rise the limit.

About bug:
https://support.zabbix.com/browse/ZBX-540

thissolution
17-10-2008, 23:17
Alex

I have tried build 6180, but the same issue happens. I have posted this, and the log on the bug report on the 15/10 - but have had no correspondence from the zabbix team.

Paul

thissolution
21-10-2008, 06:10
Hi

Is there any update on this bug? I am not sure how anyone could use this in a wan based DM setup with this issue at hand?

Paul

xs-
21-10-2008, 09:07
this specific bug has been fixed in the 1.6 branch in svn (or get the nightly build).