PDA

View Full Version : Distributed monitoring problem


stalker
30-09-2008, 12:31
I try to setup master node with one child node. Zabbix version 1.6. Postgresql version. Child node used after clean install.

As a standalone servers all ok, but after converting bases (zabbix-server -n [1,2] -c /etc/zabbix/zabbix_server.conf) and adding master/slave nodes in gui i have trouble when change nodes in gui: Error: Unable to select configuration.

In log-files I see:

Master node zabbix-server.log:
23010:20080930:141743 NODE 1: Received auditlog from node 2 for node 2 datalen 904
23010:20080930:141743 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (200200000000001,200200000000002,1222768032,4,0,0x 4d616e75616c204c6f676f7574)
23010:20080930:141743 Query failed:PGRES_FATAL_ERROR:ERROR: ошибка синтаксиса в или рядом "x4d616e75616c204c6f676f7574"
LINE 1: ... (200200000000001,200200000000002,1222768032,4,0,0x 4d616e756...
23014:20080930:142214 NODE 1: Received auditlog from node 2 for node 2 datalen 904
23014:20080930:142214 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (200200000000001,200200000000002,1222768032,4,0,0x 4d616e75616c204c6f676f7574)
23014:20080930:142214 Query failed:PGRES_FATAL_ERROR:ERROR: syntax error near "x4d616e75616c204c6f676f7574"
LINE 1: ... (200200000000001,200200000000002,1222768032,4,0,0x 4d616e756...
^
[etc..]


Child node zabbix-server.log:
6078:20080930:141614 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
6078:20080930:141614 NOT OK
6078:20080930:141624 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
6078:20080930:141624 NOT OK
6078:20080930:141635 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
6078:20080930:141635 NOT OK
6078:20080930:141644 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
6078:20080930:141644 NOT OK
6078:20080930:141655 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
6078:20080930:141655 NOT OK
6078:20080930:141705 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
6078:20080930:141705 NOT OK
.....
6078:20080930:141824 Query::insert into node_cksum (nodeid,tablename,recordid,cksumtype,cksum) select 2,'slideshows',slideshowid,1,md5(name)||','||delay from slideshows where 1=1 and slideshowid between 200000000000000 and 299999999999999
union all select 2,'slides',slideid,1,slideshowid||','||screenid||' ,'||step||','||delay from slides where 1=1 and slideid between 200000000000000 and 299999999999999
union all select 2,'drules',druleid,1,proxy_hostid||','||md5(name)| |','||md5(iprange)||','||delay||','||nextcheck||', '||status from drules where 1=1 and druleid between 200000000000000 and 299999999999999
union all select 2,'dchecks',dcheckid,1,druleid||','||type||','||md 5(key_)||','||md5(snmp_community)||','||md5(ports) from dchecks where 1=1 and dcheckid between 200000000000000 and 299999999999999
[... very long query with many unions ...]
6078:20080930:141824 Query failed:PGRES_FATAL_ERROR:ERROR: function md5(numeric) not found
LINE 7: ...||status||','||md5(macros)||','||md5(agent)||', '||md5(time)|...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.

6078:20080930:141825 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
6078:20080930:141825 NOT OK
6078:20080930:141834 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
6078:20080930:141834 NOT OK
6078:20080930:141845 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
[etc... all errors repeated many times]

stalker
30-09-2008, 14:45
I have deleted all and have install new zabbix-servers with default settings (master and 2 slave). The problem remained.

All is made in accuracy under the official documentation

on master in zabbix log:
31661:20080930:164405 NODE 1: Received auditlog from node 2 for node 2 datalen 1866
31661:20080930:164405 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (200000000000001,200000000000002,1222767853,3,0,0x 4c6f67696e206661696c6564205b4275646e696b6f765d)
31661:20080930:164405 Query failed:PGRES_FATAL_ERROR:ERROR: ошибка синтаксиса в или рядом "x4c6f67696e206661696c6564205b4275646e696b6f765d"
LINE 1: ... (200000000000001,200000000000002,1222767853,3,0,0x 4c6f67696...
^


on both slaves:
9014:20080930:164108 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
9014:20080930:164108 NOT OK
9014:20080930:164118 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
9014:20080930:164119 NOT OK
9014:20080930:164128 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
9014:20080930:164128 NOT OK
9014:20080930:164138 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
9014:20080930:164139 NOT OK
9014:20080930:164148 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
9014:20080930:164148 NOT OK
9014:20080930:164201 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
9014:20080930:164201 NOT OK

stalker
30-09-2008, 15:40
PS: ubuntu 8.04 server

stalker
01-10-2008, 14:00
in version 1.6 distributed monitoring is usable?

With debug=4 on both sides:

fresh installed node 3:
24417:20081001:154429 NODE 3: Sending [ZBX_GET_HISTORY_LAST_ID�3�3
auditlog�auditid] to Node [1]
24417:20081001:154429 NODE 3: Receiving [0] from Node [1]
24417:20081001:154429 Query [select auditid,userid,clock,action,resourcetype,details from auditlog where auditid>0 and auditid between 300000000000000 and 399999999999999 order by auditid limit 10000]
24417:20081001:154429 NODE 3: Sending auditlog of node 3 to node 1 datalen 1362
24417:20081001:154429 In connect_to_node(nodeid:1)
24417:20081001:154429 Query [select ip,port from nodes where nodeid=1]
24417:20081001:154429 NODE 3: Sending [History�3�3�auditlog
300300000000001�300300000000002�1222797789�3�0�436 f7272656374206c6f67696e205b41646d696e5d
300300000000002�300300000000001�1222776685�1�2�485 454502041757468656e7469636174696f6e
300300000000003�300300000000001�1222776717�1�21�4e 6f6465205b4c6f63616c206e6f64655d206964205b335d
300300000000004�300300000000001�1222776761�0�21�4e 6f6465205b4d6173746572206e6f64655d206964205b315d
300300000000005�300300000000001�1222776771�1�21�4e 6f6465205b454b54206e6f64655d206964205b335d
300300000000006�300300000000001�1222776947�1�21�4e 6f6465205b4d6f73636f775d206964205b315d
300300000000007�300300000000001�1222776958�1�21�4e 6f6465205b456b61746572696e627572675d206964205b335d
300300000000008�300300000000002�1222850788�3�0�4c6 f67696e206661696c6564205b41646d696e5d
300300000000009�300300000000002�1222850804�3�0�436 f7272656374206c6f67696e205b41646d696e5d
300300000000010�300300000000001�1222852068�4�0�4d6 16e75616c204c6f676f7574
300300000000011�300300000000002�1222852075�3�0�4c6 f67696e206661696c6564205b41646d696e5d
300300000000012�300300000000002�1222852084�3�0�436 f7272656374206c6f67696e205b41646d696e5d
300300000000013�300300000000001�1222852104�1�21�4e 6f6465205b566f726f6e657a685d206964205b315d
300300000000014�300300000000001�1222861365�4�0�4d6 16e75616c204c6f676f7574
300300000000015�300300000000002�1222861373�3�0�436 f7272656374206c6f67696e205b41646d696e5d] to Node [1]
24417:20081001:154429 NODE 3: Receiving [] from Node [1]
24417:20081001:154429 NOT OK


fresh installed node1:
18031:20081001:155232 Trapper got [ZBX_GET_HISTORY_LAST_ID�3�3
auditlog�auditid] len 44
18031:20081001:155232 In send_list_of_history_ids()
18031:20081001:155232 Query [select MAX(auditid) from auditlog where 1=1 and auditid between 300000000000000 and 399999999999999]
18031:20081001:155232 NODE 1: Sending [0] to Node [3]
18033:20081001:155232 Trapper got [History�3�3�auditlog
300300000000001�300300000000002�1222797789�3�0�436 f7272656374206c6f67696e205b41646d696e5d
300300000000002�300300000000001�1222776685�1�2�485 454502041757468656e7469636174696f6e
300300000000003�300300000000001�1222776717�1�21�4e 6f6465205b4c6f63616c206e6f64655d206964205b335d
300300000000004�300300000000001�1222776761�0�21�4e 6f6465205b4d6173746572206e6f64655d206964205b315d
300300000000005�300300000000001�1222776771�1�21�4e 6f6465205b454b54206e6f64655d206964205b335d
300300000000006�300300000000001�1222776947�1�21�4e 6f6465205b4d6f73636f775d206964205b315d
300300000000007�300300000000001�1222776958�1�21�4e 6f6465205b456b61746572696e627572675d206964205b335d
300300000000008�300300000000002�1222850788�3�0�4c6 f67696e206661696c6564205b41646d696e5d
300300000000009�300300000000002�1222850804�3�0�436 f7272656374206c6f67696e205b41646d696e5d
300300000000010�300300000000001�1222852068�4�0�4d6 16e75616c204c6f676f7574
300300000000011�300300000000002�1222852075�3�0�4c6 f67696e206661696c6564205b41646d696e5d
300300000000012�300300000000002�1222852084�3�0�436 f7272656374206c6f67696e205b41646d696e5d
300300000000013�300300000000001�1222852104�1�21�4e 6f6465205b566f726f6e657a685d206964205b315d
300300000000014�300300000000001�1222861365�4�0�4d6 16e75616c204c6f676f7574
300300000000015�300300000000002�1222861373�3�0�436 f7272656374206c6f67696e205b41646d696e5d] len 1362
18033:20081001:155232 In node_history()
18033:20081001:155232 Query [begin;]
18033:20081001:155232 NODE 1: Received auditlog from node 3 for node 3 datalen 1362
18033:20081001:155232 In process_record ()
18033:20081001:155232 Query [insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (300300000000001,300300000000002,1222797789,3,0,0x 436f7272656374206c6f67696e205b41646d696e5d)]
18033:20081001:155232 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (300300000000001,300300000000002,1222797789,3,0,0x 436f7272656374206c6f67696e205b41646d696e5d)
18033:20081001:155232 Query failed:PGRES_FATAL_ERROR:ERROR: syntax error "x436f7272656374206c6f67696e205b41646d696e5d"
LINE 1: ... (300300000000001,300300000000002,1222797789,3,0,0x 436f72726...
^

18033:20081001:155232 Query [rollback;]


In database field details is varchar and should be quoted in query.

stalker
02-10-2008, 14:10
With mysql engine all ok :(

tekknokrat
06-10-2008, 11:44
did you also tried with using proxy for distributed monitoring?
Btw. i have packages for ubuntu here (http://oss.travelping.com)

Alvils
06-10-2008, 12:07
Seems that zabbix server tries to insert data in PostgreSQL like it was MySQL.

:)

I created a quick patch to check this:
In file src/zabbix_server/trapper/nodehistory.c
Find line 348. It should look like:
zbx_snprintf_alloc(sql, sql_allocated, &sql_offset, len + 8, "0x%s,",
Change that to
zbx_snprintf_alloc(sql, sql_allocated, &sql_offset, len + 38, "encode(decode('%s','hex'),'escape'),

Recompile Zabbix and that should work.

Yet, it still crashes occasionally. Perhaps after receiving data from Zabbix remote nodes. For now, I stopped the remote nodes and will see if it still crashes. If not, I will take a look at this problem again... :)

stalker
06-10-2008, 15:00
I do not use a proxy.

I have switched to the mysql-version. With kicks, reinstallations and shamanism I managed to unite 4 nodes. It looks working.

Irritates the following:

1. Periodically in logs I see the message: "16602:20081006:165209 Timeout while answering request".

2. Sometimes process mysqld starts to consume 100 % of the processor and for this time (from logs) data acquisition from slave nodes stops.

stalker
06-10-2008, 15:27
after registering new node and recieving first portion of data (16603:20081006:171142 NODE 1: Received data from slave node 3 for node 3 datalen 4130113) cpu idle time become 0% for long-long time (i.e.: Cpu(s): 30.3%us, 45.3%sy, 0.0%ni, 0.3%id, 22.3%wa, 0.7%hi, 1.0%si, 0.0%st)
.

At this time in node3 logs:
7538:20081006:181059 server #19 started [HTTP Poller]
7540:20081006:181059 server #20 started [HTTP Poller]
7542:20081006:181059 server #21 started [HTTP Poller]
7545:20081006:181059 server #23 started [Escalator]
7511:20081006:181059 server #0 started [Watchdog]
7511:20081006:181059 In main_watchdog_loop()
7529:20081006:181059 server #15 started [Poller for unreachable hosts. SNMP:YES]
7544:20081006:181059 server #22 started [Discoverer. SNMP:YES]
7525:20081006:181114 Deleted 0 records from history and trends
7530:20081006:181115 NODE 3: Sending configuration changes to master node 1 for node 3 datalen 4130114

stalker
06-10-2008, 15:56
after 30 minutes node has been added! :D

stalker
07-10-2008, 12:22
Problems:
1. periodically: when switched in gui to some node i get error "cannot select configuration"

2. Periodically: errors in log when 100% cpu is used:
Query failed: [delete from node_cksum where nodeid=2 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
Query failed: [delete from node_cksum where nodeid=3 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
Query failed: [delete from node_cksum where nodeid=4 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
Query failed: [delete from node_cksum where nodeid=5 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
Query failed: [delete from node_cksum where nodeid=4 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
....

stalker
07-10-2008, 15:48
errors "Query failed: [delete from node_cksum where nodeid=2 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]"

This error arises when the new node is added. Mysql consumes 100 % of resources after reception of a configuration of new node. After 20 minutes errors in log: "Deadlock found when trying to get lock start to appear; try restarting transaction"

Alexei
08-10-2008, 10:45
Registered as ZBX-537.

stalker
09-10-2008, 13:03
at this time on slave node:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13537 zabbix 25 5 1177m 421m 768 S 0.3 29.0 0:43.91 zabbix_server
11101 mysql 20 0 234m 84m 3504 S 95.0 5.8 976:27.65 mysqld



:(

stalker
09-10-2008, 13:04
Possible this is timeout problem. How to change timeouts?

Links between my nodes is 2mbps and after add new switch to slave from master node i repeatedly see on master node:

16675:20081009:130543 Timeout while answering request
16675:20081009:130543 NODE 1: Error while receiving answer from Node [2] error: ZBX_TCP_READ() failed [Interrupted system call]

and on slave node repeated:

13537:20081009:130217 NODE 2: Received data from master node 1 for node 2 datalen 1944392

On slave node after recieving this 1944392 bytes mysql 20-30min eat 100% cpu and again by timeout recieved it. Perpetuum mobile

With initial sync node1 recieved 4Mb data and transforms to same perpetuum mobile.

How to change timeouts?

Palmertree
09-10-2008, 17:04
Look at the following to change timeouts:

http://www.zabbix.com/forum/showthread.php?t=9890

You have to change it in the C code and the zabbix_server.conf file for trapper.

r3dn3ck
17-10-2008, 23:33
Seems that zabbix server tries to insert data in PostgreSQL like it was MySQL.

:)

I created a quick patch to check this:
In file src/zabbix_server/trapper/nodehistory.c
Find line 348. It should look like:
zbx_snprintf_alloc(sql, sql_allocated, &sql_offset, len + 8, "0x%s,",
Change that to
zbx_snprintf_alloc(sql, sql_allocated, &sql_offset, len + 38, "encode(decode('%s','hex'),'escape'),

Recompile Zabbix and that should work.

Yet, it still crashes occasionally. Perhaps after receiving data from Zabbix remote nodes. For now, I stopped the remote nodes and will see if it still crashes. If not, I will take a look at this problem again... :)


Would this change apply to a mysql back-ended setup?

vinny
22-10-2008, 13:37
I face this bug too...
I tried on the master & slaves nodes to modify the nodehistory.c but without success.

Was this bug corrected ?

vinny

kombat
03-11-2008, 11:12
Hi,
We have 4 nodes with mysql but when we decided mysql replace by postrge.
We have the same problem with nodes and postgres because on mysql it's system are working well.

maybe it's bug ?

Sorry, my english