Ad Widget

**stalker** · 30-09-2008, 14:45

I have deleted all and have install new zabbix-servers with default settings (master and 2 slave). The problem remained.

All is made in accuracy under the official documentation

on master in zabbix log:

Code:

31661:20080930:164405 NODE 1: Received auditlog from node 2 for node 2 datalen 1866
 31661:20080930:164405 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (200000000000001,200000000000002,1222767853,3,0,0x4c6f67696e206661696c6564205b4275646e696b6f765d)
 31661:20080930:164405 Query failed:PGRES_FATAL_ERROR:ERROR:  ошибка синтаксиса в или рядом "x4c6f67696e206661696c6564205b4275646e696b6f765d"
LINE 1: ... (200000000000001,200000000000002,1222767853,3,0,0x4c6f67696...
                                                             ^

on both slaves:

Code:

  9014:20080930:164108 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
  9014:20080930:164108 NOT OK
  9014:20080930:164118 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
  9014:20080930:164119 NOT OK
  9014:20080930:164128 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
  9014:20080930:164128 NOT OK
  9014:20080930:164138 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
  9014:20080930:164139 NOT OK
  9014:20080930:164148 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
  9014:20080930:164148 NOT OK
  9014:20080930:164201 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
  9014:20080930:164201 NOT OK

**stalker** · 30-09-2008, 15:40

PS: ubuntu 8.04 server

**stalker** · 01-10-2008, 14:00

in version 1.6 distributed monitoring is usable?

With debug=4 on both sides:

fresh installed node 3:

Code:

 24417:20081001:154429 NODE 3: Sending [ZBX_GET_HISTORY_LAST_ID�3�3
auditlog�auditid] to Node [1]
 24417:20081001:154429 NODE 3: Receiving [0] from Node [1]
 24417:20081001:154429 Query [select auditid,userid,clock,action,resourcetype,details from auditlog where auditid>0 and auditid between 300000000000000 and 399999999999999 order by auditid limit 10000]
 24417:20081001:154429 NODE 3: Sending auditlog of node 3 to node 1 datalen 1362
 24417:20081001:154429 In connect_to_node(nodeid:1)
 24417:20081001:154429 Query [select ip,port from nodes where nodeid=1]
 24417:20081001:154429 NODE 3: Sending [History�3�3�auditlog
300300000000001�300300000000002�1222797789�3�0�436f7272656374206c6f67696e205b41646d696e5d
300300000000002�300300000000001�1222776685�1�2�485454502041757468656e7469636174696f6e
300300000000003�300300000000001�1222776717�1�21�4e6f6465205b4c6f63616c206e6f64655d206964205b335d
300300000000004�300300000000001�1222776761�0�21�4e6f6465205b4d6173746572206e6f64655d206964205b315d
300300000000005�300300000000001�1222776771�1�21�4e6f6465205b454b54206e6f64655d206964205b335d
300300000000006�300300000000001�1222776947�1�21�4e6f6465205b4d6f73636f775d206964205b315d
300300000000007�300300000000001�1222776958�1�21�4e6f6465205b456b61746572696e627572675d206964205b335d
300300000000008�300300000000002�1222850788�3�0�4c6f67696e206661696c6564205b41646d696e5d
300300000000009�300300000000002�1222850804�3�0�436f7272656374206c6f67696e205b41646d696e5d
300300000000010�300300000000001�1222852068�4�0�4d616e75616c204c6f676f7574
300300000000011�300300000000002�1222852075�3�0�4c6f67696e206661696c6564205b41646d696e5d
300300000000012�300300000000002�1222852084�3�0�436f7272656374206c6f67696e205b41646d696e5d
300300000000013�300300000000001�1222852104�1�21�4e6f6465205b566f726f6e657a685d206964205b315d
300300000000014�300300000000001�1222861365�4�0�4d616e75616c204c6f676f7574
300300000000015�300300000000002�1222861373�3�0�436f7272656374206c6f67696e205b41646d696e5d] to Node [1]
 24417:20081001:154429 NODE 3: Receiving [] from Node [1]
 24417:20081001:154429 NOT OK

fresh installed node1:

Code:

 18031:20081001:155232 Trapper got [ZBX_GET_HISTORY_LAST_ID�3�3
auditlog�auditid] len 44
 18031:20081001:155232 In send_list_of_history_ids()
 18031:20081001:155232 Query [select MAX(auditid) from auditlog where 1=1 and auditid between 300000000000000 and 399999999999999]
 18031:20081001:155232 NODE 1: Sending [0] to Node [3]
 18033:20081001:155232 Trapper got [History�3�3�auditlog
300300000000001�300300000000002�1222797789�3�0�436f7272656374206c6f67696e205b41646d696e5d
300300000000002�300300000000001�1222776685�1�2�485454502041757468656e7469636174696f6e
300300000000003�300300000000001�1222776717�1�21�4e6f6465205b4c6f63616c206e6f64655d206964205b335d
300300000000004�300300000000001�1222776761�0�21�4e6f6465205b4d6173746572206e6f64655d206964205b315d
300300000000005�300300000000001�1222776771�1�21�4e6f6465205b454b54206e6f64655d206964205b335d
300300000000006�300300000000001�1222776947�1�21�4e6f6465205b4d6f73636f775d206964205b315d
300300000000007�300300000000001�1222776958�1�21�4e6f6465205b456b61746572696e627572675d206964205b335d
300300000000008�300300000000002�1222850788�3�0�4c6f67696e206661696c6564205b41646d696e5d
300300000000009�300300000000002�1222850804�3�0�436f7272656374206c6f67696e205b41646d696e5d
300300000000010�300300000000001�1222852068�4�0�4d616e75616c204c6f676f7574
300300000000011�300300000000002�1222852075�3�0�4c6f67696e206661696c6564205b41646d696e5d
300300000000012�300300000000002�1222852084�3�0�436f7272656374206c6f67696e205b41646d696e5d
300300000000013�300300000000001�1222852104�1�21�4e6f6465205b566f726f6e657a685d206964205b315d
300300000000014�300300000000001�1222861365�4�0�4d616e75616c204c6f676f7574
300300000000015�300300000000002�1222861373�3�0�436f7272656374206c6f67696e205b41646d696e5d] len 1362
 18033:20081001:155232 In node_history()
 18033:20081001:155232 Query [begin;]
 18033:20081001:155232 NODE 1: Received auditlog from node 3 for node 3 datalen 1362
 18033:20081001:155232 In process_record ()
 18033:20081001:155232 Query [insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (300300000000001,300300000000002,1222797789,3,0,0x436f7272656374206c6f67696e205b41646d696e5d)]
 18033:20081001:155232 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (300300000000001,300300000000002,1222797789,3,0,0x436f7272656374206c6f67696e205b41646d696e5d)
 18033:20081001:155232 Query failed:PGRES_FATAL_ERROR:ERROR:  syntax error "x436f7272656374206c6f67696e205b41646d696e5d"
LINE 1: ... (300300000000001,300300000000002,1222797789,3,0,0x436f72726...
                                                             ^

 18033:20081001:155232 Query [rollback;]

In database field details is varchar and should be quoted in query.

**stalker** · 02-10-2008, 14:10

With mysql engine all ok

**tekknokrat** · 06-10-2008, 11:44

did you also tried with using proxy for distributed monitoring?
Btw. i have packages for ubuntu here

**Alvils** · 06-10-2008, 12:07

A small patch

Seems that zabbix server tries to insert data in PostgreSQL like it was MySQL.

I created a quick patch to check this:
In file src/zabbix_server/trapper/nodehistory.c
Find line 348. It should look like:
zbx_snprintf_alloc(sql, sql_allocated, &sql_offset, len + 8, "0x%s,",
Change that to
zbx_snprintf_alloc(sql, sql_allocated, &sql_offset, len + 38, "encode(decode('%s','hex'),'escape'),

Recompile Zabbix and that should work.

Yet, it still crashes occasionally. Perhaps after receiving data from Zabbix remote nodes. For now, I stopped the remote nodes and will see if it still crashes. If not, I will take a look at this problem again...

**stalker** · 06-10-2008, 15:00

I do not use a proxy.

I have switched to the mysql-version. With kicks, reinstallations and shamanism I managed to unite 4 nodes. It looks working.

Irritates the following:

1. Periodically in logs I see the message: "16602:20081006:165209 Timeout while answering request".

2. Sometimes process mysqld starts to consume 100 % of the processor and for this time (from logs) data acquisition from slave nodes stops.

**stalker** · 06-10-2008, 15:27

after registering new node and recieving first portion of data (16603:20081006:171142 NODE 1: Received data from slave node 3 for node 3 datalen 4130113) cpu idle time become 0% for long-long time (i.e.: Cpu(s): 30.3%us, 45.3%sy, 0.0%ni, 0.3%id, 22.3%wa, 0.7%hi, 1.0%si, 0.0%st)
.

At this time in node3 logs:

Code:

  7538:20081006:181059 server #19 started [HTTP Poller]
  7540:20081006:181059 server #20 started [HTTP Poller]
  7542:20081006:181059 server #21 started [HTTP Poller]
  7545:20081006:181059 server #23 started [Escalator]
  7511:20081006:181059 server #0 started [Watchdog]
  7511:20081006:181059 In main_watchdog_loop()
  7529:20081006:181059 server #15 started [Poller for unreachable hosts. SNMP:YES]
  7544:20081006:181059 server #22 started [Discoverer. SNMP:YES]
  7525:20081006:181114 Deleted 0 records from history and trends
  7530:20081006:181115 NODE 3: Sending configuration changes to master node 1 for node 3 datalen 4130114

**stalker** · 06-10-2008, 15:56

after 30 minutes node has been added!

**stalker** · 07-10-2008, 12:22

Problems:
1. periodically: when switched in gui to some node i get error "cannot select configuration"

2. Periodically: errors in log when 100% cpu is used:

Code:

 Query failed: [delete from node_cksum where nodeid=2 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
 Query failed: [delete from node_cksum where nodeid=3 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
 Query failed: [delete from node_cksum where nodeid=4 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
 Query failed: [delete from node_cksum where nodeid=5 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
 Query failed: [delete from node_cksum where nodeid=4 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
....

**stalker** · 07-10-2008, 15:48

errors "Query failed: [delete from node_cksum where nodeid=2 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]"

This error arises when the new node is added. Mysql consumes 100 % of resources after reception of a configuration of new node. After 20 minutes errors in log: "Deadlock found when trying to get lock start to appear; try restarting transaction"

**Alexei** · 08-10-2008, 10:45

Registered as ZBX-537.

**stalker** · 09-10-2008, 13:03

at this time on slave node:

Code:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                   
13537 zabbix    25   5 1177m 421m  768 S  0.3 29.0   0:43.91 zabbix_server                                                                             
11101 mysql     20   0  234m  84m 3504 S 95.0  5.8 976:27.65 mysqld

**stalker** · 09-10-2008, 13:04

Possible this is timeout problem. How to change timeouts?

Links between my nodes is 2mbps and after add new switch to slave from master node i repeatedly see on master node:

Code:

16675:20081009:130543 Timeout while answering request
 16675:20081009:130543 NODE 1: Error while receiving answer from Node [2] error: ZBX_TCP_READ() failed [Interrupted system call]

and on slave node repeated:

Code:

13537:20081009:130217 NODE 2: Received data from master node 1 for node 2 datalen 1944392

On slave node after recieving this 1944392 bytes mysql 20-30min eat 100% cpu and again by timeout recieved it. Perpetuum mobile

With initial sync node1 recieved 4Mb data and transforms to same perpetuum mobile.

How to change timeouts?

Ad Widget

Distributed monitoring problem

Distributed monitoring problem

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment