Ad Widget

Collapse

ODBC Queries on an offline database causes massive freeze and cascading alerts

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Sven
    Junior Member
    • Feb 2012
    • 26

    #1

    ODBC Queries on an offline database causes massive freeze and cascading alerts

    Hi All!

    I have an Oracle ODBC template (from the templates repository) which I use across a variety of Oracle 10, 11, 12 systems and all is well.

    We had a network outage that caused one database on one server to go offline at approx. 07:22

    The ODBC queries appear to have hung and stalled anything else from working after this. It set off a cascade whereby Zabbix begun to report that other systems were offline, even though they weren't. The error of "Database down" is down to the fact that Zabbix server did not receive any data for 10 minutes.

    Due to the nature of some of the information, I've redacted some of the information, but it is enough to work with.

    Last time this happened, I increased the number polers and the number of discovery processes in case there was an issue, but that obviously hasn't fixed the fundamental issue.

    This is a proof-of-concept system at the moment and I have built a single Zabbix server and no proxies (yet).

    Any help appreciated !!

    S.


    Problems:


    Click image for larger version

Name:	Problems.PNG
Views:	1799
Size:	29.9 KB
ID:	395812

    Zabbix Process Utilisation:


    Click image for larger version

Name:	zabbix services utilisation.PNG
Views:	1408
Size:	122.4 KB
ID:	395813

    Zabbix Server Log

    I've removed some of the tables in the Schema as it relates to certain applications that run in there.

    Code:
      5882:20200218:072251.132 error reason for "dbserver:db.odbc.select[archive log gap,{$DSN}]" changed: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5882:20200218:072256.166 error reason for "dbserver:db.odbc.select[max applied archived log,{$DSN}]" changed: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5949:20200218:072301.784 discovery rule "dbserver:db.odbc.discovery[ASM Disk Discovery,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072337.252 item "dbserver:db.odbc.select[Instance Status,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5882:20200218:072340.256 item "dbserver:db.odbc.select[Physical Read Bytes Per Sec,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5883:20200218:072342.258 item "dbserver:db.odbc.select[Physical Write Bytes Per Sec,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5882:20200218:072412.357 item "dbserver:db.odbc.select[Schema SYSTEM size,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072414.394 item "dbserver:db.odbc.select[Schema AUDSYS size,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072416.429 item "dbserver:db.odbc.select[Schema DBSNMP size,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072437.507 item "dbserver:db.odbc.select[Schema ORDDATA size,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072442.522 item "dbserver:db.odbc.select[Schema SYS size,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5883:20200218:072507.576 item "dbserver:db.odbc.select[Schema WMSYS size,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072508.577 item "dbserver:db.odbc.select[Schema PERFSTAT size,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5883:20200218:072709.730 item "dbserver:db.odbc.select[Blocked sessions count,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5883:20200218:072709.730 item "dbserver:db.odbc.select[Buffer Cache Hit Ratio,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5882:20200218:072711.734 item "dbserver:db.odbc.select[Cursor Cache Hit Ratio,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5885:20200218:072712.737 item "dbserver:db.odbc.select[Database CPU Time Ratio,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072713.748 item "dbserver:db.odbc.select[Database lock count,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5885:20200218:072714.749 item "dbserver:db.odbc.select[Host CPU Utilization,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5883:20200218:072715.751 item "dbserver:db.odbc.select[Long Table Scans Per Sec,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5882:20200218:072716.757 item "dbserver:db.odbc.select[PGA Cache Hit Percentage,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072717.761 item "dbserver:db.odbc.select[Physical Reads Per Sec,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5883:20200218:072718.785 item "dbserver:db.odbc.select[Physical Writes Per Sec,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072719.830 item "dbserver:db.odbc.select[Physical reads direct per sec,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072719.830 item "dbserver:db.odbc.select[Physical writes direct per sec,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072720.855 item "dbserver:db.odbc.select[Redo Allocation Hit Ratio,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072722.890 item "dbserver:db.odbc.select[Shared Pool Free %,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072737.912 item "dbserver:db.odbc.select[Soft Parse Ratio,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072740.942 item "dbserver:db.odbc.select[Waiting sessions count,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5882:20200218:072806.957 item "dbserver:db.odbc.select[db session count,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072807.959 item "dbserver:db.odbc.select[max archived log,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5885:20200218:072809.965 item "dbserver:db.odbc.select[physical reads direct temporary tablespace per sec,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5884:20200218:072810.967 item "dbserver:db.odbc.select[session blockers count,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5948:20200218:072812.304 discovery rule "dbserver:db.odbc.discovery[ASM Disk Group Discovery,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5949:20200218:072813.747 discovery rule "dbserver:db.odbc.discovery[Flash Recovery Area Discovery,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5949:20200218:072815.710 discovery rule "dbserver:db.odbc.discovery[blocking sessions discovery,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5948:20200218:072816.070 discovery rule "dbserver:db.odbc.discovery[waiting session discovery,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
      5885:20200218:073316.802 item "dbserver:db.odbc.select[Total Database Size,{$DSN}]" became not supported: Cannot connect to ODBC DSN: [SQL_ERROR]:[HY000][12170][[unixODBC][Oracle][ODBC][Ora]ORA-12170: TNS:Connect timeout occurred
    -- snip
      5884:20200218:075816.262 item "dbserver:db.odbc.select[Schema AUDSYS size,{$DSN}]" became supported
      5882:20200218:075817.265 item "dbserver:db.odbc.select[Schema DBSNMP size,{$DSN}]" became supported
      5882:20200218:075817.265 item "dbserver:db.odbc.select[Schema SYSTEM size,{$DSN}]" became supported
      5884:20200218:075817.272 item "dbserver:db.odbc.select[Schema ORDSYS size,{$DSN}]" became supported
      5883:20200218:075818.267 item "dbserver:db.odbc.select[Schema XDB size,{$DSN}]" became supported
      5883:20200218:075818.267 item "dbserver:db.odbc.select[Schema SYS size,{$DSN}]" became supported
      5885:20200218:075819.300 item "dbserver:db.odbc.select[Blocked sessions count,{$DSN}]" became supported
      5885:20200218:075819.300 item "dbserver:db.odbc.select[Buffer Cache Hit Ratio,{$DSN}]" became supported
      5885:20200218:075819.300 item "dbserver:db.odbc.select[Schema PERFSTAT size,{$DSN}]" became supported
      5885:20200218:075819.300 item "dbserver:db.odbc.select[Cursor Cache Hit Ratio,{$DSN}]" became supported
      5885:20200218:075819.300 item "dbserver:db.odbc.select[Database CPU Time Ratio,{$DSN}]" became supported
      5885:20200218:075819.300 item "dbserver:db.odbc.select[Database lock count,{$DSN}]" became supported
      5885:20200218:075819.300 item "dbserver:db.odbc.select[Host CPU Utilization,{$DSN}]" became supported
      5885:20200218:075819.300 item "dbserver:db.odbc.select[Instance Status,{$DSN}]" became supported
      5949:20200218:075821.522 discovery rule "dbserver:db.odbc.discovery[blocking sessions discovery,{$DSN}]" became supported
      5948:20200218:075824.200 discovery rule "dbserver:db.odbc.discovery[waiting session discovery,{$DSN}]" became supported
      5882:20200218:075838.338 item "dbserver:db.odbc.select[Long Table Scans Per Sec,{$DSN}]" became supported
      5882:20200218:075839.349 item "dbserver:db.odbc.select[PGA Cache Hit Percentage,{$DSN}]" became supported
      5882:20200218:075840.363 item "dbserver:db.odbc.select[Physical Read Bytes Per Sec,{$DSN}]" became supported
      5882:20200218:075841.374 item "dbserver:db.odbc.select[Physical Reads Per Sec,{$DSN}]" became supported
      5884:20200218:075842.377 item "dbserver:db.odbc.select[Physical Write Bytes Per Sec,{$DSN}]" became supported
      5882:20200218:075843.390 item "dbserver:db.odbc.select[Physical Writes Per Sec,{$DSN}]" became supported
      5882:20200218:075845.416 item "dbserver:db.odbc.select[Redo Allocation Hit Ratio,{$DSN}]" became supported
      5882:20200218:075847.429 item "dbserver:db.odbc.select[Shared Pool Free %,{$DSN}]" became supported
      5884:20200218:075848.436 item "dbserver:db.odbc.select[Soft Parse Ratio,{$DSN}]" became supported
      5884:20200218:075848.436 item "dbserver:db.odbc.select[Total Database Size,{$DSN}]" became supported
      5884:20200218:075850.459 item "dbserver:db.odbc.select[Waiting sessions count,{$DSN}]" became supported
      5882:20200218:075851.466 error reason for "dbserver:db.odbc.select[archive log gap,{$DSN}]" changed: SQL query returned NULL value.
      5884:20200218:075853.472 item "dbserver:db.odbc.select[db session count,{$DSN}]" became supported
      5884:20200218:075856.490 error reason for "dbserver:db.odbc.select[max applied archived log,{$DSN}]" changed: SQL query returned NULL value.
      5884:20200218:075856.490 item "dbserver:db.odbc.select[max archived log,{$DSN}]" became supported
      5884:20200218:075900.527 item "dbserver:db.odbc.select[session blockers count,{$DSN}]" became supported
      5948:20200218:075901.394 discovery rule "dbserver:db.odbc.discovery[ASM Disk Discovery,{$DSN}]" became supported
      5949:20200218:075902.619 discovery rule "dbserver:db.odbc.discovery[ASM Disk Group Discovery,{$DSN}]" became supported
      5948:20200218:075904.293 discovery rule "dbserver:db.odbc.discovery[Flash Recovery Area Discovery,{$DSN}]" became supported
      5884:20200218:080044.933 item "dbserver:db.odbc.select[Physical reads direct per sec,{$DSN}]" became supported
      5884:20200218:080044.934 item "dbserver:db.odbc.select[Physical writes direct per sec,{$DSN}]" became supported
      5882:20200218:080059.998 item "dbserver:db.odbc.select[physical reads direct temporary tablespace per sec,{$DSN}]" became supported

    Zabbix Processes:

    Code:
    # ps aux | grep disco
    zabbix    5872  0.0  0.0 279676  5556 ?        S    Feb14   0:12 /usr/sbin/zabbix_server: discoverer #1 [processed 0 rules in 0.000712 sec, idle 60 sec]
    zabbix    5873  0.0  0.0 279676  5556 ?        S    Feb14   0:12 /usr/sbin/zabbix_server: discoverer #2 [processed 0 rules in 0.000591 sec, idle 60 sec]
    zabbix    5874  0.0  0.0 279676  5556 ?        S    Feb14   0:13 /usr/sbin/zabbix_server: discoverer #3 [processed 0 rules in 0.000894 sec, idle 60 sec]
    zabbix    5875  0.0  0.0 279676  5556 ?        S    Feb14   0:12 /usr/sbin/zabbix_server: discoverer #4 [processed 0 rules in 0.000926 sec, idle 60 sec]
    zabbix    5876  0.0  0.0 279676  5556 ?        S    Feb14   0:12 /usr/sbin/zabbix_server: discoverer #5 [processed 0 rules in 0.000871 sec, idle 60 sec]
    zabbix    5877  0.0  0.0 279676  5556 ?        S    Feb14   0:12 /usr/sbin/zabbix_server: discoverer #6 [processed 0 rules in 0.000684 sec, idle 60 sec]
    zabbix    5878  0.0  0.0 279676  5556 ?        S    Feb14   0:12 /usr/sbin/zabbix_server: discoverer #7 [processed 0 rules in 0.000615 sec, idle 60 sec]
    zabbix    5879  0.0  0.0 279676  5556 ?        S    Feb14   0:12 /usr/sbin/zabbix_server: discoverer #8 [processed 0 rules in 0.000996 sec, idle 60 sec]
    zabbix    5880  0.0  0.0 279676  5556 ?        S    Feb14   0:12 /usr/sbin/zabbix_server: discoverer #9 [processed 0 rules in 0.000719 sec, idle 60 sec]
    zabbix    5881  0.0  0.0 279676  5556 ?        S    Feb14   0:12 /usr/sbin/zabbix_server: discoverer #10 [processed 0 rules in 0.000647 sec, idle 60 sec]
    
    # ps aux | grep poll
    zabbix    5871  0.0  0.0 175208  2968 ?        S    Feb14   0:28 /usr/sbin/zabbix_server: http poller #1 [got 0 values in 0.000732 sec, idle 5 sec]
    zabbix    5887  0.0  0.0 178524  4076 ?        S    Feb14   0:16 /usr/sbin/zabbix_server: java poller #1 [got 0 values in 0.000056 sec, idle 5 sec]
    zabbix    5888  0.0  0.0 178524  4076 ?        S    Feb14   0:17 /usr/sbin/zabbix_server: java poller #2 [got 0 values in 0.000060 sec, idle 5 sec]
    zabbix    5889  0.0  0.0 178524  4076 ?        S    Feb14   0:17 /usr/sbin/zabbix_server: java poller #3 [got 0 values in 0.000064 sec, idle 5 sec]
    zabbix    5890  0.0  0.0 178524  4076 ?        S    Feb14   0:17 /usr/sbin/zabbix_server: java poller #4 [got 0 values in 0.000064 sec, idle 5 sec]
    zabbix    5891  0.0  0.0 178524  4076 ?        S    Feb14   0:16 /usr/sbin/zabbix_server: java poller #5 [got 0 values in 0.000059 sec, idle 5 sec]
    zabbix    5892  0.0  0.0 175208  4076 ?        S    Feb14   0:17 /usr/sbin/zabbix_server: proxy poller #1 [exchanged data with 0 proxies in 0.000089 sec, idle 5 sec]
    zabbix    5895  0.2  0.3 522284 28772 ?        S    Feb14  14:23 /usr/sbin/zabbix_server: poller #1 [got 2 values in 0.030844 sec, idle 1 sec]
    zabbix    5896  0.2  0.3 522344 28776 ?        S    Feb14  14:24 /usr/sbin/zabbix_server: poller #2 [got 0 values in 0.000029 sec, idle 1 sec]
    zabbix    5897  0.2  0.3 522300 28804 ?        S    Feb14  14:23 /usr/sbin/zabbix_server: poller #3 [got 2 values in 0.009258 sec, idle 1 sec]
    zabbix    5898  0.2  0.3 522360 28768 ?        S    Feb14  14:34 /usr/sbin/zabbix_server: poller #4 [got 4 values in 0.017420 sec, idle 1 sec]
    zabbix    5899  0.2  0.3 522308 28796 ?        S    Feb14  14:22 /usr/sbin/zabbix_server: poller #5 [got 4 values in 0.017257 sec, idle 1 sec]
    zabbix    5900  0.2  0.3 522364 28736 ?        S    Feb14  14:17 /usr/sbin/zabbix_server: poller #6 [got 1 values in 0.005830 sec, idle 1 sec]
    zabbix    5901  0.2  0.3 522356 28692 ?        S    Feb14  14:14 /usr/sbin/zabbix_server: poller #7 [got 0 values in 0.000293 sec, idle 1 sec]
    zabbix    5902  0.2  0.3 522320 28760 ?        S    Feb14  14:29 /usr/sbin/zabbix_server: poller #8 [got 3 values in 0.020163 sec, idle 1 sec]
    zabbix    5903  0.2  0.3 522340 28780 ?        S    Feb14  14:23 /usr/sbin/zabbix_server: poller #9 [got 0 values in 0.000031 sec, idle 1 sec]
    zabbix    5904  0.2  0.3 522368 28784 ?        S    Feb14  14:21 /usr/sbin/zabbix_server: poller #10 [got 0 values in 0.000023 sec, idle 1 sec]
    zabbix    5905  0.2  0.3 522328 28760 ?        S    Feb14  14:20 /usr/sbin/zabbix_server: poller #11 [got 0 values in 0.000030 sec, idle 1 sec]
    zabbix    5906  0.2  0.3 522284 28760 ?        S    Feb14  14:50 /usr/sbin/zabbix_server: poller #12 [got 2 values in 0.007104 sec, idle 1 sec]
    zabbix    5907  0.2  0.3 522340 28712 ?        S    Feb14  14:28 /usr/sbin/zabbix_server: poller #13 [got 0 values in 0.000061 sec, idle 1 sec]
    zabbix    5908  0.2  0.3 522288 28776 ?        S    Feb14  14:38 /usr/sbin/zabbix_server: poller #14 [got 0 values in 0.000036 sec, idle 1 sec]
    zabbix    5909  0.2  0.3 522304 28792 ?        S    Feb14  14:23 /usr/sbin/zabbix_server: poller #15 [got 4 values in 0.009616 sec, idle 1 sec]
    zabbix    5910  0.2  0.3 522280 28724 ?        S    Feb14  14:17 /usr/sbin/zabbix_server: poller #16 [got 3 values in 0.012393 sec, idle 1 sec]
    zabbix    5911  0.2  0.3 522332 28772 ?        S    Feb14  14:22 /usr/sbin/zabbix_server: poller #17 [got 0 values in 0.000061 sec, idle 1 sec]
    zabbix    5912  0.2  0.3 522336 28748 ?        S    Feb14  14:20 /usr/sbin/zabbix_server: poller #18 [got 0 values in 0.000997 sec, idle 1 sec]
    zabbix    5918  0.2  0.3 522376 28716 ?        S    Feb14  14:20 /usr/sbin/zabbix_server: poller #19 [got 1 values in 0.007566 sec, idle 1 sec]
    zabbix    5919  0.2  0.3 522280 28732 ?        S    Feb14  14:25 /usr/sbin/zabbix_server: poller #20 [got 1 values in 0.006706 sec, idle 1 sec]
    zabbix    5920  0.0  0.0 282940  7316 ?        S    Feb14   0:16 /usr/sbin/zabbix_server: unreachable poller #1 [got 0 values in 0.000029 sec, idle 5 sec]
    zabbix    5921  0.0  0.0 282940  6644 ?        S    Feb14   0:17 /usr/sbin/zabbix_server: unreachable poller #2 [got 0 values in 0.000077 sec, idle 5 sec]
    zabbix    5922  0.0  0.0 282940  6512 ?        S    Feb14   0:16 /usr/sbin/zabbix_server: unreachable poller #3 [got 0 values in 0.000062 sec, idle 5 sec]
    zabbix    5923  0.0  0.0 282940  6288 ?        S    Feb14   0:16 /usr/sbin/zabbix_server: unreachable poller #4 [got 0 values in 0.000050 sec, idle 5 sec]
    zabbix    5925  0.0  0.0 282940  6720 ?        S    Feb14   0:17 /usr/sbin/zabbix_server: unreachable poller #5 [got 0 values in 0.000062 sec, idle 5 sec]
    zabbix    5926  0.0  0.0 282940  6100 ?        S    Feb14   0:17 /usr/sbin/zabbix_server: unreachable poller #6 [got 0 values in 0.000068 sec, idle 5 sec]
    zabbix    5927  0.0  0.0 282940  6684 ?        S    Feb14   0:16 /usr/sbin/zabbix_server: unreachable poller #7 [got 0 values in 0.000027 sec, idle 5 sec]
    zabbix    5928  0.0  0.0 282940  6188 ?        S    Feb14   0:16 /usr/sbin/zabbix_server: unreachable poller #8 [got 0 values in 0.000069 sec, idle 5 sec]
    zabbix    5929  0.0  0.0 282940  6692 ?        S    Feb14   0:16 /usr/sbin/zabbix_server: unreachable poller #9 [got 0 values in 0.000181 sec, idle 5 sec]
    zabbix    5930  0.0  0.0 282940  6420 ?        S    Feb14   0:16 /usr/sbin/zabbix_server: unreachable poller #10 [got 0 values in 0.000026 sec, idle 5 sec]
  • Sven
    Junior Member
    • Feb 2012
    • 26

    #2
    I have now just increased the "StartLLDProcessors" option up to 20.

    This Oracle template (and others I've created) include lots of auto discovery, and I'm building more to cope with some of the dynamic environments.

    I'm not sure this will help, but I will let you know if it happens again.

    Comment

    • Sven
      Junior Member
      • Feb 2012
      • 26

      #3
      Well, it happened again.

      One of the databases was brought down to install a new version of Oracle. This started at around 10:53 on Saturday and an event triggered and I was rightfully alerted.

      At around 12:14 I had more alerts stating that 3 other databases had gone offline, even though they hadn't.

      It almost looks like the the system locks up from being able to query other ODBC sources and Zabbix therefore states that it hasn't received any data from the databases as no new values have come in.

      Has *anyone* any suggestions?


      Click image for larger version

Name:	gkrellShoot_04-06-20_112916.jpg
Views:	1345
Size:	184.2 KB
ID:	398816

      Comment

      • tim.mooney
        Senior Member
        • Dec 2012
        • 1427

        #4
        We're not monitoring Oracle using db.* items so I have limited experience here.

        My guess is that the problem relates to these two notes from the ODBC item page:
        Important notes


        • Zabbix does not limit the query execution time. It is up to the user to choose queries that can be executed in a reasonable amount of time.
        • The Timeout parameter value from Zabbix server is used as the ODBC login timeout (note that depending on ODBC drivers the login timeout setting might be ignored).
        In your case, I'm guessing that the *login* process still works, since the listener remains up, but the actual queries fail. That seems like it must be causing a poller to hang? If that's the case, then it makes sense that once all the pollers have tried an ODBC query that fails, they're all effectively hung, so then other systems can't be checked.

        Your logs clearly show that Oracle returns an ODBC error, though. I would think that would cause Zabbix to at least close the connection and let the poller move on to its next tasks from the queue. It seems like there's probably a bug there, but without being able to test it myself it's hard to say what the problem is.

        If I were in your situation, I would probably set up a test where I
        1. put in a maintenance period for nearly everything, so I don't cause an alert storm
        2. identify an Oracle database that can be taken offline without causing an actual outage for your users
        3. right before you actually take down the Oracle DB you've idenitified, I would use 'strace' to attach to every one of your Zabbix pollers on your Zabbix server, and use the options to write the output to separate files.
        4. take the Oracle DB offline, to trigger the eventual hang.
        If all the pollers are eventually getting tied up because of hung queries, you should be able to eventually see every one of them stuck in roughly the same spot in each of the output files. In other words, they should all be doing the same thing.

        If you make progress on this issue, please do update this thread.

        Comment

        • Sven
          Junior Member
          • Feb 2012
          • 26

          #5
          Well - I've found the issue.

          I had some more alerts and happen to be checking the logs around that time. There was a kernel message regarding the netpf10 module. Turns out that there was a new scan from the security teams that was causing these odd things to happen.

          Oddly, it is a combination of IPv6 being disabled on the server and SELinux, even though SELinux was set to permissive.

          I'm not an selinux guy, so I cobbled a list of commands on how to solve this. There may be a cleaner way:

          Code:
          cd /root
          grep zabbix_server /var/log/audit/audit.log | audit2allow -m zabbix_netpf10 > zabbix-netpf10.te
          checkmodule -M -m -o zabbix-netpf10.mod zabbix-netpf10.te
          semodule_package -m zabbix-netpf10.mod -o zabbix-netpf10.pp
          semodule -i zabbix-netpf10.pp
          Hope that helps everyone!

          Comment

          • Sven
            Junior Member
            • Feb 2012
            • 26

            #6
            Well - I guess I was wrong.

            One of the engineers brought down one of the ODAs for patching on Friday and hadn't put it into maintenance mode. This resulted in another storm alert as the ODBC queries all thought that there was no more data coming in.

            Guess it must be something else.

            Anyone have any ideas ?

            Comment

            • Sven
              Junior Member
              • Feb 2012
              • 26

              #7
              I’m now thinking that moving to zabbix_agent2 be the way to go instead of the main server polling, offload the querying to the agents installed on the ODAs/Solaris.

              I am going to have an issue with this route in the next few months as we’re planning on having 2 Exadata systems installed and I’m not sure what we can do about agents being installed on there.

              does anyone have any thoughts on that?

              Comment

              Working...