Hi,
I'm hoping I can get some assistance with potential tuning parameters that may help reduce/solve some of the problems we've been having with our production Zabbix infrastructure. It's been in place for at least 6 months now and we've seen a sudden decline in performance: gaps in our graphs, no data, slow responses from the frontend, and a very large queue of Zabbix agent data. I'd appreciate any suggestions for tweaking our installation.
Details about the pieces involved:
Zabbix Server
Version 2.0.10
Centos 6.5
24GB RAM
Intel(R) Xeon(R) CPU E5603 @ 1.60GHz
relevant zabbix_server.conf sections



Backend Database
Postgres 9.1.8
Db Size= 52GB, partitioned
ex:
zabbix=# \d+ history_uint
Table "public.history_uint"
Column | Type | Modifiers | Storage | Description
--------+---------------+-------------------------------+---------+-------------
itemid | bigint | not null | plain |
clock | integer | not null default 0 | plain |
value | numeric(20,0) | not null default (0)::numeric | main |
ns | integer | not null default 0 | plain |
Indexes:
"history_uint_1" btree (itemid, clock)
"history_uint_mn" btree (itemid, clock)
Triggers:
partition_trg BEFORE INSERT ON history_uint FOR EACH ROW EXECUTE PROCEDURE trg_partition('day')
Child tables: partitions.history_uint_p2014_03_04,
partitions.history_uint_p2014_03_05,
partitions.history_uint_p2014_03_06,
partitions.history_uint_p2014_03_07,
partitions.history_uint_p2014_03_08,
partitions.history_uint_p2014_03_09,
partitions.history_uint_p2014_03_10,
partitions.history_uint_p2014_03_11,
partitions.history_uint_p2014_03_12,
partitions.history_uint_p2014_03_13,
partitions.history_uint_p2014_03_14,
partitions.history_uint_p2014_03_15,
partitions.history_uint_p2014_03_16,
partitions.history_uint_p2014_03_17,
partitions.history_uint_p2014_03_18,
partitions.history_uint_p2014_03_19,
partitions.history_uint_p2014_03_20,
partitions.history_uint_p2014_03_21
Has OIDs: no
relevant postgres settings:
In addition to the kinds of things we're seeing on the front-end (the large queue, and gaps in graph data), I'm also seeing a lot of slow query lines in the Zabbix server log and a number of log lines in Postgres that look like this:
Is this enough for a starting assessment?
Thanks in advance.
I'm hoping I can get some assistance with potential tuning parameters that may help reduce/solve some of the problems we've been having with our production Zabbix infrastructure. It's been in place for at least 6 months now and we've seen a sudden decline in performance: gaps in our graphs, no data, slow responses from the frontend, and a very large queue of Zabbix agent data. I'd appreciate any suggestions for tweaking our installation.
Details about the pieces involved:
Zabbix Server
Version 2.0.10
Centos 6.5
24GB RAM
Intel(R) Xeon(R) CPU E5603 @ 1.60GHz
relevant zabbix_server.conf sections
Code:
### Option: StartPollers # StartPollers=5 StartPollers=128 ### Option: StartIPMIPollers # StartIPMIPollers=0 StartIPMIPollers=24 ### Option: StartPollersUnreachable # StartPollersUnreachable=1 StartPollersUnreachable=80 ### Option: StartTrappers # StartTrappers=5 StartTrappers=32 ### Option: StartPingers # StartPingers=1 StartPingers=24 ### Option: StartDiscoverers # StartDiscoverers=1 StartDiscoverers=4 ### Option: StartHTTPPollers # StartHTTPPollers=1 # Only required if Java pollers are started. ### Option: StartJavaPollers StartJavaPollers=64 ### Option: StartSNMPTrapper # If 1, SNMP trapper process is started. # StartSNMPTrapper=0 StartSNMPTrapper=1 ### Option: StartDBSyncers # StartDBSyncers=4 StartDBSyncers=4 ### Option: StartProxyPollers # StartProxyPollers=1 ******** HousekeepingFrequency=1 DisableHousekeeping=1 CacheSize=2G HistoryCacheSize=2G TrendCacheSize=2G HistoryTextCacheSize=2G LogSlowQueries=3000
Backend Database
Postgres 9.1.8
Db Size= 52GB, partitioned
ex:
zabbix=# \d+ history_uint
Table "public.history_uint"
Column | Type | Modifiers | Storage | Description
--------+---------------+-------------------------------+---------+-------------
itemid | bigint | not null | plain |
clock | integer | not null default 0 | plain |
value | numeric(20,0) | not null default (0)::numeric | main |
ns | integer | not null default 0 | plain |
Indexes:
"history_uint_1" btree (itemid, clock)
"history_uint_mn" btree (itemid, clock)
Triggers:
partition_trg BEFORE INSERT ON history_uint FOR EACH ROW EXECUTE PROCEDURE trg_partition('day')
Child tables: partitions.history_uint_p2014_03_04,
partitions.history_uint_p2014_03_05,
partitions.history_uint_p2014_03_06,
partitions.history_uint_p2014_03_07,
partitions.history_uint_p2014_03_08,
partitions.history_uint_p2014_03_09,
partitions.history_uint_p2014_03_10,
partitions.history_uint_p2014_03_11,
partitions.history_uint_p2014_03_12,
partitions.history_uint_p2014_03_13,
partitions.history_uint_p2014_03_14,
partitions.history_uint_p2014_03_15,
partitions.history_uint_p2014_03_16,
partitions.history_uint_p2014_03_17,
partitions.history_uint_p2014_03_18,
partitions.history_uint_p2014_03_19,
partitions.history_uint_p2014_03_20,
partitions.history_uint_p2014_03_21
Has OIDs: no
relevant postgres settings:
Code:
max_connections = 2048 shared_buffers = 8196MB temp_buffers = 128MB work_mem = 128MB checkpoint_segments = 256 checkpoint_completion_target = 0.9 effective_cache_size = 8192MB
Code:
2014-03-21 21:02:39 UTC6159LOG: process 6159 still waiting for ShareLock on transaction 4243462660 after 1005.219 ms 2014-03-21 21:02:52 UTC1065LOG: duration: 13912.682 ms statement: select value from history_uint where itemid=50464 and clock<=1395432334
Thanks in advance.