Ad Widget

**tomtomclub** · 21-03-2014, 23:05

Hi,

I'm hoping I can get some assistance with potential tuning parameters that may help reduce/solve some of the problems we've been having with our production Zabbix infrastructure. It's been in place for at least 6 months now and we've seen a sudden decline in performance: gaps in our graphs, no data, slow responses from the frontend, and a very large queue of Zabbix agent data. I'd appreciate any suggestions for tweaking our installation.

Details about the pieces involved:

Zabbix Server
Version 2.0.10
Centos 6.5
24GB RAM
Intel(R) Xeon(R) CPU E5603 @ 1.60GHz

relevant zabbix_server.conf sections

Code:

### Option: StartPollers
# StartPollers=5
StartPollers=128
### Option: StartIPMIPollers
# StartIPMIPollers=0
StartIPMIPollers=24
### Option: StartPollersUnreachable
# StartPollersUnreachable=1
StartPollersUnreachable=80
### Option: StartTrappers
# StartTrappers=5
StartTrappers=32
### Option: StartPingers
# StartPingers=1
StartPingers=24
### Option: StartDiscoverers
# StartDiscoverers=1
StartDiscoverers=4
### Option: StartHTTPPollers
# StartHTTPPollers=1
#	Only required if Java pollers are started.
### Option: StartJavaPollers
StartJavaPollers=64
### Option: StartSNMPTrapper
#	If 1, SNMP trapper process is started.
# StartSNMPTrapper=0
StartSNMPTrapper=1
### Option: StartDBSyncers
# StartDBSyncers=4
StartDBSyncers=4
### Option: StartProxyPollers
# StartProxyPollers=1
********
HousekeepingFrequency=1
DisableHousekeeping=1
CacheSize=2G
HistoryCacheSize=2G
TrendCacheSize=2G
HistoryTextCacheSize=2G
LogSlowQueries=3000

Backend Database
Postgres 9.1.8
Db Size= 52GB, partitioned
ex:

zabbix=# \d+ history_uint
Table "public.history_uint"
Column | Type | Modifiers | Storage | Description
--------+---------------+-------------------------------+---------+-------------
itemid | bigint | not null | plain |
clock | integer | not null default 0 | plain |
value | numeric(20,0) | not null default (0)::numeric | main |
ns | integer | not null default 0 | plain |
Indexes:
"history_uint_1" btree (itemid, clock)
"history_uint_mn" btree (itemid, clock)
Triggers:
partition_trg BEFORE INSERT ON history_uint FOR EACH ROW EXECUTE PROCEDURE trg_partition('day')
Child tables: partitions.history_uint_p2014_03_04,
partitions.history_uint_p2014_03_05,
partitions.history_uint_p2014_03_06,
partitions.history_uint_p2014_03_07,
partitions.history_uint_p2014_03_08,
partitions.history_uint_p2014_03_09,
partitions.history_uint_p2014_03_10,
partitions.history_uint_p2014_03_11,
partitions.history_uint_p2014_03_12,
partitions.history_uint_p2014_03_13,
partitions.history_uint_p2014_03_14,
partitions.history_uint_p2014_03_15,
partitions.history_uint_p2014_03_16,
partitions.history_uint_p2014_03_17,
partitions.history_uint_p2014_03_18,
partitions.history_uint_p2014_03_19,
partitions.history_uint_p2014_03_20,
partitions.history_uint_p2014_03_21
Has OIDs: no

relevant postgres settings:

Code:

max_connections = 2048    
shared_buffers = 8196MB
temp_buffers = 128MB
work_mem = 128MB
checkpoint_segments = 256
checkpoint_completion_target = 0.9
effective_cache_size  = 8192MB

In addition to the kinds of things we're seeing on the front-end (the large queue, and gaps in graph data), I'm also seeing a lot of slow query lines in the Zabbix server log and a number of log lines in Postgres that look like this:

Code:

2014-03-21 21:02:39 UTC6159LOG:  process 6159 still waiting for ShareLock on transaction 4243462660 after 1005.219 ms
2014-03-21 21:02:52 UTC1065LOG:  duration: 13912.682 ms  statement: select value from history_uint where itemid=50464 and clock<=1395432334

Is this enough for a starting assessment?

Thanks in advance.

Attached Files

Ad Widget

Large queue and no catching up

Large queue and no catching up