Ad Widget

Collapse

Large queue and no catching up

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • tomtomclub
    Junior Member
    • Oct 2013
    • 9

    #1

    Large queue and no catching up

    Hi,

    I'm hoping I can get some assistance with potential tuning parameters that may help reduce/solve some of the problems we've been having with our production Zabbix infrastructure. It's been in place for at least 6 months now and we've seen a sudden decline in performance: gaps in our graphs, no data, slow responses from the frontend, and a very large queue of Zabbix agent data. I'd appreciate any suggestions for tweaking our installation.

    Details about the pieces involved:

    Zabbix Server
    Version 2.0.10
    Centos 6.5
    24GB RAM
    Intel(R) Xeon(R) CPU E5603 @ 1.60GHz

    relevant zabbix_server.conf sections
    Code:
    ### Option: StartPollers
    # StartPollers=5
    StartPollers=128
    ### Option: StartIPMIPollers
    # StartIPMIPollers=0
    StartIPMIPollers=24
    ### Option: StartPollersUnreachable
    # StartPollersUnreachable=1
    StartPollersUnreachable=80
    ### Option: StartTrappers
    # StartTrappers=5
    StartTrappers=32
    ### Option: StartPingers
    # StartPingers=1
    StartPingers=24
    ### Option: StartDiscoverers
    # StartDiscoverers=1
    StartDiscoverers=4
    ### Option: StartHTTPPollers
    # StartHTTPPollers=1
    #	Only required if Java pollers are started.
    ### Option: StartJavaPollers
    StartJavaPollers=64
    ### Option: StartSNMPTrapper
    #	If 1, SNMP trapper process is started.
    # StartSNMPTrapper=0
    StartSNMPTrapper=1
    ### Option: StartDBSyncers
    # StartDBSyncers=4
    StartDBSyncers=4
    ### Option: StartProxyPollers
    # StartProxyPollers=1
    ********
    HousekeepingFrequency=1
    DisableHousekeeping=1
    CacheSize=2G
    HistoryCacheSize=2G
    TrendCacheSize=2G
    HistoryTextCacheSize=2G
    LogSlowQueries=3000




    Backend Database
    Postgres 9.1.8
    Db Size= 52GB, partitioned
    ex:

    zabbix=# \d+ history_uint
    Table "public.history_uint"
    Column | Type | Modifiers | Storage | Description
    --------+---------------+-------------------------------+---------+-------------
    itemid | bigint | not null | plain |
    clock | integer | not null default 0 | plain |
    value | numeric(20,0) | not null default (0)::numeric | main |
    ns | integer | not null default 0 | plain |
    Indexes:
    "history_uint_1" btree (itemid, clock)
    "history_uint_mn" btree (itemid, clock)
    Triggers:
    partition_trg BEFORE INSERT ON history_uint FOR EACH ROW EXECUTE PROCEDURE trg_partition('day')
    Child tables: partitions.history_uint_p2014_03_04,
    partitions.history_uint_p2014_03_05,
    partitions.history_uint_p2014_03_06,
    partitions.history_uint_p2014_03_07,
    partitions.history_uint_p2014_03_08,
    partitions.history_uint_p2014_03_09,
    partitions.history_uint_p2014_03_10,
    partitions.history_uint_p2014_03_11,
    partitions.history_uint_p2014_03_12,
    partitions.history_uint_p2014_03_13,
    partitions.history_uint_p2014_03_14,
    partitions.history_uint_p2014_03_15,
    partitions.history_uint_p2014_03_16,
    partitions.history_uint_p2014_03_17,
    partitions.history_uint_p2014_03_18,
    partitions.history_uint_p2014_03_19,
    partitions.history_uint_p2014_03_20,
    partitions.history_uint_p2014_03_21
    Has OIDs: no

    relevant postgres settings:
    Code:
    max_connections = 2048    
    shared_buffers = 8196MB
    temp_buffers = 128MB
    work_mem = 128MB
    checkpoint_segments = 256
    checkpoint_completion_target = 0.9
    effective_cache_size  = 8192MB
    In addition to the kinds of things we're seeing on the front-end (the large queue, and gaps in graph data), I'm also seeing a lot of slow query lines in the Zabbix server log and a number of log lines in Postgres that look like this:

    Code:
    2014-03-21 21:02:39 UTC6159LOG:  process 6159 still waiting for ShareLock on transaction 4243462660 after 1005.219 ms
    2014-03-21 21:02:52 UTC1065LOG:  duration: 13912.682 ms  statement: select value from history_uint where itemid=50464 and clock<=1395432334
    Is this enough for a starting assessment?

    Thanks in advance.
    Attached Files
Working...