Ad Widget

Collapse

Some Experience of tuning of large zabbix scalability

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • frankymryao
    Member
    • Oct 2011
    • 52

    #1

    Some Experience of tuning of large zabbix scalability

    I'm replying an email for some tuning problems. I just copy the email to this. A friend of community asked me about tuning and I made this replay.


    Hi Michael,


    Our zabbix now is XXXX hosts, 60w items and 20w triggers, but vps is only 1220. Oracle has 10 disks and is made to ASM, normal-redundancy(like RAID 1). Our arch is 1 server and 3 proxy. The bottleneck is the IO of Oracle.While Zabbix is working well, io is about 30% to 40%, but after restarting(or staring) Zabbix, IO will reach 100% because it is chasing the data missed in the period of Zabbix-shutdown. Yesterday, DBA suggested we can throw away the 'normal-redundancy'. We estimated there will be one disk broken about three year and drop 'normal-redundancy' will double the IO performance.

    I think you need to tuning your system in two ways:


    1. Make items' interval longer. At first, our vps is very high and we reduce the interval and it works. Longer the interval, and effect is very obvious. Below is the statistics:
    SQL> select delay, count(*) from items where status=0 group by delay order by 2 desc;
    DELAY COUNT(*) PERCENT
    ---------- ------------- ------------------------------------------------------------------------
    600 119489 18.04%
    86400 103224 15.59%
    300 90168 13.62%
    2400 79051 11.94%
    7200 62286 9.41%
    1200 53981 8.15%
    120 37741 5.7%
    240 34252 5.17%
    3600 25578 3.86%
    60 19412 2.93%
    51840 15251 2.3%
    180 8026 1.21%
    172800 4125 .62%
    900 3344 .5%
    120960 3309 .5%
    30 2942 .44%
    150 16 0%
    6000 5 0%
    1800 4 0%

    2. Perfomance Tuning

    (1) patch code. Files are attached. Zabbix_frontend.tar is the frontend(usually in /var/www/html), some php code is patched. libs.zip is zabbix source code after patched and it solve the open cusor, sequence and sharepool probelm. We use it for months and it has been working well. (in src/libs of install files)

    (2) zabbix_proxy.conf and zabbix_server.conf are our configuration file, you can do some reference with our param.

    (3) Oracle tuning. What's important, history need to be truncated in 7 days, the length of history is mostly effect the performance. Table history is used to make graph in short period(1 hour, 2hour, 1day ..) and table trends is used to make graph in long period(1 week, 2 week, 1month...). I strong suggest you need to truncated table history in less than one week.

    --
    姚仁捷 Frank Yao
    @PPTV, Shanghai, China

    Weibo: weibo.com/frankymryao
    Blog: baniu.me



    ------------------------------------------------------
    Below is michael's problem
    -----------------------------------------------------

    On 21 April 2012 06:51, Michael Julian wrote:


    Hi Frank-

    Thanks for the reply. I too do not check my gmail much as well. My apologies.

    First off, in comparison of the number of hosts monitored, we are relativley small - about 4200. However we are currenlty capturing 2573 nvps with Oracle based on our requirements. As it is now, we would not be able to hold this configuraiton prior to the 3 Oracle patches we have received and applied. 2 patches - to address the shared memory pool leak and one to address the latches (mutex). We still see high concurrency at times and when we do, it creates a condition were all processes are waiting for work to do while the one process completes its task. And while I am not a developer, I understand what may need to be done to increase performance more. Just not familar enough with what it woiuld take to create a sequence in Oracle and rewrite the zabbix code to take advantage of it.

    For what it is worth, we are running Oracle 11.2, partitioned on trends and history, and zabbix 1.8.9 with 4 proxy servers. All agents are configured as active agents and I am still trying to find the right balance of dbsyncers, trappers and pollers to optimize performance. From your experience, how many dbsyncers do you presently have configured for Oracle and how does Zabbix perform?

    I think this is presently my biggest issue?

    Would be interested in your thoughts.

    Michael Julian


    --------------------------------------------
    Below is my letter for sharing some experience
    --------------------------------------------
    On Tue, Apr 3, 2012 at 8:49 AM, Frank Yao wrote:


    Hi Michael,


    I've changed my mail box from [email protected] to this - [email protected] and I missed your mail. I must say sorry for that.

    Last month, I've just pathed code for the share pool issue and it got to work well now. For oracle, zabbix always makes 'begin' and 'end' for SQL statements. With scalability growing, the big SQL which starts with 'begin' and ends with 'end' CAN NOT share cursor in oracle. As a result, oracle will crash because of the share pool(or share memory). What I've done is to split the big SQL to single SQL command. The patch is mainly in the function 'vexecute' of 'zbxdb/db.c' ( or 'zbxdbhign/db.c', I do not rememer it).

    First, you must know the tradition method for zabbix to get the 'nextid'. Nextid is a very important variable in zabbix. It is used to made for 'eventid'. Traditional method of making nextid in zabbix has very low performence in oracle(you can read source code). I made an sequence in oracle and patched zabbix code to get 'nextid' from this sequence. After patching code for sequence, the number of TX locks in oracle is decreased very obviously.

    In addition, above conditions won't come in production environment if your scalabilty is not huge enough.
Working...