No announcement yet.

Question about number of iops

  • Filter
  • Time
  • Show
Clear All
new posts

    Question about number of iops


    I'm working with zabbix installation which has:
    - 4363 enabled hosts,
    - 284k enabled items,
    - 148k enabled triggers,
    - nvps -~ 1030.

    Database is on located on separate mariadb galera cluster (4 db nodes with load balancer). Storage is a mix of 10k, 15k and SSD disks.

    From time to time, we're facing storage performance issues and our hardware specialists claims that average of 3k iops is far too much for storage to handle.

    Can you tell me how many iops you have on your installations and how it corresponds to number of items/triggers?

    Thanks in advance

    Usually biggest bottleneck in zabbix DB backend is on writing data (not reading).
    Actually by using 4 nodes DB backend you are creating more problems than solving read speed which should be solved by simple adding more memory to DB backend to have lower latency and throughput of reads operations.
    Just use 2 nodes to have HA and double in each memory size.
    To have best possible speed instead using galera use just master-slave setup.
    This will give you zero overhead on data syncing between nodes with good enough HA.

    Proper amount of memory is when in DB backend IO stats you will see only almost write IOs and at least 30 times less read IOs. With more memory to buffer most of the MRU/MFU data you should be able completely resign from 10k/15k disks.
    If you are using Linux you should use ssd as log dev of the vol used by DB engine and read cache.
    Deploying whole DB on SSD is pointless.

    Look what happens with IO stats when you have enough memory for most of the read IOs This example was on zabbix with almost 3k nvps. A you can see in this case proportion between avg physical read and write IOs are around 1:100.
    Despite that this was on Solaris you can do the same on Linux as well. Only difference is that on Linux you will need +3 times more RAM than on Solaris because Solaris ZFS can keep in memory compressed transparently buffered disks data and usually zabbix DB data are compressing with at least 3 times compression ratio.
    In this case DB backed had 96GB RAM. About 2/3 where used as ARC. ZFS can cache better data than MySQL innodb pool had only 12GB (enough to keep in those buffers DB indexes).
    zapish - Zabbix API SHell binding
    My zabbix templates


      Have you partitioned your database and disabled housekeeping on history and trends?

      We have 1.25 mil items, 5400 NVPS, our server is VM with FC disk. We proved iscsi and Nimble were inadequate for us.


        Originally posted by LenR View Post
        Have you partitioned your database and disabled housekeeping on history and trends?

        We have 1.25 mil items, 5400 NVPS, our server is VM with FC disk. We proved iscsi and Nimble were inadequate for us.
        Yes I have.
        Number of items or hosts is not important. Only actual flow of NVPS matters. This is very good single factor describing how big exact zabbix stack is

        I'm using zabbix partitioned history*/trends* tables everywhere (even on my laptop where I have some demo/test stack). Partitioning always allows save a lot of IOs .. not only on housekeeping because less deeper b-trees in each head partition allows dramatically decrease number of VFS layer read IOs (every update and insert query before new data will be written generates multiple read IOs as well).

        Zabbix greatly optimizes writing new data to the history*/trends* tables by forming inserts with whole series of new data. Effectively with 5k nvps you should have ~50 inserts/s as base line + some fluctuations caused by traffic to other tables like events, audit. Passing such barrier is not rocket science as long as all MRU/MFU (Most Recently/Frequently Used) data are well cached in memory.
        With enough big buffers in RAM proportions between physical block layer write to read IOs should be at least like 1:30 (as I wrote I'm always trying to gain 1:100).

        Biggest problem with DB backend is to have lowest possible latency. In such context investing in faster storage is pointless as RAM still is and for quite long time still will be always faster. With well cached data in RAM increase rate of selects by few times is not big problem which you can see on my other blog entry I'm always trying time to time have look on some monitoring data even if there is no "visible" problems. In this case temporary increase selects rate by factor 3-4 was not a problem as long as all those selects have been hitting already well cached data in RAM.

        On this example which I've documented in my blog entry as storage was used pool of 6 10k spindles with additional SATA SSD (300GB) which was used in two parts. One was few GB ZIL device (used as ZFS Intent Log dev) and second (rest f the space) which was used as L2ARC.
        With such hierarchical storage where in first layer is ARC (Adaptive Reclaim Cache) in RAM (~64-70GB), on second layer about 290GB L2ARC in SSD (Level 2 ARC) and 6 spindles in RAIDZ2 (ZFS equivalent of classic RAID5) was possible to have storage with size of spindles but with latency of the SSD.
        Additionally restart of the MySQL in such architecture is not a problem because with all cached in RAM data outside MySQL innodb pool warming much top layer caches was possible to fill from ZFS ARC. With additional slave DB in the same storage layout even full OS reboot was not a problem as before planned reboots for maintenance procedure was to use current slave as new master where all ARC caches where warm.

        What is on the graph on my blog entry is flow of the read/write IOs to the whole zpool before L2ARC. Using L2ARC perfectly handles high latency write operations as whatever must be written to the storage is written first to ZIL and asynchronously moved to spindles without impacting latency of the write operations on VFS layer. All data to the ZIL are written as uncompressed blocks even if zfs volume has enabled compression. Compression of new written data is applied before move them from ZIL to spindles. All tgis is done asynchronously and this is even applying compression is not impacting write IOs latency.

        Whatever someone will be doing on pure Linux above architecture will be always more expensive as there is no good reads caching block devices layer. Even using OpenZFS on Linux is not so effective as Solaris 11.3 because OpenZFS is not able to hold in ARC (RAM) compressed ZFS records and by this on Linux with OpenZFS is necessary to use more RAM.
        Try to think that even on using AWS jump from instance which has 64-128GB RAM to that one with 2-4 times bigger generates huge increase costs. This even more true in case using bare metal instances.

        In my case of my example 60-70GB ARC in RAM have been working like ~3.6 bigger RAM (3.6 was in this case avg zfs compression ratio). Try to think that usually increase RAM by such factor is causes choosing more expensive HW. Cost of such upgrade will be always waaay bigger than 500$/year cost of the Solaris license on non-Oracle hardware (even in full life cycle commodity HW lifespan which is 3-5 years).
        I'm using always something like this to show hat Linux in some scenarios without support (theoretically for free) is more expensive than Solaris with paid support.
        Only because it forces to use much more powerful and expensive HW than in Solaris case.

        Second part of the differences is on another "small" bit in ZFS. ZFS uses COW (Copy On Write). How it works? On rewrite the same data for example on update queries or rewrite b-tree blocks on inserts ZFS never uses the same physical location on storage. Old blocks are added to free list and new are allocated in new location. THIS with constant flow of write IOs allows allocate SEQUENTIAL region for random IOs on VFS layer. In other words using COW allows reduce physical write IOS. This allows as well reduce number of hardware interrupts and PCI buses operations (people quite often are forgetting that on using NVMe they can usually saturate HW interrupts than max bandwidth NVMe devices).

        IMO Linux in such cases will be always behind Solaris. Why? Because even after +12 years of ZFS availability on the market with source code as well no one even started working of set of technologies which will be equivalent of the ZFS.
        Such pieces of technologies are even for biggest Linux using and supporting companies like RH, IBM, HP or Dell are not easy to develop and support.

        Only FS on Linux with well working COW is btrfs. How heavy such technology on Linux is fact that few days ago RH announced abandoning btrfs support.
        Provide on ext4 equivalent of ZFS ZIL (decrease write latency) by use faster storage is quite easy to provide as this FS is possible to use with journal device from other block dev (SATA or NVMe SSD) but it has no COW semantics (ext and xfs dev teams is working on COW for those FSes.
        From mkfs.ext4(8):
               -J journal-options
                      Create the ext3 journal using options specified on the command-line.  Journal options are comma separated, and may take an argument using  the  equals
                      ('=')  sign.  The following journal options are supported:
                                  Attach the filesystem to the journal block device located on external-journal.  The external journal must already have been created  using
                                  the command
        but how many people are using this and how much it is stable? btrfs has COW but it cannot be used with external journal dev (it is possible to have journal dev when btrfs is placed on top of MD/LVM but it is pointless as btrfs as same as zfs is able to use multiple block devices to organize pooled storage).
        bcache on the Linux is so rarely used that it is still very unstable.
        All those components are on Linux optional and many code on kernel are only to provide interfaces to those options. ZFS always is using those bits and by this all together already have been tested to the level not possible to see on Linux. This why as well combined binary ZFS code used in kernel space is bigger ext4, bcache or cachefiles modules and still ZFS code provides many functionalities never ever seen on Linux.

        In other words Linux is in kind of rabbit hole .. and seems like something like this still will be for at least next 5 years (because it is IMO minimum time necessary to develop on Linux GPL equivalent of the ZFS).
        That is only my 5 cents to general discussion about use Linux in case of zabbix stacks with extreme DB backend performance needs. Using Linux has some well drawn limits and beyond those limits at least OpenZFS must be added and on next step is pure Solaris.
        zapish - Zabbix API SHell binding
        My zabbix templates


          We aren't using ZFS, or any compressing file system. I would think that would be counterproductive for high performance DB.

          I got a lot benefit form increasing innodb buffer pools and instances. I also come from a proprietary OS background, we used to say if it wasn't water cooled, it wasn't really a computer :-)

          The best IO is the one you avoid, a database IO that is avoided using it's own buffers is the best, OS layer buffers are subsystem calls away, hardware cache is another layer away. When I started a new job once I found a system with completely default buffer tuning. It couldn't even hold all index levels in RAM. Some napkin math to guesstimate the number of buffers per user x number of users at busy times and adding a few 1000 buffers made enough difference that I got a plaque for my wall :-)


            Originally posted by LenR View Post
            The best IO is the one you avoid, a database IO that is avoided using it's own buffers is the best, OS layer buffers are subsystem calls away, hardware cache is another layer away. When I started a new job once I found a system with completely default buffer tuning. It couldn't even hold all index levels in RAM. Some napkin math to guesstimate the number of buffers per user x number of users at busy times and adding a few 1000 buffers made enough difference that I got a plaque for my wall :-)
            As long as I can generally agree about avoiding IOs at the same time I cannot agree about increase DB engine buffers as long as we are not talking about Oracle DB.
            You must know that combinations of MRU/MFU caching algorithms to ZFS came from Oracle DB. Simple Sun developers cooperating with Oracle been able to observe many thing about how those cach strategies works on really big scale, With such knowledge was possible to cut off those approaches and implant them to general technology which combines in on layer what usually in classic approaches is done in VFS layer cache, FS layer and raw storage block layer + fully transactional and COW. This unique packing in single box without tuning can deliver much higher performance than moving all or most ARC memory to innodb pool. Why? because all what is in MySQL or PosgreSQL does not know anything about what is beneath VFS layer.
            By migration some parts of the technologies from SQL engine (caching) , adding compression and many other things is possible OOTB to increase speed by factor at least not few percent but by at least 50 to sometimes even 300% or more!!!! With some tuning it is easy to gain increase maximum possible to reach speed by even factor 10 or more!!! It may sound like fantasy but it is reality. You may be asking "how it is possible that such speedup was notced by most of the *nix admins??". Answer is that there is no to many places on Earth where very high speed is necessary. Simple on the World where +95% of admins are using Linux happened something which can be only described by old sentence "Millions of flies cannot be wrong. We must eat the poo!!!".
            Yes .. most of the *nix admins have only impression that Linux is most powerful system OS ever. That is sad truth!!!
            It may sound like contradiction but in most cases I'm using Linux and like it still like it most because it was one of my first Unices (to be precise second one after DEC Ultrix )
            However I have no illusion about why it is used and what is possible and not possible to do with this OS.

            Try to think that even altering tables by add row compression you are compressing only data stored in tables which in case of storing numeric data mostly slows down than increases overall speed.
            ZFS uses transparent compression and you can alter exact volum in ZFS pool tu use exact compression from null (files with no matter how much only zero bytes are taking only entry in directory structures .. even petabytes such files are "compressed" to zero physical space), by gzip-1 to gzip-9 with lz (lzma) at the end.
            As I wrote few times here I found that for zabbix DB data best balance between speed and compression is on use gzip-1 and compression ratio is at least x3.5. Just try to think what you could do on you storage if you could be able to store on that same HW 3.5 more monitoring data (??) .. all this without single HW change.
            Most of the today hardware have a lot unused CPU power. Utilizing this unused CPU power allow "transmute" it to more REAL storage space!!!
            All this before deduplication!!!

            On my first Solaris MySQL DB backend I've been using maximum allocation record which is 1MB on Sol 11.3 (+SRUs). What I've been able to reach by this? Very funny thing
            It is known that MySQL uses 64KB IOs to read/srite data. In every documentation about MySQL and ZFS you can find that to have max speed you shou;d do "zfs set recordsize=64K <mysql/vol>" to have max speed. However sometines you are working way below max possible speed. So what will happen if you will be forming bigger batches of data with size 1MB? Of course between memory and disks you will be shuffling those 1Mb block decompressing -> modify-> compressing and storing back. it may look crazy but it is something which you can improve even is so bend settings. By using larger chunks of data to which will be compressed all you effective data you will be increasing compression ratio. With 1MB ZFS record size I was able to reach 4.2 compression ratio with the same gzip-1. Simple today SSDs are able to read/write with speed of ad least few hundredths MB/s if not even tenths of GB/s. Buy such settings is possible to (again) "transmute" spare CPU power and max bandwidth of those devices to even more logical disk space.
            With exactly this HW and increasing DB performance it was at some point to decrease recordsize to 256KB because at some point I've it was hot 300MB/s max bandwidth to SSD
            All it was architected almost three years ago on at this time very old HW (HP blade gen6).
            Today fresh HW has even more completely not used CPU power and storage bandwidth which allows "transmute" those resources into pure disk space.

            And now sad part .. all those tricks and/or combinations of multiple tricks is not possible to use as long as you are only using Linux. Most of the Linux admins on such stories will have have only few word comment .. "that is impossible".

            Again: about 3 years ago in some Sol 11.3 SRU (Service Recommended Updates) ZFS ARC keeps in memory compressed cached ZFS records. Because they are working together with other parts embedded in ZFS gut allowing SQL engine (as log as it is not Oracle DB) to cache MRU/MFU data is waste of resources. ZFS can do this way better .. this is why on single box with 96GB RAM from example which was used other system used in my quoted blog entries MySQL innodb pool had only 12 (literally: twelve) GB
            Empirically I've tested that giving more for MySQL had only negative impact.
            zapish - Zabbix API SHell binding
            My zabbix templates


              Originally posted by kloczek View Post
              Again: about 3 years ago in some Sol 11.3 SRU (Service Recommended Updates) ZFS ARC keeps in memory compressed cached ZFS records.
              Jut found in my bookmarks Oracle developer blog entry from 2014 about this (AKA reARC) and many other changes.

              Just before this SRU used in my blog entry zabbix DB has been working on almost on max possible limited by HW speed. After fort reboot on new SRU I remember that I've been checking few times that after reboot MySQL is fully working because I did't believe that we just gained huge improvements and was possible to call off move to new HW. Database was able to work for next 1.5 year with constantly increasing DB performance.
              When 4 years ago zabbix 1.8 was used everything was working barely on max possible to gain speed of 260 NVPS. After next 3 years move to Solaris at the end was possible to gain 3.5k NVPS and still was plenty of headroom. Cost of those improvements was move from pair of 10krpm spindles first to pair of SATA SSDs. then next move was to move to 6 spindles in separated disk blade and reuse SSDs as L2ARC+ZIL. Memory was upgraded as well from 48GB to 96GB. Additional cost was cost of the 3 years Solaris support 9if such support will bu done in 3 years period on non-Oracle HW Oracle offers 30%/year discount). Raw cost of Solaris single box up to 4 CPU socket on non-Oracle HW is something about 500 GBP/year.
              As and exercise .. try compare those costs with costs of buying 10 more powerful HW and migrating Linux with minimal changes (which people quite often are doing).
              I must honestly say that part of those performance improvements came in mean time from MySQL upgrade (from first from 5.1 to 5.5 and then from 5.5 to 5.6). Impact of the upgrade to 5.6 I've documented here (DB backend already was on Solaris):
              zapish - Zabbix API SHell binding
              My zabbix templates