Ad Widget

Collapse

Better way of monitoring

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • vlam
    Senior Member
    Zabbix Certified Specialist
    • Jun 2009
    • 166

    #1

    Better way of monitoring

    Hi

    I need some help...

    We are in the process of fine tuning our Zabbix environment in which we currently monitor close to 2000 servers (At this stage just OS)
    Business want us to start to add application monitoring and also look at our Storage devices and Network Devices

    This is where my problem and question comes:

    Problem:
    My MySQL DB is already running a primary DB that only houses 3 months data but it is already sitting at 1.5TB and our secondary DB that runs our historical data (1 Years Data) is sitting at 6TB

    Question:
    Can I split my App and Device monitoring ?
    Have the existing environment that does OS and then have a separate Zabbix App server and DB running for the Application and Device Monitoring. But with this still use the same agent on my monitored devices for both app servers?
    Then also have the same Frontends looking at both
    4 Zabbix Frontend Servers (Load balanced)
    2 Zabbix App Servers (HA)
    2 Zabbix Database Servers (HA)
    18 Zabbix Proxy Servers (HA)
    3897 Deployed Zabbix Agents
    6161 Values per second
    X-Layer Integration
    Jaspersoft report Servers (HA)
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    Separating OS and application monitoring does not make any sense as with such separation it will be no longer possible to define many triggers dependencies.
    How much memory is using now MySQL? Do you have partitioned history* and trends* tables?
    Why so many web frontends? Did you already switched to nginx and php-fpm?

    Recommendation: switch zabbix DB backend to Solaris and use at least gzip-1 transparent compression. With at least 3x compression ratio you will be able to use the same hw way longer.
    In memory zfs ARC will be able to hold compressed cached data so currently used memory will be working like current physical memory multiplied by compression ratio.
    3 months raw history data is usually to long. 2-3 weeks usually is enough.With zfs beneath snapshoting an replicating snaphots to slave DB node will be way easier as well.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • vlam
      Senior Member
      Zabbix Certified Specialist
      • Jun 2009
      • 166

      #3
      Both DB servers has 96GB of Memory and yes the tables are all partitioned
      No still uses HTTP web servers and currently have 2x servers running Load balanced and on each is 2x Zabbix web servers one for Live data and one for historical data access
      Solaris Backend will be difficult by us due to availability and support. is it not possible to do something similar on Linux, MySQL on x86_64?
      4 Zabbix Frontend Servers (Load balanced)
      2 Zabbix App Servers (HA)
      2 Zabbix Database Servers (HA)
      18 Zabbix Proxy Servers (HA)
      3897 Deployed Zabbix Agents
      6161 Values per second
      X-Layer Integration
      Jaspersoft report Servers (HA)

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        Originally posted by vlam
        Both DB servers has 96GB of Memory and yes the tables are all partitioned
        No still uses HTTP web servers and currently have 2x servers running Load balanced and on each is 2x Zabbix web servers one for Live data and one for historical data access
        Solaris Backend will be difficult by us due to availability and support. is it not possible to do something similar on Linux, MySQL on x86_64?
        AFAIK no one still is able to provide real zfs support on Linux (basing on OpenZFS code).

        Solaris is really wort of the money in on one side you will put cost of way more powerful HW and on second side ~500$/y (on up to 4 CPU sockets no-Oracle x86_64 HW) + connector cost who will setup Solaris box.
        With 96GB RAM this box can work like at least 256GB new HW. Just compare both costs.
        Solaris does not mean using USparc platform. On x86_64 it works exactly the same way as on Sparc.
        If you have HP, Dell, IBM or SuperMicro HW all this HW is listed on Solaris HCL and Oracle provides Solaris HW on those companies boxes.

        With 96GB of RAM I'm guessing thaif DB is not VM it may work in your case on some not-so-fresh-hw. Am I right?
        Solaris administration for typical Linux admin is very easy.+95% skills and knowledge known from Linux is possible to reuse on Solaris.
        In my carrier in firs two years of usilg Solaris I've been literally using it like Linux (I've spent even a lot of time porting rpm to Sol8 and building my own library of Solaris rpm packages .. which was of course pointless but at this time I didn't know about this )
        Much better HW diagnostic, way better HW heath diagnostics (almost t every time when Sol gives some signals about HW issues Linux is completely calm) aka PSH (Predictive Self Healing), DTrace and many more technologies which after even decades still is not possible to have on Linux makes this OS still superior to Linux.
        Only issue is cost because it is not free .. however case like yours is one of those cases when Solaris even with paid base support is cheaper than Linux which is theoretically free (only because using it forces to use way more powerful and expensive HW).

        Biggest problem with OpenZFS is lack of reorganisation of the zfs code targeting all locks contentions in this code (aka reARC) which has been done by Oracle +4y ago.
        In other words even if you will be using Linux + OpenZFS performance of such solution is able to provide IMO still 20-40% less of what is possible to squeeze out of Solaris (storage performance in case of scale ability) .. especially now with some improvements with deduplication which with zabbix database data may provide even higher raw disk space reduction.
        Solaris with ZFS can provide sometimes way higher storage HA than using some expensive external storage solutions.
        L2ARC (Lever 2 of Adaptive Reclaim Cache) using even .5TB Intel NVMe low latency SSD can provide in case of Zabbix db latency of the DB operation with speed of the SSD with size slower SATA or even spindles storage.

        With kernel zones is way better possible to partition bigger hw to give exact amount of memory for ZFS caching to each logical OS instance inside.
        Really list of very useful Solaris features which (still) not possible to have on any Linux is veery long
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • vlam
          Senior Member
          Zabbix Certified Specialist
          • Jun 2009
          • 166

          #5
          Originally posted by kloczek
          Separating OS and application monitoring does not make any sense as with such separation it will be no longer possible to define many triggers dependencies.
          How much memory is using now MySQL? Do you have partitioned history* and trends* tables?
          Why so many web frontends? Did you already switched to nginx and php-fpm?
          So I have changed the Webservers to be Nginx with PHP-fpm
          There I can already see a big improvement on the solution. With the DB's we do not have so much a performance problem I just have a bit of an size issue with it. With my historical DB that is 6TB is there an way that I can breakup the existing History table in monthly segments?
          As I believe that will also help it.

          I have also started to do a lot of tweaks on the templates that we use to better them and also to assist in the amount of metrics that is being gathered. on the Zabbix dashboard I am not to worried on pretty graphs as I use Grafana for most of my Dashboards that we provide to clients and to different departments for there requirements.

          I have also done some tweaks on the application and proxy side itself and there we have also seen some improvements since.

          My biggest problem at this stage I think is still the DB sizes as I know the MySQL will get to a point where it is going to start affecting its performance and that is my biggest concern.
          4 Zabbix Frontend Servers (Load balanced)
          2 Zabbix App Servers (HA)
          2 Zabbix Database Servers (HA)
          18 Zabbix Proxy Servers (HA)
          3897 Deployed Zabbix Agents
          6161 Values per second
          X-Layer Integration
          Jaspersoft report Servers (HA)

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by vlam

            So I have changed the Webservers to be Nginx with PHP-fpm
            There I can already see a big improvement on the solution. With the DB's we do not have so much a performance problem I just have a bit of an size issue with it. With my historical DB that is 6TB is there an way that I can breakup the existing History table in monthly segments?
            As I believe that will also help it.
            Told you :P
            I'm betting that you should be able to move back to only one web frontend.

            I have also started to do a lot of tweaks on the templates that we use to better them and also to assist in the amount of metrics that is being gathered. on the Zabbix dashboard I am not to worried on pretty graphs as I use Grafana for most of my Dashboards that we provide to clients and to different departments for there requirements.
            IMO using zabbix with Grafana is more or less waste of time bacause whatever what usaually is possible to have using Grafana usually is possible to define using zabbix.
            FYI in my templates you can find ready to use nginx and php monitoring template.

            php-fpm for now is only in devel branch:
            https://github.com/kloczek/zabbix-te...vice%20php-fpm
            https://github.com/kloczek/zabbix-te...ervice%20Nginx
            FYI devel version of the Nginx template has added only one new calculated item with number of http queries per tcp session.

            My biggest problem at this stage I think is still the DB sizes as I know the MySQL will get to a point where it is going to start affecting its performance and that is my biggest concern.
            Completely wrong impression and all this is thanks (mostly) to tables partitioning which you already have
            Just try to think only about fact that most active part of the zabbix DB content in 99% of typical cases only size of last day data.
            As long as let's say you have daily partitions this is where most of the queries operates.
            Size of the older data is in this case almost completely not relevant because those data will be touched only when someone will be trying to observe some older data or will be watching some graphs in longer time scale.
            However even if many web clients will be requesting long time scale data this is again not a problem thanks to switching automatically use trends data instead raw history* tables data by zabbix server. All this happens competently transparently

            Generally overall performance of the zabbix DB is related to MRU/MFU (Most recently Used/Most Frequently Used) misses/hots ratio.
            Linux provides here very bad observation deck because it has no at all any metrics showing hits/misses to buffer cache data.
            Situation will change dramatically if you will decide switch from Linux to Solaris/OpenSolaris/FreeBSD (systems which provides ZFS OOTB)
            kstat (kernel stat interface) provides more than 150k metrics on typical syetm install profile.
            ZFS has here very importand set of metrics showing hots/misses to ARC cahesche (in memory) and to L2ARC (Level 2 ARC which is usually organised on some low latency SATA or better NVMe SSD).
            If you will try to use my OS Solaris template you will find OOTB hots/misses stats and graphs.
            On my old post https://kloczek.wordpress.com/2016/0...rade-surprise/ you can find example data about rate of the IOs after passing ARC layer.
            If you will look closer you can find that most of the time typical rate of the IOs to L2ARC was around few/s. Just after start using new daily created history* partition for first hour rate of the read IOs was quite often 0 IO/s (zero).
            I can tell you that in in case of this DB zabbix have been reading about 120MB/s .. all with some scraps time to time hitting slowest storage

            Zabbix creates typical warehouse DB workload. Really .. try to read some theory about such workload and you will find thar crucial to handle that type of workload are three factors:
            - avg IO read latency of MRU/MFU data (mostly against already cached data)
            - size of the MFU/MRU data
            - avg IO write latency
            However last factor highly depends on first one because to write some data added over insert and update queries first it needs to locate (read) some data which will be pointing to exact location of the places where new data needs to be written. In other words .. if you want to reach really extreme write speed you must be really focused on observing read IO latency. This is single most important DB backend metric. As same as in case whole zabbix stack is NVPS rate.
            As you see there is no on this short list total size of the data
            Q.E.D.

            Another factor which is important in case warehouse datatabase with moving active window of the MFU/MRU data is related to use on storage COW (Copy On Write).
            You can "milk" this thing to transform typical random writes IO workload on such moving front to sequential IO workload.
            Simple COW allows deallocate instead modify block of storage which needs to be updated and allocate continuous new region where all new/modified data as batch will be written.
            ZFS is COW OOTB and you cannot disable it.
            I can give you hint about next possible move if you will be fighting with not enough fast write speed .. switch to btrfs (as long as you still want to stick with Linux).
            Using COW able/aware fs will be your last possible move .. then after this Solaris
            Last edited by kloczek; 05-07-2018, 22:07.
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • vlam
              Senior Member
              Zabbix Certified Specialist
              • Jun 2009
              • 166

              #7
              Originally posted by kloczek
              php-fpm for now is only in devel branch:
              https://github.com/kloczek/zabbix-te...vice%20php-fpm
              https://github.com/kloczek/zabbix-te...ervice%20Nginx
              FYI devel version of the Nginx template has added only one new calculated item with number of http queries per tcp session.
              You do not perhaps have a copy of your templates for these that will work for 3.2.11

              Thanks
              4 Zabbix Frontend Servers (Load balanced)
              2 Zabbix App Servers (HA)
              2 Zabbix Database Servers (HA)
              18 Zabbix Proxy Servers (HA)
              3897 Deployed Zabbix Agents
              6161 Values per second
              X-Layer Integration
              Jaspersoft report Servers (HA)

              Comment

              • kloczek
                Senior Member
                • Jun 2006
                • 1771

                #8
                Originally posted by vlam

                You do not perhaps have a copy of your templates for these that will work for 3.2.11
                All my templates have been prepared on top of 3.4.x. Some of them it should be possible to use on 3.2.x. php-fpm and Nginx are definitely at the moment 3.4.x only as they are using master item.
                If someone wants to use those templates is possible to use on 3.2.x still it needs to be tested and someone needs to take care to take responsibility and care about possible active development those templates.
                Biggest problem on working with my templates is still making them upgrade able and prepare some st of steps guarantee that whoever will take some version of those templates in future it will be possible to upgrade them to next version. It requires some amount of time, access to at least one test zabbix stack (for example on own laptop or ws).
                Really sorry but I have no time to do this but if someone want to cooperate on make at least some of those templates 3.2..x ready just please contact me to discuss how it could be done (using git it should be pretty straight forward).
                http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                https://kloczek.wordpress.com/
                zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                My zabbix templates https://github.com/kloczek/zabbix-templates

                Comment

                Working...