How to discover and monitoring over than 2m ports in a large environment?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

    How to discover and monitoring over than 2m ports in a large environment?

    Hello Zabbix experts,

    As a requirement came up, I would like to ask the following:

    How can I discover and monitor over that 2.000.000 ports in a large infrastructure environment?

    Has zabbix this ability? Any idea how can I perform such a port monitoring?

    Thank you very much in advance.

    BR.
    Costas

    #2
    Originally posted by tritsako View Post
    How can I discover and monitor over that 2.000.000 ports in a large infrastructure environment?
    First you need to know what you need to monitor.

    (I don't want to be rude but really .. please try to be a bit more realistic about what you are asking for. Only you knows what you need to monitor, and seems like more or less you are asking on the public forum to do for you all normal engineering work about collecting enough details before start answering on implementation questions.
    Please try to read zabbix documentation first, then try do some initial work/experiments and back when you will have some problems. If you have no time to do this just hire someone who will do this for you.
    At the end: Yes, zabbix can handle tenths millions of metrics.
    PS. "Meric" it is kind of base monitoring unit. Not a "port")
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment


      #3
      How can I discover and monitor over that 2.000.000 ports in a large infrastructure en

      Hi kloczek,


      Thank you for your reply. I will check it again.

      BR.
      Costas.

      PS: I am working on systems monitoring over than 8 years, I am not trying find the solution of my problem ready.

      Comment


        #4
        This may be late, but I just saw the thread :-)

        Our network group had a database of deployed switches predating zabbix. It had location, ip, make & model. From that database, they used the API to create hosts for each device, assign groups (implying support hours) and a base template. (All are monitored via snmp)

        Then LLD discovers the variable data for each switch, ports, fans, anything that occurs N times. If a switch life-cycles, they just delete it from Zabbix, the automation will build it's replacement.

        They have about 6500 devices, they discover 6 items per port, 4 proxies do the actual monitoring of about 1.2 million items. We also have servers in the same Zabbix, those hosts take us up to 8250 devices and 5400 NVPS.

        What they did wrong: 6 items per port is too many. They created IT Services for everything, it is so slow the GUI is unusable for all IT Services. They alert on 10k events a day, I'm not sure what, but that is "normal" noise, so should either be squelched or fixed if they really are alerts.

        Comment


          #5
          Hi LenR,

          Thank you for your nice reply. Very nice example. !

          BR.
          Costas

          Comment


            #6
            Hi LenR,

            Could you detail a little more about the zabbix_server.conf setup and the DB architecture you had a behind a system of this size? We currently monitor about 300 items, adding another 600 in the next 4 months, are running in all VM infrastructure splut across a high performance Flash SAN and offloading to slower disk san, but zabbix is having issues with poller and process resource despite tuning.

            Many thanks,
            T

            Comment


              #7
              These are sanitized

              zabbix_server.conf
              StartPollers=100
              StartIPMIPollers=2
              StartPollersUnreachable=100
              StartTrappers=30
              StartPingers=50
              StartEscalators=10
              JavaGateway=zabbix-
              StartJavaPollers=20
              CacheSize=4G
              CacheUpdateFrequency=600
              HistoryCacheSize=1G
              HistoryIndexCacheSize=512M
              TrendCacheSize=512M
              ValueCacheSize=2G
              Timeout=4
              UnreachablePeriod=45
              UnavailableDelay=150
              LogSlowQueries=3000

              Proxy for linux/windows servers
              Server=ip addr
              Hostname=zabbix
              ConfigFrequency=600
              StartPollers=120
              StartIPMIPollers=0
              StartPollersUnreachable=75
              StartTrappers=25
              StartPingers=25
              StartDiscoverers=0
              StartHTTPPollers=10
              JavaGateway=127.0.0.1
              StartJavaPollers=5
              StartVMwareCollectors=0
              CacheSize=300M
              HistoryCacheSize=256M
              HistoryIndexCacheSize=64M
              Timeout=30
              UnreachableDelay=30
              LogSlowQueries=3000


              Busy proxy for network devices (snmp v2 mostly)
              'Server=
              Hostname=zabbix-
              ConfigFrequency=600
              StartPollers=120
              StartIPMIPollers=0
              StartPollersUnreachable=150
              StartTrappers=25
              StartPingers=25
              StartDiscoverers=0
              StartHTTPPollers=10
              StartVMwareCollectors=0
              CacheSize=500M
              HistoryCacheSize=256M
              HistoryIndexCacheSize=64M
              Timeout=6
              UnreachableDelay=30
              LogSlowQueries=3000

              Zabbix server is a vm, 8 cores, 36G, boot disk + 4 disks for mysql. Those are xfs and striped via LVM. mysql 5.7,

              select parts of my.cnf
              innodb_log_file_size=4G
              slow-query-log=on
              max_connections=1000
              innodb_buffer_pool_size=24G
              large-pages

              900 items shouldn't be a problem, even 900 hosts with ~100 items each with reasonable collection frequency shouldn't be a problem. Make sure the zabbix server template is applied to your server, it will give stats on zabbix internal and data collection processes. mysql iowait is the kiss of death. lots of slow query errors is bad. Physical with SSD is what Zabbix recommends, but that is not today's "best practices", our data center if full of VM hosts, no room for every special need to have it's own physical server. We had bad experience with iSCSI vm disk but FC attached disk seems much better.

              We have a "lite" zabbix that basically just pings about 7000 hosts to see if they are alive. It averages 2 items per host, under 200 NVPS, I don't have to do much tuning for it to function. Proxies here only for network access, not Zabbix server offload.

              Comment


                #8
                Originally posted by LenR View Post
                These are sanitized

                zabbix_server.conf
                StartPollers=100
                StartIPMIPollers=2
                StartPollersUnreachable=100
                StartTrappers=30
                StartPingers=50
                StartEscalators=10
                JavaGateway=zabbix-
                StartJavaPollers=20
                CacheSize=4G
                CacheUpdateFrequency=600
                HistoryCacheSize=1G
                HistoryIndexCacheSize=512M
                TrendCacheSize=512M
                ValueCacheSize=2G
                Timeout=4
                UnreachablePeriod=45
                UnavailableDelay=150
                LogSlowQueries=3000
                On the scale +2mln metric monitoring anything than zabbix server internal metrics over the server is simply wrong.
                The number of pollers could be greater than absolute minimum only if you are using passive proxies.
                Usually, the ratio between pollers and proxies should be like 1:2. In case active proxies and trappers this ratio could be even bigger.
                So StartPollers=100 may be OK for +200 proxies.
                I'm almost sure that memory parameters are to low for +2mln metrics.
                If none of the monitoring is done over server StartIPMIPollers=2, StartPollersUnreachable=100, StartPingers=50 and StartJavaPollers=20 does not make any sense as well.

                Proxy for linux/windows servers
                Server=ip addr
                Hostname=zabbix
                ConfigFrequency=600
                StartPollers=120
                StartIPMIPollers=0
                StartPollersUnreachable=75
                StartTrappers=25
                StartPingers=25
                StartDiscoverers=0
                StartHTTPPollers=10
                JavaGateway=127.0.0.1
                StartJavaPollers=5
                StartVMwareCollectors=0
                CacheSize=300M
                HistoryCacheSize=256M
                HistoryIndexCacheSize=64M
                Timeout=30
                UnreachableDelay=30
                LogSlowQueries=3000
                Exact numbers used in StartIPMIPollers, StartHTTPPollers, StartVMwareCollectors and StartJavaPollers should depend on what is monitored by exact proxy.
                StartPollers should be related to the number of hosts monitored over IPMI, SNMP and number of passive agents.
                StartTrappers should be correlated with the number of active agents connecting to the exact proxy.
                Logging slow queries do not make any sense because if someone is looking on those logs it means that storage IO layer is not implemented or someone is not using those stats to evaluate is HW used on DB backend is enough strong (look below why).

                Busy proxy for network devices (snmp v2 mostly)
                'Server=
                Hostname=zabbix-
                ConfigFrequency=600
                StartPollers=120
                StartIPMIPollers=0
                StartPollersUnreachable=150
                StartTrappers=25
                StartPingers=25
                StartDiscoverers=0
                StartHTTPPollers=10
                StartVMwareCollectors=0
                CacheSize=500M
                HistoryCacheSize=256M
                HistoryIndexCacheSize=64M
                Timeout=6
                UnreachableDelay=30
                LogSlowQueries=3000
                If it is for SNMP monitoring why StartPollersUnreachable=150? It does not make to much sense. Especially if the proxy will be used to monitor hosts over active agents.

                Zabbix server is a vm, 8 cores, 36G, boot disk + 4 disks for mysql. Those are xfs and striped via LVM. mysql 5.7,

                select parts of my.cnf
                innodb_log_file_size=4G
                slow-query-log=on
                max_connections=1000
                innodb_buffer_pool_size=24G
                large-pages
                max_connections=1000 .. you need one connection per zabbix server process so StartPollers=100 + StartIPMIPollers=2 + StartPollersUnreachable=100 + StartTrappers=30 + StartPingers=50 + StartEscalators=10 StartJavaPollers=20 is ~310. I don't think that you would be able to handle ~700 web/API sessions.
                Quite good estimation how much innodb_buffer_pool_size memory should be used is the volume of the data stored in the zabbix history tables multiplied by at least 0.5. Exact ratio depends on how deeply on time scale some triggers needs to reach some data. If for example there is some significant number of the triggers using forecast functions this ratio may be bigger.
                It is quite easy to calculate the amount of RAM looking on history tables partitions size.

                One very important thing on large scale in MySQL settings is missing here. It is:

                transaction-isolation=READ-COMMITTED

                Zabbix on any host/template modify locks almost all tables for write operations. With above still will be possible to read data from those locked tables.
                In your MySQL settings, there is no details related to using slave DB.
                In case of zabbix and use slave DB instances should be used binlog_format=MIXED.
                Other things. Memory for queries results cache can be chopped to zero. Zabbix server already caches so well in internal caches that whatever left is almost completely non-cacheable which means that almost all SELECT queries results are unique. All because all those queries are moving own window data on timescale.

                900 items shouldn't be a problem, even 900 hosts with ~100 items each with reasonable collection frequency shouldn't be a problem. Make sure the zabbix server template is applied to your server, it will give stats on zabbix internal and data collection processes. mysql iowait is the kiss of death. lots of slow query errors is bad. Physical with SSD is what Zabbix recommends, but that is not today's "best practices", our data center if full of VM hosts, no room for every special need to have it's own physical server. We had bad experience with iSCSI vm disk but FC attached disk seems much better.

                We have a "lite" zabbix that basically just pings about 7000 hosts to see if they are alive. It averages 2 items per host, under 200 NVPS, I don't have to do much tuning for it to function. Proxies here only for network access, not Zabbix server offload.
                The only way to solve low latency of read operations is to have enough memory to hold in in all what SELECT queries may need.
                As well all INSERT and UPDATE queries before they will start writing data are generating read operations related to where some b-trees needs to be updated.
                In other words, enough RAM is the key factor to writing with enough speed new monitoring data.

                On really large scale reading even from SSD is the kiss of death. The latency of typical SATA SSD is 120-150 ms or more. With NVMe it is possible to go below 100ms up to about 10ms in the case 3D XPoint flash.
                It is still nothing if you will realise that latency on accessing the data not cached in L1/L2/L3 CPU cashes is 10 ns or less.
                DB backend with enough memory should have on storage layer at least 1:20 ratio between read and write IOs .. yes almost everything what needs to be delivered to the SELECT queries should be served almost entirely out of data already cached in RAM. Personally, I'm trying to keep this ratio on at least 1:40 to 1:50 level.
                Last edited by kloczek; 18-04-2018, 20:36.
                http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                https://kloczek.wordpress.com/
                zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                My zabbix templates https://github.com/kloczek/zabbix-templates

                Comment

                Announcement

                Collapse
                No announcement yet.
                Working...
                X