Ad Widget

**kloczek** · 22-02-2017, 16:51

Originally posted by tritsako

How can I discover and monitor over that 2.000.000 ports in a large infrastructure environment?

First you need to know what you need to monitor.

(I don't want to be rude but really .. please try to be a bit more realistic about what you are asking for. Only you knows what you need to monitor, and seems like more or less you are asking on the public forum to do for you all normal engineering work about collecting enough details before start answering on implementation questions.
Please try to read zabbix documentation first, then try do some initial work/experiments and back when you will have some problems. If you have no time to do this just hire someone who will do this for you.
At the end: Yes, zabbix can handle tenths millions of metrics.
PS. "Meric" it is kind of base monitoring unit. Not a "port")

**tritsako** · 23-02-2017, 10:06

How can I discover and monitor over that 2.000.000 ports in a large infrastructure en

Hi kloczek,

Thank you for your reply. I will check it again.

BR.
Costas.

PS: I am working on systems monitoring over than 8 years, I am not trying find the solution of my problem ready.

**LenR** · 05-05-2017, 15:03

This may be late, but I just saw the thread :-)

Our network group had a database of deployed switches predating zabbix. It had location, ip, make & model. From that database, they used the API to create hosts for each device, assign groups (implying support hours) and a base template. (All are monitored via snmp)

Then LLD discovers the variable data for each switch, ports, fans, anything that occurs N times. If a switch life-cycles, they just delete it from Zabbix, the automation will build it's replacement.

They have about 6500 devices, they discover 6 items per port, 4 proxies do the actual monitoring of about 1.2 million items. We also have servers in the same Zabbix, those hosts take us up to 8250 devices and 5400 NVPS.

What they did wrong: 6 items per port is too many. They created IT Services for everything, it is so slow the GUI is unusable for all IT Services. They alert on 10k events a day, I'm not sure what, but that is "normal" noise, so should either be squelched or fixed if they really are alerts.

**tritsako** · 18-05-2017, 13:12

Hi LenR,

Thank you for your nice reply. Very nice example. !

BR.
Costas

**cxsgroup** · 17-04-2018, 22:47

Hi LenR,

Could you detail a little more about the zabbix_server.conf setup and the DB architecture you had a behind a system of this size? We currently monitor about 300 items, adding another 600 in the next 4 months, are running in all VM infrastructure splut across a high performance Flash SAN and offloading to slower disk san, but zabbix is having issues with poller and process resource despite tuning.

Many thanks,
T

**LenR** · 18-04-2018, 02:09

These are sanitized

zabbix_server.conf
StartPollers=100
StartIPMIPollers=2
StartPollersUnreachable=100
StartTrappers=30
StartPingers=50
StartEscalators=10
JavaGateway=zabbix-
StartJavaPollers=20
CacheSize=4G
CacheUpdateFrequency=600
HistoryCacheSize=1G
HistoryIndexCacheSize=512M
TrendCacheSize=512M
ValueCacheSize=2G
Timeout=4
UnreachablePeriod=45
UnavailableDelay=150
LogSlowQueries=3000

Proxy for linux/windows servers
Server=ip addr
Hostname=zabbix
ConfigFrequency=600
StartPollers=120
StartIPMIPollers=0
StartPollersUnreachable=75
StartTrappers=25
StartPingers=25
StartDiscoverers=0
StartHTTPPollers=10
JavaGateway=127.0.0.1
StartJavaPollers=5
StartVMwareCollectors=0
CacheSize=300M
HistoryCacheSize=256M
HistoryIndexCacheSize=64M
Timeout=30
UnreachableDelay=30
LogSlowQueries=3000

Busy proxy for network devices (snmp v2 mostly)
'Server=
Hostname=zabbix-
ConfigFrequency=600
StartPollers=120
StartIPMIPollers=0
StartPollersUnreachable=150
StartTrappers=25
StartPingers=25
StartDiscoverers=0
StartHTTPPollers=10
StartVMwareCollectors=0
CacheSize=500M
HistoryCacheSize=256M
HistoryIndexCacheSize=64M
Timeout=6
UnreachableDelay=30
LogSlowQueries=3000

Zabbix server is a vm, 8 cores, 36G, boot disk + 4 disks for mysql. Those are xfs and striped via LVM. mysql 5.7,

select parts of my.cnf
innodb_log_file_size=4G
slow-query-log=on
max_connections=1000
innodb_buffer_pool_size=24G
large-pages

900 items shouldn't be a problem, even 900 hosts with ~100 items each with reasonable collection frequency shouldn't be a problem. Make sure the zabbix server template is applied to your server, it will give stats on zabbix internal and data collection processes. mysql iowait is the kiss of death. lots of slow query errors is bad. Physical with SSD is what Zabbix recommends, but that is not today's "best practices", our data center if full of VM hosts, no room for every special need to have it's own physical server. We had bad experience with iSCSI vm disk but FC attached disk seems much better.

We have a "lite" zabbix that basically just pings about 7000 hosts to see if they are alive. It averages 2 items per host, under 200 NVPS, I don't have to do much tuning for it to function. Proxies here only for network access, not Zabbix server offload.

**kloczek** · 18-04-2018, 12:12

Originally posted by LenR

These are sanitized

zabbix_server.conf
StartPollers=100
StartIPMIPollers=2
StartPollersUnreachable=100
StartTrappers=30
StartPingers=50
StartEscalators=10
JavaGateway=zabbix-
StartJavaPollers=20
CacheSize=4G
CacheUpdateFrequency=600
HistoryCacheSize=1G
HistoryIndexCacheSize=512M
TrendCacheSize=512M
ValueCacheSize=2G
Timeout=4
UnreachablePeriod=45
UnavailableDelay=150
LogSlowQueries=3000

On the scale +2mln metric monitoring anything than zabbix server internal metrics over the server is simply wrong.
The number of pollers could be greater than absolute minimum only if you are using passive proxies.
Usually, the ratio between pollers and proxies should be like 1:2. In case active proxies and trappers this ratio could be even bigger.
So StartPollers=100 may be OK for +200 proxies.
I'm almost sure that memory parameters are to low for +2mln metrics.
If none of the monitoring is done over server StartIPMIPollers=2, StartPollersUnreachable=100, StartPingers=50 and StartJavaPollers=20 does not make any sense as well.

Proxy for linux/windows servers
Server=ip addr
Hostname=zabbix
ConfigFrequency=600
StartPollers=120
StartIPMIPollers=0
StartPollersUnreachable=75
StartTrappers=25
StartPingers=25
StartDiscoverers=0
StartHTTPPollers=10
JavaGateway=127.0.0.1
StartJavaPollers=5
StartVMwareCollectors=0
CacheSize=300M
HistoryCacheSize=256M
HistoryIndexCacheSize=64M
Timeout=30
UnreachableDelay=30
LogSlowQueries=3000

Exact numbers used in StartIPMIPollers, StartHTTPPollers, StartVMwareCollectors and StartJavaPollers should depend on what is monitored by exact proxy.
StartPollers should be related to the number of hosts monitored over IPMI, SNMP and number of passive agents.
StartTrappers should be correlated with the number of active agents connecting to the exact proxy.
Logging slow queries do not make any sense because if someone is looking on those logs it means that storage IO layer is not implemented or someone is not using those stats to evaluate is HW used on DB backend is enough strong (look below why).

Busy proxy for network devices (snmp v2 mostly)
'Server=
Hostname=zabbix-
ConfigFrequency=600
StartPollers=120
StartIPMIPollers=0
StartPollersUnreachable=150
StartTrappers=25
StartPingers=25
StartDiscoverers=0
StartHTTPPollers=10
StartVMwareCollectors=0
CacheSize=500M
HistoryCacheSize=256M
HistoryIndexCacheSize=64M
Timeout=6
UnreachableDelay=30
LogSlowQueries=3000

If it is for SNMP monitoring why StartPollersUnreachable=150? It does not make to much sense. Especially if the proxy will be used to monitor hosts over active agents.

Zabbix server is a vm, 8 cores, 36G, boot disk + 4 disks for mysql. Those are xfs and striped via LVM. mysql 5.7,

select parts of my.cnf
innodb_log_file_size=4G
slow-query-log=on
max_connections=1000
innodb_buffer_pool_size=24G
large-pages

max_connections=1000 .. you need one connection per zabbix server process so StartPollers=100 + StartIPMIPollers=2 + StartPollersUnreachable=100 + StartTrappers=30 + StartPingers=50 + StartEscalators=10 StartJavaPollers=20 is ~310. I don't think that you would be able to handle ~700 web/API sessions.
Quite good estimation how much innodb_buffer_pool_size memory should be used is the volume of the data stored in the zabbix history tables multiplied by at least 0.5. Exact ratio depends on how deeply on time scale some triggers needs to reach some data. If for example there is some significant number of the triggers using forecast functions this ratio may be bigger.
It is quite easy to calculate the amount of RAM looking on history tables partitions size.

One very important thing on large scale in MySQL settings is missing here. It is:

transaction-isolation=READ-COMMITTED

Zabbix on any host/template modify locks almost all tables for write operations. With above still will be possible to read data from those locked tables.
In your MySQL settings, there is no details related to using slave DB.
In case of zabbix and use slave DB instances should be used binlog_format=MIXED.
Other things. Memory for queries results cache can be chopped to zero. Zabbix server already caches so well in internal caches that whatever left is almost completely non-cacheable which means that almost all SELECT queries results are unique. All because all those queries are moving own window data on timescale.

900 items shouldn't be a problem, even 900 hosts with ~100 items each with reasonable collection frequency shouldn't be a problem. Make sure the zabbix server template is applied to your server, it will give stats on zabbix internal and data collection processes. mysql iowait is the kiss of death. lots of slow query errors is bad. Physical with SSD is what Zabbix recommends, but that is not today's "best practices", our data center if full of VM hosts, no room for every special need to have it's own physical server. We had bad experience with iSCSI vm disk but FC attached disk seems much better.

We have a "lite" zabbix that basically just pings about 7000 hosts to see if they are alive. It averages 2 items per host, under 200 NVPS, I don't have to do much tuning for it to function. Proxies here only for network access, not Zabbix server offload.

The only way to solve low latency of read operations is to have enough memory to hold in in all what SELECT queries may need.
As well all INSERT and UPDATE queries before they will start writing data are generating read operations related to where some b-trees needs to be updated.
In other words, enough RAM is the key factor to writing with enough speed new monitoring data.

On really large scale reading even from SSD is the kiss of death. The latency of typical SATA SSD is 120-150 ms or more. With NVMe it is possible to go below 100ms up to about 10ms in the case 3D XPoint flash.
It is still nothing if you will realise that latency on accessing the data not cached in L1/L2/L3 CPU cashes is 10 ns or less.
DB backend with enough memory should have on storage layer at least 1:20 ratio between read and write IOs .. yes almost everything what needs to be delivered to the SELECT queries should be served almost entirely out of data already cached in RAM. Personally, I'm trying to keep this ratio on at least 1:40 to 1:50 level.

Ad Widget

How to discover and monitoring over than 2m ports in a large environment?

How to discover and monitoring over than 2m ports in a large environment?

Comment

Comment

Comment

Comment

Comment

Comment

Comment