Hi!
My setup:
Everything (Zabbix server, frontend / apache2, PostgreSQL) is on 1 VM with the following parameters:
Hardware:
6-core Intel Xeon CPU E5-2670 v2 @ 2.50 GHz
8 GB RAM
50 GB SATA at 7500 RPM for the root disk of the system
300 GB SAS at 10000 RPM for /var/lib/postgresql
Zabbix server:
NVPS: 1800
Hosts: 210
Items: 187 700
(we do not have many hosts, but we have many items per host: one host has an average of 900 items)
Triggers: 82 210
Number of users: 23 (online at the same time are 5-6 users)
Number of Zabbix proxies: 61 (our installations are mostly made up of 3 local virtual machines, one of them is proxy with 512 MB RAM and it is collecting the values from the other 2 machines and sending them to the Zabbix server).
All items are either active items, or trapper active items (we use the Zabbix API and zabbix sender for sending bulk values).
Database:
Size: 185 GB
Table size (including indexes):
public.history | 64 GB
public.trends_uint | 44 GB
public.history_uint | 31 GB
public.trends | 21 GB
public.trends_text | 14 GB
public.history_text | 8491 MB
I can say that even with that hardware everything is working somehow fine
I am running this Zabbix for 4 years with this hardware. But in april 2017 I was having around 1100 NVPS, now I have 1700 NVPS.
And the problem is when there are VACUUM operations (they are restricted by I/O usage with cost delays and limits), DELETE operations (in Zabbix terms housekeeping) or SELECT queries from the interface by users: the INSERT queries are just waiting on I/O and they start to be very slow and in the Zabbix server the effect is that there are many items in the queue and the queue over 10 min. and the Zabbix server is falling behind.
These are the Zabbix server parameters:
StartPollers=1
StartPollersUnreachable=1
StartHTTPPollers=0
StartTrappers=8
StartPingers=1
StartProxyPollers=0
StartDBSyncers=8
Timeout=30
TrapperTimeout=30
CacheSize=512M
ValueCacheSize=1024M
HistoryCacheSize=1024M
HistoryTextCacheSize=512M
TrendCacheSize=512M
I have done many optimizations in the PostgreSQL database server (increasing shared_buffers, effective_cache_size, I have even disabled for some time fsync and full_page_writes just to monitor the performance without them, disabled synchronous_commit, etc.). I know PostgreSQL well and it is tuned good enough for this virtual machine and its capabilities. I do not have any dead rows or even bloat, I monitor these parameters and vacuuming is running pretty well. But when there are operations like VACUUM, DELETE or SELECT queries from the interface, the INSERT queries are just slow. The I/O performance is on the edge. If there are more queries, the Zabbix server falls behind.
But this is the virtual machine and hardware parameters I have at the moment. What software optimizations can I do more?
I was thinking about partitioning / sharding, but I need to shard at many thin intervals, because history and history_uint have a total of 481 591 089 live rows, which means around 100 million rows per day... What do you think of that?
Also I am going to make optimizations on item intervals: now 90 % of items are with interval of 60 seconds, but I thinks it will be good to review them and set some of them to 120, 300, etc.
I am open for ideas.
If there are no ideas, tell me what can I increase in hardware. I think the bottleneck is I/O only (not CPU or RAM usage) and SSD disks will be fine! But for now I only have these 10 000 RPM SAS drives...
Thank you!
My setup:
Everything (Zabbix server, frontend / apache2, PostgreSQL) is on 1 VM with the following parameters:
Hardware:
6-core Intel Xeon CPU E5-2670 v2 @ 2.50 GHz
8 GB RAM
50 GB SATA at 7500 RPM for the root disk of the system
300 GB SAS at 10000 RPM for /var/lib/postgresql
Zabbix server:
NVPS: 1800
Hosts: 210
Items: 187 700
(we do not have many hosts, but we have many items per host: one host has an average of 900 items)
Triggers: 82 210
Number of users: 23 (online at the same time are 5-6 users)
Number of Zabbix proxies: 61 (our installations are mostly made up of 3 local virtual machines, one of them is proxy with 512 MB RAM and it is collecting the values from the other 2 machines and sending them to the Zabbix server).
All items are either active items, or trapper active items (we use the Zabbix API and zabbix sender for sending bulk values).
Database:
Size: 185 GB
Table size (including indexes):
public.history | 64 GB
public.trends_uint | 44 GB
public.history_uint | 31 GB
public.trends | 21 GB
public.trends_text | 14 GB
public.history_text | 8491 MB
I can say that even with that hardware everything is working somehow fine
I am running this Zabbix for 4 years with this hardware. But in april 2017 I was having around 1100 NVPS, now I have 1700 NVPS.And the problem is when there are VACUUM operations (they are restricted by I/O usage with cost delays and limits), DELETE operations (in Zabbix terms housekeeping) or SELECT queries from the interface by users: the INSERT queries are just waiting on I/O and they start to be very slow and in the Zabbix server the effect is that there are many items in the queue and the queue over 10 min. and the Zabbix server is falling behind.
These are the Zabbix server parameters:
StartPollers=1
StartPollersUnreachable=1
StartHTTPPollers=0
StartTrappers=8
StartPingers=1
StartProxyPollers=0
StartDBSyncers=8
Timeout=30
TrapperTimeout=30
CacheSize=512M
ValueCacheSize=1024M
HistoryCacheSize=1024M
HistoryTextCacheSize=512M
TrendCacheSize=512M
I have done many optimizations in the PostgreSQL database server (increasing shared_buffers, effective_cache_size, I have even disabled for some time fsync and full_page_writes just to monitor the performance without them, disabled synchronous_commit, etc.). I know PostgreSQL well and it is tuned good enough for this virtual machine and its capabilities. I do not have any dead rows or even bloat, I monitor these parameters and vacuuming is running pretty well. But when there are operations like VACUUM, DELETE or SELECT queries from the interface, the INSERT queries are just slow. The I/O performance is on the edge. If there are more queries, the Zabbix server falls behind.
But this is the virtual machine and hardware parameters I have at the moment. What software optimizations can I do more?
I was thinking about partitioning / sharding, but I need to shard at many thin intervals, because history and history_uint have a total of 481 591 089 live rows, which means around 100 million rows per day... What do you think of that?
Also I am going to make optimizations on item intervals: now 90 % of items are with interval of 60 seconds, but I thinks it will be good to review them and set some of them to 120, 300, etc.
I am open for ideas.
If there are no ideas, tell me what can I increase in hardware. I think the bottleneck is I/O only (not CPU or RAM usage) and SSD disks will be fine! But for now I only have these 10 000 RPM SAS drives...
Thank you!
Comment