Ad Widget

Collapse

1800 NVPS, slow hardware (mostly disks) and database INSERT issues

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ivanni
    Junior Member
    • Mar 2015
    • 15

    #1

    1800 NVPS, slow hardware (mostly disks) and database INSERT issues

    Hi!

    My setup:

    Everything (Zabbix server, frontend / apache2, PostgreSQL) is on 1 VM with the following parameters:

    Hardware:

    6-core Intel Xeon CPU E5-2670 v2 @ 2.50 GHz
    8 GB RAM
    50 GB SATA at 7500 RPM for the root disk of the system
    300 GB SAS at 10000 RPM for /var/lib/postgresql

    Zabbix server:

    NVPS: 1800
    Hosts: 210
    Items: 187 700
    (we do not have many hosts, but we have many items per host: one host has an average of 900 items)
    Triggers: 82 210
    Number of users: 23 (online at the same time are 5-6 users)

    Number of Zabbix proxies: 61 (our installations are mostly made up of 3 local virtual machines, one of them is proxy with 512 MB RAM and it is collecting the values from the other 2 machines and sending them to the Zabbix server).

    All items are either active items, or trapper active items (we use the Zabbix API and zabbix sender for sending bulk values).

    Database:

    Size: 185 GB
    Table size (including indexes):

    public.history | 64 GB
    public.trends_uint | 44 GB
    public.history_uint | 31 GB
    public.trends | 21 GB
    public.trends_text | 14 GB
    public.history_text | 8491 MB



    I can say that even with that hardware everything is working somehow fine I am running this Zabbix for 4 years with this hardware. But in april 2017 I was having around 1100 NVPS, now I have 1700 NVPS.

    And the problem is when there are VACUUM operations (they are restricted by I/O usage with cost delays and limits), DELETE operations (in Zabbix terms housekeeping) or SELECT queries from the interface by users: the INSERT queries are just waiting on I/O and they start to be very slow and in the Zabbix server the effect is that there are many items in the queue and the queue over 10 min. and the Zabbix server is falling behind.

    These are the Zabbix server parameters:

    StartPollers=1
    StartPollersUnreachable=1
    StartHTTPPollers=0
    StartTrappers=8
    StartPingers=1
    StartProxyPollers=0
    StartDBSyncers=8
    Timeout=30
    TrapperTimeout=30

    CacheSize=512M
    ValueCacheSize=1024M
    HistoryCacheSize=1024M
    HistoryTextCacheSize=512M
    TrendCacheSize=512M


    I have done many optimizations in the PostgreSQL database server (increasing shared_buffers, effective_cache_size, I have even disabled for some time fsync and full_page_writes just to monitor the performance without them, disabled synchronous_commit, etc.). I know PostgreSQL well and it is tuned good enough for this virtual machine and its capabilities. I do not have any dead rows or even bloat, I monitor these parameters and vacuuming is running pretty well. But when there are operations like VACUUM, DELETE or SELECT queries from the interface, the INSERT queries are just slow. The I/O performance is on the edge. If there are more queries, the Zabbix server falls behind.

    But this is the virtual machine and hardware parameters I have at the moment. What software optimizations can I do more?

    I was thinking about partitioning / sharding, but I need to shard at many thin intervals, because history and history_uint have a total of 481 591 089 live rows, which means around 100 million rows per day... What do you think of that?

    Also I am going to make optimizations on item intervals: now 90 % of items are with interval of 60 seconds, but I thinks it will be good to review them and set some of them to 120, 300, etc.

    I am open for ideas.

    If there are no ideas, tell me what can I increase in hardware. I think the bottleneck is I/O only (not CPU or RAM usage) and SSD disks will be fine! But for now I only have these 10 000 RPM SAS drives...

    Thank you!
  • Satei
    Junior Member
    • Aug 2014
    • 1

    #2
    Hello

    I can tell you from my practice that you can nearly do nothing on software base.
    Your problem is really simple the 10k drive is not fast enough.
    I run my database vm on 6x15krpm RAID10 (old 3.5 drives).
    That works until I hit around 2500nvps.
    After that I got the same problems as you.
    Your only software way could be to stop the housekeeping process and run partitioning.
    That was not my way, I throw an not so old server with 10x15krpm RAID10 (2,5 drives) in the ring and it worked very well.
    Also I changed the housekeeping process to run 2 times a day on X:15.
    Because on X:00 the server gets more values and 15minutes later is more silence on the drives...

    Regards
    Bodo

    Comment

    • ingus.vilnis
      Senior Member
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Mar 2014
      • 908

      #3
      Hi,

      Yes, from the available info we see that disk performance is the first thing to look at. Definitely not the first instance to observe this behavior.

      Do "sar -d -wp 1" or "iostat -d -x -h 1" or whichever disk analysis tool you prefer. Check for the await latency and % util, Latency higher than 2ms and util close to 100% are the first indicators.

      From Zabbix UI check Monitoring -> Graphs, select Zabbix server. First things to look at:
      • CPU utilization
      • Zabbix Cache usage free
      • Zabbix internal process busy
      In the CPU graph see first iowait time, if it is high (like more than 10%) then slow disks.

      Depending on the Zabbix graphs without seeing them I would suggest the following:
      • HistoryCacheSize=256M (max)
      • Other caches apart from Config cache should be adjusted to be steadily ~80-90% free. No need to have them at 100% constantly, waste of RAM. Config cache if no rapid growth planned can be at 50% without worries.
      • StartDBSyncers=4 . Each DB syncer can process ~1000 nvps and you are half way to utilize the default 4 to their limits but I don't know your triggers, maybe you do rocket path calculations there so more syncers are reasonable. Otherwise just unnecessary load on the already "on the limits" DB and disk subsystem.
      The MBs of allocated memory taken away from Zabbix caches you can now use for some PostgreSQL stuff. Jut make sure to fit in the 8GB total you have (leaving some to OS and other services).

      Partitioning is highly recommended if you still rely on Zabbix Housekeeper process. Daily for history and monthly for trends. It is absolutely normal for even 10K+ nvps instances and your 100M daily rows are not a problem at all. Those 481 591 089 live rows you mentioned are only because of selects for graphs I suppose.

      All SSD storage is the way to go in the future but sure let's see what can you squeeze out from the current limits.

      And yes, increasing the item update intervals will also be a benefit. Think wisely if you really need all the data collected at very short intervals. It is OK of you need, but in case you don't then it is just a waste of resources and the increase of intervals is the way to fix it.

      Hope this helps a bit.

      Comment

      • ivanni
        Junior Member
        • Mar 2015
        • 15

        #4
        Satei, thank you for your answer.

        ingus.vilnis, thank you for your answer, too! The disk utilization (disk busy) is around 100 % when there are several queries on the database, including INSERTS, SELECTS, DELETES or VACUUM. I have tested with iostat, but I have not posted the results here.
        You are right about the caches! It is a waste of RAM, because history_* caches are 90 % free, I checked them. This available RAM I am going to deliver to PostgreSQL caches and buffers.
        I have decreased the DB syncer processes to 4, no need of 8 really, too many processes, as you mentioned.
        And now I am making a plan for partitioning / sharding, because deleting from these large tables is slow and it can lead to bloat easily.

        Thank you very much!

        Comment

        • Chewbakka-Wakka
          Junior Member
          • Feb 2018
          • 25

          #5
          You will need much more than just 8G RAM. Get more RAM! Get the SSD(s)

          Partitioning - Yes do it!
          After you got more RAM increase caches used by Postgres

          Comment

          Working...