Ad Widget

Collapse

Experiencing Challenges with Zabbix Scaling in Large Environments

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • nexusiceland
    Junior Member
    • Nov 2023
    • 2

    #1

    Experiencing Challenges with Zabbix Scaling in Large Environments

    I've been encountering some hurdles while managing a large-scale Zabbix environment with thousands of devices and significant data throughput. The specific issues revolve around scaling, performance, high availability (HA), and maintenance. Despite going through the documentation, I'm seeking insights and practical advice from the experienced members of this community.
    Here are some specific details on the challenges I'm facing:: As the number of devices increases, I'm noticing potential scalability issues. What are the best practices for scaling Zabbix in such environments?
    With a high volume of data (1K+ values per second), I'm observing performance bottlenecks. Any tips on optimizing performance for large-scale deployments?Ensuring uninterrupted monitoring is crucial. What strategies do you recommend for achieving high availability in Zabbix? Routine maintenance tasks become more complex in large environments. How do you manage maintenance activities efficiently without impacting monitoring?
    I'm eager to hear from those who have successfully managed Zabbix in similar large-scale setups. Your experiences, suggestions, and best practices would be incredibly valuable.
  • LenR
    Senior Member
    • Sep 2009
    • 1005

    #2
    IIRC, we were about 2K nvps on a VM without issues. Random things: (from memory, I retired this year)
    • NFS storage wouldn’t scale, fiber disk was OK. Many recommend SSD, we didn’t need it.
    • Tune MySQL innodb buffers, give it ram, use hugepages. I think we gave it 8 to 12G.
    • Use db portioning, housekeeping history doesn’t scale. We even broke history down to 6 hour tables. Performance in a past MySQL release showed slower as partitioned aged.
    • Gather data with proxies. Use agent active items where possible.
    • We ran MySQL and Zabbix server on one host, web on another, 4 big proxies and a few others for network access.
    • Tune Zabbix buffers using the server template.

    Comment

    • Jason
      Senior Member
      • Nov 2007
      • 430

      #3
      We moved to postgres a few years back and it seems much better. We use timescale with partitioning so we just drop old partitions to remove old data for history and trends. If you install the timescale from the timescale repo you can use compression on the database. That really helps. (postgres needs pgbouncer and pgpool to handle all the connections in optimal manner).
      Use lots of proxies and try not to have too many items on a single proxy. Where possible have all data coming into the proxies not directly to zabbix. This way they can do the pre-processing and filtering etc to cut down on the load on zabbix.
      Use discard data with heartbeat to minimise the amount of data coming in to zabbix from the proxies. This is very effective with items that don't change regularly and also very good for items such as switches where the ports are down etc.
      Make sure have separate servers for database, zabbix server and web front end.
      Possibly consider multiple zabbix servers. There's option coming in version 7 to combine multiple zabbix servers into a single dashboard.
      Monitor your proxy performance regularly and adjust if necessary to even the load out.

      Comment

      • cyber
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Dec 2006
        • 4806

        #4
        Suggestion to topic starter.. Describe, what you already have, what HW setup, versions etc.. It would be easier to give suggestions, if there are maybe obvious bottlenecks visible.
        Currently does not look like a large setup..:P

        Comment

        • nexusiceland
          Junior Member
          • Nov 2023
          • 2

          #5
          Originally posted by nexusiceland
          I've been encountering some hurdles while managing a large-scale Zabbix environment with thousands of devices and significant data throughput. The specific issues revolve around scaling, performance, high availability (HA), and maintenance. Despite going through the documentation, I'm seeking insights and practical advice from the experienced members of this community.
          Here are some specific details on the challenges I'm facing:: As the number of devices increases, I'm noticing potential scalability issues. What are the best practices for scaling Zabbix in such environments?
          With a high volume of data (1K+ values per second), I'm observing performance bottlenecks. Any tips on optimizing performance for large-scale deployments? Ensuring uninterrupted monitoring is crucial. What strategies do you recommend for achieving high availability in Zabbix? Routine maintenance tasks become more complex in large environments. How do you manage maintenance activities efficiently without impacting monitoring?
          I'm eager to hear from those who have successfully managed Zabbix in similar large-scale setups. Your experiences, suggestions, and best practices would be incredibly valuable.
          Firstly, I want to express my gratitude for your prompt and insightful responses. Your experiences with employee self service portal large-scale Zabbix setups are incredibly professional development valuable, and I appreciate the detailed wellbeing initiatives suggestions. the transition to PostgreSQL with Timescale and the use of compression for Nexus Iceland database management are intriguing. The recommendation to distribute data through proxies for pre-processing and filtering makes a lot of sense. I'll certainly look into implementing discard data with heartbeat as well.
          your suggestion to provide more details about the existing setup is duly noted. I'll share more information about the hardware, versions, and configurations to facilitate more targeted suggestions.

          I'm curious to know if any of you have encountered specific challenges or successes related to high availability strategies with Zabbix in large environments. Any insights on that front would be greatly appreciated.​

          Comment

          Working...