Ad Widget

Collapse

Huge Performance Issues - System.run - Powershell Scripts

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • zabbattical
    Junior Member
    • Jul 2022
    • 6

    #1

    Huge Performance Issues - System.run - Powershell Scripts

    Hi there

    I have some performance issues with a host where many Items with System.run keys with powershell calls are used.

    Full Story:
    I inherited the Zabbix server in our company.
    Saw that there are around 100 items implemented (passiv/active agents to see if it makes a difference, which it does not) which use system.run to run powershell scripts. (system.run[powershell.exe -command D:\Zabbix-Monitoring\Scripts\Do_something.ps1] )

    Within the latest data of these Items i saw that the data quality isn't good. Nearly none of these items show data in the configured item interval.
    Then i started to create nodata trigger so i get informed what items have problems with their current interval.

    No i know, nearly none of these items are working as expected.
    We got some Items with a configured 1 minute interval but these give data only around every 4-5 minutes.

    Click image for larger version

Name:	image.png
Views:	953
Size:	49.8 KB
ID:	485501
    We also got items with 5 minutes interval that only give data ~15minutes.
    I think this all stacks together. So one item is slowing out the next one, this stacks up, the sometime the items got killed because they take to long then you won't get a delayed value but none at all, etc..

    So can someone tell me what is wrong and how to work on this?

    What kind of limitation is there for System.run with powershell scripts?
    From the Host it does not look like it's a performance issue. CPU is mostly ideling, also there are only a handfull powershell processes running at most. - why doesn't it start up 50 powershell processes when 50 items are set to run every minute?
    Does zabbix queue this up somehow and then working it sequencialy?
    What is best practice for using custom powershell scripts?
    Whats to much for one host?

    We have to do a lot of custom checks for all king of application monitoring, so we have to get arround these limitations somehow, or distribute the load to other servers or something.
    But therefore i need to know where are the limits, so we can monitor this as well and don't run into troubles in the future.

    Unfortunately we are currently still using Zabbix 4.0.14
    But we are in the state of migrating to zabbix 6 (maybe this can solve something?)

    Hopefully someone can shed some light on these topic.

    BR
  • PeterZielony
    Senior Member
    • Nov 2022
    • 146

    #2
    I'd say this is bad design - you don't want to run 50 scripts by agent at the same time even if they are very tiny

    I'm working with tons of custom scripts but I always keep in mind to run as little as possible on boxes.

    there is a limit 30sec for each script - if you run 50 at the same time they will go out beyond 30sec easily as each need to initialize and then run.

    What are those scripts doing if you don't mind asking - some operation or reading something - if reading then what exactly? I might help but I would need the whole picture

    and yeah .. 4x version probably doesn't help here either.

    Hiring in the UK? Drop a message

    Comment

    • Markku
      Senior Member
      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
      • Sep 2018
      • 1781

      #3
      As a side story, I once created a Python app that was using PowerShell commands to interface with PowerShell-only APIs. But very quickly I had to abandon the design because starting the PowerShell interpreter for each API call separately was so slow (in the order of seconds) that the app was unusable and requests couldn't be served. I had to redesign the whole app to execute as native long-running PowerShell script that read the requests in a loop from an external queue, instead of starting the script separately for each request.

      cmd.exe starts probably much faster, if that could be usable in your case. And, maybe some WMI items could be used instead of PowerShell scripts?

      Finally, another way to feed data to Zabbix is to send it as Zabbix trapper items: have the PowerShell script running all the time (from task scheduler or so), and make it loop and send the relevant metrics every X seconds with zabbix_sender. It is then almost like active agent, just controlling the frequency within the script instead of Zabbix item interval configuration.

      Markku

      Comment


      • tim.mooney
        tim.mooney commented
        Editing a comment
        +1 to everything Markku said.

        We use a few powershell scripts in our environment for things where there's no easy way to monitor using WMI or other methods. Starting a copy of powershell is expensive, and getting even carefully optimized scripts to run within the timeout can be difficult. We slightly increased the (passive) agent timeout (remember do it on both the server *and* the client) to give our scripts a few more seconds to run, but even that isn't enough in some cases.

        Note that some Windows commands that might execute from PowerShell can hang indefinitely, so it can also be useful use one of the Powershell idioms to force a command to timeout within a certain period, so your scripts don't hang and start piling up every time Zabbix tries to run them.

        As Markku suggested, keeping PowerShell running as a kind of pseudo-service and only submitting data back to Zabbix periodically (or even just writing its results to a well-known file location and having the zabbix agent just check the file contents) will perform much better.

        As far as going to Zabbix 6.x, I don't remember which version allows you to specify a longer timeout (Zabbix 7.x increases the max timeout dramatically), but newer versions of the agent or agent2 might have support for new items, so it's possible you could reduce your need for some of your powershell scripts.

        If you're capable of programming in a compiled language (Agent2 prefers "Go", the traditional agent would be C or C++ or something that could link to C/C++), it's also possible to write your own loadable modules to create custom items. That can be a way (if you have the necessary technical expertise) to access native APIs at native speeds, without the cost of starting up a heavyweight internpreter shell ever minute or 5 minutes or whatever.
    • zabbattical
      Junior Member
      • Jul 2022
      • 6

      #4
      PeterZielony
      - Checking MSMQ on Servers if replication is running fine
      - checking files on header information to see if it was built and transmitted correctly
      - checking smb directories if files get processed
      - failover cluster monitoring
      - lots and lots of SQL Query checks (we run this in powershell because the agent user runs with a serviceaccount that has access and so we don't need to keep the user credentials in zabbix)
      - checking backup data for consistency
      - kafka prometheus monitoring (response >64kb, to large for zabbix web request)
      - and lots of other stuff i'm currently not thinking off

      the 30 second limit is a per script limit i think
      but since the powershell queue as a whole can only run a defined number of parallel processes, the runtime of the single scripts cumulates and thats why my 1min items gehts delayed because the sum of items that run every 5 minutes takes so long that the 1min items already get queued again and won't run until the 5min items are worked through.

      So for example zabbix sends all 1min items into queue they only take 30 seconds to work through in total.
      Then after 5 minutes alle the 5 minute items get shoved into the Powershell queue for zabbix to work through
      But since thats a so huge number the Queue takes ~4 minutes to work through all the 5 minute items.
      Would be fine for the 5 minute Items.
      But in the meantime the 1 minute items already get queued again.
      When they then finally run they are already 4 minutes delayed.

      This would also describe the behaviour that i'm seeing currently - but i really dunno how the queuing works thats why i'm here

      Markku
      zabbix sender was also my plan to go, but somewhere in the manual i read that you shouldn't use zabbix sender if there is the possibility to run it through an agent

      Comment

      • cyber
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • Dec 2006
        • 4807

        #5
        Originally posted by zabbattical
        zabbix sender was also my plan to go, but somewhere in the manual i read that you shouldn't use zabbix sender if there is the possibility to run it through an agent
        That's total BS, pardon my french... It is clearly seen here, that your chosen way does not work very well (and it really has nothing to do with Zabbix itself, but the way windows starts things and manages them)... You should try better approach and sender is more efficient here.
        I dare you to find that quote from manual again.. I'd really like to read it..

        Comment


        • zabbattical
          zabbattical commented
          Editing a comment
          NVM was a stack overflow entry xD

          "The preferred way is to use a Zabbix Agent if it's possible."
          https://stackoverflow.com/questions/...s-zabbix-agent

          The manual itself says:
          "The utility is usually used in long running user scripts for periodical sending of availability and performance data."
      • PeterZielony
        Senior Member
        • Nov 2022
        • 146

        #6

        Originally posted by zabbattical
        PeterZielony
        - Checking MSMQ on Servers if replication is running fine
        - checking files on header information to see if it was built and transmitted correctly
        - checking smb directories if files get processed
        - failover cluster monitoring
        - lots and lots of SQL Query checks (we run this in powershell because the agent user runs with a serviceaccount that has access and so we don't need to keep the user credentials in zabbix)
        - checking backup data for consistency
        - kafka prometheus monitoring (response >64kb, to large for zabbix web request)
        - and lots of other stuff i'm currently not thinking off

        the 30 second limit is a per script limit i think
        but since the powershell queue as a whole can only run a defined number of parallel processes, the runtime of the single scripts cumulates and thats why my 1min items gehts delayed because the sum of items that run every 5 minutes takes so long that the 1min items already get queued again and won't run until the 5min items are worked through.

        So for example zabbix sends all 1min items into queue they only take 30 seconds to work through in total.
        Then after 5 minutes alle the 5 minute items get shoved into the Powershell queue for zabbix to work through
        But since thats a so huge number the Queue takes ~4 minutes to work through all the 5 minute items.
        Would be fine for the 5 minute Items.
        But in the meantime the 1 minute items already get queued again.
        When they then finally run they are already 4 minutes delayed.

        This would also describe the behaviour that i'm seeing currently - but i really dunno how the queuing works thats why i'm here

        Markku
        zabbix sender was also my plan to go, but somewhere in the manual i read that you shouldn't use zabbix sender if there is the possibility to run it through an agent
        hm .. this is a lot and will require re-design - this is for sure. I don't have much experience with MSMQ - but I'm sure since this is Microsoft it has some form of accessing data other than powershell. I have to admit - i would love to be in your position to investigate everything and write out solutions for everything - without being there it would be hard for me to pint point to a solution in a single message since a lot is going on your severs.

        - MSMQ is an MS product that surely exposes WMI metrics that can be natively collected via agent. (https://wutils.com/wmi/root/cimv2/default.html)
        - headers? Do you mean from a file or some API call "in-fly"?
        - "checking smb directories if files get processed" from where do you get file info that is supposed to expose this information (how often file name changes or there are multiple ones, from where you get it, what are the paths etc)
        - failover cluster - this is when potentially can be observed via PowerShell
        - SQL query checks - again PowerShell only if cannot use ODBC - what info do you get from SQL to confirm "failed/Success" and based on what query - is this MSMQ DB?
        - backups - meaning which backups? SQL?
        -- etc -- etc
        ---- There is a lot of questions that need to be asked and without access to script and environment, barely impossible to suggest a "one fit all" resolution.

        The best approach for you is to investigate every single script and separate them by documenting each task with objectives:
        - what is purpose of this check
        - from where you can get info (snmp, powershell, WMI, SQL, log files etc) and explore every option if its possible
        - how often does it need checking
        - is there a way to get a list of things for specific information as long data is somewhat static to create discovery rules and items etc


        Then you need to group things "per service" - for example all tasks related to "MSMQ" etc.

        Test individual metrics needed separately but avoid using scripts on host at all cost - if you cannot then you will have to write powershell service that will collect data and either expose it to log files (each service/task - separate log file, ideally rotating) -- or use Zabbix sender so you can send data to Zabbix for further processing.


        I get this is a very vague answer but really all processes you have there require very specific technical documentation to see what is available, which will help connect everything together later on.

        If you need help - sure, we can help but this seems we will have to take it step by step. Each script has 30sec timeout if triggered by agent, but if there are 50 at same time - it simply won't work and each of tasks you described is a challenge in it self if you want reliable observability, Zabbix will help ofc but this have to be redesigned from ground up


        unless you are familiar with go and C like Tim suggested - then you can write your own build-in functions without needing PS to check things.
        Last edited by PeterZielony; 13-06-2024, 17:28.

        Hiring in the UK? Drop a message

        Comment

        • zabbattical
          Junior Member
          • Jul 2022
          • 6

          #7
          Thanks for your help.

          Theres a lot of work for me to do...
          Will split up scripts to the host where they belong instead one zentralised scripts server.
          Also will change some items from powershell to wmi where it's possible.
          User trappers, odbc, etc..

          All in All a complete overwork

          Comment


          • PeterZielony
            PeterZielony commented
            Editing a comment
            also consider agent ver 1 which allows running multiple instances (each will have its own listening port tho) of agents whereas agent 2 just 1 but it is more "packed"
        • cyber
          Senior Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Dec 2006
          • 4807

          #8
          PeterZielony commented
          14-06-2024, 15:36
          also consider agent ver 1 which allows running multiple instances (each will have its own listening port tho) of agents whereas agent 2 just 1 but it is more "packed"
          You can do that with agent2 also...

          Comment


          • PeterZielony
            PeterZielony commented
            Editing a comment
            oh, never used ver1 but I thought this was the difference. Will visit docs again. Thanks for clarifying
        Working...