Ad Widget

Collapse

Zabbix 3.2.11 IT Services SLA, downtime and period

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Chrimos
    Junior Member
    • Jul 2019
    • 3

    #1

    Zabbix 3.2.11 IT Services SLA, downtime and period

    Hello,

    I have some unclarity in Zabbix 3.2.11 SLA calculation formula and also would like to get some information about Downtime function and period selection in IT services.

    SLA calculation - could somebody please explain me, how SLA and problem time is calculated in here and what is the formula behind it. I have example also given below (SLA example picture). Customer 4 has 13 machines. Customer 4 problem time is 24.6 but totally if I look all this customer machines, there is only 3 machines which have a problem.

    Downtime - there is no data currently running in Downtime column (see Downtime IT Services picture attached) but I assume that there should be some kind of a function for it. Could somebody explain what is the function and if there is some formula, also would be good to know.

    Period - in IT Services there are 8 selections: Today; This week; This month; This year; Last 24 hour; Last 7 days; Last 30 days; Last 365 days. I would really need to have more options, like a calendar, is this possible to be added in this version somehow? I would need to have monthly and weekly data from the past, currently this option will not allow to have this kind of report.

    Your help is much appriciated,
    Thank You in advance,

    Chris
  • splitek
    Senior Member
    • Dec 2018
    • 101

    #2



    SLA - It is very simple but first read the doc and section:
    Status calculation algorithm Method of calculating service status:
    When IT service is "in state" then SLA is reduced. When IT service have children then calculation (reduction) depends on configuration (chosen algorithm).

    Downtime - service state within this period does not affect SLA. You define downtime periods in configuration of IT Service.

    Period - periods are defined in frontend. For now, only way to get SLA for non defined period is to query Zabbix API:

    Comment

    • Raido
      Junior Member
      • Jun 2019
      • 13

      #3
      The type of SLA calculation is "Problem, if at least one child has a problem".
      Eg. If customer 4 have 14 machines. Machine 1 have 2 active problems. During the Machine 1 SLA calculation - does the 2 active problems are summed together, averaged together or which way the Machine 1 SLA is calculated if it has multiple active problems?

      Comment

      • splitek
        Senior Member
        • Dec 2018
        • 101

        #4
        Originally posted by Raido
        Machine 1 have 2 active problems.
        What you mean by "problem" here? You think about problems from problems view for some host (machine 1)? It is not like that.
        Think like that:
        IT service in zabbix is connected to trigger (it can be only one trigger). Trigger can be OK/PROBLEM, from that IT service can be OK/PROBLEM. Like you see number of problems on host doesn't matter. State of connected trigger matters. IT service show ratio UP/DOWN for one chosen trigger. You can say something like "that service have SLA 50% so it worked 50% of the time it should worked".


        Now... you have service "customer" with children services - calculation is "Problem, if at least one child has a problem". If one of this children go PROBLEM then it propagates PROBLEM up to parent service "customer" and parent will be in PROBLEM too. If you change calculation to "Problem, if all children have problems" then all children need to be in PROBLEM to propagate it to parent.

        Comment

        • Raido
          Junior Member
          • Jun 2019
          • 13

          #5
          How is the parent (yellow) SLA calculated if the children uptime is much higher?
          Attached Files

          Comment

          • splitek
            Senior Member
            • Dec 2018
            • 101

            #6
            Hard to tell... but I will try.
            Let say parent configuration is: "Problem, if at least one child has a problem". So our parent go PROBLEM when child 1 or child 2 or child 3 (... or so on) is in PROBLEM.
            Let's draw timeline with every minute. On every minute we put 0 if child is OK, or 1 if child is in a PROBLEM.

            Child 1: 00001111
            Child 2: 11001100
            Child 3: 11110000
            -------------------------
            Parent: 11111111

            As you see every one child uptime is higher, Parent have no up time.
            "OR" operation in made in every second on all children statuses and this result is propagated to parent. When result is 1 then Parent is in PROBLEM state.

            Comment

            Working...