Ad Widget

Collapse

Discussion thread for official Zabbix Template Ceph

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Discussion thread for official Zabbix Template Ceph

    This thread is designed to provide grounds for discussion of the official Zabbix Template for Ceph.
    The template and details of the template is available in GIT repository: https://git.zabbix.com/projects/ZBX/...pp/ceph_agent2


    Zabbix is always looking for ways to improve our services and to make our users happier.
    We pride ourselves on doing our best each and every day, but we know that there is always something more to learn.
    We would like to hear back from you to know what have you liked and what would you improve in the template.
    Last edited by AlexL; 06-10-2020, 13:04.

    #2
    Hi Zabbix Team,
    i saw the new Zabbix Plugin and first of all thank you for the work! But i have a question, why we have to use the Zabbix Agent2 to collect RestAPI metrics, we could not use the Zabbix HTTP Check?
    There are some special requirement that can not be solved using the HTTP Item?
    Thanks so much

    Comment


      #3
      Originally posted by dimi View Post
      Hi Zabbix Team,
      i saw the new Zabbix Plugin and first of all thank you for the work! But i have a question, why we have to use the Zabbix Agent2 to collect RestAPI metrics, we could not use the Zabbix HTTP Check?
      There are some special requirement that can not be solved using the HTTP Item?
      Thanks so much
      A lot of metrics used in template are aggregated metrics, and some required metrics for monitoring were not available through API.
      Regards,
      Alex

      Comment


        #4
        Hi Alex,
        Ok, thanks so much for the details!

        Comment


          #5
          I am testing the ceph_agent2 template against my ceph cluster, using agent 5.0.5, server 5.0.3.

          I have successfully created the api key for the zabbix user to use, and have tested that the API key is valid, however the Zabbix side doesn't seem to work.

          I am imagining that this is the result of the SSL key being self-signed, which I feel like is going to be a very common scenario for most ceph clusters, and something that should be accounted for with a template macro for instance.
          I figure that the ceph.* keys in the agent2 are calling curl under the hood, instead of curl -k.
          I feel like this should be configurable, since not every org/homelab is going to have their own CA configured for this, and self-signed would be a valid method for this API.
          Code:
          $ curl https://$CEPHUSER:[email protected]$CEPHHOST:8003/server
          curl: (60) SSL certificate problem: self signed certificate
          More details here: https://curl.haxx.se/docs/sslcerts.html
          
          curl failed to verify the legitimacy of the server and therefore could not
          establish a secure connection to it. To learn more about this situation and
          how to fix it, please visit the web page mentioned above.
          Code:
          $ curl -k https://$CEPHUSER:[email protected]$CEPHHOST:8003/server
          [
          {
          "ceph_version": "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)",
          However,
          Code:
          $ zabbix_get -s $CEPHHOST -k ceph.ping["$CEPHCONNSTRING","$CEPHUSER","$CEPHAPIKEY"]
          0
          Any ideas on what to try next?

          I pulled the latest XML of the template from the 5.0 branch, though the removed XML from master/5.2 seems to be different from the XML in 5.0 by ~150 lines just looking at the line numbers between https://git.zabbix.com/projects/ZBX/...eph_agent2.xml and https://git.zabbix.com/projects/ZBX/...Frelease%2F5.0

          Hopefully someone can point me in the correct direction.

          Comment


            #6
            Originally posted by reedacus25 View Post
            I am testing the ceph_agent2 template against my ceph cluster, using agent 5.0.5, server 5.0.3.

            I have successfully created the api key for the zabbix user to use, and have tested that the API key is valid, however the Zabbix side doesn't seem to work.

            I am imagining that this is the result of the SSL key being self-signed, which I feel like is going to be a very common scenario for most ceph clusters, and something that should be accounted for with a template macro for instance.
            I figure that the ceph.* keys in the agent2 are calling curl under the hood, instead of curl -k.
            I feel like this should be configurable, since not every org/homelab is going to have their own CA configured for this, and self-signed would be a valid method for this API.
            Code:
            $ curl https://$CEPHUSER:[email protected]$CEPHHOST:8003/server
            curl: (60) SSL certificate problem: self signed certificate
            More details here: https://curl.haxx.se/docs/sslcerts.html
            
            curl failed to verify the legitimacy of the server and therefore could not
            establish a secure connection to it. To learn more about this situation and
            how to fix it, please visit the web page mentioned above.
            Code:
            $ curl -k https://$CEPHUSER:[email protected]$CEPHHOST:8003/server
            [
            {
            "ceph_version": "ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)",
            However,
            Code:
            $ zabbix_get -s $CEPHHOST -k ceph.ping["$CEPHCONNSTRING","$CEPHUSER","$CEPHAPIKEY"]
            0
            Any ideas on what to try next?

            I pulled the latest XML of the template from the 5.0 branch, though the removed XML from master/5.2 seems to be different from the XML in 5.0 by ~150 lines just looking at the line numbers between https://git.zabbix.com/projects/ZBX/...eph_agent2.xml and https://git.zabbix.com/projects/ZBX/...Frelease%2F5.0

            Hopefully someone can point me in the correct direction.
            cat /etc/zabbix/zabbix_agent2.d/ceph.conf
            Plugins.Ceph.InsecureSkipVerify=true

            oh and note, the monitoring only works on the ceph node that has the mgr running.

            Comment


              #7
              Originally posted by che666 View Post

              cat /etc/zabbix/zabbix_agent2.d/ceph.conf
              Plugins.Ceph.InsecureSkipVerify=true
              That worked immediately for ceph.ping and some others.
              This should really be included in the readme, or even better, the ceph.conf file with any template specific runtime variables with default values that can be uncommented and changed as needed.
              Otherwise, no one knows where to find any of this.

              However, this has led to new issues.

              ceph.ping, ceph.status, ceph.osd.dump, ceph.df all seem to work.

              ceph.osd.stats returns the error "Access Denied"
              Not sure if there is a specific API permission that it doesn't have?

              On the discovery side:
              ceph.pool.discovery is appearing to have issues parsing pool names with "-" in the pool name.
              Code:
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-metadata.bytes_used".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-hd3.bytes_used".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-ssd.bytes_used".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-hdd.bytes_used".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-hybrid.bytes_used".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-ssd.bytes_used".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-hd3-ec73.bytes_used".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-metadata.max_avail".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-hd3.max_avail".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-ssd.max_avail".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-hdd.max_avail".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-hybrid.max_avail".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-ssd.max_avail".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-hd3-ec73.max_avail".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpath starting with: "-metadata.objects".
              Cannot create item: invalid value for preprocessing step #1: unsupported construct in jsonpa
              ceph.osd.discovery appears to be having trouble parsing the crush map?
              Code:
              Cannot parse result: cannot find node "-36".

              Crush ID of -36 is a second ROOT in my crush map, one root for hdd, one root for ssd.

              Hopefully someone has an idea.
              Ubuntu 18.04.5, Agent2 5.0.5, Server 5.0.5.

              Comment


                #8
                Hi,

                unfortunately I must confirm the problems reported by reedacus25.

                Ceph.osd.stats says "Access Denied" and ceph.osd.discovery says "Cannot parse result: cannot finde node "-8" in my case. "-8" is also a second ROOT in my crush map.

                Thank you!

                Comment


                  #9
                  Hello,

                  I am having more of an unusual issue, if I try a local API request
                  Code:
                  curl -k https://$CEPHUSER:[email protected]$CEPHHOST:8003/server
                  [
                  {
                  "ceph_version": "ceph version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)",
                  Looks fine, however from Zabbig_get

                  Code:
                  zabbix_get -s $CEPHHOST -k ceph.status["$CEPHCONNSTRING","$CEPHUSER","$CEPHAPIKEY"]
                  ZBX_NOTSUPPORTED: Cannot unmarshal JSON: invalid character '<' looking for beginning of value.
                  I have tried this with Ceph 15 and this seems to be fine. I did dig into the code a little bit as we have proxy servers, however I can confirm the request is hitting the agent and the agent is taking notice of the NO_PROXY environment variable. I added to test for this. However since my knowlege of GoLang is a little lacking in places, I am trying to determine the URL request it is constructing and how this is failing

                  N.B. I have added an asterisk to the extra lines I added to determine what is going on
                  Code:
                  2020/11/19 13:44:01.676628 received passive check request: 'ceph.status[https://localhost:8003,zabbix,c68c4027-a9d8-49c9-891d-5d5010a9adb5]' from '192.168.7.18'
                  2020/11/19 13:44:01.676720 [1] processing update request (1 requests)
                  2020/11/19 13:44:01.676737 [1] registering new client
                  2020/11/19 13:44:01.676764 [1] adding new request for key: 'ceph.status[https://localhost:8003,zabbix,c68c4027-a9d8-49c9-891d-5d5010a9adb5]'
                  2020/11/19 13:44:01.676774 [1] created direct exporter task for plugin 'Ceph' itemid:0 key 'ceph.status[https://localhost:8003,zabbix,c68c4027-a9d8-49c9-891d-5d5010a9adb5]'
                  2020/11/19 13:44:01.676781 [1] created starter task for plugin Ceph
                  2020/11/19 13:44:01.676786 [1] created configurator task for plugin Ceph
                  2020/11/19 13:44:01.676830 plugin Ceph: executing configurator task
                  2020/11/19 13:44:01.676925 plugin Ceph: executing starter task
                  *2020/11/19 13:44:01.677044 Proxy (<nil>)
                  2020/11/19 13:44:01.677077 executing direct exporter task for key 'ceph.status[https://localhost:8003,zabbix,c68c4027-a9d8-49c9-891d-5d5010a9adb5]'
                  *2020/11/19 13:44:01.677099 URI (https://zabbix:[email protected]:8003/request?wait=1)
                  *2020/11/19 13:44:01.677119 metrics: {description:Returns an overall cluster's status. commands:[status] params:map[] handler:0x7b4a90}
                  *2020/11/19 13:44:01.706335 r &{cmd:status data:[] err:{err:{err:0xc000072f60 cause:<nil>} cause:0xc00036c2e0}}
                  *2020/11/19 13:44:01.706369 r,data []
                  2020/11/19 13:44:01.706386 failed to execute direct exporter task for key 'ceph.status[https://localhost:8003,zabbix,c68c4027-a9d8-49c9-891d-5d5010a9adb5]' error: 'Cannot unmarshal JSON: invalid character '<' looking for beginning of value.'
                  2020/11/19 13:44:01.706424 sending passive check response: ZBX_NOTSUPPORTED: 'Cannot unmarshal JSON: invalid character '<' looking for beginning of value.' to '192.168.7.18'
                  Any help would be appreciated

                  Thanks

                  Fran

                  Comment


                  • vadimipatov
                    vadimipatov commented
                    Editing a comment
                    Hello!
                    Could you please share full output of the API's "status" command?

                    Code:
                    curl -k  -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"prefix": "status", "format":"json"}' "https://$CEPHUSER:[email protected]$CEPHHOST:8003/request?wait=1"
                    Last edited by vadimipatov; 23-11-2020, 10:01.

                  • fran@swansea
                    [email protected] commented
                    Editing a comment
                    Ahh yes that is causing an issue

                    Code:
                    # curl -k  -H "Accept: application/json" -H "Content-Type: application/json" -X POST -d '{"prefix": "status"}' "https://$CEPHUSER:[email protected]$CEPHHOST:8003/request?wait=1"
                    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
                    <title>500 Internal Server Error</title>
                    <h1>Internal Server Error</h1>
                    <p>The server encountered an internal error and was unable to complete your request.  Either the server is overloaded or there is an error in the application.</p>
                    Looks like an issue with mgr
                    Code:
                    2020-11-19 22:14:28.193 7f1e8fa13700  0 mgr[restful] Traceback (most recent call last):
                      File "/lib/python3.6/site-packages/pecan/core.py", line 683, in __call__
                        self.invoke_controller(controller, args, kwargs, state)
                      File "/lib/python3.6/site-packages/pecan/core.py", line 574, in invoke_controller
                        result = controller(*args, **kwargs)
                      File "/usr/share/ceph/mgr/restful/decorators.py", line 35, in decorated
                        return f(*args, **kwargs)
                      File "/usr/share/ceph/mgr/restful/api/request.py", line 88, in post
                        return context.instance.submit_request([[request.json]], **kwargs)
                      File "/usr/share/ceph/mgr/restful/module.py", line 589, in submit_request
                        request = CommandsRequest(_request)
                      File "/usr/share/ceph/mgr/restful/module.py", line 69, in __init__
                        results = self.run(commands_arrays[0])
                      File "/usr/share/ceph/mgr/restful/module.py", line 87, in run
                        result.command = common.humanify_command(command)
                      File "/usr/share/ceph/mgr/restful/common.py", line 37, in humanify_command
                        for arg, val in command.iteritems():
                    AttributeError: 'dict' object has no attribute 'iteritems'
                    
                    2020-11-19 22:14:28.199 7f1e8fa13700  0 mgr[restful] Error on request:
                    Traceback (most recent call last):
                      File "/usr/lib/python3.6/site-packages/werkzeug/serving.py", line 209, in run_wsgi
                        execute(self.server.app)
                      File "/usr/lib/python3.6/site-packages/werkzeug/serving.py", line 197, in execute
                        application_iter = app(environ, start_response)
                      File "/usr/lib/python3.6/site-packages/pecan/middleware/recursive.py", line 56, in __call__
                        return self.application(environ, start_response)
                      File "/usr/lib/python3.6/site-packages/pecan/core.py", line 840, in __call__
                        return super(Pecan, self).__call__(environ, start_response)
                      File "/usr/lib/python3.6/site-packages/pecan/core.py", line 683, in __call__
                        self.invoke_controller(controller, args, kwargs, state)
                      File "/usr/lib/python3.6/site-packages/pecan/core.py", line 574, in invoke_controller
                        result = controller(*args, **kwargs)
                      File "/usr/share/ceph/mgr/restful/decorators.py", line 35, in decorated
                        return f(*args, **kwargs)
                      File "/usr/share/ceph/mgr/restful/api/request.py", line 88, in post
                        return context.instance.submit_request([[request.json]], **kwargs)
                      File "/usr/share/ceph/mgr/restful/module.py", line 589, in submit_request
                        request = CommandsRequest(_request)
                      File "/usr/share/ceph/mgr/restful/module.py", line 69, in __init__
                        results = self.run(commands_arrays[0])
                      File "/usr/share/ceph/mgr/restful/module.py", line 87, in run
                        result.command = common.humanify_command(command)
                      File "/usr/share/ceph/mgr/restful/common.py", line 37, in humanify_command
                        for arg, val in command.iteritems():
                    AttributeError: 'dict' object has no attribute 'iteritems'
                    Last edited by [email protected]; 20-11-2020, 00:16.

                  #10
                  Hello!

                  Originally posted by reedacus25 View Post
                  This should really be included in the readme, or even better, the ceph.conf file with any template specific runtime variables with default values that can be uncommented and changed as needed.
                  Otherwise, no one knows where to find any of this.
                  It's already there.

                  Originally posted by reedacus25 View Post
                  ceph.osd.stats returns the error "Access Denied"
                  Not sure if there is a specific API permission that it doesn't have?
                  We will dig into this.

                  Originally posted by reedacus25 View Post
                  ceph.pool.discovery is appearing to have issues parsing pool names with "-" in the pool name.
                  Cannot parse result: cannot find node "-36".
                  These two errors will be fixed in one of the nearest versions.

                  Thanks for your report!

                  Comment


                    #11
                    [ninja delete]
                    Last edited by reedacus25; 19-11-2020, 18:24. Reason: deleting

                    Comment


                    • vadimipatov
                      vadimipatov commented
                      Editing a comment
                      I'm sorry, forgot about formatting. POST data should be: {"prefix": "status", "format":"json"}

                    #12
                    I just found some low-hanging fruit to fix with this template.

                    The ceph.rd_ops.rate item preprocessing is missing the "change per second" filter.
                    Right now it is just an ever growing value.

                    This is only for the "cluster" stats. The discovered per-pool rd_ops.rate objects are not affected.

                    Comment

                    Announcement

                    Collapse
                    No announcement yet.
                    Working...
                    X