Ad Widget

Collapse

Moving ~200 hosts to a new server

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mork
    Junior Member
    • Jun 2016
    • 8

    #1

    Moving ~200 hosts to a new server

    Hi guys,

    We are moving monitoring for a portion of our servers from one zabbix server to a brand new one. I would like some input on how to manage this on the server side.

    So far I have created a script that will ssh into each host on my list and modify the agent conf file to point to a different server, then restart zabbix on the host.

    We are considering two approaches on the server side. Currently the new server is an exact copy of the old one but is obviously not receiving any data from the hosts as they are still pointed at the old server. The problem with this is that this server is monitoring ~600 hosts, most of which we will not want on the new server. Checking to see if a host has moved over successfully on the front end is a bit of a hassle because there is so many configured hosts to wade through.

    Basically we can either continue with our exact copy of the old server or wipe it clean, export the hosts we DO want to monitor from the old server and then migrate them using the script, monitoring manually to make sure the migration has worked successfully.

    The reason we want to check each host manually is that this is a mission-critical monitoring system which is looking at servers in various datacenters across the world. Because of firewall rules and other potential unforseen problems, it is possible that just changing the agent config and restarting zabbix will not result in a succesful migration, in which case we need to roll back the change and flag the host for further investigation.

    Does anyone have any experience doing something like this or any wisdom to offer? Thanks in advance.

    Version = 2.2.9
  • Linwood
    Senior Member
    • Dec 2013
    • 398

    #2
    I haven't.

    But in reading this I am a bit unsure what the real concern is.

    Is the concern to have no period when a host is not monitored? One possibility (though I haven't tried it) is to run two instances of the agent, the old one unchanged, the new with a separate config file pointed to the new server (a minor issue is you have to switch port numbers, how "minor" of course depends on your viewpoint). If I was using SNMP I'd do that.

    Not sure if all the firewall issues would present a problem there with two up in parallel, but even there a two stage effort might be safest.

    Is the issue the loss of monitored (historical) data during the migration? One technique might be to start with an exact copy then delete the hosts you do not want, that reduces the gap to only the time to migrate, and in principle you could move that data after the fact with SQL if all the item keys stayed exactly the same; I'm not sure if discovered items could be a problem here.

    Comment

    • mork
      Junior Member
      • Jun 2016
      • 8

      #3
      Thanks for the reply!

      The concern is that as we move machines over to the new server, we want to semi-automate the config changes but have manual checks each time a host is moved to make sure it is talking to the new server before being marked off our list as "migrated".

      The manual checks would pretty much consist of me having two browser tabs open on "Configuration of hosts" and watching the availability square go red on one, and green on the other for each particular machine. If there is a better way to do this please let me know.

      Loss of historical data is not an issue, nor is absolutely zero downtime (although the less the better). The priority is thoroughness of the migration and reliability of the new monitoring system.

      Because of the large amount of irrelevant hosts configured on our new server (which is an exact copy of our old server) the manual check looks like it could be extremely tedious, although maybe it's just something I will have to live with.

      Comment

      • Linwood
        Senior Member
        • Dec 2013
        • 398

        #4
        I'm not quite sure what the manual check option difficulty is. One possibility if it's just isolation is grab them 10 at a time (or whatever) and put them in a separate host group on both machines so you have a view easily of just the ones being moved today, and then move them until satisfied. Though if it was me I'd be a bit paranoid that all the checks (i.e. item definitions) move transparently, so I'd be tempted to also review each one's items before and after to make sure you had the exactly the same number and none went unsupported, but you can do that from the trigger screen using the same host group to show them all.

        If you have any zabbix trapper items, snmp trap items, external checks you may need to figure out a way to verify them separately; latest data might handle some, but infrequent ones are a different matter.

        And cross your fingers and hope no one cheated on any item definitions or external checks and left in hard coded server names in somewhere.

        I've never tried changing port numbers to see how hard that would be to do thoroughly, but if it was me I'd try that at least briefly to see if difficult and think about running in parallel. I've always been a fan of parallel runs for new servers as a safety net, when possible. But it may be too much hassle to shift from the default number. No idea.

        Comment

        • mork
          Junior Member
          • Jun 2016
          • 8

          #5
          Thanks for the input.

          I think doing it in batches of 10 in specific "migration groups" is probably a good idea.

          Worst-case scenario, the host doesn't migrate properly and I roll back to the previous config and previous server.

          Fingers crossed it works out. If I gain any wisdom I will update here.

          Comment

          • kloczek
            Senior Member
            • Jun 2006
            • 1771

            #6
            Originally posted by mork
            Hi guys,

            We are moving monitoring for a portion of our servers from one zabbix server to a brand new one. I would like some input on how to manage this on the server side.

            So far I have created a script that will ssh into each host on my list and modify the agent conf file to point to a different server, then restart zabbix on the host.
            Why did you not just move server IP to new system?
            Or if in agents configuration is used server hostname why this migration was not made by simple DNS change?
            http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
            https://kloczek.wordpress.com/
            zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
            My zabbix templates https://github.com/kloczek/zabbix-templates

            Comment

            • mork
              Junior Member
              • Jun 2016
              • 8

              #7
              Originally posted by kloczek
              Why did you not just move server IP to new system?
              Or if in agents configuration is used server hostname why this migration was not made by simple DNS change?
              The existing Zabbix server will still be used to monitor the other ~400 hosts.

              The team is splitting, basically.

              Comment

              • kloczek
                Senior Member
                • Jun 2006
                • 1771

                #8
                Originally posted by mork
                The existing Zabbix server will still be used to monitor the other ~400 hosts.
                Move existing address to new zabbix server to monitor existing hosts and assigning new address to new server to monitor next 400 hosts still is waaaay simpler
                http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                https://kloczek.wordpress.com/
                zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                My zabbix templates https://github.com/kloczek/zabbix-templates

                Comment

                • mork
                  Junior Member
                  • Jun 2016
                  • 8

                  #9
                  Originally posted by kloczek
                  Move existing address to new zabbix server to monitor existing hosts and assigning new address to new server to monitor next 400 hosts still is waaaay simpler
                  I'm not sure I understand.

                  Whichever way we go about it, there will be a point where a bunch of hosts need to be told to stop pointing at server X and start pointing at server Y. Ensuring this "migration" works flawlessly is still the problem.

                  Right now I've found a pretty good solution (I think). I have a script that takes in a CSV of IPs, these being the machines we need to migrate to the new server. It ssh's into these servers, modifies the config and restarts zabbix. Then it hits the zabbix API on the new server with the hostname of the server it just "migrated" to see if it's returning ["available' : 1] which would indicate that the migrated host is now successfully communicating with the new server. At this point the server is considered "migrated" and is added to a list. If any of these steps fail it is considered "not migrated" and is added to another list. The script generates a log each time it runs to keep track of everything.

                  Is this a giant ball ache? Yes. Have I over-engineered a solution? Maybe, but it beats the hell out of refreshing a page on the Zabbix frontend to see if a square has turned from green to red/red to green. It'll also be considerably more reliable than a human doing that.

                  I'll ask my boss If I can publish the code when I'm done. Maybe someone will find it useful.

                  Comment

                  • kloczek
                    Senior Member
                    • Jun 2006
                    • 1771

                    #10
                    Originally posted by mork
                    Right now I've found a pretty good solution (I think). I have a script that takes in a CSV of IPs,
                    [..]
                    Is this a giant ball ache? Yes. Have I over-engineered a solution? Maybe, but it beats the hell out of refreshing a page on the Zabbix frontend to see if a square has turned from green to red/red to green. It'll also be considerably more reliable than a human doing that.

                    I'll ask my boss If I can publish the code when I'm done. Maybe someone will find it useful.
                    I would say that in your case your solution looks like it is not over-engineered but under-engineered because you are not using DNS.
                    It is one of the good reasons why to use DNS .. to make many other changes easier.
                    If everything uses hostnames if env is growing and when it is necessary to delegate network layer responsibilities to network team is possible to move services between network segments without single change on applications/services layer configuration.

                    Such massive changes like moving whole env to new range of IPs could be done only touching two points: DHCP server set up and DNS server set up instead all hosts settings.
                    With lowered TTL time in DNS and lease time in DHCP subnet(s) all what is necessary to do is reload named and dhcpd processes to initiate whole change. Of course using above means that it on no hosts will be /etc/hosts entries, no IPs in apaches, tomcats, zabbix agents settings and other applications as well. If applications on each hosts network interfaces will be listening on "0" IP (all locally available address) it is not even necessary to login on each host to restart hosts services. Usually some services after loosing connectivity to some service are trying to reconnect. Such behaviour is implemented for example within zabbix server/proxy on using DB backend.

                    I know that it is to late for you now but try to learn something from above on building your new 400 hosts env
                    Or maybe it is still not to late. If you sill have time to first change settings one by one to bring in in this env enough level of whole env flexibility by introduce using DNS in final moment final migration will be piece of cake
                    Last edited by kloczek; 29-06-2016, 17:04.
                    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                    https://kloczek.wordpress.com/
                    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                    My zabbix templates https://github.com/kloczek/zabbix-templates

                    Comment

                    • Linwood
                      Senior Member
                      • Dec 2013
                      • 398

                      #11
                      Originally posted by kloczek
                      I would say that in your case your solution looks like it is not over-engineered but under-engineered because you are not using DNS.
                      I think you misunderstand his need.

                      His 400 hosts are being monitored by server X.

                      When he is done, 200 will be monitored by server X, and 200 by server Y. X doesn't go away.

                      Comment

                      • mork
                        Junior Member
                        • Jun 2016
                        • 8

                        #12
                        Originally posted by Linwood
                        I think you misunderstand his need.

                        His 400 hosts are being monitored by server X.

                        When he is done, 200 will be monitored by server X, and 200 by server Y. X doesn't go away.
                        This is correct.

                        Currently I think the program I have written is the best solution for the problem. It's semi-automated in that it checks the API for you instead of you having to keep fitting f5 on the front end, copy/pasting hostnames and searching etc. but there is an element of human supervision to stop the script running away and ruining our monitoring because of weird firewall rules or something. Probably excessive paranoia on my part but whatever.

                        Comment

                        Working...