Ad Widget

Collapse

upgrade 1.8 to 2.0: The Need for Speed

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • 0siris
    Member
    Zabbix Certified Specialist
    • Nov 2010
    • 76

    #1

    upgrade 1.8 to 2.0: The Need for Speed

    No no, not talking about this

    since ZBX-5125 I've been able to update zbx to 2.0
    This is done on a VM with a copy of the database. The DB itself has about 15 million records in history and trends, and about 4 million records in events.

    The update script has been running for almost 11 days now.
    It has been processing this query:
    Code:
    SELECT e1.eventid,(SELECT e2.value
                                    FROM events e2
                                    WHERE e2.source=e1.source
                                            AND e2.object=e1.object
                                            AND e2.objectid=e1.objectid
                                            AND (e2.clock<e1.clock OR (e2.clock=e1.clock AND e2.eventid<e1.eventid))
                                            AND e2.value<2
                                    ORDER BY e2.source DESC,
                                                    e2.object DESC,
                                                    e2.objectid DESC,
                                                    e2.clock DESC,
                                                    e2.eventid DESC,
                                                    e2.value DESC
                                    LIMIT 1) AS prev_value
                    FROM events e1
                    WHERE e1.source=0
                            AND e1.object=0
                            AND e1.value=0
                    HAVING prev_value IS NULL OR prev_value = 1
    And when I look at the updatescript, another query just as big as this one is coming next...

    These queries are running on a single cpu core (on a multicore VM guest), taking up 100% cpu on that core. hardly any disk activity, the load is purely single-cpu-core-bound

    Now I have 4 options:
    1) sit back, relax, good work takes some time, Rome wasn't built in one day...blablabla
    2) Buy myself an intel core i7 3770Kwith a phase change cooling system, overclock the hell out of it, let my boss pay for it and look if it will go any faster than before (so far I think this solution is by far the best somehow )
    3) truncate the large tables and go on with only the base tables, perhaps keep the old db available as an archive for the next couple of months
    4) see if things can be sped up softwarewise. For example, it might be possible to do part of the updatescript, or at least, part of this, with tools like sed, awk, grep, cut and tr to an export of the database?

    About the 11 days running: I havent seen any progress in the tmpdir directory, newest file in there is 5 days old, no more new entries in that file as well

    strace shows nothing going on:
    Code:
    root@zbxtst:~# strace -p 1431
    Process 1431 attached - interrupt to quit
    select(14, [12 13], NULL, NULL, NULL
    left it at that a couple of minutes.

    Ideas welcome.

    <edit>
    I cancelled the upgrade after almost 15 days. Still no new files / rows in files in mysql tmpdir, still the same query running(?)
    Will try with emptied events table.
    Last edited by 0siris; 26-07-2012, 13:28.
  • kens
    Junior Member
    • Oct 2006
    • 27

    #2
    May be we need to book some time on Tianhe-1A

    Having got round ZBX-5125 I'm on day 3 on a i7 running the same query. Like yours the disk activity has died down and I don't see any new files being created.

    Has anyone else done/doing this?

    Got any ideas?

    Thanks

    Ken

    Comment

    • 0siris
      Member
      Zabbix Certified Specialist
      • Nov 2010
      • 76

      #3
      Originally posted by kens
      May be we need to book some time on Tianhe-1A
      Query runs single threaded, no use for a huge multithreaded beast here

      Comment

      • kens
        Junior Member
        • Oct 2006
        • 27

        #4
        That's what I thought, mine is running natively on an i7 and the workload is just switching between the processors as the Linux scheduler allocates CPU's.

        Could any SQL experts possibly advise on a quicker way to accomplish the query that this bit of the script is doing or a suggestion on dropping data/offline convert re-integrate data that can't be converted in a sensible time.

        As I have mentioned elsewhere, I have taken a snapshot of my production database to test the migration. If the conversion could be done in ~ 24 hours I'd live with a 24 hour gap in the data in the converted database. That could be at a weekend.

        But its not good to potentially have a 1 month gap in the data or to take a Zabbix system down for ~ a month to do this upgrade.

        So the options appear to be, drop the migration of historical data or convert the historical data off line and merge it back into the production database after its converted.

        Worst case I could live with losing all the historical data and just keep the users/hosts/host groups/triggers/events. Maybe a script that would do that as a option would be useful. I'd keep the old database and query it from a 1.6 (in my case) Web interface if I wanted to see some history.

        Any suggestions

        Thanks

        Ken

        Comment

        • kens
          Junior Member
          • Oct 2006
          • 27

          #5
          I would have hoped that there would be some suggestions for moving this issue forward.

          For the present I have stopped our database upgrade and put our migration on V2.0 on hold pending resolution of this issue. We will retain V1.6 in production for the moment and decide whether to move to 1.8 or 2.0 a later stage. We will continue to do some testing on V2.0.

          In the meantime I will raise a bug report.

          Thanks

          Ken

          Comment

          • kens
            Junior Member
            • Oct 2006
            • 27

            #6
            Bug report

            Comment

            • heaje
              Senior Member
              Zabbix Certified Specialist
              • Sep 2009
              • 325

              #7
              Just a thought, but I ran this process in the past for doing an upgrade. I got all my history back and the actual monitoring downtime was fairly small. Just look at my comments in the thread below:

              Comment

              • kens
                Junior Member
                • Oct 2006
                • 27

                #8
                Thank you for that suggestion. I'll check it out.

                In the meantime I have had the following response to my bug report

                richlv closed ZBX-5359.
                -----------------------

                Resolution: Won't Fix

                not really a bug. note that small innodb buffer pool size will result in a long upgrade.


                I'll also investigate that suggestion.

                Thanks

                Ken

                Comment

                • kens
                  Junior Member
                  • Oct 2006
                  • 27

                  #9
                  Running for 7 days

                  Hi all,

                  I have increased the innodb buffer pool to 3Gig, and taken the other advice mysqltuner.pl suggests. The i7 machine I running the database conversion has 4Gig ram. The upgrade process has now been running for 7 days at 100% CPU and negligible disk activity. It is still on the first block of SQL code (The bit that 0siris posted above) in the upgrade script.

                  I ran strace -f on the MySQL process, in a similar way to the 0siris, the original poster of this topic, and got the trace at the bottom of this post.

                  Can anyone, who knows more of the internal workings of MySQL than I do, tell if this process is actually doing anything useful? The "resource not available" and "timeouts" look suspicious. If it would be worth doing I could increase the memory to > 16Gig so the entire database would fit in RAM.

                  I like the idea of the two stage upgrade suggested by heaje, as a matter of interest how long did the database upgrade take, and what size was it?

                  I still find the "Not a bug/Won't fix" response disappointing. I take it to mean that this process is therefore considered to be "working as designed/fit for purpose?"

                  Any suggestions most welcome.

                  Thanks

                  Ken

                  Code:
                  [pid  3490] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
                  [pid  3483] <... futex resumed> )       = 0
                  [pid  3490] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3483] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3484] <... select resumed> )      = 0 (Timeout)
                  [pid  3484] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3483] <... select resumed> )      = 0 (Timeout)
                  [pid  3490] sched_yield()               = 0
                  [pid  3490] futex(0x2f5472c, FUTEX_WAIT_PRIVATE, 1389789, NULL <unfinished ...>
                  [pid  3483] futex(0x2f5472c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x2f546f0, 1389790 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3483] <... futex resumed> )       = 1
                  [pid  3490] futex(0x2f546f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
                  [pid  3483] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3483] <... futex resumed> )       = 1
                  [pid  3490] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3483] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3484] <... select resumed> )      = 0 (Timeout)
                  [pid  3484] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3483] <... select resumed> )      = 0 (Timeout)
                  [pid  3490] sched_yield()               = 0
                  [pid  3490] futex(0x2f5472c, FUTEX_WAIT_PRIVATE, 1389791, NULL <unfinished ...>
                  [pid  3483] futex(0x2f5472c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x2f546f0, 1389792 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3483] <... futex resumed> )       = 1
                  [pid  3490] futex(0x2f546f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
                  [pid  3483] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
                  [pid  3483] <... futex resumed> )       = 0
                  [pid  3490] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3483] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3484] <... select resumed> )      = 0 (Timeout)
                  [pid  3484] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3483] <... select resumed> )      = 0 (Timeout)
                  [pid  3490] sched_yield()               = 0
                  [pid  3490] futex(0x2f5472c, FUTEX_WAIT_PRIVATE, 1389793, NULL <unfinished ...>
                  [pid  3483] futex(0x2f5472c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x2f546f0, 1389794 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3483] <... futex resumed> )       = 1
                  [pid  3490] futex(0x2f546f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
                  [pid  3483] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
                  [pid  3483] <... futex resumed> )       = 0
                  [pid  3490] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3483] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3484] <... select resumed> )      = 0 (Timeout)
                  [pid  3484] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3485] <... select resumed> )      = 0 (Timeout)
                  [pid  3485] select(0, NULL, NULL, NULL, {5, 0} <unfinished ...>
                  [pid  3483] <... select resumed> )      = 0 (Timeout)
                  [pid  3490] sched_yield()               = 0
                  [pid  3490] futex(0x2f5472c, FUTEX_WAIT_PRIVATE, 1389795, NULL <unfinished ...>
                  [pid  3483] futex(0x2f5472c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x2f546f0, 1389796 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3483] <... futex resumed> )       = 1
                  [pid  3490] futex(0x2f546f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
                  [pid  3483] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
                  [pid  3483] <... futex resumed> )       = 0
                  [pid  3490] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3483] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3484] <... select resumed> )      = 0 (Timeout)
                  [pid  3484] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3483] <... select resumed> )      = 0 (Timeout)
                  [pid  3490] sched_yield()               = 0
                  [pid  3490] futex(0x2f5472c, FUTEX_WAIT_PRIVATE, 1389797, NULL <unfinished ...>
                  [pid  3483] futex(0x2f5472c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x2f546f0, 1389798 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3483] <... futex resumed> )       = 1
                  [pid  3490] futex(0x2f546f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
                  [pid  3483] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3483] <... futex resumed> )       = 1
                  [pid  3490] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1) = 0
                  [pid  3483] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3484] <... select resumed> )      = 0 (Timeout)
                  [pid  3484] select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
                  [pid  3483] <... select resumed> )      = 0 (Timeout)
                  [pid  3490] sched_yield()               = 0
                  [pid  3490] futex(0x2f5472c, FUTEX_WAIT_PRIVATE, 1389799, NULL <unfinished ...>
                  [pid  3483] futex(0x2f5472c, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x2f546f0, 1389800 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = 0
                  [pid  3483] <... futex resumed> )       = 1
                  [pid  3490] futex(0x2f546f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
                  [pid  3483] futex(0x2f546f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
                  [pid  3490] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)

                  Comment

                  • 0siris
                    Member
                    Zabbix Certified Specialist
                    • Nov 2010
                    • 76

                    #10
                    Already tried with an innodb buffer larger than the db itself, didn't help.

                    Tried upgrade with truncated trends, trends_uint, history, history_uint and events table. That went smooth, but it's not exacly what I want.
                    So now I have my hopes set for the housekeeper: set it to a couple of weeks or so, let it run, and upgrade the DB afterwards, and after that, set it back to original values.

                    I can see ZBX-5125 has been updated, not sure if it helps with this case. I only have limited time, can't test everything I'd like to test in respect to the updatescript unfortunately.

                    Comment

                    • heaje
                      Senior Member
                      Zabbix Certified Specialist
                      • Sep 2009
                      • 325

                      #11
                      @Kens

                      I don't really recall exactly how big the DB was when I did my migration (history tables had at least a few hundred million rows in them), but I did have the luxury of starting my old installation in one of the alpha/beta versions of 2.0. As a result, I didn't end up needing to upgrade the schema of my history tables before migrating DBs (indexes excluded). I did follow the process I outlined though to migrate to a new DB that had new indexes in the history tables. That way I avoided the problems that you are running in to.

                      One mistake I made when copying over the old data was the manner in which I copied it. I wrote a stored procedure that looped through all my items and copied the history over an item at a time. That copy took a LONG time (several days). If I were to do it over, I would simply do a mysqldump of the history tables and import that. It would probably take a few hours at that point (after the import started).

                      However, the idea is still the same, simply copy everything except history/trends into another DB, upgrade that DB, point Zabbix to the new DB, upgrade the old DB (however long that takes - sounds like possibly weeks in your case), and then copy the old history into the new DB.

                      One option you could try for your history tables is to copy a small amount of data into a different table and then run the upgrade against that smaller amount (not sure how realistic that is). You could then copy that upgraded data into the new DB. Obviously this would have to be scripted and it could possibly take a while to finish the whole process. Looking at how long your previous attempts to upgrade have taken though, I would think that a slow copy would be preferable since you would slowly build up the history again in your new DB as data got copied over in chunks instead of having to wait possibly a week or two until it finishes getting upgraded .

                      Comment

                      • kens
                        Junior Member
                        • Oct 2006
                        • 27

                        #12
                        If I do it at all, I may have to do this in chunks as you suggest.

                        My concerns at the moment are
                        my lack of understanding of the details of the 1.8 and 2 schema and therefore how well equipped I am to script a convert in chunks method
                        How long it might take for me to gain that understanding and test my scripts
                        Whether the conversion that is currently in progress is actually doing anything of any validity and whether the weeks of conversion processing may just be a futile waste of electricity keeping my office warmer than it needs to be
                        Whether retaining the history is worth that effort, frustration and hassle

                        The last point is a key one, as I could simply keep the history in 1.8 format and keep a 1.8 system in a VM so that I can query it if I need to.

                        I'm concerned that the Zabbix development team seem either to have infinite processing power available or don't understand these issues, as a product that will take potentially months to upgrade in real life does not seem to me to be fit for production. I know that's a strong statement, but data migration between versions is a key issue, especially when the schema changes. I appreciate that this is an open source project and I am genuinely grateful for the development effort for what could be a splendid product. But an acknowledgement that this is an issue might be a step forward even if the reply was "accept it as a problem but resources to develop an alternative are not readily available". That might help regain some confidence.

                        Thanks

                        Ken

                        Comment

                        • ghoz
                          Senior Member
                          • May 2011
                          • 204

                          #13
                          just FYI, I came with the following procedure while testing the upgrade.


                          basicaly it's what @heaje suggests, upgrading configuration before historic data.
                          You however have to be careful with database ids, so there is no clash when reimporting the migrated data.

                          Another point of interest is making sure you dont have any duplicates in the DB, as the update process is not forgiving, and some past bug in frontend and API would let you create duplicates...

                          Comment

                          • dougbee
                            Member
                            • Apr 2011
                            • 68

                            #14
                            I performed a trial run of our Zabbix 1.8.12 DB, using the 2.0.2 scripts. I'm doing the actual switchover in a couple of weeks, this was just to see if the db upgrade scripts were feasible.

                            I did a mysqldump of the current DB, and restored it to a new, empty DB on the same machine. I don't have exact specifics, but we're running a 32-bit MySQL on a 6 processor System/P. I'll get more info if that would help.

                            Between history, history_uint, trends, trends_uint there are about 160 million rows.

                            Zabbix itself is running under VMware. A quick/rough timeline of steps:

                            1. Stop zabbix server, run mysqldump. (And this is being done from the Zabbix server itself, so the backup is running over the network) Time was about 1 hour. Resulting bzipped file was only 700MB; uncompressed is ~5GB.

                            2. Restore (using bzcat) to new zabbix2 DB. I believe it took a few hours.

                            3. Run DB upgrade scripts. Time was 16 hours.

                            I don't know how our System/P compares to the machines everyone else is running, but if I can gather more info that would help, let me know.

                            Comment

                            • dougbee
                              Member
                              • Apr 2011
                              • 68

                              #15
                              I'm wondering if anyone has tested the updated script in ZBX-5359..


                              and if there was a definite speed improvement. Thanks!

                              Comment

                              Working...