Ad Widget

**kloczek** · 27-09-2015, 13:58

To everyone who have issue with slow access history data:
Did you ever look on your zabbix database monitoring?
Do you have your db backend monitoring?
Do you have storage IO monitoring?

Your issue with slow access to data it is not issue with zabbix byt with to weak DB backend.

**steveboyson** · 27-09-2015, 19:12

Originally posted by kloczek

To everyone who have issue with slow access history data:
Did you ever look on your zabbix database monitoring?
Do you have your db backend monitoring?
Do you have storage IO monitoring?

Your issue with slow access to data it is not issue with zabbix byt with to weak DB backend.

I don't think so. We have a quite fast backend and with 2.0.10 everything was OK. This misbehaviour was introduced in 2.2.something.

**kloczek** · 28-09-2015, 00:24

Originally posted by steveboyson

I don't think so. We have a quite fast backend and with 2.0.10 everything was OK. This misbehaviour was introduced in 2.2.something.

Don't get me wrong bur what you wrote does not contain answer on my questions.
Even with "quite fast backend" you may be using it wrongly.
So please have look on you DB backed stats ant try to tell:
- What is the ratio between read and write IOs?
- Do you have partitioned history and trends tables?

**steveboyson** · 28-09-2015, 00:57

Originally posted by kloczek

Don't get me wrong bur what you wrote does not contain answer on my questions.
Even with "quite fast backend" you may be using it wrongly.
So please have look on you DB backed stats ant try to tell:
- What is the ratio between read and write IOs?
- Do you have partitioned history and trends tables?

Neither partitioned hist nor trends table ( I never read that this is a prerequisite but rather an optional optimization step).

But now do not get me wrong: how comes that a bigger database (more host agents, more items & triggers, DB size ~ 40 GB) works fine on 2.0.10 while a smaller installation (maybe one fifth the size of the previous, ~8 GB) lacks "latest data" performance when on 2.2.10?

If you can explain that, I will look into database optimization.

And please be so kind and tell me how one could use a database "wrongly".
The smaller install is a plain vanilla install without custom scripts while the former is heavily customized with external scripts. So?

**kloczek** · 28-09-2015, 16:08

Originally posted by steveboyson

Neither partitioned hist nor trends table ( I never read that this is a prerequisite but rather an optional optimization step).

But now do not get me wrong: how comes that a bigger database (more host agents, more items & triggers, DB size ~ 40 GB) works fine on 2.0.10 while a smaller installation (maybe one fifth the size of the previous, ~8 GB) lacks "latest data" performance when on 2.2.10?

If you can explain that, I will look into database optimization.

And please be so kind and tell me how one could use a database "wrongly".
The smaller install is a plain vanilla install without custom scripts while the former is heavily customized with external scripts. So?

MySQL like may other RDBs uses B-Tree on accessing to table rows data.
Those engines groups as well parts of B-Tree structures in blocks. If:

B – records per block size
N – total number of records

For B-tree it creates reads and writes disk logN/LogB.

So as you see at some point of DB/table size inserting even single records is causing that on physical storage must be done more than one IO and with growing size of DB number of those IOs per single insert is growing (LogN is growing and LogB is const).

Now try assume that you are storing let's say only last two weeks raw data and to make the same math when you total number of records will be not N but N/14 to store all data in 14 partitions created daily.
When most of you reads operations will be in range of last day time window B-Tree structures of older partitions will be not even cached in page cache, you DB caches or other caches (like ARC in case using ZFS).
In such conditions with partitioned tables you will be doing much less IOs because you are searching and writing to much smaller files. Additionally in case of zabbix you will be writing only to today partitions.

In other words with partitioned tables you have much lower avg and max IO/s.

With partitioned history tables instead delete oldest data by housekeeprt and doing even millions of IOs in single HK cycle you will be doing only few on create new empty partition for next day and delete oldest data by drop oldest partition (effectively .. delete single file).
Housekeeping zabbix data by delete queries has additional impact on all read IO/s. Why? Because all data before they needs to be deleted DB engine needs to make sometimes many read IOs to locate areas which needs to be released. Those reads are damaging content of read caches so they not only causes some additional bandwidth of read IOs but they are decreasing hits/misses to cached data which is causing new wave of reads.

Now another trick. With assumption that majority of data which your zabbix users are interested sits in last 24h time window you can precisely calculate how much memory (RAM) you need to have in innodb pool or ZFS ARC (if you are using Solaris and ZFS). With enough cache size you should have proportion between read and write IOs like at last 1:30 to 1:50 (avg).
Well architected DB backend should be able to keep all MRU/MFU data in memory caches so speed of you IO system doesn't really matter.
So with enough big/descent cache you can gain kind of effect that even not in-memory DB engine will behave like in-memory one so this logN/LogB read IOs will be replaced by only few IO/s.

As you see even with "quite fast backend" if you don't know above you have very high possibility to screw here.

You mention that you have 40GB database. Mine is more than 400GB (and this is more like average size of the database because people sometimes have few TB databases or more even). Just checked last 24h IO rate:
- read: min 0.7 IO/s, avg 15.4 IO/s, max 235 IO/s
- write: min 110 IO/s, 654 IO/s, 2060 IO/s

If your storage is doing more avg reads than writes it is firs red light warning that your DB backend is not "quite fast" because is used wrongly.

The same as in even perfect programming language is possible to write very bad code the same "quite fast backend" can be used very badly.

Above it is not tuning of the database. It is only and nothing more like using proper DB backend architecture matching with DB workload which can improve speed not by few of few tenth percents but few times of few tenths times.

**steveboyson** · 28-09-2015, 16:41

Thanks for your extensive explanation. Since I have a degree in computer science I was aware of the logic of database systems while not in that depth of details.

Nevertheless it explains nearly nothing why it happens in 2.2.x and not in 2.0.x.

By the way, both databases live on SSD storage arrays. IOPs > 50000 are not posing any problem.

Question persists: what happened to "latest data" in 2.2.x?

**kloczek** · 28-09-2015, 17:15

Originally posted by steveboyson

Thanks for your extensive explanation. Since I have a degree in computer science I was aware of the logic of database systems while not in that depth of details.

As long as quite often computer science education programs are focused more on programming skills, understanding modern technologies without proper mathematical and logical background many depends where you had your CS education

Nevertheless it explains nearly nothing why it happens in 2.2.x and not in 2.0.x.

You should stop using 2.2.x and start prepare upgrade to 2.4.
Not all zabbix data could be equally effective cached in DB of FS generic caches and 2.4 introduces caching more data in zabbix server processes caches using approach/algorithms tightly suited to zabbix needs so far upgrade to every next mayor zabbix release caused that number of selects is decreasing.
Such effect especially is possible to observe on switching to 2.4.

By the way, both databases live on SSD storage arrays. IOPs > 50000 are not posing any problem.

Every IO is causing use interrupt. Some of those interrupts are causing context switch between processes. I'll guarantee you that in with your SSD you will saturate rate of interrupts and/or number of cs/s than max IO/s max bandwidth.
BTW: when you last time been checking cs/s or interrupts/s in you system?

Do you have in you zabbix monitoring system.cpu.switches[] and system.cpu.intr[]?

Question persists: what happened to "latest data" in 2.2.x?

Really .. do not expect to much replies if you are not using latest stable zabbix version or you have paid zabbix support.
For example personally I'm more and more interested about upcoming 3.0 than 2.4.

**steveboyson** · 28-09-2015, 17:43

2.2.x is so far the one and only LTS release.
This is enough reason for us (and our contract customers) to stay on that.

And from a practical point of view for our customers it's not interesting at all *how* database systems do work internally. They just *HAVE TO* work. That's their purpose.

We're in the epoch where hardware is cheaper than software design (as opposed to the 80s).
A weak written/developed application may suffer from poor performance but up to a distinct point this can be solved by using more powerful hardware.
This is what we doing - with good to excellent results.

2.0.10 is perfect but lacks features, 2.2.10 has more features but here and there lacks performance. Have not tried 2.4 so far because we get along with the weirdness of 2.2.10.
As soon as there is a new LTS release we will give it a try. And no, I will not start with partioning DB tables unless our DBs will grow up to more that 200 GB.

Our focus is application monitoring (and not "system monitoring"). Applications are built on top of (or around) systems so monitoring systems "only" give you only a partial view on how sane your environment is. But application monitoring is more.

We rather prefer to define an application (or call it "service oriented") view than a system (or "host") view.

The counter for "system context switches" bores me quite much - I am interested for example in knowing if a customer's request can be fulfilled within a defined period of time.
Only when there is weak performance such numbers (and the like) can give me a clue where the power goes to ;-) And then I am interested in system.cpu.switches or system.cpu.intr. Of course we measure these but they never appear in any screens or graphs.

.... that's why I asked for "latest data" since it is then the only way to get a historical view of the values. And "latest data" is obviously broken since 2.2.x. Discusion starts here again where we began a couple of pages ago ...

**kloczek** · 28-09-2015, 19:29

Originally posted by steveboyson

2.2.x is so far the one and only LTS release.
This is enough reason for us (and our contract customers) to stay on that.

IMO LTS is for envs where number of changes is limited and first rule of maintaining such env is first MacGyver rule (If it ain't broke, don't fix it

)
In every other envs you simple cannot ignore that on switching to 2.4 you may have significantly less for example selects/s.

We're in the epoch where hardware is cheaper than software design (as opposed to the 80s).
A weak written/developed application may suffer from poor performance but up to a distinct point this can be solved by using more powerful hardware.
This is what we doing - with good to excellent results.

A little off-topic ..
This is one dimension generalization. As all generalizations they are likely to not be true in exact case. Such generalization could be true in context of "anchored in time" exact type of workload. Problems only is that not only hardware is changing but software as well.
Few decades ago single computer was able to handle running applications executed by few hundreds of thousands of users sitting on front of teletype terminals. Today the similar number of http session can handle single PC.
Huge progress caused that in mean time what needs to be done in single http session is not possible to handle by computers. In the same time what computers been doing 20-30 years ago is completely different what computers are doing today. So if you want you should form here two dimensional generalization. In reality here whole problem is not 2-D but N-D because number of users rose by many orders of magnitude. We are doing using computers thing which are previously was not possible to do and now on using using even typical smartphone which is many times more powerful than computers 20-30 years ago .. no one today is doing mainly complicated mathematical computations on those devices

Be careful with generalizations because they are only useful on showing only some aspects of problems .. they are never pointing to exact solution of MyProblem(tm).

And from a practical point of view for our customers it's not interesting at all *how* database systems do work internally. They just *HAVE TO* work. That's their purpose.

Lack is understanding of the issue cannot be excuse.
I'm quite often quoting some sentence that "it is not stupid don't know something but it is don't want to know something".
Cost of the hardware is growing exponentially with performance. Many people are thinking about computers as CPU + RAM. CPUs are very powerful today. Systems are much complicated and they are as strong as weak is biggest bottleneck. Most of the people are not able to list more than two, tree possible bottleneck in modern systems but in reality systems have much more possible areas which may cause that at the end applications may work very slow. Without proper/adequate understanding whole stack (from hardware aspects over OS layer up to sometimes base libraries and finishing on some limitations of virtual machines on which is is executed you application code (PHP, perl, ruby, python, JS, java VMs) you have really small chance to build complicated your product stack.

As soon as there is a new LTS release we will give it a try. And no, I will not start with partioning DB tables unless our DBs will grow up to more that 200 GB.

Our focus is application monitoring (and not "system monitoring"). Applications are built on top of (or around) systems so monitoring systems "only" give you only a partial view on how sane your environment is. But application monitoring is more.

We rather prefer to define an application (or call it "service oriented") view than a system (or "host") view.

The counter for "system context switches" bores me quite much - I am interested for example in knowing if a customer's request can be fulfilled within a defined period of time.

Do you know that on x86 systems you can do only 1k cs/s per CPU core as long as your applications are involuntary taken of from CPU?
Take typical java application which creates many threads and and let's say that each one would be so CPU hungry that it will be sitting almost constantly in OS running queue. Result: your application will be chocking. Why? Because none of the treads would be able to be enough long on CPU to finish own work. Lest take such application which additionally is constantly processing alot of data to have very low hits/misses to CPU caches. Result? you application will be chocking and you would be able to see that you CPU usage is very low. Why? Because CPU would be most of the time waiting on delivery some new pages of memory from RAM to CPU caches. Can you solve this by using more powerful hardware? Of course you can do this but you can solve you bottleneck issue sitting on high rate of data exchanged between L1 and L2 or L# (or sometimes even L4) caches by binding to exact cores to minimize overhead generated by trends/processes migrations between cores or group threads in bunches doing similar thing to minimize misses/hits ration on core caching.

Only when there is weak performance such numbers (and the like) can give me a clue where the power goes to ;-) And then I am interested in system.cpu.switches or system.cpu.intr. Of course we measure these but they never appear in any screens or graphs.

.... that's why I asked for "latest data" since it is then the only way to get a historical view of the values. And "latest data" is obviously broken since 2.2.x. Discusion starts here again where we began a couple of pages ago ...

This is why majority of monitoring data should be not about monitoring exact aspects of running system and applications but about collecting some additional data which may be useful in case when WeHavesSomeIssue(tm) to not be blind completely.

Have a look on SpaceX and last mishap with last Falcon 9. They had about 40k metrics about flying rocket. Some of them are so frequently sampled that it was possible to locate where initial explosion happen observing shock wave passing over microphones located in many locations and use these data to almost triangulate place where everything started.
The same is with monitoring of computers. Most of the data should be collected "just in case" and as same as in SpaceX control room people sitting on front of the screens been able to observe alarming about few crucial aspects only few things should be with additional alarming layer.

As long as you have free disk space/IO bandwidth or you monitoring is not creating something which I call "monitoring quantum effect" (when state of the monitored object is affected by monitoring) you should be collecting as much as you only can.
In this content you never should tell something like "The counter for "system context switches" bores me quite much".

**steveboyson** · 28-09-2015, 20:07

Sorry, I do not want to be offensive or rude at all, but: in my opinion your elaborated explanations do not put any new insights on the topic.

In the same time it seems you declare all users with that particular type of problems as beeing dumb asses which do nothing about their environment - at least this impression could occur.

In fact, there *IS* a problem.

Google

https://www.google.com/webhp?q=latest%20data%20slow%20zabbix

Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.

Anyway, thanks for the lot of time you spent. But: I'm giving up.

**kloczek** · 28-09-2015, 20:39

Originally posted by steveboyson

In the same time it seems you declare all users with that particular type of problems as beeing dumb asses which do nothing about their environment - at least this impression could occur.

If you found any declaration in what I wrote. Please .. stop reading between the lines and please try to read what I wrote straight

I wrote that as long as you you don't know what kind of IO characteristics you see you first must start observing it (still don't know do you have majority of write IOs as it should be or not)
If for example someone is person without awareness of some typical DB aspects it would be even more difficult.

I've gave you few simple questions and advises on what you should look first. Try to write down something about context of those details which may point what exactly is wring with your setup.
IMO generally your DB backend setup is somehow incorrect and it has nothing to do with zabbix. 40GB DB it is not big deal ..

Yes you can find many people like you having the same problems only because many people are looking on zabbix or his DB backend as black boxes and/or don't know even what to observe to find root cause.

**steveboyson** · 28-09-2015, 21:55

Originally posted by kloczek

...
I've gave you few simple questions and advises on what you should look first.
...

The most obvious advice (which is: let the developers fix the still open issues) you are not giving. Anyway, as I mentioned before, I will not discuss this topic any longer as we have different understandings of the importance, reason and solution. Pls. respect this. Thank you.

Again and trying to be polite:
please save your time and do some consulting or explain someone else how databases are working and how service providers in your opinion have to do their jobs.

Don't get me worng: I accept your point of view. I just have a different one.

Now I'm really out here.

Ad Widget

Opening Latest data is very slow

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment