possible performance/crash issue #13

anthonysomerset · 2020-03-27T16:06:31Z

anthonysomerset
Mar 27, 2020

Hi there

I need to caveat that this is is an i think scenario not confirmed as yet

i appear to be having routine performance issues with influx... now i would expect this not to be an issue here but it is appearing that as a result of said issues the zabbix module is getting backed up trying to write data to influx and not timing out or failing in any way.

this is appear to cause some silent backlog in zabbix and suddenly no data at all is being inserted into zabbix.

restart zabbix server and/or influxd seems to fix it (trying to narrow down which specific daemon at present)

raising this somewhat pre-emptively in case there is an obvious reason why the influx write would block up zabbix server like this

EDIT - semi confirmed this - my influxd instance is currently disk IO constrained and this backup seems to be causing the writer on the Zabbix side to get stuck and everything to get stuck behind it

the moment i stop influx the connection drops and things resume nearly immediately

i-ky · 2020-03-27T23:28:14Z

i-ky
Mar 27, 2020
Maintainer

Hi! Huge thanks for giving some feedback!

This behaviour is expected, it is a deliberate consequence of several design decisions I made. Let me explain a bit.

History export modules are not first-class citizens in Zabbix. There is no caching/buffering and no retries implemented on Zabbix side, which would allow modules to make a pause in sending data to alternative backends and catch up later. Module has a single chance of sending data to storage backend. If backend is having performance issues, module is basically between hammer and anvil. There are following options available (and none of them is particularly good):

Drop the data. Very simple solution, no performance issues. But it will cause data loss, this would undermine trust of module users and eventually no one would use it.
Buffering. It is a viable solution for temporary backend performance problems, but it would significantly increase complexity of the module. And data loss could still potentially happen if Zabbix decided to stop while module had loads of buffered data. Module would need to dump buffers on disk, which would increase complexity even further.
Block and wait. Another simple option. And this ensures consistency between Zabbix database and InfluxDB. These are two reasons why I have chosen this option. Timing out makes no sense in this strategy, module has no choice but to be fully committed. As a downside, when InfluxDB performs poorly, Zabbix will perform poorly as well.

In a way current module's behaviour mirrors the relationship between Zabbix database and Zabbix server - when database performs poorly, Zabbix may have troubles too. That's why it is very important to monitor health status of Zabbix database and, if you are using this module, you should monitor availability of InfluxDB as well. Preferably, with an independent monitoring setup.

I understand that current module's behaviour may not fit all use cases, so please let me know what are your thoughts on this topic. Would you prefer option 1? It is relatively easy to add a configuration parameter for this strategy. Or do you want me to pursue option 2? With enough support from other users I might give this a try. Again, with some support from other users, we can even push Zabbix to implement buffering on their side.

0 replies

i-ky · 2020-03-27T23:39:49Z

i-ky
Mar 27, 2020
Maintainer

I am reading that there is influxdb-relay which can do the buffering. Maybe you can plug it into your setup between module and InfluxDB?

0 replies

anthonysomerset · 2020-03-29T07:06:53Z

anthonysomerset
Mar 29, 2020
Author

Hi thanks for the feedback it helps my understanding and at least we have confirmed that the behaviour is by design - it would be nice to have a configurable option to drop if timedout or similar but i think by default it should be off by default to retain current behaviour for all

i am actually going to look at one of the influx relay binaries to solve this problem in the short term most likely while i review the DB scaling issue i have on the influx side which looks entirely unrelated to the zabbix data feed but something entirely seperate

0 replies

i-ky · 2020-03-29T12:11:11Z

i-ky
Mar 29, 2020
Maintainer

You are welcome! Feel free to throw all your questions at me.

Please share your findings afterwards. I think they will make a good addition to the documentation.

0 replies

i-ky · 2020-05-04T21:27:35Z

i-ky
May 4, 2020
Maintainer

@anthonysomerset,
Today I accidentally recalled "database watchdog" functionality Zabbix has to monitor and alert on unavailability of its own database. Given a module that can check availability of its backend, Zabbix could use the same alerting mechanism to notify you that module's storage backend is not working. So I have created ZBX-17674. Please add your vote if you think this could be a good solution.

2 replies

i-ky Aug 8, 2021
Maintainer

The latest response from Zabbix does not make me feel optimistic...

i-ky Jul 18, 2022
Maintainer

Official loadable module documentation now says:

In case of internal error in history export module it is recommended that module is written in such a way that it does not block whole monitoring until it recovers but discards data instead and allows Zabbix server to continue running.

@anthonysomerset, do you think that module's behaviour needs to be changed in order to match the above recommendations?

anthonysomerset · 2022-07-27T23:40:39Z

anthonysomerset
Jul 27, 2022
Author

It’s not ideal but it would make sense if this behaviour could be turned on and off depending on users preference. Perhaps default to not invoking the discard behaviour to preserve current behaviour for users That being said we have moved away from using this module due to the availability and native support for timescaledb and compression in zabbix which makes less moving parts for us to maintain On 18 Jul 2022, at 03:26, i-ky ***@***.***> wrote: Official loadable module documentation<https://www.zabbix.com/documentation/current/en/manual/config/items/loadablemodules#providing-history-export-callbacks> now says: In case of internal error in history export module it is recommended that module is written in such a way that it does not block whole monitoring until it recovers but discards data instead and allows Zabbix server to continue running. @anthonysomerset<https://github.com/anthonysomerset>, do you think that module's behaviour needs to be changed in order to match the above recommendations? — Reply to this email directly, view it on GitHub<#13 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AACTE2GGERX4GEEOLJHGTXDVUUBIZANCNFSM5BYJ4N2A>. You are receiving this because you were mentioned.Message ID: ***@***.***>

1 reply

i-ky Jul 28, 2022
Maintainer

It’s not ideal but it would make sense if this behaviour could be turned on and off depending on users preference. Perhaps default to not invoking the discard behaviour to preserve current behaviour for users

Makes sense indeed. I will create an issue (#18).

That being said we have moved away from using this module due to the availability and native support for timescaledb and compression in zabbix which makes less moving parts for us to maintain

A bit sad to hear, but totally understandable. I am glad that Zabbix now provides some sort of a solution out of the box.

Thank you for the feedback!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible performance/crash issue #13

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

possible performance/crash issue #13

anthonysomerset Mar 27, 2020

Replies: 6 comments · 3 replies

i-ky Mar 27, 2020 Maintainer

i-ky Mar 27, 2020 Maintainer

anthonysomerset Mar 29, 2020 Author

i-ky Mar 29, 2020 Maintainer

i-ky May 4, 2020 Maintainer

i-ky Aug 8, 2021 Maintainer

i-ky Jul 18, 2022 Maintainer

anthonysomerset Jul 27, 2022 Author

i-ky Jul 28, 2022 Maintainer

anthonysomerset
Mar 27, 2020

Replies: 6 comments 3 replies

i-ky
Mar 27, 2020
Maintainer

i-ky
Mar 27, 2020
Maintainer

anthonysomerset
Mar 29, 2020
Author

i-ky
Mar 29, 2020
Maintainer

i-ky
May 4, 2020
Maintainer

i-ky Aug 8, 2021
Maintainer

i-ky Jul 18, 2022
Maintainer

anthonysomerset
Jul 27, 2022
Author

i-ky Jul 28, 2022
Maintainer