-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LogStash fails to acquire lock causing LockException. #16173
Comments
Thanks for filing the issue. Could you please provide the information requested on the issue template? That is:
Thank you! |
Hi @roaksoax, Thank you for acknowledging the issue. Please find below the requested information. Logstash Version: Installation Source with Steps followed to install: Pipeline sample causing the issue:
Steps to reproduce the issue: Logs (providing the traceback is not enough). |
The backtrace from the issue description appears to be the jruby runtime hitting an NPE in the course of interpreting and running our code (as opposed to our code failing to acquire a lock). This is not a normal scenario, and is certainly caused by a bug in Jruby. The NPE appears to have occurred while starting the pipeline, after the Logstash code had acquired the PQ's lock (which is two-fold; when opening the queue, we first ensure that no other process has the queue open using an on-disk lock file, and then ensure that the current process only opens it once; it appears that both levels of locks had been acquired prior to jruby throwing the NPE). The Agent's config state converger prevented the exception from crashing the Logstash process, but the queue's lock was left in a locked state. Because the lock is supposed to live beyond the starting of the pipeline, there is no implicit auto-close handling. Since the lock had been acquired and was not released, subsequent reloads of the pipeline cannot acquire it. The only way to get the pipeline running again is to stop the process, manually remove the offending queue's lock file, and restart the process. Logstash 8.11.3's distribution from Elastic is bundled with Jruby 9.4.5.0 and Adoptium's JDK 17.0.9p9, but since you have built from source there are a number of additional variables at play. Since you built from source it would also be helpful to know:
|
Hi @yaauie , Thank you for sharing the detailed analysis. Please find the below info, which you have requested.
|
Hi @yaauie, Just to add, we observe LockException multiple times and the pattern we observe is always it is followed by some error. Some of them like
Proposal: Can we handle releasing of on-disk lock in this part of code https://github.com/elastic/logstash/blob/main/logstash-core/src/main/java/org/logstash/ackedqueue/Queue.java#L175-L177 by invoking method releaseLockAndSwallow(); |
Hi, |
Hi @yaauie, We observed the issue again, in this case we observed Key Error before Lock Exception. |
It seems challenging to reproduce the problem on demand. @yaauie wrote:
Could you clarify why this scenario requires protection? Why does re-acquisition throw an exception:
Is it intended to guard against situations where lock is not released properly due to a bug in Logstash? Could the same issue occur, for example, if the Queue object is destroyed in a way that prevents it from properly closing and releasing the lock? |
Hi @yaauie , |
Hi, |
Hi @yaauie , We have been observing the issue happening for multiple users since we reported this issue. Can we have some fix for it? |
Hi @yaauie Any thoughts on how to prevent this issue? |
Hi @sasikiranvaddi since we haven't been able to reproduce on our side, can you check if a more modern Logstash, using the bundled jdk (8.15.0 bundles JDK 21) still shows this issue? Logstash now bundles JRuby 9.4 which has had significant changes since 9.3 and could have solved the problem. |
Hi @jsvd, From the latest logs that is reported by one of the user recently, the LockException has been obtained with following version, where the jruby version is 9.4.5 Versions:
[logstash.agent] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:opensearch, :exception=>'KeyError', :message=>'key not found', :backtrace=>['/opt/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:325:in |
Thanks it does seem to be a different error, or at least in a different section of the code. The initial error was at converge_state
While that error is deeper into the initialization of a Pipeline:
How frequently does it happen? Does it always happen with the same output? does the pipeline have more than 1? |
Hi @jsvd. The issue is spoardic and we have noticed key not found couple of times on different environments and the reproduction counts for the pipelines. OpenSearch - 2 So far we have noticed in 3 environments, in which two are for Opensearch pipeline and one for Logstash pipeline. |
Hi, Could someone share if there any views on the comment that highlighted above. |
Hi, |
Hi, Any suggestions or way forward on how to overcome this issue |
Hi, |
1 similar comment
Hi, |
Anyone has configs and explain step-by-step to reproduce the case? @sasikiranvaddi can you provide I have built LS from
From the logs provided, some logs indicate flow metrics initialization failed.
another
The situation, The short term fix, we can place a safeguard but I would like to understand the case behaviour (I believe there will be better fix). With the safeguard, we lose flow metrics. + if generated_value
fetch_or_store_namespaces(namespaces).fetch_or_store(key, generated_value)
+ end |
Hi @mashhurs,
Also could you help us with below queries that are requested from @tsaarni We rarely see the issue in our local labs, it is been reported by the customers, so getting trace logs is bit challenging. |
My goal here to make sure released versions (especially recent) are safe. As I do understand from your previous comment, LS is built from the source by changing the Jruby and JDK distros, also running on a specific environment?!
The error you posted shows
PQ lock file is a synchronized place to persist PQ info (page, size, etc...) where workers access to process PQ events. At a given time, single process needs to acquire a lock. For the normal situations, LS acquires and releases the lock but
|
We observe LockException when logstash process is running. Looking at the logs, before LockException has occurred logstash.agent is trying to fetch the pipelines count but it couldn't get casing
JavaNullPointerException
.From the following traceback, logstash is trying to execute the reload pipeline
github.com source
Further tracing back, probably when the below lines of code is executed it is returning null and leaving the lock not getting removed.
github.com source
elastic/logstash/blob/main/logstash-core/lib/logstash/pipeline_action/reload.rb#L39-L42
Could you please let us know on what all scenarios NullPointerException, LockException is occurred. In case if the transaction has failed then as a rescue should it clean the lock for upcoming transaction to complete successfully.
The text was updated successfully, but these errors were encountered: