Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LogStash fails to acquire lock causing LockException. #16173

Open
sasikiranvaddi opened this issue May 21, 2024 · 24 comments
Open

LogStash fails to acquire lock causing LockException. #16173

sasikiranvaddi opened this issue May 21, 2024 · 24 comments

Comments

@sasikiranvaddi
Copy link

sasikiranvaddi commented May 21, 2024

We observe LockException when logstash process is running. Looking at the logs, before LockException has occurred logstash.agent is trying to fetch the pipelines count but it couldn't get casing JavaNullPointerException.

[logstash.agent] Failed to execute action {:action=>LogStash::PipelineAction::Reload/pipeline_id:logstash, :exception=>'Java::JavaLang::NullPointerException', :message=>'', :backtrace=>['org.jruby.runtime.scopes.DynamicScope5.getValue(Unknown Source)', 'org.jruby.ir.interpreter.InterpreterEngine.retrieveOp(InterpreterEngine.java:594)', 'org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:348)', 'org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66)', 'org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116)', 'org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:136)', 'org.jruby.runtime.IRBlockBody.yieldSpecific(IRBlockBody.java:76)', 'org.jruby.runtime.Block.yieldSpecific(Block.java:158)', 'org.jruby.ext.monitor.Monitor.synchronize(Monitor.java:82)', 'org.jruby.ext.monitor.Monitor$INVOKER$i$0$0$synchronize.call(Monitor$INVOKER$i$0$0$synchronize.gen)', 'org.jruby.internal.runtime.methods.JavaMethod$JavaMethodZeroBlock.call(JavaMethod.java:561)', 'org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:90)', 'org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:103)', 'org.jruby.ir.instructions.CallBase.interpret(CallBase.java:545)', 'org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:367)', 'org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66)', 'org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:82)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:201)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:188)', 'org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:220)', 'org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:242)', 'org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:318)', 'org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66)', 'org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:82)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:201)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:188)', 'org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:257)', 'org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:270)', 'org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:341)', 'org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:66)', 'org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:88)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:238)', 'org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:225)', 'org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:228)', 'opt.logstash.logstash_minus_core.lib.logstash.agent.RUBY$block$converge_state$2(/opt/logstash/logstash-core/lib/logstash/agent.rb:386)', 'org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:141)', 'org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:64)', 'org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58)', 'org.jruby.runtime.Block.call(Block.java:144)', 'org.jruby.RubyProc.call(RubyProc.java:352)', 'org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:111)', 'java.base/java.lang.Thread.run(Thread.java:829)']}

From the following traceback, logstash is trying to execute the reload pipeline

github.com source

begin
  logger.debug("Executing action", :action => action)
  action_result = action.execute(self, @pipelines_registry)
  converge_result.add(action, action_result)
  unless action_result.successful?
    logger.error("Failed to execute action",
      :id => action.pipeline_id,
      :action_type => action_result.class,
      :message => action_result.message,
      :backtrace => action_result.backtrace
    )
  end

Further tracing back, probably when the below lines of code is executed it is returning null and leaving the lock not getting removed.

github.com source
elastic/logstash/blob/main/logstash-core/lib/logstash/pipeline_action/reload.rb#L39-L42

def execute(agent, pipelines_registry)
  old_pipeline = pipelines_registry.get_pipeline(pipeline_id)
  if old_pipeline.nil?

Could you please let us know on what all scenarios NullPointerException, LockException is occurred. In case if the transaction has failed then as a rescue should it clean the lock for upcoming transaction to complete successfully.

@roaksoax
Copy link
Contributor

Hi @sasikiranvaddi

Thanks for filing the issue. Could you please provide the information requested on the issue template?

That is:

  • Logstash Version
  • Installation Source with Steps followed to install
  • Pipeline sample causing the issue
  • Steps to reproduce the issue
  • Logs (providing the traceback is not enough).

Thank you!

@sasikiranvaddi
Copy link
Author

sasikiranvaddi commented May 22, 2024

Hi @roaksoax,

Thank you for acknowledging the issue. Please find below the requested information.

Logstash Version:
8.11.3

Installation Source with Steps followed to install:
Built from source code.

Pipeline sample causing the issue:

pipelines.yml:
----
- pipeline.id: logstash
  queue.type: persisted
  queue.max_bytes: 1024mb
  path.config: "/opt/logstash/resource/logstash.conf"
- pipeline.id: opensearch
  queue.type: persisted
  queue.max_bytes: 1024mb
  path.config: "/opt/logstash/resource/searchengine.conf"

Steps to reproduce the issue:
Steps are unknown on what is causing the Java NPE and why the lock is not getting released

Logs (providing the traceback is not enough).
Attached logs.txt
logs.txt

@yaauie
Copy link
Member

yaauie commented May 22, 2024

The backtrace from the issue description appears to be the jruby runtime hitting an NPE in the course of interpreting and running our code (as opposed to our code failing to acquire a lock).

This is not a normal scenario, and is certainly caused by a bug in Jruby.

The NPE appears to have occurred while starting the pipeline, after the Logstash code had acquired the PQ's lock (which is two-fold; when opening the queue, we first ensure that no other process has the queue open using an on-disk lock file, and then ensure that the current process only opens it once; it appears that both levels of locks had been acquired prior to jruby throwing the NPE).

The Agent's config state converger prevented the exception from crashing the Logstash process, but the queue's lock was left in a locked state. Because the lock is supposed to live beyond the starting of the pipeline, there is no implicit auto-close handling.

Since the lock had been acquired and was not released, subsequent reloads of the pipeline cannot acquire it. The only way to get the pipeline running again is to stop the process, manually remove the offending queue's lock file, and restart the process.


Logstash 8.11.3's distribution from Elastic is bundled with Jruby 9.4.5.0 and Adoptium's JDK 17.0.9p9, but since you have built from source there are a number of additional variables at play.

Since you built from source it would also be helpful to know:

  • whether you are running the assembled components directly ("${SOURCE}/bin/logstash") or from an assembled artifact (such as a tarball, rpm, or deb)
  • whether your build correctly vendored jruby 9.4.5.0 or later (I can see similar issues resolved in the jruby project as recently as 9.4.4.0)
  • which commit you built from (git rev-parse HEAD); if that commit is not on elastic/logstash, then the diff between your head and a base commit that is on elastic/logstash would be necessary.
  • which JDK you are using to build logstash
  • which JDK you are using to run the logstash process

@sasikiranvaddi
Copy link
Author

Hi @yaauie ,

Thank you for sharing the detailed analysis.
Just a quick question, The Agent's config state converger will not release lock in case of Exception and it is made on purpose. Is my understanding correct? The only way forward is to stop the logstash process manually, delete the lock and start again.

Please find the below info, which you have requested.

  1. whether you are running the assembled components directly ("${SOURCE}/bin/logstash") or from an assembled artifact (such as a tarball, rpm, or deb)
    whether your build correctly vendored jruby 9.4.5.0 or later (I can see similar issues resolved in the jruby project as recently as 9.4.4.0)
    The build uses 9.3.10 version of jruby and JDK version 11. We mirror github repository and use it from there.
  2. which commit you built from (git rev-parse HEAD); if that commit is not on elastic/logstash, then the diff between your head and a base commit that is on elastic/logstash would be necessary.
    $ git rev-parse HEAD
    45f6dce
  3. which JDK you are using to build logstash
    openjdk version "11" 2018-09-25
    OpenJDK Runtime Environment 18.9 (build 11+28)
    OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
  4. which JDK you are using to run the logstash process
    openjdk 11.0.22 2024-01-16
    OpenJDK Runtime Environment (build 11.0.22+7-suse-150000.3.110.1-x8664)
    OpenJDK 64-Bit Server VM (build 11.0.22+7-suse-150000.3.110.1-x8664, mixed mode)

@sasikiranvaddi
Copy link
Author

Hi @yaauie,

Just to add, we observe LockException multiple times and the pattern we observe is always it is followed by some error. Some of them like

  1. NPE
  2. Key Not Foundfailed to coerce
  3. rubyobj.LogStash.Instrument.MetricType.Gauge to org.logstash.instrument.metrics.counter.LongCounter

Proposal:
We see https://github.com/elastic/logstash/blob/main/logstash-core/src/main/java/org/logstash/ackedqueue/Queue.java#L167-L192 does the two-fold locking here. If the exception has occurred in this part of execution https://github.com/elastic/logstash/blob/main/logstash-core/src/main/java/org/logstash/ackedqueue/Queue.java#L171-L177, we could see the Thread lock has been released in the finally block but on-disk lock still exists.

Can we handle releasing of on-disk lock in this part of code https://github.com/elastic/logstash/blob/main/logstash-core/src/main/java/org/logstash/ackedqueue/Queue.java#L175-L177 by invoking method releaseLockAndSwallow();

@sasikiranvaddi
Copy link
Author

Hi,
Could you please check my latest comment ?

@sasikiranvaddi
Copy link
Author

sasikiranvaddi commented Jul 2, 2024

Hi @yaauie,

We observed the issue again, in this case we observed Key Error before Lock Exception.
Attached logs.
latest-logs-key-error-before-lockException.txt

@tsaarni
Copy link

tsaarni commented Aug 1, 2024

It seems challenging to reproduce the problem on demand.

@yaauie wrote:

Since the lock had been acquired and was not released, subsequent reloads of the pipeline cannot acquire it. The only way to get the pipeline running again is to stop the process, manually remove the offending queue's lock file, and restart the process.

Could you clarify why this scenario requires protection? Why does re-acquisition throw an exception:

if (!LOCK_HELD.add(realLockPath.toString())) {
throw new LockException("Lock held by this virtual machine on lock path: " + realLockPath);
}

Is it intended to guard against situations where lock is not released properly due to a bug in Logstash? Could the same issue occur, for example, if the Queue object is destroyed in a way that prevents it from properly closing and releasing the lock?

@sasikiranvaddi
Copy link
Author

Hi @yaauie ,
Could you please share your thoughts on preventing this issue?

@sasikiranvaddi
Copy link
Author

Hi,
Any thoughts on how to prevent this issue?

@sasikiranvaddi
Copy link
Author

Hi @yaauie ,

We have been observing the issue happening for multiple users since we reported this issue. Can we have some fix for it?

@sasikiranvaddi
Copy link
Author

Hi @yaauie

Any thoughts on how to prevent this issue?

@jsvd
Copy link
Member

jsvd commented Sep 23, 2024

Hi @sasikiranvaddi since we haven't been able to reproduce on our side, can you check if a more modern Logstash, using the bundled jdk (8.15.0 bundles JDK 21) still shows this issue? Logstash now bundles JRuby 9.4 which has had significant changes since 9.3 and could have solved the problem.

@sasikiranvaddi
Copy link
Author

Hi @jsvd,

From the latest logs that is reported by one of the user recently, the LockException has been obtained with following version, where the jruby version is 9.4.5

Versions:

Logstash: 8.13.0
Jruby: 9.4.5.0
JDK; 17

[logstash.agent] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:opensearch, :exception=>'KeyError', :message=>'key not found', :backtrace=>['/opt/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:325:in raise_fetch_no_key'', '/opt/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:186:in fetch'', '/opt/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:202:in fetch_or_store'', '/opt/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:85:in block in fetch_or_store'', '/opt/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:83:in block in fetch_or_store'', 'org/jruby/ext/thread/Mutex.java:171:in synchronize'', '/opt/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:80:in fetch_or_store'', '/opt/logstash/logstash-core/lib/logstash/instrument/collector.rb:92:in register?'', 'org/logstash/execution/AbstractPipelineExt.java:547:in initialize_flow_metrics'', '/opt/logstash/logstash-core/lib/logstash/java_pipeline.rb:127:in start'', '/opt/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:51:in block in execute'', '/opt/logstash/logstash-core/lib/logstash/pipelines_registry.rb:159:in create_pipeline'', '/opt/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:50:in execute'', '/opt/logstash/logstash-core/lib/logstash/agent.rb:386:in block in converge_state'']}

@jsvd
Copy link
Member

jsvd commented Sep 25, 2024

Thanks it does seem to be a different error, or at least in a different section of the code. The initial error was at converge_state

'opt.logstash.logstash_minus_core.lib.logstash.agent.RUBY$block$converge_state$2(/opt/logstash/logstash-core/lib/logstash/agent.rb:386)', 'org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:141)', 'org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:64)', 'org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58)', 'org.jruby.runtime.Block.call(Block.java:144)', 'org.jruby.RubyProc.call(RubyProc.java:352)',

While that error is deeper into the initialization of a Pipeline:

'/opt/logstash/logstash-core/lib/logstash/instrument/collector.rb:92:in register?'', 
'org/logstash/execution/AbstractPipelineExt.java:547:in initialize_flow_metrics'', 
'/opt/logstash/logstash-core/lib/logstash/java_pipeline.rb:127:in start'',
'/opt/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:51:in block in execute'',

How frequently does it happen? Does it always happen with the same output? does the pipeline have more than 1?

@sasikiranvaddi
Copy link
Author

Hi @jsvd.

The issue is spoardic and we have noticed key not found couple of times on different environments and the reproduction counts for the pipelines.

OpenSearch - 2
Logstash - 1

So far we have noticed in 3 environments, in which two are for Opensearch pipeline and one for Logstash pipeline.
As mentioned by @tsaarni #16173 (comment) above, we feel when there is a failure in pipeline, the clean-up of in-memory locks should be performed before reloading of the pipeline so that Logstash will not be in stuck state for ever.

@sasikiranvaddi
Copy link
Author

Hi,

Could someone share if there any views on the comment that highlighted above.

@sasikiranvaddi
Copy link
Author

Hi,
Could someone share if there are any suggestions to handle it in a more robust way

@sasikiranvaddi
Copy link
Author

Hi,

Any suggestions or way forward on how to overcome this issue

@sasikiranvaddi
Copy link
Author

Hi,
Any suggestions or way forward on how to overcome this issue

1 similar comment
@sasikiranvaddi
Copy link
Author

Hi,
Any suggestions or way forward on how to overcome this issue

@mashhurs mashhurs self-assigned this Dec 19, 2024
@mashhurs
Copy link
Contributor

mashhurs commented Dec 24, 2024

Anyone has configs and explain step-by-step to reproduce the case?
Does this happen with recent (8.17.0) Logstash version?

@sasikiranvaddi can you provide trace logs which will include more details such as LS metrics initialization and input-beats plugin connection/thread level details.


I have built LS from 45f6dce and also from LS main branch (HEAD), used two pipelines similar to @sasikiranvaddi has (following), frequently curled pipeline flow metrics (watch -n 1 curl -v http://127.0.0.1:9600/_node/stats/pipelines/logstash), sent intensive data to both beats-input and udp-input, reloaded pipeline by changing pipeline configs. However, it was impossible to face issue case, it seems the situation is very case specific.

 - pipeline.id: logstash
   pipeline.workers: 1
   pipeline.batch.size: 2048
   queue.type: persisted
   queue.max_bytes: 2048mb
   config.string: |
      input {
         beats { port => 5033 }
      }
      filter { ruby { code => "sleep(0.01)" }} # backpressure
      output {
         #stdout { } # output
      }
 - pipeline.id: opensearch
   pipeline.workers: 1
   pipeline.batch.size: 2048
   queue.type: persisted
   queue.max_bytes: 2048mb
   config.string: |
      input { 
         udp { port => 5055 }
      }
      filter { ruby { code => "sleep(0.01)" }} # backpressure
      output {
        # stdout { } # output
      }

From the logs provided, some logs indicate flow metrics initialization failed.

[org.logstash.instrument.metrics.gauge.LazyDelegatingGauge] A gauge metric of an unknown type (org.jruby.specialized.RubyArrayOneObject) has been created for key: send_to. This may result in invalid serialization. It is recommended to log an issue to the responsible developer/development team.

another

[logstash.agent] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:logstash, :exception=>'KeyError', :message=>'key not found', :backtrace=>['/opt/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:325:in `raise_fetch_no_key'', '/opt/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:186:in `fetch'', '/opt/logstash/vendor/bundle/jruby/3.1.0/gems/concurrent-ruby-1.1.9/lib/concurrent-ruby/concurrent/map.rb:202:in `fetch_or_store'', '/opt/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:85:in `block in fetch_or_store'', '/opt/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:83:in `block in fetch_or_store'', 'org/jruby/ext/thread/Mutex.java:171:in `synchronize'', '/opt/logstash/logstash-core/lib/logstash/instrument/metric_store.rb:80:in `fetch_or_store'', '/opt/logstash/logstash-core/lib/logstash/instrument/collector.rb:92:in `register?'', 'org/logstash/execution/AbstractPipelineExt.java:547:in `initialize_flow_metrics'', '/opt/logstash/logstash-core/lib/logstash/java_pipeline.rb:127:in `start'', '/opt/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:51:in `block in execute'', '/opt/logstash/logstash-core/lib/logstash/pipelines_registry.rb:159:in `create_pipeline'', '/opt/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:50:in `execute'', '/opt/logstash/logstash-core/lib/logstash/agent.rb:386:in `block in converge_state'']}

The situation, generated_value is nil which becomes default_value to fetch_or_store(key, default_value = NULL), so JRuby concurrent-map is raising raise_fetch_no_key exception.

The short term fix, we can place a safeguard but I would like to understand the case behaviour (I believe there will be better fix). With the safeguard, we lose flow metrics.

+  if generated_value
      fetch_or_store_namespaces(namespaces).fetch_or_store(key, generated_value)
+  end

@sasikiranvaddi
Copy link
Author

Hi @mashhurs,
Thank you for looking into this issue.
The issue is not reproducible all the times, and the we haven't tried with latest version of LT 8.17.0
There are couple of other cases where LockException has observed, below are those

  1. NPE ( For NPE, as mentioned here LogStash fails to acquire lock causing LockException. #16173 (comment), we will observe incase if still it is getting reported after upgrading to JRuby)
  2. Key Not Found failed to coerce
  3. rubyobj.LogStash.Instrument.MetricType.Gauge to org.logstash.instrument.metrics.counter.LongCounter

Also could you help us with below queries that are requested from @tsaarni
#16173 (comment)

We rarely see the issue in our local labs, it is been reported by the customers, so getting trace logs is bit challenging.

@mashhurs
Copy link
Contributor

mashhurs commented Jan 8, 2025

Hi @mashhurs, Thank you for looking into this issue. The issue is not reproducible all the times, and the we haven't tried with latest version of LT 8.17.0 There are couple of other cases where LockException has observed, below are those

My goal here to make sure released versions (especially recent) are safe.

As I do understand from your previous comment, LS is built from the source by changing the Jruby and JDK distros, also running on a specific environment?!
So, this will be hard chasing down the issue. If you please use released version of the LS and provide the full logs at the time when error happens, that would simplify investigation and faster support from our side.


  1. NPE ( For NPE, as mentioned here LogStash fails to acquire lock causing LockException. #16173 (comment), we will observe incase if still it is getting reported after upgrading to JRuby)
  2. Key Not Found failed to coerce
  3. rubyobj.LogStash.Instrument.MetricType.Gauge to org.logstash.instrument.metrics.counter.LongCounter

The error you posted shows AbstractPipelineExt.java:547 line throws (in initialize_flow_metrics method) exception but when I checked out the git sha (45f6dce3e8aa9919e6daa0043a882e12ba3ca97a) you had used, the AbstractPipelineExt.java:547 lines points to collect_flow_metrics method. There is inconsistency, can you double check?

Also could you help us with below queries that are requested from @tsaarni #16173 (comment)

PQ lock file is a synchronized place to persist PQ info (page, size, etc...) where workers access to process PQ events. At a given time, single process needs to acquire a lock. For the normal situations, LS acquires and releases the lock but Lock held by this virtual machine on lock path:.. exception indicates LS didn't release it properly - this we are chasing down here!

We rarely see the issue in our local labs, it is been reported by the customers, so getting trace logs is bit challenging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants