-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FrozenError during concurrent activity in 3.2.11 #445
Comments
From the backtrace, it seems that Please try using the feature/URI-frozen-bug branch to see if that makes a difference. Another place that could be tried is in the |
Unfortunately I can't test it directly since our servers are stuck on ruby 2.7 for a few more months at least. I may be able to monkey patch the one line change in though, I'll look into it. |
I can create a branch off of an earlier release that supports 2.7, if that would help. |
Try branch 3.2.11-patch, which is based on 3.2.11 with the small change to |
Okay, thanks, I'll try that out. It might be a little while before I know whether it resolves the issue, since it's an intermittent issue |
We ran another bulk ingest today with the branch deployed and got 2 FrozenErrors. The first one triggered in a different location than usual, while the second one looks similar to the previous error, at least from the perspective of the RDF gem. In all cases the jobs were executed by sidekiq:
|
Might be easier for you to experiment with some changes in a local copy of RDF.rb. I don't really understand what's going on in the first trace. It seems like something with ObjectCache interaction may be involved. A change was made in 58d8c52 to address a memory leak; it's hard to see what the difference could be, but you might try reverting that particular commit. You could also try modifying RDF::Util::Cache.new to ensure that WeakRefCache cache is used instead of ObjectSpaceCache. For the second trace, the problem could be in RDF::URI#freeze where the mutex is grabbed after checking to see if it is frozen, which looks like a race condition. Try enclosing the entire method in def freeze
@mutex.synchronize do
unless frozen?
# Create derived components
authority; userinfo; user; password; host; port
@value = value.freeze
@object = object.freeze
@hash = hash.freeze
super
end
end
self
end I'd start with the change to URI#freeze to see if that handles it. |
Sure, I will try some changes. The update to the freeze method makes sense, since if two threads called freeze at the same time, they could both potentially get past the Regarding the Cache error, its possible fix |
It might make sense, but let’s start with the URI#freeze? Change to see if that does it. But, of course, it’s up to you. |
@bbpennel Any update on this? If the fix to RDF::URI does it, I can release a patch and merge the change into the develop branch. |
@gkellogg We've run two bulk ingests since deploying this change and have not seen the error. So I can't say definitively that its resolved, but looks good so far, and I think it's a reasonable change to make either way |
Fixed in 3.2.12 and on develop. |
We upgraded to rdf 3.2.12 but unfortunately that didn't seem to resolve the RDF::URI::FrozenError for us. We're still seeing them occasionally in background jobs run by sidekiq. FWIW It has happened about 200 times out of 800,000 jobs over 2 weeks. For example:
|
Perhaps your idea about using a mutex in RDF::Util::Cache might solve the issue.
|
We have been running into periodic FrozenErrors originating from ruby rdf during periods of concurrent activity in our samvera based application. I have spoken to two other institutions that have been running into the same error.
The error appears to have started when upgrading from 3.2.9 to 3.3.0 in one case, while my own insitution and the other are experiencing it in 3.2.11. I don't see the error happening in our logs prior to upgrading to this version. So it seems as though the issue may have been introduced between 3.2.9 and 3.2.11:
3.2.9...3.2.11
Based on the change set and the error's inconsistent nature, I wonder if it could be related to the fix that causes the cache to get cleared out during garbage collection, so entries would potentially need to get repopulated now and could collide. I wonder if that component needs protections to ensure thread safety? But this is just a guess. I may try downgrading to 3.2.9 next.
I had hoped to be able to share a test for reproducing the error, but my attempts so far haven't been successful. The error continues to occur almost every time we do a bulk ingest into our production system, though.
These are the related issues:
samvera/bulkrax#947
pulibrary/figgy#6391
avalonmediasystem/avalon#5783
The text was updated successfully, but these errors were encountered: