-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodupe_basis path
doesn't look at the path when the checksums are different
#1382
Comments
Things discussed/realized:
|
I just looked at the code, and nodupe_basis path is simply not implemented at all, afaict.
I think nodupe_basis path was an originally created basis in v2, but we found it was never used. All the cases where nodupe_basis path was considered, was eventually replaced by nodupe_basis name because the use cases did not actually want to compare the entire path. We have two options...
I'm guessing you ran into a case where somebody wanted nodupe_basis path to work? |
First the key is derived, if they key is already in the cache, then it compares the (rel)path sarracenia/sarracenia/flowcb/nodupe/disk.py Lines 87 to 121 in 372e77b
sarracenia/sarracenia/flowcb/nodupe/disk.py Lines 123 to 143 in 372e77b
Assuming |
The default behavious does not implement nodupe_basis path it implements the default processing, which is based deriving the key from the message using:
Nodupe uses an associative array by key... so for check_message to return correctly, the two messages have to have the same key. It only checks the path if the two messages have the same key. The deriveKey routine has NO means of setting the key to be the relPath alone, so nodupe_basis path is not implemented. The keys never match, so the path setting does not matter. One would need an override class to implement that, or a logic change in the routine based on nodupe_basis setting. Neither is present. It's a few lines to implement, but it isn't there. something like a path subclass: flowcb/nodupe/path.py:
and logic in the config to have it loaded. would be enough.
Other thoughts:
|
This is a good explanation :) we eventually figured that out, except we assumed that the default behaviour you explained was how "path" was supposed to work. And that's why we were confused by what we thought was an inconsistency between "path" (actually just the default processing) and name. I started working on this branch: development...issue1382 In that branch:
|
Suggest:
If you want to keep the _only name scheme... then you probably need to rename data to data_only Regardless, I think the first step is to re-work the unit tests... so that they confirm the behaviour of each. Here's that branch: https://github.com/MetPX/sarracenia/tree/issue1382_unit_tests It passes the tests with correctish fields (but no fileOps yet... maybe later.) |
maybe message is clearer than metadata meaning you derive the key from fields in the whole message. |
Currently, when
nodupe_basis path
is selected as the basis of the duplicate suppression, it will get overridden when the checksum of a file is different@reidsunderland and I found that this is because the default routing key for a
path
is set to a checksum.sarracenia/sarracenia/flowcb/nodupe/__init__.py
Lines 33 to 42 in a5fd5ed
The
path
doesn't have its own class unlike the name option. So by default it always has the checksum as the key.We also talked about adding an option to override not checking the checksum.
nodupe_check_data
might be a good name? It would be nice to give the user the option.So the current behaviour we have is
Ideally we would want the following
The text was updated successfully, but these errors were encountered: