Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodupe_basis path doesn't look at the path when the checksums are different #1382

Open
andreleblanc11 opened this issue Feb 19, 2025 · 7 comments
Assignees

Comments

@andreleblanc11
Copy link
Member

Currently, when nodupe_basis path is selected as the basis of the duplicate suppression, it will get overridden when the checksum of a file is different

@reidsunderland and I found that this is because the default routing key for a path is set to a checksum.

if ('nodupe_override' in msg) and ('key' in msg['nodupe_override']):
key = msg['nodupe_override']['key']
elif 'fileOp' in msg :
if 'link' in msg['fileOp']:
key = msg['fileOp']['link']
elif 'directory' in msg['fileOp']:
if 'remove' not in msg['fileOp']:
key = msg['relPath']
elif ('identity' in msg) and not (msg['identity']['method'] in ['cod']):
key = msg['identity']['method'] + ',' + msg['identity']['value'].replace('\n', '')

The path doesn't have its own class unlike the name option. So by default it always has the checksum as the key.


We also talked about adding an option to override not checking the checksum. nodupe_check_data might be a good name? It would be nice to give the user the option.

So the current behaviour we have is

basis look at checksum? look at path? look at name?
path yes yes (not directly, included in path)
data yes no no
name no no yes

Ideally we would want the following

basis look at checksum? look at path? look at name?
path optional (no by default) yes (not directly, included in path)
data yes no no
name optional (no by default) no yes
@andreleblanc11
Copy link
Member Author

Things discussed/realized:

  • The values in nodupe_basis bypass the default nodupe algorithm when specified. This doesn't apply for the path
  • We want to have similar behaviour between the path/name options. To have this, we would need to
    • Add the option to duplicate suppress based on the file name, but also post files if the checksum/mtime/size/pubTime changes
    • Add the option to duplicate suppressi based on the path that ignores all the other data (checksum, time, size)

  • To do this, one option is to add new nodupe overrides, name_only and path_only.
  • Where path and name will go through the default nodupe algorithm before checking the name, path respectively.
  • path_only and name_only will override the key.

  • The documentation for deriveKey isn't the best and should be improved.

@petersilva
Copy link
Contributor

I just looked at the code, and nodupe_basis path is simply not implemented at all, afaict.

  • the nodupe_basis option is not referenced anywhere in the nodupe related modules.
  • for the data and name options, the config parser removes nodupe_basis from the options...
  • the result of the parse is an option nodupe_basis with value "path" that is never referenced anywhere.

I think nodupe_basis path was an originally created basis in v2, but we found it was never used. All the cases where nodupe_basis path was considered, was eventually replaced by nodupe_basis name because the use cases did not actually want to compare the entire path.

We have two options...

  • de-document path as an option, as it is never used.
  • implement path ... assuming we have an actualy use case for it.

I'm guessing you ran into a case where somebody wanted nodupe_basis path to work?

@reidsunderland
Copy link
Member

nodupe_basis path is implemented, it's the default behaviour of https://github.com/MetPX/sarracenia/blob/development/sarracenia/flowcb/nodupe/disk.py

First the key is derived, if they key is already in the cache, then it compares the (rel)path

def _not_in_cache(self, key, relpath) -> bool:
""" return False if the given key=relpath value is already in the cache,
True otherwise
side effect: add it to the cache if it isn't there.
"""
# not found
self.cache_hit = None
qpath = urllib.parse.quote(relpath)
if key not in self.cache_dict:
logger.debug("adding entry to NoDupe cache")
kdict = {}
kdict[relpath] = self.now
self.cache_dict[key] = kdict
self.fp.write("%s %f %s\n" % (key, self.now, qpath))
self.count += 1
return True
logger.debug( f"entry already in NoDupe cache: key={key}" )
kdict = self.cache_dict[key]
present = relpath in kdict
kdict[relpath] = self.now
# differ or newer, write to file
self.fp.write("%s %f %s\n" % (key, self.now, qpath))
self.count += 1
if present:
logger.debug( f"updated time of old NoDupe entry: relpath={relpath}" )
self.cache_hit = relpath
return False
else:
logger.debug( f"added relpath={relpath}")
return True

def check_message(self, msg) -> bool :
"""
derive keys to be looked up in cache of messages already seen.
then look them up in the cache,
return False if message is a dupe.
True if it is new.
"""
key = self.deriveKey(msg)
if ('nodupe_override' in msg) and ('path' in msg['nodupe_override']):
path = msg['nodupe_override']['path']
else:
path = msg['relPath'].lstrip('/')
msg['noDupe'] = { 'key': key, 'path': path }
msg['_deleteOnPost'] |= set(['noDupe'])
logger.debug("NoDupe calling check( %s, %s )" % (key, path))
return self._not_in_cache(key, path)

Assuming msg['nodupe_override']['path'] is not set, it uses msg['relPath'] for the (rel)path comparison.

@petersilva
Copy link
Contributor

The default behavious does not implement nodupe_basis path it implements the default processing, which is based deriving the key from the message using:

  • if this is a fileOp message, then pick a per fileOp value... else (reg file):
  • the identity field in the message (aka checksum) if available.
  • if not... ( this is a regular file that has no checksum,) then it looks at mtime, pubtime, and size, and relPath. and it comes up with some key derives from a combination of those.
  • It Never sets it to the relPath alone.

Nodupe uses an associative array by key... so for check_message to return correctly, the two messages have to have the same key. It only checks the path if the two messages have the same key.
For nodupe_basis path to work properly, the key has to be set to the relpath value alone, and then the path has to be set to the same thing for the same key...

The deriveKey routine has NO means of setting the key to be the relPath alone, so nodupe_basis path is not implemented. The keys never match, so the path setting does not matter. One would need an override class to implement that, or a logic change in the routine based on nodupe_basis setting. Neither is present.

It's a few lines to implement, but it isn't there. something like a path subclass:

flowcb/nodupe/path.py:

import logging
from sarracenia.flowcb import FlowCB

logger = logging.getLogger(__name__)

class Path(FlowCB):
    """
      Override the the comparison so that files with the same name,
      regardless of what directory they are in, are considered the same.
      This is useful when receiving data from two different sources (two different trees)
      and winnowing between them.
    """
    def after_accept(self, worklist):
        for m in worklist.incoming:
            if not 'nodupe_override' in m:
                m['_deleteOnPost'] |= set(['nodupe_override'])
                m['nodupe_override'] = {}
            m['nodupe_override']['key'] = m['relPath']

and logic in the config to have it loaded. would be enough.
config/init.py: 2012 ish...


           elif self.nodupe_basis == 'path':
                self.plugins_early.append( 'nodupe.path' )
                delattr( self, 'nodupe_basis' )
        

Other thoughts:

  • Might want to add some logic to complain about unknown values of nodupe_basis.
  • Might want to document what the default behaviour is... the documentation says it's path... but it's actually different, and undocumented...

@reidsunderland
Copy link
Member

This is a good explanation :) we eventually figured that out, except we assumed that the default behaviour you explained was how "path" was supposed to work.

And that's why we were confused by what we thought was an inconsistency between "path" (actually just the default processing) and name.

I started working on this branch: development...issue1382

In that branch:

  • "Default processing" is named "path" (and is still the default basis)
  • Current "name" is renamed to "name_only"
  • A new "name" is added, that uses the default key derivation but overrides the path. @andreleblanc11 noted that this won't actually work properly without a modification to the deriveKey method
  • A new basis, "path_only" is added which is the same as your code above

@petersilva
Copy link
Contributor

Suggest:

  • rename the default processing ... called path in your branch to "metadata"
    • metatdata should be the default and it should be documented (perhaps using description above as starting point.)
  • change name_only back to name ... as that it should work properly already, and is documented.
  • change path_only to be called path ... as that should match already documented behaviour.

If you want to keep the _only name scheme... then you probably need to rename data to data_only

Regardless, I think the first step is to re-work the unit tests... so that they confirm the behaviour of each.
That's where I was going to start. The unit tests need the sum fields replaced with identity and perhaps fileOp fields.

Here's that branch:

https://github.com/MetPX/sarracenia/tree/issue1382_unit_tests

It passes the tests with correctish fields (but no fileOps yet... maybe later.)
I'm working on adding a new unit test for when path is used.

@petersilva
Copy link
Contributor

maybe message is clearer than metadata meaning you derive the key from fields in the whole message.

@petersilva petersilva self-assigned this Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants