`nodupe_basis path` doesn't look at the path when the checksums are different #1382

andreleblanc11 · 2025-02-19T16:48:04Z

Currently, when nodupe_basis path is selected as the basis of the duplicate suppression, it will get overridden when the checksum of a file is different

@reidsunderland and I found that this is because the default routing key for a path is set to a checksum.

sarracenia/sarracenia/flowcb/nodupe/__init__.py

Lines 33 to 42 in a5fd5ed

    
           if ('nodupe_override' in msg) and ('key' in msg['nodupe_override']): 
        
               key = msg['nodupe_override']['key'] 
        
           elif 'fileOp' in msg : 
        
               if 'link' in msg['fileOp']: 
        
                   key = msg['fileOp']['link'] 
        
               elif 'directory' in msg['fileOp']: 
        
                   if 'remove' not in msg['fileOp']: 
        
                       key = msg['relPath'] 
        
           elif ('identity' in msg) and not (msg['identity']['method'] in ['cod']): 
        
               key = msg['identity']['method'] + ',' + msg['identity']['value'].replace('\n', '')

The path doesn't have its own class unlike the name option. So by default it always has the checksum as the key.

We also talked about adding an option to override not checking the checksum. nodupe_check_data might be a good name? It would be nice to give the user the option.

So the current behaviour we have is

basis	look at checksum?	look at path?	look at name?
path	yes	yes	(not directly, included in path)
data	yes	no	no
name	no	no	yes

Ideally we would want the following

basis	look at checksum?	look at path?	look at name?
path	optional (no by default)	yes	(not directly, included in path)
data	yes	no	no
name	optional (no by default)	no	yes

The text was updated successfully, but these errors were encountered:

andreleblanc11 · 2025-02-20T21:41:59Z

Things discussed/realized:

The values in nodupe_basis bypass the default nodupe algorithm when specified. This doesn't apply for the path
We want to have similar behaviour between the path/name options. To have this, we would need to
- Add the option to duplicate suppress based on the file name, but also post files if the checksum/mtime/size/pubTime changes
- Add the option to duplicate suppressi based on the path that ignores all the other data (checksum, time, size)

To do this, one option is to add new nodupe overrides, name_only and path_only.
Where path and name will go through the default nodupe algorithm before checking the name, path respectively.
path_only and name_only will override the key.

The documentation for deriveKey isn't the best and should be improved.

petersilva · 2025-02-22T18:02:03Z

I just looked at the code, and nodupe_basis path is simply not implemented at all, afaict.

the nodupe_basis option is not referenced anywhere in the nodupe related modules.
for the data and name options, the config parser removes nodupe_basis from the options...
the result of the parse is an option nodupe_basis with value "path" that is never referenced anywhere.

I think nodupe_basis path was an originally created basis in v2, but we found it was never used. All the cases where nodupe_basis path was considered, was eventually replaced by nodupe_basis name because the use cases did not actually want to compare the entire path.

We have two options...

de-document path as an option, as it is never used.
implement path ... assuming we have an actualy use case for it.

I'm guessing you ran into a case where somebody wanted nodupe_basis path to work?

reidsunderland · 2025-02-24T18:50:57Z

nodupe_basis path is implemented, it's the default behaviour of https://github.com/MetPX/sarracenia/blob/development/sarracenia/flowcb/nodupe/disk.py

First the key is derived, if they key is already in the cache, then it compares the (rel)path

sarracenia/sarracenia/flowcb/nodupe/disk.py

Lines 87 to 121 in 372e77b

    
               def _not_in_cache(self, key, relpath) -> bool: 
        
                   """  return False if the given key=relpath value is already in the cache, 
        
                               True otherwise 
        
                        side effect: add it to the cache if it isn't there. 
        
                   """ 
        
                   # not found 
        
                   self.cache_hit = None 
        
                   qpath = urllib.parse.quote(relpath) 
        
                   if key not in self.cache_dict: 
        
                       logger.debug("adding entry to NoDupe cache") 
        
                       kdict = {} 
        
                       kdict[relpath] = self.now 
        
                       self.cache_dict[key] = kdict 
        
                       self.fp.write("%s %f %s\n" % (key, self.now, qpath)) 
        
                       self.count += 1 
        
                       return True 
        
                   logger.debug( f"entry already in NoDupe cache: key={key}" ) 
        
                   kdict = self.cache_dict[key] 
        
                   present = relpath in kdict 
        
                   kdict[relpath] = self.now 
        
                   # differ or newer, write to file 
        
                   self.fp.write("%s %f %s\n" % (key, self.now, qpath)) 
        
                   self.count += 1 
        
                   if present: 
        
                       logger.debug( f"updated time of old NoDupe entry: relpath={relpath}" ) 
        
                       self.cache_hit = relpath 
        
                       return False 
        
                   else: 
        
                       logger.debug( f"added relpath={relpath}") 
        
                   return True

sarracenia/sarracenia/flowcb/nodupe/disk.py

Lines 123 to 143 in 372e77b

    
               def check_message(self, msg) -> bool : 
        
                   """ 
        
                      derive keys to be looked up in cache of messages already seen. 
        
                      then look them up in the cache,  
        
                      return False if message is a dupe. 
        
                             True if it is new. 
        
                   """ 
        
                   key = self.deriveKey(msg) 
        
                   if ('nodupe_override' in msg) and ('path' in msg['nodupe_override']): 
        
                       path = msg['nodupe_override']['path'] 
        
                   else: 
        
                       path = msg['relPath'].lstrip('/') 
        
                   msg['noDupe'] = { 'key': key, 'path': path } 
        
                   msg['_deleteOnPost'] |= set(['noDupe']) 
        
                   logger.debug("NoDupe calling check( %s, %s )" % (key, path)) 
        
                   return self._not_in_cache(key, path)

Assuming msg['nodupe_override']['path'] is not set, it uses msg['relPath'] for the (rel)path comparison.

petersilva · 2025-02-25T00:01:20Z

The default behavious does not implement nodupe_basis path it implements the default processing, which is based deriving the key from the message using:

if this is a fileOp message, then pick a per fileOp value... else (reg file):
the identity field in the message (aka checksum) if available.
if not... ( this is a regular file that has no checksum,) then it looks at mtime, pubtime, and size, and relPath. and it comes up with some key derives from a combination of those.
It Never sets it to the relPath alone.

Nodupe uses an associative array by key... so for check_message to return correctly, the two messages have to have the same key. It only checks the path if the two messages have the same key.
For nodupe_basis path to work properly, the key has to be set to the relpath value alone, and then the path has to be set to the same thing for the same key...

The deriveKey routine has NO means of setting the key to be the relPath alone, so nodupe_basis path is not implemented. The keys never match, so the path setting does not matter. One would need an override class to implement that, or a logic change in the routine based on nodupe_basis setting. Neither is present.

It's a few lines to implement, but it isn't there. something like a path subclass:

flowcb/nodupe/path.py:

import logging
from sarracenia.flowcb import FlowCB

logger = logging.getLogger(__name__)

class Path(FlowCB):
    """
      Override the the comparison so that files with the same name,
      regardless of what directory they are in, are considered the same.
      This is useful when receiving data from two different sources (two different trees)
      and winnowing between them.
    """
    def after_accept(self, worklist):
        for m in worklist.incoming:
            if not 'nodupe_override' in m:
                m['_deleteOnPost'] |= set(['nodupe_override'])
                m['nodupe_override'] = {}
            m['nodupe_override']['key'] = m['relPath']

and logic in the config to have it loaded. would be enough.
config/init.py: 2012 ish...


           elif self.nodupe_basis == 'path':
                self.plugins_early.append( 'nodupe.path' )
                delattr( self, 'nodupe_basis' )

Other thoughts:

Might want to add some logic to complain about unknown values of nodupe_basis.
Might want to document what the default behaviour is... the documentation says it's path... but it's actually different, and undocumented...

reidsunderland · 2025-02-25T01:27:05Z

This is a good explanation :) we eventually figured that out, except we assumed that the default behaviour you explained was how "path" was supposed to work.

And that's why we were confused by what we thought was an inconsistency between "path" (actually just the default processing) and name.

I started working on this branch: development...issue1382

In that branch:

"Default processing" is named "path" (and is still the default basis)
Current "name" is renamed to "name_only"
A new "name" is added, that uses the default key derivation but overrides the path. @andreleblanc11 noted that this won't actually work properly without a modification to the deriveKey method
A new basis, "path_only" is added which is the same as your code above

petersilva · 2025-02-25T17:01:35Z

Suggest:

rename the default processing ... called path in your branch to "metadata"
- metatdata should be the default and it should be documented (perhaps using description above as starting point.)
change name_only back to name ... as that it should work properly already, and is documented.
change path_only to be called path ... as that should match already documented behaviour.

If you want to keep the _only name scheme... then you probably need to rename data to data_only

Regardless, I think the first step is to re-work the unit tests... so that they confirm the behaviour of each.
That's where I was going to start. The unit tests need the sum fields replaced with identity and perhaps fileOp fields.

Here's that branch:

https://github.com/MetPX/sarracenia/tree/issue1382_unit_tests

It passes the tests with correctish fields (but no fileOps yet... maybe later.)
I'm working on adding a new unit test for when path is used.

petersilva · 2025-02-25T17:16:24Z

maybe message is clearer than metadata meaning you derive the key from fields in the whole message.

petersilva self-assigned this Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`nodupe_basis path` doesn't look at the path when the checksums are different #1382

`nodupe_basis path` doesn't look at the path when the checksums are different #1382

andreleblanc11 commented Feb 19, 2025

andreleblanc11 commented Feb 20, 2025

petersilva commented Feb 22, 2025

reidsunderland commented Feb 24, 2025

petersilva commented Feb 25, 2025

reidsunderland commented Feb 25, 2025

petersilva commented Feb 25, 2025

petersilva commented Feb 25, 2025

nodupe_basis path doesn't look at the path when the checksums are different #1382

nodupe_basis path doesn't look at the path when the checksums are different #1382

Comments

andreleblanc11 commented Feb 19, 2025

andreleblanc11 commented Feb 20, 2025

petersilva commented Feb 22, 2025

reidsunderland commented Feb 24, 2025

petersilva commented Feb 25, 2025

reidsunderland commented Feb 25, 2025

petersilva commented Feb 25, 2025

petersilva commented Feb 25, 2025

`nodupe_basis path` doesn't look at the path when the checksums are different #1382

`nodupe_basis path` doesn't look at the path when the checksums are different #1382