cache collision #78

patxoca · 2020-02-20T10:15:23Z

Scrapping two different PDFs yields the exact same results when using the FileCache.

The problem is that set_hash_key() always computes the same key because the file is already seek at the end (md5("") == "d41d8cd98f00b204e9800998ecf8427e") and pdfquery ends up using the same cached data for both PDFs.

Adding file.seek(0) before computing the md5 seems to solve the issue.

The text was updated successfully, but these errors were encountered:

patxoca · 2020-03-11T13:11:33Z

Temporary workaround until the issue is fixed, define a custom cache class:

from pdfquery.cache import FileCache as _FileCache

class FileCache(_FileCache):

    def set_hash_key(self, file):
        file.seek(0)
        return super().set_hash_key(file)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache collision #78

cache collision #78

patxoca commented Feb 20, 2020

patxoca commented Mar 11, 2020

cache collision #78

cache collision #78

Comments

patxoca commented Feb 20, 2020

patxoca commented Mar 11, 2020