Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global cache_dir variable for exact, fuzzy, and semantic deduplication #384

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

sarahyurick
Copy link
Collaborator

@sarahyurick sarahyurick commented Nov 19, 2024

TODO:

  • Exact deduplication files
  • Fuzzy deduplication files
  • Semantic deduplication files

@sarahyurick sarahyurick changed the title Global cache variable for exact, fuzzy, and semantic deduplication Global cache_dir variable for exact, fuzzy, and semantic deduplication Nov 19, 2024
Comment on lines +154 to +160
You also need to set a global variable representing the cache directory where the outputs are written:

.. code-block:: python

from nemo_curator.cache import initialize_cache_directory

initialize_cache_directory("cache_dir")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also make more sense to call this something else, like deduplication_outputs.

id_column=id_column,
id_column_type=id_column_type,
which_to_keep=config.which_to_keep,
output_dir=os.path.join(cache_dir, config.clustering_save_loc),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to re-add output_dir to SemanticClusterLevelDedup and/or add it as a parameter for SemDedup

self.sorted_clusters_dir = os.path.join(
get_cache_directory(), "clustering", "sorted"
)
self.output_dir = os.path.join(get_cache_directory(), "clustering")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.output_dir = os.path.join(get_cache_directory(), "clustering")
self.output_dir = os.path.join(get_cache_directory(), "duplicates")

?

sarahyurick and others added 6 commits November 19, 2024 16:13
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Nov 20, 2024
@sarahyurick sarahyurick marked this pull request as ready for review November 20, 2024 23:27
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir

# Global variable to store the cache directory
_global_cache_dir = None
Copy link
Collaborator

@Maghoumi Maghoumi Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a global variable, we should use the singleton design pattern here, where there is a Cache class that creates the static singleton instance (if it doesn't exist), or returns an existing one (if it was already created).

Subsequently, the init and get methods can be nicely wrapped inside that class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarahyurick please let me know what you think and whether my comment makes sense to you

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Yes, that makes sense to me. Sorry I haven't gotten back to this earlier, but I expect to have more changes ready next week.

@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants