Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanding to a full YFCC-100M filtered dataset #236

Open
landrumb opened this issue Nov 9, 2023 · 3 comments
Open

Expanding to a full YFCC-100M filtered dataset #236

landrumb opened this issue Nov 9, 2023 · 3 comments
Assignees

Comments

@landrumb
Copy link
Contributor

landrumb commented Nov 9, 2023

We are looking to test our submission on a 100M scale filtered dataset, and would be happy to integrate it into datasets.py if the embeddings and metadata were added to the domain where the dataset currently downloads from. We would prepare them ourselves, but the corresponding file for dataset preparation refers to an external script for generating the metadata, and we do not have the full set of CLIP descriptors.

@mdouze could you make the full 100M vector dataset available where the 10M subset is hosted at https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M/?

@mdouze
Copy link
Collaborator

mdouze commented Nov 20, 2023

I could probably do it but it's still smaller than 100M (like 90M or the like) because of missing images, videos, etc.
LMK if this is of interest to you.

@landrumb
Copy link
Contributor Author

Definitely still interested.

If we're adding other sub-100M filter datasets, do you think we should try to standardize on a round number like 50M or just subset as needed for comparison?

@mdouze
Copy link
Collaborator

mdouze commented Nov 30, 2023

Stay tuned, I have it on my TODO list...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants