-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: RVL_CDIP and DocLayNet #44
Comments
RVL_CDIP has the issue of being 400K images and annotations would need to change to COCO format. |
Hi @dnth! I just wanted to let you know that I was able to run fastdup on
Sharing the analysis htmls here: analysis I do believe that this shows the usefulness of your tools on this dataset, requiring further visual inspection with the visual-layer tool :) |
Helly @Jordy-VL ! That's mindblowing how many duplicates are in the dataset! I think this would be very helpful to the community that works with this dataset. Thank you for sharing it :) |
I would like to use your tool to investigate data noise in https://huggingface.co/datasets/aharley/rvl_cdip and https://ds4sd.github.io/icdar23-doclaynet/
It is known in the literature already that there is plenty of noise in RVL_CDIP, yet your tool could provide more quantitative insight.
The text was updated successfully, but these errors were encountered: