Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: RVL_CDIP and DocLayNet #44

Open
Jordy-VL opened this issue Jun 20, 2023 · 4 comments
Open

Feature request: RVL_CDIP and DocLayNet #44

Jordy-VL opened this issue Jun 20, 2023 · 4 comments

Comments

@Jordy-VL
Copy link

I would like to use your tool to investigate data noise in https://huggingface.co/datasets/aharley/rvl_cdip and https://ds4sd.github.io/icdar23-doclaynet/

It is known in the literature already that there is plenty of noise in RVL_CDIP, yet your tool could provide more quantitative insight.

@Jordy-VL
Copy link
Author

Jordy-VL commented Jun 20, 2023

RVL_CDIP has the issue of being 400K images and annotations would need to change to COCO format.
It would be a great contribution to the document AI community if you could showcase this dataset's quality issues with your tool ;)

@dnth
Copy link
Contributor

dnth commented Jun 20, 2023

Hi @Jordy-VL thank you for the comment. We will add this to our roadmap. In the meantime, you can also try it out yourself using our no-code platform here for free.

Or if you're feeling adventurous to run some code, try using fastdup.

@Jordy-VL
Copy link
Author

Jordy-VL commented Jul 3, 2023

Hi @dnth!

I just wanted to let you know that I was able to run fastdup on RVL-CDIP with the following results:

2023-06-22 11:56:43 [INFO] Found a total of 35106 fully identical images (d>0.990), which are 4.39 %
2023-06-22 11:56:43 [INFO] Found a total of 188747 nearly identical images(d>0.980), which are 23.59 %
2023-06-22 11:56:43 [INFO] Found a total of 769216 above threshold images (d>0.900), which are 96.15 %
2023-06-22 11:56:43 [INFO] Found a total of 40079 outlier images         (d<0.050), which are 5.01 %
2023-06-22 11:56:43 [INFO] Min distance found 0.684 max distance 1.000

Sharing the analysis htmls here: analysis

I do believe that this shows the usefulness of your tools on this dataset, requiring further visual inspection with the visual-layer tool :)

@dnth
Copy link
Contributor

dnth commented Jul 4, 2023

Helly @Jordy-VL ! That's mindblowing how many duplicates are in the dataset! I think this would be very helpful to the community that works with this dataset. Thank you for sharing it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants