-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data found under https://github.com/Santosh-Gupta/datasets is not 700,000 question/answer pairs #15
Comments
In the meanwhile, here's a link to the binary tfrecords files https://drive.google.com/drive/u/8/folders/1wRc1jtl5Q0objpfualNFwpg4H575tmks |
Number of rows in healthtap, webmd, AskDocs is about 137000, 46200, 78000, respectively. I do not see a few hundred thousand data points per dataset as you mention. |
@ash3n @JayYip I'm going through the chats regarding the data. Unless I missed something, I believe that is the total amount of data we have, ~300,000 question answer pairs, not 700,000. Health tap, askdocs, and webmd are the most significant datasets we have. All the other datasets have very low numbers compared to those three. |
I'm not very familiar with the size of data since I'm not the one who collected and parsed them. |
I'll wait until @ash3n comments. If my analysis is correct, we should update the Read.me |
#15 (comment) - link expired Please provide working links |
@JayYip that's your folder |
Link restored. Please check. |
Still this link is not working - https://drive.google.com/drive/u/8/folders/1wRc1jtl5Q0objpfualNFwpg4H575tmks |
also @JayYip |
@Better-Boy restored. I'm running out of google drive space. @Santosh-Gupta @Better-Boy I wonder if you guys can help to store the data and start a PR to update links in README? Thanks. |
Sure, lets chat over the slack |
Hello!
There are 2 links under https://github.com/Santosh-Gupta/datasets:
1 - https://drive.google.com/drive/folders/1PymmjbrgfOIs-HJ7oBmjZKH8j4rYsGZj
2 - https://drive.google.com/drive/folders/1kYD57uStDd4kXyb3JOYCTQd92Al6Il4K
However, there are duplicates in both links; for example each of the AskDocs.csv and icliniqQAs.csv is found in both links. Therefore, when I import all the non-duplicate data, I only see about 200,000 QA pairs, not the 700,000 which your repo mentions. Is the rest of the data somewhere else. Please kindly let me know how to import the entire 700,000 QA Dataset.
Thank you!
The text was updated successfully, but these errors were encountered: