Data found under https://github.com/Santosh-Gupta/datasets is not 700,000 question/answer pairs #15

EphChem · 2019-08-24T20:31:21Z

Hello!
There are 2 links under https://github.com/Santosh-Gupta/datasets:

1 - https://drive.google.com/drive/folders/1PymmjbrgfOIs-HJ7oBmjZKH8j4rYsGZj
2 - https://drive.google.com/drive/folders/1kYD57uStDd4kXyb3JOYCTQd92Al6Il4K

However, there are duplicates in both links; for example each of the AskDocs.csv and icliniqQAs.csv is found in both links. Therefore, when I import all the non-duplicate data, I only see about 200,000 QA pairs, not the 700,000 which your repo mentions. Is the rest of the data somewhere else. Please kindly let me know how to import the entire 700,000 QA Dataset.

Thank you!

Santosh-Gupta · 2019-08-24T20:39:31Z

How much data points are in the askdocs , healthtap , and webmd csvs? Each of those should contain several hundred thousand data points each. @ash3n @JayYip Are there updated versions of those files?

Santosh-Gupta · 2019-08-24T21:19:03Z

In the meanwhile, here's a link to the binary tfrecords files

https://drive.google.com/drive/u/8/folders/1wRc1jtl5Q0objpfualNFwpg4H575tmks

EphChem · 2019-08-25T11:49:12Z

Number of rows in healthtap, webmd, AskDocs is about 137000, 46200, 78000, respectively. I do not see a few hundred thousand data points per dataset as you mention.
Thanks for the tfrecords but not sure how helpful this will be for my work.

Santosh-Gupta · 2019-08-25T23:37:40Z

@ash3n @JayYip I'm going through the chats regarding the data. Unless I missed something, I believe that is the total amount of data we have, ~300,000 question answer pairs, not 700,000. Health tap, askdocs, and webmd are the most significant datasets we have. All the other datasets have very low numbers compared to those three.

JayYip · 2019-08-26T02:27:23Z

I'm not very familiar with the size of data since I'm not the one who collected and parsed them.

Santosh-Gupta · 2019-08-27T01:55:37Z

I'll wait until @ash3n comments. If my analysis is correct, we should update the Read.me

Better-Boy · 2019-11-03T21:31:05Z

#15 (comment) - link expired
https://drive.google.com/drive/folders/1PymmjbrgfOIs-HJ7oBmjZKH8j4rYsGZj - link expired

Please provide working links

Santosh-Gupta · 2019-11-04T00:09:06Z

@JayYip that's your folder

JayYip · 2019-11-04T02:10:49Z

Link restored. Please check.

Better-Boy · 2019-11-05T10:27:27Z

Still this link is not working - https://drive.google.com/drive/u/8/folders/1wRc1jtl5Q0objpfualNFwpg4H575tmks

Santosh-Gupta · 2019-11-05T17:08:23Z

also @JayYip

JayYip · 2019-11-07T02:03:53Z

@Better-Boy restored. I'm running out of google drive space. @Santosh-Gupta @Better-Boy I wonder if you guys can help to store the data and start a PR to update links in README? Thanks.

Santosh-Gupta · 2019-11-07T02:51:24Z

Sure, lets chat over the slack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data found under https://github.com/Santosh-Gupta/datasets is not 700,000 question/answer pairs #15

Data found under https://github.com/Santosh-Gupta/datasets is not 700,000 question/answer pairs #15

EphChem commented Aug 24, 2019

Santosh-Gupta commented Aug 24, 2019

Santosh-Gupta commented Aug 24, 2019

EphChem commented Aug 25, 2019 •

edited

Loading

Santosh-Gupta commented Aug 25, 2019

JayYip commented Aug 26, 2019

Santosh-Gupta commented Aug 27, 2019

Better-Boy commented Nov 3, 2019 •

edited

Loading

Santosh-Gupta commented Nov 4, 2019

JayYip commented Nov 4, 2019

Better-Boy commented Nov 5, 2019

Santosh-Gupta commented Nov 5, 2019

JayYip commented Nov 7, 2019

Santosh-Gupta commented Nov 7, 2019

Data found under https://github.com/Santosh-Gupta/datasets is not 700,000 question/answer pairs #15

Data found under https://github.com/Santosh-Gupta/datasets is not 700,000 question/answer pairs #15

Comments

EphChem commented Aug 24, 2019

Santosh-Gupta commented Aug 24, 2019

Santosh-Gupta commented Aug 24, 2019

EphChem commented Aug 25, 2019 • edited Loading

Santosh-Gupta commented Aug 25, 2019

JayYip commented Aug 26, 2019

Santosh-Gupta commented Aug 27, 2019

Better-Boy commented Nov 3, 2019 • edited Loading

Santosh-Gupta commented Nov 4, 2019

JayYip commented Nov 4, 2019

Better-Boy commented Nov 5, 2019

Santosh-Gupta commented Nov 5, 2019

JayYip commented Nov 7, 2019

Santosh-Gupta commented Nov 7, 2019

EphChem commented Aug 25, 2019 •

edited

Loading

Better-Boy commented Nov 3, 2019 •

edited

Loading