-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the data split of NQ 320K #15
Comments
To be honest, Im also not quite sure why they call it NQ320k 😄. |
Oh, I see. I think it's a rough number XD. And for MSMARCO: |
yeah, I just use the NQ dataset from huggingface. I did not decide the number of 6980, the official msmarco passage dev small has 6980 queries, so I just used all of them by default. |
is it not in the Queries link? I don't exactly remember where to find it, but it is the official dev set. Maybe you can use ir_dataset to download it: https://ir-datasets.com/msmarco-passage.html#msmarco-passage |
it is probably in this link collectionandqueries.tar.gz |
Thanks for the patient explanation and the links. I finally saw the 6.8K in the links XD |
Hi, Dr. Zhuang. Sorry to bother you again, and I hope you are doing well! According to your information, I set However, I couldn't reach the score AppendixBelow is the script I tried to reproduce the experiment on NQ-320K (with single A6000):
|
Seems you were generating 10 queries per document? Maybe try to increase the number to 50 |
Thanks for the advice. Before I start reproducing, will this change increase performance by 20%, according to your gut feeling? |
according to my experience more generated queries is always better, but indeed it will take even longer to converge.. |
Alright ... This is life ... 😢 |
Hi, Dr. Zhuang. I've finished the training with query generation number = If it is normal, should I try:
Thanks for the passionate reply! |
how many steps you have trained? |
there is no need to use CE when use 100 queries |
Ok. I'll try 100 queries first. Thanks for the confirmation! Thanks again! |
@hi-i-m-GTooth Hi GTooth. Thanks for showing your reproduced result. I have tried to run the code on another dataset with less size than NQ320k, but after 80k steps, hit@1 just reaches 0.03. I don't know why it is. Your figure shows that when training 50k steps, the hit@1 already achieves 0.2. Running 50k steps means just 2 or 3 epochs, since it is about 228k unique documents in NQ320K, and if you generate 10 queries for each docs, there will be 11 * 228k training samples in total. Is my understanding correct? Thanks for your confirmation in advance! |
hi @yuxiang-guo If you are getting very low sores I suggest you have a look at generated queries. Do they look correct? |
Hi @yuxiang-guo , @ArvinZhuang already confirmed it XD. But I want to add: |
Hi @ArvinZhuang Thanks for your reply! I checked my generated queries. They looked quite correct but different from the queries in the test set. Since the docs in my used datasets are more semantically rich, and thus each doc can correlate with many diverse queries. So for a doc, if the generated queries are not the same as the true query in the test set, the accuracy would be significantly affected? |
Thanks for your reply and suggestions! I will try it. |
Hi, Dr. Zhuang.
Thanks for your contribution again.
I've successfully conducted some experiments on MSMARCO with DSI-QG.
To keep going, I plan to conduct the experiments on NQ dataset.
The only thing I want to make sure is:
Refer to NQ's huggingface, which says the data split is train: 307373 | dev: 7830.
These values are quite close to the amount you mentioned in your work: "The NQ 320k dataset has ≈307k training query-document pairs and ≈8k dev query-document pairs."
May I treat 307373≈307k and 7830≈8k?
(By the way, I am also curious about the reason for the number, dev: 6980, in MSMARCO-100K dataset.)
Thanks for your confirmation in advanced!
The text was updated successfully, but these errors were encountered: