Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the data split of NQ 320K #15

Open
hi-i-m-GTooth opened this issue Apr 15, 2024 · 22 comments
Open

About the data split of NQ 320K #15

hi-i-m-GTooth opened this issue Apr 15, 2024 · 22 comments

Comments

@hi-i-m-GTooth
Copy link

Hi, Dr. Zhuang.

Thanks for your contribution again.
I've successfully conducted some experiments on MSMARCO with DSI-QG.
To keep going, I plan to conduct the experiments on NQ dataset.
The only thing I want to make sure is:

Refer to NQ's huggingface, which says the data split is train: 307373 | dev: 7830.
These values are quite close to the amount you mentioned in your work: "The NQ 320k dataset has ≈307k training query-document pairs and ≈8k dev query-document pairs."

May I treat 307373≈307k and 7830≈8k?
(By the way, I am also curious about the reason for the number, dev: 6980, in MSMARCO-100K dataset.)
Thanks for your confirmation in advanced!

@ArvinZhuang
Copy link
Owner

To be honest, Im also not quite sure why they call it NQ320k 😄.
For msmarco, dev 6980 is actually a subset of full dev (The full dev is very big). But this subset is used for MSMARCO leaderbord submissions thus people just use this subset for papers.
Check out this web page: https://microsoft.github.io/msmarco/Datasets.html

@hi-i-m-GTooth
Copy link
Author

Oh, I see. I think it's a rough number XD.
So may I ensure that you use the same setting as NQ's huggingface says?

And for MSMARCO:
Thanks for your introduction; it helped me to understand your data processing further!
But I am still curious about how you determine the number 6980.
Thanks again, you just respond as fast as usual!

@ArvinZhuang
Copy link
Owner

yeah, I just use the NQ dataset from huggingface.

I did not decide the number of 6980, the official msmarco passage dev small has 6980 queries, so I just used all of them by default.

@hi-i-m-GTooth
Copy link
Author

Thanks for the confirmation!!
But I only found dev rather than dev-small, did you mean you reserve the dev rows that are also in train-small?
image

@ArvinZhuang
Copy link
Owner

is it not in the Queries link? I don't exactly remember where to find it, but it is the official dev set. Maybe you can use ir_dataset to download it: https://ir-datasets.com/msmarco-passage.html#msmarco-passage

@ArvinZhuang
Copy link
Owner

it is probably in this link collectionandqueries.tar.gz

@hi-i-m-GTooth
Copy link
Author

Thanks for the patient explanation and the links. I finally saw the 6.8K in the links XD

@hi-i-m-GTooth
Copy link
Author

hi-i-m-GTooth commented May 14, 2024

Hi, Dr. Zhuang.

Sorry to bother you again, and I hope you are doing well!

According to your information, I set --train_num to 307373 and --eval_num to 7830 and tried reproducing the experiment on NQ-320K.
The preprocessing code is modified from your old repo (The reason I modified it is to follow the logics/formats of new preprocessing codes).

However, I couldn't reach the score 82.36 for t5-base. I only reach 58.24.
I've checked the training pairs and it seems normal with the QG ckpt mentioned in #12 .
I also noticed that a document in NQ could be longer than max_seq_len for a transformer-based encoder.
I would like to know did you do any other preprocessing for the training data and dev data?
Thanks in advance.

Appendix

Below is the script I tried to reproduce the experiment on NQ-320K (with single A6000):

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 python3 run.py \
                --task "DSI" \
                --model_name "google-t5/t5-base" \
                --run_name "NQ-320k-baseline-t5-base-DSI-QG" \
                --max_length 32 \
                --train_file temp/nq_corpus.tsv.320k.q10.docTquery \
                --valid_file data/nq_data/320k/nq_DSI_dev_data.json \
                --output_dir "models/NQ-320k-t5-base-DSI-QG(baseline)" \
                --learning_rate 0.0005 \
                --warmup_steps 100000 \
                --per_device_train_batch_size 128 \
                --per_device_eval_batch_size 128 \
                --evaluation_strategy steps \
                --eval_steps 1000 \
                --max_steps 1000000 \
                --save_strategy steps \
                --dataloader_num_workers 10 \
                --save_steps 1000 \
                --save_total_limit 2 \
                --load_best_model_at_end \
                --gradient_accumulation_steps 2 \
                --report_to wandb \
                --logging_steps 100 \
                --dataloader_drop_last False \
                --metric_for_best_model Hits@10 \
                --greater_is_better True \
                --remove_prompt True

And here is the score and loss on the dashboard:
image
image

@ArvinZhuang
Copy link
Owner

Seems you were generating 10 queries per document? Maybe try to increase the number to 50

@hi-i-m-GTooth
Copy link
Author

Thanks for the advice. Before I start reproducing, will this change increase performance by 20%, according to your gut feeling?
Since my computation resources are not very generous, I think I should run this script more carefully.

@ArvinZhuang
Copy link
Owner

according to my experience more generated queries is always better, but indeed it will take even longer to converge..

@hi-i-m-GTooth
Copy link
Author

Alright ... This is life ... 😢

@hi-i-m-GTooth
Copy link
Author

hi-i-m-GTooth commented May 17, 2024

Hi, Dr. Zhuang.

I've finished the training with query generation number = 50.
Compared to HIT@10 = 58.24 from query generation number = 10, it makes progress to HIT@10 = 69.12.
However, it still doesn't reach HIT@10 = 82.36.
Is it normal or as expected for HIT@10 to only reach 69.12 in this setting?

If it is normal, should I try:

  1. Increase the query generation number to 100
  2. CE reranker m = 50 (but may I acquire the training args for NQ dataset? As for training data, I think I could generate them according to DPR format by myself.)

Thanks for the passionate reply!

@ArvinZhuang
Copy link
Owner

how many steps you have trained?
maybe go ahead with 100 queries

@hi-i-m-GTooth
Copy link
Author

I've trained about 300k steps.
image

So Dr. Zhuang thinks I should try query generation number 100 without CE reranker?
Will CE reranker have a significant influence on this situation?

@ArvinZhuang
Copy link
Owner

there is no need to use CE when use 100 queries

@hi-i-m-GTooth
Copy link
Author

Ok. I'll try 100 queries first. Thanks for the confirmation!
Nevertheless, I have one more concern: what's the --max_length you used to generate queries in NQ Dataset?

Thanks again!

@yuxiang-guo
Copy link

@hi-i-m-GTooth Hi GTooth.

Thanks for showing your reproduced result.

I have tried to run the code on another dataset with less size than NQ320k, but after 80k steps, hit@1 just reaches 0.03. I don't know why it is.

Your figure shows that when training 50k steps, the hit@1 already achieves 0.2. Running 50k steps means just 2 or 3 epochs, since it is about 228k unique documents in NQ320K, and if you generate 10 queries for each docs, there will be 11 * 228k training samples in total. Is my understanding correct?

Thanks for your confirmation in advance!

@ArvinZhuang
Copy link
Owner

hi @yuxiang-guo
In my code, I actually did not use the original doc at all. So that will be 10 * 228k training samples/

If you are getting very low sores I suggest you have a look at generated queries. Do they look correct?

@hi-i-m-GTooth
Copy link
Author

hi-i-m-GTooth commented Jul 14, 2024

Hi @yuxiang-guo ,

@ArvinZhuang already confirmed it XD.

But I want to add:
According to my experience, batch_size is also a critical factor.
You could also try to increase the batch_size or gradient_accumulation_steps.

@yuxiang-guo
Copy link

Hi @ArvinZhuang

Thanks for your reply!

I checked my generated queries. They looked quite correct but different from the queries in the test set. Since the docs in my used datasets are more semantically rich, and thus each doc can correlate with many diverse queries. So for a doc, if the generated queries are not the same as the true query in the test set, the accuracy would be significantly affected?

@yuxiang-guo
Copy link

Hi @hi-i-m-GTooth

Thanks for your reply and suggestions! I will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants