Create Danish retrieval dataset from Wikipedia #1353

KasperGroesLudvigsen · 2024-10-29T08:09:16Z

KasperGroesLudvigsen
Oct 29, 2024

Hi! Danish Data Science Community have gotten access to an Nvidia A100 GPU for the rest of the year, kindly sponsored by Nvidia and Arrow ECS Denmark. I’d like to see if we can use it to create a Danish dataset for training embedding models on a retrieval task.

@KennethEnevoldsen suggested we take approach similar to what @rasdani did in #718 where he - amongst other thing - used an LLM to generate queries for paragraphs in Wikipedia data. Specifically, I envision that we create a dataset like this but for paragraphs in Danish Wikipedia.

If I understand correctly, we can create such a dataset using this approach (paraphrasing #718 (comment)):

Select n samples where number of paragraphs >= 9 from some subset of from Danish Wikipedia articles, e.g. n articles from the top 100k most viewed
For each article, select a random window of 9 consecutive paragraphs. Ask an LLM to generate a query that would return the middle paragraph using a prompt like this GPT4-o generated queries for 14 languages #718 (comment) The middle paragraph is considered “positive” while the surrounding paragraphs are considered “negatives”

To that end, I have some clarifying questions:

Could a Danish version of a dataset like this be useful for training/fine tuning an embedding model?
Did I get the approach right?
@rasdani where can I see the Danish wikipedia dataset that you made for the retrieval task? I guess it would make sense if we exclude your samples from the training data

Any comments are welcome! :)

x-tabdeveloping · 2024-10-29T08:42:13Z

x-tabdeveloping
Oct 29, 2024
Collaborator

Hey @KasperGroesLudvigsen!! I was already experimenting with something like this a couple of months ago using Zephyr.
I think it's a sensible approach to generate training datasets for Danish sentence encoders, but I would be hesitant to use it as an evaluation dataset.
I'm curious what @KennethEnevoldsen has to say about this.

1 reply

KennethEnevoldsen Nov 1, 2024
Maintainer

We already use a similar approach as an evaluation dataset (see #718) however, I would def. be more interested in using it for training.

meshachaderele · 2024-11-01T09:50:16Z

meshachaderele
Nov 1, 2024

Hi @KasperGroesLudvigsen

I tried it out like you suggested and was able to generate the queries
https://github.com/meshachaderele/ddsc-llm

Hope it makes sense.

1 reply

KasperGroesLudvigsen Nov 4, 2024
Author

Very nice! I've opened an issue on your repo to discuss next step :)

KasperGroesLudvigsen · 2024-11-04T21:10:14Z

KasperGroesLudvigsen
Nov 4, 2024
Author

We're thinking of generating multiple candidate queries by using multiple LLMs. Any suggestions wrt how to select the best candidate would be appreciated. Was thinking of BERT score between the paragraph and the generated queries or to embed them with Jina and choose the query with the minimum distance.

@x-tabdeveloping @KennethEnevoldsen

3 replies

x-tabdeveloping Nov 5, 2024
Collaborator

Hmm, I'm a bit unsure about that. This would inevitably favour the model you're using for choosing the best candidate, thereby making it practically useless as a benchmark task. Could still be useful as a training set for smaller embedding models.

I probably wouldn't do this though.
One thing you could do, is generate queries for a hundred paragraphs with each LLM, rank them by hand (blindly, don't look at which LLM generated them), and then see which one you ranked highest on average, then go with the winner for the rest of the dataset. Nothing beats human evaluation and sanity checks in my opinion.

KasperGroesLudvigsen Nov 5, 2024
Author

Thanks for your response! Just to be clear: the purpose of the dataset we're trying to make is purely training / fine tuning.
I think the idea of manual annotation is appealing

KennethEnevoldsen Nov 5, 2024
Maintainer

Hmm for simplicity using one LLMs seems a lot simpler. Then you might experiment with different LLMs once you have a good setup and can evaluate downstream performance.

This allows a faster iteration time in the beginning, you can easily run it again with a different LLM to compare the two (e.g. using performance on SEB as method to discriminate) and then you can see if combining the two LLMs lead to better results.

KasperGroesLudvigsen · 2024-11-08T11:21:28Z

KasperGroesLudvigsen
Nov 8, 2024
Author

We've made the dataset now. Any input is welcome. Would you for instance now how to use? Any info missing?

https://huggingface.co/datasets/DDSC/da-wikipedia-queries

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Danish retrieval dataset from Wikipedia #1353

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Create Danish retrieval dataset from Wikipedia #1353

KasperGroesLudvigsen Oct 29, 2024

Replies: 4 comments · 5 replies

x-tabdeveloping Oct 29, 2024 Collaborator

KennethEnevoldsen Nov 1, 2024 Maintainer

meshachaderele Nov 1, 2024

KasperGroesLudvigsen Nov 4, 2024 Author

KasperGroesLudvigsen Nov 4, 2024 Author

x-tabdeveloping Nov 5, 2024 Collaborator

KasperGroesLudvigsen Nov 5, 2024 Author

KennethEnevoldsen Nov 5, 2024 Maintainer

KasperGroesLudvigsen Nov 8, 2024 Author

KasperGroesLudvigsen
Oct 29, 2024

Replies: 4 comments 5 replies

x-tabdeveloping
Oct 29, 2024
Collaborator

KennethEnevoldsen Nov 1, 2024
Maintainer

meshachaderele
Nov 1, 2024

KasperGroesLudvigsen Nov 4, 2024
Author

KasperGroesLudvigsen
Nov 4, 2024
Author

x-tabdeveloping Nov 5, 2024
Collaborator

KasperGroesLudvigsen Nov 5, 2024
Author

KennethEnevoldsen Nov 5, 2024
Maintainer

KasperGroesLudvigsen
Nov 8, 2024
Author