Create Danish retrieval dataset from Wikipedia #1353
Replies: 4 comments 5 replies
-
Hey @KasperGroesLudvigsen!! I was already experimenting with something like this a couple of months ago using Zephyr. |
Beta Was this translation helpful? Give feedback.
-
I tried it out like you suggested and was able to generate the queries Hope it makes sense. |
Beta Was this translation helpful? Give feedback.
-
We're thinking of generating multiple candidate queries by using multiple LLMs. Any suggestions wrt how to select the best candidate would be appreciated. Was thinking of BERT score between the paragraph and the generated queries or to embed them with Jina and choose the query with the minimum distance. |
Beta Was this translation helpful? Give feedback.
-
We've made the dataset now. Any input is welcome. Would you for instance now how to use? Any info missing? |
Beta Was this translation helpful? Give feedback.
-
Hi! Danish Data Science Community have gotten access to an Nvidia A100 GPU for the rest of the year, kindly sponsored by Nvidia and Arrow ECS Denmark. I’d like to see if we can use it to create a Danish dataset for training embedding models on a retrieval task.
@KennethEnevoldsen suggested we take approach similar to what @rasdani did in #718 where he - amongst other thing - used an LLM to generate queries for paragraphs in Wikipedia data. Specifically, I envision that we create a dataset like this but for paragraphs in Danish Wikipedia.
If I understand correctly, we can create such a dataset using this approach (paraphrasing #718 (comment)):
To that end, I have some clarifying questions:
Any comments are welcome! :)
Beta Was this translation helpful? Give feedback.
All reactions