-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Has this LLM been pretrained on this corpus? #23
Comments
Relevant paper: https://arxiv.org/abs/2311.01964 |
Training Data Extraction: |
Early Literature Review: |
This is the first passage in the MS MARCO passage corpus:
Here's me playing around... https://chat.openai.com/share/8b19eeec-4be2-4a16-a1ad-12b68fad81f2 |
Investigating Data Contamination in Modern Benchmarks for Large Language Models |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Let's try to characterize data pollution - i.e., has this LLM been pretrained on this corpus?
Simple task: pick a random passage - chop into half. Feed first half into LLM, ask it to complete the passage.
The text was updated successfully, but these errors were encountered: