Has this LLM been pretrained on this corpus? #23

lintool · 2024-01-18T18:49:11Z

Let's try to characterize data pollution - i.e., has this LLM been pretrained on this corpus?

Simple task: pick a random passage - chop into half. Feed first half into LLM, ask it to complete the passage.

lintool · 2024-01-24T12:41:50Z

Relevant paper: https://arxiv.org/abs/2311.01964

AndreSlavescu · 2024-02-05T16:02:11Z

Training Data Extraction:

https://arxiv.org/abs/2311.17035

ASChampOmega · 2024-04-15T13:58:11Z

Early Literature Review:

Literature Review for Data Contamination.docx

lintool · 2024-04-15T15:48:39Z

This is the first passage in the MS MARCO passage corpus:

{"id": "0", "contents": "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated."}

Here's me playing around... https://chat.openai.com/share/8b19eeec-4be2-4a16-a1ad-12b68fad81f2

lintool · 2024-04-15T21:24:11Z

Investigating Data Contamination in Modern Benchmarks for Large Language Models
https://arxiv.org/abs/2311.09783

crystina-z · 2024-04-19T14:28:59Z

some related papers:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Has this LLM been pretrained on this corpus? #23

Has this LLM been pretrained on this corpus? #23

lintool commented Jan 18, 2024

lintool commented Jan 24, 2024

AndreSlavescu commented Feb 5, 2024

ASChampOmega commented Apr 15, 2024

lintool commented Apr 15, 2024

lintool commented Apr 15, 2024

crystina-z commented Apr 19, 2024

Has this LLM been pretrained on this corpus? #23

Has this LLM been pretrained on this corpus? #23

Comments

lintool commented Jan 18, 2024

lintool commented Jan 24, 2024

AndreSlavescu commented Feb 5, 2024

ASChampOmega commented Apr 15, 2024

lintool commented Apr 15, 2024

lintool commented Apr 15, 2024

crystina-z commented Apr 19, 2024