Skip to content

Commit 0cd0a36

Browse files
committed
add test dataset
1 parent bdfc394 commit 0cd0a36

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+5538
-0
lines changed

data/test/README-v1.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Mu-SHROOM @ SemEval 2025: Validation Set
2+
This archive corresponds to the unlabeled test data for the Mu-SHROOM shared task 3 at Semeval 2025 (The Multilingual Shared-task on Hallucinations and Observable Overgeneration Mistakes).
3+
It contains:
4+
1. the present README,
5+
2. JSONL files containing the annotated data corresponding to our test split (henceforth "the data files"),
6+
3. a directory containing scripts for replicating our datapoint creation process; these are mainly provided for documentary purposes.
7+
8+
There are separate data files for all 14 languages of the shared task : Arabic (modern standard), **Basque,** **Catalan,** Chinese (Mandarin), **Czech,** English, Finnish, **Farsi,** French, German, Hindi, Italian, Spanish, and Swedish.
9+
Languages in bold in the list above correspond to test-only languages: no validation was provided for these languages, so as to test participants' systems in pure generalization conditions.
10+
11+
## What is Mu-SHROOM?
12+
The task consists in detecting spans of text corresponding to hallucinations.
13+
Participants are asked to determine which parts of a given text produced by LLMs constitute hallucinations.
14+
The task is held in multi-lingual and multi-model context, i.e., we provide data in multiple languages and produced by a variety of public-weights LLMs.
15+
16+
This task is a follow-up on last year's SemEval Task 6 (SHROOM).
17+
18+
More information is available on the official task website: https://helsinki-nlp.github.io/shroom/
19+
20+
## How will participants be evaluated?
21+
22+
Participants will be ranked along two (character-level) metrics:
23+
1. intersection-over-union of characters marked as hallucinations in the gold reference vs. predicted as such
24+
2. how well the probability assigned by the participants' system that a character is part of a hallucination correlates with the empirical probabilities observed in our annotators.
25+
26+
Rankings and submissions will be done separately per language.
27+
28+
For further information, you can have a look at the scoring program at [this url](https://helsinki-nlp.github.io/shroom/scorer.py).
29+
30+
## Data file format
31+
The data files are formatted as a JSON lines. Each line is a JSON dict object and corresponds to an individual datapoint.
32+
33+
Each datapoint corresponds to a different annotated LLM production, and contains the following information:
34+
- a unique datapoint identifier (`id')
35+
- a language (`lang');
36+
- a model input question (`model_input`), the input passed to the models for generation;
37+
- a model identifier (`model_id`) denoting the HuggingFace identifier of the corresponding model;
38+
- a model output (`model_output_text`), the output generated by a LLM when provided the aforementiond input;
39+
- a list of model output tokens (`model_output_tokens`), corresponding to the tokenized output of the LLM response,
40+
- a list of logit values for the tokens generated in the LLM response (`model_output_logits`),
41+
42+
Files have different number of items, with Basque, Catalan, Czech and Farsi containing around 100 items, and other languages containing around 150 items.
43+
44+
We provide output logits so as to foster methods that investigate model behavior, but particpants will likely be interested in more complex attributes, such as probability distributions or model embeddings. As a starting point, you can look into the logits reconstruction scripts that we provided for German and English (`scripts/english/recompute_logits_english.py` and `scripts/german/recompute_logits_german.py`), which explicitly retrieve the full output distributions.
45+
46+
## Submission file format
47+
Participants need to produce JSON lines files for their prediction. Participants should create one file per language. Each line should correspond to one test point, and should contain the following information:
48+
- the unique datapoint identifier (`id') of the test item this line is a prediction for (always required);
49+
- binarized predictions (`hard_labels`), provided as a list of pairs, where each pair corresponds to the start (included) and end (excluded) of a hallucination (optional if `hard_labels` are included);
50+
- continuous predictions (`soft_labels`), provided as a list of dictionary objects (optional if `soft_labels` are included). Each dictionary object must contain the following keys:
51+
+ `start`, indicating the start of the hallucination span,
52+
+ `end`, indicating the end of the hallucination span,
53+
+ `prob`, the empirical probabilty (proportion of annotators) marking the span as a hallucination.
54+
55+
The inclusion of extra information is not penalized; i.e., participants may simply fill in the columns missing in the test file and submit those to the platform.
56+
Participants are encouraged to have a look at the neural baseline provided in the [`participant kit`](https://a3s.fi/mickusti-2007780-pub/participant_kit.zip), or at the validation files for examples of the expected format.
57+
58+
Participants can submit any of the following:
59+
1. both binarized and continuous predictions (`hard_labels` and `soft_labels`);
60+
2. only binarized predictions (`hard_labels`);
61+
3. only continuous predictions (`soft_labels`)
62+
63+
In the latter two cases, the score program will apply default rules to convert binarized predictions into continuous predictions or vice-versa.
64+
65+
The hard labels (`hard_labels`) will be used to assess the intersection-over-union accuracy, whereas the soft labels (`soft_labels`) will be used to measure correlation.
66+
In the evaluation phase, participants will be tasked with reconstructing the soft labels and provide the `start`, `end` and `prob` keys of all the spans they detect.
67+
68+
## How will this dataset differ from upcoming data releases?
69+
70+
Labels will be release at the end of the evaluation phase.
71+
We also intend to release supplementary annotation details after the evaluation phase.

data/test/scripts/.DS_Store

6 KB
Binary file not shown.

data/test/scripts/arabic/gen_ar.py

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
import os
2+
import sys
3+
import json
4+
import pandas as pd
5+
import torch
6+
from transformers import AutoModelForCausalLM, AutoTokenizer
7+
import tqdm.notebook as tqdm
8+
from transformers.utils import logging
9+
import json
10+
11+
logging.set_verbosity_warning()
12+
13+
seed = 42
14+
torch.manual_seed(seed)
15+
16+
17+
records = pd.read_csv("questions-ar.csv")
18+
records= records.to_dict(orient='records')
19+
20+
21+
# Display the contents of the second sheet
22+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
23+
print(device)
24+
25+
model_names = [
26+
"SeaLLMs/SeaLLM-7B-v2.5",
27+
"openchat/openchat-3.5-0106-gemma",
28+
"arcee-ai/Arcee-Spark",
29+
]
30+
31+
split_text_array = ["\n<|im_start|>assistant\n", "\nassistant", "\nassistant"]
32+
configs = [
33+
("k50_p0.90_t0.1", dict(top_k=50, top_p=0.90, temperature=0.1)),
34+
# ('k50_p0.95_t0.1', dict(top_k=50, top_p=0.95, temperature=0.1)),
35+
("k50_p0.90_t0.2", dict(top_k=50, top_p=0.90, temperature=0.2)),
36+
# ('k50_p0.95_t0.2', dict(top_k=50, top_p=0.95, temperature=0.2)),
37+
("k50_p0.90_t0.3", dict(top_k=50, top_p=0.90, temperature=0.3)),
38+
# ('k50_p0.95_t0.3', dict(top_k=50, top_p=0.95, temperature=0.3)),
39+
("k75_p0.90_t0.1", dict(top_k=75, top_p=0.90, temperature=0.1)),
40+
# ('k75_p0.95_t0.1', dict(top_k=75, top_p=0.95, temperature=0.1)),
41+
("k75_p0.90_t0.2", dict(top_k=75, top_p=0.90, temperature=0.2)),
42+
# ('k75_p0.95_t0.2', dict(top_k=75, top_p=0.95, temperature=0.2)),
43+
("k75_p0.90_t0.3", dict(top_k=75, top_p=0.90, temperature=0.3)),
44+
# ('k75_p0.95_t0.3', dict(top_k=75, top_p=0.95, temperature=0.3)),
45+
]
46+
47+
model_idx = 2
48+
model_name = model_names[model_idx]
49+
split_text = split_text_array[model_idx]
50+
51+
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(
52+
device
53+
)
54+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
55+
56+
57+
print("Model used ", model_name)
58+
model_short = model_name.split("/")[-1]
59+
60+
for record in tqdm.tqdm(records):
61+
i += 1
62+
print(
63+
f"Link in Arabic: {record['Link in Arabic']}"
64+
) # Print 'Link in Arabic' once for every sample
65+
question = str(record["AR questions"])
66+
print(question)
67+
generated_answers = []
68+
configs_to_show = list(configs)[:5] # Ensure only 5 configs are shown
69+
70+
for shorthand, config in tqdm.tqdm(configs_to_show):
71+
72+
messages = [
73+
{"role": "user", "content": "أجب عن السؤال التالي بشكل دقيق ومختصر"},
74+
{
75+
"role": "assistant",
76+
"content": "بالطبع! ما هو السؤال الذي تود الإجابة عنه؟",
77+
},
78+
{"role": "user", "content": question},
79+
]
80+
encodeds = tokenizer.apply_chat_template(
81+
messages, return_tensors="pt", add_generation_prompt=True
82+
)
83+
84+
model_inputs = encodeds.to(device)
85+
86+
outputs = model.generate(
87+
model_inputs,
88+
max_new_tokens=512,
89+
num_return_sequences=1,
90+
no_repeat_ngram_size=2,
91+
return_dict_in_generate=True,
92+
output_logits=True,
93+
do_sample=True,
94+
**config,
95+
)
96+
97+
response_text = tokenizer.decode(
98+
outputs.sequences[0], skip_special_tokens=True
99+
).strip()
100+
response_text = response_text.split(split_text)[-1].strip()
101+
print(response_text)
102+
response_token_ids = (
103+
outputs.sequences[0].to("cpu").tolist()[len(model_inputs[0]) :]
104+
)
105+
response_tokens = tokenizer.convert_ids_to_tokens(response_token_ids)
106+
response_logits = [l.to("cpu").tolist() for l in outputs.logits]
107+
108+
# Extract only the logits corresponding to the output tokens
109+
response_logits = [logit.to("cpu").tolist() for logit in outputs.logits]
110+
111+
generated_answers.append((shorthand, response_text, response_logits))
112+
113+
# Show all 5 answers and let the user pick one
114+
for idx, (shorthand, answer, logits) in enumerate(generated_answers):
115+
print(f"\nAnswer {idx + 1} (Config: {shorthand}):\n{answer}\n")
116+
117+
choice = int(input("Choose the answer to save (1-5): ")) - 1
118+
119+
selected_shorthand, selected_answer, selected_logits = generated_answers[choice]
120+
121+
output_file_path = f'./SHROOM/{i}/{model_name.split("/")[1]}/arabic-{model_name.split("/")[1]}.{i}.{selected_shorthand}.jsonl'
122+
os.makedirs(
123+
f'./SHROOM/{i}/{model_name.split("/")[1]}', exist_ok=True
124+
)
125+
os.makedirs(f"/./SHROOM/{i}", exist_ok=True)
126+
127+
# Save the selected answer and its logits
128+
with open(output_file_path, "w", encoding="utf-8") as file:
129+
record["model_id"] = model_name
130+
record["lang"] = "AR"
131+
record["output_text"] = selected_answer
132+
record["output_tokens"] = tokenizer.convert_ids_to_tokens(response_token_ids)
133+
record["output_logits"] = selected_logits
134+
135+
columns_to_extract = [
136+
"Link in Arabic",
137+
"lang",
138+
"AR questions",
139+
"model_id",
140+
"output_text",
141+
"output_tokens",
142+
]
143+
records_small = {k: record[k] for k in columns_to_extract}
144+
records_small["gen_config"] = config
145+
json.dump(records_small, file, ensure_ascii=False)
146+
file.write("\n")
147+

data/test/scripts/basque/gen_gemma.py

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
#!/usr/bin/env python
2+
# coding: utf-8
3+
import os
4+
import random
5+
6+
import torch
7+
from transformers import AutoTokenizer, AutoModelForCausalLM
8+
from huggingface_hub import hf_hub_download, login
9+
10+
11+
# Installing dependencies.
12+
random.seed(2202)
13+
14+
# GPU
15+
# This config has been tested on an v100. 32GB
16+
# For download the models
17+
18+
os.environ['HF_HOME'] = './.hf/'
19+
#!pip install --upgrade pip
20+
#!pip install huggingface_hub
21+
#!export HF_HOME='./.hf'
22+
23+
os.makedirs('outputs/4annot', exist_ok=True)
24+
os.makedirs('outputs/with_logits', exist_ok=True)
25+
26+
27+
# safely copy your hf_token to this working directoy to login fo HF
28+
with open('./hf_token', 'r') as file:
29+
hftoken = file.readlines()[0].strip()
30+
31+
login(token=hftoken, add_to_git_credential=True)
32+
model_name = "google/gemma-7b-it"
33+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
34+
print(device)
35+
36+
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device)
37+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
38+
39+
40+
41+
import pandas as pd
42+
file_path = "questions-eu.tsv"
43+
records = pd.read_csv(file_path, sep='\t').to_dict(orient='records')
44+
45+
pd.read_csv(file_path, sep='\t')
46+
47+
configs = [
48+
('k50_p0.90_t0.1', dict(top_k=50, top_p=0.90, temperature=0.1)),
49+
('k50_p0.90_t0.2', dict(top_k=50, top_p=0.90, temperature=0.2)),
50+
('k50_p0.95_t0.1', dict(top_k=50, top_p=0.95, temperature=0.1)),
51+
('k50_p0.95_t0.2', dict(top_k=50, top_p=0.95, temperature=0.2)),
52+
('k75_p0.90_t0.1', dict(top_k=75, top_p=0.90, temperature=0.1)),
53+
('k75_p0.90_t0.2', dict(top_k=75, top_p=0.90, temperature=0.2)),
54+
('k75_p0.95_t0.1', dict(top_k=75, top_p=0.95, temperature=0.1)),
55+
('k75_p0.95_t0.2', dict(top_k=75, top_p=0.95, temperature=0.2)),
56+
('default', dict()),
57+
]
58+
59+
random.shuffle(configs)
60+
61+
62+
import tqdm
63+
from transformers.utils import logging
64+
import pathlib
65+
import json
66+
logging.set_verbosity_warning()
67+
68+
for shorthand, config in tqdm.tqdm(configs):
69+
print(config)
70+
output_file_path = f'outputs/with_logits/basque3-{model_name.split("/")[1]}.{shorthand}.jsonl'
71+
anootation_file_path = f'outputs/4annot/basque3-{model_name.split("/")[1]}-anotation.{shorthand}.jsonl'
72+
if not pathlib.Path(anootation_file_path).is_file():
73+
new_records = []
74+
with open(output_file_path, 'w', encoding='utf-8') as file:
75+
for record in tqdm.tqdm(records):
76+
record = {**record}
77+
message = [
78+
79+
# # Prompt 2:
80+
# {"role": "user", "content": "Answer this question ONLY in Basque, as correctly and concisely as you can"},
81+
# {"role": "model", "content": "Sure! What is the question that I need to answer in Basque?"},
82+
# {"role": "user", "content": record['question']},
83+
84+
# Prompt 3:
85+
{"role": "user", "content": "Erantzun galdera hau, BAKARRIK euskaraz, modu zuzen eta zehatzean"},
86+
{"role": "model", "content": "Noski! Zein da euskaraz erantzun behar dudan galdera?"},
87+
{"role": "user", "content": record['question']},
88+
]
89+
90+
inputs = tokenizer.apply_chat_template(
91+
message,
92+
add_generation_prompt=True,
93+
return_tensors="pt"
94+
).to(model.device)
95+
96+
terminators = [
97+
tokenizer.eos_token_id,
98+
tokenizer.convert_tokens_to_ids("<|eot_id|>"),
99+
tokenizer.encode('\n')[-1],
100+
]
101+
102+
outputs = model.generate(
103+
inputs,
104+
max_new_tokens=512,
105+
num_return_sequences=1,
106+
eos_token_id=terminators,
107+
pad_token_id=tokenizer.eos_token_id,
108+
return_dict_in_generate=True,
109+
output_logits=True,
110+
do_sample=True,
111+
**config
112+
)
113+
114+
115+
# response repeats the input in the begining
116+
response = outputs.sequences[0][inputs.shape[-1]:]
117+
response_text = tokenizer.decode(response, skip_special_tokens=True)
118+
# some OOM workarounds
119+
response_token_ids = response.to("cpu").tolist()
120+
response_tokens = tokenizer.convert_ids_to_tokens(response_token_ids)
121+
response_logits = [l.squeeze().to("cpu").tolist()[response_token_ids[idx]] for idx,l in enumerate(outputs.logits)]
122+
123+
124+
record['model_id'] = model_name
125+
record['lang'] = 'EU'
126+
record['output_text'] = response_text
127+
record['output_tokens'] = response_tokens
128+
record['output_logits'] = response_logits
129+
130+
json.dump(record, file, ensure_ascii=False)
131+
file.write('\n')
132+
133+
134+
columns_to_extract = ['url-localized', 'lang', 'question', 'model_id', 'output_text', 'output_tokens', 'title']
135+
extracted_data = {key: record[key] for key in columns_to_extract if key in record}
136+
new_records.append(extracted_data)
137+
138+
139+
output_data = []
140+
141+
print(anootation_file_path)
142+
with open(anootation_file_path, 'w', encoding='utf-8') as file:
143+
for extracted_data in new_records:
144+
# extracted_data = {key: data[key] for key in columns_to_extract if key in data}
145+
146+
json.dump(extracted_data, file, ensure_ascii=False)
147+
file.write('\n')
148+

0 commit comments

Comments
 (0)