oneonlee
diff --git a/‎data/test/README-v1.md
Lines changed: 71 additions & 0 deletions b/‎data/test/README-v1.md
Lines changed: 71 additions & 0 deletions
diff --git a/‎data/test/scripts/.DS_Store
6 KB b/‎data/test/scripts/.DS_Store
6 KB
diff --git a/‎data/test/scripts/arabic/gen_ar.py
Lines changed: 147 additions & 0 deletions b/‎data/test/scripts/arabic/gen_ar.py
Lines changed: 147 additions & 0 deletions
diff --git a/‎data/test/scripts/basque/gen_gemma.py
Lines changed: 148 additions & 0 deletions b/‎data/test/scripts/basque/gen_gemma.py
Lines changed: 148 additions & 0 deletions
@@ -0,0 +1,71 @@
+# Mu-SHROOM @ SemEval 2025: Validation Set
+This archive corresponds to the unlabeled test data for the Mu-SHROOM shared task 3 at Semeval 2025 (The Multilingual Shared-task on Hallucinations and Observable Overgeneration Mistakes).
+It contains:
+1. the present README,
+2. JSONL files containing the annotated data corresponding to our test split (henceforth "the data files"),
+3. a directory containing scripts for replicating our datapoint creation process; these are mainly provided for documentary purposes.
+
+There are separate data files for all 14 languages of the shared task : Arabic (modern standard), **Basque,** **Catalan,** Chinese (Mandarin), **Czech,**  English, Finnish, **Farsi,** French, German, Hindi, Italian, Spanish, and Swedish.
+Languages in bold in the list above correspond to test-only languages: no validation was provided for these languages, so as to test participants' systems in pure generalization conditions.
+
+## What is Mu-SHROOM?
+The task consists in detecting spans of text corresponding to hallucinations. 
+Participants are asked to determine which parts of a given text produced by LLMs constitute hallucinations.
+The task is held in multi-lingual and multi-model context, i.e., we provide data in multiple languages and produced by a variety of public-weights LLMs.
+
+This task is a follow-up on last year's SemEval Task 6 (SHROOM).
+
+More information is available on the official task website: https://helsinki-nlp.github.io/shroom/
+
+## How will participants be evaluated?
+
+Participants will be ranked along two (character-level) metrics: 
+1. intersection-over-union of characters marked as hallucinations in the gold reference vs. predicted as such
+2. how well the probability assigned by the participants' system that a character is part of a hallucination correlates with the empirical probabilities observed in our annotators.
+
+Rankings and submissions will be done separately per language.
+
+For further information, you can have a look at the scoring program at [this url](https://helsinki-nlp.github.io/shroom/scorer.py).
+
+## Data file format
+The data files are formatted as a JSON lines. Each line is a JSON dict object and corresponds to an individual datapoint.
+
+Each datapoint corresponds to a different annotated LLM production, and contains the following information:
+- a unique datapoint identifier (`id')
+- a language (`lang');
+- a model input question (`model_input`), the input passed to the models for generation;
+- a model identifier (`model_id`) denoting the HuggingFace identifier of the corresponding model;
+- a model output (`model_output_text`), the output generated by a LLM when provided the aforementiond input;
+- a list of model output tokens (`model_output_tokens`), corresponding to the tokenized output of the LLM response,
+- a list of logit values for the tokens generated in the LLM response (`model_output_logits`),
+
+Files have different number of items, with Basque, Catalan, Czech and Farsi containing around 100 items, and other languages containing around 150 items.
+
+We provide output logits so as to foster methods that investigate model behavior, but particpants will likely be interested in more complex attributes, such as probability distributions or model embeddings. As a starting point, you can look into the logits reconstruction scripts that we provided for German and English (`scripts/english/recompute_logits_english.py` and `scripts/german/recompute_logits_german.py`), which explicitly retrieve the full output distributions.
+
+## Submission file format
+Participants need to produce JSON lines files for their prediction. Participants should create one file per language. Each line should correspond to one test point, and should contain the following information:
+- the unique datapoint identifier (`id') of the test item this line is a prediction for (always required);
+- binarized predictions (`hard_labels`), provided as a list of pairs, where each pair corresponds to the start (included) and end (excluded) of a hallucination (optional if `hard_labels` are included);
+- continuous predictions (`soft_labels`), provided as a list of dictionary objects (optional if `soft_labels` are included). Each dictionary object must contain the following keys:
+   + `start`, indicating the start of the hallucination span,
+   + `end`, indicating the end of the hallucination span,
+   + `prob`, the empirical probabilty (proportion of annotators) marking the span as a hallucination.
+
+The inclusion of extra information is not penalized; i.e., participants may simply fill in the columns missing in the test file and submit those to the platform.
+Participants are encouraged to have a look at the neural baseline provided in the [`participant kit`](https://a3s.fi/mickusti-2007780-pub/participant_kit.zip), or at the validation files for examples of the expected format.
+
+Participants can submit any of the following: 
+ 1. both binarized and continuous predictions (`hard_labels` and `soft_labels`); 
+ 2. only binarized predictions (`hard_labels`);
+ 3. only continuous predictions (`soft_labels`)
+
+In the latter two cases, the score program will apply default rules to convert binarized predictions into continuous predictions or vice-versa.
+
+The hard labels (`hard_labels`) will be used to assess the intersection-over-union accuracy, whereas the soft labels (`soft_labels`) will be used to measure correlation.
+In the evaluation phase, participants will be tasked with reconstructing the soft labels and provide the `start`, `end` and `prob` keys of all the spans they detect. 
+
+## How will this dataset differ from upcoming data releases?
+
+Labels will be release at the end of the evaluation phase.
+We also intend to release supplementary annotation details after the evaluation phase.
@@ -0,0 +1,147 @@
+import os
+import sys
+import json
+import pandas as pd
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import tqdm.notebook as tqdm
+from transformers.utils import logging
+import json
+
+logging.set_verbosity_warning()
+
+seed = 42
+torch.manual_seed(seed)
+
+
+records = pd.read_csv("questions-ar.csv")
+records= records.to_dict(orient='records')
+
+
+# Display the contents of the second sheet
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(device)
+
+model_names = [
+    "SeaLLMs/SeaLLM-7B-v2.5",
+    "openchat/openchat-3.5-0106-gemma",
+    "arcee-ai/Arcee-Spark",
+]
+
+split_text_array = ["\n<|im_start|>assistant\n", "\nassistant", "\nassistant"]
+configs = [
+    ("k50_p0.90_t0.1", dict(top_k=50, top_p=0.90, temperature=0.1)),
+    #  ('k50_p0.95_t0.1', dict(top_k=50, top_p=0.95, temperature=0.1)),
+    ("k50_p0.90_t0.2", dict(top_k=50, top_p=0.90, temperature=0.2)),
+    # ('k50_p0.95_t0.2', dict(top_k=50, top_p=0.95, temperature=0.2)),
+    ("k50_p0.90_t0.3", dict(top_k=50, top_p=0.90, temperature=0.3)),
+    # ('k50_p0.95_t0.3', dict(top_k=50, top_p=0.95, temperature=0.3)),
+    ("k75_p0.90_t0.1", dict(top_k=75, top_p=0.90, temperature=0.1)),
+    # ('k75_p0.95_t0.1', dict(top_k=75, top_p=0.95, temperature=0.1)),
+    ("k75_p0.90_t0.2", dict(top_k=75, top_p=0.90, temperature=0.2)),
+    # ('k75_p0.95_t0.2', dict(top_k=75, top_p=0.95, temperature=0.2)),
+    ("k75_p0.90_t0.3", dict(top_k=75, top_p=0.90, temperature=0.3)),
+    # ('k75_p0.95_t0.3', dict(top_k=75, top_p=0.95, temperature=0.3)),
+]
+
+model_idx = 2
+model_name = model_names[model_idx]
+split_text = split_text_array[model_idx]
+
+model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(
+    device
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+
+
+print("Model used ", model_name)
+model_short = model_name.split("/")[-1]
+
+for record in tqdm.tqdm(records):
+    i += 1
+    print(
+        f"Link in Arabic: {record['Link in Arabic']}"
+    )  # Print 'Link in Arabic' once for every sample
+    question = str(record["AR questions"])
+    print(question)
+    generated_answers = []
+    configs_to_show = list(configs)[:5]  # Ensure only 5 configs are shown
+
+    for shorthand, config in tqdm.tqdm(configs_to_show):
+
+        messages = [
+            {"role": "user", "content": "أجب عن السؤال التالي بشكل دقيق ومختصر"},
+            {
+                "role": "assistant",
+                "content": "بالطبع! ما هو السؤال الذي تود الإجابة عنه؟",
+            },
+            {"role": "user", "content": question},
+        ]
+        encodeds = tokenizer.apply_chat_template(
+            messages, return_tensors="pt", add_generation_prompt=True
+        )
+
+        model_inputs = encodeds.to(device)
+
+        outputs = model.generate(
+            model_inputs,
+            max_new_tokens=512,
+            num_return_sequences=1,
+            no_repeat_ngram_size=2,
+            return_dict_in_generate=True,
+            output_logits=True,
+            do_sample=True,
+            **config,
+        )
+
+        response_text = tokenizer.decode(
+            outputs.sequences[0], skip_special_tokens=True
+        ).strip()
+        response_text = response_text.split(split_text)[-1].strip()
+        print(response_text)
+        response_token_ids = (
+            outputs.sequences[0].to("cpu").tolist()[len(model_inputs[0]) :]
+        )
+        response_tokens = tokenizer.convert_ids_to_tokens(response_token_ids)
+        response_logits = [l.to("cpu").tolist() for l in outputs.logits]
+
+        # Extract only the logits corresponding to the output tokens
+        response_logits = [logit.to("cpu").tolist() for logit in outputs.logits]
+
+        generated_answers.append((shorthand, response_text, response_logits))
+
+    # Show all 5 answers and let the user pick one
+    for idx, (shorthand, answer, logits) in enumerate(generated_answers):
+        print(f"\nAnswer {idx + 1} (Config: {shorthand}):\n{answer}\n")
+
+    choice = int(input("Choose the answer to save (1-5): ")) - 1
+
+    selected_shorthand, selected_answer, selected_logits = generated_answers[choice]
+
+    output_file_path = f'./SHROOM/{i}/{model_name.split("/")[1]}/arabic-{model_name.split("/")[1]}.{i}.{selected_shorthand}.jsonl'
+    os.makedirs(
+        f'./SHROOM/{i}/{model_name.split("/")[1]}', exist_ok=True
+    )
+    os.makedirs(f"/./SHROOM/{i}", exist_ok=True)
+
+    # Save the selected answer and its logits
+    with open(output_file_path, "w", encoding="utf-8") as file:
+        record["model_id"] = model_name
+        record["lang"] = "AR"
+        record["output_text"] = selected_answer
+        record["output_tokens"] = tokenizer.convert_ids_to_tokens(response_token_ids)
+        record["output_logits"] = selected_logits
+
+        columns_to_extract = [
+            "Link in Arabic",
+            "lang",
+            "AR questions",
+            "model_id",
+            "output_text",
+            "output_tokens",
+        ]
+        records_small = {k: record[k] for k in columns_to_extract}
+        records_small["gen_config"] = config
+        json.dump(records_small, file, ensure_ascii=False)
+        file.write("\n")
+
@@ -0,0 +1,148 @@
+#!/usr/bin/env python
+# coding: utf-8
+import os
+import random
+
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from huggingface_hub import hf_hub_download, login
+
+
+# Installing dependencies.
+random.seed(2202)
+
+# GPU 
+# This config has been tested on an v100. 32GB 
+# For download the models
+
+os.environ['HF_HOME'] = './.hf/'
+#!pip install --upgrade pip
+#!pip install huggingface_hub
+#!export HF_HOME='./.hf'
+
+os.makedirs('outputs/4annot', exist_ok=True)
+os.makedirs('outputs/with_logits', exist_ok=True)
+
+
+# safely copy your hf_token to this working directoy to login fo HF
+with open('./hf_token', 'r') as file:
+    hftoken = file.readlines()[0].strip()
+
+login(token=hftoken, add_to_git_credential=True)
+model_name = "google/gemma-7b-it"
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print(device)
+
+model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+
+
+
+import pandas as pd
+file_path = "questions-eu.tsv"
+records = pd.read_csv(file_path, sep='\t').to_dict(orient='records')
+
+pd.read_csv(file_path, sep='\t')
+
+configs = [
+    ('k50_p0.90_t0.1', dict(top_k=50, top_p=0.90, temperature=0.1)),
+    ('k50_p0.90_t0.2', dict(top_k=50, top_p=0.90, temperature=0.2)),
+    ('k50_p0.95_t0.1', dict(top_k=50, top_p=0.95, temperature=0.1)),
+    ('k50_p0.95_t0.2', dict(top_k=50, top_p=0.95, temperature=0.2)),
+    ('k75_p0.90_t0.1', dict(top_k=75, top_p=0.90, temperature=0.1)),
+    ('k75_p0.90_t0.2', dict(top_k=75, top_p=0.90, temperature=0.2)),
+    ('k75_p0.95_t0.1', dict(top_k=75, top_p=0.95, temperature=0.1)),
+    ('k75_p0.95_t0.2', dict(top_k=75, top_p=0.95, temperature=0.2)),
+    ('default', dict()),
+]
+
+random.shuffle(configs)
+
+
+import tqdm
+from transformers.utils import logging
+import pathlib
+import json
+logging.set_verbosity_warning()
+
+for shorthand, config in tqdm.tqdm(configs):
+    print(config)
+    output_file_path = f'outputs/with_logits/basque3-{model_name.split("/")[1]}.{shorthand}.jsonl'
+    anootation_file_path = f'outputs/4annot/basque3-{model_name.split("/")[1]}-anotation.{shorthand}.jsonl'
+    if not pathlib.Path(anootation_file_path).is_file():
+        new_records = []
+        with open(output_file_path, 'w', encoding='utf-8') as file:
+            for record in tqdm.tqdm(records):
+                record = {**record}
+                message = [
+
+                            # # Prompt 2:
+                            # {"role": "user", "content": "Answer this question ONLY in Basque, as correctly and concisely as you can"},
+                            # {"role": "model", "content": "Sure! What is the question that I need to answer in Basque?"},
+                            # {"role": "user", "content": record['question']},
+
+                            # Prompt 3:
+                            {"role": "user", "content": "Erantzun galdera hau, BAKARRIK euskaraz, modu zuzen eta zehatzean"},
+                            {"role": "model", "content": "Noski! Zein da euskaraz erantzun behar dudan galdera?"},
+                            {"role": "user", "content": record['question']},
+                        ]
+
+                inputs = tokenizer.apply_chat_template(
+                    message,
+                    add_generation_prompt=True,
+                    return_tensors="pt"
+                ).to(model.device)
+
+                terminators = [
+                    tokenizer.eos_token_id,
+                    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
+                    tokenizer.encode('\n')[-1],
+                ]
+
+                outputs = model.generate(
+                    inputs,
+                    max_new_tokens=512,
+                    num_return_sequences=1,
+                    eos_token_id=terminators,
+                    pad_token_id=tokenizer.eos_token_id,
+                    return_dict_in_generate=True,
+                    output_logits=True,
+                    do_sample=True,
+                    **config
+                )
+
+        
+                # response repeats the input in the begining
+                response = outputs.sequences[0][inputs.shape[-1]:]
+                response_text = tokenizer.decode(response, skip_special_tokens=True)
+                # some OOM workarounds
+                response_token_ids = response.to("cpu").tolist()
+                response_tokens = tokenizer.convert_ids_to_tokens(response_token_ids)
+                response_logits = [l.squeeze().to("cpu").tolist()[response_token_ids[idx]] for idx,l in enumerate(outputs.logits)]
+                
+        
+                record['model_id'] = model_name
+                record['lang'] = 'EU'
+                record['output_text'] = response_text
+                record['output_tokens'] = response_tokens
+                record['output_logits'] = response_logits
+        
+                json.dump(record, file, ensure_ascii=False)
+                file.write('\n')
+
+                
+                columns_to_extract = ['url-localized', 'lang', 'question', 'model_id', 'output_text', 'output_tokens', 'title']
+                extracted_data = {key: record[key] for key in columns_to_extract if key in record}
+                new_records.append(extracted_data)
+        
+        
+        output_data = []
+        
+        print(anootation_file_path)
+        with open(anootation_file_path, 'w', encoding='utf-8') as file:
+            for extracted_data in new_records:
+                # extracted_data = {key: data[key] for key in columns_to_extract if key in data}
+        
+                json.dump(extracted_data, file, ensure_ascii=False)
+                file.write('\n')
+